. ‘— ”WWW 2722225922“ 2 .2 . .1- .. i2 2222'22 o “I I 2' 222-22: WM 2 ‘VYI ,.,._..--._._ A . v.4 .- . '.c.‘“‘“ a- 2 c.22‘ 222’" Eng 2 -.22 '.-.‘.122'Iv2l : 2'22. 22222225 22:22. 2.. 22222. r‘..22‘|2 2 2 222 2 2‘2" 2 "I 22 .222 . 222 ."’2 ‘2 2:2.7' 2.'.2." 2232722If 222-22 .2 "42.21222 22E; ._1 J: I 2 V 2 3 NJ " '2‘22 __. 1‘: 222 l': . 2222 2 . 2). . “2‘22 .1 .f” '3'; g. 6 :vu~ . L__...1 . . < ....f-‘v-o-~ “- 22 J._ A .-_.4.. t n- ‘ -.-. - r». -. NwC- rt 0..“ .a-OCI IL- n...-o-v «Hf 74f: ’ .4"; ‘5‘... o a..." .. _cn.’\‘-"- V‘ 7"- ...-¢.....o -- . .n .. n- "$233....“— "'v 3‘ .. l .‘ -v J. l. -.‘o 4221'”. :21. 2! ‘27:; v. .“545— ‘ficfiw: .-.4 ‘I‘ 'G‘...o- .. ....'. w..- ‘ .‘- . . '.-..~.4~uv .- j 1‘22 2,. - ‘— 1 ,- .... ~ v - o -—' " “,fiu... ‘2‘- .. era-3 4 5:3. ,.,.7--.. .WA... . n v __ c at ‘ .4...- .‘- .7 A .‘_. W“ . --—.‘. “21”.; 'f-J '22“ .222 .2- t'nhm“ 2 E22322 . A . . _ 4. ' H V A“: .'. CV .u _ . . .. ... . . .J‘.) ~. ~ ,A . , . 7 ,,-, _ _, ... *3 ’A “ f... . _. . .' ‘ 2 f: ~ — ' _ . f ._ V“ I .g M... .’ fl ; -2 - .‘. n ' _.... _ 227:2“; ’2' 2:» 22:2 27:2: 2 2:27: 1 ' 132,31}th 2222.2 322 2’) 12:222- f." I 2 M22 221321;" ' {32' 2‘ . 22.12.527.722225232. 22,” 222322;;‘ .- . ' . .— . . - . .4 2. ~. 1., m .- ’ ‘ .. Ira-v9 ' .. ' . 'L -. ‘1- L12. 3,223.1. 12:"- '2 .i. .- ...._ I Cfin', 1.2.... .24.?ch unpu> ‘_. .< n .- ‘2‘ 2 [22:22: 01202 2222“. 2 L2, 232-29 “ " -. ,HI 2 222222 "2 2.22.2121 2262222222! 922.2 222222 - , "2221’ l 22». .2.- 3-) "221-: 3202212122 2.22 22. .‘J I . .5373 2)! “1'1 22'! 2.9.3 222222.‘ 1 2.5. “.2223 01"“ “I! _ ca 0 . _ npni ' .21“... ¢ 1.: ... 4 . 1:17.: "Paar: . :1" .. ,........v XVI. ""L'J,’ a- n.- "Y‘ t‘::!‘.'." ~" _. ..... .um: _ 7:, r"- . . -m ‘ f . u. 2:51. ‘9 7" ,.: .. 'UF— \lmq. ‘— ’ $25.19?“ 2' ,‘21n‘r' 3‘6} .. :«n’? “.4 v THESIS lllllllllllllllllllllllllllllllllllllll 3 1293 01686 This is to certify that the dissertation entitled Network Routing and Planning in Switch-Based Networks presented by Wenjian Qiao has been accepted towards fulfillment of the requirements for Doctoral degree in PhilggophL \AMO \Vl M Major v\professor Date sept. 9, 1997 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINE-3 return on or before date due. DATE DUE MTE DUE MTE DUE me macs-m4 NETWORK ROUTING AND PLANNING IN SWITCH-BASED NETWORKS By Wenjian Qiao A DISSERTATION Submitted to Michigan State University in partial fulfilhnent of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1997 ABSTRACT NETWORK ROUTING AND PLANNING IN SWITCH-BASED NETWORKS By Wenjian Qiao In recent years, switch-based networks have received much attention in both local area networks (LAN 8) and wide area networks (WAN 5) due to their higher network bandwidth and throughput, greater interconnect scalability and flexibility, and better fault handling capability than shared-medium networks. For cost and performance reasons, switch-based networks should support arbitrary topologies. Although arbitrary topologies do provide the needed flexibility and incremental scalability, network routing and planning in such networks are not trivial and have a great impact on the network performance. This thesis studies issues of switching techniques, routing algorithms, topology design, network tuning, and performance evaluation in switch-based networks. Three switching schemes are reviewed and compared. Analytical models are presented for average delay analysis and comprehensive simulation results are conducted to verify the analytical models and evaluate network performance under different switching schemes and buffer architec- tures. 'vao novel deadlock-free routing algorithms are proposed to allow arbitrary inter- connection of cut-through (wormhole) switches. In the proposed routing algorithms, two opposite unidirectional Eulerian trails are used to avoid deadlock and shortcuts are used to provide shorter routes. For initial network topology design, a simple method based on block design is proposed to efficiently minimize path length. To improve network performance, a systematic method is proposed for network tuning and incremental reconfiguration. Our design principles for network planning and tuning include minimizing routing path length, exploiting traffic locality, and balancing channel utilization. To obtain a practical insight to the network behavior and performance under various networking environments, a flexible and extensible simulator has been implemented. Comprehensive simulation experiments have been conducted to demonstrate network behavior under different situations and eval- uate network performance of our proposed algorithms. © Copyright 1997 by Wenjian Qiao All Rights Reserved To my parents and my husband ACKNOWLEDGMENTS I would like to take this Opportunity to express my appreciation to those who have con- tributed to the completion of this dissertation. I am very grateful to my advisor Professor Lionel M. Ni for exposing me to a broad range of exciting issues in computer networks and distributed systems. Without his patient instruction and imaginative enlightening, this dis- sertation would not be possible. He has provided me with a great deal of encouragement, a direction for my research, and a great sense of organization. I would like to thank the other members of my dissertation committee: Professors Abdol-Hossein Esfahanian, Tien-Yien Li, and Matt W. Mutka, for their valuable sugges- tions on this dissertation, for their careful review of the manuscript, and for their time and support. Special thanks to Professor Esfahanian for his very good suggestions to improve this dissertation. I wish to thank all the people who helped me during my stay at Michigan State Univer- sity. Thanks especially to Natawut Nupairoj, Mingyao Yang, Dr. Hong Xu, Sherry Moore, Wei-kuo Liao, and Xipeng Xiao for their valuable technical discussions and sharing their views on research with me. I also wish to acknowledge the love and support of my parents and my parents-in-law. vi My parents have provided their constant love, encouragement, and support. My parents- in-law offered their continuous support and patient understanding throughout my doctorate work. Finally and foremost, I wish to acknowledge and thank the most important person in my life, my husband Dai Wan. He deserves most credit, for loving me, for providing support in time of need, and for putting up with me during all of my ups and downs. vii TABLE OF CONTENTS LIST OF TABLES xi LIST OF FIGURES xii 1 Introduction 1 1. 1 Motivations .................................... 5 1.2 Problem Statement and Research Contributions ................. 6 1.3 Organization of the Thesis ............................ 10 2 Switch Design and Switch Interconnects 12 2.] Switching Technology .............................. 12 2.2 Considerations on Switch Design ......................... 13 2.2.1 Nonblocking vs. Blocking ........................... 15 2.2.2 Scalability and Incremental Expandability ................... 15 2.2.3 Switching Techniques ............................. 16 2.2.4 Buffer Architecture and Channel Arbitration ................. 17 2.2.5 Other Issues ................................... 18 2.3 Example Switches ................................. 21 2.4 Design Considerations on Switch Interconnects ................. 25 2.4.1 Routing Algorithms ............................... 25 2.4.2 Topology Design and Incremental Reconfiguration .............. 27 2.4.3 Other Issues ................................... 30 2.5 Traffic Characteristics and Measurement ..................... 31 2.6 Performance Metrics and Evaluation ....................... 33 3 Performance Evaluation of Switching Techniques 37 3.1 Analytical Model ................................. 38 3.1.1 Delay Analysis ................................. 39 3.1.2 Analytical Model for Store-and-Forward Switching .............. 41 3.1.3 Analytical Model for Virtual Cut-Through Switching ............. 44 3.1.4 Cut-Through Probability ............................ 45 3.2 Simulation Results ................................ 46 3.2.1 Simulation Environment ............................ 46 3.2.2 Verification of Analytical Models ....................... 47 3.2.3 Performance Evaluation ............................ 52 4 Deadlock-Free Routing in Arbitrary Switch-Based Networks 63 4.1 Related Work ................................... 64 4.2 Definitions and Eulerian Trail ........................... 67 4.2. 1 Definitions ................................... 69 4.2.2 Eulerian Trail .................................. 70 4.3 Eulerian-Trail Routing Algorithm (ETR) ..................... 71 4.3.1 Routing Tables ................................. 72 4.3.2 Time Complexity ................................ 72 4.4 Adaptive-Trail Routing Algorithm (ATR) .................... 73 4.4.1 Adaptive Trail .................................. 74 4.4.2 Routing Tables ................................. 79 4.4.3 Time Complexity ................................ 81 4.5 Heuristics ..................................... 8 1 4.5.1 Heuristics about Degree of Adaptivity ..................... 82 4.5.2 Selection of Eulerian Trails ........................... 83 4.6 The Proof of Deadlock Freedom ......................... 84 4.7 Performance Evaluation .............................. 89 4.7.1 Network Workload and Parameters ....................... 90 4.7.2 Network Topologies .............................. 92 4.7.3 Uniform Traffic ................................. 95 4.7.4 Client/Server Traffic .............................. 102 5 Network Planning and Timing in Switch-Based LANs 110 5.1 Design Considerations of Switch-Based Networks ................ 111 5.1.1 Routing Path Length .............................. 111 5.1.2 Local Traffic, Remote Traffic and Inter-Switch Traffic ............. 112 5.1.3 Balance Channel Utilization .......................... 113 5.2 Model Description ................................ 114 5.2.1 Basic Parameters ................................ 114 5.2.2 Design Variables ................................ 115 5.3 Initial Topology Design .............................. 1 18 5.3.1 From Block Design to Network Design .................... 120 5.3.2 Construction of Blocks ............................. 122 5.3.3 Construction of Bo ............................... 127 5.3.4 Considerations of Traffic Distribution ..................... 132 5.4 Incremental Reconfiguration ........................... 133 5.4.1 Overall Considerations ............................. 134 5.4.2 Add A New Client ............................... 135 5.4.3 Add A New Server ............................... 136 5.4.4 Network Tuning ................................ 136 5.5 Simulations .................................... 141 5.5.1 Simulation Environment ............................ 141 5.5.2 Performance Improvement by Network Tuning ................ 145 5.5.3 Server Bottleneck ................................ 148 5.5.4 Fairness ..................................... 148 ix 5.6 Evaluation of the Network Planning Algorithm ................. 151 6 A Flexible Simulation Model for Performance Evaluation 160 6.1 Overall Structure ................................. 161 6.2 Design and Implementation ............................ 163 6.2.1 Routing Algorithm ............................... 164 6.2.2 Switch Architecture ............................... 165 6.2.3 Incoming and Outgoing Channels ....................... 170 6.2.4 Message Generation and Reception ...................... 171 6.2.5 Flow Control .................................. 172 6.2.6 Basic Parameters ................................ 172 6.2.7 Performance Metrics .............................. 173 6.3 Workload and Topology Support ......................... 173 6.4 Flexibility and Expandability ........................... 176 6.5 Simulation Experiments .............................. 177 7 Conclusions 178 7.1 Contributions of the Thesis ............................ 178 7.1.1 Performance Evaluation of Switching Techniques ............... 179 7.1.2 Routing Algorithm ............................... 179 7.1.3 Network Planning and Tuning ......................... 180 7. 1 .4 Simulations ................................... 1 80 7.2 Future Work .................................... 18 1 BIBLIOGRAPHY 183 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 6.1 LIST OF TABLES Example routing table entries in Switch 8 for deterministic ETR routing . . . . 73 Example routing table entries for a first visited node ............... 80 Example routing table entries for a multiply visited node ............ 81 The network configurations for simulation .................... 93 Three models to study the effect of server locations for A and D ........ 106 List Of important symbols in Chapter 5 ...................... 1 19 Blocks of symmetric BIBD (7,3,1) ........................ 120 Blocks of BBD (14,3,1) .............................. 123 Blocks of CBD (14,4+2,Z 1) ........................... 124 The two columns determined by z' and j ..................... 125 The two columns determined by i and 3' (after rotation) ............. 126 Relationship between block size and number of distinct differences ....... 127 Minimum r to yield all differences in a BBD (s, r, 2 1) ............. 129 Blocks of BBD (10, 4, Z 1): before reorder and after reorder .......... 131 Number of clients in each workgroup ...................... 142 Traffic distribution of pattern 1 .......................... 143 Traffic distribution of pattern 2 .......................... 144 List of major parameters in SimNet ........................ 172 xi 1.1 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.1 1 4.12 4.13 5.1 5.2 LIST OF FIGURES A switch-based network with nine 8-port switches ................ 2 A high-level view of switch architecture ..................... 14 A network with three switches and two channels ................. 34 An example of latency/throughput performance curve .............. 35 Four networks for verification and evaluation .................. 48 Average latency of store-and-forward switching under uniform traffic ..... 49 Average latency of virtual cut-through switching under uniform traffic ..... 50 Average latency of store-and-forward switching under client/server traffic . . . 51 Average latency of virtual cut-through switching under client/server traffic . . . 52 Comparison of switching schemes under uniform traffic ............. 54 Comparison of switching schemes under client/server traffic .......... 55 Comparison of FIFO and shared buffering under uniform traffic ........ 56 Comparison of FIFO and shared buffering under client/server traffic ...... 57 Effect of packet size ................................ 58 Comparison of switching schemes under uniform traffic ............. 59 Comparison of switching schemes under client/server traffic .......... 60 Comparison of buffer management under uniform traffic ............ 61 Comparison of buffer management under client/server traffic .......... 62 An example of an arbitrary network topology .................. 68 An example of deadlock ............................. 74 Illustration for the generation of adaptive trails .................. 76 All cases where f—shortcuts from u to 1) should be removed to avoid deadlock . 78 The deadlock case due to source shortcut ..................... 79 Six topologies considered in simulation ..................... 92 Comparisons of ETR and ATR under virtual cut-through switching ....... 94 Performance under uniform traffic with fixed message distribution ....... 96 Performance under uniform traffic with bursty message distribution ...... 101 Effect of buffer size under uniform traffic .................... 103 Performance under client/server traffic ...................... 104 Effect of server locations and number of servers ................. 108 Effect of buffer size under client/server traffic .................. 109 From block design to the network ........................ 121 The algorithm to construct Bo .......................... 130 xii 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 6.1 6.2 6.3 6.4 6.5 6.6 Algorithm to add a new client to the network .................. 135 Algorithm to add a new server to the network .................. 136 Algorithm for network tuning ........................... 138 Initial topology used for simulation ........................ 142 Performance improvement by apply network tuning ............... 147 Performance comparison of traffic pattern 1 and traffic pattern 2 ........ 148 The importance of higher bandwidth to release server bottleneck ........ 149 Comparison of fairness ratio for workgroup 2 with and without a higher band- width channel ................................. 150 Five non-isomorphic switch-based networks with 6 switches and 12 channels . 152 Nine non-isomorphic switch-based networks with 6 switches and 11 channels . 153 Performance comparisons of networks (6 switches and 12 channels) under uni- form traffic .................................. 155 Performance comparisons of networks (6 switches and 1 1 channels) under uni- form traffic .................................. 156 Performance comparison of networks (6 switches and 12 channels) under client/server traffic .............................. 157 Performance comparisons of networks using high bandwidth server ports under client/server traffic .............................. 158 The structure and modular relationship in SimNet ................ 162 Data structure RoutEntry ............................. 165 A generic model of a switch ........................... 165 Definition of class Switch ............................ 166 Algorithm to simulate switch forwarding in each cycle ............. 168 Class Definitions of InChannel and OutChannel ................. 170 xiii Chapter 1 Introduction In recent years, switch-based networks have received much attention in both local area networks (LAN 5) and wide area networks (WAN 5) due to their higher network bandwidth and throughput, greater interconnect scalability and flexibility, and better fault handling ca- pability than shared-medium networks. Traditional shared-medium LANs do not provide satisfactory throughput and latency for some communication intensive applications, espe- cially those emerging multimedia applications. For example, a fast workstation can easily choke a 10Mbps Ethernet. On a shared-medium network only one host can be a sender and other hosts can only be receivers at one moment. Switch—based networks make such appli- cations possible by providing virtual point-to-point communication and achieving higher bandwidth and lower latency. Security is improved than that of share-medium networks because data on the physical link is no longer always available to every host on the subnet. Switched LANs are also an incremental approach to upgrade LAN performance since the investment in existing adapters, cables, and drivers are preserved and switched LAN 3 still conform to widely accepted network standards. 1 2 An example of switch-based networks is shown in Figure 1.1, where nine 8-port switches are interconnected. An open port is either connected to a node or open for fu- ture usage. A node can be a workstation, a multi-processor, or a gateway to another net- work. In a switch-based network, each host computer has a network adapter connecting to a network switch. When the scale of the network increases due to the increasing number of host computers and the increasing demand of aggregate network bandwidth, more switches and channels can be added to the network to accommodate the increasing communication demands. The interconnection of switches defines various network topologies. For cost and performance reasons, switch-based networks should support arbitrary topologies. Al- though arbitrary topologies do provide the needed flexibility and incremental scalability [1], network routing and planning in such networks are not trivial and have a great impact on the network performance. Figure 1.1: A switch-based network with nine 8-port switches A switch-based network provides an excellent platform for distributed multimedia ap- plications and high performance computing based on networks of workstations. Obviously, 3 good switch design and appropriate interconnection of switches play significant roles in im- proving network performance and meet performance requirements. This thesis addresses the challenging issues in switch design and switch interconnect, presents practical and ef- ficient methods for network routing, planning, and tuning, and evaluates network perfor- mance based on comprehensive simulation results. Analytical modeling is also presented to analyze the average latency for store—and-forward and virtual cut-through switching. There are two common forwarding schemes used in building LAN switches: store-and- forward and cut-through. Store-and-forward switching must keep a whole frame of data buffered before forwarding the data. The disadvantages of this approach include increased latency and buffer space, especially for frame-switching networks. In contrast, cut-through switching can send packets in a pipeline fashion and forward partially received data as long as the packet header is received and decoded, thus promises low-latency delivery. On the other hand, store-and-forward switching can make better decisions for corrupted or truncated frames by receiving a whole packet, but cut-through switching just propagates a frame all the way to its destination. If a frame is corrupted or truncated, it may not be able to be fully removed from the network until it reaches the destination, thus wasting network bandwidth. However, in a LAN environment the error rate for a data frame is generally very low. Therefore, latency-sensitive real-time applications and high performance computing will greatly benefit from cut-through switching because of its shorter latency. Cut-through switching can be further distinguished as virtual cut-through switching [2] and worrnhole switching [3]. Virtual cut-through switching buffers the entire packet in a large buffer when the outgoing channel is in use, effectively degrading to the store- and-forward mode. In contrast, a blocked worrnhole packet stalls in the network, holding 4 previous channels in its route until the outgoing channel becomes available. Since a worm- hole switch does not need to store entire packets, it is enough to have a small buffer holding a few flits of the packet, which relieves the burden on buffer space. Furthermore, wormhole switching can provide primitive link-to-link flow control capability due to the back pres- sure mechanism in each link. Wormhole switching can achieve low latency, particularly in a lightly-loaded network; however, a blocked wormhole packet may stall other traffic on its route because it holds the channel(s). This limits network throughput and complicates the effort to avoid packet deadlock. Wormhole switched networks can improve throughput by including additional flit buffers or by incorporating virtual channels [4] on each physical channel to allow other traffic to bypass a blocked packet. Many new generation switches support cut-through switching, such as the DEC GI- GAswitch [5] for FDDI networks, the Ancor FCS 266/ 1062 [6] switches for FCS networks, the Myricom Myrinet [7], Ethernet Switches[8], and Sun Microsystem’s S-Connect [9]. Unlike cell-based switches such as ATM switches, the above frame-based switches allow large packets and can increase the effective channel utilization. This research concentrates on these frame-based switches which can support varying size of frames. There are design and performance tradeoffs among different switching techniques. Al- though cut-through switching schemes have the potential to reduce communication latency, store-and-forward switching can simplify operation over different channel bandwidth and reduce flow-control requirements. Ideally, the selection of a specific switching scheme should depend on network conditions and dynamically change its switching mode to pro- vide best network performance. Cut-through switching could be applied to achieve low latency under light workload; while store-and-forward switching may be applied to achieve high throughput under heavy workload. 1.1 Motivations High-speed networks are demanded for various applications, such as high-performance computing, database applications, and distributed multimedia communications. A high- speed network should provide high throughput, low latency, and high scalability for various applications. However, current network systems may not satisfy these requirements. Sys- tem area networks (e.g., some MPP machines), may provide high throughput due to high bandwidth channel and low latency due to wormhole routing. But they usually have the constraints of regular topologies and small physical distance between switches (routers), which result in poor incremental scalability in general. Shared medium LAN s (e.g., Ehter- net) do not provide scalable bandwidth to accommodate the increasing communication demands. In traditional store-and-forward switched networks, the communication latency is high because packets have to be entirely buffered in each intermediate switch before forwarding. Therefore, the question is how to design a high—speed network dealing with the fol- lowing requirements. First, it provides high throughput and low latency, which means cut-through switching is the choice in such a network. Second, it is able to cover longer distance band in local area. Thus, system area networks may not be a good candidate. Third, it should be incrementally scalable. Because regular topologies impose stringent interconnect constraints and may not be incrementally scalable, arbitrary topologies are needed due to the economic reason, the technical requirement, and the engineering wiring 6 constraints. Fourth, multiple paths between switches may be needed in order to handle fault-tolerance. It is the above requirements that trigger our research to study issues in switch-based networks and allow arbitrary switch interconnects. As we mentioned earlier, switch-based networks are becoming popular to meet the increasing demand for higher network performance. Meanwhile, there are still lots of chal- lenging issues open to researchers. A good method for performance evaluation is necessary to understand the network behavior and guide network design under various network envi- ronments. It is not trivial to allow deadlock-free routing in any arbitrary network especially for a wormhole switched network. Without topology constraint, network planning and incremental reconfiguration become important in order to provide reasonable network per- formance. In order to increase sustained throughput, channel utilization should be balanced under expected traffic workload. Some other challenges include fault-tolerant routing and efficient multicast communication. 1.2 Problem Statement and Research Contributions This thesis considers high-speed switch-based LANs, where wiring length is short and the cost of channels is relatively insignificant comparing with the cost of switches. Such net- works provide a lower network latency, high incremental scalability, and very infrequent transmission error. Due to the high reliability of transmission, link level acknowledgment is eliminated and only end-to—end acknowledgment is supported. Such a network has fully incremental scalability, where channels and switches may be added, removed, or reconfig- ured to adapt to a particular network traffic pattern of a given environment. 7 In order to reduce communication latency, the trend of new generation switches is to support cut-through switching. Recently, cut-through switching is being applied to switch- based interconnects like Myrinet [7] and ServerNet [10] to build networks of workstations in LANs. The concept of wormhole switching, as a special case of cut-through switching, has been used in new generation scalable parallel computers, such as the IBM SP [11], Cray T3D [12], MIT J-machine [13], Ncube-3 [14], and Intel Paragon [15]. In order to avoid deadlock, those scalable parallel computers have to use regular network topologies, such as meshes and tori. Many deadlock-free routing algorithms have been proposed for such regular network topologies. Interested readers may refer to [16] for a detailed survey. However, using regular network topologies is impractical and uneconomical for hi gh-speed switched networks as regular networks impose stringent interconnect constraints and are not incrementally scalable. Thus, a natural approach is to allow arbitrary interconnection of switches. A major difficulty in constructing a large-scale network with wormhole switches is to avoid deadlock among simultaneously transmitting frames. An efficient deadlock-free routing algorithm is highly demanded for switch-based networks using wormhole switches. Deadlock-free routing algorithms based on regular topologies cannot be applied for an arbitrary topology. Although some related algorithms have been proposed for arbitrary topologies, they either are too complex or easily cause traffic bottleneck. While arbitrary topologies do provide flexibility and incremental scalability [1], the design of an appropriate network topology can be a complex undertaking. After the net- work becomes operational, network tuning and upgrading may be needed to maintain high performance. Network topology design has been an important issue for many years. 8 In the past, considerable research efforts have been spent on Optimal topological design [17, 18, 19, 20], which was based on some assumptions including known traffic distribu- tions and homogeneous environment. Although these algorithms do have important con- tributions for topological design, they may not be applicable to real network environment due to unrealistic assumptions and very time-consuming algorithms. In this thesis, we study network routing and planning issues in switch-based networks and evaluate network performance under different switching techniques, different routing algorithms, different topologies, different traffic distributions, and different network pa- rameters. The major contributions are summarized as follows. 1. Both analytical modeling and simulation results are used to evaluate performance in switch-based networks. Analytical modeling provides a cost-effective way to predict the average latency for store-and-forward switching and virtual cut-through switch- ing. However, analytical models usually require simplifying assumptions and may not be able to give a practical insight of network behavior due to the dynamic nature of networks. Therefore, simulation results are measured to evaluate performance un- der various network environments which are difficult to be estimated by analytical modeling. Simulation results are also used to verify the analytical models. 2. Two novel deadlock-free routing algorithms are proposed for arbitrary networks us- ing wormhole switches [21]. In order to avoid deadlock, an Eulerian trail is used to help maintain an order of channel dependency and allow reasonable routing. The critical issue is how to reduce network latency due to channel blocking and at the same time avoid deadlock. In order to reduce message blocking time and allow more 9 and shorter routing paths, shortcuts are added to the Eulerian trail. To avoid deadlock, some shortcuts have to be removed or used in a restricted way. 3. We study design principles and propose efficient and practical methods for network planning and tuning in switch-based LANs. To provide good network performance, our design criteria include minimizing routing path length, exploiting traffic local- ity, and balancing channel utilization. The idea of block design is applied for ini- tial topology design to efficiently minimize routing path length. For network tun- ing and incremental reconfiguration, we consider traffic distributions in the network and present a systematic method to reduce message latency and increase network throughput. Simulation experiments are implemented to demonstrate the perfor- mance improvement by applying our method. 4. We have implemented a flexible and extensible simulator for the purpose of perfor- mance evaluation in switch-based networks. Our simulator enables comprehensive experiments with a variety of routing algorithms, network topologies, switching tech- niques, buffer architectures under diverse traffic patterns. Due to the dynamic nature of networking environments, simulation results should give us more practical insight to the network behavior and performance than analytical modeling. Intensive simu- lation experiments are conducted under various network environments and network performance are compared for different routing algorithms, different topologies, dif- ferent traffic distributions, and many other network parameters. 10 1.3 Organization of the Thesis This thesis presents effective techniques for the design and evaluation of switching tech- nique, routing algorithm, network topology in switch-based networks. Buffer architectures, traffic distributions, and many other network parameters are considered as well. Chapter 2 investigates major issues in switch design and switch interconnects. It ad- dresses the fundamental issues such as scalability, switching techniques, deadlock-free routing, network planning and tuning, traffic characteristics, performance metrics, and so on. In Chapter 3, we compare different switching schemes and buffer architectures based on both analytical models and simulation results. Analytical modeling provides an easy way to estimate the average latency in the network, but they may depend on some unrealistic assumptions. Comprehensive simulation results can demonstrate the effect of different network environments on performance. Chapter 4 presents two novel deadlock-free routing algorithms for arbitrary wormhole switched networks. The routing algorithms are based on an Eulerian trail and adding short- cuts. Several related algorithms are reviewed and compared with our algorithm based on comprehensive simulation results. In Chapter 5, the idea of block design is proposed for initial topology design to effi- ciently minimize path length and a systematic method is proposed for network tuning to improve performance. Design principles considered in these methods include minimizing routing path length, exploiting traffic locality, and balancing channel utilization. Chapter 6 describes a flexible and extensible simulation model, which enables experi- ll ments with different routing algorithms, topology design, switching techniques, and buffer management under various network workloads. Chapter 7 is the conclusion of this thesis, where major contributions of this research are summarized and some interesting topics are indicated for future work. Chapter 2 Switch Design and Switch Interconnects Switch-based networks are becoming popular to meet the increasing demand for higher net- work performance. To meet performance requirements of many communication-intensive applications, good switch design and appropriate interconnection of switches are critical to deliver demanded network performance. Many challenging issues are open to researchers and major design considerations have been discussed in [22]. In this chapter, we review major issues and design principles in the design of switches and switched networks. 2.1 Switching Technology Switching technology operates at layer 2 of the 081 Reference Model. A switch makes a relatively simple forwarding decision based on the destination MAC address contained in each packet [23]. Switches allow bandwidth to be scaled and can alleviate traffic bot- tlenecks in traditional shared-medium LANs. Today, commercial switching products are available for Ethernet, Fast Ethernet, ATM and FDDI. l2 13 Switches provide many intemetworking benefits. Switches economically segment the network into smaller collision domain, providing a higher percentage of bandwidth to each- station. Switches have protocol transparency, which allows them to be installed in multiple protocols with little or no software configuration [23]. Ethernet switches are fully com- patible with the existing cables, hubs, and network adapters without expensive hardware upgrades. The use of application-specific integrated circuit (ASIC) technology allows a switch to provide greater performance by supplying high packet throughput with extremely low latency. Therefore, it is possible to simultaneously forward packets across all ports at wire speed. For example, a single Ethernet interface can support maximum throughput of 10Mbps. A twelve-port ‘wire speed’ Ethernet switch, supporting twelve concurrent streams, must provide an aggregate throughput of 12* 10:120Mbps. Switches are designed and tuned to address LAN performance problems resulting from bandwidth shortages and network bottlenecks. Switches solve these problems by providing high aggregate bandwidth and low latency at a low cost per port. 2.2 Considerations on Switch Design A high-level view of switch architecture is shown in Figure 2.1. Its function is to forward packets from input ports to output ports. Each part is associated with a pair of input and output channels (or a bidirectional channel). A port may connect to a node which generates and consumes messages, connect to a port of another switch which defines the network topology, or leave it open for a future connection. A node (or end node) can be a worksta- tion, a multi-processor system, or a gateway to another network. To clarify our description, 14 a port connecting to another switch is called a trunk port and a port connecting to a node is called terminal port. Switch Board Switching Element ' Switch Module / I Input butters / / / output butters ‘ r I "D ——~E rte , ‘iLiiJ“*T—’ —-iED~ ‘ 4: 1:»; _. g —~fl’*rv at» a, ‘ 'Yfi‘q If T' l 8 *Ffldir' ’ is ‘1 f ’ "*9: s i v j T I ‘ = g r—“iilL” l L} rag —-~WIE~ eifl‘j’ .. —fi‘IT} . LEI~~ WHEEL ,_ lntraswitch Interconnect 4] i7 D , ]_ _, ‘F Control Unit i Figure 2.1: A high-level view of switch architecture The switch fabric is based on many small switching elements. Switch modules are built from switching elements, where a switch module is usually a physical chip. A switch board is built from a number of interconnected switch modules. A switch consists of many stacked switch boards, where the number of switch boards is determined by the switch capacity, i.e., the number of ports. When a MAC-layer frame arrives, a routing tag is added to the front of the frame after destination MAC address is available for its routing decision. Then the frame is routed through the switch to the output port using the routing tag. The routing tag is stripped at the output port and the frame is delivered to a destination host or another switch. An internal packet includes a routing tag, MAC-layer header and trailer, and payload data. There are some input FIFO buffers for packets which cannot be delivered immediately 15 because of port/channel contention in the switch. Some output FIFO buffers are available to accommodate network links of different speed. Shared buffer pool could be applied in switch design. Cut-through switching may pipeline a packet through a sequence of switches without fully buffering the packet at each switch. 2.2.1 Nonblocking vs. Blocking Nonblocking switch guarantees that any routing request to any free output port can be supported without interference with other packets being routed. Nonblocking switches have dedicated bandwidth for each port and is helpful to guarantee the quality of service for some applications. Nonblocking switch is also important to inter-switch deadlock freedom when transmit- ting data through several interconnected wonnhole switches without buffering messages between switches. For a network environment which only requires intra-switch deadlock freedom, blocking switches may well meet the performance requirement. Nonblocking switches help in an environment which requires inter-switch deadlock freedom. 2.2.2 Scalability and Incremental Expandability A scalable switch means that the same switch architecture can be applied when higher performance is demanded from the switch. Therefore, bottlenecks existing in the switch architecture will severely degrade its scalability. For example, design based on a hi gh-speed bus is not scalable. The bandwidth of a bus is constrained by some physical limits. The bus becomes a bottleneck when the bandwidth for each port needs to be increased beyond some 16 point. Design based on shared memory is not scalable either because the memory latency and bandwidth of dual-ported memory module limits the performance of the switch. The scale of a switch tends to grow with the increasing number of hosts connected to it. To achieve good expandability in switches, modular design is usually followed [25]. A switch is constructed from switch boards which are also built from some building modules. Ideally the minimum requirement to expand a switch to the next allowed configuration is a single switch board. For example, if a switch board has four input and output ports, it is more economic to expand the switch by adding four ports, that is one switch board at a time rather than having to add at least eight or even more extra ports. 2.2.3 Switching Techniques Switches forward traffic based on one of two forwarding models: 0 Store-and-forward switching o Cut-through switching As we mentioned earlier, a store-and—forward switch reads and validates the entire packet before initiating the forwarding process. This allows the switch to discard cor- rupted packets and permits the network manager to define custom packet filter to control traffic through the switch. The disadvantage of store-and-forward switching is that latency increases in proportion to the size of the packet. A cut-through switch (virtual cut-through or wormhole) starts the forwarding process immediately upon receipt of enough header information to make a routing decision. Since the switch only has to read the destination MAC address before it begins to forward a 17 frame, packets are processed faster and latency is at the same low level for both short and long packets. The results is a dramatic reduction in the network latency over the traditional store-and—forward switching techniques under light to moderate traffic. The major disad- vantage of pure cut-through switching is that corrupted frames are also forwarded by the switch. However, this should not be a big problem in LAN 3 where transmission error rate is extremely low. Although cut-through switching techniques have the potential to reduce network la- tency, most existing local and wide area networks employ packet switching to reduce flow- control requirements, simplify routing algorithms, and simplify transmission over a range of different channel bandwidth. Still, wormhole and virtual cut-through switching are vi- able options in more homogeneous, tightly-coupled domain, such as system network and high-speed switch designs [26, 27]. Research projects like Shrimp [28] and Atomic [29] as well as commercial products like Sun Microsystem’s S-connect [9] and Myricom’s Myrinet [7] have been the results of this paradigm. 2.2.4 Buffer Architecture and Channel Arbitration Buffers in a switch can be organized in various ways: input, output, cross-point, and shared. FIFO queues or non-FIFO buffers can be used to manage buffers. Input FIFO queuing is the simplest but worst performing architecture. In such a case, only the front packet of each input queue is scheduled for possible routing, which is called head-of-line (HOL) blocking and causes a limited throughput due to output contention. In [30], the performance of nonblocking switches with FIFO input buffers is surveyed. The analysis shows that the 18 maximum throughput of the nonblocking switch with FIFO input buffer is 58.6% [31, 32]. One approach to improve the switch performance of is to reduce or eliminate the HOL blocking. Non-FIFO input buffering architecture provides better performance than input queuing [33]. Use of output buffers or a shared buffer improves the throughput by elimi- nating the HOL blocking. However, these solutions are complicate to implement in practice and usually can only handle fixed size frames (packets) [34]. In output queuing, each buffer is associated with one outgoing link, and must be able to accept, in the worst case, packets arriving simultaneously from all inputs [35]. If more than one incoming packets compete the same outgoing channel, the arbitra- tion scheme coordinates access amongst competing packets. Round-robin arbitration is a common demand-driven approach to divide bandwidth fairly among incoming traffic. Priority-based schemes may improve performance by favoring longer queues or packets with stricter delay requirements [36]. 2.2.5 Other Issues Besides the above discussed issued, many other important issues need also be studied and some of them are addressed in the following: Deadlock-Free Routing Deadlock is a well known issue in wormhole switching with back pressure flow control. Because a message can hold some channels and wait for another channel, deadlock may oc- cur if there is a cycle in the channel dependency graph [16]. Deadlock will cause messages to be blocked in the network forever. Therefore, it is important that the routing algorithm is 19 deadlock free. In practice, even if the routing algorithm is deadlock free, deadlock may still occur due to some hardware or software errors, such as a lost trailer signal. A long-period timeout to force a packet out of the network can be used to avoid deadlocks completely for such rare cases [37]. There are two levels of deadlock freedom in a switch-based network. One guarantees that within a switch the routing is deadlock free, i.e., if the whole network is connected to a single switch, deadlock never happens. This may be referred to as intra-switch deadlock freedom. The other guarantees that the switching in a network which is interconnected by more than one switch is deadlock free. This may also be referred to as inter-switch deadlock freedom. To achieve inter-switch deadlock freedom, careful considerations need to be made on switch design, network topology and routing algorithm. Link-level Back Pressure Flow Control Link-level back pressure flow control is used in wormhole switches. When the input or output buffers in a switch are nearly full, back pressure is generated. This back pressure can be propagated to the sending source to throttle it from sending more data when a packet is blocked. Thus, there is less packet loss caused by buffer overflow if this mechanism is utilized. In a local environment, most packet loss is due to buffer overflow instead of transmission error. Packet loss is generally handled by hi gh-level network protocols which involve timeout and other overhead. Therefore, both network throughput and latency will benefit from less packet loss due to wormhole switching with link-level flow control. 20 Packaging and Layout Issues For ease of maintenance and expandability, the switch design must follow a modular ap- proach. There may be many hierarchies of modules, from chips to switch boards. How to efficiently design 2D layout on a board and stack switch boards in 3D is a critical engi- neering issue. When a switch is expanded or reconfigured due to fault, it is desirable that minimum number of rewiring is required. Packaging of a switch decides the physical space consumed, wiring complexity, wire length, whether a switch is stackable, easy to configure and maintain, and ready for expansion [24]. Hardware Multicast Multicast communications are critical in teleconferencing, video-on-demand, distributed computing, and so forth. Conventional shared-medium networks have a very natural way to support multicast/broadcast communications. In a switch-based network, however, each switch must possess multicast capability. Hardware multicast in networks based on cut- through switching with back pressure flow control is not a trivial task because it is very difficult to prevent deadlock. A multicast capable cut-through switch design was proposed in [24]. Fault Tolerance It is likely that one or more of the building modules of a switch are faulty sometime. A fault tolerant switch either can continue to function well regardless of the faulty module(s), or is able to function well by making minimal reconfiguration to the existing boards, without adding new good boards. We call the former on-line fault tolerance and the latter board- 21 level fault tolerance. On-line fault tolerance can afford faulty boards without loss of service. Board-level fault tolerance may lose some service upon failure but allows quick and easy recovery. 2.3 Example Switches As we mentioned before, many switches have been commercially available. In this section, we review the design characteristics of some commercial switches. Myrinet Switch Myrinet [7] supports cut-through switching in arbitrary local network topologies. It was developed from two research projects: Caltech Mosaic [38] which was a multicomputer and USC/ISI ATOMIC [29] which was a high-speed local area network built using Mosaic components. A Myrinet network consists of a number of Myrinet switches and host interfaces. Myrinet can be both a LAN of relatively small physical diameter and a System Area Net- work (SAN) that clusters computer components. A Myrinet link has a full-duplex pair of Myrinet channels. Each channel carries a sequence of 9-bit characters. A SAN channel operate at 1.28 Gb/s up to 3 meters. A LAN channel operate at 1.28 Gb/s up to 10 meters or at half this speed up to 25 meters. The core of a Myrinet switch is a pipelined crossbar for cut-through switching. A custom VLSI CMOS chip that implements link level pro- tocol and flow control have SAN ports. Another custom VLSI COMS chip can translate SAN ports to LAN ports. Flow control is done on each physical channel with STOP and 22 GO control characters and ”slack buffer” at the receiver. A slack buffer has 80 bytes. A STOP control character is sent to the sender when the slack buffer begins to be filled with more than 48 bytes and a G0 control character is sent to the sender when data in the slack buffer drops to less than 32 bytes. An 8-bit CRC character is computed on each packet. Low-latency and low-error-rate communication is achieved in Myrinet with its cut-through switching, flow control and error control. A Myrinet Control Program (MCP) is running on each host interface. MCP interacts with host to send packets from the host to the network or deliver incoming packets to the host. MCP also maps network addresses to routes. The routing algorithm is designed to be deadlock free. Up*/down* routing [39] is currently being used. Deadlock due to hardware failure is guarded against with a long timeout of 50ms. ServerNet Switch ServerNet [40] is a SAN which integrates I/O transfers and inter-processor communication within the same network. Scalable 1/0 is achieved by decoupling processor buses from I/O buses with an interconnection network between processors and I/O components. The basic components of ServerNet include six-way crossbar routers, dual-ported processor interfaces on a PCI card and low-cost peripheral-device interfaces. The addressing and routing provided by ServerNet can support communications between CPUs, from CPUs to I/O devices, or between I/O devices. ServerNet allows reliable, scalable and powerful computing systems to be built using clustered commodity components. A ServerNet router has FIFO buffers on the inputs, logic for arbitration and flow con- 23 trol, a routing table and a crossbar switch. Cut-through switching is implemented by the routers to provide low latency communication. Link-level flow control prevents packet loss during network congestion. Six-port switches allow various network topologies such as hypercube, mesh, and tree. Routing is based on table lookup. The 1024 entry routing table maps the destination address in the packet header to the output port. Deadlock free- dom is achieved by routing table entry design. Configurations with circular dependencies are disallowed. New topologies called “fractahedrons” [10] were proposed for deadlock avoidance in ServerNet. Ethernet Switches Ethernet has been the most popular local area network for more than a decade. As more and more communication intensive applications appear, tradition shared-medium Ethernet is approaching bottle status. To achieve higher bandwidth, Ethernet switching hubs are being deployed to replace traditional shared-medium Ethernet. There are generally three switching modes in an Ethernet switch: fast-forward (cut- through), fragment-free, and store-and-forward. Store-and-forward buffers a whole frame, makes an error check, and forwards the frame. Invalid frames are filtered out. To cut switching latency, a packet can be partially received before it begins to be forwarded. In fast-forward, a frame is switched as soon as the destination address or a few more following bytes are received. Corrupted packets may be propagated through the network since they are forwarded before frame errors can be detected. Fragment-free switching buffers at least 64 bytes of data before switching is made. Buffering 64 bytes of data allows Ethernet 24 collision fragments to be filtered out. Collision may happen often in a heavily loaded network, such as when a port is shared by many host interfaces. A good Ethernet switch design should be able to select the switching mode adaptively based on the network traffic conditions. In a client/server computing environment, a server is shared by many clients and re- quires higher network bandwidth than a client. This can be achieved by using switches which support both Ethernet and Fast Ethernet ports. A server can be connected to a Fast Ethernet port for higher bandwidth. When switching a frame from an Ethernet port (10Mb/s) to a Fast Ethernet port (100Mb/s), store-and-forward is necessary to adjust the speed difference of input and output ports. When switching a frame from a Fast Ethernet port to an Ethernet port, a frame can be partially received before being switched. Back pressure mechanism can also be applied in Ethernet switches to keep the sending source from delivering more frames to an overloaded port. A congested port can fake collision-detection frames which let the sending source think that collision is occurring and back off its attempt to send a frame according to CSMA/CD protocol [8]. Each Ethernet port can support multiple MAC addresses which may correspond to in- dividual workstations, repeaters, or switches. Address learning can be done by automati- cally learning the source address of frames received on each port. Address table can also be manually maintained. Old addresses not in use are aged out from the address table. Spanning-Tree protocol (part of IEEE 802.1d standard) is generally used to interoperate between compliant bridges and switches. Spanning tree allows networks to dynamically maintain a loop-free topology and establish redundant paths in event of lost connections. Some Ethernet switches allow virtual LAN s to be built. Ports on the switch are grouped 25 into separate logical networks. Virtual LAN limits broadcast traffic and offers higher secu- rity. 2.4 Design Considerations on Switch Interconnects The interconnection of those switches defines various network topologies. Regular net- work topologies do not have good incremental scalability due to the stringent intercon- nection constraints and may not be able to use the original routing schemes with faulty nodes. Therefore, a natural approach is to allow arbitrary switch interconnects like Autonet [41], Myrinet, and ServerNet. Arbitrary switched networks are truly incrementally scalable and have potential to be reconfigured to adapt to the dynamics of network traffic. Many challenging issues need to be studied in a switched network. Some major tasks include routing and flow control, topology design and incremental reconfiguration, traffic pattern, performance evaluation, reliability, scalability, and fault tolerance. 2.4.1 Routing Algorithms Routing plays an important role in a switched network. Routing decision is usually im- plemented based on table lookup or on a finite state machine. An important engineering issue is to make the routing decision as fast as possible. In this dissertation, we focus on routing in wormhole switched networks. A major difficulty in constructing a large-scale network with wormhole switches is to avoid deadlock among simultaneously transmitting frames. Arbitrary switch interconnects make the problem even harder. Three philosophies have been used to develop deadlock-free routing algorithms. One is based on deadlock 26 prevention which never allows the formation of a cycle and is quite conservative. Another approach allows the existence of cycles [42, 43], but does provide a path to escape from the cycle. The third and more aggressive approach is based on deadlock detection [44, 45]. Practical, convincing, and meaningful quantitative evidence is needed to demonstrate the superiority of one approach over other approaches. For regular topologies like k—ary n-cube, many deadlock-free routing algorithms have been proposed in literature. For arbitrary switched networks, the first known deadlock- free routing algorithm is probably the up*/down* routing used in the DEC AN 1 system [41]. A similar approach is used in the Myrinet. In the up*/down* algorithm, a breadth- first spanning tree is constructed from a selected root. Based on the direction to the root node, the network is divided into two acyclic subnetworks (up and down). When a packet enters the down subnetwork, it will never enter the up subnetwork to ensure cycle free. The algorithm is simple and easy to understand, but it may introduce severe traffic congestion near the root and allow a routing path as twice long as a shortest path. Some other routing algorithms have been proposed as well. Adaptive-trail routing (ATR) was proposed in [21] to allow deadlock-free adaptive routing in arbitrary switched networks. The basic idea of ATR is using an Eulerian trail as a ‘base line’ to maintain chan- nel dependency order and allow reasonable routing. To minimize the path length, shortcuts are added to the Eulerian trail. The smart routing (SR) [46] created a channel dependency graph from the network topology and broke dependency cycles one by one, which mini- mized a heuristic cost function based on average path length. The routing is represented by the final non-cyclic channel dependency graph. The traffic balancing, which assumed uniform traffic, depends on a linear programming solver and is the most time consuming 27 part. Graph partitioning is applied in [47] to avoid deadlock. To evaluate it in arbitrary networks, a simple and efficient implementation has to be enforced, which may not be intuitive. A recent research [48] proposed a general methodology for the design of adaptive routing algorithms in arbitrary topologies. In this method, all physical channels are split into two virtual channels: ‘original channels’ and ‘new channels’. For ‘original channels’, a deadlock-free routing algorithm like up*/down* is defined. A message can only leave source from ‘new channels’, but it can use ‘original channels’ at intermediate switches if none of ‘new channels’ is available. Once a message is transmitted on any ‘original channel’, it cannot go back to ‘new channels’ any more. 2.4.2 Topology Design and Incremental Reconfiguration The design of an appropriate network topology can be a complex undertaking and the configuration of a traffic-balanced network is not an easy task. Most of existing research efforts have been spent on optimal topological design [17, 18, 19, 20], which was based on some assumptions including known traffic distributions and homogeneous environment. Although these algorithms do have important contributions for topological design, they may not be applicable to real network environment due to unrealistic assumptions and very time-consuming algorithms. A systematic method [49] should be applied in topology design, which concerns some basic parameters. The key parameters are the number of switches, the number of ports per switch and the bandwidth of each port, the required number of host computers, the estimated traffic distribution, and the average path length. 28 When designing a network topology, a primary consideration should be minimizing diameter and the average path length. Longer paths mean more network resources are consumed per packet sent. In addition to minimizing average path length, there should be no obvious bottleneck or small bisectionl in the network. Network planners need to be responsive to the demands of client/server traffic. A server contention may become a main bottleneck because many clients communicate with the server. A key part of scaling network performance is to eliminate bottlenecks to servers. Based on the server’s throughput capacity, dedicated bandwidth is needed to release server bottleneck. Traffic locality is an important traffic characteristic in LAN and we should maximize traffic locality to benefit intra-switch traffic. If a significant fraction of traffic flows among a set of end computers (i.e., a workgroup), we should consider to place them on the same switch if possible, or on two switches connected by at most one hop. To provide good network performance, a key point is to balance traffic flow among channels. It is important to know the real or estimated traffic distribution. For a partic- ular topology, given routing algorithm and estimated traffic distribution, traffic flow on each channel and the maximum attainable aggregate flow can be calculated with a multi- commodity flow solver. This solution will also identify bottleneck channels. After a network is operational, network tuning may be needed to accommodate network changes and improve network performance. Many situations may be changed in a network; however, network reconfiguration should be incremental and not involve major changes if possible. Therefore, one practical question is how to make a minor change to a network to lBisection is defined as the minimum number of channels to be removed in order to cut a network into two disjoint subnetworks with equal number of switches in each subnetwork. 29 obtain a significant improvement in performance. The first step is to diagnose what causes the problem. If the real traffic distribution is much different from the estimated traffic distribution, the topology and routing algorithm may need to be redesigned. If the average path length of routing is much longer than the average path length of the base topology, chances are that experimenting with some additional routing strategies might improve the performance significantly. If neither the routing nor the base topology is the problem, additional resources may need to be added. If there are spare ports on any switches, additional channels might be added to yield a performance increase. In general, additional channels should be added between switches that are a maximum distance apart (taking some long paths and making them shorter), or between switches that have few channels attached already. If adding one or more channels is not possible or does not provide adequate perfor- mance improvement, one or more switches can be added. To add a switch to an existing network, it may be necessary to connect a number of channels to switches that are as widely spaced (with respect to path length) as possible. We may consider to redistribute some end nodes to the new switch. When adding channels or switches, the routing should be recalcu- lated from scratch. Such a change has a profound impact on the path length and potential deadlock in the system. Since most of these switches have configurable routing tables, this yields a practical solution. 30 2.4.3 Other Issues Topology design will be further complicated by topological constraints (such as reliabil- ity or physical layout constraints) upon the network. To consider reliability, a traditional method is to make more than one switch-disjointed path available between each pair of sources and destinations. It is clear that efficient multicast communication support is critical to the performance of many applications. A recent paper [50] proposed heuristic multicast algorithms in switched networks, which assumed the underlying routing algorithm is up*ldown* routing. While efficient multicast communication is becoming very important and obtaining much atten- tion, multiple multicasts and the effect of background traffic should be considered as well. Since the performance of multicast algorithms are various on different platforms, a prac- tical approach is to consider some critical machine-specific parameters in the design of platform—independent algorithms as proposed in [51]. In large scale networks, the probability of having a faulty channel or switch is relatively high. An intuitive solution is to simply shutdown the system and replace the faulty ele- ments. In addition to some simple solutions like redundancy, a practical and useful scheme is demanded to automatically detect faulty elements, handle dynamic faults, properly deal with transmitting messages, isolate faulty elements for live replacement, and avoid routing deadlock [1]. Flow control determines the amount of packets that can be injected into the network in order to avoid congestion. As indicated in Figure 2.3, without proper flow control, the net- work throughput will decrease after reaching its maximal. In addition to local flow control 31 like link-level back-pressure in cut-through switching, there also needs global flow con- trol (e.g., end-to-end flow control) to limit the total number of packets that can be handled by network simultaneously. If flow control mechanism works properly, further input to network should be stopped well before the network is over-saturated. 2.5 'Ii'affic Characteristics and Measurement To design an appropriate network, we should understand the traffic characteristics in LAN 8. LAN traffic analysis can be done by collecting LAN traffic and some common characteris- tics have been observed in previous work [52, 53]. In the following, we summarize some important characteristics. o Client/server traffic: every server provides services to a number of clients and there may be more than one server in a LAN. Therefore, messages generated from a node may not be uniformly destined to the other nodes and messages received by a node may not be uniformly originated from the other nodes. 0 Workgroup: the existence of workgroups means that a large portion of traffic is 10- calized among relative small workgroups of nodes. Such a feature is called ‘trafi‘ic locality’. o Bursty traffic: the burstiness may cause congestion within a network resulting in poor overall network performance. Network planners need to be responsive to the demands of client/server traffic, the 10- cality of traffic, and the challenge of traffic burstiness. A server usually has much more 32 amount of traffic than a client so that dedicated bandwidth is needed to release server bot- tlenecks. Traffic within a workgroup is usually very high, nodes within each workgroup should be placed in a same switch (if possible) to maximize traffic locality. To deal with burstiness, we have to compromise the traffic rate between the average rate and the peak rate. If a network is designed based on peak rate, it may cause an excessive cost. On the other hand, if network design considers only average rate, it may be under-engineered. How to balance between peak rate and average rate is still an open issue. The knowledge of traffic distribution is very helpful for topology design and tuning. There are two methods to obtain traffic distribution in a network. Before a network is op- erational, the only method is based on a good understanding of network applications and communication demands in the future network. For example, we may be able to figure out workgroups, servers and their clients. However, in general, traffic distribution is hard to be estimated accurately a priori. Thus, network tuning and reconfiguration may be needed to improve network performance after the network is operational. After the network is oper- ational, traffic distribution can be measured by monitoring the total number of transmitted packets over the network, the size of each packet, and the source and destination infor- mation associated with each packet (e.g., SNMP management [54]). Note that measured traffic rate should consider both average rate and bursty rate. Many LAN switches support standard-based SNMP management. For example, both HP’s AdvanceStack Assistant [55] and 3Com’s Transcend Enterprise Manager [56] pro- vide SNMP-based network management. An SNMP agent is a program running on the managed device and collecting information about device operations. For example, if the object is a switch, the agent can collect information about network traffic passing through 33 each port within the switch and information about the behavior of the switch itself under different load conditions. The SNMP agent maintains a database called the Management Information Base (MIB). The agent uses the MIB to track and systematically update data. 2.6 Performance Metrics and Evaluation From an application program’s point of view, the most important performance metric of a network is communication latency, which is the time interval between the instant that a send command is issued until the instant that the message is completely received by the recipient. The metric communication latency consists of network latency (including the blocking time due to unavailable output channels) and software latency, which is software overhead introduced by the sender and receiver. From the network point of view, an important metric is network throughput which is de- fined as the total amount of traffic transmitted in the network per unit of time. The sustained network throughput is dependent on the communication pattern, message size distribution, and the rate at which each node is able to inject messages into the network. Due to the limitation of network contention and output contention, the maximum sustained network throughput is usually much smaller than the maximum theoretical network throughput. Fairness is also an important factor in network design although it is often ignored [57]. Without some global fairness mechanism, switched networks can be naturally unfair, ser- vicing some injection ports better than others, especially when the network workload is heavy. For example, consider a simple network shown in Figure 2.2, where the center switch 0 is connected to 10 nodes and the two end switches 1 and 2 are connected to 11 34 nodes, respectively. Let us assume all nodes are injecting as fast as they can, the desti- nations are uniformally distributed and the center switch 0 is fair to all of its input ports. From our simulation, we observe that each node on switch 0 injects about 1 1 times as much traffic as the node on switches l and 2. The aggregate throughput is only 9% of the maxi- mum throughput, but it reflects almost entirely traffic originating at the central switch. The reason is because that the center switch treats the traffic from 11 nodes on each end switch in the same way as it treats the traffic from each node on the center switch. fei‘i%%>% 9:" Figure 2.2: A network with three switches and two channels After we restrict the injection rate of each node in the center switch, the network be- haves more fairly and the aggregate throughput is higher. The fairness can be significantly impacted by the topology, routing, and traffic distribution in a network. For instance, a fully-connected topology of three switches (i.e., a triangle) under uniform traffic will show no fairness problems of the sort exhibited by the network in Figure 2.2. Besides message latency, network throughput, and fairness, there are also other metrics like reliability and fault-tolerance. However, we will not address them in this dissertation. For a given network workload, the performance of a switched network is best rep- resented by its latency-throughput curve as shown in Figure 2.3. This figure shows that the network throughput can reach up to 19% of the network’s maximum throughput. The average network latency increases when the throughput increases. However, when more 35 messages are injected into the network after the network has reached its maximum sustain- able throughput, the network throughput will decrease while the average network latency will continuously increase, creating the so—called over-saturated or unstable region. In such a situation, fairness may become a major concern; some ports will essentially starve while others will inject more traffic. In such a case, the aggregate throughput may not be enough to indicate the actual network performance. Thus, fairness ratio is introduced as the ratio of the maximum traffic generated in a node over the minimum traffic generated in a node. When fairness ratio is over 2, the network is considered in a unfair situation and we mark such a performance point with a shaded rectangle in Figure 2.3. E: uniform traffic and fixed size messages 8000 . T . . 7000 L .\ . 6000 - 1 ~ 5000 ~ in . 4000 - i - 3000 - .35 ~ 2000 L ”313”" - ---El" ”am-8'”? r m r Average Latency 1000 J—""B"P'”E’" 0.1 0.12 0.14 0.16 0.18 0.2 Network Throughput (bytes per time unit) Figure 2.3: An example of latency/throughput performance curve Both analytic models and simulation results can be used to evaluate network perfor- mance. By assuming the knowledge of network workload and applying multi-commodity flow model and queuing theory, a good analytic model may indicate message latency and bottleneck channel and obtain optimal or sub-optimal solutions for network design. How- ever, due to the dynamic nature of networking environments, analytic modeling is unlikely to give practical insight to the network behavior and performance. Simulation experiments 36 should be conducted to evaluate the performance. But we can hardly trust the simulation results as well. For example, most of the network simulation studies do not consider soft- ware latency and assume an infinite input source. This may lead researchers to investigate something that may never (or very rarely) occur in an actual network. Even worse, most of the models used to characterize the network workload are far from realistic. Trace-driven or execution-driven simulation based on real applications should be considered [58]. A set of benchmark applications covering different communication patterns has to be identified. Chapter 3 Performance Evaluation of Switching Techniques Many researchers have studied performance evaluation of interconnection networks. Store- and-forward packet switching networks were described as a M -channel, N -node model and average packet latency were analytically studied [59, 60]. Performance evaluation of cut- through switching began with the work in [2], which presented a mean-value analysis of end-to-end latency under deterministic routing for virtual cut-through switching, derived from M/M/l model for packet switched networks. The extensions of this M/M/l analysis can be found in [61, 62, 63, 26]. In [64], an asymptotic performance analysis was proposed for packet-switching multistage interconnection networks. The delay analysis of k-ary n- cube direct networks was studied in [65, 66] for wormhole switching. However, these analytical models do not consider arbitrary topologies. Both analytical models and simulation experiments have been presented in literature for the purpose of performance evaluation. Effective analytical modeling provides a 37 38 cost-effective way to predict the behavior of large networks and help balance the cost- perforrnance tradeoff in switch-based networks. However, analytical models usually re- quire simplifying assumptions about the network configuration and traffic patterns, for the sake of tractability. Simulation study can characterize the effects of these assumptions to determine and show insight network behavior, especially when analytical models cannot accurately estimate real performance. This chapter compares performance of switch-based networks for different switching modes and buffer architectures, using both analytical models and simulation experiments. Analytical models based on MIMI l queuing theory are presented for average packet delay analysis under a certain assumption like deterministic routing and infinite buffer size. For cut-through switching, a packet can cut through a switch if its outgoing channel is idle, which is reflected in the analytical model by the deduction of the time saved by cut through. The analytical models have been verified by simulation results. As we mentioned earlier, analytical modeling needs simplifying assumptions and may not be able to give accurate estimation under various network environments. Therefore, simulation experiments are conducted to evaluate performance under various network environments which are hardly to be accurately estimated in the analytical models. 3.1 Analytical Model Our goal is to address an analytical model for average packet latency in a switch-based network. A basic queuing model of computer networks can decompose the interconnec- tion network into a collection of independent channels and queues. The packet delay in 39 store-and-forward packet-switching networks has been expressed using the MIMI 1 model in terms of source-destination paths through the network [60]. An arriving packet will be forwarded in a switch based on switching modes. Various topologies, algorithms, and traf- fic distributions differ in how they affect the amount of traffic on each channel and therefore the delay waiting on the channel. 3.1.1 Delay Analysis The average packet latency T for a computer-communication network has been defined as follows in terms of the average source-to-destination packet delay [59]: T: 231212,,- (3.1) 7 where Z,”- is the average packet delay (queuing and transmission) from source 2' to destina- tion j, 7,-0- is the average packet rate from i to j, and 7 = Z 710- By applying Little’s result, which relates the average number in the system to the aver- age arrival rate and the average service time, a useful expression was given in [60, 67] A T = Z 7’51). (3.2) where A), is the average packet rate on channel 10 and Tk is the average queuing plus trans- mission delay on channel k. Given this useful but too general expression, we may not not able in general evaluate A], and Tk. However, as indicated in [17], if we make the following assumptions, the network channels can be considered as independent M/M/l queues by 40 Jackson’s theorem [68]. 1. Poisson packet arrival distribution 2. exponential packet length distribution 3. infinite buffer space 4. fixed (deterministic) routing 5. error-free channels 6. no switch delay 7. independence between interanival times and transmission times on each channel Therefore, the average delay T), on channel It can be evaluated as follows based on M/M/l queuing theory: 1 T=——-—-——-— k ,u'Clc-Ak (3.3) where Up is average packet length (bytes/packet), Ck is capacity of channel I: (bytes/s), and Ak is average packet rate on channel k (packet/s). The average rates Ak’s can be com- puted from routing tables and the traffic distribution matrix. By plugging Equation ( 3.3 ) into ( 3.2 ) and letting f], be fiAk, i.e., the average bit rate on channel k (bytes/s), we have the following equation for average packet delay T: 1 f. ____ 3.4 T VZCk-fk ( ’ 41 Considering propagation delay and switch processing delay, a more accurate result can be expressed in the following: 1 T = 3 E AMT]; + Pk + Sic] (3.5) where P], is the propagation delay (second) on channel It and S1, is the processing time (second/packet) in the switch at which channel I: terminates. Although more accurate expressions may be obtained by considering more general dis- tributions, the validation of the above assumptions and MM] delay analysis has been tested and verified through simulations and measurements by many authors in a variety of applications [69, 59]. For packet-switching networks, M/M/l theory has been applied for delay analysis, capacity assignment, flow assignment, and topology design [60]. An application example is ARPANET design and performance evaluation. Furthermore, the solution of some of the design problems seems to be rather insensitive to the introduction of additional details in the delay formula [17]. In this paper, without loss of generality, we limit our considerations to the delay formula in ( 3.4 ), which means that we do not consider propagation delay and switch processing time in this chapter. 3.1.2 Analytical Model for Store-and-Forward Switching Given the network topology, the fixed routing algorithm, and traffic distribution, we are able to calculate traffic rate f), on each channel and the total traffic rate 7, and therefore, the average delay in ( 3.4 ). However, ( 3.4 ) was proposed for the networks where only one processor connects to a switch/router. In switch-based networks, a packet has to com- 42 pete with other packet for same destination node even in the final destination switch. So the packet delay should consider this waiting time, resulting in a modified packet delay formula: —1 ff _£_ T3_7(ZCk-fk+ZCL—f5) (3.6) where C,’, and f,’, are the port bandwidth and the average traffic rate to node v. T, means delay in store-and-forward switching. Under uniform traffic, Equation ( 3.6 ) can be further simplified by using average chan- nel utilization and average routing path length. To make it even easier, we consider a homogeneous environment, where the bandwidth for all switch ports are same (represented by C bytes/s). For the purpose of analysis, we introduce the following notations: N: total number of nodes in the network. M: total number of bidirectional channels in the network. 6: mean of packet length (bytes/packet), exponential distribution. 19”,”: routing path length from node u to node v 6: average path length of packet routing. A: packet generation rate (packet/sec), Poisson distribution. pk: utilization of the kth channel and pk = fk/C'k. p: average channel utilization 43 o {0: utilization of node v o g : average utilization of node bandwidth 0 T3: average packet delay in store-and-forward switching. 0 Tc: average packet delay in cut-through switching. Since traffic distribution is uniform, we can calculate the average path length 6 as fol- lows. 2 23191.,” (3.7) p = number.o f -nodemairs The average channel utilization is expressed under uniform traffic as follows: MpN ” — me (33’ The traffic arrival rate in each node should also be uniform, which results in M = ,, = — 3.9 6 r C < ) Therefore, given average path length i, average channel utilization p, average packet length Z, and average node bandwidth utilization f, the average packet latency is T. = __ + __ (3.10) Comparing ( 3.6 ) and ( 3.10 ), the former one is more general to handle different 44 network environments, while the later one is easier to calculate. Note that ( 3.10 ) can only be applied for a homogeneous network, where traffic are uniform and channels are evenly utilized. It may not be able to accurately estimate latency if traffic is non-uniform, or the selected topology and routing algorithm cannot evenly utilize channels. 3.1.3 Analytical Model for Virtual Cut-Through Switching For store-and-forward switching, we have analyzed the average latency using MM 1 queu- ing theory. For cut-through switching, a M/M/l model is still applicable; however, we should reduce the packet buffer time whenever the packet is cut through an intermediate switch. Remember that we have assumed infinite buffer space in order to use MM 1 queu- ing theory, which means that it is virtual cut-through here. Since the network channels are independent, a packet traveling h hops has h — 1 independent opportunities to cut through intermediate switches, resulting in a binomial distribution for the number of cut-throughs [17]. Based on this binomial distribution, analytical models can calculate average latency for cut-through networks. From ( 3.6 ), we have 1 _f___k(1,, f'(12) T,=;[2Ckf_“ fl. —ZPfak ——:+Zc,f_” f, —ZRfiv——C: ] (3.11) where P: is the cut-through probability for a packet forwarded to channel [0; R3 is the cut-through probability for a packet forwarded to node v; 01,, is the percentage of f), which has opportunity to cut through the corresponding switch; 0,, is the percentage of f,’, which has opportunity to cut through the corresponding switch; and e is the header length for a 45 packet (included in the packet). For the first hop, none of the packets has the cut-through opportunity to transmit to next channel or node, which is why we use a), and flu. From ( 3.10 ), we have "6 Z— — Tc=_p___PC(fi—1)——£+—€—_RCIB€C_€ 0(1 - p) C 00 —a> ‘3'”) where 6 is the average percentage of opportunity for a packet to be transmitted to a node in a cut-through way. This eliminates packets whose source and destination switches are same. Here C is channel bandwidth (bytes/s). Since we assume fixed routing, we have Pf=1—p,,anng=1—§,. 3.1.4 Cut-Through Probability The cut-through probability depends on channel utilization and the number of routing op- tions. For fixed routing, PC" can be simply expressed as 1 — pk. For adaptive routing, it depends on the number of routing options available for a packet. Suppose average chan- nel utilization is p and P,- is the probability for a packet to have i routing options, the cut-through probability can be calculated as follows: PC, = (1 — m1?1 + (1 — mp2 + (1 — p3)P3 + . -- (3.13) We just calculate cut-through probability under fixed routing algorithms. Under an adaptive routing algorithm, we may be able to derive an expression for average cut-through probability in some symmetric topologies with the assumption of uniform traffic distri- 46 bution. As an example, cut-through probabilities for adaptive routing algorithms in 2D meshes were analyzed and compared under uniform traffic [26]. However, for switch-based networks, we may not be able to calculate average cut-through probability even under uni- form traffic. Because the topology is very likely to be asymmetric, adaptive routing may result in different routing options in different switches and channels. Under adaptive routing, we are still able to calculate the routing options, traffic rate and cut-through probability for each individual source and destination pair. However, these calculations may not be able to capture the accurate channel utilization and cut-through probability because the adaptive routing options will be dynamically selected based on the status of each channel. Therefore, we will not present analytical models under adaptive routing in this dissertation. 3.2 Simulation Results To verify analytical modeling and evaluate network performance, simulation experiments have been conducted based on the simulation model presented in Chapter 6. 3.2.1 Simulation Environment The simulation environment is same as the one presented in Chapter 6. Although the sim- ulation model can handle many network parameters and their combinations, we only con- sider a subset of these network parameters in this chapter. Unless otherwise specified, packet (message) size is exponential distribution and the mean is 100 bytes, including 3- byte packet header. Buffer size is 2000 bytes per port in store-and-forward and virtual 47 cut-through switches and 60 bytes per port in wormhole switches. For simplicity, all ports have same bandwidth, which is 1 byte per time unit. Due to the arbitrary topologies in switch-based networks, we consider the four networks shown in Figure 3.1. In each network, there are 10 switches and each switch is supposed to have 12 bidirectional ports. There are totally 75 nodes in each network and they are evenly distributed among switches (if possible). Under client/server traffic, 4 out of 75 nodes are considered as servers. For a server, it uniformly communicates to all the other nodes. For a client, 50% of its traffic uniformly goes to the 4 servers and the rest of its traffic uniformly goes to all the other clients. As we will present in Chapter 6, a finite source model is applied in the simulations. For each source node, there is only one distinguished message (packet) and next message will not be generated until the current message completely leaves the source node. Note that in this dissertation, we do not consider packetization and therefore, packets and messages are used interchangeably. Theoretically, each node is supposed to have same traffic gen- eration rate. However, when traffic load is heavy, traffic congestion and buffer overflow may reduce traffic generation rate in certain nodes via back-pressure flow control. This phenomenon may cause a different offered load from theoretical calculation, resulting in inaccurate mathematical analysis. 3.2.2 Verification of Analytical Models The analytical expressions for average packet latency estimate the network performance, but it is difficult to capture inter-switch dependencies mathematically. To verify the analyt- 48 Figure 3.1: Four networks for verification and evaluation ical expressions, simulation experiments are conducted to compare the simulation results and analytical calculations under both uniform traffic and non-uniform traffic. Since the analytical models assume fixed routing, we use the deterministic ETR routing (which will be presented in Chapter 4) in simulations to verify the analytical models. Uniform ’Ii'affic Under uniform traffic, traffic is supposed to evenly distributed among channels. There- fore, Equations ( 3.10 ) and ( 3.12 ) may be simple and reasonable estimations by con- sidering average channel utilization and average cut-through probability. Equations ( 3.6 ) and ( 3.11 ) estimate average latency by considering each individual channel utilization and cut-through probability, which should be more accurate but more complex than Equations (3.10)and(3.12). As shown in Figure 3.2 and in Figure 3.3, analytical expressions closely match the 49 simulation results under light to moderate workload, although the simulation results are slightly higher than analytical calculations. It is not surprising to observe that Equations ( 3.10 ) and ( 3.12 ) underestimate latency than Equations ( 3.6 ) and ( 3.11 ), because the former two equations use average data. However, when offered network load is equal or greater than the load marked ‘high’, the analytical model cannot accurately estimate the average latency. Topology A Topology B I I f I f I I r T I II 3500 I T I I T I I 1 I T I 1400 - ‘ 4 I 3000 - , ~ 1200 - - 3 Azalysis (E .3.10 -o— f - al is .3.6 -+--- E 1000 - Axaiygigfi egg ~— .w 5 250° ysSiinglation M... 3,..7’ "‘ na is q. . -+--- "' _ .' g 800 Simulation We a 2000 5"," g 600 . “find 3 1500 _ .' x s .a-"i’x' s < 400 " ..4E>"_'.'£’,',-»"" < 1000 " 200 ' 500 r- 0 1 1 1 1 1 1 1 1 1 1 l 0 1 1 1 1 1 1 1 1 1 1 1 low high low high Offered Load (low to high) Offered Load (low to high) Topology C Topology D 2m I I I T rfi I I I I I 3000 I I I I I I I I j I I f I ,D ,"p 2500 i Analysis (Eq. 3.10) +— p” g 1500 - Analysis (E .3.10 .._ g~ Analyse (Eq- 3.6) -+--- o Analysis( q. 3.6 -+--- f g 2000 . Simulation ----+:i a" .1” 3 Simulation am- I 3 1' 1000 L Q 1500 " S co ‘-‘-’ a g g 1000 r < 500 - < 500 ~ 0 1 1 1 1 1 1 1 1 1 1 1 o 1 1 1 1 1 4L J 1 1 1 1 low high Offered Load (low to high) low high Offered Load (low to high) Figure 3.2: Average latency of store-and-forward switching under uniform traffic For all plots in this chapter, the offered load marked ‘high’ means that under the same or higher offered load, one or more than one channel have channel utilization over 95% (or even over 100%). Under such high workloads, some channels are over utilized so that it is hard to capture the real traffic rate on those channels. Moreover, since we use back—pressure flow control, traffic congestion on overloaded channels may further influence traffic on other channels and eventually reduce the real traffic generation rate in the source nodes. It means that the offered load used for analysis is different from the real network load, which 50 is difficult to know mathematically. 1000 800 400 Average Latency 200 Average Latency 1000 TopologyA r r r r r I I I y :1; , 35m f ill I‘ 3000 Analysis (Eq. 3.12 .._. :o' . Analysis Eq.3.11 -+~-- : g: 2500 Simulation We .. i 3 2000 § 1500 e < 500 1 1 l 1 1 L 1 1 1 1 o low high Offered Load (low to high) Topology I I I I I I r I I I l’ I 25m -° , 2000 Analysis (Eq. 3.12 ~— I ,-' - 2; Analysis Eq. 3.11 -+--- _:' 5 Simulation We ‘ 3 1500 i a . g 1000 a) - 2 500 1 1 1 1 1 1 1 1 1 1 1 0 low high Offered Load (low to high) Topology B I Analysis (Eq. 3.12 -o— ,.--;' Eq.3.11 -+--- ,9," - Simulation am- I Analysis I I T I I T I I I low high Offered Load (low to high) Topology D Analysis (Eq. 3.12 .._. «.a ,«" . Analysis (Eq. 3.11 -+--- r I f ,1 B Simulation ev- ,"' low high Offered Load (low to high) Figure 3.3: Average latency of virtual cut-through switching under uniform traffic N on-Uniform 'D'affic Under non-uniform traffic, i.e., the client/server traffic mentioned in Chapter 3.2.1, Equations ( 3.10 ) and ( 3.12 ) will not be valid because traffic is not uniformly distributed any more. As shown in Figures 3.4 and 3.5, Equations ( 3.6 ) and ( 3.11 ) are used to 51 estimate the average latency. Again, it is observed that the analytical models closely match the simulation results under light to moderate workload. When the offered load is equal to or over the point marked ‘high’, analytical models cannot accurately estimate the latency although they still keep the performance trend. Topology A Topology B 4000 I I I I I I I I I 1 4000 I I I I I I I I I T I 3500 . Analysis (Eq. 3.6) —o— q 3500 _ Analysis (Eq. 3.6) -o— "" , g 3000 ~ Simulation -+--- ,1” ~ g 3000 - Simulation 4;“! . a 2500 - " - e 2500 ' i 3 3 3. 2000 - . 8» 2000 . g 1500 . - g 1500 . E 1000 - - 3: 1000 . 500 r . 500 4 0 + o r Topology C Topolog D 4000 I I I I I I I I I '1' 4000 I I I fl I I yI I I I 1””, 350° ' Analysis (Eq. 3.6) +— .x” i 3500 ' Analysis (Eq. 3.6) +—,r"' ‘ g 3000 - Simulation -+--- g 3000 - Simulation #7,: . 3 2500 r " - 3?. 2500 - f . o 2000 ~ - a, 2000 - - D D g 1500 h - g 1500 r ~ 2 1000 - - 3: 1000 ~ « 500 - a 500 P J o 4 1 r 1 r r 1 r r 1 r o P L 1_ r r 1 1 r 1 1 L low high low high Offered Load (low to high) Offered Load (low to high) Figure 3.4: Average latency of store-and-forward switching under client/server traffic Remember that when the offered load is equal to or over ‘high’, some channel have utilization over 95% or even over 100%, which means that these channels are under an over-saturated situation. The traffic congestion may influence traffic on other channels and traffic generation rate. Unfortunately, such a traffic inter-dependency in a network is very difficult to capture mathematically. In a word, if the offered load does not cause high 52 utilization for any channel, our analytical expressions closely match simulation results. Otherwise, the analytical results may not be close to simulation results, but they still have the same performance trend. Topology A Topology 8 4” I I I I I T fi T I I 4m I I I I I I I I I I l} 3500 - I; ~ 3500 b .. . 3‘ 3000 . Analysis (Eq. 3.11) -+— ,t” - >~ 3000 - Analysis (Eq. 3.11) +—,/" - g 2500 Simulation -+--- d g 2500 _ Simulation «+5.4: , a 2000 . g, 2000 - f . g 1500 4 g 1500 i - 3: 1000 1 2 1000 - - 500 i 500 - - O 1 o r r 1 1 r r 1 1 r r 1 low high Offered Loa (low to high) TopologyC Topology 4000 I I r I I I I TI fi I I 4” I I f I I I I I I I ,"T ,. 350° ' 1 350° ' Analysis (Eq. 3.11) .._ ’ ' g 3000 - Analysis (Eq. 3.11) -+— ,I’ 1 5 3000 L Simulation -+---,r"" ~ Simulation C g 2500 - ' . g 2500 - . g 2000 " " Q 2000 "' " CD 0) g 1500 ~ . g 1500 l « > > < 1000 ~ . < 1000 r - 500 - 4 500 - ~ 0 1- - r 1 1 1 J 1 1 r 1 r O 1 1 1 1 1 L 1 J 1 1 1 low high low high Offered Load (low to high) Offered Loa (low to high) Figure 3.5: Average latency of virtual cut-through switching under client/server traffic 3.2.3 Performance Evaluation To understand the effect of switching modes, buffer management, and other network pa- rameters on network performance, simulation results are measured to make comprehensive comparisons. Comparison of switching schemes 53 There are three different switching modes in commercial switches: store-and-forward, virtual cut-through, and wormhole. Figures 3.6 and 3.7 demonstrates their performance tradeoff under both uniform and client/server traffic. Under light to moderate workload, virtual cut-through and wormhole switches provide low packet latency because they do not need to buffer the whole packet before forwarding. However, wormhole switches do not provide high throughput under heavy workload, because we use small buffer space for wormhole switches, which is not big enough to store most of packets. If wormhole switches have bigger buffer space and still allow pipeline forwarding, it should have higher through- put. Store-and-forward switches cause high latency even under light workload due to the overhead of buffering packets. On the other side, store-and-forward switching provides as high throughput as virtual cut—through switching. This is because that they both have large buffer to keep blocked packets. Given the benefits of the pipeline fashion in cut-through switching and the big buffer space in store-and-forward switching, a promising switch design should integrate them to- gether and dynamically change switching mode for each port. Some recent router/switch designs have supported hybrid switching modes. As an example, the Cray T3D router [70] implements wormhole switching with two pairs of virtual channels. Each virtual channel has enough buffer space to store a small packet, while longer packets spread across mul- tiple routers (channels). Therefore, a small packet will not stall in a channel. IBM SP2 switch [71] reduces network congestion through a shared output buffer. The central queue dynamically buffers blocked wormhole packets in 8-flit chunks. Average Latency Average Latency 54 Toplogy A Toplogy B I I r 3000 I r r 1 7 100° ' Store-and-forward —o— , ‘ 2500 _ " . Virtual cut-through -+--- 800 - Wormhole We w 3‘ i 5 2000 - Store-and-forward -o— 5 ~ 600 3 Virtual cut-through ~+-- ' 1 o 1500 . Wormhole We . U) as 40° L J a; 1000 - - 'I/r'” < 200 .- ”Ottawa/as” . 500 _ 4 odW’G‘kw O 1 1 1 o r 1 1 1 0.04 0.08 0.12 0.16 0.04 0.06 0.08 0.10 0.12 Normalized Network Throughput Normalized Network Throughput ToplogyC Toplogy D 2m I I I 3000 I I I I I 2500 r ,r a 1500 r Store-and-forward 4— a 3 . Virtual CUI'WOUQh '+'" ,r 5 2000 - Store-and-forward _._ + l W0m‘h0'8 "“9 3 Virtual cut-through -+--- - 1ooo , . m 1500 i- Wormhole -----o , . <3 . g 1000 l- - 500 - - . < . “MM" 500 . o-vo-W'Q’MB'Q i o r 1 1 o 1 1 1 1 1_ 0.04 0.08 0.12 0.16 0.04 0.06 0.08 0.10 0.12 Normalized Network Throughput Normalized Network Throughput Figure 3.6: Comparison of switching schemes under uniform traffic Average Latency Average Latency 4000 3500 3000 Topology A 55 I Store-and-forward +— Virtual cut-through -+--- 250° ’ Wormhole We ‘ 2000 . 1500 5 1000 ~ 500 . 0 0.03 0.05 0.07 0.09 0.1 1 Normalized Network Throughput TopologyC 4000 r r 3500 ,5 3000 \Sitore-land-ftomarg —o— ,i . rtua cut-t roug -+--- :' 250° ' Wormhole -~-9~ f i 2000 If r 1500 a . a. 1000 “g . 500 '1 ,,,,, - o r...1--.-a+.a-r 1 0.03 0.07 0 09 0.11 0.05 . Normalized Network Throughput Average Latency Average Latency Topology B 4000 . . r, 3500 « 3000 \Store-Iand-frorwarg -o— « irtua cut-t roug -+--- i 2500 Wormhole o"- " ‘ 2000 . 1500 . 1000 ~ 500 '43") u o 8"?” 1 1 0.03 0.05 0.07 0.09 0.11 Normalized Network Throughput Topology D 4000 I I I 3500 : 1 3000 \Sitore-land-fgrwarg +— 1+ ~ rtua cut-t roug -+--- ,: 2500 Wormhole en- ; ‘ 2000 J 1500 -i 1000 ~ 500 -4-é'itlflau " 0 °“*T'"°‘"“’“f;"' . 0.03 0 05 0.07 0.09 0.11 Normalized Network Throughput Figure 3.7: Comparison of switching schemes under client/server traffic 56 Comparison of FIFO and Shared Buffering It is well-known that FIFO input buffering can cause poor performance due to the head- of-line blocking. In Figures 3.8 and 3.9, we use simulation results to compare the perfor- mance between FIFO input buffers and shared buffers under store-and-forward switching. For shared buffering, each input port has a buffer to temporarily hold the incoming packet and each switch has a 2000 bytes shared buffer pool. It is not surprising to observe that shared buffering provides lower latency and higher network throughput than FIFO input buffering, although the shared buffer pool has smaller space than the sum of all input FIFO buffers. Topology A Topology B I j I I 3000 I I I r I 1000 . J x." 2500 r- . 5‘ 800 l FIFO +— ,x’ - a FIFO 4— 3: Shared -+--- " 5 2000 r Shared -+--- 1 - g 600 r ,1" - g 1500 , 5+4" - S S s 5 40° ’ ..... v" ‘ as 1000 L ,1 . > *-.4—-"* > ' < ** < 200 P - 500 . . o 1 l 1 1 0 1 1 1 L 1 0.04 0.08 0.12 0.16 0.2 0.04 0.06 0.08 0.10 0.12 0.14 Normalized Network Throughput Normalized Network Throughput Topology C Topology D am I I I I 3000 T I I I I I 2500 ~ * - 1500 - . 3 x, 5 2000 . FIFO _._ f: - 3 x’ 3 Shared -+--- 5 1000 - , - a, 1500 - i 4 8. j o: 3 S . S ,t g ,.v" 9 1000 - . < 500 - ”my“ . < +--+-+"“"'+---- 500 l- - o 1 r 1 1 0 1 1 L 1 1 1L 0.04 0.08 0.12 0.16 0.2 0.04 0.06 0.08 0.10 0.12 0.14 Normalized Network Throughput Normalized Network Throughput Figure 3.8: Comparison of FIFO and shared buffering under uniform traffic Average Latency Average Latency Topology A 57 4000 3500 .. 3000 . 2500 - 2000 ~ 1500 - 1000 - 500 - I FIFO +— Shared -+--- Average Latency 0 0.03 0.05 Normalized Network Throughput 0.07 Topology C 0.09 0.11 4000 3500 - 3000 F 2500 - 2000 i- 1500 ~ 1000 - 500 * FIFO -o— Shared -+--- L Average Latency 0 0.03 0.05 0.07 0.09 0.11 Normalized Network Throughput Topology B 4000 1 . . 3500 ~ 3000 4“ - 2000 FIFO +— ,2" - 1500 Shared -+--- - 1000 ~ --4 500 - 0 1 r r 0.03 0.05 0.07 0.09 0.11 Normalized Network Throughput Topology D 4000 . r . 3500 , ‘ 3000 *1.‘ . To? 2500 i 1 2000 FIFO _._ ,1 + 1500 i Shared -+--- _ 1000 ,I’ - 500 "4' - 0 1 J L 0.03 0.05 0.07 0.09 0.11 Normalized Network Throughput Figure 3.9: Comparison of FIFO and shared buffering under client/server traffic Effect of Packet Size To understand the effect of packet size on network performance, we also measure the performance data when the mean packet size is 500 bytes. Under same packet generation rate (bytes/s), the performance under different packet sizes are compared in Figure 3.10. It shows that the performance trends are similar. The higher latency for larger packet size is expected. Adaptive Routing For all the above performance data, they are measured under a deterministic routing algorithm. Adaptive routing algorithms are also proposed to provide routing flexibility 58 ToplogyA 2000 ~ _ 5‘ g 15001 Msgsize=1003 .._ . ‘5 Msg 5128:5008 "1“" ..J 8 1000 I: S 0 > < 500 ~ 0 l ‘ l r 0.04 0.08 0.12 0.16 Normalized Network Throughput ToplogyB 4000 . , , r 3500 F i . g 2500 - ii -i m 2000 L Msg size: 1003 .._ . CD ' = —+--- ,4‘ g 1500 _ Msg Size 5008 q E 1000 - - 500 L*-+-+--+-"' _, 0 1 0.04 0.06 0.08 0.10 0.12 Normalized Network Throughput Figure 3.10: Effect of packet size and take the advantage of multiple routing paths. We have implemented an adaptive rout- ing algorithm (ATR), which will be presented in Chapter 4, in our simulations. Under the adaptive routing, we also measure performance and make comparison among different switching modes and buffer management. As Shown in Figures 3.11, 3.12, 3.13 and 3.14, the same performance trends under deterministic routing can also be observed under adap- tive routing. 1000 800 - 600 400 Average Latency 200 1 400 1 200 Average Latency Figure 3.11: Comparison of switching schemes under uniform traffic 59 Toplogy A: adaptive routing I I Store-and-forward _.— . Virtual cut-through -+--- Wormhole ---—-e 0.04 0.08 0.12 0.16 0.2 Normalized Network Throughput Toplogy C: adaptive routing I - Store-and-forward —o— - q.“ Virtual cut-through -+--- - ‘31. Wormhole We - 0.04 0.08 0.12 0.16 0.2 Normalized Network Throughput 60 Topology A: adaptive routing 4000 . | . 3500 i . g 3000 ~ Store-land-fgrwarg ~9— i - irtua cut-t roug -+--- , g 2500 ’ Wormhole We " ' .1 Q 2000 ' " °’ I E 1500 - a < 1000 " RED I " 500 ~ '9; ,,,,,,,, - 0 9'1" mer r r 0.05 0.07 0.09 0.1 1 Normalized Network Throughput Topology C: adaptive routing 5000 I I I “‘ I 4500 - I, ~ 4000 ~ ‘1 . 3500 - Store-and-forward _.__ . Virtual cut-through -+--- I - Wormhole -----£» 2500 - 2000 - 1500 - 1000 - 500 - Average Latency 0.05 0.07 0.09 0.1 1 Normalized Network Throughput Figure 3.12: Comparison of switching schemes under client/server traffic Average Latency Average Latency Figure 3.13 61 Topology A: adaptive routing 1000 r FIFO 4— - Shared -+--- 800 600 0.04 0.08 0.12 0.16 0.2 0.24 0.28 Normalized Network Throughput I I Topology C: adaptive routing 1 000 I I I I 800 L FIFO -°— ~ Shared -+--- 1,,“ 600L 0.04 0.08 0.12 0.16 0.2 0.24 Normalized Network Throughput : Comparison of buffer management under uniform traffic 62 Topology A: adaptive routing I 4000 . 3500 _ . 3000 ~ FIFO —._ i, . 2500 _ Shared -+--- it 2000 ~ 1500 F 1000 — 500 ~ A. - . Average Latency 0.05 0.07 0.09 0.1 1 Normalized Network Throughput Topology C: adaptive routing 4000 I I I ‘ 3500 ~ . 3000 — FIFO 4— i - 2500 _ Shared -+--- . 2000 i 1500 ~ 1000 ~ 500 ~ I Average Latency 0.05 0.07 0.09 0.1 1 Normalized Network Throughput Figure 3.14: Comparison of buffer management under client/server traffic Chapter 4 Deadlock-Free Routing in Arbitrary Switch-Based Networks The main contributions of this chapter include two proposed routing algorithms and com- prehensive performance evaluations among three routing algorithms in arbitrary networks. Due to the arbitrary network topologies, it is not trivial to develop an efficient routing algo- rithm with lower transmission latency and higher network throughput. In this chapter, we consider wormhole switching, a special case of cut-through switching. A main difficulty for wormhole switched networks is to design a routing algorithm avoiding deadlock. To deal with this challenge, we propose two simple and efficient deadlock-free routing algo- rithms for arbitrary networks using wormhole switches. In order to avoid deadlock, we use two unidirectional Eulerian trails to help maintain an order of channel dependency and allow reasonable routing. Shortcuts are added to the Eulerian trails to provide more and shorter routing paths. Heuristics are suggested in order to provide better performance. 63 4.1 Related Work In a wormhole switch, based on the header information of an incoming packets, a routing algorithm selects an outgoing channel to deliver or forward the packet. A critical issue in designing an appropriate routing algorithm in wormhole switched networks is to avoid channel deadlock, because a message can hold some channels and wait for another chan- nel. A deadlock occurs when there is a cyclic dependency among occupied channels and requesting channels. A well-known solution for deadlock avoidance was proposed in [72] based on restricting the routing algorithm. For example, the turn model proposed in [73] provides deadlock freedom via prohibiting certain turns. Some general methodologies [42, 43, 45, 74] have been proposed to allow deadlock- free routing in arbitrary networks. However, it is difficult to directly apply those general ideas to an arbitrary network. The examples given in those papers are all for regular topolo- gies. There are other methods of preventing deadlocks, such as dropping packets for which there is insufficient buffer space, or using virtual channels [4] to allow more routing free- dom. We assume that these methods are unacceptable or impractical for the switch-based networks we desire. For the purpose of evaluation and comparison, we consider another two existing routing algorithms, which can be directly applied to arbitrary networks using wormhole switches. These two algorithms are the up*ldown* routing (UDR) used in the DEC AN 1 system [41] and the smart-routing (SR) [46] proposed by HP researchers. In order to avoid deadlock, the UDR and SR adopt the concept of deadlock-prevention, which never allows the formation of a dependency cycle. Our proposed routing algorithm can be used in two ways: a simple 65 algorithm called Eulerian trail routing (ETR) and a complicated algorithm called adaptive trail routing (ATR). Details of the two routing algorithm will be described in Chapters 4.3 and 4.4. While the ETR never allows a dependency cycle, the ATR uses the philosophy in [42, 43], which allows the existence of cycles, but does provide a channel to escape from the cycles. Neither UDR nor SR considers virtual channels. For simple and practical reasons, we do not consider virtual channels in our routing algorithm either, but it is an interesting issue worthy further study. The UDR algorithm. A breadth-first Spanning tree is constructed from a specified root. Each channel is assigned a direction based on the spanning tree, with ‘up’ meaning ‘toward the root’. A tie is solved by comparing the ids of two end switches of a channel. With this assignment, the directed channels do not form loops. A legal route is defined to be one that never uses a channel in the ‘up’ direction after it has used one in the ‘down’ direction. The UDR routing is easy to understand and implement, but it may concentrate traffic around the root switch and allow unnecessary long routing paths. Thus, its performance is greatly dependent on the network topology and the selected spanning tree. Since all routing paths satisfying the up*ldown* requirement are allowed in the routing table, many very long routing paths may exist and cause poor performance in many cases. We have modified the original algorithm in the following way. If there are multiple routing choices in a routing table entry and the shortest distance among these paths is h, longer routing paths, which have more than two hops and are longer or equal to 2h, will be discarded. The modified UDR is called MUDR. A Similar idea is used in Myrinet [37], where a spanning tree is built up for routing. However, Myrinet uses source routing instead of adaptive routing based on table lookup. 66 The SR algorithm. Based on the network topologies, the SR algorithm builds an explicit channel dependency graph and searches the graph for cycles. For each cycle, a dependency is broken to minimize a heuristic cost function. The procedure terminates when the channel dependency graph has no cycles. The routing is represented by the channel dependency graph. The SR can be used as adaptive routing (SRA) or deterministic routing (SRD). For the SRA, an adaptive routing table, which allows multiple routing paths among switch pairs, is created in each switch. For the SRD, a deterministic routing table, which provides only a single path between any two switches, is created in each switch. In this paper, the SR represents both SRA and SRD. The smart-routing has done a good job to balance channel utilization under uniform traffic. However, the traffic balancing, which depends on a linear programming solver, is the most time consuming part when calculating the routing table. Such complexity may not gain significant performance benefit in real networks Since non-uniform (e.g., client/server) traffic is very commonly observed. Extensive simulation experiments have been conducted to evaluate the performance of up*ldown*, ATR, and SR under different topologies and workloads [57]. It is observed in the simulations that a premature bottleneck channel is one main reason to cause poor net- work performance. The SR is the only algorithm putting significant effort and computation time to balance channel traffic. Note that traffic balancing should be guided by real traffic pattern in the network. The SR assumes a unrealistic traffic pattern, uniform traffic, so it may not provide better performance in non-uniform (e.g., client/server) traffic. A good heuristic routing algorithm may have good network performance under many network con- ditions. For example, a much more simple algorithm ATR can offer as good performance as SR in most cases [57]. However, to achieve near optimal performance, traffic balancing 67 should be considered if routing table computation time is not very critical. 4.2 Definitions and Eulerian 11:31] As we mentioned earlier, regular network topologies do not have good incremental scal- ability due to the stringent interconnection constraints [1] and may not be able to use the original routing algorithms with faulty nodes. Thus, some arbitrary network topologies with good network performance are highly demanded. In this chapter, we will use the same network shown in Figure 1.1 to illustrate our proposed algorithm. The nine 8-port wormhole switched network is Shown again in Figure 4.1(a). The graph in Figure 4.1(b) is used to model this network, where each vertex corresponds to a switch and each edge cor- responds a channel connecting between a switch pair. This graph representation is used to represent network topologies and derive channel dependency graph. All channels (edges) are assumed to be bidirectional and multiple channels (edges) between a switch pair are allowed. Our proposed algorithm [21, 75] is applicable to any network topology with Eulerian trails. Its basic idea is to find two opposite unidirectional Eulerian trails to provide rea- sonable routing paths and control the order of channel dependency. An Eulerian trail is a sequence of channels, which visits each channel once and exactly once so that it can main- tain the order of channel dependency. In order to maximize channel utilization and allow more and Shorter routing paths, shortcuts may be added to the two unidirectional Eulerian trails. The two unidirectional trails with shortcuts are called adaptive trails. To avoid deadlock, shortcuts have to be used in a restricted way (e.g., some shortcuts 68 can only be used for messages coming from terminal ports) based on the channel depen- dencies along the adaptive trails. However, a dependency cycle is allowed in adaptive trails as long as there is an escape channel [42, 43] for that cycle. Based on the unidirectional Eu- lerian trails and adaptive trails, we have two routing algorithms: one is called Eulerian trail routing (ETR) based on the two unidirectional Eulerian trails, and the other one is called adaptive trail routing (ATR) based on the two unidirectional adaptive trails. The ETR is easier to implement, but it may cause longer routing paths and imbalanced channel utiliza- tion due to channels’ different locations in the trail. The ATR is more complex, but it is more fair to route messages evenly among all channels. For both ETR and ATR algorithms, routing information are carried in a static routing table in each switch, which provides local routing information for each switch. (b) Figure 4.1: An example of an arbitrary network topology 69 4.2.1 Definitions In order to simplify our presentation, we introduce the following definitions. Definition 1 Trail, path, and cycle: a trail is a finite sequence of edges (channels) of the form '00 —> 01 —> —) vm_1 —> pm, in which all the edges are distinct. If vertices 00, 01, - - ~vm_1, vm are also distinct, it is called a path. When 00 2 pm, it is called a cycle. A graph G is said to be connected if every pair of its vertices are joined by a path. Obviously, all graphs considered in this thesis are connected. The degree of a vertex is defined as the number of edges connected to the vertex. For a pair of vertices u, v in G, dG(u, 0) denotes the length of a shortest path from u to v in G. The diameter of a graph is the maximum dG(u, 0). Definition 2 Eulerian trail and Eulerian graph: an Eulerian trail of a connected graph is a trail that contains all the edges of the graph. A graph is called an Eulerian graph if and only if it has an Eulerian trail. The sufficient and necessary condition is that all vertices have even degrees or exactly two vertices have odd degrees. Definition 3 Channel, shortcut, and index in a unidirectional Eulerian trail: given a uni- directional Eulerian trail ET: 110 —> -~ . —> v,- —+ - .. —> 12,- —> . - - —> vm in graph G, each vertex (switch) along the trail is given an index beginning from 0. In trail ET} a channel between any two subsequent vertices v,- and 11,41 is denoted as c,-,,-+1 (i. e., v,- —> 0,41). Given an Eulerian trail E7: a shortcut 31"]: between two non-subsequent vertices v, and 12,- exists if j > z' + 1, 223- 75 0,41, and the channel 0,- —+ '0,- is in the edge set of G. A shortcut 3,3,“ exists between two subsequent vertices v, and 0,41 if there are multiple channels between 70 v,- and 12,-.” in the network. Let C( ET) denote the set of all unidirectional physical links in E7: which means 11,- —-> 0,41 6 C(ET), but 0.41 —> 12,- ¢ C(ET). In Definition 3, the concepts of channels and shortcuts are used to logically distinguish physical links’ positions in a trail. Any channel or shortcut corresponds to a unidirectional physical link in the network. If v, and 03- represent the same switch in a network, we say u,- = 03-. A shortcut SM 6 C(ET) if and only if there is a k such that v), = 0,, and 0H1 2 pg. We use cs,”- to represent either cm- or 3,-0- if applicable. osm- and cs“ represent the same physical link if v,- 2 up, 03- = rig, and there is no multiple channel between the two switches. 4.2.2 Eulerian Trail Our routing algorithm can be applied to any network which has an Eulerian trail. The algorithm to find an Eulerian trail is well-known. We adopt the same algorithm in the book [76]. A graph may have more than one Eulerian trail. Our routing algorithm is based on any Eulerian trail. A heuristic is suggested in Chapter 4.5.2 to select an Eulerian trail which may provide a good performance. In this section, we suppose a random Eulerian trail. Since bidirectional channels are used to interconnect switches, an Eulerian trail is con- sidered as two opposite unidirectional Eulerian trails, called ETl and ET2, respectively. Unless otherwise specified, any channel or trail mentioned in the proposed routing algo— rithms means a unidirectional channel or trail. As illustrated in Figure 4.3, one Eulerian trail,ETl,forFigure4.1(b)isO—->5—i4——>3—>8—>7—> 1 £3 5—> 2—+3—+1—‘>5—+ 6 —+ 0 —-> 1 —-> 8. We use 6 and e’ to distinguish the two physical links between switches 1 and 5. Another Eulerian trail ET2 can be obtained if we look at the above trail in the reverse 71 order:8—)1——>O—>6—>5-3)1—>3—>2——>5—e;1—>7—>8—>3—+4—>5—>0.Ifa channel is in ETl , it cannot be a channel in ET2. 4.3 Eulerian-Trail Routing Algorithm (ETR) Given ET] and ET2, an easy way is to route messages along the two trails. To allow more and shorter routing paths, the two Eulerian trails are concatenated into a single unidirec- tional trail. This routing method is called Eulerian trail routing (ETR), where messages are routed on the concatenated unidirectional trail. Given ET] and ET2, we can easily concatenate them into a single unidirectional trail as follows: 0—>5—>4—>3—>8—>7—>1$5—>2—>3——>1-‘35—>6—>O—>1—>8—>1-—> 0—+6—>5—e>1—~>3—->2—>5i'>1—>7—+8—>3—>4—>5—>0. Message routing will follow the unidirectional trail. For example, a message from switch 4 to switch 8 will take path 4 —) 3 —> 8. The ETR routing algorithm takes the reasonable paths following the channel dependency order in the concatenated trail. Note that a message does not have to go through every channel and it never passes the same switch more than one time. For example, consider a message from 0 to 6. It will take 0->5->6insteadof0->5—>4—>3—>8—>7—i135—”4341:3546. Because the later routing path passes switch 5 multiple times, increasing path length and wasting channel bandwidth. The ETR can be used as deterministic routing or adaptive routing. If used as determinis- tic routing, only the shortest path in the trail is allowed. If used as adaptive routing, multiple 72 routing choices can be kept. Heuristics suggested in Chapter 4.5.] could be applied to avoid longer routing paths. Different Eulerian trails generate different routing tables, which may provide different performance. A heuristic method is suggested in Chapter 4.5 .2 to select an Eulerian trail providing better performance. Like the UDR routing algorithm, the ETR routing is very easy to understand and very simple to implement. Neither shortcut iS added to the Eulerian trail nor channel dependency cycle exists. Unlike the Up/down, it uses an Eulerian trail to provide reasonable routes and avoid obvious traffic congestion. 4.3.1 Routing Tables Based on the routing paths allowed in the concatenated trail, a routing table is created in each switch to carry routing information for the local switch. A routing table has many entries and each entry is indexed by the destination switch and the incoming port (channel). If adaptive routing is allowed, more than one routing option may be listed in an entry. As an example, the routing table in switch 8 for deterministic ETR is Shown in Table 4.1. Note that if a message cannot match any entry, it means somehow it has taken a wrong route. For example, if a message in switch 8 has destination switch 0, it Should not come from switch 7 because from 7 to 0 should take 7 —> 1 —) 0. 4.3.2 Time Complexity The ETR algorithm is simple to implement. Let m be the number of bidirectional channels and n be the number of switches in a network. Finding an Eulerian trail takes 0(m) time 73 l Dst Incoming Channels " Outgoing Channel I 0 3—i8,terminalports 8—>1 1 3 —+ 8, terminal ports 8 -> 1 2 terminal ports 8 —> 1 3 7—>8,terminal ports 8—>3 4 7—>8,terminal ports 8—93 5 terminal ports 8 —+ 7 6 terminal ports 8 —+ l 7 3—+8,terminalports 8—->7 Table 4.1: Example routing table entries in Switch 8 for deterministic ETR routing complexity and creating routing tables in all switches takes 0(mn) time complexity. 4.4 Adaptive-flail Routing Algorithm (ATR) The ETR algorithm is very easy to implement and also provide reasonable performance. Deadlock can never happen in the ETR because there is no channel dependency cycle. However, the ETR may have two drawbacks. First, some messages are not able to use the shortest paths to reach their destinations. For example, from switch 6 to switch 2, a message has to go through 6 —+ 5 —> 1 ——> 3 ——> 2 (four hops) although the shortest path is 6 -—> 5 —> 2. Second, the earlier a channel appears in the final concatenated Eulerian trail, the less nodes can utilize them. For example, channel 0 —> 5 can only be used by nodes in switch 0 because it is the first channel in the trail. This may cause unbalanced channel utilization. Therefore, we proposed a more complex algorithm, ATR, to avoid these problems. In the ATR, shortcuts are added to each Eulerian trail. The trails with shortcuts are called adaptive trails. The adaptive trails derived from ETl and ET2 are called AT] and AT2, 74 respectively. 4.4.1 Adaptive Trail Any channel in the network may be a shortcut in an adaptive trail. However, deadlock is possible if shortcuts are added without any restriction. A deadlock example is shown in Figure 4.2, where a shortcut 30,5 (i.e., 8 —> 3) is added to AT2 and used in the routing. m1: 8"’3—> 2, waitingfor 2 ' >5 e, m2: 2—>5->l, waitingfor 1 >7 m3: l""*7""’ 8, waitingfor 8 ‘ >3 Figure 4.2: An example of deadlock In order to avoid deadlock, shortcuts have to be used with some restrictions. The con- cept of channel dependency was first presented in [72]. A routing algorithm is deadlock- free if there is no cyclic channel dependency in the network [72]. However, such a condition is too restrictive for adaptive routing, where more than one outgoing channel are offered at many switches. It has been shown in both [43] and [42] that an adaptive routing algo- rithm with cyclic channel dependency can still be deadlock-free as long as some necessary conditions are satisfied. For any detected dependency cycle, we break it if it may cause deadlock, but we allow it if it cannot cause deadlock. The shortcuts are categorized into three different types: free-style shortcut, destination shortcut, and source shortcut. Figure 4.3 shows how to add shortcuts step by step. For a 75 partialtraillike~~u1 ——> —> u,---- —) ’U1°H—> vk---,whereu,- = u(1$i_<_ j), v,- = 0(1 3 i g k), and no 0 exists between ul and ii], any allowed Shortcut u MOI-tic” is drawn from 11,- to 01. The order to add different types of shortcuts facilitates deadlock avoidance. Free-style shortcut: Given an Eulerian trail, a shortcut SM is a free-style shortcut (or f-shortcut) if there is an i < p such that v,- = 0,, and 0,41 = 0,1. A free-style shortcut must be in C(ET). To create adaptive trails, free-style shortcuts are first added to each Eulerian trail as illustrated in Figure 4 (first step). Note that in ET], the free-style shortcut 310,11 from 010 (i.e., switch 1) to 1111 (i.e., switch 5) is channel e’, while channel cm,” is e. A free-style Shortcut can be used by any message if it is on the routing path. Destination shortcut: Given an Eulerian trail ET, any shortcut of ET is a destination shortcut (or d-shortcut). For example, in AT] of second step in Figure 4, there are two destination shortcuts to 09 (i.e., switch 3) from 04 (i.e., switch 8) and Us (i.e., switch 1), respectively. A destination shortcut 3M can only be used by messages whose destination is vq. Thus, whenever a message acquires a destination shortcut, it will eventually go to its destination and release the channel. Source shortcut: Given an Eulerian trail ET, any shortcut of ET, but not in C(ET), is a source shortcut (or s-Shortcut). For example, in AT] of third step in Figure 4, there are two source shortcuts from Us (i.e., switch 1) to rig (i.e., switch 3) and 013 (i.e., switch 0), respectively. Note that channel 36,15 (i.e., 1 ——> 8) on AT] is not a source shortcut since 1 —> 8 is in C(ETl). A source shortcut SM can only be used by messages whose source is a local processor connected to switch up. No message can hold a channel and request for a source Shortcut. 76 - The two unidirectional Eulerian trails lndcx:0 l 2 3 4 5 6 7 8 9 10 ll 12 l3 14 15 arr: 0->s->4->3—>8->7—>135->2—>3—>15>5—>6->0—>1->8 ET2: 8->r—>o->6->5-‘>1->3—>2—>55>i->7->s—>3->4->5->0 - First step: add free-style shortcut channels ' - e.‘ .. AT]: 0+5+4+3+8+7+1$5+2+5+i3>g+6+o+i+§ .-:-.. ''''''''''''''''''' 1 --------- . AT2: s->1—>o->6—>53>1—>3—>2->'55>li'->7->s—>i->4->5—>ri «- Second step: add destination shortcut channels and remove free-style shortcut if it win cause deadlock ,-...r-- ............. AT1:0->5-> —>-—>->->3'>->—>->'1-5>—>6—>-> cl AT2;3—>1—>0—>—>—‘>—>->—>-5->7—>—>—>->-> e - Third step: add source shortcut channels and remove some of them causing deadlock: - -- --— - —--——-. ' AT1:0->5->->—>->->-c.>—> 915-9699» vi AT2;3—>1—>0->—>—$—>—>—>5>—>7->—>->—>—> l. Free-style shortcuts are drawn in dashed line above the Eulerian trails. 2. Destination shortcuts are drawn in solid line below the Eulerian trails. 3. Source shortcuts are drawn in solid line above the Eulerian trails. Figure 4.3: Illustration for the generation of adaptive trails 77 After adding different Shortcuts, we have to detect cyclic channel dependency. Any routing path allowed in the corresponding Eulerian trail will be kept. Whenever a depen- dency cycle has to be broken to avoid deadlock, we always remove a shortcut to break the cycle. As we mentioned above, different types of shortcuts are used for different purposes. A destination shortcut cannot cause deadlock. Because whenever it is occupied as a destina- tion shortcut, it always delivers a message to the destination and will be released finally. In Figure 4.3, there is a dependency 3(03) —-+ 8(04) d—s’gta‘t 1( 06) on AT] and a dependency 8(00) —> 1(2),) —) 0(1),) —> 6(03) —> 5(1),) —+ 1(1),) -+ 3(2),) ““3““ sun) on AT2. The two dependencies cause a dependency cycle, but the cycle cannot cause deadlock due to the following. If 8 —) 1 is used as a destination shortcut on ATl, it will be released finally. If 8 -——) 1 is used on AT2, there is another path from 8 to l (8 —> 7 —+ 1) available in ATl. Dependency cycles due to free-style shortcuts may cause deadlock. Consider the fol- lowing trail, where u and y_ represent a same node and v and y represent a same node. f —shorfcut . —shortcut Assume there IS a free-style shortcut g ——> y. Because it —> v and g f —> y represent a same physical link, there is no way to distinguish them when a message arrives at switch 0 from u. Due to u —> v —> a ——) -- - —> d allowed by the Eulerian trail, a message from source 1: to destination d may take path a: —> g f—sgtmt y —> a —> - - - —+ d. Such a path, which breaks the channel dependency order given by the Eulerian trail, causes a dependency cycle together with the dependency of a —> - - - -—> d —> - - - —> a: —) — h g. It may cause deadlock, so g f 83““ g has to be removed. As an example, let us 78 f—shogcufle) 1 see why 5(08) (09) on AT2 may cause deadlock and has to be removed. If we allow this free-style Shortcut, we have 5(04) 5) 1(225) —> 3(06 2111) and f—shogcufle) 5(03) 1(v9) —> 7(010) ——> 8(011). At switch 1, if a message coming from f—shogcufle) 2(07) —) 5(08) 1(09) wants to go to switch 8, it may take 1(05) —+ 3(v6) and ) d—sh_o>rtcut 8( ) d-slgrtcut 8( wait on 3(06 2211). However, if 3(06 on) is not available, it may take 3(06) —> 2(07) instead and have to wait on 2(07) —-> 5(08) which has been occupied by the same message. Thus, deadlock happens. In Figure 4.4, we show all the configurations — h t t where _q I 8—0; cu 31 cannot be used as a free-style Shortcut. case]:---u—>v—>a—+-~d-~-—ia:—+g—+~--—>y—>---d--- ca562:...u__)IU—)a_)...p°"—)$"ig",‘d...y—.)...d... case3:...u_.)v__)a_)...p°"—)$—)y_—;...—-)Q—)"'d... case4: ~--u—>v—)a—)---P"'—>$—>H—>"'—>y—> ---...d (Note: over arrow lines mean f-shortcuts or d-shortcuts) Figure 4.4: All cases where f-shortcuts from u to 0 should be removed to avoid deadlock Deadlock may be caused by a dependency cycle due to source shortcuts too. In Fig- s—shortcut ure 4.5, there is a dependency a —+ b —> -- - —-> c —+ d on AT] and a dependency s—shortcut —> d —> - - - -—) a ——) b on AT2. The two dependencies create a cycle and may cause deadlock. Therefore, a source shortcut (e.g., c —> d on AT2) has to be removed to break the cycle. Without loss of generality, we assume AT2 is used to break such a cy- cle. For example, if 2 -—) 3 on ET] is added to AT2 as a source shortcut, there will be s—shortcut dependency 2(v7) ——> 3(012) —> 4(013) —> 5(014) on AT2. Since there is dependency 4(02) ”skim“ 5(v7) —> 2(v8) —+ 3(09) on ATl, a dependency cycle exists and may cause deadlock. Therefore, shortcut 2(v7) s-s’gtw‘ 3(012) has to be removed from AT2. 79 ATl:---b—>a—>---—>b—>~--—>d—>---—>c—>d—+~- AT2:---d—+c—>---—+i’l—>---—+b—>---—+a—ib—->--- Figure 4.5: The deadlock case due to source shortcut Figure 4.4 and Figure 4.5 have shown all the possible deadlock configurations caused by adding shortcuts. The deadlock detection algorithm will detect them and break a deadlock cycle by removing a shortcut. The detailed proof given in Chapter 4.6 shows that the adaptive-trail routing is deadlock free. By routing messages along the two adaptive trails, we guarantee deadlock—free adaptive routing. Note that a shortcut 3,,1- may have more than one shortcut type. For example, 333 (3 —i 2) on AT] is both a source Shortcut and a destination shortcut. In this case, 3,3]. is used for both purposes. 4.4.2 Routing Tables As the ETR, the ATR also creates a static routing table in each switch based on the two adaptive trails. A routing table has many entries and each entry is indexed by both the destination switch and the incoming channel (port). Some entryI may have multiple op- tions of outgoing channels, which are selected by Shortest path first policy. We first create routing table for each trail separately; then combine them together and remove redundant entries if any. Adjustment of the routing table may be done according to the heuristics in Section 4.5.1. Our scheme scans each trail beginning from the largest index to the smallest index. When a switch is visited, different types of shortcuts are used in different ways. In order to use a source shortcut as an outgoing channel, the incoming channel has to be a terminal 80 port. A destination shortcut SM cannot be used in an entry unless the destination is vq. Channels c,-,,-+1 and free-style shortcuts are used without limitation. In Table 4.2, a routing table is created in switch 5 when it (i.e., 011) is first visited along AT] of Figure 4.3. Note d—shortcut that 5(1)”) —+ 0(013) and 5(2)”) d-s’fltw‘ 1(014) can only be used for destination switches 0 and 1, respectively. I Dst ] Incoming Channels H OutCh 1 I OutCh 2 I 0 1‘—/§'5,2—>5,tenninaipons 5—>0 5—+6 1 2 —> 5, terminal ports 5 ‘1‘? 1 5 —> 6 6 all, 5, 2 —> 5, terminal ports 5 -—) 6 none 8 1 613’ 5, 2 —> 5, terminal ports 5 —) 6 none Table 4.2: Example routing table entries for a first visited node When a switch is visited again, a key point is to use the output channels available for its current occurrence and previous occurrence(s) with larger index. However, we Should avoid routing a message to any switch more than once. Consider that switch 5 (07) is visited for the second time along AT] in Figure 4.3. The new entries in switch 5 are shown in Table 4.3. Note that for destination switch 6, we do not allow the path 5(07) —> 2(08) —-> 3(09) —+ 1(010) (”15’ 5(1)“) —> 6(012) because it passes switch 5 twice. A Shortcut offers a shorter routing path, but it may not be available for message routing. Consider that a message M comes from 5(07) —> 2(08) ——> 3(09) ——> 1(v10) (on AT]) and goes to switch 0. Although 1(010) d-sh—oitw‘ 0(013) is the shortest path to M’s destination, M may take 1(v10) 5) 5(1211) ——> 6(012) -> 0(013) if 1(010) (#8th 0(1213) is not available when M is routed from switch 1. This results in passing switch 5 twice and wastes channel utilization. Due to this reason, a heuristic in Section 4.5 .1 is suggested to handle such a case. 81 [ Dst Incoming Channels ] OutCh 1 OutCh 2 ] OutCh 3 I 0 4—>5,terminalports 5—>0 5—)6 5—>2 0 1 5’) 5 5 —-> 0 5 —) 6 none 1 4—>5,terminalports 5‘51 5—+2 5—>6 2 1 £3 5, 4 —+ 5, terminal ports 5 —> 2 none 3 1 £3 5, 4 —+ 5, terminal ports 5 —) 2 none 6 1 3’) 5, 4 —-> 5, terminal ports 5 ——> 6 none 8 1315,445Jerminalports 5—)2 5—r6 none Table 4.3: Example routing table entries for a multiply visited node 4.4.3 Time Complexity The whole algorithm is simple to implement. Let m be the number of trunk channels and n be the number of switches in a network. As shown in [75], the time complexity of finding an Eulerian trail, adding all shortcuts, removing free-style shortcuts for deadlock avoid- ance, removing source shortcuts for deadlock avoidance, and creating routing tables are 0(m), 0(m2), 0(m2), 0(m2), and 0(nm), respectively. Thus, the total time complexity is 0(m2). 4.5 Heuristics To provide potentially good network performance, some heuristics are suggested in the following to avoid longer routing paths and decide the selection of an Eulerian trail. 82 4.5.1 Heuristics about Degree of Adaptivity If a switch appears more than once along an Eulerian trail, there will be more than one outgoing channels from the switch to some destinations. Multiple routing paths from this switch to a destination switch offers higher degree of adaptivity and redundant routing choices for fault-tolerance. However, if a path is very long, many messages may be blocked due to the long path. Simulation results Show that allowing long paths will cause low network throughput and high message latency. Our heuristics consider multiple routing options in a switch and decide whether to keep a longer routing path or not. Two heuristics are suggested: dual-path heuristic and source heuristic. Dual-path heuristic: Suppose switch 0 appears more than once along the trail. Let 0,- and 12,- (j > 2') denote any two of them. If v,- and 12,- have different outgoing channels for a destination d, the dual-path heuristic will decide if the longer path is allowed. Onecaseisthatdappearsonce: —) 12,—) a —> -~—->vj —>b—>~>- —> d —-) In this case, 0,- can go to d from either 12,- —+ b or v,- —+ a plus some shortcut(s) if any. Suppose the later choice is used. The shortcut may not be actually used in a route if it is not available when a message is scheduled, which results in a longer path and visiting switch 0 twice. An example is that a packet may take path 5(07) ——> 2(08) —> 3(09) ——+ 1(v10) —e> 5(1)“) —> 6(012) —> 0(013) instead of 5(07) —> 2(08) —> 3(v9) —-> 1(010) —> 0(1213) (on ATl). Our dual—path heuristic will not keep such a path unless its length is no longer than the path from 11,-. Anothercaseisthatdappearstwice: ---—>v,-—>a—>---d---—>vj —+b—>-~-—+ d —> . - -. When a message comes from vfis incoming channel, it may use either 0, —-> a. 83 or 12,- —> b. If one path is much longer than the other, the longer path may not be good for performance because it consumes many channel resources. Suppose the length of the shorter path is h, our dual-path heuristic will discard the longer path if its length is greater than 2h. Source heuristic: There may be more than one routing paths between a source and a destination. Suppose the shortest distance among these options is h, our heuristic will not keep an option in the corresponding entry if its length is greater than or equal to 2h. Such a heuristic is called source heuristic, which is only applied to terminal incoming ports. In Figure 4.3, Since the shortest path from switch 0 to 1 is 0 -—> 1 (on AT]), the longer path 0 —-> 6 -—> 5 —) 1 (on AT2) will not be used by messages coming from a terminal port in switch 0. 4.5.2 Selection of Eulerian 'Ii'ails So far, we use the well-known algorithm to randomly choose an Eulerian trail. Different Eulerian trails may result in performance difference, which has been observed in our sim- ulations [21, 75]. In order to get a better performance, a heuristic to find an appropriate Eulerian trail is needed. Given two bidirectional Eulerian trails, the following method is used to choose one of them which may provide a better performance. Let r(u, v) be the shortest distance from u to v allowed by the routing algorithm. Note that the adaptive trails are constructed if the ATR is used. Let 6(u, v) = W, where dG(u, v) is the shortest distance from u to v allowed by the network. Let A be the sum of all 6(11, 0). In general, an Eulerian trail with 84 smaller A will have a better performance. However, it is not guaranteed that the trail with minimum A will provide the best performance. Also, it is still an open issue about how to construct an Eulerian trail with the minimum A. In our algorithms, we select several bidirectional Eulerian trails by varying the starting switch and the way to select the next channel. For each bidirectional Eulerian trail, we create the corresponding adaptive trails and compute A. An Eulerian trail with the smallest A is selected. We use the following methods to select the next channel for an Eulerian trail. 0 Select a channel randomly. e A table is used to maintain the possible Shortest distance along the trail for each pair of visited nodes and un-visited nodes. Whenever a new channel (and a node) is visited, the table is updated. Suppose the pair of nodes 3 (visited node) and d (uh-visited node) has the longest distance in the table. We select next channel whose newly visited node is on the Shortest path from s to d. 4.6 The Proof of Deadlock Freedom The deadlock freedom of ETR routing is straightforward because there is no channel depen- dency cycle. In this section, we are going to show that the ATR algorithm is deadlock free too. In order to simplify our proof, we need the following definition of deadlock-immune channel which was first defined in [43]. Definition 4 A channel c is deadlock-immune for a packet M if and only if, once M oc- 85 cupies the channel c, M will eventually leave channel c and release it. For convenience, if a packet M can never use channel c, c is also defined to be deadlock-immune for packet M. A channel is said to be deadlock-immune if and only if it is deadlock-immune for all packets. As shown in [43], a routing algorithm is deadlock-free if and only if every channel is deadlock-immune. This is obvious because the channels involved in a deadlock could not be deadlock-immune. In the ATR routing algorithm, a packet is transmitted to its destination along ATl and/or AT2. Although the routing tables are created separately based on each adaptive trail, they are finally combined together to become integrated routing tables. Due to the use of source shortcut, it is possible for a packet to Start from one trail and change its route to another trail. Note that this change will not happen unless there is a channel on its routing path, which is also used as a source shortcut on another trail. Let ET] be 00 —+ 121 ——> ——> Um and ET2 be no —+ Ur —> ——> um, where u,- = vm_,-. In the following, Cit-+1 and 3:,” represent a channel in C(ETl) and a shortcut of AT], respectively, and cf, +1 and 31%,q represent a channel in C(ET2) and a shortcut of AT2, respectively. Lemma 1 If none of channels cit-+1 (z' _>_ k) in C(ETI) is a source shortcut in A72, all channels c},,~+1’s (2' 2 k) are deadlock-immune. Proof: Let us consider any packet, no matter it is originally routed along AT] or AT2. There are three cases: 86 1. The packet holds a channel c3”. +1 (1' _>_ k) when it is routed along AT]. Then its route cannot be changed to AT2 any more, because none of Ci.i+1 (i 2 k) is a source short- 1 m—1,m’ cut in AT2. If any packet is holding c the packet will arrive in destination um and release c,l,,_1’m. After Gin—1,771 is released, any packet holding Crln—2,m—1 will go to its destination (either cm or vm_1) and release cln_2,m_1. Eventually, cln_3,m_2, . - - , and ch +1 will be released by the packet because any packet holding cf“- +1 (2' _>_ k) can always go to the destination via path 1),-+1 —+ 1),-+2 —-> - - . —> destination. There- fore, all channels cl, +1 (1' 2 k) are deadlock-immune for such a packet. 2. The packet holds a channel oil-+1 (i 2 10) when it is routed along AT2. It means that of“. H is used as a destination shortcut in AT2 (because it is not a source Shortcut) for the packet. The packet will go to the destination and release the channel eventually. Therefore, each ck,- “ (i 2 k) is deadlock-immune for such a packet. 3. The packet never takes any channel Ci,- +1 (2' 2 k). By definition, c2“- +1 (2' Z k) is deadlock-immune for such a packet. Therefore, c,-,,-+1’s are deadlock-immune. CI Lemma 2 If none of channels Ci,- +1 (71 2 k) in C(ED) is a source shortcut in AT], all channels 012,241," (i 2 k) are deadlock—immune. The proof is Similar to that of Lemma 1 . Lemma 3 Channel c1 is deadlock-immune. m—1,m Proof: If c}n_1,m is not a source shortcut in AT2, it is deadlock-immune by Lemma 1. 87 Otherwise, let 3:", is the source shortcut, which corresponds to c}n_1,m, in AT2. We claim that none of channel egg-+1 (j _>_ q) can be a source shortcut in AT]. Because other- wise, there will be a dependency cycle as follows, which is the same deadlock configuration shown in Figure 4.5 and should not be allowed in our adaptive trails. AT1:50—+.--—>v,—>---—>vg---vr—>vr+r—>---vm(wher68.l,y=0§.i+r) AT2“0"*°”"')up—l"°_)u6'°’uj—)uj+l”—l"'um(Wheresg,q=cln—l,m) Therefore, all c§J+1’s are deadlock-immune by Lemma 2. In this case, if any packet holds 312,", as a source shortcut, it will go to its destination via 11,, —-> uq+1 —> -—> destination and eventually release 3,2,”. So, c}n_1,m is deadlock-immune for any packet. D Lemma 4 Channel an—1,m is deadlock-immune. The proof is Similar to that of Lemma 1 . Lemma 5 For any channel ch +1 in C(ETI), if all channels cit-+1 (i > k) are deadlock- immune, 011ch +1 is deadlock-immune. Proof: For any packet, there are the following four cases: 1. The packet never takes channel Ci, k +1. Then by definition, cjc, k +1 is deadlock-immune for the packet. 2. The packet holds channel ch +1 and its destination is switch ’Uk+1- In this case, the packet will finally go to destination and free ch +1. So 0,1”: +1 is deadlock-immune for the packet. 88 3. The packet holds Cb, +1 and requests c}c+1,k+2 in ATl. Since c},,-+1’s (i > k) are deadlock-immune, they will be available for the packet. Therefore, the packet will eventually go through 0H1 -—> vk+2 —> - - - —-> destination and reach the destination. In this case, ck, k +1 is deadlock-immune for the packet. 4. The packet holds ch: H as a source shortcut in AT2. Suppose the source shortcut is 33",. Then the packet requests cg,q +1. If none of egg-+1 (j 2 q) is a source shortcut in ATl, c? J- +1’S are deadlock-immune by Lemma 2. It means that the packet can reach its destination via c? J- +1’S. If some 01.241 is a source shortcut in AT], Let 33w be a source shortcut corresponding to egg-+1 in AT], there must be y > k. Otherwise, there will be a dependence cycle as follows, which is the same deadlock configuration shown in Figure 4.5 and should not be allowed in our adaptive trails. Alevo—>---——>'Ux—>--°—>’U;'°°’Ur—>’Ur+1—>°“'Um(Wheresalr,y=inn) AT2uo —> —->up —) ——> 115-nu]- —-+uj+1—>-~um(wheres§,q =c%,,-+1) Therefore, we must have y > k. Any packet holding Sim (y > k) will go to its destination by path 12,, —) vy+1 —> —> destination because all channels c},,+1’s (i > k) are deadlock-immune. It means that Sis (i.e. c1141) is deadlock-immune. Therefore, the packet holding sf,” (i.e., Cir,k+1) can go to its destination via uq —> uq+1 ——) - ~ - —> destination and release 3;"). So ch +1 is deadlock-immune for the packet is this case. In a word, Ci,k+1 is deadlock-immune. C1 89 Lemma 6 For any channel ci’k +1 in C(ET2), if all channels egg-+1 ( j > k) are deadlock- immune, c2 is deadlock-immune. k,k+1 The proof is similar to that of Lemma 5. Theory 1 The ATR routing algorithm is deadlock-free. Proof: Due to Lemmas 3, 4, 5, and 6, channels c1 c1 I 1 m-l,m’ m—2,m—11 ' ' ' 161,2, and C0,l and channels c2 m_1,m,cfn_2,m_1, - - - ,ciz, and 00,1 are proved to be deadlock-immune one by one. Therefore, the routing algorithm is deadlock-free. Otherwise, some channels must not be deadlock-immune because channels involved in a deadlock cannot be deadlock-immune. E] 4.7 Performance Evaluation This section shows some typical simulation results based on the simulation environment described in Chapter 6. Let a: be the time duration between the time when the current message has completely left the source buffer and the time when the next message is gen- erated in each node. Thus, a: is a random variable based on an exponential distribution. To demonstrate the performance behavior of different routing algorithms, we run the simula- tions under different workloads by varying the value of a: from large to small. The smaller the a: is, the heavier the workload is. To make a fair comparison, we use the same :1: for all algorithms under the same combination of network parameters. Note that the actual effec- tive network workload, which depends on both a: and how fast a message leaves its source in each switch, may not be exactly same for different algorithms. However, this should be enough to evaluate the performance behavior of each algorithm. For each run, at least 90 30,000 messages have been received and the error rate on average message latency is less than 5%. The first 1,000 messages are passed to eliminate start—up transient effect. The la- tency/throughput curves are used as our performance curves. In order to Show the fairness ratio, we use a star to mark each point on the performance curves where the fairness ratio is grater than 2 for the corresponding workload. 4.7.1 Network Workload and Parameters Two destination distributions (i.e., traffic pattern) are considered: 1. Uniform traffic: a node uniformly communicates with any other node. Two nodes connected to the same switch may communicate with each other. 2. Client/server traffic: k nodes are servers and all the other nodes are clients. For a client, a certain rate (75% here) of traffic is sent uniformly among those servers and the rest of traffic is sent uniformly among those clients. A server sends messages to other clients and servers with uniform distribution. In a real network, a client may not uniformly communicate with all servers; instead, it may communicate with a specific server more frequently. However, there are too many cases in such a non-uniform client/server pattern. For simplicity, we choose the above model. In terms of k, we use 4 or 5 servers out of 75 nodes in our simulations. We use the following two message Size distributions: 1. Fixed size messages: the message size is always 1000 bytes. 91 2. Bursty messages: bursty traffic happens at a Specific rate and bursty size is 10,000 bytes. Other messages are 100 bytes long. The bursty rate is 5% in this chapter. The fixed message size and the uniform traffic distribution are combined to demonstrate the maximum sustained throughput or worst case latency. A client/server distribution is more realistic than uniform distribution and useful to model network hot Spots. In actual network applications, the bursty message distribution and the hot-spot destination pattern are commonly observed. A time unit is defined as the time needed to transmit a byte via a channel, which is decided by channel bandwidth. We assume the header for each message is 10 bytes. Thus, the switch delay for a routing decision is 10 time unit (i.e., the input buffer should have at least 10 bytes to hold the complete header field). The input FIFO buffer capacity is 100 bytes by default. However, we also measure the performance of 1000 bytes buffer capacity to Show the effect of buffer Size. In a wormhole switch, it is not necessary to limit the size of a packet smaller than the input buffer capacity. We assume variable size packets and consider each message as a packet. Because the ATR uses escape channels to avoid deadlock, a buffer cannot be assigned to a new packet unless it is empty. However, it is not necessary for the ETR, SRA, SRD, UDR, and MUDR. Thus, the SRA, SRD, UDR, and MUDR can take more benefits of large buffer capacity since a single buffer can hold more than one packet. But this benefit may not be significant if the buffer capacity is much smaller than the message size (e.g., 1000-byte messages and 100—byte buffer capacity). 92 4.7 .2 Network Topologies The topologies used to measure network performance are Shown in Figure 4.6. In order to make a rather fair comparison, all topologies have the same number of switches and nodes. There are 10 switches and 75 nodes in each network. The number of channels, the node distributions, diameters, and average path length are shown in Table 4.4. We assume that all switches have the same number of bidirectional ports and the number of used ports in each switch is similar. Open ports are left for future usage. Based on the node distribution, each switch should have at least 12 ports. Figure 4.6: Six topologies considered in Simulation The path length from node a to v is defined as the number of trunk channels traversed by the path. For nodes 11 and v in the same switch, the path length between them is defined as 0. The average path length is the average length over all pairs of nodes of the shortest path Edam,“ pathJength(i,j) number..of.nodes2 between nodes, which is defined as . Note that the average length of 93 Switch Number of Nodes A B l C D I E ] F 0 7 7 7 3 3 8 1 8 8 8 8 9 7 2 8 7 8 8 9 8 3 8 8 7 8 8 8 4 7 8 8 8 9 8 5 8 8 7 8 9 7 6 7 7 7 8 9 7 7 7 7 8 8 8 7 8 8 8 8 8 8 8 9 7 7 7 8 3 7 total nodes 75 75 75 75 75 75 total trunk channels 20 20 20 18 17 19 diameter 2 3 3 2 2 3 avg. path length 1.399 1.442 1.50 1.510 1.631 1.454 Table 4.4: The network configurations for simulation paths allowed by routing may be longer than the average path length in the network due to the deadlock-avoidance and adaptive routing. A good routing algorithm should keep the average length of routing paths as small as possible. Although A, B, and C have the same number of trunk channels for each switch, they have different diameter and the average path length. The number of trunk channels are unevenly distributed among switches in D and E. Channels in F are basically evenly distributed among switches except that switch 9 has only two trunk channels, which may cause bottleneck for traffic coming from or going to switch 9. To avoid showing too many curves in a figure, we do not show the performance for the ETR routing. As shown in Figure 4.7, the performance trend is that the ATR provides better performance than the ETR if the ETR provides a much longer average routing path length than the ATR. However, it the ETR can allow most of the shortest paths, it could have as good performance as the ATR. 94 1400 Toplogy A: uniform traffic Toplogy C: uniform traffic 12m L I I I I I J 14m p I I I I I q 1200 - m j: . E 1000 r- -4 5‘ fl 5 1000 - ~ 3 80° ’ ‘ 3 300 . . O 0 3 60° ' ‘ E 600 ~ ,. . 0 ,e (D 3 400 ~ - 2 400 - ~ 200 '- " 200 r- 0 1 1 1 r 1 0 1 1 1 1 1 0.04 0.08 0.12 0.16 0.2 0.04 0.08 0.12 0.16 0.2 Normalized Network Throughput Normalized Network Throughput Toplogy A: client/server traffic Toplogy C: client/server traffic 4000 1 a r . 1.. 4000 . 3500 ~ ETR ' 3500 - ~ g 3000 ~ Am 3: g 3000 » - C 9 2500 - 2 2500 r - 3 2000 - 3 2000 L o o - a a g 1500 - § 1500 — - < 1000 - < 1000 ~ « 500 - - - - 500 ~ . O 1 T . 1 1 r O r 0.04 0.06 0.08 0.10 0.04 0.06 0.08 0.10 0.12 Normalized Network Throughput Normalized Network Throughput 2.9in Figure 4.7: Comparisons of ETR and ATR under virtual cut—through switching In the following, we are going to compare performance between ATR, UDR, MUDR, SRA, and SRD routing algorithms. The performance of UDR and ATR depends on the selection of the root switch and the selection of Eulerian trails. For the ATR, we adopt all the heuristics in [21] and select one bidirectional Eulerian trail which minimizes a heuristic function based on average path length [21] among three randomly selected bidirectional Eulerian trails. For the UDR, we select the switch with maximum number of trunk chan- nels as the root (e.g., switch 0 is the root in D). If more than one switch have a highest degree, we construct three different trees with different roots and Show the one with the best performance in the paper. To assign the direction for a channel whose end switches are at 95 the same level, we assume that the ‘up’ direction is always from the switch with a smaller id to the switch with a larger id. If the above channel direction is assigned in a random way, the performance may be better but may not have a big difference. The MUDR selects the same root as the UDR in the same topology. 4.7 .3 Uniform 'h'affic In order to fully understand the effect of topologies and routing algorithms to network performance under uniform traffic, we have measured the network performance for all the six network topologies with fixed message size distribution and with bursty message size distribution. Figure 4.8 shows the performance curves with fixed message Size distribution. It is observed that most performance curves reach their maximum sustained throughput under a certain workload. Then, the throughput will be decreased but the latency is still increased when the workload is further increased. For those curves which have not shown such behavior, they will do so if some smaller :1: is given. Because channel contentions are very serious under such over-saturated workloads, many messages have to wait for a long time before they are able to use free channels, which causes higher transmission latency and lower network throughput. In a real network, a good flow control mechanism Should be applied to avoid the over-saturated workloads. The same range of a: is used for all topologies except for E, where three additional smaller values of a: are applied in order to Show the trend of performance. Average Latency Average Latency Average Latency 96 A: uniform traffic and fixed size messages 9000 8000 7000 6000 5000 4000 3000 2000 1 000 3 r 4 I 5 Network Throughput (bytes per time unit) I 6 7 I I r II I I ATR ~— ‘ SRA -+--- . SFID ....... UDR —~-—- - MUDR 4" . f Unfair - + 1 1 t l 1 .‘C' .g” ."a 1 1 1 8 9101112 C: uniform traffic and fixed size messages 10000 9000 8000 7000 6000 3000 2000 1000 Figure 4.8: Performance under uniform traffic with fixed message distribution 3 4 Network Throughput (bytes per time unit) ATR -o— SFIA -+ ‘ SRD ..9.... .1 UDR +— MUDR 4" ‘ Unfair I . 5 6 7 8 9 1011 12 E: uniform traffic and fixed size messages 3 4 I I I I Unfair I ATFI +— SRA -+--- see a one __,..... I I 5 6 7 8 9 Network Throughput (bytes per time unit) 10 11 12 Average Latency Average Latency Average Latency B: uniform traffic and fixed size messages 3 4 5 I I I I I I ATR -o— SRA -+-- SRD .9... UDR -+—- Unfair I 67891011 12 Network Throughput (bytes per time unit) D: uniform traffic and fixed size messages 3 5 I I I MUDR Unfair I 4 6 7 8 9 Network Throughput (bytes per time unit) 1 1 10 11 12 F: uniform traffic and fixed size messages 3 I 4 Network Throughput (bytes per time unit) I T I .f I I 1 5 6789 ATR «— SRA -+--- see am- one «- Unfair - 10 11 i 12 97 The Effect of Topologies As shown in Figure 4.8, the network performance is varied under different topologies. If we consider both the average latency and the network throughput, A is probably the best choice among the six networks although E has higher throughput for some algorithms under very heavy workload and delivers the best performance for the UDR algorithm. The reason that A delivers a better performance is multi-folded. First, A has the shortest average path length. Second, A’ 3 channels are evenly distributed and more than one shortest path exist between most pairs. Thus, channels are shared among traffic and most traffic may use a shortest path to deliver messages, which reduces channel contention in heavy workload. Although B and C have the same number of trunk channels in each switch as A, they both have worse performance than A. There is no simple explanation for why B and C perform worse. Besides that B and C have longer average path length, one fact is that B and C both have a bisection‘ of only six channels. F does not show surprise behavior given its topology and number of trunk channels. Note that switch 9 in F has only two trunk channels which may become a bottleneck. An interesting fact is that E has a relative better performance, especially under heavy workload, although it has only 17 trunk channels. In E, most of the shortest paths could be allowed in a routing algorithm without causing deadlock. Thus, the average path length of routing in E could be very close to the average path length of the network, which may not be true for the other five topologies. Moreover, each channel in E is used to route messages between a certain pairs. For example, channel 2) —+ 0 may be only used for messages from lBisection is defined as the minimum number of channels to be removed in order to cut a network into two disjoint subnetworks with equal number of switches in each subnetwork. 98 switch 2) to another switch. Such a feature is likely to provide better flow control and reduce channel contention under heavy workload. Although D also has a central switch connected to all other switches, it shows different performance curves than E. In D, channels between the central switch and non-central switches are heavily used. Although there are extra channels between two non-central switches in D, those extra channels do little to increase throughput because they are not well shared among traffic. The unbalanced channel utilization causes poor performance of D. Adaptive Routing vs. Deterministic Routing It is interesting to compare the performance of the SRA and SRD. Both of them are de- veloped by the same philosophy and have been made to balance channel utilization under uniform traffic. For all the six topologies, the SRA achieves lower latency under light work- load but suffers from worse performance under heavy workload. This is not a surprising observation because adaptive routing can provide an alternative routing path and take ad- vantage of multiple paths between pairs under light workload. However, adaptive routing may take a longer path when shorter paths are not available. The misrouting can cause more channel contentions under heavy workload and therefore, result in poor performance. On the other hand, deterministic routing, especially traffic balanced deterministic routing like the SRD, always selects a unique and shorter path for a source-destination pair. Thus, a message has to wait on a busy channel while a free alternative path exists, which results in higher latency under light workload. Under heavy workload, such a disadvantage is gone because all channels are busy on transmission and a shorter path is the best choice. 99 The Effect of Routing Algorithms Although these routing algorithms have different behavior under different topologies, there are still common behavior for each algorithm. As we mentioned above, the SRD provides the best performance under heavy workload for most of topologies. With the time complex- ity of balancing channel utilization, the SR achieves good performance in most of uniform cases. The ATR is much more simple than the SR, but it provides very close (even better) performance to the SR in all topologies except E. The good performance of ATR is because that it uses Eulerian trails and shortcuts to provide reasonable and shorter paths. The reason of worse performance in E is due to the way to avoid deadlock. The ATR uses two opposite Eulerian trails to avoid deadlock. For E, some longer routing paths along the Eulerian trails have to be kept in order to avoid deadlock. Those longer paths consume more channel resources and may cause poor performance. The UDR provides a worse performance in most of topologies except E. First, it may concentrate traffic near the root switch. If the root switch does not have enough trunk channels to transfer those traffic, serious contention may cause poor performance, which is the case for A, B, C, and F. Second, the UDR allows a routing path as long as it never uses any ‘up’ channel after using ‘down’ channels. This may allow very long routing paths. For example, in E, 3 —+ 4 —> 5 —> 7 —> 0 may be a legal path, if switch 0 is the root and 4 —> 5 and 5 —) 7 are considered as ‘up’ channels. It is the second reason that makes the performance difference between MUDR and UDR. The reason that the UDR performs very good in E is the avoidance of the above two shortcomings. 100 Fairness It is observed that the fairness ratio in some routing algorithms may be over 2 after the net- work traffic is heavier than a certain point. The fairness is impacted by both the topologies and the routing algorithms. For A, B, C, and F, the UDR shows higher fairness ratio than the SR and ATR under same workload. This is because that the UDR favors the nodes at the root switch. In terms of the topologies, A’ s and E’s fairness ratio is seldom over 2 comparing with other topologies under the same workload. When the network is overloaded, it may lose the fairness for terminal nodes and suffer from decreasing performance. Thus we should control the network traffic under the over- saturated workload. The wormhole switching can partially control the network workload. When the workload reaches some saturation region, where increased latency and decreased throughput are observed, the serious channel contentions will block messages and eventu- ally avoid the injection of more messages in our finite source model. Bursty Messages As shown in Figure 4.9, the performance comparison under the bursty messages is similar to that under the uniform traffic with fixed size messages. This is because that they both have the same destination distribution. When long messages are transmitted in a network, some channels are occupied by these long messages and block many other messages, which causes higher network latency and lower network throughput for all topologies. The SRD causes higher latency because it has to wait on a busy channel for a long time while that channel is occupied by a long message. Average Latency Average Latency Average Latency 1% —I I— I I I I I 9000 - . 8000 i ATR .._. . ~ 7000 ~ SRA “T“ - SRD -----a 6000 i MUDR +-— ,r . 5000 __ Unfair ' 1: ”‘9 .1 4000 L w} « aooo - - 2000 - 1 1000 . 1 0000 9000 8000 7000 6000 5000- 4000 2000 1000 1 0000 9000 8000 7000 6000 5000 4000 3000 2000 1 000 101 A: uniform traffic and bursty messages 2 2.5 C: uniform traffic and bursty messages I I I I I I I I I | - ATR _._. SRA -+--- SRD -a- - MUDR +- Unfair I 3 3. 5 Network Throughput (bytes per time unit) 2 2.5? E: uniform traffic and bursty messages i ATR *— - SRA -+--- , ~ . sap ...- g”. . UDR +- ,3: ‘ Unfair . 2,2?" 4 .- . " ,f/ .4 2 2. 5 3 3. 5 4. 5 5 Network Throughput (bytes per time unit) 3 3.5 4 4.5 5 5.5 Network Throughput (bytes per time unit) Average Latency Average Latency 4.5555 Average Latency 5.5 1 0000 8000 7000 6000 5000 4000 3000 2000 1 000 10000 9000 8000 7000 6000 5000 4000 2000- 1000 B: uniform traffic and bursty messages ATR 4— - SRA -+~—- . SRD ----a UDR ~*-—~ « Unfair I i l J 2 2.5 Network Throughput (bytes per time unit) 3 3.5 4 4.5 5 5.5 D: uniform traffic and bursty messages fl --«»“ f I fifi ATR _._ - SRA -+--— ‘ SRD --~--a UDR ~~x~~~ « Unfair I 1 2.5 33.5 4555.5 2 4 Network Throughput (bytes per time unit) F: uniform traffic and bursty messages I ffi ATR -o— . SRA -+--- . SRD ”+2- UDR «w- « Unfair I 1 1 1 2.5 33...5445555 2 Network Throughput (bytes per time unit) Figure 4.9: Performance under uniform traffic with bursty message distribution 102 Buffer Size In order to demonstrate the effect of buffer size to performance, we measure the perfor- mance by using larger buffer capacity with 1000 bytes. As shown in Figure 4.10, large buffer capacity does improve the performance for all the algorithms under uniform traffic with both fixed size messages and bursty messages. This is because that a larger buffer will hold more bytes and reduce the number of occupied channels by a blocked message. Note that the UDR and SR gain more performance benefit from larger buffer than the ATR, especially for bursty traffic. This is due to the limitation of empty buffer required by the ATR. For bursty message distribution, a 1000-byte buffer may hold up to 10 IOO-byte mes- sages for UDR and SR, while these 10 messages have to occupy 10 different channels for the ATR. Such an improvement can be observed more or less in other topologies. Further simulation results show that the benefit of larger buffer space will not be signif- icant after the buffer size is big enough to hold the maximize message. 4.7 .4 Client/Server Traffic The performance under client/server traffic is shown in Figure 4.11. Let us look at Fig- ure 4.6 again. Four servers are assumed in each network shown in Figure 4.6, where two server nodes are located at each of the shadowed switches with wider border and one server node is located at each of the shadowed switches with thinner border. Under client/server traffic, the routing algorithms may have different performance be- havior than that of uniform traffic. For example, the ATR has the worst performance under uniform traffic in E, but it does provide a better performance than the UDR and the SRA in Average Latency Average Latency the selected client/server model. The reason is that many messages in such a client/server model will go to switches O and 9 and the ATR does not allow very long routing paths for these two destinations. Thus, messages to O and 9 will not waste channel utilization caused by long paths. Such an observation shows that we have to consider different traffic patterns 8: uniform traffic and fixed size messages 12“ I I T I I I I I I “ii 10000 - ,3, . ATR(100 +— < ‘ AJ31(000 -+--- _ 6.1 100 ......) . 8°00 “- uonuooo -——~ Unfair - 6000 . 4000 \ ~ \\ ‘r 2000 _ 4' . 3 4 5 6 7 8 9 10 11 12 Network "Throughput (bytes per time unit) 8: uniform traffic and bursty messages 18“ I I I I I I I I _ ATR(100 4— . 16°00 ATR(1000 -+-- \ 14°°° * Mattias i i +.—. 12000 i' ungair I 'i 10000 - ,3, -\ >‘ J 8000 - é.“ , « eooo - 2 4 4000 " '39 '1 2m ”8 l 1 l l 1 q 2 2.5 3 3.5 4 .5 5 5.5 6 Network Throughput (bytes per time unit) 103 Average Latency Average Latency B: uniform traffic and fixed size messages 10000 8000 6000 4000 2000 18000 16000 14000 12000 3 r 4 5 I 6 7 I ‘\ ‘8 \ I 8910111 I SRA(100 *— SRA 1000 SRD(1000 +3 Unfair I .9.... I 2 Network Throughput (bytes per time unit) 8: uniform traffic and bursty messages F b b I I I I I 1 I I SRA(100 +— ‘ ~ SRA1000 -+-- SR (100 am- ‘ SRD(1000 ~*-~ \ . Unfair I i ,i .4: j 2 2.5 3 3.5 4 4.5 5 5.5 6 Network Throughput (bytes per time unit) Figure 4.10: Effect of buffer size under uniform traffic and workload when we evaluate the performance for a network. Average Latency Average Latency Average Latency A: client/server traffic and fixed size messages 104 B: client/server traffic and fixed size messages 16” I I I fi I I I I I I 14000 ~ 4 14000 - ~ 12000 - ATR +— I « 6‘ 12000 - ~ SRA -+--- '-. 5 10000 - SRD .....a ,- - 3 10000 L ATR +—- - MUDR «~— 9 SRA '+-“ 8000 - Unfair . up « § 8000 - 833 We ~ 6000 - {p - g 6000 " Unfair - ' 4000 ~ ~ 4000 - « 2000 - . . 2000 L . . J 2 2. 5 3 3. 5 4. 5 5 5. 5 2 2.5 3 3.5 4 4.5 5 5.5 Network Throughput (bytes per time unit) Network Throughput (bytes per time unit) C: client/server traffic and fixed size messages 160000: client/server traffic and fixed size messages 14000 ~ 4 14000 ~ ATR 11' 11 ~ —0— 0 12000 - ~ E 12000 . SRA ,+_-_ ’8 : . _ . m SRD -0-- 6 ° 10000 3 10000 - 1611?? ...... :1 1 - 3000 r ‘ g 8000 - "a” 1 - 6000 L" ‘ a); 6000 _ .. 4000 . ‘ < 4000 - - 2000 1 « 2000 _ . 2 2.5 3 3.5 4 4.5 5 5.5 2 2.5 3 3.5 4 4.5 5 5.5 Network Throughput (bytes per time unit) Network Throughput (bytes per time unit) E: client/server traffic and fixed size messages F: client/server traffic and fixed size messages I I I I I I I i I T I fi 14000 - f' f - 14000 ~ 1, 12000 ~ 1' g « a 12000 - ATR 4—\ - ,1 5 SRA -+--- 10000 - ATR _._ {1 , 1 3 10000 - sap ...... . SRA -+--- 1‘ i J UDR +—- '1, . 8000 - 333 9....1‘ 1 § 8000 r Unfair - 1‘ 1 6°00 “ Unfair - i 1 1 E’ 6000 " i." ‘ 4000 - J .1 - < 4000 - .i - 2000 - ’ . 1 ‘ 2°00 ' ; ------ . . . . . ‘ 2 2.5 3 3.5 4 4.5 5 5.5 2 2. 5 3 3. 5 4. 5 5 5. 5 Network Throughput (bytes per time unit) Network Throughput (bytes per time unit) Figure 4.1 1: Performance under client/server traffic 105 The Effect of Topologies In a client/server traffic pattern, the output contention of servers is the main bottleneck for performance. Also, the trunk channels near the servers will be contended among messages. E provides the best performance in our selected client/server model, because its topology favors messages to switches O and 9 and provides a good performance when most messages are destined to switches O and 9. However, the output contention limits its performance. How to reduce the output contention is a very important issue under client/server traffic. One way is to use multiple ports for servers, but multiple network adapters and network addresses are needed, which is a complicate design in practice. The Effect of Routing Algorithms As shown in Figure 4.11, the performance difference of the four algorithms under client/server traffic shown in Figure 4.11 is not as significant as that of uniform traffic shown in Figure 4.8. This is due to the limitation of output contention. An interesting thing is that the SRD delivers the best performance for E but the worst performance for D. This is due to the locations of servers. The SRD is designed for uniform traffic and it cannot ad- just routing based on the traffic pattern due to its deterministic feature. Thus, it may cause poor performance because it cannot deal with some specific traffic pattern. The SRA does not show such extreme performance although it is also designed based on uniform traffic. This is because that the SRA has more flexibility to deal with various traffic patterns due to its adaptive feature. A routing algorithm, which is based on uniform traffic and performs well under uniform traffic, may not provide good performance under client/server traffic. 106 Locations of Servers The number of servers and the locations of servers both influence the network performance. In Figure 4.12, we show their effects to performance in A and D, where three different models shown in Table 4.5 are used to study the effect for A and D, respectively. We select the SRD and the ATR to show the performance of A and D, respectively. Similar performance is observed in other routing algorithms and other topologies. Models Server location Server location switch ids in A switch ids in D Model 1 2 3 8 9 - 0 4 5 7 Model 2 2 3 8 9 l 0 0 4 7 Model 3 2 2 2 8 8 O 0 0 7 Table 4.5: Three models to study the effect of server locations for A and D A is a network where trunk channels are evenly distributed among switches. Since no switch has extremely high degree, it is better to distribute the servers among different switches if possible. In Figure 4.12, we compare the performance under three client/server models: model 1 has four servers located at four switches, model 2 has five severs located at five switches, and model 3 has five servers located at two switches. It is not surprising that model 2 achieves the best performance. Model 3 has the worst performance because it concentrates the traffic near two switches and channel bottleneck becomes serious. D is a network where switch 0 has full connection to all other switches. Servers should be allocated at switch 0 in order to use those channels for it. Again, we compare three mod- els: model 1 has four servers located at four switches with one server at switch 0, model 2 has four servers located at three switches with two servers at switch 0, and model 3 has four 107 servers located at two switches with three servers at switch 0. As shown in Figure 4.12, while model 1 shows the worst performance, model 2 shows the best performance under light workload and model 3 has the best performance under heavy workload. When three servers are located at switch 3, there is no client node in switch 3, which cannot take advan- tage of short paths to servers. That is why model 3 delivers higher latency in light workload. When the workload becomes heavier, model 3 gains a little bit better performance because it has the highest bandwidth to release traffic near servers. However, because output con- tention is the major bottleneck in a client/server environment, the performance difference among these three models are little. Note that the performance we have shown is based on our uniform client/server model. If a group of nodes tend to communicate with a specific server, that server should be located close to the group of nodes. Fairness A star is used to mark each point where the fairness ratio is over 2. Under client/server traffic, we have distinguished the server class and the client class and measured the ratio for the two classes individually. The marks in the figure show the unfair cases for the client class. Because servers send messages in a uniform pattern, the fairness ratio of the server class is much less than that of the client class under same workload. However, the ratio of the server class can still be over 2 when workload is very heavy. Given a range of workloads, an observation is that there are more unfair cases under client/server traffic than that under uniform traffic. This is due to the different distance from clients to servers. For those clients located at the same switch as one server, messages from 108 A: client/server traffic and fixed size messages I I 1 T I I 14000 ~ 4 > 12000 . ~ g SRD-modeii —.— x 9 10000 - SRD-modelz -+--- 11 ~ 3 SRD-model3 We ' 11: a 3000 ' Unfair I ‘1', J “’ i § 6000 ' 1 d < 4000 - .1 - 2000 - ~ 2 2.5 3 3.5 4 4.5 5 5.5 Network Throughput (bytes per time unit) D: client/server traffic and fixed size messages 14000 - I I I j *I *I d I > 12000 - i." . g ATR-model1 -o— 1i 3 10000 - ATR-model 2 -+--- 1,115 - 3 ATR-modela --a--- :1 81 8000 . Unfair x 1 g 6000 1 ' 4 < 4000 - 4 2000 - - 2 2.5 3 3.5 4 4.5 5 5.5 Network Throughput (bytes per time unit) Figure 4.12: Effect of server locations and number of servers them to the server do not suffer from trunk channel contention, and so more messages are sent out from these clients. Buffer Size The effect of buffer size under client/server traffic is shown in Figure 4.13. Note that there is no big difference between small buffer capacity and big buffer capacity. This is because that the output contention and the channel contention near the servers are main bottlenecks 109 in such an environment. To improve the performance under client/server traffic, output contention should be considered and multiple ports may be used for a server to reduce output contention. D: client/server traffic and fixed size messages 16000 a . . . . 14000 ~ and: 1 5 12000 - ATR(100 .._ S; '1 « 5 ATR(1000 -+--- $41 3 ...... M11111 ‘ 1000 ~-x-~-— fix . 81 3°00 ’ Unfair .. a; . 1 «.2 ff 1 E 6000 - , . 4000 - ~ 2000 - I L I l l - 2 2.5 3 3.5 4 4.5 5 Network Throughput (bytes per time unit) D: client/server traffic and fixed size messages 16000 I f I 7 i I 1.1851 14000 f SRA(100 +— g ' 12000 L SRAUOOO "1"" 0111 . ., $110000 ...... e: | 10000- 811011000 +- at" 1 - Unfair 1- 1;» 5 8000 - 6000 - 4000 . 2000 - 2 2.5 3 3.5 4 4.5 5 Network Throughput (bytes per time unit) Average Latency Figure 4.13: Effect of buffer size under client/server traffic Chapter 5 Network Planning and '111ningin Switch-Based LAN 8 This chapter studies network planning and tuning issues in switch-based LANs [77]. Our purpose is not to address an optimal solution for network design, which can be an NP- complete problem [20]. Instead, we propose a simple method for initial topology design and an incremental method for network tuning and reconfiguration after the network be- comes operational. To provide low message latency and high network throughput, our criteria are to minimize routing path length, maximize traffic locality, reduce inter-switch traffic, and balance channel utilization. For the initial topology design, we use block design to efficiently minimize routing path length by covering as many switch pairs as possible in blocks. For incremental reconfiguration and network tuning, we propose a systematic method to deal with various situations in a network. In this chapter, we first describe general considerations on topology design and then present design variables and analytic models used in our method for network planning and 110 111 tuning. A simple method based on block design is proposed for initial topological design, where the number of switches, the number of computers, and the estimated traffic pattern are given. We also propose solutions to handle various changes in network environment and use simulation results to demonstrate the performance improvement after network tuning. Finally, we present some preliminary results for the evaluation of our topology design algorithm. 5.1 Design Considerations of Switch-Based Networks A switch-based network consists of a number of switches. Each switch has a certain num- ber of ports and it forward packets from input ports to output ports. Switches are inter- connected via their ports. A node, which can be a workstation, a multi-processor system, or a gateway to another network also connects to a switch port. Although there are differ- ent switching techniques (e.g., store-and-forward and cut-through), our method does not assume any specific switching technique. The topology design and node placement in a switch-based network can significantly influence network performance. To achieve good performance, we need to consider the LAN traffic characteristics mentioned in Chapter 2.5. Traffic locality, workgroup, and client/server traffic should be exploited in the design of switch-based networks. 5.1.1 Routing Path Length In general, a high-performance network will have as low an average path length as pos- sible. Longer paths mean higher transmission latency and more channel bandwidth are 112 consumed per packet sent, resulting in lower network throughput and higher message la- tency in most cases. Thus, when designing a network topology, a primary consideration should be minimizing the average path length and the maximum path length. Given the number of switches, ports per switch, and the required number of nodes, it is, in general, a difficult problem to find a network with a minimal average path length. However, there are still some basic ideas that can make it easier. In general, each channel should run between a unique pair of switches; extra channels between a particular pair of switches does not decrease the path length, and almost always more performance can be gained by connecting two otherwise unconnected switches. Connecting the long-distance switches with a new channel is usually a better idea than connecting two short-distance switches. In addition to minimizing average path length, we have to ensure that there is no obvious bottleneck. Such a bottleneck might be a single switch connected with only one or two channels; generally, the channel connections should be fairly uniformally spread among the switches. We also have to ensure that the bisectionl is not too small, because such a ‘channel cut’ is usually a performance bottleneck. 5.1.2 Local fiaffic, Remote fiafflc and Inter-Switch Traffic We define traffic within a switch as ‘local traffic’ and traffic between switches as ‘remote trafiic’. ‘Inter-switch trafiic’ is used to represent traffic between any two switches. To effi— ciently utilize channel bandwidth, an important design principle is to maximize local traffic lBisection is defined as the minimum number of channels to be removed in order to cut a network into two disjoint subnetworks with equal number of switches in each subnetwork. 113 and minimize remote traffic as well. Exploiting traffic locality will benefit intra-workgroup communication and efficiently reduce average path length. Small average path length usu- ally means less traffic passing through intermediate switches and more efficient utilization of the network bandwidth. Therefore if a significant fraction of traffic is transmitted among a set of nodes, we should consider to place them on the same switch if possible, or on two switches connected by at most one hep. 5.1.3 Balance Channel Utilization The message delay Th on channel 11: can be calculated based on a standard M / M / 1 queuing model: Tk = 1 (51) where tk is the time required to transmit a byte on channel k, m is the average message size in bytes, and 11,, is the utilization of channel k. Equation 5.1 means that Tk will be very large if 11,, is close to 1. Our goal is to keep mammk) as small as possible because the channel with maximum utilization is usually the traffic bottleneck. To reduce maa:(11,,), it is important to reduce total inter-switch traffic and balance channel utilization. There are three ways to balance traffic: change topology, change routing, and balance inter-switch traffic. Changing topology, which should be in- tegrated with routing algorithm, usually needs many iterations to achieve a near-optimal solution. On the other hand, it is usually efficient to redirect traffic from overloaded chan- nels to under—loaded channels by taking alternate routing paths and balancing inter-switch 1 14 traffic. 5.2 Model Description This section considers the basic parameters and derives design variables for topology de- sign and incremental reconfiguration. To design network topologies, fixed routing is gen- erally assumed because it is easy to evaluate channel flows as a function of routing tables and traffic distribution [17]. Although we assume fixed routing algorithms to simplify our analysis, our idea can also be applied to adaptive routing algorithms, which can provide routing flexibility and adjust network utilization. 5.2.1 Basic Parameters The design of an appropriate network topology can be a complex undertaking, which con- cerns some basic parameters. In this paper, we define the following given parameters: N: the total number of nodes in the network. S (or s): the number of switches. c,-: the number of ports in switch z'. m: the bandwidth (bytes/sec) of channel k (each unidirectional channel has a unique id 1:). To design or tune a network topology, we define two traffic distribution matrices to represent traffic patterns and workloads in the corresponding network. Specifically, we 115 define a N x N traffic matrix X, where mu,” represents the traffic rate (bytes/sec) from node 11 to node v, and a S x S traffic matrix Y, where yid- represents the traffic rate (bytes/sec) from switch 1' to switch j. X and Y reflect both the message generation rate and message size and they could use peak rate, average rate or a value in between depending on the design requirement. Note that Y is actually derived from X and 3114‘ is inter-switch traffic rate from switch 1' to switch j. The offered workload W of the network should be N N S S W = :2 as... = 22y...- (5.2) As mentioned in Section 2, before a network is operational, the traffic distribution can only be estimated based on the knowledge of expected network applications, which may not be accurate. After the network is operational, traffic distribution may be measured via monitoring network traffic, which should be close to the real traffic. 5.2.2 Design Variables For the purpose of analysis, the following design variables are defined: 71,-: the number of nodes in switch 2' and N = 2179:1111. 0 mi: the number of channels in switch 1' and n,- + m,- g c,-. M: the total number of bidirectional channels in the network and M = 235211. pm: the length of routing path from source node 11 to destination node v in hops, which is defined as the number of channels traversed by the path. For nodes 11 and v in the same switch, the path length between them is defined as 0. Note that path 116 length should be proportional to channel bandwidth and higher bandwidth channel contributes less path length. 0 13: the maximum path length, i.e. ma$(pu,v). N N 211:1 211:1 puiv number_o f -(u,v)_pairs ' 0 p: the mathematical average of all pum’s, i.e. o F: the average path length with the consideration of the amount of traffic between . . . _ EN: EN: (Pum‘zum) — — each pair, Wthh 18 defined as P = —n—l——E—v1,———. Note that p and P may not be equal unless the traffic is uniform. B should be more accurate to reflect both path length and traffic distribution. fk: the average rate of traffic flow on channel 11:. pk: utilization of channel 11:, which is calculated by f1, /flk. Given a network topology and a fixed routing algorithm, it is easy to compute the route and its path length for each source and destination pair. Due to the bandwidth requirements, we have F >1: W g 2 Bk. Therefore, the upper bound of W is W s ZFB" (5.3) The above equation shows that smaller F results in larger upper bound of W. Note that the upper bound of W is obtained without the consideration of channel contention, switch delay, and software latency. Such an upper bound gives us a metric to evaluate the throughput of a network in terms of offered load, but it can hardly be obtained in a real network. 117 One of the design criteria is to maximize traffic locality. For the purpose of design and analysis, we define ‘local trafi‘ic’ L as the total traffic transmitted within a switch and ‘remote trajfic’ R as the total traffic transmitted between switches. Formally, L and R are defined in Equations 5.4 and 5.5. Note that R is calculated by the consideration of routing path length, which should be more reasonable than a simple sum because a longer routing path consumes more channel bandwidth per packet send. Increasing L and reducing routing path length will decrease R. N N L = Z :3 $11,111 where u and v are in same switch (5.4) u=1 v=l N N R = Z Z rump”, where u and v are not in same switch (5.5) u=l v=l To express the local traffic out of W, we define ‘trafi‘ic locality ratio’ a = {4%. However, simply maximizing a may cause unbalanced inter-switch traffic and unbalanced channel utilization, although bigger a will benefit local traffic. We should also balance inter-switch traffic and avoid channel bottleneck(s) due to large inter-switch traffic. For example, a group of nodes may be placed in switch 2' to maximize 01. Meanwhile, these nodes also communicate to a server in switch j, but they cannot be placed in switch j due to the limit of available ports. If inter-switch traffic ym- contributes a significant fraction of traffic, the path from switch 2' to switch j may be overloaded and become traffic bottleneck(s). In such a case, we may redistribute some nodes from switch 2' to other switches, resulting in smaller 0: but more balanced channel utilization and better overall network performance. 118 The criterion is which is bigger, traffic within these nodes or traffic from these nodes to the SCI'VCI'. Given X and Y, we are able to compute traffic flow on each channel for a given topol- ogy and its routing algorithm. Our selection of fixed routing algorithm makes this calcula- tion unambiguous. A decision variable 1r and traffic flow f1, can be calculated as follows: 1 if the route from switch 1' to switch j passes channel 11: 711'“ = (5.6) 0 otherwise. S S f1 = Z :01...- * wt.) (5.7) i=1 j=l The calculations of fk and pk help us identify traffic bottleneck under the current net- work configuration. The channel with mad/11,) is usually the bottleneck channel. To improve network performance, we should minimize mad/11,). Besides balancing channel utilization by adjusting routing paths, we also need to reduce the total remote traffic and balance inter-switch traffic. For convenience, important symbols used in this chapter is listed in the following table. 5.3 Initial Topology Design In this section, we propose a simple and efficient method for initial topology design. As we mentioned earlier, a primary consideration for topology design is minimizing both maxi- 119 a traffic locality ratio Bk bandwidth of channel I: A minimum number of occurrences for a pair in block design B,- a block in block design q, o number of ports in a switch fk traffic flow rate (bytes/s) of channel k L local traffic within switches M total number of bidirectional channels m,- number of channels in switch 1‘ N number of nodes 11,- total number of nodes in switch 1' p1,,” routing path length from node 11 to node v 13 maximum routing path length 1‘9 average routing path length (not consider traffic) _75 average routing path length (consider traffic) R remote traffic between switches s, S number of switches (objects) pk channel utilization X traffic distribution matrix between nodes mu,” traffic rate from node u to node 1) Y traffic distribution matrix between switches yih,‘ inter-switch traffic rate from switch 1' to switch j Table 5 . 1: List of important symbols in Chapter 5 120 mum and average path lengths. With shorter routing paths, less channel resources are con- sumed per packet sends, resulting in lower message latency and higher network throughput. To design a topology with shorter routing paths, we adopt the idea of block design, which was previously suggested in multiprocessor networks [78] and switch-based LAN 5 [79]. 5.3.1 From Block Design to Network Design Combinatorial design provides methodologies to distribute a set of objects into blocks to satisfy certain properties. One method is called balanced incomplete block design (BIBD), which distributes 3 objects to blocks of size d such that (1) each pair of the objects occurs in exactly A blocks, (2) each object occurs in 7‘ blocks, and (3) there are b blocks. A BIBD with s = b and r = d is called symmetric, which can be denoted by (13,13).). For example, a symmetric BIBD (7,3,1) has seven objects 0, 1, - - - ,6 and seven blocks B0, B1, - - - ,B5 as shown in Table 5.2, where each block has three objects. Bo=(013) B4=(450) Bl=(124) B5=(561) Bz=(235) Bs=(602) Bg=(346) Table 5.2: Blocks of symmetric BIBD (7,3,1) To design a switch-based network from the block design, we have the following maps between blocks and the network: 1. Each object matches a switch and there are total 3 switches. 2. Each block defines a path for its objects (switches). For example, Bo determines a bidirectional path 0 H 1 <—-) 3. 121 3. All paths within blocks are connected to construct a single bidirectional trail, which is considered as an Eulerian trail2 of the network. The network topology is then decided by the Eulerian trail. For the block design in Table 5.2, the Eulerian trail and topology are shown in Figure 5.1. BOIOH1H3;BI:1H2H4;3222<—)3(—)5; B323H4H6;B4:4H5H0;B5:5(-)6(—>1;B¢;:6<—)0(—)2; EulerianTrail:0(—>1H3H4H6HOH2H3H5H6H1H2H4H5H0 Figure 5.1: From block design to the network 4. The average path length 5 and the maximum path length i) depend on d. 5. The degree of each switch is 211 — 2. 6. The number of routing paths between a switch pair is related to A. Given the paths within blocks and the Eulerian trail, it is easy to design a routing al- gorithm. One way is to route messages along one of the two unidirectional Eulerian trails. Message routing from switch 2' to switch j can take the path in the corresponding block, because each pair occurs in exactly one block. Thus, the maximum routing path length is d — 1. Shortcuts may be used to provide shorter routing paths. We can also adopt some 2An Eulerian trail of a network is defined as a trail which passes each channel once and exactly once. 122 existing routing algorithms like Eulerian trail routing in [21] or virtual ring routing in [79]. Note that we do not consider switching techniques here. For wormhole switching, special considerations may be needed to avoid deadlock [72, 42]. 5.3.2 Construction of Blocks In a topology design, we are given the number of switches (3), number of ports per switch (c,-), and the required number of nodes (N). To simplify our description, we assume all cfis are equal to c. The average degree k of switches can be calculated by cs 2 211: + N. Given k, we have 11 = [Ir/2] +1. Combinatorial design has shown some block design theories and methods for special parameters. For symmetric BIBD, an important property is: Property 1 An (s,r;1 )-symmetric block design exists when (r — 1) is a power of a prime numberands = r2 - 1" +1. The above property means that we may not be able to find a symmetric BIBD for any 3 and 11. Since r decides 2r — 2 degree for each switch, it should not be very large in our design. In [79], it points out that the set of integers for which there is a constructive symmetric design with A = 1 and 3 S r g 20 is limited to {7, 13, 21, 31, 57, 73, 91, 133, 273, 307, 381}. To design a network for any reasonable size of s and r, we introduce the following definitions as variations of symmetric block design. Definition 5 Basic block design (BBD): in a basic block design (3, r, 2 A), there are 3 ob- jects and 3 blocks, each block has r objects, each objects occurs in 7‘ blocks, and each pair of objects occurs in at least A blocks. Note that A may be 0 and the number of occurrences 123 of different pairs may be dtfierent. A BBD (14, 3, Z 0) is shown in Table 5.3, where some pairs occur in one of blocks but other pairs (e.g., 5 and 10) occur in none of blocks. Bo=(013) B7=(7810) Bl=(l24) B3=(89ll) Bz=(235) Bg=(91012) Bg=(346) B10 =(lO ll 13) B4=(457) Bll=(11 120) B5 =(S68) B12 =(12131) Bs=(679) B13=(1302) Table 5.3: Blocks of BBD (14,3,1) Definition 6 Combined block design (CBD): a combined block design (3, r0 + r1 + - ~ - + 7.9—1)? A) is a combination ofg BEDS: (317.01 2. A0): (817.11 2 AI), ° ' '1 (31 Tg—la 2 Ag); where all the BBDs have the same set of objects. In the CBD, there are 3 objects and gs blocks. Each pair occurs in at least A blocks and it is obvious that A Z A,- (1 S i S g). A CBD (14, 4 + 2, 2 1) is shown in Table 5.4, which consists of two BBDs (14, 4, 2 O) and (14,2, _>_ O). For a CBD, the construction of an Eulerian trail is the integration of the Eulerian trail from each individual BBD and the switch degree is 2r — 29. If A 2 1, the maximum routing path length will be max(r,-) — 1. In the network derived from Table 5 .4, the switch degree is 8 and the maximum path length is 4 — 1 = 3. Given 3 and r, we use the following method to construct a BBD. Switches are rep- resented by numbers 0, 1, 2, s — 1. Suppose Bo is (b0,b1, - --b,_1), where each b,- represents a switch number. All the other blocks are constructed in such a way that block B,- is ( (b0 + 2') mod 3, (b1 + i) mod 3, - - - , (b,._1 + 2') mod 3) for O < i < s. This kind of 124 30,0 = (O l 3 7) 31,0 = (0 5) 30,1 =(12 4 8) 31,1 =(16) 30,2 = (2 3 5 9) 31,2 = (2 7) 30,3 = (3 4 6 10) 31,3 = (3 8) Bo,4=(45711) 31,4 =(4 9) 130,5 = (5 6 812) 31,5 =(510) B01; =(67 913) 31,6 =(611) 30,7 = (7 8100) 31,7 =(712) 80,8 = (8 911 1) 31,8 =(813) 30,9 =(91012 2) 31,9 = (9 0) 30,10 =(1011 13 3) 31,10 =(101) Bo,1l=(1112 04) 31,11 =(112) 30,12 =(121315) 31,12 = (12 3) 30,13 =(13026) 31,13=(13 4) Table 5.4: Blocks of CBD (14,4+2,2 1) block designs have special cyclic automorphism, which pennutes the objects of all blocks in a cycle of 3. Therefore, if B0 is decided, all the other blocks are determined. No matter if each pair occurs in one of blocks, messages can be routed along the Eulerian trail. How- ever, a message may have to travel more than one block if its source and destination switch pair do not occur in any block. In this case, the maximum path length may be greater than the block size 71. To cover as many pairs as possible in a BBD, we use differences to construct B0 based on s and r. A difierence of two switches is defined as the difference of their numbers. For 3 switches, there are totally s — 1 distinct nonzero differences: 1, 2, - . - ,s — 1. Note that any two switches i and j yield two s-complement differences: (2' — j) mod 5 and (j — 1') mod 3. For example, 0 and 1 yields differences 1 and s — 1 (i.e., —1 mod 3). Due to the way a BBD is constructed, all Bfis should yield a same set of differences. We are interested in BBD (s, r, 2 1), which means that each pair occurs in at least one block of the BBD so that maximum path length can be bounded by r — 1. We have the following lemma. 125 Lemma 7 Each pair occurs in at least one block of the corresponding BBD if and only if Bo yields all dzfierences 1, 2, - - - ,s — 1. Proof: Step 1 ‘if’: consider an BBD (3, 11,2 A), where Bo yields all differences 1, 2, - - - , and s — 1. We are trying to show that each object (switch) pair occurs in at least one block in the BBD (i.e., A = 1). Let us consider any two objects 2' and j in B0. Without loss of generality, let j is greater than i. Leta =j-— i andfl = (i —j) MODs. Theremustbea+,6 = s. IntheBBD, we have 3 rows (blocks) and 7' columns. Therefore, we can further consider the two columns decided by i and j, which are constructed in Table 5.5. i H. j i+1 ~- j+1 s—a—l ~~ s—l s—a ~- 0 3—a+1 ~- 1 s—l a—l ... a 1 ~- a+1 i—l ~- j-l Table 5.5: The two columns determined by i and j If we rotate the rows and let 0 and a be the first row, we will have Table 5.6. The above two columns clearly show that they cover all object (switch) pairs whose differences are a (e. g, j — i) and all object (switch) pairs whose differences are 6 = s — a (e.g., i — j). Therefore, given any difference :1: generated by two objects in B0, all pairs 126 a 1 a+l 2 .7' s—a—l ~- s—l s-a ... s—a+1 ~- 1 s—l ~- a—l Table 5.6: The two columns determined by i and j (after rotation) whose differences are either a: or s - a: must occur in the BBD. Since it is given that Bo can generate all differences, all pairs will occur in the BBD. Step 2 ‘only if’: Suppose each object (switch) pairs occurs in at least one block of the BBD(s, r, 2 1). We are trying to show that Bo yields all differences 1, 2, - - - ,s — 1. Consider a difference (1. Since each object pair occurs in at least one block of the BBD, there must be two objects 2' and j in one block whose differences are a and s — a. Note that the differences of i — j and j — i are s-complement. Due to the cyclic addition method to construct all blocks, each pair in the same columns as 2' and j in a block yields the same differences a and s — a. It means that the pair in B0 also yields differences a and s — (1. Therefore, for any a, we can find such two objects 1' and j and the corresponding pair in B0 whose differences are a and s — a. C] A CBD can be created by the combination of corresponding BBD’s. Without loss of generality, we suppose a CBD consisting of two BBDs: BBDO and BBD]. In BBDo (3,710, 2 A0), we construct B010 which yields some differences. In BBD1 (3,711, 2 A1), we construct 31,0 which tries to yield as many unyielded differences as possible. Ideally, B110 127 yields all the differences not yielded by 80,0. If each difference is yielded by either Bop or 31,01 each pair will occur in either BBDO or BBDl. The examples of BBD and CBD constructions have been shown in Tables 5.2, 5.3, and 5.4. In Table 5.4, B0], of BBDO yields differences 1, 13, 2, 12, 3, 11, 4, 10, 6, 8, and 7. Since we have differences 5 and 9 left, 31,0 is constructed to yield differences 5 and 9. This results in a CBD ( 14, 4+2, 2 1). 5.3.3 Construction of B0 Because all the other blocks are determined by Bo, we only need to show the construction of B0. Our objective is to yield as many distinct differences as possible in B0. The more distinct differences Bo yields, the more pairs a BBD covers. As we mentioned earlier, each pair occurs in at least one block if and only if Bo yields all differences. Block B0 of BBD (s, r, 2 A) yields r(r — 1) differences which may or may not be mutually distinct depending on its objects. If some of the differences yielded by Bo are duplicated, the total number of distinct differences is less than r(r — 1). Since there are totally s - 1 non-zero differences (modulo 3), we have the following relationship as shown in Table 5.7. r r(r — 1) Yield all differences‘d 3 \/.§ < s — 1 no \/§ s — 1 maybe or maybe not Table 5.7: Relationship between block size and number of distinct differences In general, a larger 11 yields more distinct differences in a block and more distinct pairs 128 in a BBD. Given a big enough 71, we can always find a block of size 11 yielding all distinct differences of 3 objects. However, r also determines the maximum length of routing paths so that 1' should be as small as possible. Moreover, r is equal to [Ir/2] + 1, where k is the average switch degree in the network. Due to the limit on the number of channels connected to a switch, r may not be very big. Given 8 and r, a best solution of B0, which covers as many distinct differences as possible, can always be found by exhausting search. However, such an algorithm may take too much time for large 3 and r. Therefore, we propose a heuristic algorithm in Figure 5.2 to construct B0. As shown in Table 5.7, to cover all pairs in a BBD, 7' must be greater than fl. We have run our algorithm for 3 S s g 400 and found the minimum r yielding all differences for each of 3. Although our algorithm considers only local optima, it turns out that the results are pretty good. As shown in Table 5.8, when 3 is equal to or less than 100, we can construct a network with a r which is no more than [Vs—l + 2. Smaller r guarantees smaller routing path length between pairs. If r is allowed to be bigger than the one listed in Table 5.8 for a given 3, a CBD can be constructed to reduce path length. In most cases, 11 = [J8] + 1 can yield at least 85% of differences, so it is easy to use another BBD with smaller block size to yield all the other differences. The two BBDs can then be combined as a CBD covering all pairs. There is tradeoff between a BBD and a CBD. For same 3 and r, a CBD has smaller block size than a BBD, which decides shorter routing paths for pairs within blocks. If a CBD can cover all pairs, we may choose the CBD instead of a BBD because smaller path lengths are promised in a CBD. However, the number of pairs in a CBD is smaller than that in a BBD for same 3 and 1‘. Therefore, if r is not big enough to yield all differences in one block, we do not consider partitioning r for a 129 CBD unless there is a good reason. 3 1‘ 3—100 six/3+2 101—200 six/5+4 201—300 six/37% 301—400 six/3+7 Table 5.8: Minimum 2' to yield all differences in a BBD (3,11, 2 1) After all objects have been selected in B0, the order of these objects may need to be reorganized in the following cases. Our purposes are to avoid double channels between any two switches and reduce maximum path length if possible. Since double channels between a switch pair will not reduce routing path length, we try to avoid double channels and save the channel for another two otherwise unconnected switches, which results in shorter path length. 0 If s is an even number, the two objects whose difference is g should not be adjacent in B0. Otherwise, there will be double channels between switches 2' and 2' + 3, because the two s-complement differences yielded by the two switches are same. For example in Table 5.4, if block 80,0 is (O 7 1 3), where O and 7 have difference of l;— and are adjacent in the block, there will be double channels between switches O and 7, 1 and 8,-~,and6and13. o If more than one pair in B0 yield same differences, try to place one of these pairs in the two boundaries of B0 so that the maximum path length is less than T — 1. o If more than one pair in B0 yield same differences, try to place at most one of the these pairs adjacent in B0. Otherwise, there will be double channels between some 130 Input: 3, 1‘ Output: block B0 2 (b0, b1, ...b,_1) Algorithm: /* Step 1: initialization */ ()0 = 0; D2ffSet={1,2,---,s—1}; /* Step 2: iteration to find next object for Bo */ For(2'= 1;2 6, channel k is considered overloaded. For most design, 6 should be about 50-60%. We try to avoid overloaded channels by reducing remote traffic R and balancing channel utilization. The algorithm sketch is shown in Figure 5.5, which includes the following four major steps. In any step, if none of 12,, is greater than 6, the algorithm can be stopped. 1. Increase local traffic L and balance inter-switch traffic for any switch pairs. The method is to move/exchange node placement among switches. This is guided by traffic distribution matrix. Note that maximizing L can efficiently reduce total remote traffic R. 2. Decrease remote traffic R. Our method is to reduce routing path length for those switch pairs with bigger inter-switch traffic by moving and/or adding channels. If our block design method is used for initial topology design, it should be easy to make these channel changes as follows. We can change the topology and routing path length by rearranging the order of objects in blocks. For example, if block (0 l 3 7) 138 Input: current topology, routing tables, and traffic distribution matrix Output: network after network tuning Algorithm: /* Step 1: increase L and balance yid- */ increase local traffic L by moving/exchanging node placement; balance inter-switch traffic between switch pairs by moving/exchanging node placement; /* Step 2: decrease R */ minimize R by moving and/or adding channels; recalculate f,c and a), on influenced channels; [fall a), < 6, done; /* Step 3: balance channel utilization by changing routing paths */ calculate or measure fk’s, pk’s; BusySet = all channels whose utilization is over 6; For each channel BusyCh in BusySet { BusyUtil = BusyCh’s utilization; PathSet = all paths which include BusyCh; try to find an alternate routing path for each path in PathSet, which reduces BusyC’h’s utilization; does not increase max(pk); does not make another channel’s utilization over BusyUtil; } /* Step 4: balance channel utilization by changing node placement */ recalculate f), and pk; BusySet = all channels whose utilization is over 6; For each channel BusyCh in BusySet { BusyUtil = BusyCh’s utilization; find all source and destination switches whose path include BusyCh; try to move nodes in these switches so that new node placement reduces BusyC’h’s utilization; does not increase max(p,,); does not make another channel’s utilization over BusyUtz’l; } Figure 5.5: Algorithm for network tuning 139 is changed to (O 3 l 7), the changed topology will have two more channels between switches 0 and 3 and switches 1 and 7 but lose two channels between switches 0 and 1 and switches 3 and 7. We will make these channel changes if inter-switch traffic 310,3 + y1,7 is more than 310,1 + 313,7, because the remote traffic will be reduced in this case. Since we use Eulerian trail to design network topology and routing algorithm, we do not change boundary objects in this step; otherwise, maximum path length may be increased. Moreover, we try to avoid double channels between any two switches because we can save one of the double channels to connect two otherwise unconnected switches. Note that this change only influences the corresponding block not all the blocks. 3. Balance traffic by changing routing paths if possible. We need the knowledge of traffic flow f), and channel utilization a), on each channel, which can be obtained by either theoretical calculations or network measurement. Because theoretical calcu- lations may be different from network measurement due to simplified assumptions in the calculations and the dynamic nature of networking environment, measured channel utilization is usually more accurate and is suggested whenever applicable. If any pk is over 6, the corresponding channel is marked overloaded. Our algorithm searches all the paths for the overloaded channels. If a routing path includes at least one overloaded channel, we will try to find an alternate path so that traffic on the cor- responding overloaded channel(s) will be reduced. This routing change should also ensure that it does not cause new overloaded channel or bigger max(uk). If there are more than one alternate path, we will pick up the one which derives smallest 140 max(pk) and more balanced channel utilization. 4. Balance traffic by moving/exchanging nodes among switches so that the traffic on heavily utilized channels can be reduced. Suppose channel It is overloaded and inter- switch traffic from switch 2' to switch j passes channel 12. If we can move/exchange node placement to reduce 31,-J, it will reduce the utilization of channel 12 especially when 12,/,3,- contributes a significant portion to the traffic flow of channel 12. Of course, changing node placement has to be taken in a global point of view and should not cause new bottleneck channel. Usually, changing routing tables can balance traffic pretty well. This step is helpful only when the initial node placement causes very unbalanced inter-switch traffic. We would like to mention that it may not always be so straightforward to redistribute nodes from one switch to another. Even in a LAN, the distances between the switches may be too large to make it difficult (or impossible) to reconnect a given host to another switch. Therefore, the feasibility of end node replacement may be limited by a real network environment. There is no straight-forward implementation for any of the above steps. Moreover, we have to compromise between the criteria if we are not able to achieve all of them at the same time. The most important criterion should be minimizing mad/2),). Note that we first need to check if the average channel utilization :5“; is less than 6. If the average channel utilization is over 6, the channel bandwidth may not be enough to handle the offered workload. In this case, more network resources such as new channels and/or switches are needed. Even though the average channel utilization is under 6, it is still possible that 141 some uk’s are greater than 6 after applying the algorithm due to the traffic distribution and network configuration. If this happens, we may need to add more channels or upgrade to high bandwidth ports (switches). 5.5 Simulations A simulator has been implemented to evaluate the network performance before and after network tuning under various topologies and traffic workload. There are so many network parameters (e.g., topologies, placement of nodes, traffic distributions, message size, buffer size, message injection rate, location of servers, etc.) and combinations of these parameters. However, due to the space limit, we can only show the performance under a limited subset of parameter combinations. 5.5.1 Simulation Environment Without loss of generality, the performance data shown in this section are measured from a switch-based network which has 10 switches and 200 nodes. The initial topology design is created by a BBD (10, 4, 2 1), which has been shown in the right side of Table 5.9. The corresponding initial topology is shown in Figure 5.6, where the network has 30 bidirec- tional channels and the degree of each switch is 6. The deterministic ETR is used as the routing algorithm. All calculations for traffic rate and channel utilization are based on the topology and the routing algorithm. In order to evaluate interconnection networks, a simulator should consider the following workloads: traffic patterns, message size distributions, and temporal distribution [80]. Two 142 Bo:3<—>1H0<—>6;BI:4+—>2Hl<—>7;Bg:SH3H2H8;33:6H4++3H9; B4:7H5H4H0;B5:8H6H5H1;Bs:9H7H6(->2;B710H8H7H3; Bszl++9<-+8H4;Bg:2<->O<—>9<—>5; Figure 5.6: Initial topology used for simulation Group 1 Group 2 Group 3 Group 4 Group 4 50 35 25 45 40 Table 5.10: Number of clients in each workgroup 143 traffic patterns are used in this section, where there are 5 servers and 5 client wokgroups out of 200 nodes for both patterns. The number of clients in each workgroup is listed in Table 5.10 and the traffic distribution of each pattern is shown in Tables 5.11 and 5.12. Each server or client has 20-30% global uniform traffic as the background traffic. Traffic pattern 1 and traffic pattern 2 are similar, except that pattern2 has more workgroup traffic and less client/server traffic. Our intention is to evaluate the effect of workgroup traffic and client/server traffic on network performance. For simplicity, we choose fixed size messages and the message size is always 128 bytes. Although we just show network performance for a specific topology and two traffic patterns, similar performance trend can also be observed in other t0pologies and network workload. The switching technique is assumed to be cut-through switching and each port has an input FIFO buffer of 1000 bytes. Back-pressure flow control is used to avoid buffer overflow. Again, similar performance trend can be observed in store-and-forward switched networks. srvl srv2 srv3 srv4 srv5 grpl grp2 grp3 grp4 grpS uniform srv 1 10% 40% 30% 20% srv 2 20% 40% 20% 20% srv 3 60% 20% 20% srv 4 10% 70% 20% srv 5 10% 70% 20% group 1 30% 30% 20% 20% group 2 50% 30% 20% group 3 30% 50% 20% group 4 20% 40% 20% 20% group 5 10% 40% 30% 20% Table 5.11: Traffic distribution of pattern 1 We use finite input source to model the message generation in each node because it is 144 srvl srv2 srv3 srv4 srv5 grpl grp2 grp3 grp4 grp5 uniform srv 1 10% 40% 30% 20% srv 2 20% 40% 20% 20% srv 3 50% 30% 20% srv 4 10% 60% 30% srv 5 10% 60% 30% group 1 20% 20% 40% 20% group 2 30% 50% 20% group 3 20% 60% 20% group 4 10% 30% 40% 20% group 5 10% 30% 40% 20% Table 5.12: Traffic distribution of pattern 2 more realistic than the infinite source model. A finite source model may have up to K out- standing messages. In high performance computing, most parallel programs use blocking send to deliver messages. Thus, the next send will not be initiated until the previous one is completed. In a client/server environment, a server may receive several requests and have several messages waiting on the queue to send. For simplicity, we consider the case of K = 1, which means next message cannot be generated unless the previous message has completely left the source. A time unit is defined as the time needed to transmit a byte via a channel, which is de- cided by channel bandwidth. Unless otherwise specified, all channels connecting switches are assumed to have bandwidth of 1 byte per time unit, all ports connecting to clients are assumed to have bandwidth of 1 byte per time unit, and all ports connecting to servers are assumed to have bandwidth 10 bytes per time unit. The ratio between higher bandwidth and lower bandwidth emulates lOO-baseT and lO—baseT ports in Ethernet switches. Higher bandwidth for server ports will efficiently reduce traffic congestion due to server bottle- neck. We assume the header for each message is 3 bytes. Thus, the switch delay for a 145 routing decision is 3 time unit. Let x be the time duration between the time when the current message has completely left the source buffer and the time when the next message is generated in each node. Thus, :1: is a random variable based on an exponential distribution. The smaller :1: is, the heavier the workload is. In our simulation, all results are measured under a same set of workloads. The simulator will be run until at least 30000 messages are received. In each run, the first 2000 messages are passed to eliminate start-up transient effect. Average message latency and network throughput are used to evaluate network performance. The network throughput is normalized by the theoretically maximum throughput, which is 200 bytes per time unit in the selected network. 5.5.2 Performance Improvement by Network 'Itming In this section, we use simulation results to show the performance improvement by applying our algorithm for network tuning. We use the method of block design in Section 4 to plan initial topology and adopt Eulerian trail routing [21] as our routing algorithm. Our algorithm is implemented based on the way we design topology and routing algorithm. Due to the space limit, we cannot address our algorithm in detail. (should address the algorithm a little bit) In our simulations, we measure the network performance under the following four cases, which are categorized into four levels based on how much network tuning has been done. In tuning levels 0-2, each switch has exactly 20 nodes and at most one of the 20 nodes is a server. In tuning level 3, a switch can have at most 22 nodes. This is intended to 146 simulate that each switch has a limited number of ports. Allowing at most one server in a switch is helpful to balance inter-switch traffic. 1. Tuning level 0: initial topology; random node placement; and no network tuning. 2. Tuning level 1: initial topology; nodes are placed in a way to maximize local traffic and balance inter-switch traffic; and no other network tuning. 3. Tuning level 2: based on level 1, try to minimize remote traffic R by changing the order of objects in each block, resulting in new network topology and new routing path length. No further traffic balance. 4. Tuning level 3: based on level 2, try to balance channel utilization by changing rout- ing paths and node placement. As shown in Figure 5.7, each level of network tuning can achieve significant perfor- mance benefits. When new clients or servers are added, similar performance benefits can be achieved as long as we follow the algorithms. The better performance after network tuning demonstrates the importance of network tuning and the efficiency of our algorithm. In tuning level 3, we use measured channel utilization because it is more accurate than cal- culations and can be measured after a network is operational. All the other network tuning levels are based on calculation. To evaluate the effect of workgroup traffic and client/server traffic on network perfor- mance, we compare the performance between traffic pattern 1 and traffic pattern 2. As shown in Figure 5.8, traffic pattern 2 provides lower message latency and higher network throughput than pattern 1. For example, maximum sustained network throughput is about 147 Traffic pattern 1 I I I I 1600 - . Level 0 ..— > 1400 - teve11 -+--- 1 eve 2+1- 5 1200 P Level 3 -..--. ‘ 3 1000 L 1,. q . “8 80° ’ 1tWM E 1 +’ 1‘ i‘ 3 600 - .1”, ”23m .2" . 90’ ,B ,,...x' 400 - p-‘G x .1 .a ------- x‘ 200 '- :B&HR:;?C:BKX~&/T 4 .. 0.05 0.1 0.15 0.2 0.25 Normalized Network Throughput Traffic pattern 2 1600 h Level 0 —.—_ ‘ 1400 - Levei1 -+--- . 5‘ Leve|2 ----a g 1200 " Leve| 3 ...x ....... . E 1000 - +1 3 . L. ‘4: U x 3 600 - ,1, a, 1x . <1: x.) 400 - 31"3 ....... x", -- . 200 " a figgzglge ”Nix”. " l .( 0.05 0.10 0.15 0.20 0.25 Normalized Network Throughput Figure 5.7: Performance improvement by apply network tuning 200 x 22% = 44 bytes per time unit for pattern 1 and 200 x 24% = 48 bytes per time unit for pattern 2. This is because pattern 2 has more local traffic within workgroups but less traffic to servers. Therefore, server bottleneck in pattern 2 is not as much as that in pattern 1. 1200 1000 > (c) ,9 800 _‘3 8, 600 f! O 5 400 200 Figure 5 .8: Performance comparison of traffic pattern 1 and traffic pattern 2 148 Comparison of pattern 1 and pattern 2 I I I PAT1-Levei 2 4— PAT2-Level 2 -+--- PAT1-Level3 we PAT2-Levei 3 ...,,...... : i I 5.5.3 Server Bottleneck To avoid server bottleneck, we assume that server ports have higher bandwidth (10 bytes per time unit). We have also measured network performance when all server ports have same bandwidth (1 byte per time unit) as other ports. As shown in Figure 5.9, the network performance is dramatically decreased without higher bandwidth ports for servers. Further network tuning cannot improve performance very much due to server bottleneck. This re- sult emphasizes the importance to release server bottleneck in a client/server environment. We will talk more about server bottleneck in Section 5.5.4. 5.5.4 Fairness In Chapter 2.6, we have discussed the fairness issue. To evaluate fairness in a switch-based 0.05 Normalized Network Throughput 0.10 0.15 network, we define fairness ratio as follows: 0.20 149 Traffic pattern 1 (tuning level 1) 3500 . i 3000 - - > 2 2500 L Low Bandwidth +— ~ 93 High Bandwidth -+--- _‘3 2000 - . Q) 53 1500 - - 2 < 1000 "' 1..+A------H'"-+’*’+ - 500 F ,,_.—-1r"""'l'r " +"4'"*’ 1 g 0.05 0.10 0.15 Normalized Network Throughput Figure 5.9: The importance of higher bandwidth to release server bottleneck max trafi'ic generated by a node min traffic generated by a node f a2rness-rat2'o = The bigger fairness ratio is, the less the network is fair. We have measured traffic generated by each node and calculated fairness ratio for each client workgroup. It turns out that fairness ratio is usually lower than 2 under light to medium workload, but the ratio could be as high as 7 under heavy workload. Network tuning can reduce fairness ratio for each workgroup and improve fairness for the whole network. In our selected traffic patterns, the main reason causing high fairness ratio is the differ- ent locations of clients in a workgroup. Due to the limited number of ports in each switch, none of workgroups can be all placed in the same switch as its server. For a client which is located in the same switch as its server, its messages to the server do not compete chan- nel bandwidth with other traffic. Therefore, its messages are transmitted to the server very quickly and next message are generated very soon. However, for a client which is located in 150 a different switch from its server, its messages have to compete at least one channel before they are transmitted to the server. Since a client communicates to a server frequently, such a location difference can make big difference in terms of transmission latency and number of messages generated by a node. To avoid such an unfair situation, we may consider use higher bandwidth channel to connect two switches if there are a significant portion of traffic transmitted between them. Thus, the higher bandwidth channel may be able to avoid traffic congestion on it. As an example, we use a higher bandwidth channel (10 bytes per time unit) to connect the two switches which host the clients of workgroup 2. As shown in Figure 5.10, under the new network configuration, fairness ratio of workgroup 2 is significantly reduced under heavy workload. Workgroup 2 of traffic pattern 1 (tuning level 1) 7 . 1 6 - No high bandwidth channel +—— - 0 One high bandwidth channel -+—-- '8 5 ~ c: iii 4 - 01 E ii 3 ' . 2 . 1 high workload low workload Workload Figure 5 . 10: Comparison of fairness ratio for workgroup 2 with and without a higher band- width channel 151 5.6 Evaluation of the Network Planning Algorithm In this chapter, we have proposed a topology design algorithm based on block design and a systematic method for network tuning. Since our proposed solutions are heuristic, it is important to show how good the heuristic methods are. However, it is difficult to compare our methods with the optimal design and/or other existing topology design algorithms due to the following reasons. First, the optimal topology design is NP-complete, which makes it unrealistic to find the optimal solution when a network has more than a few switches and a dozen of channels. Second, although there are many other existing algorithms for topology design, it is not straightforward to implement any of them and measure its performance via simulations. Third, even if we could implement some algorithm, the network environment considered in the algorithm may not be same as the one we considered — switch-based LAN 5, which may cause unfair comparisons. Therefore, how to comprehensively evaluate our network planning and tuning algo- rithms remains an open issue in this dissertation. We will try to figure out a good approach to solve this problem in the future. On the other hand, we have done some preliminary work in this section to compare our topology design with all the other possible topologies for two small network configurations. Although our comparisons are incomplete in terms of traffic patterns, routing algorithms, and network size, we hope that these preliminary results can demonstrate the network behavior and performance in some sense. In the following, we consider all non-isomorphic switch-based networks in the two cases: 0 Case 1: a network has 6 switches and 12 channels, 152 0 Case 2: a network has 6 switches and 11 channels. We assume that each switch has 16 ports and there are totally 70 nodes (i.e., host computer) in each network. As shown in Figure 5.11 and Figure 5.12, there are 5 non- isomorphic topologies for case 1 and 9 non-isomorphic topologies for case 2. As marked in the figures, Topology 1 in each figure is the topology generated by our network planning algorithm. Note that we have not done any network tuning yet because it highly depends on the traffic pattern. 0 1 O 1 5 2 5 ‘ 2 4 3 4 3 Topology 1* (Eulerian) Topology 2 (Eulerian) Topology 3 (Eulerian) 0 1 0 1 / 5 \ 4 3 4 3 Topology 4 Topology 5 * the tapology generated by the proposed network planning method Figure 5.11: Five non-isomorphic switch-based networks with 6 switches and 12 channels For simplicity, we consider two traffic patterns in our comparisons and the traffic gen- eration rate is Poisson distribution. 4 3 4 3 Topology 4 (Eulerian) 0 l A A NV 4 3 Topology 7 153 5 V g 7 2 4 3 Topology 5 (Eulerian) Topology 9 "‘ the topology generated by the proposed network planning algorithm Figure 5.12: Nine non-isomorphic switch-based networks with 6 switches and 11 channels 154 0 Uniform traffic: message destinations are uniformly distributed among all network nodes; 0 Client/server traffic: there are 2 servers and 68 clients out of 70 nodes. For a client, a certain ratio (e.g, 66.67% is used in this section) of traffic evenly goes to one of the two servers and the rest of traffic evenly goes to the other clients. For a server, it is assumed to have uniform traffic to all servers and clients. It is known that the routing algorithm is critical to network performance for a given topology. In Chapter 4, we have studied three different routing algorithms. In this sec- tion, we choose the ETR and SR routing algorithms for the purpose of comparison. Since the ETR algorithm cannot simply deal with non-Eulerian graph, we only compare those networks with Eulerian trails under this routing algorithm. We have run our simulation for each of the network topologies and measured the net- work performance under uniform traffic and client/server traffic. The simulation results are summarized in the following. Uniform Me As shown in Figure 5.13, Topology 1 of Figure 5.11, the topology generated by our network planning algorithm, has the best performance among all the other topologies under uniform traffic and the ETR and SR routing algorithms. Similarly, Figure 5.14 shows that our topology design, Topology 1, has the best performance among the nine networks of Figure 5.12. The SR routing cannot generate routing table for Topology 6, and it is why there is no performance curve for Topology 6. Client/Server fiaffic Average Latency Average Latency 155 Networks (6 switches and 12 channels): ETR I I I I 1500 - Topology1 4— 1 1 Topology2 -+--- 12‘ Topology3 ---+3- . 1000 - f p 4 500 - o l l l l 0.04 0.08 0.12 0.16 0.2 Normalized Network Throughput Networks (6 switches and 12 channels): SR 2500 . . . ,1: . 2000 - . ‘1 '5’ 1500 - Topology1 +— i: . Topology2 -+--- . Topology3 We it; 1000 ' Topology 4 w» j ‘ Topology5 ----- / ...-.5 500 - - O L l I l 0.04 0.08 0.12 0.16 0.2 Normalized Network Throughput Figure 5.13: Performance comparisons of networks (6 switches and 12 channels) under uniform traffic 156 Networks (6 switches and 11 channels): ETR 500 . s . . 2000 T I 1 1 ' 090 ogy "— - g Topology 2 -+---‘.,-‘ T g Topology3 mu»; 3‘ 3 1500 " Topology 4 -..”. ..... i“ ‘l‘ - a, Topology 5 ‘--- f 3 :9 3 1000 . ’1 ' - 3% 500 - . O 1 0 0.04 0.08 0.12 0.16 0.2 Normalized Network Throughput Networks (6 switches and 11 channels): SR 4000 I I 1 I x! I ‘2 K. > - ODO ogy2 "1"" i' .. _ 3 3000 Topology 3 .9....4 h 2! 2500 - Topology4 «w; ii“- -._. . 3 Topology5 —~A----g i513 m 2000 ' Topology7 ---~-g~ ii“? ; - 5’ 1500 - T0p0|09y8 -----.°-i : 2 TOPO'OQY 9 "+“",‘ 2 < 1000 ' ,‘i 7".- " 500 L . 0 . - - 0.04 0.08 0.12 0.16 Normalized Network Throughput Figure 5.14: Performance comparisons of networks (6 switches and 11 channels) under uniform traffic 157 Under the client/server traffic, if a server port has the same bandwidth as a client port, the server will become traffic bottleneck because it is unable to consume the incoming traf- fic as fast as demanded. Therefore, the whole network performance will be significantly degraded. This phenomenon has been observed in Figure 5.15 when the server ports have regular bandwidth, where none of the topology is superior to others. To relieve the server bottleneck, we should provide higher bandwidth port for each server. It is shown in Fig- ure 5.16 that the network performance is significantly improved and our generated topology performs very well among all the topologies. Client/server. regular bandwidth server port 5000 I . 4500 - d 4000 ' Topology1 -o— ,+ d 3500 ' Topology2 -+--- ’ ‘ 3000 _ Topology3 We ;, _ 2500 r ' 2000 . 1500 - 1000 ~ 500 - O I 1 0.04 0.08 Normalized Network Throughput Average Latency Figure 5.15: Performance comparison of networks (6 switches and 12 channels) under client/server traffic Remarks As we mentioned before, our comparisons are incomplete. In terms of traffic pattern, we consider only uniform traffic and a specific client/server traffic. To make more prac- tical comparison, more realistic traffic, like bursty traffic or traffic traces from real LAN networks should be considered. Obviously, a routing algorithm plays an important role to influence the network performance. Since not many routing algorithms are specifically 158 Networks with 6 switches and 12 channels 1000 1 . I . 800 - Topology1 —o—- ~ § Topology2 -+——- % 600 _ Topology3 «Er-~- q _l 8, g 400 ~ - a) 2 200 - - 0 P l l l 0.04 0.08 0.12 0.16 Normalized Network Throughput Networks with 6 switches and 11 channels 1800 r . . . 160° ' Topology1 -o— ‘ >, 1400 L Topology2 -+"' :2! : g Topology3 "9""gi 9 120° ’ Topology 4 «WE; ‘ 3 1000 . Topology5 ""3 ' . §’ 800 r a y x . g 600 i 1 < 400 - - 200 - ~ 0 l l 1 l 0.04 0.08 0.12 0.16 Normalized Network Throughput Figure 5.16: Performance comparisons of networks using high bandwidth server ports un- der client/server traffic 159 proposed for switch-based networks, we choose ETR and SR in this section. Due to the ex- istence of numerous switch-based networks, there is no way to use a few example networks to represent all of them. However, our preliminary work can at least show the performance trend of the considered networks and help us to investigate a better evaluation method. Chapter 6 A Flexible Simulation Model for Performance Evaluation Due to the dynamic nature of networking environments, analytic modeling is unlikely to give a practical insight to the network behavior and performance. Moreover, analytical models often require simplifying assumptions that degrade the accuracy of the evalua- tion. To overcome these limitations, simulation experiments have been used to evaluate performance of multicomputer networks by many researchers. These simulators either evaluate routing algorithms and switching schemes on multicomputer network architecture [81, 82, 83] or simulate the instruction-level operations of parallel processing applications [84]. To address network behavior under different traffic patterns, routing algorithms, topolo- gies, and switching techniques, this chapter presents a flexible simulation model for per- formance evaluation in switch-based networks. The simulator, SimNet, is implemented in C-H-. Major network components are defined as different classes, representing the network 160 161 topology, routing algorithm, switches, channels, message sending, message receiving, net- work workload, and so on. It is flexible to run simulation experiments under different network environments by selecting topologies, routing algorithms, workloads, switching techniques, buffer architecture, and many other network parameters in user input files. New modules can be easily added to the simulator with minor change of existing modules. 6.1 Overall Structure SimNet’s overall structure is shown in Figure 6.1. The main components in SimNet are a set of CH- classes, supporting network topology (NetTop), routing algorithm (RouteAlg), switches (Switch), message generation (MsgSend), message reception (Mngecv), channels (InChannel and OutChannel), workload (Workload), and messages (F lit). The task library, a multi-threaded library in SUNWspro software package, is used in the simulator to control concurrent operations of switches and nodes. A task is an object with a control thread run- ning its associated program. More specifically, classes like Switch, Msgsend, and Mngecv are derived from class task and their scheduling are controlled by the implementation of the task library. The constructors of these classes, which are designed and implemented in SimNet, are the main program of the tasks. None of the tasks survives after the completion of the constructor. After reading user input files, NetTop set up its member data, which consist of the switch adjacency matrix, port bandwidth, number of nodes connected to each switch, and other necessary information to get the complete knowledge of the network topology. Given NetTop, routing tables are created based on the selected routing algorithm and each switch 162 [ User Inputs topology information & network parameters NetTop Workload generate rate - ' message Size adjacent swrtches destination distribution &channels ,.___ ---- l V- \ ll ' ' t 1 / MsgSend : opo ogy . inport messa e SWItch generat§on : : Oll 11 change node A A (P0 message : Node : placement & annels , . eptiont ' 1n/out port routing u | status options ' : . Mngecv I | I I I Route“: ‘ ‘ " ' " negate size latency channel 1‘ change it hops utilizatiorfi routing table PerfMeasure NetTune Figure 6.1: The structure and modular relationship in SimNet 163 carries one of the tables. Each switch has a number of input ports and output ports, which are connected to other switches or nodes via OutChannels and InChannels. Each unidirec- tional channel is represented by one InChannel and one OutChannel corresponding its two end switches, respectively. Since a node generates and receives messages, two classes are defined to simulate message generation and message reception, respectively. MsgSend en- capsulates the details of the traffic generation and Mngecv handles all functions related to packet reception. NetTune will balance traffic based on the current network configuration, resulting in changed node placement, changed channels and/or changed routing tables. To evaluate network performance under a wide range of workload, SimNet defines workload as the combination of message generation rate, message size distribution, and traffic pattern. To consider various network situations, the simulator provides users a pa- rameter file and users can select a specific situation by setting proper parameters. These parameters are listed in Chapter 6.2.6. Whenever a message is received by Mngecv, im- portant performance data like transmission latency, message size, and number of traversed channels will be collected in order to calculate the performance data. The simulator calcu- lates end-to-end transmission latency statistics, such as the mean, variance, and confidence intervals. We do not distinguish messages and packets in SimNet, which means that each message is considered as a packet. 6.2 Design and Implementation As mentioned before, SimNet is written in C++ and uses the task library. Major modules are implemented by C++ classes, which defines related data structures and member func- 164 tions. Switch, MsgSend, and Mngecv are derived from task class in task.h, so they are independent threads and scheduled by the library implementation. The switching schemes, buffer management, arbitration policy, message generation, and message reception are im- plemented in SimNet, independent of task library. At the beginning, SimNet reads two input files: one for topology information and one for network parameters. Then, all classes are setup, routing tables are created, and switches and channels are properly interconnected. After initial setup, switches (Switch) and nodes (MsgSend and Mngecv) will keep running as individual threads. Each object of MsgSend generates messages based on workload in- formation, each object of Switch forwards messages flit by flit from input ports to output ports, and each object of Mngecv receives arrived messages and measures the performance data. 6.2.1 Routing Algorithm Three routing algorithms have been implemented in SimNet: ETR, ATR, and UDR. The simulator also accepts the results from SR simulator and creates corresponding routing ta- bles from the results. A new routing algorithm can be either accommodated in the simulator or implemented outside of the simulator. In the later case, we need to make the simulator to accept the output format of the new routing algorithm. Each routing algorithm is rep- resented by routing tables in switches. Each routing table keeps routing information for packets in the related switch. In SimNet, each entry in a routing table includes the input port, the destination switch, and the output port(s). The major part of data structure RoutEntry class is shown in the follows, which is the 165 basic element of routing tables. typedef struct RoutEntry_t { int dest; // destination switch id int inCh; // input channel id (input port) int options; // number of routing options for this entry OutCh *outCh[MAXOPTION]; // all possible output ports, sorted by routing distance } RoutEntry; Figure 6.2: Data structure RoutEntry 6.2.2 Switch Architecture Figure 6.3 shows a generic switch with 1: ports. The switch is assumed to be non-blocking and have a logical crossbar inside. As we mentioned earlier, each port is associated with a pair of input and output channels. Each port may connect to a node, which generates and consumes messages, connect to the port of another switch, or leave it open for future usage. Input Input FIFO Buffer Output Buffer Output Channels Channels 0 >EEEIZD—> —EEEI]:l———> Nonblocking 1 >CEI:I:D—> —-Cl:l:l:l:l——-> 1 0 Network 0 o o Interconnect O O k-l >EE1133—> —CEEED--*——i> k-l Figure 6.3: A generic model of a switch The sketch of Switch definition is shown in Figure 6.4. In each input port, a buffer called ‘inbuf’ is used to hold incoming packets. Each output port also has a small ‘outbuf’ to temporarily store the packet coming from an inport. The use of small output buffer 166 simplifies switch processing because there is no need to check if the downstream switch has enough buffer space. For virtual cut-through and wormhole switches, there are only FIFO input buffers in inports. For store-and-forward, shared buffer pool is supported as well as FIFO input buffers. Note that store-and-forward and virtual cut-through schemes need a big enough buffer to hold the max sized packet, while wormhole scheme only needs a buffer to hold the packet header (one or a few flits). Arbitration schemes to assign an outport among competing incoming packets include round-robin and FCFS. class Switch :: public task { private: int id; // identifier of the switch int numinport; // number of input ports int numoutport; // number of output ports qhead *inbuflMAXPORT]; // buffer for each input port, from InChannel qtail *outbuflMAXPORT]; // buffer for each output port, to OutChannel Flit *sharebuf[NUMBUF]; // used for shared buffer only int busyflag[MAXPORT]; // status of each output port int bandwidth [MAXPORT]; // bandwidth of each output port int matchPort[MAXPORT]; // match inport to outport int readyflVIAXPORT]; // status of each input port // private and public member functions; Figure 6.4: Definition of class Switch The switch process in each cycle is highlighted in Figure 6.5. Each inport can have the following six statuses. o Idle: no packet is waiting for transmission; 167 o Not-ready: the packet in the inport is not totally buffered in store-and-forward switch— ing or virtual cut-through switching if the packet has been blocked before; or the header has not completely arrived in wormhole switching; 0 Ready: the packet in the inport is ready to be forwarded depending on the forwarding scheme; 0 In-transit: part of the packet has been forwarded and the rest of the packet will be continually forwarded to the corresponding outport; 0 Go: the packet is assigned an available outport and its header flit will be forwarded in the current cycle; 0 Blocked: the packet cannot be forwarded because there is no available outport for it; Statuses ‘idle’, ‘not-ready’, and ‘ready’ will be set in step 1 and used in step 2. Statues ‘go’ and ’blocked’ will be set in step 2 and used in step 3. Status ‘in-transit’ will be set in step 3 and used in all steps. An outport has two statuses: ‘idle’ and ‘occupied’, which are well self-explained. 168 step 1: check the status of each inport store-and-forward: if the whole packet has been buffered; virtual-cut-through: the port is cut-through or store-and-forward mode; wormhole: if the header has been received; set ‘idle’, ‘not-ready’, ‘ready’ and keep ‘in-transit’; step 2: search routing table for an output port if (an inport is ‘idle’, ‘not-ready’, ‘in-transit’) do nothing; if (an inport is ‘ready’) find an available outport for it; if (there is more than one available outport) select the one with shortest path length; mark the inport ‘go’; keep the mapping from inport to outport in the switch; if (there is one available outport) select that outport; mark the inport ‘go’; keep the mapping from inport to outport in the switch; if (more than one inport compete the same outport) use arbitration principle to assign the outport to one of the inports; mark the selected inport ‘ready’ and the other inports ‘blocked’; keep the mapping from inport to outport in the switch; if (no output port available for an inport) mark the inport ‘blocked’ step 3: forwarding from inports with ‘in-transit’ and ‘go’ if (an inport is marked ‘idle’ or ‘not-ready’) do nothing; if (an inport is marked ‘in-transit’) forward its flit to the corresponding outport; mark the inport ‘idle’ if it is last flit; mark the outport ‘idle’ if it is last flit; if (an inport is marked ‘go’) forward its flit to the corresponding outport; mark the inport ‘in-transit’; mark the outport ‘occupied’; if (an inport is marked ‘blocked’) block the packet; change it to store-and-forward mode if virtual cut-through; Figure 6.5: Algorithm to simulate switch forwarding in each cycle 169 At each simulation cycle, a switch checks the status of its input ports, with no depen- dence on the network topology, traffic patterns, or internal switching techniques in other switches. If an input port has an in-transit packet, whose header has been forwarded to an output port, the switch will continuously forward its flits one by one to the outport. If an input port has a new ‘ready’ packet, whose header flit has not been forwarded, the switch will try to find an available output port based on its header information and table lookup. If more than one input packet compete the same output port, the channel arbitration principle is applied to coordinate the access of output ports. If no output port is available, the packet will be blocked. By default, a switch forwards one flit from an input port to an output port (if available) within one cycle, and a k-port switch can forward at most k flits from k inports to k outports in one cycle if no channel contention. However, if a port has higher bandwidth, more than one flits can be forwarded in one cycle depending on the bandwidth. At next cycle, the input port of the corresponding downstream switch will have the packet in its buffer. The simulator is implemented to be able to handle different forwarding schemes: store- and-forward, virtual cut-through, and wormhole. For store-and-forward, the whole packet will be buffered before it is forwarded, resulting in longer transmission latency. For virtual cut-through, a packet will be forwarded as soon as the header information has been re- ceived. If the packet is blocked in an inport because of no available output port, the whole packet has to be buffered before it is ready to be transmitted from this inport. For wormhole switching, its input buffer does not have to be big enough to hold a whole packet. Flits can be forwarded at any time as long as the header information has been completely received. 170 6.2.3 Incoming and Outgoing Channels The class definitions InChannel and OutChannel are shown in Figure 6.6. They use qhead and qtail defined in the task library, respectively. For qhead, a FIFO queue, we can ‘get’ a flit from its head. For qtail, which is also an FIFO queue, we can ‘put’ a new flit to its tail. class InChannel { friend Switch; private: int multiple; l/ number of channels connecting a switch pair int bandwidth[MULTIPLE]; // channel bandwidth qhead *InBuflMULTIPLE]; // buffers for multiple channels // member functions class OutChannel { friend Switch; private: int multiple; // number of channels connecting a switch pair int bandwidth[MULTIPLE]; // channel bandwidth qtail * OutBuflMULTIPLE]; // buffers for multiple channels // member functions Figure 6.6: Class Definitions of InChannel and OutChannel All channels in the network are organized as two-dimension arrays. Inchannel[i][j] means the incoming channel in switch 2' coming from switch j and 0utchannel[i][j] means 171 the outgoing channel in switch 2' going to switch j. If there is no channel connecting to two switches, the buffer space in InChannel and OutChannel is empty. As we mentioned earlier, each unidirectional channel is represented by both InChannel and OutChannel. Outchan- nel[i]|'j]’s OutBuf points to the tail of Inchannelfi][i]’s InBuf and they logically represent the same unidirectional channel from switches z’ to j. The separation of InChannels and OutChannels simulates input buffer and output buffer in a switch so that a switch does not need to check the channel status of any downstream switch. 6.2.4 Message Generation and Reception Each node will send and receive messages and these two processes are simulated by two classes MsgSend and Mngecv. Remember that we do not distinguish messages and pack- ets in this Chapter. For MsgSend, it generates next message after a period of time based on message generation rate. The message size follows message size distribution and the destination follows destination distribution, which will be described in Chapter 6.3. Since we use finite source model holding one distinct message, next message cannot be gener- ated unless the previous message has completely left the source buffer. In the simulation, we suppose a flit has 1 byte. Each message have a number of flits (based on its message size) including one or more than one header flit (depending on the header delay in the user parameter file) and one tail flit. The header flit includes the following information: source, destination, message size, generated time, and header flit tag. The tail flit has a special tag to indicate the end of a message. When a message (packet) arrives in the destination node, Mngecv will remove it from 172 its buffer and record important performance data like latency and message size. Some calculations are needed to compute average latency in a batch. 6.2.5 Flow Control In SimNet, we suppose all switches have link-level back pressure, which means that if a buffer is full in the downstream switch, the upstream switch will receive a ‘stop’ signal and stop sending more packets to the downstream switch until it receives a ‘go’ signal later on. This flow scheme is similar to the one in Myrinet [7]. Therefore, we do not worry about packet loss due to buffer overflow in the simulation. 6.2.6 Basic Parameters An input file including all needed parameters are required by the simulator. These param- eters include important information for network topology and routing. Major parameters are listed in Table 6.1. BufSize buffer size for each input port BurstRate bursty rate for bursty message DistType destination distribution MastgSize maximum packet size MinMsgSize minimum packet size MeanMsgSize mean packet size MsgSizeType message size: fixed, bimodal, or bursty Mngate message generation rate RouteType what routing algorithm: ETR, ATR, UPDOWN, SR SwitchType what switching techniques: store-and-forward, cut-through Bquype FIFO or shared buffer (for store-and-forward) Tuning how much network tuning will be done Table 6.1: List of major parameters in SimNet 173 6.2.7 Performance Metrics When the number of received messages is over a certain number and the statistical average latency is within a confidential interval, the simulator will calculate and report the important performance data and terminate itself. As we have discussed in Chapter 2.6, we consider the following three important performance metrics: average message latency, sustained network throughput, and fairness ratio. In most plots in this thesis, the performance is represented by its latency-throughput curve. We also mark unfair situation in some figures. Fairness ratio is calculated by considering node types (i.e., server or client). 6.3 Workload and Topology Support In a switch-based network, network performance depends on the subtle interaction between the network topology, the routing algorithm, and the traffic distribution. In order to eval- uate network interconnects, a simulator should consider the following workloads: traffic patterns, message size distributions, and temporal distribution [80]. To facilitate a wide range of experiments, SimNet supports various workloads. In terms of destination distribution (i.e., traffic pattern), We have the following three major types of destination distribution. 1. Uniform traffic: any node will uniformly communicate with any other node. 2. Global client/server traffic: k nodes out of n nodes are servers and all the other nodes are clients. For a client, a certain rate of traffic is sent uniformly among those servers and the rest of traffic is sent uniformly among those clients. A server sends 174 messages to other clients and servers with uniform distribution. This traffic pattern is a simplified version of client/server traffic. Because in a real network, a client may not uniformly communicate with all servers; instead, it may communicate with a specific server more frequently. Therefore, we come up with the following more realistic and general traffic pattern. 3. Workgroup and client/server traffic: k out of n nodes are servers and the rest 72 - k nodes are partitioned into g workgroups. A traffic distribution matrix is used to describe how many percentage of traffic will be transmitted among 9 workgroups and k servers. Traffic patterns for all nodes in the same workgroup are same. The example expression of traffic distribution has been shown in Chapter 5.5. 1. The following four message size distributions are used in our simulation: 0 Fixed length: all message size is same. 0 Exponential: message size is an exponential distribution with a given mean. 0 Bimodal: 75% of messages are short message with 100 bytes, and 25% of messages are long message with 1,500 bytes. 0 Bursty: burst traffic happens at a specific rate and burst size is 10,000 bytes. Other messages are 100 bytes long. In actual network applications, the bimodal and bursty distributions are typical and the hot—spot destination pattern is commonly observed. The fixed message size and the uni- form traffic distribution are combined to demonstrate the maximum sustained throughput or worst case latency. 175 We use finite input source to model the message generation in each node because it is more realistic than the infinite source model typically used by other researchers. Moreover, due to the limitation of memory space, the over-saturated behavior is extremely difficult to measure using the infinite source model. A finite source model may have up to K out- standing messages. In high performance computing, most parallel programs use blocking send to deliver messages. Thus, the next send will not be initiated until the previous one is completed. In a client/server environment, a server may receive several requests and have several messages waiting on the queue to send. For simplicity, we consider the case of K = 1, which means next message cannot be generated unless the previous message has completely left the source. Let a: be the time duration between the time when the current message has completely left the source buffer and the time when the next message is generated in each node. Thus, a: is a random variable based on an exponential distribution. The smaller the :r is, the heavier the workload is. Note that the actual effective network workload, which depends on both :1: and how fast a message leaves its source in each switch, may not be exactly same for same a: under different network situations. However, this should be enough to fairly and practically evaluate the performance. The routing algorithms implemented in the simulator can handle all topologies except that ETR and ATR cannot deal with a network which has no Eulerian trail. The sufficient and necessary condition for the existence of Eulerian trail in a network is that each switch has even number of channels. Such a limitation can be partially released by removing a few channels and adding them back to the Eulerian trail. In most simulation experiments, we just consider networks with Eulerian trails. The simulator needs an input file, which 176 includes information for switch interconnects, number of ports and their bandwidth in each switch, number of nodes in each switch, and server location if applicable. After reading these input data, NetTop is initiated and keeps all information related to network topology. InChannels and OutChannels will be linked together based on switch interconnects. 6.4 Flexibility and Expandability The simulation environment for switch-based networks (SimNet) provides an extensible, object-oriented framework for performance evaluation. The simulator’s structure facili- tates experiments with flexible routing algorithms, topologies, switching schemes, buffer management, and arbitration policies. All experiments can be run independently based on various design parameters. At each run time, the simulator can instantiate unique network workloads, routing algorithms, switching-schemes, and buffer management. Major com- ponents in SimNet are defined as different classes, which makes clean and well-defined boundaries between the components and allows easy extension of the simulator. For example, NetTune was not included in the simulator originally. Later on, we need to add this module for network tuning in order to reconfigure network, balance channel utiliza- tion, and improve performance. Since this module will only communicate with RouteAlg and NetTop, there is no need to change any other modules (classes). The simulator can easily incorporate new topologies, routing algorithms, switching techniques, traffic patterns, and data collection routines. 177 6.5 Simulation Experiments SimNet has been run many times under different network environments. Lots of perfor- mance data have been collected and simulation results are plotted in Chapters 3, 4, and 5 for the purpose of evaluation and comparison. In each run, a specific parameter file and a topology file are given as input arguments. The simulator will run until enough number of packets have been received and the average latency is within a confidential interval. The major experiments are summarized here. 0 In Chapter 3, simulation experiments are used to verify the accuracy of our analytic models and evaluate performance under various network environments. 0 In Chapter 4, we evaluate network performance for three different routing algorithms in switch-based networks. Wormhole switching and FIFO input queuing are as- sumed. Uniform and client/server traffic are considered as well as bursty traffic. 0 In Chapter 5, we demonstrate the performance improvement by applying the network tuning algorithm, which reduces inter-switch traffic and balance channel utilization. Chapter 7 Conclusions In previous chapters, we have reviewed fundamental issues in switch design and switch interconnects, described analytical models for delay analysis, proposed two simple and ef- ficient routing algorithms, presented practical solutions for network planning and tuning, evaluated network performance under various network environments. This chapter con- cludes the research contributions of this thesis and indicates future work. 7 .1 Contributions of the Thesis In this thesis, we have presented our research results on switching techniques, routing al- gorithms, network planning and tuning, and performance evaluation in switch-based net- works. Major research contributions are summarized in the following. 178 179 7.1.1 Performance Evaluation of Switching Techniques Performance evaluation is very important for the design of interconnection networks. In switch-based networks, three switching schemes (store-and-forward, virtual cut-through, and wormhole) have been proposed and applied in commercial products. To understand their performance tradeoff, analytical models are presented for average packet delay analy- sis in both store-and-forward switching and cut-through switching. The analytical models, which consider some simplified assumptions in switch-based networks, are inspired by previous work based on M/M/l queuing theory. Meanwhile, simulation results are used to verify the accuracy of the models and evaluate performance under various network envi- ronments. 7.1.2 Routing Algorithm We have proposed two simple and efficient deadlock-free routing algorithms for arbitrary switch-based networks. Both algorithms use an Eulerian trail as the ‘base line’ to control channel dependency and provide reasonable routes. The ETR algorithm is very simple, which concatenates the two unidirectional Eulerian trail together to provide reasonable routes. The ATR algorithm is more complex, which adds different types of shortcuts to minimize channel utilization and balance channel utilization. The ATR has the potential to allow shorter routing paths and evenly utilize channels. Two related algorithms are reviewed and their performance are compared with ATR via simulation experiments under various topologies and traffic workloads. For most of cases, our proposed algorithm ATR and the SR provide better performance than the up*ldown*. 180 While the ATR has close performance to the SR, the ATR is much more simple than the SR. The up*ldown* may concentrate traffic near root switch and therefore its performance heavily depends on topologies and the selection of root switch. 7 .1.3 Network Planning and 'lhning We have studied issues of network planning and routing in switch-based LANs. Impor- tant design considerations have been discussed in terms of network characteristics, routing path length, traffic locality, traffic balance, and performance metrics. We have proposed a method for initial topology design based on block design. This method can efficiently minimize the maximum path length and the average path length by appropriately selecting differences and rearranging the order of objects to construct block Bo. A systematic method has been presented for network tuning and incremental reconfiguration. Our method tries to maximize local traffic, balance inter-switch traffic, reduce remote traffic, and balance chan- nel utilization. Our algorithm for network tuning has been implemented based on our block design and Eulerian trail routing algorithm. Simulation results show that network tuning can significantly improve network performance, especially when we have better knowledge of traffic distribution after a network is operational. 7.1.4 Simulations A flexible simulator has been implemented in C++ using a multi-threaded task library. The simulator can handle different routing algorithms, topologies, traffic workloads, switching schemes, and buffer management. Users can initiate a specific network configuration by 181 setting proper parameters in the input parameter file. New features can be easily incorpo- rated in the simulator. Comprehensive simulation experiments provides an insight view of network behavior under various network conditions. The simulation results have been used to evaluate performance for different switching techniques, routing algorithms, topologies, workloads, and network tuning methods. 7 .2 Future Work There are many directions worth further exploration. The following issues indicate some possible directions to extend the work which have been done in this thesis. 0 More realistic traffic pattern: We have considered uniform traffic and client/server traffic in this dissertation. Bursty traffic is also used to measure performance for different routing algorithms. However, to make practical design and evaluation, we should consider more realistic traffic, like bursty traffic and traffic traces from real LAN networks, in network routing, planning, and tuning. 0 Evaluation of network planning and tuning methods: Although we have pre- sented the preliminary evaluation results in Chapter 5.6, a good approach is needed to evaluate our proposed algorithms and compare them with other existing algorithms. Due to the difficulty, this issue is not addressed very well in the dissertation and should be studied in the future. 0 Stochastic analysis: Stochastic analysis of the network is another direction for future study, since the stochastic behavior of host to host interaction is very relevant in 182 LAN 3. The interaction can significantly influence network performance. Multicast: Although our routing algorithm focuses on unicast routing, we are also interested in multicast communication. It is clear that efficient multicast communi- cation support is critical to the performance of many applications. Some hardware multicast methods have been proposed in [85] for the gigabit ATM switch and [24] for cut-through switches. In [86], a reliable multicast scheme has been proposed for Myrinet using Fast Message [87]. A trip-based multicast routing algorithms is also proposed in [88] for any wormhole routed network using two virtual channels. When multicast communication is becoming very important in practice, more re- search is needed to allow efficient multicast communication in switched networks. Since the performance of a multicast algorithm varies on different platform, a more reality and practical method is to consider some critical machine-specific parameters as proposed in [51] when we design the multicast algorithm. BIBLIOGRAPHY Bibliography [1] L. M. Ni, “Issues in designing truly scalable interconnection networks,” Proc. of the 1996 ICPP Workshop on Challenges for Parallel Processing, pp. 74 — 83, Aug. 1996. [2] P. Kermani and L. Kleinrock, “Virtual cut—through: A new computer communication switching technique,” Computer Networks, vol. 3, no. 4, 1979. [3] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed Comput- ing, vol. 1, no. 3, pp. 187 — 196, 1986. [4] W. J. Dally, “Virtual channel flow control,” Computer Networks, vol. 3, pp. 194 — 205, Mar. 1992. [5] R. J. Souza and et al, “The GIGAswitch system: A high-performance packet switch- ing platform,” Digital Technical Journal, vol. 6, Jan. 1994. [6] Ancor Communications Inc. http://www. ancot: com/cxt.html. [7] N. J. Boden and et al, “Myrinet — a Gigabit-per-second local area network,” IEEE Micro, vol. 15, pp. 29 - 36, Feb. 1995. [8] T. W. Giorgis, “29 switching hubs save the bandwidth,” BYTE, pp. 162 — 169, July 1995. [9] A. G. Nowatzyk, M. C. Browne, E. J. Kelly, and M. Parkin, “S-connect: from net- works of workstations to supercomputer performance,” Proceedings of the 22nd In- ternational Symposium on Computer Architecture, June 1995. [10] R. Horst, “ServerNet deadlock avoidance and fractahefral topologies,” Proceedings of the International Parallel Processing Symposium, pp. 274 — 280, Apr. 1996. [11] C. B. Stunkel and et a], “The SP1 high-performance switch,” Proceeding of 1994 Scalable High Performance Computing Conference ( SHPC C94 ), pp. 150 — 157, May 1994. [12] “Cray T3D system architecture overview,” tech. rep., Cray Research Inc., Sept. 1993. [13] M. Noakes, D. A. Wallach, and W. J. Dally, “The j-machine multicomputer: An ar- chitecture evaluation,” International Symposium on Computer Architecture, pp. 224 — 236, 1993. 183 184 [14] B. Duxett and R. Buck, “An overview of the Ncube-3 supercomputer,” Proceedings of the Frontiers of Massively Parallel Computation, pp. 458 — 464, 1992. [15] Intel Corporation, “Paragon XP/S product overview,” 1991. [16] L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct networks,” IEEE Computer, vol. 26, pp. 62 — 76, Feb. 1993. [17] M. Gerla and L. Kleinrock, “On the topological design of distributed computer net- works,” IEEE Transaction on Communications, vol. COM-25, pp. 48 — 60, Jan. 1977. [18] B. Gavish, “A general model for the topological design of computer networks,” GLOBECOM’86, PP. 1584 - 1588, 1986. [19] C. Ersoy and S. S. Panwar, “Topological design of interconnected LAN/MAN net- works,” IEEE Journal of Selected Areas in Communications, vol. 11, pp. 1 172 - 1 182, Oct. 1993. [20] R. Elbaum and M. Sidi, “Topological design of local area networks using genetic algorithms,” INF OCOM ’95, pp. 64 — 71, Apr. 1995. [21] W. Qiao and L. M. Ni, “Adaptive routing in irregular networks using cut-throughput switches,” Proceedings of the 1996 International Conference on Parallel Processing, vol. 1, pp. 52 — 60, Aug. 1996. [22] L. M. Ni, W. Qiao, and M. Yang, “Switches and switch interconnects,” MPPOI’97, pp. 52 — 60, June 1997. [23] 3Com Corporation, “Switches and routers,” http://www.3com.com/solutions, 1997. [24] M. Yang and L. M. Ni, “Design of scalable and multicast capable cut-through switches for high-speed LAN 3,” accepted to appear in the Proceedings of the I 997 International Conference on Parallel Processing, Aug. 1997. [25] M. Yang and L. M. Ni, “Incremental design of scalable interconnection networks using basic building blocks,” Proceedings of the Seventh IEEE symposium on parallel and Distributed Processing, pp. 252 — 259, Oct. 1995. [26] J. Rexford, “Tailoring router architectures to performance requirements in cut-through networks,” Ph.D Dissertation, University of Michigan, 1996. [27] S. Daniel, J. Rexford, J. Dolter, and K. Shin, “A programmable routing controller for flexible communications in point-to-point networks,” Proceedings-of the IEEE International Conference on Computer Design, pp. 320 — 325, Oct. 1995. [28] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, and J. Sandberg, “Virtual memory mapped network interface for the Shrimp multicomputer,” Proceedings of the International Symposium on Computer Architecture, pp. 142 — 153, Apr. 1994. 185 [29] D. Cohen and G. Finn, “ATOMIC: A low-cost, very-high-speed, local communica— tion architecture,” in Proceedings of the 1993 International Conference on Parallel Processing, vol. I, pp. 39 — 46, Aug. 1993. [30] Y. Oie, T. Suda, M. Murata, and H. Miyahara, “Survey of the performance of non- blocking switches with fifo input buffers,” Proceedings of ICC ’90, Apr. 1990. [31] M. J. Karol, M. G. Hluchyj, and S. P. Morgan, “Input versus output queuing on a space-division packet switch,” IEEE Transactions on Communications, vol. 35, pp. 1347 - 1356, Dec. 1987. [32] M. G. Hluchyj and M. J. Karol, “Queuing in high-performance packet switching,” IEEE Journal on Selected Areas in Communications, vol. 6, pp. 1587 — 1597, Dec. 1988. [33] T. Anderson, S. Owicki, J. Saxe, and C. Thacker, “High-speed switch schduling for local area networks,” ACM Transactions on Computer Systems, vol. 11, pp. 319 — 352, Nov. 1993. [34] R. O. Onvural, Asynchronous Transfer Mode Networks: Performance Issues. Artech House, 1994. [35] M. Katevenis, P. Vatsolaki, and A. Efthmiou, “Pipelined memory shared buffer for VLSI switches,” Proceedings of ACM SIGCOMM ’95, pp. 39 — 48, Aug. 1995. [36] J. P. Li and M. W. Mutka, “Priority based real-time communication for large scale wormhole networks,” Proceedings of the International Parallel Processing Sympo- sium, pp. 433 - 438, Apr. 1994. [37] Myricom, “Myrinet link specification,” http://www.myri.com/products/docurnentation/linkl. [38] C. Seitz and et al, “The design of the Caltech Mosaic C multicomputer,” Proceedings of the University of Washington Symposium on Integrated Systems, 1993. [39] M. Schroeder and et al., “Autonet: a high-speed, self-configuring, local area net- work using point-to-point links,” IEEE Journal on Selected Areas in Communications, vol. 9, Oct. 1991. [40] R. Horst, “TNet: A reliable system area network,” IEEE Micro, pp. 36 — 44, Feb. 1995. [41] S. S. Owicki and A. R. Karlin, “Factors in the performance of the ANl computer network,” Performance Evaluation Review, vol. 20, pp. 167 — 180, June 1992. [42] J. Duato, “Deadlock-free message routing in multiprocessor interconnection net- works,” IEEE Transactions on Parallel and Distributed System, vol. 4, Nov. 1993. [43] X. Lin, P. McKinley, and L. Ni, “The message flow model for routing in wormhole- routed networks,” IEEE Transactions on Parallel and Distributed Systems. PP- 755 — 760, July 1995. 186 [44] A. K. VenKatramani and T. M. Pinkston, “An efficient fully adaptive deadlock recov— ery scheme: DISHA,” Proceedings of the 22nd International Symposium on Com- puter Architecture, pp. 201 — 210, June 1995. [45] A. K. VenKatramani, T. M. Pinkston, and J. Duato, “Generalized theory for deadlock- free adaptive wormhole routing and its application to disha concurrent,” Proceedings of the International Parallel Processing Symposium, pp. 815 — 821, Apr. 1996. [46] L. Cherkasova, V. Kotov, and T. Rokicki, “Fibre channel fabrics: Evaluation and design,” 29th Hawaii International Conference on System Sciences, Feb. 1995. [47] B. Abali, “A deadlock avoidance method for computer networks,” Proceedings of the First International Workshop on Communication and Architectural Support for Network-based Parallel Computing, pp. 61 — 72, Feb. 1997. [48] F. Silla, M. Malumbres, A. Robles, P. Lopez, and J. Duato, “Efficient adaptive routing in networks of workstations with irregular topology,” Proceedings of the First Inter- national Workshop on Communication and Architectural Support for Network-based Parallel Computing, pp. 46 — 60, Feb. 1997. [49] K. M. Khalil and P. A. Spencer, “A systematic approach for planning, tunning and upgrading local area networks,” GLOBECOM’9I, pp. 658 — 663, 1991. [50] R. Kevavan, K. Bondalapati, and D. Panda, “Multicast on irregular switch-based net- works with wormhole routing,” Third International Symposium on High Performance Computer Architecture (HPCA-3), pp. 48 - 57, Feb. 1997. [51] J. L. Park, H. Choi, N. Nupairoj, and L. M. Ni, “Construction of optimal multicast trees based on the parameterized communication model,” Proceedings of Intema- tional Conference on Parallel Processing, vol. 1, pp. 180 -— 187, Aug. 1996. [52] K. M. Khalil, K. Q. Luc, and D. V. Mlson, “LAN traffic analysis and workload char- acterization,” Proceedings of 15th IEEE Conference on Local Computer Networks, Oct. 1990. [53] J. Shoch and J. Hupp, “Measured performance of an Ethernet local network,” Com— munications of ACM, pp. 711 — 721, Dec. 1980. [54] M. Rose and K. McCloghrie, How to Manage Your Network using SNMP: The Net- work Management Practicum. Prentice Hall, 1995. [55] Hewlett Packard Company http://www. hp. com/rnd/technoI/switch/impperf/lans.htrn. [56] 3Com Corporation http://www.3com.com/nsc.html, 1996. [57] W. Qiao and L. M. Ni, “Evaluate routing algorithms in cut-through switched net- works,” Tech. Rep. MSU-CPS-ACS-96—07, Michigan State Univ., Dept. of Computer Science, June 1996. 187 [58] A. Kumar and L. Bhuyan, “Evaluating virtual channels for cahce—coherent shared- memory multiprocessors,” International Conference on Supercomputing, May 1996. [59] L. Kleinrock, Communication Nets: Stochastic message flow and delay. McGraw- Hill, New York, 1964. [60] L. Kleinrock, Queuing Systems, vol. 11: Computer Applications. Wiley-interscience, New York: John Wiley & Sons, 1976. [61] J. W. Dolter, P. Ramanathan, and K. G. Shin, “Performance analysis of virtual cut- through switching in harts: a hexagonal mesh multicomputer,” IEEE Transactions on Computers, vol. 40, pp. 669 - 680, June 1991. [62] Y. S. Youn and C. K. Un, “Performance analysis of an integrated voice/data cut- through switching network,” Journal of Distributed Computing, vol. 21, no. 1, pp. 41 — 51, 1991. [63] T. D. Morris and H. G. Perros, “Approximate analysis of a discrete-time tandem net- work of cut-through queues with blocking and bursty traffic,” Performance Evalua- tion, vol. 17, no. 3, pp. 207 — 223, 1993. [64] C. P. Kruskal and M. Snir, “The performance of multistage interconnection networks for multiprocessors,” IEEE Transactions on Computers, vol. C-32, pp. 1091 — 1098, Dec. 1983. [65] W. J. Dally, “Performance analysis of k-ary n-cube interconnection networks,” IEEE Transactions on Computers, vol. 39, pp. 775 - 785, June 1990. [66] A. Agarwal, “Limits on interconnection network performance,” IEEE Transactions on Parallel and Distributed Systems, vol. 2, pp. 398 - 412, Oct. 1991. [67] L. Kleinrock, Queuing Systems, vol. 1: Theory. Wiley-interscience, New York: John Wiley & Sons, 1975. [68] J. R. Jackson, “Networks of waiting lines,” Operations Research, vol. 5, pp. 518 — 521, Aug. 1957. [69] G. L. Fultz and L. Kleinrock, “Adaptive routing techniques for store-and-forward computer communication networks,” Proceedings of International Conference on Communication, pp. 14 — 16, June 1971. [70] W. Oed The Cray Research Massively Parallel Processor System: Cray T3D, Nov. 1993. [71] C. Stunkel and et al., “Sp2 high-performance switch,” IBM Systems Journal, vol. 34, no. 2, pp. 185 — 204, 1995. [72] W. Dally and C. Seitz, “Deadlock-free message routing in multiprocessor intercon- nection networks,” IEEE Transactions on Computers, vol. 036, pp. 547 — 553, May 1987. 188 [73] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” Journal of the Asso- ciation for Computing Machinery, vol. 41, pp. 874 — 902, Sept. 1994. [74] H. Park and D. P. Agrawal, “Generic methodologies for deadlock-free routing,” Pro- ceedings of the International Parallel Processing symposium, pp. 638 — 643, Apr. 1996. [75] W. Qiao and L. M. Ni, “Adaptive routing in irregular networks using cut-throughput switches,” Tech. Rep. MSU-CPS-ACS-108, Michigan State Univ., Dept. of Computer Science, Dec. 1995. [76] F. Harary, Graph Theory. Readings, Massachusetts: Addson-Wesley, 1972. [77] W. Qiao and L. M. Ni, “Network planning and tuning in switch—based lans,” submitted for publication, July 1997. [78] J. Opatmy, N. Srinivasan, and V. S. Alagar, “Highly fault-tolerant communication network models,” IEEE Transactions on Circuits and Systems, vol. 36, pp. 23 — 29, Jan. 1989. [79] B. Yener, Y. Ofek, and M. Yung, “Topological design of loss-free switch-based LANs,” INF OCOM ’95, pp. 88 — 96, Apr. 1995. [80] A. A. Chien and M. Konstantinidou, “Workload and performance metrics for eval- uating parallel interconnects,” IEEE Computer Architecture Technical Committee Newsletter, pp. 23 - 27, Summer-Fall 1994. [81] K. Bolding, S. E. Choi, and M. Fulgharn, “The chaos router simulator,” Parallel Com- puter Routing and Communication Workshop, pp. 115 - 124, May 1994. [82] P. K. McKinley and C. Trefftz, “MultiSim: a simulation tool for the study of large- scale multiprocessors,” Proceedings of the International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 57 - 62, Jan. 1993. [83] J. R. Jump and S. Lakshmanamurthy, “NETSIM: a general-purpose interconnection network simulator,” Proceedings of the International Workshop on Modeling, Analy- sis, and Simulation of Computer and Telecommunication Systems, pp. 121 — 125, Jan. 1993. [84] E. Olk, “PARSE: Simulation of message passing communication networks,” Proceed- ings of the Annual Simulation Symposium, pp. 115 - 124, Apr. 1994. [85] J. S. Tumer, “An optimal nonblocking multicast virtual circuit switch,” Proceedings of 1994 INF OCOM. Pp. 298 — 305, June 1994. [86] K. Verstoep, K. Langendoen, and H. Bal, “Efficient reliable multicast on myrinet,” Proceedings of the 1996 International Conference on Parallel Processing, vol. 3, pp. 156 — 165, Aug. 1996. 189 [87] S. Pakin, M. Lauria, and A. Chien, “High performance messaging on workstations: Illinois fast messages (fm) for myrinet,” Supercomputing ’95, 1995. [88] Y.-C. Tseng, D. K. Panda, and T. Lai, “A trip-based multicasting model in wormhole- routed networks with virtual channels,” IEEE Transaction on Parallel and Distributed Systems, vol. 7, pp. 138 - 150, Feb. 1996.