a. . , 4‘: (‘fptxs‘e‘ . . ‘ ~ ‘ . w - ' ‘ - 2 .rLo..~;".-.I-' . ' .. «3m ‘ — ' .. we. "‘(M'3;.a{&,h‘ {.3 (11-31;: H. ‘7' ‘ - '? .’ . . “5:53 "W“.&%Q§ 5‘33 1" Dab-".7 ‘. ‘ 4‘ {I .~. ---..... ~ ' ”'3‘?“ WW.- c:- t. in ‘ ‘ M .. _ >4 - \. fi-.. -. 2'4”...- 44.11:: _ - M,.\ .\ r .. ‘og‘ _ ' .- 3. «3% M4“. ‘25:. ‘m .-\ ‘ Wit 5..“ u“; "'1‘~H.\"‘.'.'v:-“.’J'i‘ u'a * . . . 4 .'-‘/o‘."2"- 5 u"..‘v$r\ Wflmfiwwt ' tAl ' ~ - I V ‘ an! 'L 4.13,? {3L} :5. :‘}‘5‘ m1..- “3;?“ baa-fiat. . | .J“ ‘ "" r at?“ "w, igéik ‘ a; 55.8 '2, '1! ‘- W 4734:?” . W 95.1; 3‘1."'~‘¢‘~‘.‘¢ 4“ :13}; -' '2" .wj. . 5&3}, fig“ . ”WV; ’Afv,\7:¢;~:‘.:.g,, v.” ‘31 , '23»; N... 332126. :““~' Ii- ':i3{.". 3‘ d: . A ~‘i':l.‘-‘~"":Z~3:-;:'- ,. .- 67.93321 ML. 7“" .. . ‘ :-~ 7 P. . _ fl“ ‘ "w 43%” "“ . «qua; ' " --;:R:flx.n 6.31;“ ' J. -' ‘ '31:: vmsfi n wZKV-nggg J“ - - . 2.5’i*.'.-‘1~.-\1 iff- ~ M . 'fi Wm.- -513 to. us - 5:: £373.95." ~ é”; {5:} A“: ”“13; 2":‘132-1 -‘-‘- " ..(.. “i’i'. I. "k‘%fi\' -‘ \“t‘n “(an 32.14,? 333.}. :,,:':_ . g?” . L71: ' :. fall-4;. V _ . y f 7” “I . ‘.;I;(z:’.~?-7Lg?¥./;{I'{;Zé "14;.5'9 ’13." 3".7!‘ ' 42’ -. v /r% ‘21,;001 ' :1". ' .‘z' ~ Z. '71: . f, I . "I: 7 Kigali”; ”4?; 3,2245. H': ‘J'. 11 55?: °r If 6' l /f” I??? 0.2;} 5"! ;."I':’:“ .n’ H I 7/24.?” :22” If. ' .' '1’"!- ""I'a’yj o'_ ,-.".‘ . u, .- u 4' "1:; 'Jl'q-J. 7:. t I(.. . - - . . n“; .‘t ‘ it’o'if, .'. r .J._. “'33." a9", '31]. / JV . Cg]; J3: ,7 7;‘/ '1 7:3. ’1’":- {:5 It? 3” or." inf”)! 3-} I l: 6' ‘ ’6’ . 5:11!" [‘7’ ’: ";:; 'I’: :;.,”r ,' :; ? _5"'J.DJ .. , "”4"“. xi}; ,4 far/g I - -12 (:97! If? g!) “(1,,“ «get? :0: ’i. U", 'I'y '11,! '(VI‘): M MAI/m "6x33,” (OW GJ»\ 146177067 ATE UNIVERSITY UIIARIES , llllllllll nu m l mm 00051 3758 LIBRARY Michigan State University This is to certify that the dissertation entitled INTERPROCESSOR COMMUNICATION IN DISTRIBUTED MEMORY MULTIPROCESSORS presented by ' Youran Lan has been accepted towards fulfillment of the requirements for Ph. D. _ Computer Science degreem_____ . M; Major professor U mew. 0-12771 MSU RETURNING MATERIALS: Place in book drop to unamuas remove this checkout from 1-". your record. FINES will be charged if book is returned after the date stamped below. Twila ; 1990- i, “I‘M JUL 1 '80”. ‘5’?” A 3170* rmw‘“ INTERPROCESSOR COMMUNICATION IN DISTRIBUTED MEMORY MULTIPROCESSORS By Youran Lan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1988 ABSTRACT INTERPROCESSOR COMMUNICATION IN DISTRIBUTED MEMORY MULTIPROCESSORS By Youran Lan Distributed memory multiprocessors (DMMPs) have gained much attention recently due to their unique architectural characteristics to establish a massively parallel process- ing environment. In such systems, the interprocessor communication mechanism has been identified as the major source of system bottleneck. Thus, an efficient interprocessor communication mechanism is the key to the future success of DMMPs. This motivates the study of a fast, versatile, and fault-tolerant interprocessor communication mechanism for DMMPs. Both store-and-forward and virtual cut-through communication techniques are dis- cussed. Formal models are developed to facilitate the performance comparison of these two techniques. Three types of interprocessor communication patterns are demanded from application point of view, which are unicast (one-to-one), multicast (one-to-many), and broadcast (one-to-all). A graph theoretical model, namely the optimal multicast tree, is proposed to characterize these communication pattems and to define the performance evaluation criteria, time and trafiic. The multicast communication, in particular, is highly demanded, but not directly supported by any existing DMMP. A distributed multicast algorithm based on a heuristic greedy method is proposed. In addition to guaranteeing a shortest path for message delivery between the source and each destination, the total traffic created is very close to the optimal solution. More importantly, the algorithm can be efficiently implemented in hardware using the virtual cut-through technique. The architecture of the hardware router is presented. A prototype router design for a 3-cube has been fabricated by MOSIS using the 3 micron CMOS technology. Enhancement to the proposed algorithm and its hardware implementation is studied, which allows the communication mechanism to be able to handle interprocessor com- munication in a faulty hypercube in which each fault-free node has at most one faulty neighboring node. In summary, centered around the interprocessor communication issue, this disserta- tion focuses on modeling, algorithm development, and hardware implementation of a versatile and efficient communication mechanism. The proposed communication mechanism is novel in the sense that it is the first communication hardware for DMMPs which directly supports all three types of communications, and the first one which has fault-tolerant capability. It can be readily applied to future generations of DMMPs to significantly increase overall system performance. © Copyright by Youran Lan 1988 iv To my parents: Zhen-lu Lou and Zhu-yu Zhou ACKNOWLEDGEMENTS I would like to thank Professor Lionel M. Ni, my thesis adviser, for his invaluable inspiration and guidance throughout my graduate study. Without the knowledge and time he gave me, this work would have been impossible. I would like to thank Professor Abdol-Hossein Esfahanian, my thesis co-advisor, for directing me to pursue rigorous research approach and for his continual encouragement. I would like to thank Professors Edwin Kashy, George Stockman, and Anil Jain, for their encouragement and many excellent comments during the course of my dissertation research. I would like to acknowledge all the faculty members and students who gave me help and assistance during my studying at Michigan State University. Too numerous to list all names, in particular, I would like to express my appreciation to Bruce McMillen, Chung-Ta King, Xiao-la Lin, and Ning Liao, for their valuable help and insightful com- ments; to Chong-wei Xu, Eric Wu, and Taieb Znati for helpful discussions; and to M. Driscoll, T. Chen, W. Chou, J. Miller, and P. Prins for their excellent work on the detailed layout of the prototype design of the router chip. I am very grateful to my parents and parents-in-law for their years of concern and support; to my wife Ben-1n for her constant encouragement and direct assistance in the preparing of this manuscript; and to my wonderful daughter Lana for being so under- standing. This work was supported in part by the DARPA ACMP project and in part by the State of Michigan RE/ED project. vi TABLE OF CONTENTS List of Tables .............................................................................................................. ix List of Figures ............................................................................................................. x Chapter 1 Introduction ............................................................................................ 1 1.1 Demand of massively parallel multiprocessors ..................................... 2 1.2 Distributed memory multiprocessors (DMMPs) ................................... 3 1.3 Interprocessor communication ............................................................... 5 1.4 Motivation and problem statement ........................................................ 7 1.5 Thesis organization ................................................................................ 8 Chapter 2 Distributed Memory Multiprocessors ................................................. 10 2.1 DMMPs under consideration ................................................................. 10 2.2 Interconnection topologies for DMMPs ................................................ 12 2.3 Topological properties of hypercube ..................................................... 18 2.4 Review of hypercube multiprocessors ................................................... 20 Chapter 3 Issues on Interprocessor Communication ........................................... 24 3.1 A model for interprocessor communication .......................................... 24 3.2 Three types of interprocessor communications ..................................... 27 3.3 Centralized vs. distributed routing ......................................................... 28 3.4 Packet vs. circuit switching ................................................................... 29 3.5 Store-and-forward vs. virtual cut-through forwarding .......................... 30 3.6 Adaptive vs. non-adaptive routing ......................................................... 34 3.7 Software vs. hardware implementation ................................................. 35 Chapter 4 Optimal Multicast Tree (OMT) — A Model for Interprocessor Communication ..................................................... 37 4.1 Graph theoretical notation and definitions ............................................. 37 4.2 The Optimal Multicast Tree model ........................................................ 38 4.3 Unicast and broadcast in hypercube multiprocessors ............................ 43 4.4 Multicast in hypercube environment ..................................................... 50 vii Chapter 5 A Distributed Multicast Algorithm for Hypercube Multiprocessors ............................................................ 58 5.1 Underlying rationale of the multicast algorithm .................................... 58 5.2 The Greedy multicast algorithm ............................................................ 61 5.3 An illustration example .......................................................................... 65 5.4 Performance study on the greedy algorithm .......................................... 69 Chapter 6 A Hardware Router Design for Hypercubes ....................................... 79 6.1 Hardware design considerations ............................................................ 80 6.2 Multi-destination message format ......................................................... 81 6.3 An overview of the router ...................................................................... 82 6.4 Message Handling Unit (MHU) ............................................................. 84 6.5 The processing unit ................................................................................ 87 6.6 The multicaster design .............................. 91 6.7 The prototype MHU chip ....................................................................... 97 Chapter 7 Routing in Faulty Hypercubes .............................................................. 99 7.1 Fault-tolerant systems ............................................................................ 99 7.2 Design considerations for fault-tolerant roofing .................................... 101 7.3 Fault-tolerant routing algorithms ........................................................... 107 7.4 A fault-tolerant router design ................................................................. 118 7.5 General fault-tolerant routing problems ................................................ 123 Chapter 8 Summary and Directions for Future Research ................................... 126 8.1 Summary of major contributions ........................................................... 126 8.2 Directions for future research ................................................................ 128 Bibliography .............................................................................................................. 131 viii LIST OF TABLES Table 1.1 Comparison of communication and computation of three DMMPs ........................................................................................ 6 Table 2.1 Hypercube system characteristics .............................................................. 23 Table 3.1 One hop communication comparison ........................................................ 25 Table 5.1 The actual addresses of source (00110) and destinations .......................... 66 Table 5.2 The reference array at node 00110 ............................................................ 66 Table 5.3 The reference array after the first run of the algorithm ............................. 67 Table 5.4 The reference array after the second run of the algorithm ........................ 67 Table 7.1 The reference array at node 00110 in faulty case ...................................... 118 LIST OF FIGURES Figure 2.1 The generic structure of the DMMP under consideration ....................... 11 Figure 2.2 The generic structure of individual nodes in the DMMP ........................ 12 Figure 2.3 An 8x8 Omega network .......................................................................... 14 Figure 2.4 A 16 node BBN Butterfly interconnection network ................................ 14 Figure 2.5 Examples of point-to-point interconnection networks ............................ 16 Figure 3.1 Communication between two neighboring nodes ................................... 25 Figure 3.2 One-to-one message-passing model (It hops) .......................................... 26 Figure 3.3 Message-passing time in a 64-node NCUBE .......................................... 33 Figure 4.1 A multicast example in 3-cube ................................................................ 39 Figure 4.2 A example of multicast in a 2-D mesh .................................................... 42 Figure 4.3 The "standard" unicast algorithm ............................................................ 45 Figure 4.4 A broadcast algorithm using a weight ..................................................... 46 Figure 4.5 A Broadcast algorithm without using weight .......................................... 47 Figure 4.6 A Broadcast algorithm using a Control vector ........................................ 48 Figure 4.7 A broadcast tree generated by BROADCAST.3 ......................................... 49 Figure 4.8 One-to—two multicast trees in hypercube ................................................ 53 Figure 4.9 Multicast algorithm for two destination case .......................................... 53 Figure 4.10 One-to-three multicast tree patterns ........................................................ 55 Figure 4.11 Multicast algorithm for three destination case ........................................ 56 Figure 4.12 One-to-four general multicast tree patterns ............................................. 57 Figure 5.1 Greedy multicast algorithm ..................................................................... 62 Figure 5.2 A multicast tree in a 5-cube ..................................................................... 68 Figure 5.3 Comparison of three communication methods in a Q 6 .......................... 71 Figure 5.4 Performance comparison of three multicast algorithms ( under uniform distribution ) .................................................................. 73 Figure 5.5 Performance comparison of Greedy algorithm with optimal solution ( under uniform distribution ) ............................... 74 Figure 5.6 Performance comparison of 3 multicast methods ( under decreasing probability distribution ) ........................................... 77 Figure 5.7 Performance comparison of Greedy algorithm with optimal solution ( under decreasing probability disu'ibution ) .............................. 78 Figure 6.1 The block diagram of a router in a Q3 .................................................... 83 Figure 6.2 The block diagram of an MHU ............................................................... 85 Figure 6.3 The Processing Module ........................................................................... 87 Figure 6.4 The Input port and the Store .................................................................... 88 Figure 6.5 The Broadcaster for a fault-free Q3 ........................................................ 90 Figure 6.6 The Multicaster and the Store ................................................................. 92 Figure 6.7 The Decoder (DECR) for the j-th column .............................................. 93 Figure 6.8 The Maximum column Checker (MAXC) .............................................. 94 Figure 6.9 The Column Selector (CLMS) ................................................................ 96 Figure 6.10 The prototype MHU chip ........................................................................ 98 Figure 7.1 An example of routing in a faulty Q4 ..................................................... 101 Figure 7.2 Simulation results of the faulty model .................................................... 106 Figure 7.3 A unicast algorithm for faulty hypercubes .............................................. 108 Figure 7.4 An example of unicast in faulty hypercubes ........................................... 109 Figure 7 .5 A broadcast algorithm for faulty hypercubes .......................................... 111 Figure 7.6 A broadcast tree in a faulty Q4 generated by BROADCAST.F2 ............... 114 Figure 7.7 Greedy multicast algorithm for faulty hypercubes .................................. 115 Figure 7.8 Comparison of multicast trees in a fault-free and a faulty Q 5 ................ 119 Figure 7.9 Decoder (DECR) for j-th column for faulty Q3 ..................................... 120 Figure 7.10 Broadcaster for faulty hypercubes .......................................................... 122 xi CHAPTER 1 INTRODUCTION A distributed-memory multiprocessor (DMMP) is a computer system consisting of many processors in which each processor is physically associated with its own local memory and all processors work independently and communicate through an intercon- nection network connecting the processors. The hypercube multiprocessor is the best known example of DMMPs, and has attracted a great deal of attention in the past few years. Many hypercube multiprocessors, such as the Ncube, iPSC, Ametek, and FPS T- series are commercially available. Massively parallel DMMPs possess the potential of achieving computation power beyond the megaflops range [HMSC86, Wile87]. Inter- processor communication is the fundamental means allowing processors to communi- cate in such systems. However, it is known that the interprocessor communication mechanism is the major bottleneck of DMMPs, especially in the first generation sys- tems. Centered around the interprocessor communication problem, this dissertation research focuses on modeling, algorithm development, and hardware implementation of a versatile and efficient interprocessor communication mechanism. The proposed inter- processor communication mechanism not only significantly speeds up communication, but also has fault-tolerant capabilities. It can be readily applied to future generations of DMMPs to significantly increase the overall system performance. 1.1 DEMAND OF MASSIVELY PARALLEL MULTIPROCESSORS The requirements of powerful computer systems which can handle computationally intensive problems are increasingly in demand in many areas. Weather forecasting, simulations, modeling of complex physical phenomena, advanced data base manage- ment, artificial intelligence, image processing and pattern recognition, mechanical engineering analysis, and medical diagnosis are some examples among many other large-scale scientific and engineering applications [I-IwBr84, FoOt84, BaPa86]. To cope with these challenging computational requirements, many advanced com- puter architectures have been developed. Great efforts have been made toward the fol- lowing two trends in order to increase computational power by orders of magnitude. One trend is to build vector machines with one or a few very powerful central proces- sors, depending mainly on very fast circuit technology. For example, the Cyber 205 and Cray-X/MP depend heavily on function unit pipelining and interleaved memory modules to gain high preformance. The computational power of a single processor, however, is limited by both physical and architectural bounds. To achieve new perfor- mance improvement in such systems has become more and more difficult. Thus, a natural trend is to construct systems consisting of multiple processors, which has been shown in recent years to be the most straightforward and cost-effective approach to achieve high performance. With continuous technological advances in integrated cir- cuits, more powerful and compact, but less expensive, microprocessor, memory and communication chips are available. This makes the construction of massively parallel multiprocessor systems feasible. In such systems, by adding more processors and using a better interconnection network, the performance can be greatly improved to the level far beyond the performance of existing supercomputers. 1.2 DISTRIBUTED MEMORY MULTIPROCESSORS (DMMPS) Existing multiple processor computer systems can be characterized into two struc- tural classes: SIMD (single instruction stream / multiple data stream) architecture and MIMD (multiple instruction stream / multiple data stream) architecture. All these archi- tectures are centered around the concept of parallelism [HwBr84, Patt85]. An SIMD computer consists of an array of processors, a network to interconnect these processors, and a central control unit to synchronize the operation of the proces- sors. In an SIMD computer, all processors are executing the same instruction at any given time, but multiple sets of operands are fetched and operated on in multiple proces- sors in a synchronous fashion. All the processors are controlled by the central con- troller. Typical examples of SIMD machines include the Illiac IV, which has a central con- trol unit and 64 mesh connected processing elements, each of which has direct connec- tions to four other processing elements [BBKK68]; MPP, a system which contains 16K bit-serial processing elements connected in a two dimensional mesh structure [Batc80]; and the Connection Machine, consisting of 64K bit-serial processors with the structure of cube-connected toroidal lattices [Hill85, SASL85]. Because of the synchronous feature, these machines are good for solving very structured problems, such as matrix operations and image processing, which deal with mainly array data types. For exam- ple, the primary purpose of the MPP is to process satellite imagery. An MIMD computer, or a multiprocessor system, consists of many processors com- municating through an interconnection mechanism [HwBr84]. In an MIMD computer, all processors work independently in an asynchronous fashion. They execute different instruction su'eams on different sets of operands. Usually, these processors are homo- geneous and the communication delay between processors is relatively small but non- negligible. Because of the multiple instruction stream feature, MIMD computers pro- vide a more general computation model. They are more flexible and versatile than SIMD computers. Depending on the structure of the memories in the system, multiprocessor systems can be classified as centralized-memory multiprocessors and distributed-memory mul- tiprocessors [NiKP87]. In a centralized-memory multiprocessor, all memory modules are equally accessible to all processors. A processor/memory interconnection network, therefore, is needed to allow all processors in the system to access the memory modules. There are several hardware bottlenecks in such a system, which include the number of processors, the memory bandwidth, and the bandwidth of interconnection networks. For example, many commercial products of centralized-memory systems including Sequent’s Balance 8000 and 21000, Encore’s Multimax, and CRAY-X/MP [DoDu85, HwBr84, Olso85] can have no more than 30 processors. In a distributed-memory multiprocessor, however, each memory module is physi- cally associated with a processor. No memory is globally accessible. An interprocessor communication network is needed to allow the processors to communicate with each other. Because each memory module is attached to a corresponding individual proces- sor, the performance of an algorithm depends on how well the application problem is partitioned and mapped into the processors. A multiprocessor system may have a mixed memory structure to provide a local memory to each processor and global memory modules shared by all processors. From the viewpoint of processes, there are two basic process synchronization and communication models. One is the shared memory model in which the system has a global memory accessible by all processors. The processes communicate through shared variables. In such a system, the access time to a unit of data is the same for all proces- sors. A hardware device or a software protocol is required in such systems for arbitrat- ing the access to the memory among the processes which share the memory. The shared memory may cause a software bottleneck. The other process synchronization and communication model is message-passing in which processors communicate by explicit message-passin g through the interproces- sor communication network. In a message-passing system, the performance of an algo- rithm depends on how well the application problem is partitioned and mapped into the processors, and how efficient the communication mechanism in the system is. Centralized-memory multiprocessors usually adopt the shared memory model, whereas distributed-memory multiprocessors usually prefer the message-passing model [NiKP87]. To construct a massively parallel system consisting of hundreds or even thousands of processors, the distributed-memory structure is a promising approach. In order to reduce the software bottleneck in such a complex system, message-passing is naturally a better choice as the process synchronization and communication mechanism. Therefore, in this study, we consider message-passing distributed-memory multiprocessors. From here on, we use distributed-memory multiprocessor (DMMP) with the implication of message-passing model. 1.3 INTERPROCESSOR COMMUNICATION As mentioned in the previous section, a DMMP system does not have the software bottleneck as in the case of a shared memory system. However, DMMPs do have an interprocessor communication problem. There are three basic communication types: unicast (one-to—one), broadcast (one- to-all), and multicast (one-to—many). In the first generation DMMPs, only unicast is directly supported. Message routing for broadcast is done by subroutine calls at the source and each intermediate processor. Multicast communication is usually imple- mented by issuing multiple unicast. The interprocessor communication mechanism and the computational power of each individual processor in a DMMP system are two of the major factors which affect the performance of the system. In order to fully explore the computational power of a DMMP, computation and communication must be balanced, that is, a proper communication/computation ratio (about 1) should be achieved [ShFi87]. However, the communication/computation ratios in the first generation DMMPs are very high. Table 1.1 is a comparison of the communication time (to transfer a 64 bit data over a link) to the computation time (to preform a multiplication on two double pre- cision floating-point numbers) for three first generation DMMPs [Duni87]. Table 1.1 Comparison of communication and computation of three DMMPs Item AMETEK S-14 Intel iPSC N CUBE 8-byte transfer time (us) 640 1120 470 8-byte multiply time (us) 33.9 43.0 14.7 Comm/Comp. 19 26 32 As can be seen from the table, the communication/computation ratios are all greater than 10. This indicates the communication mechanism provided in the first gen- eration machines does not match the speed of powerful processors and thus becomes the major bottleneck of the system preformance. Therefore, to provide an efficient interprocessor communication is the key to the successful exploitation of parallelism in DMMPs. With the advent of VLSI technology, processors are becoming faster and more powerful. This causes interprocessor commun- ication become more and more important and will dominate computing cost in both hardware and software [l-ILSM82, Fox83]. In order to improve the overall system performance, the communication overhead must be significantly reduced. As will be discussed further shortly, hardware implemen- tation of efficient communication algorithms is a necessity for the success of future gen- eration DMMPs. Therefore, we need to develop algorithms which can efficiently handle all three types of communication, and to consider hardware implementation of the algo- rithms. Interprocessor communication in faulty DMMPs is another important issue. Since we are studying highly parallel systems, consisting of up to one thousand or even more processors, the reliability problem becomes a natural concern. A good interconnection topology, such as the hypercube, may provide inherently fault-tolerant capability by having multiple routing paths between each pair of processors. However, when a single processor fails, the routing methods for a fault-free hypercube can no longer be applied. We need to provide a communication mechanism which has certain fault-tolerant capa- bility. Fault-tolerance is important especially when a large job is to be carried out, which requires computation time close to or longer than the average non-fault run time of the system. 1.4 MOTIVATION AND PROBLEM STATEMENT As indicated in [ReFu87], the emergence of multiprocessor systems in the past years has posed several important and challenging problems in (1) network topology selection, (2) communication hardware design, (3) operating system design, (4) fault tolerance considerations, and (5) algorithm design. In recent years, hypercube topology has stood out as a dominating interconnection topology for message-passing distributed memory multiprocessors. Some experts believe that hypercube multiprocessors are the most promising highly parallel multiprocessor systems [Myer86, Seit85]. However, the first generation hypercubes do not perform as well as people once expected, mainly because of the software implementation of communication algorithms. In order to fully explore the computational power of DMMPs, all the above problems have to be solved. Design of efficient parallel algorithms has been a major research issue in the application domain in the past few years. Development of operating systems better suitable for DMMPs is another important issue in the system software domain [MuBA87, FoKo86]. Some research has been done to develop communication hardware implementing unicast for second generation hypercube multiprocessors [DaSe86, iPSC88]. However, hardware design for multicast communication has received little attention. No fault- tolerant consideration has been made on the existing communication hardware designs. Motivated by the above observations, our major concern is the communication hardware design and related fault tolerance issues. The objective of this research is to study a fast, versatile, and fault-tolerant message-passing mechanism for DMMPs, in particular, for the hypercube multiprocessors. The major goal is to develop a hardware router which can efficiently handle all three types of interprocessor communications including the multicast communication which is highly demanded but not directly sup- ported by any existing DMMP. The router works not only in fault-free hypercubes, but also in faulty hypercubes in which each fault-free node has at most one faulty neighbor- ing node. The router can be readily applied to future generation DMMPs in order to significantly speedup interprocessor communication and to greatly improve the overall system performance. Our ultimate goal is to make future DMMPs u'ue MIMD super- computers. 15 THESIS ORGANIZATION This introduction has discussed the requirement of massively parallel computing systems, especially the DMMPs, and the problems existing in current DMMPs. Also, the motivation of this dissertation research has been addressed. The next chapter gives a brief review of multiprocessor systems and shows why we select distributed-memory multiprocessors. Some background about interconnection topologies, in particular, the hypercube topology, is given, followed by a comparison of existing hypercube multiprocessors. Chapter 3 addresses some fundamental issues in interprocessor communication which is the key to the performance of such systems. In Chapter 4, following the presentation of some graph theoretical notation and ter- minology, the multicast communication is formally modeled as an Optimal Multicast Tree (OMT), a graph theoretical problem. The OMT is a model for all three types of communication in a broad sense. Following the OMT model, algorithms for unicast and broadcast are studied first. Then the multicast problem in some simple cases is exam— ined. Based on the observation of the complexity of the problem and previous research on some closely related areas, the OMT problem is conjectured to be an NP-hard prob- lem. After presenting optimal solutions to some special cases, in Chapter 5, a heuristic multicast algorithm for the hypercube environment is presented. The proposed multicast algorithm is disuibuted in the sense that the overall routing is not calculated solely by the source. Instead, each involved node decides its own routing. Simulation results of the algorithm with comparison to alternative maid-destination routing methods are presented which attest to the efficiency of the proposed algorithm. From the system designers’ point of view, simply proposing a polynomial time complexity algorithm is still not of much interest. Software approaches are well known to be too slow. Thus, besides the theoretical study and performance analysis of the pro- posed algorithm, the hardware design of a dedicated router, which efficiently imple- ments the algorithm, is presented in Chapter 6. The problem of routing in faulty hypercubes is studied in Chapter 7. No hardware routing devices developed so far have fault-tolerant capability. Chapter 7 first presents a model for faulty hypercube multiprocessors, and then proposes fault-tolerant routing algorithms and demonstrates how the hardware design presented in Chapter 6 can be modified to work properly in faulty situations. The last chapter summarizes the work of this dissertation research and presents suggestions for related future research. CHAPTER 2 DISTRIBUTED MEMORY MULTIPROCESSORS In order to maximize the performance of a multiprocessor system, the components of the system should be balanced so that there is no bottleneck in the system. As the VLSI technology advances rapidly, the computing power of each individual processor becomes faster and more powerful. This situation puts great pressure on system designers to provide an efficient communication mechanism which can match the speed of the processors. Data communication is the key to the successful exploitation of paral- lelism, while the interconnection network plays a fundamental role in determining com- munication efficiency. This chapter discusses why message-passing distributed memory multiprocessor systems have become popular in recent years. Various interconnection topologies are investigated. In particular, the tepological properties of hypercube is stu- died in detail. A brief comparison of some commercially available hypercube multipro- cessors is given. 2.1 DMMPs UNDER CONSIDERATION In this study, a message-passing distributed-memory multiprocessor (DMMP) is considered. Figure 2.1 shows the generic structure of such a system. There are N processors in the system, with N ranging from hundreds to thousands. The processors are interconnected by certain interconnection topology. The processors are also connected to a host processor or 1/0 processors for communicating with exter- nal devices. All processors work independently in an asynchronous fashion and 10 ll HOST I/O Processor Processors l l l l Node 0 Node 1 Node N—l A point-to-point interprocessor connection Figure 2.1 The generic structure of the DMNIP under consideration communicate by message-passing through the interconnection network. Figure 2.2 shows a generic structure of a node in the system. Each node has its own processor and local memory. The processor/memory pair is associated with a router to handle interprocessor communication task. Each router has n incoming links and n outgoing links connected to its n neighboring nodes and another pair of links con- nected to the local processor. Through the input/output links, the router communicates with the routers of other nodes. Basically, the processor/memory pair performs computation, while the router han- dles interprocessor communication. The local processor is involved in the communica— tion task only when the node is a source or a destination. 12 Nodal Processor/Memory 0 0 —H —’ Input Router Output channels . . channels n-1 n-1 —-u ——-> Figure 2.2 The generic structure of individual nodes in the DMMP 2.2 INTERCONNECTION TOPOLOGIES FOR DMMPs Many interconnection topologies have been proposed for highly parallel multipro- cessor systems [Feng81, HLSM82]. Some of them have been adopted in real machines. Various interconnection topologies used in commercial systems can be roughly divided into three categories: shared bus, multistage network, and point-to-point interconnec- tion network. Compared with other interconnection topologies, the bus structure is least complex and the easiest to construct with low overall system cost for hardware. Also, it is not difficult to modify system configuration by adding or removing some nodes. However, simultaneous communication between multiple pairs of processors is impossible in a shared bus system. Data transfer rate of the bus limits the overall system capacity. Thus, the number of processors in the system is small. Furthermore, the failure of the bus will cause the failure of the entire system. Since the bus-based architecture is not good for 13 massively parallel DMMPs, it is mainly used in shared memory multiprocessors such as Encore’s Multimax and Sequent’s Balance. A multistage network consists of many stages of interconnected switches. Only the first stage and the last stage are connected to nodes. Switches in all intermediate stages are connected to other switches. A typical multistage interconnection network in a mul- tiprocessor system with N=k" processors consists of n stages with N /k 100: switch boxes at each stage. A notable feature of the multistage network is that the distance between any pair of nodes is the same. The most popular multistage network is the Omega (shufi‘ie-exchange) network and its variations. A generalized b"xb" shuffle- exchange network consists of n stages of bxb crossbar switches linked by perfect shuffle interconnections. Usually b=2, and n=log2N. An omega network with N=8 and b=2 is shown in Figure 2.3. A typical example of DMMP system using multistage switching network is the BBN Butterfly machine [BBN87]. Its interconnection tepology resembles that of a Fast Fourier Transform Butterfly. A 16 node BBN Butterfly interconnection network is shown in Figure 2.4. The interconnection network is constructed by two stages (log416=2) of switching units. Each stage consists of 4 (16/4=4) 4x4 switches. The major disadvantage of the multistage network is that it is very difficult to scale up the system to contain a very large number of processors, since the cost and the com- plexity of the switching hardware will grow up rapidly. Thus, multistage networks are good for medium size systems. In order to reduce the software bottleneck, message-passing seems to be a reason- able choice for process synchronization and communication mechanism. For a distributed-memory multiprocessor system with up to thousands of processors, point-to- point interconnection networks provide a promising interconnection structure. 14 H MAUJN \IO‘ 0 0 t) O __,o 2 l I 1 1 1 2 _. 2 5 ,_. 3 1 4 l _. 6 1 V 'Z 7 7 . 7 Figure 2.3 An 8x8 Omega network 0 0 4 4 8 8 12 12 1 1 5 5 9 9 13 13 2 2 6 6 10 10 14 14 3 3 7 7 11 11 15 15 Figure 2.4 A 16 node BBN Butterfly interconnection network 15 The following issues have to be considered when designing or selecting a point-to- point interconnection topology: 1. Diameter of the network. This is defined to be the maximum distance between all pairs of nodes in the system. (The distance between two nodes is the length in number of edges of a shortest path between two nodes. See formal definition in Section 4.1.) A network with a smaller diameter has smaller worst case delay caused by the message handling at intermediate nodes. The average communica- tion distance also depends mainly on the network diameter. 2. Degree of the network. This is defined to be the number of connection links per node. A large degree of network can provide a number of alternative paths between pairs of nodes. However, a large degree also implies more expensive hardware cost. 3. Regularity. A regular network is usually easy to construct and expand, and may provide an easy routing mechanism. 4. Routing. The ease of message routing is essential to the system performance and is greatly affected by the network topology. 5. Expansion. This is measured as the ease of adding new nodes to the network. A network should be incremently expandable. That is, the size of the network can be increased by easily adding more nodes to it. 6. Robustness. A reliable operation of the interconnection network is very important to the overall system performance. When the number of processors and communi- cation links increases, the probability that some components fail increases propor- tionally. Thus, it is desirable to design a network which can keep full connection capacity with graceful degradation when existing certain faults. Some of the above issues are related; however, some may conflict. Thus, a trade- off must be made between them for each individual design. (i) 3-ary 3—cube (k) 3-D cube (1) 3-D cube connected cycle Figure 2.5 Examples of point-to-point interconnection networks 17 Various point-to—point topologies have been studied. Figure 2.5 shows some popu- lar point-to—point topologies. Typical examples are ring, tree, star, mesh (2-D and higher dimension), lattice, completely connected, hypercube, etc. Among those topologies, the hypercube has attracted most attention in the recent years. The topological properties of hypercube will be discussed in detail in Section 2.3. Another topology, the binary tree (and its augmented variations) has also been exten- sively studied. Tree structure is especially good for AI application. Tree topology has a small degree, thus, low hardware cost. However, it has poor robustness and may have a congestion problem toward the root. With some addition of links, the augmented binary trees can somewhat alleviate the above problems. A typical example is the fidl ring binary tree [DePa78], as shown in Figure 2.5(a), which is a regular binary tree with all nodes in the same level connected as a ring. A rather general topology is the k-ary n~cube (torus) consisting of It" nodes. It is a topology with n dimensions, and k nodes at each dimension. The k nodes in a dimension are connected as a ring. Figure 2.5(g) is a 3-ary 2—cube, and Figure 2.50) is a 3-ary 3- cube where some nodes and edges are not shown. The topology of a k-ary n-cube is equivalent to a corresponding multistage network whose switches are replaced by processor/memory pairs [GKL883]. Binary cube is a special case of k-ary n—cube (k =2). A 2-D mesh can also be considered as a k—ary 2-cube. Two more cube-related topologies are the cube-connected cycle (CCC) [PrVu81] and hypernet [HwGh87]. Figure 2.5(1) is an example of a cube-connected cycle. The major motivation of CCC is to keep the node degree constant, so that it is easier to expand a CCC than a hypercube. Hypernet is a class of hierarchical networks for modu- lar construction of very large size parallel systems by providing structured building blocks for nodes and links. While numerous interconnection topologies have been proposed, the binary cube is still the most popular one. 18 2.3 TOPOLOGICAL PROPERTIES OF HYPERCUBE Hypercube structure has been the subject of many research projects and has been studied from different perspectives. As a result, many topological properties of hyper- cube have been discovered [Fold77, BaPa86, BrSc86, HMSC86, KrVC86, SaSc85a, SuBa77, TPPL85]. The properties of an n dimensional hypercube graph (Qn) can be briefly summar- ized as follows: (1) (2) (3) (4) (5) (6) (7) A Q, has N=2" nodes, with addresses from 0 to 2"—1 in binary form (b,,_1b,,_2 . . - b0), and n2""1 edges. There is an edge (or link) between two nodes if and only if the binary addresses of the two nodes differ at exactly one bit position. If the bit position is position i, then the edge is said to be at i-th dimension, or be the i-th dimensional edge. Thus, a Q,, is n-regular (each node has exactly n edges connecting to n neighboring nodes). A Q,, can be recursively constructed by combining two Q,,-1’s. Let (b,,_2 - - - b0) be an address in Q,,_1, then there is a link between two corresponding nodes (Obn—z ' ' ' b0) and (lbn-Z ' ‘ ° b0). A Q,, can be split into two Q,,_1’s in n different ways. Namely, all nodes having value 1 at bit position i and all edges incident to these nodes are put into one sub- cube; all other nodes and their incident edges are put into another subcube, for OSiSn—l. Moreover, it can be split into 2" QH's (lSkSn—l). The distance between any two nodes is the number of bit positions by which the two binary addresses differ. That is, the Hamming distance of the two binary numbers. The diameter of a Q, (the maximum distance between all pairs of nodes in a Q") is n. There are d! distinct paths of length d between two nodes of distance d in a Q,,. 19 (8) A set of n node—dis joint paths between any two nodes in a Q,, can be constructed. If the distance between the two nodes is d and all paths been selected are as short as possible, then d of the n paths are of length d, and the remaining n—d paths are of length d+2. The construction of the set of paths is not unique if d >2. (9) Q, is a bipartite graph. Thus, there are no cycles of odd lengths in it. (10) A ring of length p=Zq can be mapped into Q,. when 254s?-l . (1 l) A binary tree of height n-2 (i.e. n-l levels) can be mapped into Q,, (a single node is a 1 level tree of height 0). (12) A d—dimensional mesh (mlxmzx-med) can be mapped into Q" if d 210g; [mg-I S n. i=1 One of the most important advantages of hypercube topology is that it is a superset of many other topologies, such as ring, two or higher dimensional mesh, tree, etc. As we know, different application problems are best executed in systems with different archi- tectures. Many numerical algorithms can be naturally decomposed such that the com- munication pattern required by the job tasks matches the hypercube topology. Some popular topologies, such as mesh, tree, FFI‘(Fast Fourier Transform), are used in a great many scientific applications. For example, matrix operations (matrix-vector multiplica- tion, convolution, etc.) are suited to be solved in mesh structured multiprocessors, while search algorithms, linear recurrences are best executed in tree su'uctured multiproces- sors. The FFT, which is one of the most common used computational algorithms in almost all areas of scientific computation [BaPa86], can be perfectly mapped into hyper- cube topology. Because of the wide range of embeddability of hypercube topology, problems of these structures can be partitioned and mapped into a hypercube very easily and be solved efficiently. The diameter of a hypercube is relatively small, compared with other tepologies such as tree and 2-D mesh. It grows logarithmically with the number of nodes. This 20 makes it possible to construct a system consisting of thousands of processors. Each node in a hypercube is topologically identical. There are no corner-versus- edge, or root-versus-leaf nodes, as are found in regular grids and trees. Because of the symmetry, there is no special congestion point as would happen in the tree topology. This symmetry also makes message routing in the hypercube relatively easy, which will be discussed in detail later. The symmetry is also a useful feature for dynamic reconfiguration of the system. Hypercube t0pology provides multiple paths between any pair of nodes in the network. This makes the hypercube topology inherently fault- tolerant. All these properties make hypercube multiprocessors very versatile and suitable for a wide range of computational applications [FoOt84, BaPa86]. As an impressive exam- ple, in an 10-dimensional N cube/ten hypercube, for some specially selected problems with well tuned algorithms, a speedup of 1009 to 1020 over uniprocessor has been achieved [GuMB 88]. 2.4 REVIEW OF HYPERCUBE MULTIPROCESSORS Since the first hypercube multiprocessor, the COSMIC Cube, was demonstrated in Caltech [Seit85]. many other hypercube multiprocessors have been made commercially available. Examples of hypercube multiprocessors include Intel’s iPSC (up to 128 pro- cessors), Ncube’s hypercube (up to 1024 processors), Ametek’s S/14 (up to 256 proces- sors), FPS’s T series (up to 214 processors), and JPL’s MARK III [Amet86, GrRe86, HMSC86, SASL85, P'I'LP85]. In the rest of this section, we briefly outline a typical DMMP, the N CUBE. Then we give a comparison of the above mentioned systems. 2.4.1 Characteristics of a typical hypercube - the NCUBE As an example of DMMPs, we briefly review the characteristics of the NCUBE hypercube multiprocessor [HMSC86]. 21 System architecture: Modularity and scalability are two architectural features of DMMPs. An NCUBE hypercube multiprocessor can be configured to have various size by combining different number of processor boards and I/O boards to form a system. A maximum size NCUBE/ten has 1024 nodes, which consists of 16 processor boards and 8 I/O boards. A processor board contains 64 nodes. The U0 boards include host boards, graphics boards and open systems boards that can be configured for custom design. An I/O board contains 16 1/0 processors, each having connections to 8 nodes. Thus, an I/O board has connections to a 128-node subcube. At least one of the I/O boards must be a host board. Host: An NCUBE has one or more (up to eight) host boards. A host board has an Intel 80286 to run the Axis operating system. The board has 4Mbyte memory shared by the host processor and other processors on the board. It supports peripherals such as ter- minals, disk drives, tape drives, and network controllers, etc. The host(s) provide the primary user interface and perform functions such as editing, debugging, program and data downloading, performance monitoring, file maintenance, etc. Node: A unique feature of the NCUBE is its compactness. Each node requires only one processor chip and six memory chips. The processor chip is custom designed, which contains a vax-like 32-bit processor, built-in floating-point support, memory con- troller support, and eleven bidirectional communication channels. Ten of them are con- nected to 10 neighboring nodes, the other one is connected to a U0 processor. Each node has 128K or 512Kbyte memory. All eleven channels can be active simultaneously. At 10 MHz clock, a node executes non-arithmetic instructions at 2MIPS or single preci- sion floating point operations at 0.5MFLOPS. Communication: Communication between nodes is done by means of asynchro- nous DMA operations over the full duplex bidirectional links between neighboring nodes. For a 101MHz clock, the data transfer rate is about leyte/second on each direc- tion (700Kbyte/second for a 7MHz clock). Because of the DMA, the processor needs 22 only to initiate a send or a receive operation for each communication request. Commun- ication between the hypercube nodes and external devices are handled by the I/O proces- sors through the eleventh channel. Each I/O processor has a 128K (or 512K) RAM which occupies a fixed slot of memory space in the host’s 4Mbyte memory. To perform an input operation, the data are first sent to the host’s memory, then are transferred to target nodes through DMA channels. The output operations are performed in a similar way. System software: The host processor (80286) in each host board runs Axis operating system, a variant of Unix, which is compatible with both Unix System V and 4.3 BSD. Since an NCUBE may have more than one host board, each having its own file system. A potential data inconsistency and conflict may exist. Axis provides the capability to manage the multiple file systems as one unified, distributed file system. It also supports hypercube partitioning, so that a user can allocate a subcube of a proper size for his particular application. Both FORTRAN-77 and C are supported by the Axis. Each node runs a small operating system called Vertex, whose main function is to sup- port message-passing between neighboring nodes. A system of 1024 nodes may have the speed of 2000 MIPS and 500FLOPS on sin- gle precision or 300MFLOPS on double precision operations. 2.4.2 Comparison of hypercube multiprocessors The characteristics of several hypercube multiprocessors are compared in Table 2.1 [ShFi87, ReFu87, Duni87, P'I‘LP85]. Both iPSC and Ametek S/l4 use standard microprocessors. While FPT T series uses Inmos Transputer [GuHS86]. NCUBE developed its own custom designed VLSI chips. All systems have a host processor which is used for program downloading/uploading and data transfer between the nodes and external devices. In the table, rows 5 to 8 are the memory size at each node, the number of communication channels connected to each node, and the data u'ansmission rate of each channel (in bit/second), respectively. Lines 11 and 12 are the performance 23 measurement of the entire system having maximum number of nodes for non-arithmetic operations and double precision floating point operations, respectively. Table 2.1 Hypercube system characteristics No. Item Ametek S/14 FPS T-Series Intel iPSC Mark-III NCUBE 1 # of PE’s 16-256 8-4096 32-128 32-1024 64-1024 (clock) 8MHz 20MHz 8Ml-lz 16MHz 7orlOMHz 2 Processor 80286 Transputer 80286 68020 Custom (clock) 8 MHz 6 MHz 16 MHz 3 Floating Point 80287 included 80287 68881 included 4 I/O Processor 80186 included 80186 68020 DMA controller 5 Memory (Byte) 1M 1M 0.5-4.SM 4M 128-512K 6 Channels/Node 8 4x4 7 8 11 7 Bandwidth(bps) 3M 20M 10M 13.5M 10M 8 Host VAX MicroVax 286/310 system 19 80286 9 Host OS Unix/Ultrix VMS Xenix CrOS Axis 10 Node OS Xon Occam iPSC/OS Mercury Vertex ll MIPS 200 30000 100 2000 2000 12 MFLOPS 12 65000 8 2000 300 CHAPTER 3 ISSUES ON INTERPROCESSOR COMMUNICATION In a distributed memory message-passing multiprocessor system, processors do not have shared memory. Message passing is the only means for interprocessor communi- cation. If a node wants to send a message to a neighboring node, the message delivery is relatively simple. However, if a node wants to send a message to a distant node, the message has to traverse through some intermediate nodes. To send a message from a node to a number of other nodes, the situation becomes more complicated. The major problem in interprocessor communication is message routing, that is, to determine which path(s) should be used to deliver a message from the source node to some destination node(s). In this chapter, we first introduce a model for analyzing the message passing time between two nodes. Then, we discuss some fundamental issues in interprocessor communication, which are the keys to the performance of DMMPs. 3.1 A MODEL FOR INTERPROCESSOR COMMUNICATION Let us consider passing a message from a source node u, to a destination node ad with distance d(us,ud)=h in a DMMP. Some benchmarking has been done for several hypercubes [ShFi87, GrRe86]. For communication between neighboring nodes (h=1), as depicted in Figure 3.1, it is assumed that the time (t) needed to pass a message of size S bytes, can be expressed as: 24 25 u, tl “d Figure 3.1 Communication between two neighboring nodes t=t,+tcS (3.1) where t, is the communication latency, that is, the overhead time caused by processor u, to initiate the communication and by ad to terminate the communication; tc is the time needed to transmit one byte of data. After measuring the communication time for mes- sages with different lengths, the parameters t, and tc can be determined by a least- squares fit. Table 3.1 shows the results for four first generation hypercube multiproces- sors [GrRe86, ShFi87]. Table 3.1 One hop communication comparison Mark-III Intel iPSC Ametek S/l4 Ncube/ten Latency: ti (us) 95 1700 550 384 tc (us/Byte) 0.563 2.83 9.53 2.6 For a more general communication, that is, communication between any nodes (h21), single or multiple destinations, we propose the following model. Figure 3.2 dep- icts a path selected for sending a message from us to ad, which could be a one destina- tion communication, or a path from the source to one of several destinations in a multiple-destination communication. As discussed in Section 2.1, we assumed that each node is associated with a router to handle communication tasks. Let 1513,44“ be the node-to-node communication 26 u: u “d P: P Pd tr Tr 1" t Router Router ‘ > - ~ - Router Figure 3.2 One-to-one message passing model (h hops) start-terminate time, where 1:” is the start-up time spent at the source node, that is, the time interval fiom the moment when the source processor (p,) issues a send command until the moment when the router begins to receive the message; and 1,4 is the time spent at the destination node to terminate the communication, that is, the time interval from the moment when the router begins to send out the message until the moment when the destination nodal processor (pd) receives the entire message. Let 1:4 be the delay time from the moment when a sending router starts transmitting the message till the receiving router (at a neighboring node) starts to make routing decision. In order to sim- plify the discussion, the delay time 1,, at the source node (the message transfer time from the source processor to its router) is treated the same as that between two routers. Parameter 1:4 is dependent on the size of the message, the bandwidth of the link, and the message forwarding scheme, which will be discussed in detail in Section 3.5. Finally, 1,, is the processing time required to make a routing decision at a router. As shown in Fig- ure 3.2, for an h-hop message passing, h+l routers are involved. Thus, the time (t) required to send a message to a node at h—hop away can be expressed in Eq. (3.2). 27 t = t,+(h+l)‘tp+(h +1)1:d. (3.2) In the remaining sections, we will discuss different communication mechanisms that affect the communication time. 3.2 THREE TYPES OF INTERPROCESSOR COMMUNICATIONS At the system level, depending on the number of destinations, interprocessor com- munications can be classified into three types: unicast, broadcast and multicast. Unicast (One-to-one) communication is the sending of a message from a source node to one destination node. It is directly supported by all DMMPs. Unicast in a hypercube multiprocessor can be easily implemented based on the Hamming code. If the distance between the source and the destination is d (i.e., their binary addresses differ at exactly d bit positions), a total number of d! distinct (shortest) paths, all having d-l intermediate nodes, can be constructed between the two nodes by changing, one at a time, the values at the d bit positions in different orders. Using any one of the (1! paths, the message will traverse the same number of links. Broadcast (one-to-all) is a type of information exchange in which a source node wishes to send a message to all other nodes in the system as quickly as possible. A fre- quently used approach can be described as follows. In the first step, the source sends out n copies of the message to all its n neighboring nodes. Subsequently, these nodes dupli- cate and send out the message to those neighboring nodes which have not yet received the message. The process continues until all nodes receive the message. Broadcast in hypercube has been studied in [SuBa77, SaSc85b, H01086, BrSc86]. It has been shown that broadcast in an n-cube can be done in n time steps even if each node sends a mes- sage to no more than one neighbor during each time step. In Multicast (one-to-many) communication, a node wants to send the same mes- sage to k other nodes (1. ' A broadcast algorithm has also been presented in [SuBa77] and followed by other researchers [BrSc86, HoJ086, SaSc85b, Kat588]. Sullivan and Bashkow’s algorithm works by sending a weight along with the data part of the message. The weight indi— cates how the receiving node should continue sending the message. Eventually, each node in the system will receive the broadcast message exactly once and in no later than n time steps. The algorithm is listed in Figure 4.4 as BROADCASTJ. Algorithm BROADCAST.1: (* The links of each node is assigned dimension 1 to n *) (* the algorithm is executed at every node *) begin if uo=u, then weightz=n +1; (* otherwise uo receives a weight from another node *) for i :=1 to weight-l do begin new.weight :=i; send message to dimension i with new.weight end; end. Figure 4.4 A broadcast algorithm using a weight The broadcast algorithms presented in [BrSc86, HoJo86] and [SaSc86] do not use explicit weight. Instead, the relative address of local node with respect to the source node is used to direct the broadcasting. The basic idea can be described as follows [BrSc86]. The neighbors of any node n in an n-cube are of the form a(u)$2i for i=0, 1, - - - ,n --1. To implement a broadcast, each node sends broadcast message only to 47 those neighbors a(u)$2} such that 2} >a (u). The algorithm is shown in Figure 4.5 as BROADCASTZ. Algorithm BROADCAST.2: begin (* Input: actual local address a (u o) and source address a (u,) *) If a (u o)=a(u,) then Send message to all neighbors else begin Send the message to local processor; (* Calculate relative address of local node with respect to to the source *) r(u0)5b(n—1)b(n—2) ' ‘ ° 17} ' ' ' b0=a (u0)$a(us); Find the largest i such that b,=1; Send out message at the j-th dimension for all l j then Control,- [b ]:=1 9. else Control ,- [b ]:=0; 10. send message with Control ,- to dimension j 1 1. end; 12. end; 13. end. Figure 4.6 A broadcast algorithm using a Control vector to dimension i. Control,- is the control vector to be sent to dimension i from current node. Different dimensions will receive different new Control vectors calculated based on the the dimension id and the content in the received Control* vector. At the source node, the vector Control“ is initialized to all 1’s. The algorithm is to be executed in the router or I/O processor of every node. Comparing algorithm BROADCAST.2 with algorithm BROADCAST.3, we can see that the two algorithms are essentially the same. However, BROADCAST.3 is more flexi- ble and versatile than BROADCAST.2. As will be discussed later in Chapter 7, since the Control vectors are calculated at each node, algorithm BROADCAST.3 can be modified to route messages in a hypercube with certain faulty nodes. However, BROADCAST.3 is not as simple as BROADCAST2. Also it requires each message to carry a distinct Control VOCIOI'. 49 1111 Figure 4.7 A broadcast tree generated by BROADCAST.3 50 The broadcast tree resulting from executing either algorithm BROADCAST.2 or BROADCAST.3, in a 4-cube, is shown in Figure 4.7, where a circle represents a node with its address inside the circle. The binary numbers outside a circle is the binary Control vector received by that node. The source address is assumed to be 0000. The arrows form the message routing pathes, while the dashed lines are the links not involved in this particular broadcasting. 4.4. MULTICAST IN HYPERCUBEENVIRONMENT We have discussed unicast and broadcast in hypercube environment. In this section, we take a look at multicast communication. We first investigate special characteristics of multicast and examine some simple examples. Then we will go to general case in next chapter. Let Q,(V,E) be the topology of n-cube as defined above, and a node uo be the source node in a multicast. The node set V of graph Qn(V, E) can be partitioned into n+1 disjoint subsets (n+1 levels), V0,V1, - ' ° ,Vn, with respect to node no, where V,-={ve V IdQ.(uo,v)=i}. The level of a node u, denoted by Am), is equal to dQ'(uo,u). Of course, A (uo)=0. From here on, we assume the source node has address zero, i.e., a (uo)=0. If, however, a (uo)¢0, then we can relabel the nodes by XORing all node addresses with the source address. Thus, if as Vj, then ||a(u)||=dQ.(uo,u)=}t (u)=j ( note that a(uo)=0). Also, D={u1,u2, ° ° - ,uk} will denote the set of destination nodes in a multi- cast. And set M ={uo }+D will be called a multicast set. Furthermore, for simplicity of discussion, we assume that 2.(u1)51(u2)s - - ~ Sunk). Definition: A node u is an ancestor of a node v (or v is a descendant of u) if and only if u is contained in a shortest path joining uo and v. Thus, 7t(u)9. (v) and 1(v) - 3t (u)=dQ.(u,v) . 51 Definition: let X :V-{uo} be a non-empty set. Then, the common ancestors of X, the set CA (X), is defined as follows: CA (X)=[ue V In is an ancestor of every node veX}. Note that CA (X) is not empty, as for any X, no 6 CA (X). The nearest common ances- tor of X, denoted by NCA (X), is defined as NCA (X)={ue V Iue CA (X), and for every ve CA (X), A. (u)23\. (v)}. Lemma 1: Let X cV—{uol and IX I=2. Then NCA (X) contains only one element. That is, the nearest common ancestor of any two nodes is unique. Furthermore, the address of this node can be calculated by bitwise ANDing the binary addresses of the two nodes. Proof: Let X ={u1,u2 1, and "&" denote the bitwise AND operation on two or more binary numbers. For simplicity, denote Mug) by 2.3. Furthermore, let u, be the node whose binary address a (u,) is given by a(ux)=a(u1)&a(u2) (4.1) This implies that (A). a (u,) has value 1 at exactly 1, bit positions and value 0 at the remaining n—Xx bit positions, and (B). a (u 1) and a (u 2) have value 1 at the same 2., bit positions as a (14,). Furthermore, at each of the n-k, bit positions, at least one of the two binary addresses a(u1) and a (uz) has value 0. We first prove uxe NCA (X). It is obvious from Eq. (4.1) that uxe CA (X). In order to prove the nearest requirement, suppose there exists another node aye CA (X) with 2,»... With no loss of generality, assume the only difference between a (u,) and a(ux) is that a(u,) has value 1 at one more bit position, say bk, among those n-X, bit positions. That is, l,=k,+1. By condition (B), a (14,) differs from at least one of the two addresses a(u1) and a(u2), say a(u1), at one more bit position than a(u,) does. We then have dQ.(u,,u1)=?tl—X,+l. However, 1(u1)—k(uy)=kl -(7t,+1)=}tl 4,-1. Thus, 52 1(u1)-Mu,)¢dQ_(u,,u1), u, is not an ancestor of ul. Therefore, u, 4 CA (X), 2., is maximum. We conclude uxeNCA (X). Now let us prove the uniqueness, that is, NCA (X )=qu }. Suppose there is another node uzatux, such that uzeNCA (X). Then, we must have 11:2,, which implies I Ia(u,)l I=I Ia (u,)l I. With no loss of generality, suppose a (u,) has a 0 at one of the I», bit positions, say b8, and a l at one of the n-k, bit positions, say bk. Again, from condi- tion (B), a(u,) differs from at least one of the two addresses a(u1) and a(u2), say a(u1). in two more bit positions than a (u,) does, i.e., dQ.(u,,u 1)=11-X,+2. However, 1(u1)-7L(uz)=?~(u 1)-Mux)=11-?»x. Thus. k(u1)-l(uz)¢dg, (142.141). u: is not an ancestor of ul, and uzéNCA (X). Therefore, NCA (X )={ux}. I The argument presented in Lemma 1 can be easily generalized to the case of IX I>2. Thus we have the following theorem. Theorem 4.1: Let X cV—{uo} and IX I=k, (2SkSZ'L1). Then NCA (X) contains only one element. That is, the nearest common ancestor of any k nodes is unique. Furth- ermore, the address of this node can be calculated by bitwise ANDing the binary addresses of all the k nodes. Let us start to investigate the multicast for a small number of destinations, and then we will get into general case. We first consider the situation that a source uo wants to send a message to only two other destinations, ul and “2. We first investigate all possi- ble topological patterns in the solution space of OMT. All the possible patterns of OMTs for IM I=3 is illustrated in Figure 4.8. In the figure, circles denote nodes and lines denote shortest pathes joining pairs of nodes. Observe that if T2.1 is the desired subtree, then node x=NCA (u1,u2). Similarly, if 12.2 is the desired subtree, we have u0=NCA (u1,u2). Finally, when T23 is the solution, u1=NCA (u 1 ,uz). In fact, we may refer to T2.1 as the general pattern of the OMT when IM I=3. This is because when x=uo, we obtain subtree T22, and when x=u1, we obtain subtree T23. 53 o o ‘9 on General Pattern x=u 0 x=u1 T 2.1 T 2.2 T 2.3 Figure 4.8 One-to-two multicast trees in hypercube The above observations lead us to an algorithm for generating an OMT when M={uo,u1,u2}. The algorithm is listed in Figure 4.9 as algorithm One_to_Two_Multicast. Algorithm One_to_Two_Multicast input: 0 (ac). a(u1). a (142); a(X) :=a(u1) &a(u2); form a path between no and x; form a path between 1: and ul; form a path between x and u;. Figure 4.9 Multicast algorithm for two destination case In the One_to_Two_Multicast algorithm, to form a path between node u and node v implies that any shortest path between u and v is acceptable. If u=v, then no path has to be established. The validity of the above algorithm can be seen as follows. Denote the multicast tree found by the algorithm by T(V,E). Let UB =I Ia (u1)I I+I Ia (142)] I (which is a upper bound of the traffic). In all three subtrees, IE (T) I=UB—d (uo,x). By Lemma 54 l, x is the unique NCA of u; and u2, which implies d(uo,x)=7t(x) is maximum. Thus, IE (T)I is minimum. We now consider multicast with three destination nodes, i.e., M={uo,u1,u2,u3 }. Again, we investigate all possible subtree patterns, find a general pattern, and then give an optimal algorithm. The nine possible subtree patterns are shown in Figure 4.10. Under the assumption that 7t(u1)Sl(u2)9t(u3), in Figure 4.10, a pattern having nodes labeled u 1, u; and u3 represents a unique tree configuration. Each of the other patterns actually represents three non-isomorphic subtrees obtained by permuting i, j, and k, namely, (i, j,k)e {(1,2,3),(1,3,2),(2,3,1)}. Observing the nine patterns, we can see that T3.1 is a general pattern. When y=x12=x23=x13, T3.1 becomes 13.2, and when xgje [u1,u2}, T3.1 becomes T33. When y=uo and y=ul, the tree patterns in first row correspond to patterns in second and third rows, respectively. Clearly, node y is NCA (u1,u2,u3). Also, node xi,- must satisfy Mxij )=max[}t(NCA (u 1 ,u2)),}t(NCA (u2,u3)),A(NCA (u 1 ,u3))}. Since there is only one general pattern, we can again write a simple algorithm to solve the problem. which is listed in Figure 4.11. Denote the subtree found by the above algorithm by T (V,E), and let UB=| la(u1)| |+| |a(u2)||+| la(u3)| I. It is not difficult to see that the following equation holds for every subtree pattern in Figure 4.10. |E(T) I=UB-2d(uo.y>-d(y,xij)=UB-2| Ia 0’)| HI I0 (Xy)| I-I la 0’)| I) =UB-l la(y)| I-I la(x.-,-)||. Since U8 is fixed, a (y) is unique by Theorem 4.1. The maximum level number of xm means IIa(x,-,~)|I is maximum. Therefore, IE (T)| is minimum. This establishes the validity of the above algorithm. 55 general pattern y=x12=x23=x13 xi,- =u,- T 3.1 T 3.2 T 3.3 y=uo uo=y=x 12=x23=x 13 y=uo; xij=ui T 3.4 T 3.5 T 3.6 y=u1 y=u1=x23=x12=x13 y=u1; x23=u2 T 3.7 T 3.8 T 3.9 Figure 4.10 One-to-three multicast tree patterns 56 Algorithm One_to_Three_Mu lticast input: a (u o), a (u 1), a (u 2). a (“3); 00):=a(u1)&a(u2) & MM): 0 (x12):=a (u 1 )&a (“2). ; a(x23):=a (u2)&a (u3), a(x13):=a (u1)&a (“3); xm := xii, such that ngj ):= max {k(x12),}t(x23),k(x13)]; form a path between x max and u;; form a path between xmax and u}; form a path between xm and y; form a path between y and uk; (* k¢i, j *) form a path between y and uo. Figure 4.11 Multicast algorithm for three destination case Finally, let us have an overview on four or more destination multicast. As we have seen, the possible configurations of the multicast trees grow dramatically when the number of destination nodes increases even from two to three. As the number of desti- nations increases from three to four, the problem becomes more complicated. The most important factor is that, in the case of four destination there is no unique general pattern; instead, there are two general patterns as shown in Figure 4.12. If we want to write an optimal algorithm, we have to deal with the two general patterns separately, then select the one having less intermediate nodes. In general, if we have a k-destination multicast problem, one way of finding an OMT is to generate all the general patterns, consider each of them separately, and then select an optimal one. However, the number of general patterns grows very rapidly as the number of destinations increases. We have counted 3, 6, 11, and 23 general patterns for 5, 6, 7, and 8 destinations, respectively. We can see the trend of how rapidly the number of tree patterns and general patterns grows as the number of destinations increases. Also, the general patterns become more and more complicated. 57 General Pattern 4.1 General Pattern 4.2 Figure 4.12 One-to—four general multicast tree patterns Can we find an efficient algorithm to solve the OMT problem? For a general graph, the answer is no. In fact, it has been shown in [ChEN87] that the problem of finding an OMT for a general It is NP-hard even if G (V, E) is bipartite. A problem which is very similar to the above problem is known as the Steiner Tree 1 (ST) problem. It is the problem of finding the smallest subtree of a given graph, which > contains a given subset of nodes [GaJo79]. It can be observed that if condition (b) in the definition of OMT is removed, OMT problem becomes a ST problem. It has been shown that ST problem is NP-complete when G(V,E) is the hypercube graph [GrFo82]. The OMT model can be applied to any interconnection topology. From here on, we only consider multicast in the hypercube topology. Now the question is “Is the OMT problem still NP hard for the hypercube topology”? Based on the above observations, we conjecture that the OMT problem remains NP-hard even for hypercube topology. CHAPTER 5 A DISTRIBUTED MULTICAST ALGORITHM FOR HYPERCUBE MULTIPROCESSORS As discussed in the last chapter, it is impractical to find an OMT when the number of destinations is large. Also, the optimal algorithms discussed so far are all centralized in the sense that they require the entire routing be handled solely by the source node. Subsequently, this approach requires the information of the enfltirepathsbe carried Lbythe _ message'header, and thus increases the interprocessor communication _ov_erhead._ In this chapter, we propose a distributed heuristic algorithm which has the following properties. First, the distance from the source node to each destination node in the multicast tree generated by the proposed algorithm is the same as that in an OMT. Second, the algo- /____________.._ “-wuu— MM rithm is simple and _M4easily—implemented~1§__h§r_d_v_v_am Third, it allows distributed , routing. Finally, simulation results indicate the traffic generated by the algorithm is very close to the optimal solution and is better than existing multi-destination message delivery mechanisms. 5.1 UNDERLYING RATIONALE OF THE MULTICAST MULTICAST ALGO- RITHM Observe that in a multicast tree only forward nodes are involved in passing the multicast message to some other nodes in the multicast tree. Initially, a node, called the source node, decides to send its message to some number of other nodes, i.e., it issues a 58 59 multicast message. By ruining _t_he#al_gorithm,_ the source node will decide which of its 1,.r..-r~-'-- ‘- —___ uuh-- " 11mg nodes SEWssage. A message received by a node v includes the data unit, the address of the source node, and a list of destination nodes (referred to as destination list) which are descendants of node v in the multicast tree. Detailed description of the message format will be given in Chapter 6. Each node, upon receiving a multicast message, will perform the following func- tions. First, it will compare its own address against the addresses in the destination list of the received message. If there is a match, that matched address will be removed from the destination list and a copy of the data field will be sent to the local processor. Then, if the destination list is empty, the node is a leaf node in the multicast tree and no mes- sage will be further forwarded. However, if the destination list is not empty (which implies the node is a forward node), the node will execute the algorithm to determine its descending neighbors in the multicast tree. Depending on the number of descending neighbors, say m, the forward node will split its destination list into m disjoint destina- tion sublists, each consisting of a set of destination nodes which are descendants of a particular descending neighbor. Each such destination sublist is put into a message header and sent to its corresponding descending neighbor. Now the question is how each forward node decides which of its neighboring nodes to pass the received message. Consider a two-destination multicast, as shown in T2.1 (Figure 4.8). In a hypercube environment, suppose the relative addresses of two destina- tion nodes ul and u2 have 1’s at p common bit positions (p9.(u1)9(u2) ), which define the intermediate node 1:. For a message to go from node ac to x, it may traverse through these dimensions in any order (by changing the values at these p bit positions from 0 to 1, one at a time, in any order). As discussed before, there are pl different paths between nodes uo and x. From the point of view of this message delivery, any of these paths has the same effect since they all require the same number of time steps and create the same amount of traffic. 60 Now, suppose that node uo wants to include an additional node, u3, in its destina- tion list. In this case, we have the situation of T3.1 (Figure 4.10). Let 1(y)=lla(u1)&a(u2)&a(u3)lI=q, and q< For the case of n=6, dm=3.05. To send a message to a destination at h hops away will create h units of traffic. Thus, the average traffic generated for k destinations using mul- tiple one-to-one communication is 3.05k for uniform distribution. It follows that when the number of destination nodes is greater than 20, even broadcast approach performs better than the multiple one-to-one approach. Note that multiple one-to-one approach is actually used in the first generation hypercube multiprocessors. 72 In order to better evaluate the performance of the Greedy algorithm, we also run simulation programs for an optimal algorithm and a multi-destination routing algorithm presented by Moler and Scott [M08086] with the name of Spare global send. The optimal solution is obtained by exhaustive searching and comparing of all poSSible paths for each given number of destinations (an exponential time complexity algorithm). The idea of the Spare global send can be briefly described as follows. Given a local (for- ward) node and a destination list, the algorithm first searches the destination list and selects one which is closest to the local node, a message is then sent to that node through a shortest path between the two nodes. Those destination nodes which are descendants of the selected node will receive the message through that node. The algorithm is repeated for the remaining nodes in the destination list until all destinations are dealt with. The algorithm is also distributed. The results for the three algorithms are generated in a way similar to the curves in Figure 5.3, except that the simulation is for odd number of destinations, and for each given number of destinations the value is averaged over 100 runs instead of 1000 runs (since the optimal algorithm takes a long time). Figure 5.4 shows the three curves, where the x-axis is the number of destinations and the y-axis is the traffic created by the message delivery; the upper dashed curve, solid curve and lower dashed curve represent the result from spare global send, the Greedy multicast algorithm and the optimal solu— tion, respectively. Also shown in Figure 5.4 is the dotted line, which is a lower bound of the multicast. For k destinations, the theoretical lower bound of traffic generated is obvi- ously k, i.e., no non-destination forward nodes are needed. From the curves, we can see that the performance of the greedy algorithm is better than the spare global send, and is very close to the optimal solution, especially when k is very small or very large. Note that it is proven in Section 5.2 that when k<3, the perfor- mance of the greedy algorithm is the same as that of the optimal algorithm. This can also be observed from the curves in Figures 5.4 and 5.5. 73 64 56 - 1. dashed curve: Spare global send 48 —1 2. solid curve: Greedy / 3. dashed curve: Optimal 4o .. Traffic 32 - 24 .— 16 .1 4. dotted line: lower bound 8 —I O I j I I I I I 0 8 16 24 32 40 48 56 Number of Destinations Figure 5.4 Performance comparison of three multicast algorithms ( under uniform distribution ) 74 Traffic Difference Worst case difference Mean difference fiy" h’“~— v ‘ Standard deviation I I I I I I 16 24 32 40 48 56 Number of Destinations Figure 5.5 Performance comparison of Greedy algorithm with optimal solution ( under uniform distribution ) 75 In order to more closely compare the performance of the Greedy algorithm with that of the optimal algorithm, we calculate the worst case difference between the traffics created by the Greedy algorithm and the optimal algorithm, the mean of the difference (11), and the standard deviation of the difference (a) during the 100 runs of the simula- tion programs for each given number of destinations. The two parameters It and o are calculated by the following formula: p.=— 100.- 12(greedy1- -optimal1- ) 2 100.- —21 [(greedy1--optimal1- )-u] where greedy; and optimal; are the traffics generated by the two algorithms, respec- tively, for i-th run of the algorithms. Figure 5.5 shows the results, where the dotted, solid, and dashed line pieces represent the worst case differences, the mean, and the standard deviation of the differ- ences, respectively, for the 100 runs. 5.4.3 Decreasing probability routing A good scheduling algorithm in a hypercube multiprocessor should partition and map the data set to processors in a way that interprocessor communication is minimized. In other words, if the evaluation of one data partition requires information from another data partition, these two partitions should be allocated to adjacent processors [NiKP87]. In this case, those destination nodes are usually close to each other. Decreasing probabil- ity routing is a good assumption for this situation [ReFu87]. Under this assumption, the probability that a node sending a message to a destination of distance d decreases as the value d increases. Let p(l1-) be the probability that a node of distance i happens to be a destination node in a multicast. Then, we have p(l1-+1)=kp (11), where km, the routing will be implemented by broadcast algorithm. From Figure 5.3, we can see that when the number of destinations is large, the traffic generated by broadcast is close to that generated by multicast. Thus, the 83 choice of m is a tradeoff between the space (hardware cost) and the time (system traffic). In order to simplify the illustration, we show the diagrams of a hardware router for a 3-cube. In the following discussions, n denotes the dimension of the hypercube (n =3, in our example); m denotes the maximum number of destinations in a multicast the mul- ticaster can handle (m =4, in our example); and k denotes the actual number of destina- tions in a particular multicast. Each router in a 3-cube has three input channels from its three neighbors, three out- put channels to the three neighbors, and one input connection and one output connection to the local nodal processor, as depicted in Figure 2.2. Channel i is the directly con- nected link between the local node and the neighboring node at dimension i, that is, their binary addresses differ at bit position i only. Each input channel is associated with a message handling unit (MHU), through an input gateway (11- for dimension i, OSiSn). After processing is done in the router, the incoming message is sent out through one or more of the four output gateways (O,- for dimension j, OSan) to the corresponding neighbor(s) or/and the local nodal processor. E ...... ____.,. [0 MHUo ___, From To Neighbors TI 1‘ MHUI Neighbors E ...... 12 MHU2 _’ FromLocal To Local P ssor —a 13 MHU3 P ssor Figure 6.1. The block diagram of a router in a Q3 84 In this study, what we are concerned with is the structure and activities in a router attached to the nodal processor. In the following discussions, from the point of view of a router in current node, the term sending node is used to represent the router in a neigh- boring (parent) node which sends a message to the current node or the nodal processor at the local node which originates a message. The term source node represents the node which originates a message. Thus, a sending node is not necessarily a source node. Similarly, a receiving node is the immediate receiver of a message — the router in a neighboring child node, which is not necessarily a destination node. Each input gateway provides necessary interface between the channel and the MHU. Each output gateway acts as a multiplexer. An output gateway may receive mes- sages from any of the four MHUs and send them out to the the receiving nodes specified by the MHUs. We will concentrate our discussion on the MHU design since it is the major component in a router. 6.4. MESSAGE HANDING UNIT (MHU) Figure 6.2 shows the block diagram of a Message Handling Unit (MHU) which consists of the following modules: (1). A Controller Module, which interacts with neighboring and local processors, con- trols the current state of the MHU, and synchronizes the activities of other modules in the MHU. (2) A Processing Module, which is the essential component —- actually executing all communication algorithms. (3) A Data Bufi’er Module, which provides temporary buffering space for the data field of a message. (4) An Input Module, which acts as a distributor. It sends k field, address fields, and source field to the processing module and sends the source field and data field to the data buffer module. 85 processing module b---d input _ _ _ _ - output module I- - - - q controller module module IIII r I I I l K/t£ta buffer module Figure 6.2. The block diagram of an MHU (5) An Output Module, which generates proper output to the output gateways. The output module receives new message header(s) prepared by the processing module and the data field from the data buffer, and reassembles them to form new message(s). The message(s) are then sent to the output gateways according to the relative address(es) of receiving node(s), which are also provided by the processing module. In the following sub-sections, we will briefly discuss the design of the controller module and the data buffer module with emphasis on the processing module. 6.4.1 The Controller Module The function of the controller module is to interact with all sending and receiving nodes and coordinate all modules in the MHU by issuing proper control signals. Before a sending node sends out a message, it first sends a request-to-send signal to its intended receiving node. When the controller at the receiving node receives the request, it issues 86 an acknowledgement signal. Upon receiving the acknowledgement, the sender can then send out a message. The MHU works in the following four states: WAIT: it is in idle status and is waiting for the next request-to-send signal from the sender. INPUT: The k field, address fields, and source field of an input message from the sender are received. The k field of the message header is checked. In case of k<2"-1, the k address fields are checked to see if the local address is in the destination list. PROCESSING: Two things are to be done at this state: (1). Address processing is done at this time. The processing module entails finding proper output port(s) to forward the message. (2). At the same time, the input module sends the upcoming data field to data buffer module. OUTPUT:During this stage, the new message header(s) is formed and sent out through selected output gateway(s) followed immediately by the data field pumped out from the data buffer module. 6.4.2 The Data Buffer Module Once a sending node starts sending a message, it will not stop till the entire mes- sage has been sent out. When the processing module is making routing decision for the message after the message header is received, the data field of the message will keep coming in. Thus, a temporary buffering space is needed for the data field received dur- ing the address processing time. However, it may happen that two or more MHU’s in a router want to send messages to the same neighbor at the same time, or a MHU can not receive an 1 acknowledgement from a receiving node. In this case, the entire message may have to be buffered temporarily until the link is available. Thus, for flow control purpose, the capacity of the data buffer should be equal to the size of a maximum length 87 message, even though only a small part of the buffer is used in usual case. 6.5 THE PROCESSING MODULE matching multicaster checker l from . I to u 111p :1: - store output > input p0 port output module I module type broadcaster ——I checker Figure 6.3 The Processing Module The processing module is the main part in the MHU. As shown in Figure 6.3, it consists of following components: an Input Port, a Store, a Broadcaster, a Multicaster, a Type Checker, a Matching Checker, and an Output Port. All the connections between the processing module and the control module for signals are omitted in the figure. The task of the processing module is to determine routing dimensions (the relative addresses of neighboring nodes) based on the algorithms MULTICAST.1 (described in Section 5.2) and BROADCAST.2 (described in Section 4.3). The relative addresses are directly sent to the output module and are then used to determine which output gateway to send the message. The multicaster unit will be discussed in Section 6.6, while all the other units are briefly described in this section. 88 6.5.1 Input Port b2 b1 b0 A[L‘] 012 011 01o A[2,*] 022 021 020 A13.*] 032 031 030 A[4, *] 042 041 040 Store as2 as 1 “so I .»——>to matching checker XOR a(uo): local address RCgiStCI' a("l)r C(UZ), . . ° 9 G(Uk), 00‘!) Figure 6.4. The Input port and the Store The basic function of the Input Port is to convert the actual addresses of the source and all destinations in the message header into relative addresses with respect to the local address and then save them in the Store. As shown in Figure 6.4, the input destina- tion addresses a (u1-) (lSiSk) and source address a(u_1) are first XORed with the local address a (uo) to form the relative addresses: ro(u1-)=-=a1-(,1-1)a1-(11-2)..a1j ..a1-o=a (u1-)$a (uo), for lsiSk; Note that the relative source address is also calculated for the purpose of broadcast com- munication. 6.5.2 The Store The Store in the processing unit has a (m +1)xn binary array ( 5x3 in our example as shown in Figure 6.4), which provides temporary storage space for the relative addresses of the source and up to m destination nodes. Initially, all entries in the array 89 are set to zeros. A subset of the Store, A [l..m,0..n—1], which initially holds the m rela- tive destination addresses, is referred to as reference array. After each cycle (to be defined laster) of the operation in the MHU, a particular forwarding dimension, and thus a receiving node, is selected. The addresses of a subset of destinations are put into a destination sublist in a forwarding message and sent out along that dimension. The rows corresponding to those destination addresses are then reset to all zeros. 6.5.3 The Type Checker The first field in an incoming message is the k field, which indicates the communi- cation type. The Type Checker checks the k field in the message header and sends a type signal to the Controller. If k>m, then type=0 and the Broadcaster will be invoked. Otherwise, type=1 and the Multicaster will be activated by the Controller. The type check also informs Matching Checker the number of address fields, k, to be checked if k<2”-1. 6.5.4 The Matching Checker When the relative addresses are sent from Input Port to the Store, they are also sent to the Matching Checker. The function of the Matching Checker is to examine each destination address. Whenever a zero relative address is found, which indicates that the local node is one of the destination nodes, the Matching Checker sends a signal to the controller, so that a copy of the source field and the data field in the incoming message will be sent to the local processor. 6.5.5 The Output Port The Output Port basically performs the following two functions: (1) It passes the relative addresses of the receiving node(s) to the output module. Notice that there is exactly one bit position with value 1 in each of the relative addresses. The addresses actually indicate the routing dimensions, or channel id numbers. Thus, the output module can easily determine the forwarding dimension 90 accordingly. Furthermore, the k field of the message header provided by the Multicaster or Broadcaster will also be directly passed through the Output Port immediately follow- ing each dimension information. (2) Similar to the Input Port, the Output Port performs XOR operation on the rela- tive addresses in the destination sublist with the local address to form the actual addresses. In broadcast case, the Output Port sends out a relative neighboring address followed by a k field with value zero. In multicast case, the relative address of a neigh- boring node is followed by a non-zero k field and k actual destination addresses. Note that the relative address of neighboring node is eventually consumed at output modules, while the k filed and the k following destination addresses are assembled into a new message header with the data field from the data buffer to form a new message. 6.5.6 The Broadcaster a..__{>c 4.. N ; g asl—‘_‘l >0 J 1 as0—_'>C —:> > 80 Figure 6.5 The Broadcaster for a fault-free Q3 The Broadcaster is invoked for a broadcast communication (k=2"-1) or a multicast communication with a large number of destinations (m l The above logic ensures that there is one and only one of the m+1 output lines which has value 1. Furthermore, s1=l means that the maximum number of 1’s in the columns is exactly I. 95 A special case is that so=1, which means that all columns have zero number of 1’s. That is, the current reference array is all zeroes. In other words, all destination addresses have been resolved and the routing of the current message has been done. Thus, the output line of so is used as a termination signal sent to the Controller. 6.6.3 The Column Selector After finding the value of maximum number of 1’s in the columns, the next step is to find which column has that maximum number. As pointed out in the discussion of greedy algorithm, more than one column may have the same maximum number of 1’s. The greedy algorithm requires only one column be selected. To simplify our design, we select the column with the smallest column index among those columns. The Column Selecter (CLMS) is designed for this purpose. As shown in Figure 6.9, it has (m +1)x2xn inputs and n outputs. The function of the CLMS can be expressed in the following equations: 0 m . yj = U (6,131) for OSan-I (=0 y 1 = 1 means column j has the maximum number of 1’s. To make the selected column unique, we let r t y} ifj=0 ”=4 if lSan-l . t yl n U yr ‘ allp2"’1 , condition 4 will not be satisfied since there must exist a node as V(Q11)-F (Q11) such that u is adjacent with at least two nodes in F(Q,1). Thus, Max( IF (Q11) |)s2"’1, and the equality occurs when F (Q11) induces an (n -l)-cube. However, if in addition to condition 4, it is required that no two faulty nodes be adjacent in Q1, then the problem of finding the upper bound of Max( IF (Q11) I) is reduced to the following classical question in the theory of error-correcting code: ‘ ‘What is the largest group of n-bit binary vectors (codewords) such that the Hamming distance of any pair of codewords in the group is at least d?”. If the size of the largest such group is denoted by IC I, then, IC Is Ig]+['11]in..+ If] 1 (7.1) where t= [(d /2 — 1)] . The above bound is known as the Hamming bound [Hamm50]. Clearly, with the above modification of condition 4, the Hamming bound gives an upper bound for Max( IF (Q11) I) with d=3. This implies 2!! Max( IF (Qn) 05 "+1 . (7.2) 105 For n=3,4,5,6,7.8.9.and10, the floors of the right hand side of Eq. (7.2) are calcu- lated to be 2.3,5,9,16,28,51, and 93, respectively. The above modification of condition 4 is material by the fact that when selecting a pair of nodes in random from V(Q11). the probability that the two nodes are adjacent is very small. For this reason, we consider the Hamming bound as a representative of the upper bound for Max( IF (Q11) I). In order to have a quantitative idea about the fault model, we have run a simulation program to estimate the probability that condition 4 above is valid under different number of fault nodes. Figure 7.2 shows the results for 6—dimensional to 10-dimensional hypercubes, where the x-axis is the number of faulty nodes in the system and y-axis is the probability that the fault model is valid for the number of faulty nodes. In the rest of this chapter, we first review some existing algorithms for unicast and broadcast and then provide the fault-tolerant revisions of the algorithms which can be easily implemented in hardware. Also, modification to the multicast algorithm presented in Chapter 5 will be presented to allow certain fault tolerance. The hardware implementation problem is always the major concern of this study. In Section 7.4, the discussion is focused on how the hardware router design presented in Chapter 6 can be modified so that it can handle all three types of interprocessor communication in faulty hypercubes. If the above assumption is violated, that is, a fault-free node has more than one faulty neighbor. then the routing problem becomes more complicated. Section 7.5 briefly discusses this situation to see how we can deal with it. 106 ‘§¢\‘ \\ “'8‘ \ it \ . - ‘1 1‘, 1 Dimensmn: 0.8 a ‘ \‘ 1“- '5'. A n=10 .0 \ - \‘ x [:1 n=9 I X. \\ \‘ '.. \D x "=8 \ '. r - Probability 0'6 7 ‘. ‘~ It ".7 fault I _ "=6 model “ x is valid “ XE m‘ 0.4 _ * °' ‘ \ .'. \\ \ ‘ ‘\ ’.° \\ \ .'. a \\ )1... \\ 0.2 _I ‘1 '0, \s \ e. \ \ ..°. 8‘ c it ‘ ~ .1 511 \ S \ ' ‘ ‘ 0 ~ ~ ‘ - it ...... x. s a 1 ~ I I T ‘ ‘ - ......... ‘ Number of faulty nodes Figure 7.2 Simulation results of the faulty model 107 7.3 FAULT-TOLERANT ROUTING ALGORITHMS Based on the model of faulty hypercube stated in Section 7.2. in this section, we first investigate how one-to-one message routing can be achieved. Then, we discuss how broadcast and multicast communications, which are more complicated in faulty hypercube, are handled. 7.3.1 Unicast in Faulty Hypercube We start our discussion with unicast communication in a faulty hypercube under the above assumptions. Algorithm UNICAST.F1 (shown in Fig 7.3), a simple variant of algorithm UNICAST.1 of Section 4.3, will find a shortest path between any two fault-free nodes in a Q11. The basic idea is that. similar to the case of fault-free hypercube, each node. upon receiving a message, first tries to send a message to the dimension corresponding to the right most bit position at which the addresses of current node and the destination node differ. If that dimension happens to be faulty, the node then sends the message to the second right most one. The algorithm is to be executed at every for- ward node along the path. Theorem 7.1: Given two fault-free nodes u, and ad with distance d(u,.ud)=d in a Q11, algorithm UNICAST.F1 always finds a shortest path of length d from u, to u, pro- vided that every fault-free node in the Q11 has at most one faulty neighboring node. Proof: Since the distance between u, and ud is d, their addresses a(u1.) and a(u11) differ at exactly d bit positions. and d disjoint paths between the two nodes can be con- structed. As discussed in Section 4.3, if the system is fault-free, starting from u,, by traversing through each of the d dimensions exactly once, the message will reach ad in d steps. The order of which dimension is traversed first is immaterial. In a faulty hyper- cube, at any forward node uj, suppose the distance d(u,-,u11)=g. If g22, the algorithm tries to find a fault-free dimension among the g dimensions at which a(u,-) and a(ud) differ, to pass the message. Since each fault-free node has at most one faulty neighbor, 108 Algorithm UNICAST.F1: (* h=1; |F1(u)ISl for all us V—F *) begin (* find the relative destination address with respect to the local address *) route) 2=0(uo)$a(ud); If (r 0(u11)=0) then send message to local processor else i :=0; repeat if dimension W1-(ro(u11)) is fault-free then send message to dimension W1-(ro(ud)) else i :=i-I-1; until message is sent; end. Figure 7.3 A unicast algorithm for faulty hypercubes such a fault-free dimension always exists. If g=1. then ud is a neighbor of uj, and ad is supposed to be fault-free, uj will pass the message to ud through the direct link between them. Therefore, a shortest path from u111 to ad can always be found. I Figures 7.4 shows examples of unicast in a faulty 2-cube and a faulty S-cube. where double circled nodes represent faulty nodes. The dotted lines form the path in fault-free hypercube found by algorithm UNICAST.1. while the solid lines form the path in faulty hypercube found by UNICAST.F1. In Figure 7.4 (b), 0000 is the source node, and 1111 is the destination node. The source first sends a message to node 0001. At node 0001, the algorithm first tries to pass message to node 0011 at dimension 1 (W1(0001$llll)=l). which, however. is found to be faulty. Thus, dimension 2 (W2(0001 $1111)=2) is selected, and the message is then forwarded to node 0101. At 0101, the algorithm tries dimension 1 again ( W1(0101$llll)=1 ), which, unfor- tunately, is found to be faulty again. Having had to select dimension 3 109 \ ‘\ ‘. I. l” I (b) In a 4-cube Double circled nodes represent faulty nodes Dotted lines: path in fault-free hypercube Solid lines: path in faulty hypercube Figure 7.4 An example of unicast in faulty hypercubes 110 (W2(0101$1111)=3), the message reaches node 1101. Finally, 1101 passes the mes- sage to the destination 1111 through dimension 1. 7.3.2 Broadcast in Faulty Hypercubes In this subsection, we consider broadcast in faulty hypercubes. Let us compare the two broadcast algorithms: BROADCAST.2 and BROADCAST.3, presented in Section 4.3. Algorithm BROADCAST2 decides routing based on the local and source addresses only. It does not seem modifiable to work in a faulty hypercube. Algorithm BROADCAST.3, however, calculates and sends different Control vectors to different dimensions at each forward node. Based on this approach, we develop a fault-tolerant algorithm BROADCASTFS, which can correctly broadcast a message to all fault-free nodes in a fault hypercube. The algorithm is listed in Figure 7.5, where u, refers to source node of the broadcast. uo refers to the local node which is currently executing the algorithm, Control* is the control vector received by uo. and Contron is the control vector sent out from no to dimension j. Note that the major difference between Algorithm BROADCAST.F3 and BROADCASTB is in line 8. We prove the correctness of the algo- rithm in Lemma 7.1 and Lemma 7.2. Lemma 7.1: Given nodes u1,ude V(Q,1)-F (Q11), algorithm BROADCAST.F3 finds a shortest path from 1411 to ad in exactly the same way as UNICAST.F1 does, which has been proven to be a shortest path between us and ud. Proof: Let the path from u, to ad constructed by Algorithm UNICAST.F1 be p=(u,,u1. - - - ,u1-, ° - - .u11_1,u11). We note the following two facts about algorithm BROADCAST.F3. (1) Message is sent only to the dimension whose corresponding bit position has value 1 in Control* and is fault-free; (2) For any bit position I, if Control*[l ]=0. then Control 1 [1 ]=0 for any outgoing link j. lll Algorithm BROADCAST.F3: (* h=1; |F1(u) I51 for all ue V-F *) 1. begin (* if uo=u,. the router get a Control* vector with all bits set to 1 *) ("' otherwise, no receives a Control* vector from a parent node *) ("' fb=1, if dimension b is faulty, otherwise fb=0 031,91 -1 at) 2. for j:=0 to n-l do 3. begin 4. send message to local processor; 5. if Control* [1' ]=1 and fj=0 then 6. begin (* form Control ,- vector *) 7. for b:=0 to n -1 do 8. if Control* [b]=1 and (b > j or fb=1 ) then C ontrol ,- [b ]:=1 9. else Control ,- [b]:=0; 10. send message with Control j to dimension j 1 1. end; 12. end; 13. end. Figure 7 .5 A broadcast algorithm for faulty hypercubes We prove by induction on i that algorithm BROADCAST. F3 reaches u1- along path (u,,u1. - ° - u1--1) and that for each bit position x at which a(u1-) and a(ud) differ, Control* [1:] at node u1- has value 1. For i =1, the basis step of the induction, obviously u1 receives messages from u... At u,, all bit positions of Control“ are set to value I initially. Now assuming the hypothesis is true for i =t. We show that it is true for i =t+l. At node u1, let l=W1(a(u1)$a(u11)), if dimension 1 is fault-free. algorithm UNICAST.F1 routes the message from u, to u1+1 through dimension I. That means u1+1 is the l-th dimensional neighbor of u1. Algorithm BROADCAST.Fa will also route the message through dimension 1. since Control*[l]=1 by the hypothesis. Also by the hypothesis, Control“ has value 1 at all bit positions at which a(u1) and a(ud) differ. By Algorithm 112 BROADCAST.FS, Control, will also have value 1 at all those bit positions, except bit position I, which is reset to 0. Since a (u1) and a(u1+1) differ at bit position I only, Control, is thus also properly set. If, however, dimension 1 is faulty, which implies that d(u1,ud)22 (otherwise, ud itself is faulty, it should not receive any message). then algorithm UNICAST.FI routes the message from u, to u1+1 through dimension m=W2(a(u1)$a(ud)), which always exists and is not faulty since each fault-free node has at most one faulty neighbor. In that case. u1+1 is the m-th dimensional neighbor of u1. Algorithm BROADCAST.FS will also route the message through dimension m, since Control* [m ]=1 by the hypothesis. Control,” sending from u, to 14111 will have value 1 at bit position I and value 0 at bit position m. Other bit positions at which a(u1-+1) and a(u11) differ, Control.,1 will have value 1. The positions in vector Controlm, at which a(u1-+1) and a(u11) agree, will be set to appropriate values depending on if u.) is a leaf node or a forward node in the broadcast tree. By the induction step, it is proved that a message from u, to ud follows the same path as constructed by algorithm UNICAST.F1. which is shown to be a shortest path from u, to ud. I Lemma 7.2: Given nodes u,,ude V(Qn)—F (Q11), exactly one shortest path from u11 to ud will be constructed by executing algorithm BROADCAST .F3. Proof: We prove it by contradiction. Suppose by executing Algorithm BROADCAST.FS. another path p’ from u, to ud, besides path p, is formed. such that p’iep, where p is the path discussed in the proof of Lemma 7.1. and p’=(u1,,u’1, . ° - ,u’1-, ° - - ,u’11_1,ud). Notice that, by the way Algorithm BROADCAST.FS works, along any path from u11 to a destination, the message always traverses from a parent node to a child node. It never goes along the reverse direction. Thus, p’ and p should have the same length d. Let i be the smallest integer such that u1¢u’1. Let l=W1(a (us—1)$a(u.-)) and l’=W1(a(u1-_1)ea(u’1-)). Thus. a(u1--1) differ form a(u11) at both bit positions I and l’, and both dimensions are fault-free. Assuming l>l’, then. by 113 Algorithm BROADCAST.Fa, Controly, received by u’1-. will be set properly as detailed in the proof of Lemmas 7.1, in particular, Control([l’]=0 and Control; [1 ]=1. However, by the execution of lines 7 to 9 in Algorithm BROADCAST.F3. Control1, received by u1-, will have value zero at both bit positions I and 1’. Since u1-_1 and ad differ at both bit posi- tions 1 and l’. and u1- differs from u1--1 at bit position I only. u1- and ad must also differ at bit position 1’. However, Control1[l ’]=0, and, thus, the message will never go to ud. Therefore, the existence of path p’ is impossible. If we assume l " ch (111' c- ’ a . Wallace 1 11 2’ Decoder I Tree 1%}. CjZ a3,- I 1.— F013 a4,- I t»— > cj4 (* fj=1, if dimension j is faulty *) Figure 7.9 Decoder (DECR) for the j-th column for faulty Q3 We first discuss the modification to the Multicaster. In the fault-free case, the column-sums for all n dimensions are directly sent to the Maximum Checker and then to Column Selector to select a dimension for message forwarding in each operation cycle. In a faulty hypercube, if dimension j is found faulty (0:1), it should never be selected. It can be realized by having the outputs of the Decoder ANDed with the complement of the fault indication flip-flop (f7) before being sent for further processing. Thus, if dimen- sion j is faulty. the c}.- lines (for OSiSm) will all be zerosT. This ensures that column j will never be selected. Figure 7.9 illustrates the fault-tolerant version of the Decoder. 121 As discussed previously, the Broadcaster unit design, an implementation of BROADCAST.2 as shown in Figure 6.5, is fairly simple for fault-free hypercubes. The dimension(s) which should receive messages can be easily determined based on the rela- tive address of the local node with respect to the source node. For a faulty hypercube, we design a broadcaster to be an implementation of BROADCAST.FS. A binary vector Con- trol has to be calculated for each selected dimension at a forward node. Algorithm BROADCAST. F3 does not look very straightforward. However, the actual hardware implementation, by using a few gates, looks simpler than the appearance of the algo- rithm. The diagram is shown in Figure 7.10. Accordingly. a minor modification has to be made to the message format proposed in Section 6.2. As mentioned before, when k=2"—1, there will be no destination fields, only a k field, a source address, and a data field will be contained in the message. To make the message format good for both fault-free and faulty cases, the source address will no longer be included in the message header. Instead, an n—bit Control field, immediately following the the k address fields, will be sent, replacing the position of source address. The general message format presented in Section 6.2 can be modified to the following form. n-bit n-bit n-bit n-bit n-bit k DI Dz ° ° ' Dk COMO! DATA ‘I'Theaetically, cjo’ should be set to l to indicate the column-sum being zero. However, since we assume each fault-free node has at most one faulty neighbor. some column-sum must be non-zero, if the multicmt has not finished. Therefore, this configuration will not affect the function of the Multicaster. 122 = C 22 - w , to. : dimensron 2 c21 : Control 2 1 I I —' €20 ' J ........ I 2:» V KI: §~ 82" I 1 n '5 L--- : €02 ' 'I . to . 1 d1mens10n 0 : 601 I Control 0 : GND - . : Cm - -' ________ J Input Control“ =c§c1c3 Ifdimension i is faulty, then f1-=1 else f1-=0, 0SiSn-l If send1-=l, "Control1-” will be sent to dimension i with the message. Figure 7.10 Broadcaster for faulty hypercube 123 For multicast message, the Control will be set to all zeros. In the case of broadcast, the message format will look like: k field (n-bit) n-bit 2"-1 Control DATA In the case of m