LIBRARY Mlchlgan State Unlvorslty PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE 1/93 mamas-p14 GROUP COMMUNICATION UNDER LINK-STATE ROUTING By Yih Huang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science March 18, 1998 ABSTRACT GROUP COMMUNICATION UNDER LINK-STATE ROUTING By Yih Huang Multiparty communication, also termed group communication, is a generalization of the traditional point-to-poz'nt communication in which more than two parties can participate in a “conversation.” Many current and emerging communication applica- tions, such as teleconferencing, computer-supported cooperative work, and distributed interactive simulation, typically involve several, or a large number of, participants, and require efficient network support for multiparty communication. Link-state routing (LSR) is a type of network routing method that makes complete network status information available throughout the network. Adopted by both the Inter- net, the de facto standard for data communications, and asynchronous transfer mode (ATM), an international standard for telecommunications, the importance of LSR in communication cannot be overstated. In this research, we investigate and exploit the relationship between LSR and group communication. Specifically, we develop a collection of novel and efficient protocols that (1) use group communication methods to improve the performance of LSR operation, (2) take advantage of LSR to provide new network services to group communication applications, and (3) benefit both LSR and group-based applications. Our contributions can be summarized in the following four areas. First, we identify an important aspect of LSR operation that can benefit from group communication methods: the broadcast of network status information, also known as the flooding Operation. We propose a novel flooding approach for use in ATM networks, termed switch-aided flooding (SAF), that takes advantage of underlying ATM hardware functionality. The SAF method is Shown, through both theoretical analysis and simulation study, to be much more efficient than previous methods. Second, we address a requirement raised by the diversity of multiparty commu- nication applications: the need to support different types of multipoint connections (MCs), the network entities that define the routing of traffic streams among the par- ticipants in multiparty conversations. We develop a generic MC (GMC) protocol that is able to accommodate multiple topology types and computation algorithms as plug- in components. We Show that a “chassis” for MC protocols can operate efficiently under LSR. Third, we investigate an issue involved in both LSR and group communication —- the leader election problem. We define the problem of “network-level” leader election, where participants of an election are network switching elements rather than hosts, and we develop an LSR-based solution to the problem, called the Network-level Leader Election (NLE) protocol. The N LE protocol is formally proven to be robust; it handles not only leader failures, but also much more disastrous situations, such as network partitioning. We apply the NLE protocol to the problem of managing traffic transit centers, or core nodes, for multicast groups. Our prOposed solution, called the LSR-based Core Management (LCM) protocol, automatically selects the core node for a multicast group when the group is created, supports core migration to improve multicast performance during the lifetime of the group, handles the failures of both multicast cores and the core management server itself, and survives network partitioning scenarios. Lastly, we turn again to the operation and performance of LSR itself. Tradition- ally, LSR uses two costly techniques to achieve its robustness and responsiveness: message forwarding on every communication link in the flooding of network status updates, and the periodic flooding of local status by each router. We conclude this research by combining two techniques developed earlier, namely the election of a leader and the construction of multipoint connections, to develop a totally different approach to LSR. The resulting Tree-based LSR (T-LSR) protocol imposes only a small fraction of the overhead of previous LSR methods, while guaranteeing to main- tain consistent routing decisions throughout the network under any combination of network component failures, partitioning scenarios, and undetected communication transmission errors. Unlike the ATM-oriented SAF protocols, the T—LSR protocol is designed for use in general-purpose, LSR-based networking environments and requires no special hardware support. In summary, this research reveals a mutually beneficial relationship between group communication and LSR: many aspects of group communication (such as the con- struction of communication channels, the management of membership, and the con- sensus on leadership) can take advantage of the internal operation of LSR, while the performance of LSR itself can be improved by incorporating various group communi- cation mechanisms. TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES 1 Introduction 2 Background 2.1 Multiparty Communication Applications ............ 2.1.1 Human-to-Human Interaction ................. 2.1.2 Distributed Interactive Simulation .............. 2.1.3 Distributed Information Management ............. 2.1.4 Information Distribution .................... 2.2 Multicast Communication .................... 2.2.1 Multicast Routing Topologies ................. 2.2.2 Local Membership Management ................ 2.2.3 Multicast in the Internet .................... 2.2.4 Multicast in ATM Networks .................. 2.2.5 Discussion ............................ 2.3 Overview of Link State Routing ................. 2.3.1 Basic Operation ........................ 2.3.2 Fault Tolerance Issues ..................... 2.3.3 Hierarchical LSR ........................ 2.4 Discussion ............................. 3 Switch-Aided Flooding 3.1 Motivation ............................. 3.2 The Spanning MC Protocol ................... 3.3 The SAF Protocols ........................ 3.3.1 Basic SAF Protocol ...................... 3.3.2 Bandwidth-Efficient SAF Protocol .............. 3.4 Performance Evaluation ..................... 3.5 Summary ............................. 4 Optimal SAF Operations 4.1 Motivation ............................. 4.2 ER SAF Protocol Design ..................... 4.2.1 Basic Concept ......................... 4.2.2 Operation Modes ........................ viii xi vi 4.3 Algorithms .................................. 78 4.4 The Virtual Ring ............................... 83 4.5 Performance Evaluation ........................... 85 4.6 Summary ................................... 89 5 A Generic Method of MC Construction 92 5.1 Motivation ................................... 93 5.2 LSR-Based Multipoint Connections ..................... 96 5.3 The GMC Protocol .............................. 97 5.3.1 Design Issues ................................ 98 5.3.2 Protocol Overview ............................. 100 5.3.3 GMC LSA Format ............................. 102 5.3.4 Data Structures And Protocol States ................... 103 5.3.5 Protocol Algorithms ............................ 105 5.3.6 MC Creation and Destruction ....................... 111 5.4 Proof of Correctness ............................. 111 5.4.1 Correctness without Memory Overflows ................. 111 5.4.2 The Handling of Memory Overflows .................... 114 5.5 Performance Evaluation ........................... 117 5.5.1 Simulation methodology .......................... 117 5.5.2 Group Creation Periods .......................... 119 5.5.3 Normal Operations ............................. 123 5.5.4 Comparison with the MOSPF Protocol .................. 124 5.6 Summary ................................... 126 6 Group Leader Election under Link-State Routing 127 6.1 Introduction .................................. 127 6.2 The NLE Protocol .............................. 130 6.2.1 Overview .................................. 130 6.2.2 State Machines and Events ........................ 132 6.2.3 The Operation of LCM .......................... 133 6.2.4 The Operation of MSM .......................... 135 6.3 Proof of Correctness ............................. 136 6.4 Performance Evaluation ........................... 139 6.5 Other Potential Uses of The NLE Protocol ................. 142 6.5.1 Multicast Address Resolution ....................... 143 6.5.2 Multicast Core Management ........................ 144 6.5.3 Performance of Multicast Group Creation ................ 145 6.6 Summary ................................... 147 7 Multicast Core Management 149 7.1 Introduction .................................. 150 7.2 The LCM Protocol .............................. 153 7.3 Performance Evaluation ........................... 156 7.4 Summary ................................... 159 vii 8 'ITee—Based Link State Routing 161 8.1 Motivation ................................... 161 8.2 Overview ................................... 164 8.3 Algorithms .................................. 175 8.4 Proof of Correctness ............................. 181 8.5 Performance Evaluation ........................... 189 8.6 Summary ................................... 194 9 Conclusions And Future Work 196 LIST OF FIGURES 2.1 Three types of MC topologies ......................... 2.2 The operation of the DVMRP ......................... 2.3 An example of member join Operation in the CBT protocol. ....... 2.4 Comparison of multicast forwarding in the CBT protocol and SSTS.‘ . . . 2.5 The operation of the MOSPF protocol .................... 2.6 Shared trees constructed by the PIM protocol ................ 2.7 The result of topology transition for the sender 33. ............ 2.8 VC Operation in ATM networks ........................ 2.9 Operation of the ACBT protocol. ...................... 2.10 Problem in correctly identifying node failure ................. 2.11 An example of the flooding operation. ................... 2.12 The handling of network partitioning in LSR. ............... 2.13 A network topology. ............................. 2.14 Breaking up the network into routing domains. .............. 2.15 The image of the domain A.4 ......................... 2.16 The Simplified/high-level network image. .................. 3.1 Examples of multipoint connections. .................... 3.2 An example MC built by the CBT protocol. ................ 3.3 An example of the SMC protocol ....................... 3.4 The handling of event LSAS .......................... 3.5 The ReachCore module ............................ 3.6 The processing of the reach-core request message. ............ 3.7 An example of the Basic SAF protocol .................... 3.8 The Basic SAF protocol with a broken SMC. ............... 3.9 An example of the BE SAF protocol. .................... 3.10 The BE SAF protocol with a broken SMC .................. 3.11 The sender algorithm of the BE SAF protocol ................ 3.12 The receive-LSA routine in the BE SAF protocol. ............. 3.13 The receive-dummy routine in the BE SAF protocol. ........... 3.14 The timeout handler in the BE SAF protocol. ............... 3.15 Comparisons of flooding alternatives with a correctly functioning SMC. . 3.16 Comparisons of flooding alternatives with partitioned SMC. ....... 3.17 Comparisons of flooding alternatives when SMC does not exist. ..... 3.18 Performance of the SMC protocol ....................... 4.1 ER SAF flooding in normal cases ....................... viii 15 18 20 21 23 24 25 27 30 33 34 41 51 52 52 54 55 59 63 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 6.1 6.2 6.3 6.4 6.5 6.6 7.1 7.2 7.3 8.1 8.2 8.3 8.4 8.5 8.6 ix A hypothetical scenario where LSA retransmissions over R degenerate into a bidirectional store-and-forward process ................. 79 The sender algorithm of the ER SAF protocol ................ 80 The ReceiveLSA routine in the ER SAF protocol. ............. 82 The ReceiveACK routine in the ER SAF protocol. ............. 83 The timeout handler in the ER SAF protocol. ............... 84 Comparisons of flooding alternatives with an operational SMC and virtual ring. .................................... 88 Comparisons of flooding alternatives in the performance of flooding link- down events ................................. 90 The average/ worst case time to build a virtual ring ............. 91 Example MC Showing member switches and attached hosts. ....... 94 Problem created by inconsistent topology proposals ............. 99 The topology ordering problem ....................... 100 A network/ MC configuration. ........................ 103 Events and advertisements in the GMC protocol. ............. 104 The state-transition diagram of the GMC protocol. ............ 105 The algorithm for EventHandler ........................ 107 The algorithm for AcceptTopology. ..................... 108 The algorithm for ReceiveLSA ......................... 109 The algorithm for TCTimerHandler. ..................... 110 Performance of the GMC protocol under 1 second arrival interval. . . . . 121 Performance of the GMC protocol under 10 seconds arrival interval. . . . 121 Performance of the GMC protocol under 30 seconds arrival intervals. . . 122 Performance of the GMC protocol under the 10 minutes arrival interval. 123 Performance of the GMC protocol in normal operations. ......... 124 Topologies computations per event of the MOSPF protocol. ....... 125 The finite state machines in N LE ....................... 132 The leadership consensus machine at a switch a: for a group 9 (LCM(a:,g)).133 The membership status machine at a switch :1: for a group 9 (MSM($, g)). 135 Performance of the NLE protocol ....................... 142 Bandwidth usage of alternative election protocols .............. 142 Number of bindings generated for group creation. ............. 147 Core migration in LCM. ........................... 156 Queue length at the CBS. .......................... 158 Core-to-member distances produced by various core selection methods. . 159 An example of tree-based flooding. ..................... 163 The flooding of two LSAS in different modes ................. 170 The completion of the T-mode flooding in mode G. ............ 170 An example of the incorrect leadership problem ............... 174 The routine that flood router local status. ................. 176 Processing incoming LSAS ........................... 177 X 8.7 Setting preferred leader. ........................... 178 8.8 The BroadcastCTA routine. ......................... 178 8.9 The processing of incoming CTAS. ..................... 179 8.10 The processing of ballot messages ....................... 180 8.11 The processing of LEAs. ........................... 180 8.12 Comparison of periodic-flooding overhead. ................. 191 8.13 Efficiency of CTA broadcast .......................... 192 8.14 Comparison of event-driven flooding performance .............. 193 8.15 Overhead of flooding mode switching ..................... 194 3.1 4.1 4.2 5.1 6.1 8.1 8.2 LIST OF TABLES Characteristics of randomly generated graphs. ............... 62 Complexities of various flooding protocols. ................. 72 Characteristics of randomly generated graphs. ............... 86 Characteristics of randomly generated graphs. ............... 119 Characteristics of randomly generated graphs. ............... 140 Control messages in the T-LSR protocol ................... 165 T -LSR data structures at a router :r. .................... 165 xi Chapter 1 Introduction Many modern distributed applications involve multiparty communication, in which two or more participants are involved in a group “conversation.” A distinguishing characteristic of multiparty communication is the requirement for a source party (for example, a person that is currently speaking in a teleconference) to be heard by more than one receiving parties (for example, the other participants in the confer- ence). Applications that involve multi-party communication include teleconferencing, computer-supported cooperative work, distributed virtual reality, remote teaching, tele—gaming, replicated file servers, parallel database search, and distributed paral- lel processing. This thesis concerns efficient network support for various aspects of multiparty communication, or, interchangeably, group communication. Previous prominent works in this direction exist in the form of multicast protocols, especially those proposed for the Internet [1]. A multicast protocol routes communica- tion traffic streams from their sources to multiple destinations, as Opposed to exactly one destination, as in conventional point-to-point routing. Multicast methods sup— ported within the network are generally favored over host-level multicast methods, where typically a source explicitly sends a copy of the message individually to each recipient. The problem with the latter approach is that, when the paths from the source to destinations share a common link, the message traverses the link multiple times. Network-supported multicast methods avoid this redundancy by having the network replicate the message after its traversal of the common link. Representative 1 2 IP multicast protocols include PIM [2], CBT [3, 4], DVMRP [5], and MOSPF [6]. An important concept supported/used by such protocols is group addressing, whereby more than one communication party can be referred to as a Single entity. For ex- ample, IP multicast addresses [7], which are perhaps the most well-known group addressing mechanism, allow a data packet that is tagged with a Single destination address to be delivered to all the systems that are “listening” to that address. Most group communication implementations must deal with two issues: the col- lection and management of group membership information, and the routing of traffic streams to reach group members. Alternative approaches to the former issue range from no network support (that is, no membership management in the network), to maintaining a member list at every network node for every active group. The latter issue concerns the topology computation of multipoint connections (MCS), that is, sets of communication channels that connect group members. Various methods of MC topology computation have been devised by researchers to meet different per- formance criteria, such as the transmission delay experienced by group members and the total bandwidth consumed by the group [8, 9, 10]. Many multicast protocols can be considered as distributed implementations of one, or a small set of, MC topology computation algorithms. Some group communication implementations must deal with a third issue, namely, group leadership, which arises when a multicast protocol assigns special duties to one group member. The leader of a group may serve as the center for membership management, or as a transit point through which all traffic streams destined to the group must be forwarded. Group leaders can be configured manually or can be selected automatically by the network. Important multicast protocols that introduce such “distinguished” members include PIM [2] and CBT [3, 4]. It Should be noted that the use of group communication is not restricted to appli- cations; many aspects of the operation of the network itself involve group communica- tion. An important example is the underlying (unicast) routing protocol, a protocol that compiles knowledge of the network for the purpose of making routing decisions. A communication network consists of three major components: hosts, switches (or, 3 synonymously, routers)1, and communication links. The hosts are computers or other devices that allow users to access the network, while switches relay traffic streams through the network over communication links. When requested to relay a traffic stream toward a given destination, a switch must determine on which of its incident links to send the traffic. To ensure the correctness and quality of this decision, the switch requires knowledge about the rest of the network. One approach to achieve this goal is to disseminate the status and configurations of switches and links through- out the network so that a global picture of the network can be compiled at every switch. AS such, the routing protocol uses broadcast operations, a Special case of multicast operations in which all network nodes are recipients. In this scenario, the entire network can be considered as a group to which switch status information is sent. Routing in communication networks has been extensively studied in computer science. Although not all routing methods use broadcast operations in this manner, a very important one does. Link-state routing (LSR) [11, 12] iS an increasingly popular type of unicast rout- ing. An LSR protocol makes complete knowledge of the network available to all switches in the manner described above. The local status of each switch, includ- ing the bandwidth available at incident links, buffer capacity, and the workload, is learned by the network via the broadcast, or flooding, of link-state advertisements (LSAS). Based on received advertisements, each switch locally maintains a complete image of the network, which it uses to make routing decisions. The Open Shortest Path First (OSPF) protocol [11], introduced by the Internet community, is one of the most well-known LSR unicast protocols. LSR has also been adopted as the rout- ing method for Asynchronous Transfer Mode (ATM), a telecommunications standard that bases all communication on connection-oriented hardware switching of small, fixed-Size cells [13]. This dissertation addresses the interaction between group communication and LSR. Our interest in this problem stems from the following three observations. First, 1We will not distinguish the two terms here, despite the fact that one of them may be preferred over the other under certain contexts of discussion. 4 since LSR involves the maintenance of a complete image of the network at every switch, LSR-based networks might use this information to support a wide range of group communication algorithms. Locally available network images at switches may also help reduce the overhead incurred by distributed implementations of these al- gorithms. The second major advantage of LSR is its fault tolerance. Because every link is monitored by its incident switches, and every switch is monitored by neigh- boring switches, malfunctioning components and congested areas are made known to all functioning switches promptly. Even the earliest LSR protocols were able to survive disastrous situations, such as network partitioning [12]. Building group communication facilities upon such a solid foundation has clear implications with re- spect to robustness. Third, because some important parts of LSR operations exhibit characteristics of group communication, methods targeted at general-purpose group communication may help, or be tailored to help, the LSR protocol itself. As we will demonstrate in later chapters, an efficient group communication method can be used to accelerate the flooding of switch status information, and leader election plays an important role in large-scale, LSR-based networks that are organized in a hierarchi- cal manner. Considering the fact that LSR is being used in the infrastructure of many modern networks, improving its performance will benefit not only multiparty communication applications, but all applications that use such networks. In this dissertation, we model various aspects of group communication as the consensus problem under LSR, which is defined as follows. Due to delays in receiving an event advertisement, switches in an LSR-based network can have different views of the network for a short period of time. The situation is exacerbated when multiple events are advertised simultaneously. Furthermore, a network can be temporarily partitioned due to malfunctioning components, and the resulting subnetworks may evolve independently. The consensus problem under LSR is to guarantee that, given any combination of status changes, component failures, and transmission errors in advertisements, all switches will eventually produce identical images of the network, provided that the network is not permanently partitioned. This definition can be generalized to incorporate group management information, if network images are extended to include such information. Thesis Statement: By modeling various aspects of group communications (such as leadership, membership maintenance, and communication channel construction) as consensus problems under LSR, we develop novel and efilcient solutions for many important issues of network group communication, including fault-tolerant leader- consensus management, the support of multiple types of multicast communication channels, the handling of disastrous situations, such as network partitioning, and the improvement of the LSR itself. The major contributions of this work can be summarized as follows. 1. Switching-aided flooding (SAF). This flooding method takes advantage of ATM hardware cell relaying and duplication to improve the performance of flood- ing operations in ATM networks. We first develop two SAF protocols, called the Basic SAF and Bandwidth Efficient (BE) SAF protocols, that construct a hardware-based data-distribution tree to accelerate the dissemination of (network-status) information. To further improve efficiency, we develop a third SAF protocol that uses a ring topology to handle acknowledgments efficiently. The complexity of this Efficient and Reliable (ER) SAF protocol is shown to be optimal in terms of bandwidth consumption, workload at switches, and flooding delay. Improving the performance and efficiency of flooding operations can be very important to the responsiveness of the network in meeting diverse appli- cation needs. 2. Generic multipoint connection (GMC) protocol. The GMC protocol is based on LSR and can be considered as an MC protocol “chassis,” that is, a frame- work that is able to accommodate multiple existing, and future, MC topology algorithms. Such an MC protocol is expected to benefit a wide variety of multi- party communication applications that favor different performance criteria. For example, a live multimedia broadcast could use an MC topology that minimizes the transmission delays from a single source to a large number of destinations, 6 while a distributed interactive Simulation application may prefer an MC topol- ogy that can efficiently accommodate a large number of participants, each of which is both a sender and a receiver. . Network-level Leader Election (NLE) protocol. The NLE protocol establishes consistent group—leader bindings at network switches, maintains up—to—date member lists at leaders, and handles network partitioning properly. Specifi- cally, given a group 9 and a set of network segments 31,52, . .. ,Sk, k 2 1, within each segment 5',- there will be consensus on a leader for g, and that leader will be an operational switch in S,- that maintains a member list of 9 containing those and only those members in 5,. The NLE protocol, which is based on LSR, can be used to select traffic transit centers, or core nodes, for individual multicast groups and to support hierarchical routing and address mapping. In addition, we apply the NLE in the design of the LSR-based Core Management (LCM) protocol. Rather than conducting leader election on a per- group basis, the LCM protocol uses the N LE protocol to select a switch as the core management server, which in turn manages core nodes for all the active groups in the network. Specifically, the LCM protocol automatically selects the core node for a multicast group when the group is created, supports core migration to improve multicast performance during the lifetime of the group, handles the failures of both multicast cores and the core management server itself, and survives network partitioning scenarios. . Tree-based LSR (T-LSR). Traditionally, LSR uses two costly techniques to achieve its robustness and responsiveness: message forwarding on every com- munication link in the flooding of network status updates, and the periodic flooding of local status by each router. We conclude this research by combining two techniques developed earlier, namely the election of a leader and the con- struction of MCS, to develop a totally different approach to LSR. The resultant T-LSR protocol imposes only a small fraction of the overheads of previous LSR methods, and guarantees to maintain consistent routing decisions throughout 7 the network under any combination of network component failures, partitioning scenarios, and undetected communication transmission errors. Unlike the SAF work, the T-LSR protocol is designed for use in general-purpose, LSR-based networking platforms, assuming no hardware-based capacities of switches. The remainder of this dissertation is organized as follows. In Chapter 2, we present background material relevant to this work, including a discussion of the semantics of group communication as perceived by different types of applications, a survey of important multicast protocols, and a survey of link-state routing. We present the SAF protocols in Chapters 3 and 4. Subsequently, we Shift our attention to the support of group communication by LSR. The GMC protocol is described in Chapter 5. Chapters 6 and 7, respectively, describe the N LE protocol and its use in the LCM protocol. The T-LSR protocol is presented in Chapter 8. Conclusions and possible future directions are discussed in Chapter 9. Chapter 2 Background Advances in communication technology have been dramatic in the last two decades. The Internet, which started out as an experimental project connecting a small num- ber of military sites and universities, has reached all the continents of the Earth. The Internet is no longer a playground for a small group of researchers and academi- cians, but has become a part of everyday life for millions of people in all kinds of professions. In the meantime, long-established communication infrastructures, such as telephone and cable television networks, are being transformed into modern infor- mation superhighways, and are expected to provide a wide Spectrum of new services (such as video on demand, multimedia telephony, data communication, tele—gaming, information retrieval, and so forth) directly to individual homes. Moreover, advances in communication technology are not limited to higher bit rates and lower loss rates; they also include unconventional ways of using communication channels. One possi- bility, which is actively being investigated by many researchers and developers, is to support multiparty communication, whereby more than two communication parties can conduct “conversations.” In this chapter, we discuss important multiparty com- munication applications, existing multicast protocols that support those applications, and link state routing, the type of network routing upon which the proposed methods are based. 9 2.1 Multiparty Communication Applications The term multiparty communication, or interchangeably group communication, refers to a wide spectrum Of communication applications, including human-to—human inter- action, distributed interactive simulation, distributed information management, and efficient information distribution. Naturally, such diverse applications have differ- ent needs and expectations regarding services provided by the underlying network. Although this dissertation largely concentrates on core network support for group communication, including multicast operations, membership management, and lead- ership consensus, in this section we examine the applications and services that may be implemented atop such network services. Our objective is to assess and classify the requirements of such applications. 2. 1.1 Human—to-Human Interaction This class of applications brings together individuals for whom it is either difficult or costly to meet face to face (for example, due to their locations), but who must work cooperatively. An example is videoconferencing, which allows participants to visually and verbally communicate with others over a network [14, 15]. A special type of teleconferencing, called computer telephony, uses computers and data com- munication networks, rather than public telephone networks, for transmitting audio in real time [16]. Teleconferencing does not necessarily use multimedia; text-based teleconferencing sessions, sometimes called chat rooms, have become pOpular on the Internet [17]. In addition, Computer-Supported Cooperative Workspace (CSCW) applications enable workers who possess different areas of expertise, and who are ge- ographically separated, to remotely and cooperatively conduct difficult Operations or manipulate sophisticated equipment [18, 19]. An interesting characteristic of many human interaction applications is the rel- atively loose requirements on multicast reliability. Typically, these applications can tolerate occasional loss of multicast data at some destinations, Since it is human be— ings, rather than machines, that receive and interpret incoming messages. Occasional 10 losses of characters in a text-based teleconference, for example, may be perceived as typos, rather than transmission errors. When multimedia is used, some loss of image pixels or audio / video frames may produce flares or jumps in playback, but the conver- sation can continue as long as the degradation is not too severe. On the other hand, delays and jitters in message delivery may be annoying —— imagine how to conduct a conversation if one’s voice is not heard by others until 30 seconds later. Therefore, many applications in this category use best effort multicast, a type of multicast that does not enforce the successful delivery of multicast data at all destinations. When possible, such applications might reserve network resources in advance in order to improve the Quality of Service (QoS) of the network. 2.1.2 Distributed Interactive Simulation In a DIS application, a virtual environment (VB) is Simulated collectively by a set of hosts over a network [20]; examples include a virtual battlefield, a virtual shopping center, and so forth. The interest in DIS originated in the military; a military training session conducted in a virtual battlefield is much less expensive, and more importantly, much safer than a real exercise. Civilian uses of such technology include simulation of police and fire department exercises, as well as the playing of multiparty games across the Internet. In such VES, some objects are static, such as trees and lakes in a virtual park, whereas other objects are active —— they move voluntarily or react to stimuli (people in the virtual park). Some objects may be computer simulated (for example, enemy tanks in a virtual battlefield), while others are controlled by users (for example, tanks controlled by trainees). In general, VE objects must sense and interact with each other in real time. For this purpose, information regarding the current positions, movements, and actions of objects must be disseminated to all participating hosts in a timely manner. Network supported multicast Operations and other group communication facilities can be used to improve performance. DIS applications are often characterized by their scale; the number of partici- pants in a VB can range from a few to thousands, and the underlying network can range from LANS to WANS. The size and the geographic distribution of the par- 11 ticipant population raise the concern of scalability issues regarding the underlying group communication support. Moreover, DIS applications call for a Special type of reliability, called selective reliable multicast [20, 21]. Consider a Situation where user X is engaged in a virtual battlefield and unfortunately loses track of his opponent, user Y, due to the loss of a sequence of three messages that broadcast the positions of Y. While the conventional semantics of reliability would force the host of X to request retransmissions of all three messages, X is interested only in the most recent position of Y. A selective multicast protocol ensures the “freshness” of object states maintained at participating hosts, and does not insist on the successful delivery of all state update messages [21]. 2.1.3 Distributed Information Management Single-server solutions have traditionally dominated the area of information manage- ment, including the management of file systems and databases. However, for reasons of scalability and fault-tolerance, distributed solutions have been proposed and are gaining momentum. For example, the Coda file system [22] allows for a file system to be replicated at more than one file server. A client to such a file system can retrieve files from the nearest server, but must submit file updates to all servers. Further, servers may fail, and backup systems may join the service. If the client-server com- munication in such circumstances is modeled as a group communication problem, clients perceive servers as a single network entity, the server group, and should not be concerned with server membership dynamics. Similar methods can be applied to database services, using replicated database servers for either fault tolerance or to improve the performance of query processing through parallel searching. Many applications in this category demand atomic multicast operations, whereby either all destinations of a multicast message receive the message, or none of them receives it. Consider a scenario where a file update request is sent to a group of replicated file servers; an atomic multicast protocol guarantees that either the file is updated at all servers, or at none of them. Although the latter case could be considered as a failed multicast operation, at least it leaves the servers in a consistent 12 state. 2.1.4 Information Distribution This category refers to applications that disseminate information to a large popula- tion. A defining characteristic of such applications is the existence of a single, or a small set of, information sources and a potentially unlimited audience Size. For ex- ample, in a remote teaching application, the lecturer in a virtual classroom can reach a large number of pupils at remote locations. Some existing information distribution processes can also be re-examined in light of new network technology. For example, the traditional process of distributing public domain software works as follows: the distributor sets up an FTP (File Transfer Protocol) [23] site and interested users individually connect to the site to download a copy. Download requests for popular software may put a heavy load on the FTP server, which repeatedly performs identical tasks: retrieving the software from a local storage medium and Shipping it. (It is not uncommon for servers to be brought down by these workloads.) Recently, the HTTP (Hyper-Text Transfer Protocol) [24] and World Wide Web [25] have largely replaced the FTP protocol in this distribution process, but the problem remains. In fact, the situation has become worse due to the more user-friendly interfaces and, hence, a larger number of interested users. A much more efficient approach is to have the distributor (also known as the publisher) set up a communication group such that group members, or subscribers, simultaneously receive a copy via multicast. File distribution protocols, a type of multicast protocols that is designed for this purpose, have been proposed for use in the Internet [26]. File distribution protocols must use reliable multicast to ensure the receipt of all multicast data at all destinations. Examples of reliable multicast transport protocols can be found in [27, 28, 29]. 13 2.2 Multicast Communication Multicast operations, which deliver messages to more than one destination, are centric to the support of multiparty communication applications. The voice and image of a teleconference member must reach all other members. The movements of objects and the status changes of terrain in a VB must be disseminated to all hosts participating in a DIS session. File update requests must be submitted to all servers. And so on. Actually, one may argue that the use of multicast is the defining characteristic of group communication. A multicast protocol is a network protocol that defines a set of rules and conventions by which multicast traffic streams are routed from sources to a set of destinations. This section reviews existing multicast solutions developed for two important types of networks, the Internet and ATM networks. We start with a review of routing topologies and membership management techniques. 2.2.1 Multicast Routing Topologies While many multicast protocols concern Simply the construction of individual mul- ticast trees (a set of communication links from a source to a set of destinations), we consider a more general form of multicast routing structure, called a multipoint connection (MC), whereby one or more sources can reach one or more destinations. Three major types of MC topologies have been studied: 1. Source-rooted trees (SRT). The MC topology typically comprises a forest of trees, each individually constructed for a different traffic source. An example in which two trees reach a set of four receivers is shown in Figure 2.1(a). This type of topology is well suited to applications with a small number of senders and a possibly large number of receivers, such as remote teaching and file dis- tribution applications. SRTS are relatively straightforward to construct and are supported by almost all existing multicast protocols. SRT-based MCS are, however, costly to maintain: a new tree must be constructed for each source, and every existing tree must be extended to reach a new receiver. Similar over- heads are incurred for departing senders and receivers. SRTS are supported in 14 the DVMRP protocol [5], MOSPF protocol [6], and the PIM protocol [2], all designed for use in the Internet. The ATM multicast virtual circuit (multicast VC) [30] also supports SRTS. 2. Symmetric shared trees (SST). A single tree is constructed to span the members of an MC; every member is both a sender and a receiver (as in the case of teleconferencing). Figure 2.1(b) shows an SST spanning five members. The tree in the figure also uses an intermediate node to reach members. Compared with an SRT-based MC, an SST counterpart tends to use fewer network resources (in terms of the number of links) than does an SRT-based forest. The problem of determining an Optimal shared tree is the well-know minimum Steiner tree problem [31]. 3. Receiver-only shared tree (ROST). A Single tree spans the receiver members of an MC, while senders use one-to-one unidirectional paths to reach any node on the tree. An example of a ROST with two senders and five receivers is depicted in Figure 2.1(c). The five receivers are connected by a shared tree, depicted with solid lines, and the sender-tO-tree paths are represented by dashed lines. This distinction between receivers from senders facilitates membership management on both sides. For example, a group of replicated file servers can be connected by a ROST such that clients to the server group see a single entity, the server MC; individual servers join and leave the server group without disrupting client- to-server communication. ROSTS are supported by the core-based tree (CBT) multicast protocol [32] and the PIM protocol [2, 33]. Besides the type of topology, another issue associated with MC is the topology computation algorithm. Even with a given topology type, different topology com- putation algorithms can be used, depending on the relative importance of various performance criteria. Such criteria include bounds in transmission delays, network resource consumption, multicast packet loss rate, and so forth. The issue of choosing the right topology algorithm is particularly important to multimedia applications. Such applications typically require quality of service from the network in order to 15 Link used by source _ " to contact the ROST __ comoctton' Iink Sourcemember _. Tree link from $1 - -> Tree link from $2 Receiver member _ Connection Unit Intermediate switch 0 Connection Member Receiver memtm Source member 0 Intermediate Switch Intermedete switch (a) two SRTS. (b) an SST. (c) a ROST. Figure 2.1: Three types of MC topologies. insure the quality of media playbacks. Thus their performance relies on good MC tOpology decisions so that network components involved in an MC have the capacity and resources to sustain the traffic flowing through the MC. For instance, Zhu [9] presented an algorithm that optimizes cost (for example, bandwidth consumption), in the presence Of delay constraints. Bauer [10] examined the multicast tree problem under degree constraints, which may be imposed by hardware switching devices. Wax- man [34, 35] addressed the problem of dynamic multicast trees, in which a sequence of membership updates must be carried out one by one. Although this dissertation does not directly address the issue of MC topology computation algorithm, it advo- cates generic MC protocols that are capable of accommodating a wide range of MC topology types and computation algorithms. 2.2.2 Local Membership Management In this dissertation, our primary concern is switch/ router level multicast. However, from the viewpoint of applications, communication groups are host groups; members of such groups are computers or other customer devices that allow users to access networks. Typically, a host accesses the network via a router/ switch, called the ingress switch of the host, and uses a local membership management protocol to inform its ingress switch / router of a list of groups in which the host wishes to participate. The ingress switch maintains a list of groups, where a group is on the list if one or more 16 attached host(s) of the switch is a member of this group. A switch that has at least one attached host that is a member of group G will be referred to as the switch member of G; all the switch members of G form a network group. With every switch knowing its membership identity with respect to a group, a multicast protocol, when given a multicast message destined to the group, is responsible for the delivery of the message from the source switch, the ingress switch of the source of the message, to switch members of the group. Perhaps the most well known and widely used local membership management pro- tocol is the Internet Group Membership Protocol (IGMP), which is designed for use in broadcast-based LAN 3 [7]. In IGMP, the router of a LAN sends Host-Membership- Query messages destined to a reserved multicast address that includes all hosts in a LAN as members. In response, a host returns a Host-Membership—Report message, which includes a list of multicast addresses in which that host is interested. Via re- ceived membership reports, a router compiles a list of multicast addresses in which the network (LAN) is interested. This process is repeated periodically to accommodate membership dynamics. The IGMP uses several optimization techniques to reduce the traffic produced by Host-Membership—Report messages, which must be generated by all hosts in a LAN. Further details of the IGMP can be found in [7]. The group communications solutions developed in this dissertation assume the use of an existing local membership protocol, such as IGMP, by hosts to communicate with respective ingress switches regarding membership identities. 2.2.3 Multicast in the Internet The Internet is a connectionless network, meaning that, when a sender S wishes to send a datagram to a destination D, the sender is not required to contact D prior to transmission. When S and D Share a common communication medium (for example, the two are the endpoints of a point-to—point link, or they both have access to a broadcast medium, such as Ethernet), D receives the datagram directly from S. Otherwise, Internet routers collectively deliver the datagram to D as follows: any router R that receives the datagram will forward the datagram via a communication 17 link that constitutes the first hop of an R—to—D shortest path. This forwarding process starts at the ingress router of S, and is repeated until the datagram arrives at D. In this manner, the routing of a given IP datagram is dynamic and independent of other datagrams. The Internet extends this basic point-tO-point datagram delivery model with multicast addresses. A datagram that contains a multicast address as its destination is called a multicast datagram, and must be forwarded to all hosts that are interested in the address. For the discussion of IP multicast, we review four protocols that have been pro- posed: DVMRP [5], CBT [3, 4], MOSPF [6], and PIM [2]. In this discussion, the term router is preferred over the term switch. Also, the term multicast group refers to a set of hosts that are listening to an IP multicast address. Following these semantics, multicast groups in the Internet are receiver groups. Distance Vector Multicast Routing Protocol (DVMRP) Given a multicast address M, DVMRP builds an SRT individually for each source of M by means of a broadcast and pruning process. A multicast stream is initially broadcast throughout the network. The broadcast method, called reverse path for- warding, works as follows. A router R, upon receiving a multicast packet P that originates from S and is destined to M, determines whether P arrived on a link that constitutes the first hop of an R-to-S shortest path. If so, R forwards P to all neigh- boring routers except the one from which P arrived. Otherwise, the packet is silently discarded by R. In the meantime, routers that are not interested in M send prune messages “upstream,” that is, one hop toward the source S. An upstream router may further discover that all its downstream routers have been pruned from the forward- ing tree, and also send a prune message upstream, unless it is itself a member of M. This pruning process will be repeated until all the routers involved in the S-to-M forwarding are either members of M or have downstream members of M, producing an SRT that is rooted at S and reaches members of M. We use the example shown in Figure 2.2 to illustrate. In Figure 2.2(a), a multicast source is using a broadcast tree to reach five receivers. In Figure 2.2(b), five non- 18 . Source 0 Receiver .___... Multicasttorwarding __..... Prune message (c) the second step in pruning. (d) the resultant multicast tree. Figure 2.2: The operation of the DVMRP. member leaves of the tree send prune messages, which are depicted with dashed lines. In Figure 2.2(c), an intermediate node in the broadcast tree receives prune messages from all its children, and sends a prune message upstream. The multicast tree resulting from this pruning process is depicted in Figure 2.2(d). An interesting aspect of the DVMRP is that group membership information is not disseminated, but discovered during tree construction by means of “negative” membership reports, namely the prune messages. However, for this very reason, later membership changes cannot be incorporated into established SRTS. To remedy this problem, existing SRTS must be periodically torn down and re-constructed [5]. This approach causes delays in the handling of membership or network changes. For example, a new member will not receive multicast packets until the next phase of tree re—construction. Periodic tree construction also imposes unnecessary overhead during “quiet” periods, that is, when no changes are taking place. Moreover, shared- tree topologies are not supported by DVMRP. Additional details of DVMRP can be 19 found in [1, 5]. A hierarchical generalization of DVMRP, called Hierarchical DVMRP (HDVMRP), is described in [36]. Core-Based Tree (CBT) Multicast Protocol Unlike DVMRP, the CBT protocol [4, 37] builds a Shared multicast tree for each group. In the CBT protocol, each multicast group is assigned a distinguished router, called the core node of the group. A member joins the group by sending a JOIN- REQUEST message “toward” the core node; the request will stop at the first node that is already on the tree. A branch to the new member is set up by a JOIN-ACK message, which follows the reverse of the path traversed by the JOIN-REQUEST message. A member leaves the group (that is, detaches itself from the tree) by sending a QUIT-REQUEST message to its parent node in the tree, which will also quit if itself is not a group member and has no other children. An example of the member join Operation in the CBT protocol is given in Figure 2.3. Figure 2.3(a) Shows the shortest path P from a joining member X to the core node. It is switch Y, the first on-tree switch along P, that grants the JOIN—REQUEST and returns a JOIN-ACK message, as depicted in Figure 2.3(b). The result of this join Operation is Shown in Figure 2.3(c). The CBT protocol handles adverse network events, including router and link fail- ures, by periodically sending CBT-ECHO-REQUEST messages upstream. If a cor- responding CBT-ECHO—REPLY is not heard, a member must rejoin the group by finding another path to reach the core. Compared to the DVMRP protocol, the CBT protocol handles membership changes in an event-driven manner, but still uses a peri- odic method to incorporate network status changes, causing delays in the handling of such changes. This hybrid approach of handling changes may serve some applications well, but could be inappropriate for critical applications that must operate seamlessly in the presence of network changes. Another concern with the CBT protocol is its inflexibility in MC tOpology: the protocol does not support the SRT MC topology. Further, the restriction that a multicast packet must be forwarded to the core node before being forwarded along tree Core node Group member Intermediate node The joining member P: the shortest path from X to the core (a) the shortest path P from a joining member X to the core. The delivery of JOIN—REQUEST Y: the first on-tree switch on P. (b) the delivery Of CBT messages. (c) the resultant tree. Figure 2.3: An example Of member join Operation in the CBT protocol. branches imposes unnecessary steps in multicast forwarding. To illustrate the cost of this restriction, let us consider a scenario where group members shown in Figure 2.3(c) are also sources to the group (for example, they are conducting a teleconference). Figure 2.4(a) shows the forwarding of a multicast packet originated from node X, when the session is supported by the CBT protocol. For comparison, Figure 2.4(b) shows the forwarding of the same packet when an SST of the same topology is used. As we can see, the CBT protocol incurs extra forwarding steps, depicted by dashed lines in Figure 2.4, due to its restriction in the starting point of tree distribution. Besides the CBT protocol, the concept of core based multicast has also been adopted in other IP multicast protocols. Specifically, the Ordered CBT (OCBT) protocol [38] addresses the concern of core failures of the CBT protocol, and the Border Gateway Multicast Protocol (BGMP) [39] constructs core-based multicast 21 Fomarding step . Core 0 Member ——> before reaching _. core node Fonrvarding along tree branches (a) using the CBT protocol. (b) using an SST. Figure 2.4: Comparison of multicast forwarding in the CBT protocol and SSTS. trees that span across the boundaries of autonomous systems (that is, routing domains in the Internet). Multicast Extension to OSPF (MOSPF) The MOSPF protocol [6] is an extension of the Internet LSR protocol, OSPF [11]. In the MOSPF protocol, the identities of group members are broadcast via group- membership LSAS, such that all routers maintain complete member lists for all active multicast addresses. The distribution channel for a multicast group is constructed when the first datagram destined for the multicast address is sent. Upon receiving the first datagram that originates from a source S and is destined for a multicast address M, a router consults its local database for the member list of M and computes a shortest-path tree T that is rooted at the source switch of the datagram, and reaches the switch members of M. Subsequently, the router saves a multicast routing entry such that datagrams from S to M will be forwarded via a set of outgoing links determined by T, and forwards the datagram accordingly. This forwarding will trigger further topology computations at downstream routers. An example of MOSPF operation is given Figure 2.5, where a host that is attached to router A sends a datagram to a multicast group with members attached to routers C and D. As shown in Figure 2.5(a), router A computes a shortest-path tree that is 22 rooted at A and reaches C and D. This computation is possible because the topology of network is compiled by the underlying LSR protocol, OSPF, while the member list of the destination group ({C, D}, in this example) is made available by MOSPF. The resultant tree shows that A must forward the datagram to F, which upon receipt will perform the tree computation again and learn of its downstream routers C and D; see Figure 2.5(b). When C and D receive the datagram, they will also carry out the identical tree computation, only to notice that they are leaf routers and should forward the datagram to their attached hosts; see Figure 2.5(c). As illustrate, the MOSPF protocol imposes redundancy in topology computation — identical computations are performed at all routers involved in a multicast tree. This problem is exacerbated by the restriction that the MOSPF protocol supports only SRTS; hence this computational redundancy is incurred in a per-source—per-group manner rather than a per-group manner. Furthermore, to adapt to membership and network topology changes after a tree construction process, multicast routing entries created for the tree must be cleared upon the arrival of LSAS that advertise member- ship or network changes, resulting in the re-construction (and re-computation) of the tree when new multicast datagrams arrive. Protocol Independent Multicast (PIM) With the MOSPF and DVMRP protocols, every router in a routing domain (or possibly the entire Internet) may be involved in a multicast session. In the case of the MOSPF protocol, every router receives membership change LSAS and maintains member lists for all active multicast groups. With the DVMRP protocol, a multicast stream is periodically broadcast throughout the network. The overhead of network- wide involvement may be justified when a large fraction of the hosts in the network is interested in the multicast; such multicast sessions are sometimes termed dense mode multicasts [2]. In contrast, sparse mode multicast refers to cases where the participants represent only a small fraction of the hosts in the network and, therefore, network-wide involvement is considered too costly. The PIM protocol supports both dense mode and Sparse mode multicast. 23 O Receiver 0 Source ——-> Fomard direction To attached hosts To attached hosts (b) The tree computation and for- (c) The tree computation and for- warding at F. warding at C and D. Figure 2.5: The operation of the MOSPF protocol. Like PIM, the CBT protocol, which is a representative approach to support (receiver-only) shared-tree MCS, also does not incur network-wide involvement. How- ever, the PIM protocol further emphasizes the need to support other MC topology types, specifically the SRT topology. In addition, the designers of the PIM protocol sought universal applicability of the protocol, and therefore designed the protocol so as not to rely on any Specific routing protocol; hence the name Protocol Independent Multicast. The PIM approach to supporting both dense mode and sparse mode multicast is straightforward; it actually comprises two multicast protocols, one for each mode. In 24 the dense mode, the PIM protocol uses the DVMRP protocol (the MOSPF protocol was not chosen because of its dependence on LSR). For sparse mode multicast, the PIM protocol “initially” builds receiver-only Shared trees; the construction Of SRTS is performed selectively for some sources during the multicast session. A network region, whether it is a LAN, a routing area, or an Autonomous System, that wishes to participate in a sparse mode multicast, is assigned a rendezvous point (RP), which must be a PIM-capable router in that region. The RP of a region plays a role similar to that of the core node in the CBT protocol. Members in that region issue RP-JOIN requests, which serve the same function as the JOIN-REQUEST messages of the CBT protocol, producing within the region a ROST rooted at the RP. If N regions are interested in a multicast address, N different RPS will be associated with the address, and N shared trees will be constructed. The source of a datagram with a given multicast address must forward the datagram to all RPS associated with that address. Each RP will forward the datagram along shared tree branches to reach group members. These concepts are illustrated in Figure 2.6, where two shared trees are constructed for a multicast address that has three sources. Detailed information about the sparse-mode PIM protocol, called PIM-SM, can be found in [33]. e Sender o Rendezvous point Receiver ——> Source—to—RP path —-> Shared-tree branch Figure 2.6: Shared trees constructed by the PIM protocol. The PIM-SM protocol constructs SRTS by means of a topology transition process, which operates in a data-driven manner. When router members of a multicast address observe heavy traffic from a source S, they may determine that the source could be better served by a private distribution channel, and issue SOURCE—JOIN requests to S, resulting in a multicast tree that is rooted at S. Continuing the previous example, 25 Figure 2.7 shows that an SRT has been built for the source S3. 6 Receiver ._ -> Source-meted tree link —>OlhefMC|Ink Figure 2.7: The result of topology transition for the sender S3. PIM’s approach to supporting multiple MC topology types is elegant and efficient; we expect wide acceptance of the protocol in the Internet. However, its topology transition process, which builds SRTS, is data-driven and, hence, cannot be applied to connection-oriented networks, such as ATM networks, where routing must be es— tablished and maintained in a manner that is independent of traffic streams. The previous methods and current challenges of supporting group communication in such networks will be reviewed in the next section. An important open issue regarding the PIM protocol is the selection of RPS and dissemination of their identities. According to the Internet multicast model described in [7], a host should be able to listen to a multicast address Simply by informing its ingress router of the address. Since hosts are not obligated to provide RP identities, routers must obtain RP identities via an independent mechanism, which is not yet determined at the time of this writing [2]. AS we will Show in Chapters 6 and 7, mod- eling this RP management problem as a leader election problem within the network constitutes an important part of our research. 2.2.4 Multicast in ATM Networks ATM networks are connection-oriented networks that relay small fixed-size cells in hardware. An ATM cell is 53-byte long, comprising 48 bytes of payload and 5 bytes of control information. Before transmission, a traflic source must set up a virtual circuit (VC) that defines a path between the source and a destination. All cells belonging 26 to the VC will follow this path to reach the destination. Switching fabrics at ATM switches along the path use a virtual circuit identifier (VCI), contained in the control bytes of each cell, to determine the outgoing link for the cell. These concepts are perhaps best explained using an example. Figure 2.8(a) de- picts a VG between a source host S and a destination host D. In the example, we assume that every switch has four input ports, numbered 0 to 3, and four output ports, again numbered 0 to 3. Before transmission, the source host S issues a VG setup request to its ingress switch X; the conventions and procedures that a host fol- lows to communicate with its ingress switch are termed the User-Network Interface (UNI) [30]. Included in the request message is an input-VCI field, which indicates the VCI value chosen by the requesting host to identify cells belonging to the VC. In the example, the source host S chooses the value 5. The ingress switch X determines an output port that leads to the next-step switch defined by a shortest S-to—D path (port 2, in this example, which leads to the switch Y), selects an unused output VCI value for the VC (9, in the example), replaces the value of the input-VCI field with the new value, and forwards the request to the next-step switch (namely, Y). The set of conventions and procedures that network switches use to communicate with each other is called the Private Network-to-Network Interface (PN NI) [13]. Continuing the example, the same task is repeated at switches Y and Z. Switch Y selects the output VCI 2, which becomes the input VCI for Z, and forwards the request to Z via port 2. Switch Z selects the output VCI 6, which becomes the VCI value that the destination host D uses to recognize cells pertaining to the VC. An S-to-D connection has been established. When traffic flows through the connection, an involved switching fabric uses a switching table to determine the forwarding of cells. The switching table at the port 0 of the switch X is shown in Figure 2.8(b). AS we can see, the input VCI 5 is indexed into an entry that informs the switching fabric of X to forward cells with that VCI value to output port 2, and to tag those cells with the new VCI value 9. Further details of the VC setup procedure can be found in the UNI 3.1 [30] and PN N I 1.0 [13] standards, which have been produced by the ATM Forum, an international non-profit organization that comprises industrial 27 and academic members. The ingress The ingress switch of S switch of D Switch Y Switch 2 (a) the use of VCI values along a path. Input Ou ut Ou t VCI tp tpu VCI Port 0 O O 0 O O O O O 5 9 2 e e e O O O O O O (b) the switching table at port 0 of switch X. Figure 2.8: VC Operation in ATM networks. The connection-oriented nature of ATM requires that the topology of an MC be determined and constructed before the presence of associated traffic streams. Further, the maintenance of the topology must be performed in a signaling-driven manner, that is, in response to network control messages, rather than the receipt of multicast data itself. For these reasons, many IP multicast solutions are not applicable to ATM networks. In this section, we discuss the ATM protocol used to establish one-to—many VCS, or multicast VCS, which are the only MC type presently supported by ATM standards. At the time of this writing, it is not clear which protocol(s) will be used in ATM to support other MC types. However, we will survey two proposals that have been discussed in the ATM Forum. 28 Multicast VCS The concept of one-to-one VCs can be generalized to one-tO-many VCS, or multicast VCS. This generalization requires an optional hardware feature, called cell replication, in order to forward multiple copies of an incoming cell via different output ports. This feature has been supported in many commercial ATM switches, for example, those provided by Fore Systems [40]. In UNI 3.1, a multicast VC has exactly one source party, called the root, and can be routed to one or more receiving parties, called leaves, following a tree topology. A multicast VC is set up by its root, which uses a procedure similar to the one-to-one VC setup procedure to connect to the first receiver. The result of this first step is a multicast VC with exactly one leaf node. Subsequently, the root can issue as many ADD-PARTY messages as necessary to attach additional leaves to the multicast VC. However, current ATM standards do not support group addresses, meaning that the source must learn the identities of receivers via a host-level protocol. In the most recent version of ATM UNI (namely, the UNI 4.0), receiver-initiated actions are supported so that receivers can join and leave a multicast VC without involving the source party. Again, receivers must learn via a host-level protocol the identities of the source party, or parties, in a multiparty communication application. PrOposals for Supporting Group Addressing in ATM The lack of a group addressing mechanism in present ATM standards leaves the users/hosts to deal with the membership issue in group communication. The ATM Forum intends to add group addressing support in a future release of the PNNI standard [13]. Here, we review two proposals that have emerged within the ATM Forum. 1. A central-server approach for group membership management is promoted in [41]. In this proposal, a switch in an ATM network is configured as the group management center of the network, where the member lists of all active groups are maintained. Changes in membership must be sent to this switch in 29 order to update member lists. A host that wishes to construct a multicast VC to a group C contacts the management center to obtain the member list of C, and follows the UNI 3.1 standard to set up the multicast VC. This approach is designed for membership management, and facilitates the construction of mul- ticast VCS, which are SRTS. Other MC topology types, such as receiver-only and symmetric Shared trees, are still not supported. Further, the issue of single point Of failure at the management center is considered “not critical,” and is not addressed [41]. . A variation of the CBT protocol for use in ATM networks, called the ACBT (ATM CBT) protocol, is described in [42]. This protocol is Similar to the CBT protocol in that each group is assigned a core node, which is the root of a tree that reaches group members. This tree, however, is not an ATM multicast VC. Rather, the signaling modules of switches involved in the tree maintain the parent /child relations defined by the tree. In the ACBT protocol, a source party S can connect to all the members of a group via a single connection request, resulting in a multicast VC whose topology is the concatenation of a S—to-Core path and the Shared tree rooted at the core. To illustrate, let us consider a three-member group shown in Figure 2.9(a), where the shared tree of the group is depicted by dashed lines. Figures 2.9(b) and (c) Show the multicast VCS for two different sources. As Shown in Figure 2.9(c), a link may be used by a multicast VC in two directions. This sometimes happens because the source must reach the core, the only contact point in the CBT and ACBT protocols, before the shared-tree can be used. We also emphasize that the two multicast VCS shown in the figure operate independently, despite the fact that they use identical sets of communication links (as defined by the Shared tree) after a packet has reached the core node; the shared tree of a group exists in the form of signaling states, and is merely used to define the topology of multicast VCS destined to the group. Since multicast VCS destined to a group must be set up individually (although they Share the same tree topology), it is difficult 30 to support some ATM features on a “per-group” basis. For example, given a group G, network resources must be reserved for each individual multicast VC destined to G, rather than for the group along. Member 0 Core . Source —-.> MC link ——— Shared-tree link (b) an example of resultant multi- (c) another example of resultant cast VCS. multicast VCS. Figure 2.9: Operation of the ACBT protocol. In summary, the ACBT protocol supports group addressing and multicast VCS, which are source-rooted but not necessarily shortest—path trees. Interestingly, the protocol, albeit a CBT variation, does not support shared-tree MCS. An- other respect in which the ACBT protocol differs from the CBT protocol is the management of core. The ACBT protocol handles the selection of the core node when a group is created, rather than leaving the task to users/ hosts, as in the case of the CBT protocol. When the first member of a group joins, the ACBT protocol randomly picks a switch as the core, and advertises this core- group binding via LSAS. This binding is recorded as part of the network image at every switch. Subsequent joining members follow a CBT-like procedure to connect themselves to the core, whose identity Should now be available through— 31 out the network. When different cores are suggested by several initial members that join the group at approximately the same time, the core candidate with the smallest ID wins. 2.2.5 Discussion In summary, the designers Of multicast protocols face the following challenges. First, multiparty communication applications demand a variety of MC topology types to meet different performance criteria. While multiple protocols could be used to achieve this goal, a single “generic” solution promises to avoid unnecessary overheads and redundancy. Second, it is desirable that host members of a group be aware only of the address of the group, and not the details of the underlying MC protocol. The fact that the group is associated with a core node, or a set of rendezvous points, should be hidden from users and hosts. AS a result, any distinguished members needed in the protocol Should be selected by the network, rather than by users or hosts. Third, when such distinguished members are required, the concern of a single point of failure arises. The network, rather than users and hosts, must handle such failures. Presently, neither the IP multicast protocols nor the ATM solutions meet all these requirements. A main theme of this thesis is to Show that these difficult issues in the Internet and in ATM networks can be appropriately addressed, when the network uses a specific type of routing, namely, link-state routing. Specifically, an LSR- based generic MC protocol will be presented in Chapter 5, and alternative approaches to modeling the RP/core management as a leader election problem in LSR-based networks will be discussed in Chapters 6 and 7. 2.3 Overview Of Link State Routing LSR was initially designed for use in the ARPAN ET [12]; fault tolerance issues as- sociated with the original protocol are addressed in [43]. The ISO (International Standards Organization) version of LSR, the IS-IS (Intermediate System to Interme- diate System) protocol [44], improves the efficiency of LSR when used in networks 32 interconnected by broadcast-based LAN 5, such as Ethernet and token ring. These im- provements have been incorporated in a new Internet routing protocol, called OSPF (Open Shortest Path First) [11]. Another recent application of LSR is the ATM PNNI standard [13], whose contributions include, among others, a method for hi- erarchically constructing large-scale, LSR-based networks, and an LSR-based group leader election protocol. In this section, we provide background on LSR that will be needed later in the proposal. For purposes of discussion, the terms router, switch, and node will be used interchangeably. 2.3.1 Basic Operation The essence of LSR is to maintain complete network images at all switches. For this purpose, every switch broadcasts throughout the network its local states, including nodal states and link states. Nodal states concern the working condition of a switch, for example, the workload at the switch. Link states describe communication links that are incident to the switch. Typically, link states include queueing delay, data loss rate, bandwidth, the capacity of associated buffers, monetary cost (for using the link), and so on. For historical reasons, control messages containing either state type are referred to as link-state advertisements (LSAS). After compiling an image of the network incrementally via received LSAS, a switch X routes traffic to a destination D according to a Shortest X -tO-D path computed locally. In general, the universal avail- ability of complete network knowledge at every switch creates a robust infrastructure to support various network services, including group communication. In order to update network images to reflect network status dynamics, every switch constantly monitors its local states and advertises changes in these states immediately. For example, when a link fails, the value of its working state is changed from ON to OFF, producing a link-down LSA from each of its endpoints. Similarly, link-up LSAS are flooded when the link later returns to an Operational state. The working state of a link, which has only two values, is discrete; changes in such states are always advertised. For continuously valued states (such as queueing delay, which is 33 a positive real number), a change in state is advertised only if the change exceeds a predetermined threshold. The topology of a network is defined by the set of operational switches and com— munication links. Although it may be tempting to consider the working states of switches (as is the case for links), such states are not defined in LSR. That is to say, there are no “switch-up/switch-down” LSAS. This is because an LSR protocol cannot distinguish failed nodes from nodes that become unreachable due to failed links. To illustrate, let us consider the example in Figure 2.10(a), where the node X crashes. The five neighboring switches of X (A, B, C, D, and Y) detect the lack of respon- siveness Of the five links incident to X, and flood five respective link-down LSAS. In this example, switch A can learn of only four link-down events, because switch Y, which advertises the failure of the (X, Y) link, has been isolated by the failure of X. Figure 2.10(b) Shows the network as perceived by switch A (and any other switches other than X and Y) at this moment of time. (a) node X crashes. (b) the perception Of nodes other than X and Y. Figure 2.10: Problem in correctly identifying node failure. This observation suggests that an LSR protocol, which is not able to determine if a switch has failed or not, Should instead be concerned with “reachability” to the switch. For example, once X and Y become unreachable, they cease to exist with respect to the Operation of A. The concept of reachability is important not only to the handling of node failures, but also to the handling of much more disastrous circumstances, such as network partitioning. We will return to this issue in the next 34 section. A flooding protocol, used for the broadcast of network status information, is a highly robust protocol that guarantees that eventually all network nodes reachable from the source of an LSA will receive the LSA. The “conventional” flooding protocol works as follows. In order to send an LSA, the source switch sends the LSA to all its neighboring switches. For identification, LSAS typically contain the source address and a sequence number. When an LSA is received by another switch for the first time, it is forwarded on all incident links, except the one on which it arrived. Copies of LSAS that have already been seen by a switch are silently ignored. In this manner, every LSA is forwarded by every switch exactly once. An example of this flooding protocol is depicted in Figure 2.11; the flooding operation requires four steps to complete. Node that has received the LSA ———> LSA transmission . Node that has finished the flooding (c) step 3 ((1) step 4 Figure 2.11: An example of the flooding operation. The conventional flooding method has been adopted for use in both connection- less networks, such as the Internet, and connection-oriented networks, such ATM 35 networks. In the case of ATM, the hardware-based multicast method, namely the use of multicast VC, has previously been considered unsuitable to the flooding/ broadcast of LSAS, because it cannot guarantee the delivery Of LSAS to all reachable nodes, as guaranteed by the conventional flooding method. Hence, the LSR operation in ATM proceeds in a less eflicient, hop-by-hop manner. In Chapters 3 and 4, we will demonstrate how to take advantage of multicast VCS in flooding operations, while providing guaranteed delivery. 2.3.2 Fault Tolerance Issues Networks are often expected to operate for long periods of time, in the presence of adverse conditions or even catastrophic scenarios. While many distributed applica- tions ignore very rare adverse events, the networks themselves, and their underlying routing protocols, are expected to survive. Two types of such events, or faults, are of particular interest to LSR researchers: transmission errors not caught by the er- ror detection mechanism (for example, CRC checksums) and the partitioning of the network. In LSR-based networks, the fault tolerance issue is closely related to the consensus problem. Recall that the consensus problem under LSR is to ensure the convergence of network images under the most adverse situations. Fault tolerance mechanisms in LSR either try to eliminate deterrents to achieving consensus or try to achieve consensus as soon as a consensus-prohibiting situation is cleared. A number of meth- ods have been proposed to achieve this highly challenging goal [45]. Following is a summary of the widely-accepted OSPF solution [11]; a similar solution is adOpted in the ATM PNNI standard [13]. e Switches not only advertise status changes immediately, but also broadcast their status periodically. This practice enables temporarily isolated segments of the network to exchange information with each other after re-unification (one segment learns of the existence of other segments in the next flooding cycle). Periodic flooding also controls the lifetimes of corrupted parts of network 36 images that may occur due to undetected transmission errors, for the corrupted information will be overwritten in the next cycle of flooding. e An aging mechanism is used to identify obsolete information. Specifically, every entry in a network image has an associated aging timer, and the entry is dis- carded when its timer goes off. Nullified parts in a network image can later be filled by relevant LSAS with any sequence number value. The aging mechanism is needed to correct errors that the re-flooding mechanism along may take too long to correct. An example is undetected transmission errors in the sequence number field of LSAS. Let us consider an LSA with sequence number n that is incorrectly received as n + k at some switch. Phrther assume that the source of the LSA re-floods every minute. If the value of k is 228, it would take more than 500 years for the source switch to catch up (that is, to use sequence numbers larger than n+ k) and override the corrupted information. An aging mechanism solves this problem. To further illustrate the use of these concepts, let us continue the example of F ig- ure 2.10. Figure 2.12(a) depicts the local image at switch Y after the crash of X. We point out that the local image at switch Y (incorrectly) still contains links (X, A), (X, B), (X, C), and (X, D), because Y cannot receive corresponding link-down LSAS. An aging mechanism solves this problem. Using this mechanism, any node other than X and Y will remove the link (X, Y) and the nodes X and Y from its local network image, after not hearing periodic flooding from X and Y for a predetermined pe- riod of time. Put in another way, the {X , Y} induced subgraph “ages out” in other parts Of the network because it is no longer periodically reinforced by the two nodes. Figure 2.12(b) depicts the network image at any non-(X , Y) node after the aging mechanism takes effect. The network image at Y after aging consists Of only one node, Y itself, since all the other nodes will age out at Y. This image is omitted in Figure 2.12. To finish the story with a happy ending, we assume that node X later becomes operational. After the revival of X, all switches learn of the existence of links incident 37 (b) network images at nodes other than X and Y, after aging. (c) the consenting network image, after the revival of X. Figure 2.12: The handling of network partitioning in LSR. to X via link-up LSAS. Switches other than X and Y learn of the existence of these two nodes via the periodic status broadcasts from them. Similarly, the nodes X and Y become aware of the other parts of the network via periodic status broadcasts from other nodes. Eventually, all the switches will learn the network tOpology shown in Figure 2.12 (c), achieving consensus on the network images throughout the network. The robustness of LSR is a major reason for its wide acceptance in many modern networks. However, the operation of LSR may raise concerns about scalability. First, the size of network images grows with the size of the network, which is consequently limited by the switch with the least memory space. Second, for a network with an average degree (the average number of incident links to nodes) (1, every LSA will be received on average at times by every switch. Further, if the network has N switches that periodically flood their status for every T seconds, every switch needs to handle dN / T LSAS per second. When N is sufficiently large, the workload of LSA processing alone will exceed the computation capacity of switches, or the flooding of these LSAS 38 may use up the bandwidth of the network. Of course, there are ways to address the scalability issue. In the case of the Inter- net, LSR is intended for use in a set of networks under one administrative authority (in Internet terminology, an Autonomous System) which typically contains a few hun- dred switches and possibly several thousand hosts. In some other cases, such as the case of ATM, LSR is intended to support nation-wide, or even global, networks. In such cases, scalability can be achieved only by means of hierarchical routing. 2.3.3 Hierarchical LSR Hierarchical routing reduces the burden on individual switches by hiding the com- plexity of the entire network. Different ways of supporting a routing hierarchy with LSR have been developed and deployed [11, 13]. In the Internet, the OSPF protocol defines a two-level LSR hierarchy such that a router sees only the subnetwork to which it belongs and the subnetwork’s boarder routers, that is, routers that connect to the backbone subnetwork [11]. While intra—subnetwork traffic is routed as described in the previous section, cross-subnetwork traffic is routed in three stages: first through the home subnetwork to a boarder router, from there across the backbone network to reach a boarder node of the destination subnetwork, and finally through the des- tination subnetwork. A more general method of hierarchical LSR is described in the ATM PNNI 1.0 standard [13], which allows for arbitrary hierarchy depth. In this method, a physical network is divided into several peer sub-networks, called routing domains. For exam- ple, the network shown in Figure 2.13 can be divided as shown in Figure 2.14. This division is performed manually by configuring every switch with a domain ID. After division, each domain runs a separate instantiation of LSR, that is, switches within a domain exchange status information so that each of them maintains a “do- main image.” Continuing the previous example, the image of domain A.4 is depicted in Figure 2.15. As shown, a domain image contains not only intra—domain links, but also outgoing ones. An outgoing link, or inter-domain link, is advertised in the do— mains containing its endpoints. Hence, the link (A.4.1 A23) in Figure 2.13 will be 39 Figure 2.13: A network topology. Dom 'n A.2 3' Domain 3.1 '5‘ DomainA 0 Domain 82 Domain A. 4 e) Domain A. 3 Figure 2.14: Breaking up the network into routing domains. advertised in domain A.4 by switch A.4.1 and in domain A.2 by switch A23. The presence of inter-domain links in the image of a domain enables the domain to see neighboring domains. For a domain to see all the other domains in the network, one must run a copy of inter-domain LSR. To perform LSR among domains, a leader switch is elected within each domain. In ATM PNNI, the nodal states of a switch include two election-related states: leader priority and preferred leader. The former is manually configured by network managers to determine the rank of the switch. The latter is determined as follows: Every switch independently searches in its domain image for a reachable switch that has the highest leader priority, and calls the result of the search its preferred leader. As with other 40 O / / // A.2.3 " Us \ / / /' O In-domain switch lntra—domain link _ 1.. - . — Inter-domain link A — - — - ~ —- C , 8.2.1 / / O " A.3.2 Figure 2.15: The image Of the domain A.4. LSR states, any change in the preferred leader state must be flooded immediately. If the preferred leader at a switch is the switch itself, this switch Shall, after waiting for a period of time, inspect its local domain image for the preferred leaders of other switches. Only if unanimity is obtained will the candidate switch proclaim victory. For illustration, consider a network where the administrator configures a default leader switch X with leader priority 3 and a backup leader Y with priority 2. The remaining switches are all configured with priority 1. We assume that initially switch X is the preferred leader of all other switches. Now consider what happens when the established leader X crashes. As described earlier, neighboring switches of X will advertise link-down LSAS for the incident links Of X. Using these LSAS, every network switch finds the current leader unreachable, and searches through its local image for a switch with the next highest priority. In this case, the result would be Y with priority 2. Since every switch changes its value of the preferred-leader state to Y, every switch advertises this change immediately. These advertisements can be considered as “ballots,” which the switch Y must collect before claiming itself the new leader. Once elected, a leader learns the identities of neighboring leaders, namely the leader switches in neighboring domains, via the LSAS regarding inter-domain links. 41 (Preferred leaders of endpoints are included in such LSAS.) The leader then sets up a VG to connect to each neighboring leader. The inter-domain LSR is performed collectively by domain leaders as follows: each leader uses inter-leader VCS to flood to all the other leaders nodal states that present a Simplified representation of its home domain and link states that describe its connectivity to neighboring domains. As such, each leader compiles a simplified view of the entire network. In this view, a node represents a routing domain and a link represents the adjacency of its endpoint domains. For the example of Figure 2.13, corresponding simplified network image is depicted in Figure 2.16. In ATM PNNI, the division-and-Simplification process just described can be ap- plied recursively to build routing hierarchy of any depth. For example, when the network of Figure 2.13 is connected to an internet, the simplified network view shown in Figure 2.16 constitutes a domain in the internet, and a leader is elected among the domain leaders to represent the entire network in the next routing level. A2 8.1 Figure 2.16: The Simplified/high-level network image. 2.4 Discussion The main theme of this thesis is to demonstrate and exploit the mutually beneficial relationship between group communication and LSR. Three facets of group commu- nication will be examined for being supported by LSR: use of multicast VCS in LSR flooding, MC construction and maintenance, and leadership consensus. Let us now briefly introduce each of these problems, given the background information that has been presented in this chapter. 42 First, LSR itself can benefit from group communication techniques, because many aspects Of LSR operation exhibit characteristics of group communication. In LSR, switches in a routing domain form a communication group: they broadcast to the group, receive broadcast messages (that is, LSAS) from the group, maintain member lists of the group (which are implicitly included in local domain images), and elect a leader to represent the group in the next routing level. Moreover, such group commu- nication characteristics in LSR are even more Obvious in hierarchical LSR networks: At higher routing levels, the LSR tasks of flooding, membership management, and leader election are performed collectively by domain leaders. Since leaders are not necessarily physically adjacent with each other, a flooding operation among leaders forms a true multicast Operation in the entire network. In this thesis, we identify an important aspect of LSR that can benefit from group communication: the flooding Operation. We note that, while present ATM standards use hardware switching and cell replication to speed up host-level multicast, flooding operations still proceed in a store-and-forward manner as described earlier. Our first main contribution is to Show that flooding operations can make use of the hardware capability of ATM switching fabrics to improve performance, while in the meantime guaranteeing delivery to all nodes reachable from an originating node, as in the case of the conventional flooding protocol. In Chapters 3 and 4, we describe a family of switch-aided flooding (SAF) protocols that work in this manner. Second, the construction of MCS can benefit from the complete network informa- tion made available by LSR. We have discussed one multicast protocol, the MOSPF protocol, that takes advantages of LSR; it uses LSR to disseminate membership in- formation so that every router has a member list for every active MC. However, the MOSPF protocol is restrictive in supporting different MC topology types, and incurs computational redundancy. As we noticed in previous sections, multiparty communi- cation applications need different MC topology types. Further, the rising importance of QOS service is leading to new, sophisticated MC tOpology computation algorithms, many of which are not supported by existing MC/multicast protocols. This thesis will Show that the availability of complete network and MC membership information 43 at switches/ routers in LSR-based networks makes it possible to design a “chassis” for MC protocols to accommodate existing and future MC topology computation algo— rithms. The resultant generic MC (GMC) protocol will be presented in Chapter 5. Third, we consider the problem of leader election. Although leader election is not directly required by all group communication applications, some prominent multicast protocols, such as the CBT and PIM, assign a network node as the multicast traffic transit center, or the core node, for the group. Arguably, the core node of a group must be selected by the network; if the identity of the core is provided by host members, then the host-network interface for multicast depends on the choice of multicast protocol within the network (some multicast protocols require core identities from the interface, while others do not). Further, the introduction of a traffic transit center raises the concern of Single point of failure. The problem of assigning of core nodes to groups can be modeled as a leader elec- tion problem (the leader of a group undertakes the responsibility of the core node). The fault tolerance of LSR enables the design of robust election protocols, such as the ATM leader election protocol, that handle not only leader failures but also dis- astrous scenarios, for example, network partitioning. However, the overhead of the current ATM leader election protocol (every group member uses flooding to report its preferred leader) may be prohibitively expensive if used to support multicast groups because a large number Of such groups may exist simultaneously in a network. The design of efficient LSR-based support for the election problem constitutes the third part of this research. Our NLE protocol, presented in Chapter 6, accommodates a membership management mechanism that achieves the following consensus property: a set of mutually reachable group members reach consensus on a leader, which main- tains a member list containing exactly those members. The LCM protocol, presented in Chapter 7, uses the NLE protocol to elect a leader switch as the centralized core management server, which manages the core nodes for all active groups within the network. Finally, we come full circle. ‘By combining two group communication techniques developed earlier, namely the election of a leader and the construction of multipoint 44 connections, we develop a totally different approach to LSR. The resulting Tree-based LSR (T-LSR) protocol is lightweight, imposing only a small fraction of the overhead of previous LSR methods, and robust, guaranteeing to survive not only network component failures and partitioning scenarios, but also undetected communication transmission errors. As we discussed earlier, properly handling the latter type of faults is a vital requirement for an LSR protocol. Unlike the ATM-oriented SAF protocols, the T-LSR protocol is designed for use in general-purpose, LSR-based networking environments and requires no Special hardware support. At the first glance, the advocation of group-communication-supported LSR Oper- ations and LSR-based group communication introduces a “chicken and egg” dilemma —— which one should exist first so as to support the other? Our results show that, with careful design, the circular dependence can be avoided. The SAF and T-LSR protocols demonstrate how a multiparty communication channel can be constructed and used to improve the performance of flooding operations, which advertise rout- ing information (namely, LSAS) necessary for the construction and maintenance of the channel. On the other hand, the GMC protocol can take advantage of LSR performance improvements by T—LSR and SAF methods to enable the use of any topology computation algorithm and hence provide support for any MC topology type. Moreover, the NLE protocol, which itself is LSR-based, finds applications in both the internal operations of LSR (such as hierarchical routing) and the support of multiparty communication applications (for instance, the management of multicast cores used by such applications). Such results demonstrate the mutually beneficial relationship between LSR and group communication. Chapter 3 Switch-Aided Flooding In this chapter, we demonstrate an example to support the claim that some aspects of LSR operation can benefit from group communication. Specifically, we prOpose a flooding method, called Switch-Aided Flooding (SAF), for use in ATM networks. SAF-based protocols take advantage of hardware-supported cell relay and cell duplica- tion, characteristic of such networks, in order to reduce the time needed to disseminate changes in network tOpology and resource availability. SAF protocols use a spanning multipoint connection (SMC), which is a hardware-switched network spanning tree, but revert to conventional link-by-link flooding when the spanning MC is unavailable or under construction. Two flooding protocols based on this methodology, as well as an accompanying protocol to construct and maintain the SMC, are described in this chapter; a third SAF protocol is described in Chapter 4. The results of a simulation study reveal that the proposed flooding protocols deliver network updates several times faster than conventional approaches. Further, the bandwidth consumed by a flooding operation is also significantly reduced. 3.1 Motivation AS described in Chapter 2, ATM is a connection-oriented communication technology that relays small fixed-size cells in hardware. Many ATM switching fabrics support hardware cell duplication, whereby an incoming cell can be forwarded via multiple 45 46 output ports. Although current ATM standards use this feature to support only multicast (or one-to—many) VCS, such switch functionality enables the construction of a more generic form of group communication channels, namely, many-to-many VCS, or multipoint connections (MCS). An example of an MC is depicted in Figure 3.1(a), where a set of eight switches is interconnected with a tree topology. The responsibility of each member switch is to forward cells arriving on one link of the tree to all the other tree links that are incident to that switch. AS illustrated in Figure 3.1(b), cells arriving on any of the four links incident to the switch a: are forwarded on the remaining three incident links. Hardware-supported MCS facilitate multiparty communication applications, such as multimedia teleconferencing, distributed virtual reality, tele—gaming, and computer-supported cooperative work. MCS used in such applications typically involve only a small subset of the network switches. A special type Of MC is the spanning MC (SMC), which includes as its members all switches in a network. A spanning MC of the network used in Figure 3.1(a) is depicted in Figure 3.1(c). Since every message transmitted on an SMC is received by all switches, the SMC can be considered as a virtual broadcast medium of the network. Although hardware switching and cell duplication may greatly improve the com— munication performance Observed by end hosts and their applications, the signaling activities within ATM networks, as defined in UNI 3.1 [30] and PNNI [13] standards, proceed largely in a connectionless manner. Since signaling must take place prior to the existence of corresponding VCS [30], VC-setup request messages are forwarded and processed in a hop-by-hOp manner. Switches along the route of the VC under construction invoke signaling modules to perform functions related to the requested VC, such as routing and call admission control. In addition, the ATM PNN I standard Specifies the use of the flooding protocol described in Chapter 2, which was originally designed for the ARPANET, a connectionless point-to-point network. Not surpris- ingly, the protocol proceeds in a hop-by-hop manner, and does not take advantage of the hardware capabilities of ATM switching fabrics. We model the ATM flooding operation as a group communication problem, where an LSA is considered as a multicast message delivered to a group comprising all _ — . rm. — Lmk used by the MC / ,« A”: \ \ fix —>cell arrivial A- cell lonivarding — Link not used by the MC / MC member ' -’ ) other network node ——> C} .1 (a) an 8-node MC (b) cell forwarding at switch a: (c) a Spanning MC Of the network Figure 3.1: Examples Of multipoint connections. switches in the network. The proposed SAF method uses a common group communi- cation topology, the tree topology, to facilitate the dissemination Of LSAS. Specifically, the SAF method constructs a Spanning MC, which is used as a “broadcast medium” for distributing LSAS. The use of an SMC improves the performance of flooding operations by taking advantage of both hardware cell relaying and cell replication. However, such an approach must address the challenge of retaining the robustness of the conventional flooding method, that is, an LSA must reach all switches reachable from the source of the LSA. The main contribution of this chapter is to develop and evaluate two SAF-based flooding protocols, called Basic SAF and bandwidth-efifcient (BE) SAF protocols, that satisfy these criteria. In addition, an efficient protocol for the construction and maintenance of spanning MCS is presented. The results of a Simulation study reveal that these two SAF-based flooding protocols can distribute messages to network switches several times faster than the conventional flooding algorithm. In the next 48 chapter, we will develop a even more efficient SAF protocol by using a second group communication topology, the ring topology, to implement reliability. A robust and efficient flooding protocol can lead to better routing decisions by reducing reaction time to faulty network components and congested areas. This in turn reduces the probability of call blocking. Furthermore, general-purpose, LSR-based MC protocols, such as the MOSPF protocol [6] and the GMC protocol (discussed in Chapter 5), must disseminate group membership and/or MC tOpology advertisements, and therefore can also benefit from efficient flooding protocols. The remainder of this chapter is organized as follows. A protocol that constructs and maintains a network-wide Spanning MC is presented in Section 3.2. In Section 3.3, two SAF protocols are presented. The Basic SAF protocol extends the conventional flooding algorithm to incorporate the use of an SMC. The BE SAF protocol further addresses the issue of bandwidth consumption in flooding Operations. The perfor- mance of these two protocols is investigated through a Simulation study, the results of which are presented in Section 3.4. A summarization of this work is presented in Section 3.5. 3.2 The Spanning MC Protocol The SMC protocol constructs and maintains an SMC for use in the SAF protocols. The protocol is a variation of the CBT protocol [3, 4], a general MC protocol in which the topology of the MC is the union of shortest paths from the members to a specific node, called the core (see Figure 3.2). The SMC protocol differs from the CBT protocol in the way that the core node of the MC is determined. In the CBT protocol, the core node is static and is determined by an “outside” mechanism (for example, by network management procedures). In the SMC protocol, the core node is dynamic for reasons of robustness, since the SMC protocol must survive extensive network changes, including failure of the core node itself. In the SMC protocol, the core node selection problem is modeled as a leader election problem under LSR. In this approach, every switch :1: uses the same core node 49 . Core node . Core node 0 MC member 0 MC member 0 Intermediate node 0 Intermediate node (a) the member-to—core shortest (b) the resultant MC topology. paths. Figure 3.2: An example MC built by the CBT protocol. selection algorithm to independently identify a new core of the SMC. The choice of switch a: will be referred to as c3, and the computation will be denoted by C (G, as), where G is the network image at :17. For now, we use the function C(G, as) that simply sets cm to the preferred leader at switch :13, that is, we use the domain leader switch elected by the ATM PNNI as the core node of the SMC. The generalization that allows the use of any core selection algorithm C (G, x) can be achieved by using our N etwork— level Leader Election (NLE) protocol, which is discussed in Chapter 6. Discussion and evaluation of a variety of core selection heuristics can be found in [46, 47]. After selecting the core node locally, each switch tries to establish a connection to its choice of core node. For a switch a: to reach its core selection cx, the switch sends a reach_core request one hop towards the core, according to an III-tO-Cx shortest path computed locally. The receiving switch grants the request after it has successfully reached the core itself. Using the network shown in Figure 3.1 as an example, the process of SMC construction is illustrated in Figure 3.3. Let us assume that all nodes initially select, as the core, the darkened node in Figure 3.3(a); this figure also shows the direction of sending reach_core requests. The core node immediately grants the reach_core requests from its neighboring switches, which subsequently approve reach_core requests from downstream switches. In this way, SMC links are granted and established in a “radiating” manner; see Figures 3.3(b) to 3.3(f). Under the SMC protocol, each switch :1: in the network G executes a set of con- 50 . Core _ _ _p ReachCore request Established SMC link Node that has reached the core 0 Node that has not reached the core Figure 3.3: An example of the SMC protocol. stituent protocol modules and maintains the following data structures: a local net- work image G3, a core selection cm, and an :r-to—cx path P;. (In the following, we may omit the subscripts if they are clear from context.) Whenever a data structure must be accessed by concurrent protocol modules, access to the data is assumed to be atomic, in order to avoid race conditions among protocol entities. Critical regions and semaphores are well-known techniques to achieve atomic access. SMC protocol operation is triggered by the receipt of an event LSA (link-down, link-up, and so on). Periodic LSAS are ignored by the SMC protocol so that the protocol, and hence reconstruction / reorganization of the SMC, will not occur unnec- essarily. As shown in Figure 3.4, upon receiving an event LSA, the SMC protocol at switch as updates the local image G, of the network. The protocol then decides whether it has to re-connect to the core node because 1) its core selection changes, or 2) the LSA 6 reports a failed link that is used in P3. When it is necessary to re-connect to the core, the switch a: tears down the present MC link that leads to the core node and initiates an attempt to reach the core node by signaling another pro- 51 tocol entity, the ReachCore module. We emphasize that the inclusion of maintaining network image G1 in SMC algorithms is for the purpose of self-contained discussion; in real-world contexts, GC is most likely maintained by the underlying LSR protocol. Algorithm: Process-Event-LSA. Input: switch ID 51:, received LSA f. Update G according to 8. IF (cm # C(G,:r)) or (LinkDown(€)=TRUE and Link(€) in P1.) Let y be the next hop to ca, in P3,. Disconnect the tree link (3:, y). c, = C (G, 2:). Wake up the ReachCore module if it is sleeping. ENDIF Figure 3.4: The handling of event LSAS. The ReachCore module at switch a: is started after the initialization of 2:, and loops indefinitely. This module is responsible for setting up an SMC link that will lead to cx. For this purpose, the module sends a reach_core (c3) message one step towards the core, and will continue doing SO until a positive reply is received from the appropriate neighbor, indicating that the request has been granted and the desired link established. We note that the value of c1. may change during this period, because the Process-Event-LSA module may update the value upon receiving new event LSAS. After obtaining a positive reply, the ReachCore module records the new to-core path, Pm, and suspends itself. The routine that processes a reach-core message is shown in Figure 3.6. The receipt of such a request from switch y by switch 2: suggests that the switch a: is the first intermediate node on the path from y to Cu. The switch a: grants the request if 1) it agrees with y upon the choice of core node, and 2) it has itself reached the core (this can be determined by whether the ReachCore module at 2: is suspended). When the request is granted, the switch :1: establishes the (at, y) MC link and returns a positive reply to y, which includes the y-to—cy path used by the SMC. (The establishment of an MC link involves the setup/modification of hardware switching table entries to implement the type of cell forwarding depicted in Figure 3.1(b).) Otherwise, a 52 Algorithm: ReachCore. Input: switch ID 3:. LOOP forever 1F (Ca: 95 17) LOOP /* note: C; may have been changed by Process-Event-LSA */ Let y be the next stop to reach CI. Send a reach_core(cx) message to y. Wait for a reply. UNTIL (a reply reached_via(P) is received). P,C = P. EN DIF Sleep. ENDLOOP Figure 3.5: The ReachCore module negative reply is returned. Algorithm: Process-Reach_Core. Input: switch ID a: and a reach_core(c) request from switch y. if (c = c1.) and (I’ve reached the core node cm) Setup the (:c, y) MC link. Return a positive reply, reached_via(Px + (11:,y)), to g. ELSE Return a negative reply to y. ENDIF Figure 3.6: The processing of the reach-core request message. Cell Demulplexing. Because a spanning MC is effectively a broadcast medium that allows interleaving of messages, every switch in the network can broadcast mes- sages to, and receive messages from, all other switches. However, cells belonging to simultaneous broadcast messages can be interleaved with one another at intermediate switches. Receiving switches must be able to demultiplex these messages according to their sources. Various methods can be used to solve this problem. For example, part Of the cell payload can be used to label the sources of cells. Alternatively, Spanning MCS can be switched by the virtual path identifier (VPI). In ATM networks, every 53 cell is tagged with a pair of identifiers, VPI and VCI. When the VPI of a VG is used in switching, the VCI of the cells belonging to the VC is ignored (but remains intact during transmission). In this approach, the SMC used by the SAF protocol must be constructed in such a way that the VPI is in effect throughout the MC, and as such, the VCI field can be used to identify the source switch of cells. We emphasize that the SMC protocol, and the SAF protocols as well, work with any demultiplexing scheme. 3.3 The SAF Protocols An SAF protocol is an extension of the conventional flooding protocol. In addition to a set of point-to-point links in a network, SAF protocols presume the existence of an SMC to which all the switches in the network have access. In this section, we present two protocols designed in this manner; they differ in the implementations of reliability. 3.3.1 Basic SAF Protocol This protocol works as follows. The source of an LSA first broadcasts the LSA on the SMC and subsequently sends the LSA via all its incident links. If a switch receives the LSA for the first time via the SMC, then it forwards the LSA on all its incident links. On the other hand, if the switch receives the LSA via a point-to-point link, then it forwards the LSA on all incident links except the one on which the LSA arrived. As in the case of conventional flooding, switches silently drop LSAS that have been seen previously. To illustrate, the flooding example of Figure 2.11 is repeated in Figure 3.7, but this time using the Basic SAF protocol. As we can see in the figure, the operation now requires only two communication steps. In the first step, the source switch broadcasts the LSA, which is switched and duplicated in hardware on the SMC. In this manner, the constituent cells are pipelined throughout the network. Provided that other switches receive the LSA in the first step, they exchange this LSA via point-tO-point links in the second step; since every node has already seen 54 the LSA via SMC broadcast, all the point-to-point copies are dropped. 0 Node that received the LSA ——> LSA transmission (a) step 1: hardware switched broadcast (b) step 2: point-to—point forwarding Figure 3.7: An example of the Basic SAF protocol. The Basic SAF protocol uses the SMC as a shortcut for LSA dissemination, but does not rely on this Shortcut. In normal cases, such as the one in Figure 3.7, switches receive LSAS immediately via the SMC. However, in situations where one or more links used in the SMC is malfunctioning, or the SMC itself is under construction, or cell losses occur on the SMC, then the link-by-link forwarding will guarantee that the LSA reaches all nodes. Shown in Figure 3.8 is an example of how the Basic SAF protocol Operates when the SMC is faulty. In this example, a link that is used in the SMC falls during a flooding operation and the broadcast of the LSA cannot reach all switches (Figures 3.8(b) and 3.8(b)). As shown in Figures 3.8(c) and 3.8(d), the remaining switches are reached via link-by-link forwarding. In extreme cases where the SMC does not exist at all (for example, when the network is re—initialized), the Basic SAF protocol degenerates to the conventional flooding protocol. The Basic SAF achieves its efficiency at the price of additional bandwidth con- sumption. Here we compare the bandwidth used by the conventional flooding protocol against that of the Basic SAF protocol. In the conventional flooding protocol, the source of an LSA sends the LSA on all its incident links, and other nodes forward the LSA on all but one of the incident links. Consider a network G = (V, E), where V is the set of switches and E the set of point-to-point links. The number of links 55 0 Node that received the LSA —-> LSA transmission (a) the broken SMC (b) step 1: (partially failed) broad- cast (c) step 2: link-by-link forwarding (d) step 3: link-by-link forwarding Figure 3.8: The Basic SAF protocol with a broken SMC. traversed by conventional flooding is BC 2 1+ Z(Deg(v) — 1) 2 1+ Deg(G) — N, vEV where N = W] and Deg(G) is the sum of node degrees in G. On the other hand, the Basic SAF protocol would require Bbasic = (N — 1) + Z Deg(v) vEV links, where the first term (N — 1) is the number of links used by the broadcast on the SMC, and the second represents forwarding of the LSA on point-to—point links. To further clarify the relationship between Bc and Bbasic, let Ang denote the average node degree of G. The bandwidth consumptions of the two flooding protocols 56 can be rewritten as Bc=AngxN—N+1a:(AngxN)—N, and BbasiczAngxN+N—12(AngxN)+N. Therefore, the BbasiC to BC ratio can be approximated by Bbasic ~ ‘4ng +1 BC N Ang — 1' If the network has a small average node degree, then the Basic SAF protocol may consume Significantly more bandwidth than does the conventional flooding protocol. However, due to the simplicity of this protocol and its advantage in flooding time, the Basic SAF protocol may be attractive under a variety of conditions (see Section 3.4). 3.3.2 Bandwidth-Efficient SAF Protocol The Basic SAF protocol can be modified to reduce bandwidth consumption by in- troducing the concept of dummy forwarding. In this approach, switches receiving an LSA via the SMC forward a “dummy” of the LSA, containing only the source address and sequence number, to neighboring switches. Switches that have finished the task of dummy forwarding also expect to see responses (either the real LSA or its dummy) from all neighboring switches. After waiting for a predetermined period of time, such switches forward the real LSA to neighboring switches that fail to respond. Switches receiving the LSA via point-to—point links forward the real LSA on all incident links except that of arrival, and expect nothing from neighbors. Again we use examples to illustrate. The operation Of the BE SAF protocol with a fully operational SMC is depicted in Figure 3.9. Similar to the Basic SAF proto- col, the BE SAF protocol in this setting requires only two communication steps. In the first step, the source switch broadcasts the LSA, which is switched and dupli- cated in hardware on the SMC. In the second step, however, switches exchange with 57 neighboring switches dummies of the LSA, rather than the real one. _. LSA transmission ___. Dummy forwarding fl" 3%? ’ \ z’ i! \ I, l 1’ ‘ \ I (a) step 1: hardware switched broadcast (b) step 2: point-to—point “dummy” for- warding Figure 3.9: An example of the BE SAF protocol. Figure 3.10 illustrates the operation of the BE SAF protocol with a broken SMC. After the BE SAF protocol uses the SMC to reach as many nodes as possible (Fig- ure 3.10(b)), switches that received the LSA via the SMC forward dummies, and in the meantime expect their neighboring switches to do the same. In this example, three nodes, namely X, Y, and Z, do not see all the expected dummies from neigh- boring switches; see the unidirectional dummy forwardings in Figure 3.10(c). The three nodes, after a predetermined timeout period, start forwarding the real LSA to their “silent” neighbors, as depicted in Figure 3.10(d). Switches that receive the LSA via link-by-link forwarding further forward the LSA to all incident links except the incoming one, as Shown in Figure 3.10(c). In the discussion of BE SAF algorithms, we denote by K3 the number of switches that are neighbors of switch :13. Let us assume that the neighbors of the switch :2: can be reached via ports numbered 1 to K3,, and that the SMC is attached to port 0.1 A switch maintains three data structures: Seq[i], the sequence number of the current LSA from switch i, 1 g i S N; Received[i], a boolean flag indicating if the Seq[i]-th LSA from switch i has been received; and PM [p], a boolean flag indicating whether the switch has received via port p either the Seq[i]-th LSA from switch i or the corresponding dummy, for 1 S i 5 N and 1 _<_ p g Kx. Let us denote the sequence 1If necessary, LSAS received on the SMC can be identified by the VPI value of the MC. 58 cast ((1) step 3: real forwarding after (e) step 4: link-by-link forwarding timeout triggered by real forwarding Figure 3.10: The BE SAF protocol with a broken SMC. number Of LSA f by Seq(€) and the address of the source switch by Source(€). The source of an LSA invokes the routine BE_SAF.Source, which is shown in Figure 3.11. Parameters to the routine include the ID :1: of the invoking switch and an LSA Z to be flooded. The switch :3 updates the sequence number of its current LSA to that of f and clears relevant F flags to indicate that it has not received anything about E from its neighbors. The switch then broadcasts 6 over the SMC, forwards the dummy of f to all neighboring switches, and sets up a timer to await responses (for 59 E or its dummies) from neighbors. Algorithm: BE_SAF_Source. Input: the switch ID :13, and an LSA f. Seq[x] = Seq(€). Received[r] = TRUE. F[x][p] = FALSE, for all 1 g p g K. Transmit [.7 over the SMC. Forward a dummy of l? to all neighboring switches. Set up a timer(€). Figure 3.11: The sender algorithm of the BE SAF protocol. Switches that receive an LSA f invoke the routine BE.SAF-Receive, which is shown in Figure 3.12. The routine first decides whether it is dealing with a new LSA by checking 2’s sequence number against the current sequence number recorded locally. If a new LSA is Observed, corresponding F flags are cleared to indicate that nothing has yet been learned about this LSA from neighbors. The switch then checks whether the LSA arrived via the SMC or a point-to—point link. If the LSA arrived on the SMC, then its dummy is forwarded to all neighboring switches. Otherwise, the LSA itself is forwarded on all point-to-point links except the one on which it arrived. In the cases where dummies are forwarded, a timer is set up to make sure that the switch :1: hears responses from neighboring switches; the timeout handler is discussed later. When a switch :1: receives the dummy of an LSA t, it invokes the BE.SAF_Receive_Dummy routine, shown in Figure 3.13. The receipt of a dummy from a neighboring switch y assures :1: that y has already received the LSA and, therefore, that LSA forwarding to y is unnecessary. This situation is recorded in the corresponding F flags. As in the case of the BE.SAF_Receive routine, a check is made to determine if this is (the dummy of) a new LSA. If so, then the corresponding Seq entry is updated and relevant F flags are reset, as in the previous routine. A switch :1: that receives an LSA E from the SMC forwards the dummy, rather than the real LSA, to its neighboring switches. It also expects responses (3 itself or its dummy) from its neighbors. Lack of a response from a neighboring switch results in the forwarding of the real 8 to that switch. The timeout-handler shown 60 Algorithm: BE_SAF_Receive. Input: the switch ID IL‘, and an LSA if received from port p. y = Source(f). IF (Seq(f) > SeCIlyl) Seq[y] = Seq(f), and Received[y] :2 FALSE. F[x][q] = FALSE, for all 1 S q S K. ENDIF IF (Received[y] is TRUE) DO nothing. /* drop LSAS that have been seen before */ ELSE /* This the first time this LSA is received */ Received[y] = TRUE. IF (p = 0) /* received from the SMC */ Forward a dummy of f to all neighboring switches. Set up a timer(€). ELSE /* received from a point-tO-point link */ Forward (3 on all point-to—point ports, except p. ENDIF ENDIF Figure 3.12: The receive-LSA routine in the BE SAF protocol. Algorithm: BE_SAF-Receive_Dummy. Input: the switch ID 51:, and a dummy d received via port p. y = Source(d). 1F (Seq(d) > Seq[y]) Seq[y] = Seq(d), and Received[y] = FALSE. F[y][q] = FALSE, for all 1 S q S K. ENDIF F[y][p] = TRUE. Figure 3.13: The receive-dummy routine in the BE SAF protocol. in Figure 3.14 checks the F flags to decide for all neighboring switches individually whether the forwarding of the real I? is necessary. 61 Algorithm: BE.SAF_Timeout-Handler. Input: the switch ID :1: and a timer(€). IF (Seq[f] = Seq[Source(€)]) /* This timer is for the current LSA Of the source of f. Out-of—date timers are ignored. */ FOR (port number p = 1 to K) DO IF (F[Source(€)][p] = FALSE) forward 8 on port p. EN DIF ENDIF Figure 3.14: The timeout handler in the BE SAF protocol. 3.4 Performance Evaluation The performance of the three alternative flooding methods (conventional flooding, Basic SAF, and BE SAF) is studied through Simulation. The simulator is based on the CSIM package [48]. We are interested in both temporal and bandwidth metrics. Given a flooding Operation and a switch :12, the temporal metrics of the flooding operation include the time to receive an LSA, called the receipt time of a3, and the time for a flooding operation to complete at 1:, called the completion time of :13. (Completion time includes handling of duplicate LSAS and dummies.) In this study, we measured average receipt/completion times among all network switches. Confidence intervals were computed, but for most cases are very small and, for clarity, are not shown in plots. The bandwidth consumption of a flooding Operation can be measured by the number of links traversed by the LSA and, in the case of the BE SAF protocol, that of its dummy. The former number is denoted by B“ and the latter by B“. Given length 8,, of an LSA and length 8,; of its dummy, the bandwidth consumed by a flooding operation is B; x (3,, + B; x lb, where f 6 {conventional, Basic SAF, BE SAF} is the method of the flooding. In this study, we obtained the B“ and 8“ values through Simulation runs, and we used the Fore SPANS NNI specification, where an it link- description LSA comprises 4 + 28 at n bytes [49], to determine the fa and [a values. Networks comprising up to 256 switches were simulated; 20 graphs were generated 62 randomly for each network size. Table 3.1 Shows the characteristics of the graphs generated. In the table, a parenthesis in an entry represents the (minimum, average, maximum) triple of the corresponding metric. For example, in the case of the 20 4- node graphs, the minimum degree of a node, across all the graphs, was 1.0; the average minimum degree among the graphs was 1.35; and the largest minimum degree among the graphs was 2.0. [ Size [ min degree max degree avg degree diameter [ 4 (1 1.35 2) (2 2.70 3) 1.98 (2 2.200 3) 8 (1 1.10 2) (3 3.95 5) 2.46 (3 3.850 6) 16 (1 1.00 1) (4 5.25 7) 2.83 (4 5.550 9) 32 (1 1.00 1) (5 7.05 9) 3.34 (5 6.700 12) 64 (1 1.10 2) (6 9.40 12) 4.42 (4 6.150 11) 128 (1 1.50 3) (9 13.95 19) 6.75 (4 5.400 9) 256 (1 3.60 7) (11 21.30 29) 11.14 (4 4.800 8) Table 3.1: Characteristics of randomly generated graphs. Each communication operation, such as message forwarding, incurs ATM protocol overhead. We measured these overheads on the ATM testbed in our laboratory. The testbed comprises Sun SPARC-IO workstations equipped with Fore SBA-200 adapters and connected by three Fore ASX-100 switches. From these measurements, we obtained the figure 600 usec, which includes the overhead at both the sending and receiving switches. The final simulation parameter is the duration of the timer used by the BE SAF protocol, which is used in awaiting responses from neighboring switches. In the Simulations, we set the timer according to the average degree of the given network graph G. Specifically, we set the timer to be of length Ang(G) x a, where a is the time to forward an LSA via a point-to-point link. Experiment 1: Ideal cases. By ideal, we refer to Situations in which the SMC is completely operational during flooding operations. Such is the case for periodic flooding during “normal” periods, when all network components function properly, and during the flooding of event LSAS when none of the faulty network components 63 affect the operation of the SMC. The simulation results pertaining to this setting are plotted in Figure 3.15. As shown in Figures 3.15(a) and (b), the two SAF protocols offer significant advantage in both LSA receipt time and flooding completion time, due to their use of the SMC as a “short cut” broadcast medium. 6000 f 12000 . 5000 ~ 10000 » + ,.-II 4000 - 8000 . « D U ‘ .E g ....... .3 3000 r conventional -*— . 2 6000 t a _,.."'/-f § basic SAF .....c E; .—-" “ BE‘SAF O " m ”“- .r' 2000 ’ .1 4m ,_ __,_.--""u ----------- a: ........ I conventional -°— ‘ -" basic SAF +-- ‘ 1000 - = = = 2000 BE-SAF -3... 0 A A; L 1 0 1 i A A 4 I6 32 64 128 256 4 16 32 64 128 256 Network size Network size (3) average receipt time (psec) (b) average completion time (psec) 3500 . . . . 18 a . conventional ~0— “~- .. conventional ' 3000 r basic SAF ---- .i “ < real LSA in BE-SAF ...,, 3“ 53 2500 _ dummy LSA in BE-SAF ,x'v/y a 5‘ 2000 i E; . '8 g -. E 1500 - § 8 ' E U 6 _ g 1000 ~ LA 4 - 500 ~ . 2 _ . ' I 0 m 1 A A 4 16 32 64 128 256 4 16 32 64 128 256 Network size Network size (c) number of links traversed (d) bandwidth consumption Figure 3.15: Comparisons of flooding alternatives with a correctly functioning SMC. Figure 3.15(c) plots the number of links traversed by a real LSA or its dummy for the three protocols. As predicted by earlier analysis, the Basic SAF protocol consumes more bandwidth than does the conventional flooding algorithm; both methods do not use dummies, but the latter has smaller B“ values. With the use of dummy forwarding, the B“ values of the BE SAF protocol are significantly less than that of 64 the conventional protocol, especially when the network is large. The actual bandwidth savings of the BE SAF protocol depends on the (a to Ed ratio. Figure 3.15(d) plots the number of cells per link incurred by the three alternatives, assuming the use of Fore SPANS NNI and the inclusion of 10 link-state descriptions in the LSA.2 Somewhat surprisingly, the BE SAF protocol consumes more bandwidth than do the other alternatives when the network is small (for example, containing fewer than than 8 switches). A closer examination of simulation runs reveals that this phenomenon is due to the small average degrees in small networks, leading to premature timer firings, followed by unnecessary LSA forwardings. This problem can be fixed by introducing longer timeout periods for small networks. For larger networks, the bandwidth savings of the BE SAF protocol are substantial. Experiment 2: Partitioned spanning MC. Next, we investigate the perfor- mance of the flooding algorithms when a (random) link used in the SMC fails, par- titioning the MC into two segments. In this case, the two SAF protocols still use the MC to reach as many nodes as possible, and resort to link-by-link forwarding to reach the remaining nodes. The results are presented in Figure 3.16. As expected, the average receipt times for the two SAF protocols are larger than those of the previous experiment. However, they are still significantly smaller than those of the conven- tional flooding protocol. Provided that the component failure rate of the network is low, the occurrences of simultaneous failures should be rare. We suspect that the results of this experiment, which represent single-failure situations, combined with those of the previous experiment, which represent zero-failure situations, cover a very large fraction of network flooding situations. Interestingly, the bandwidth consumption problem with the Basic SAF protocol diminishes with a partitioned SMC, as shown in Figure 3.16(d). This is because LSA broadcasts do not traverse all links of the partitioned SMC. On the other hand, the BE SAF protocol in this situation must forward real LSAS, rather than dummies, 2An LSA must account for the links to neighboring switches as well as the links to hosts. Consid- ering that most popular ATM switches can accommodate at least 16 ports, and some even 96 ports, we believe that 10 incident links per switch may be a relatively conservative representative figure. 65 6“ T V y > conventional ~— basic SAF ------- 5000 ’ BE-SAF a 4000 ~ 1, 1’ 300° - ......................... U .. 8 M 2000 . a 1000 4 16 32 64 128 256 Network size (a) average receipt time (psec) 3000 A conventional _._ " 2500 ~ basic SAF +-- 3 real LSA in BE-SAF a "‘ dummy LSAinBE-SAF -.—. 3 2000 ~ . E 1500 ~ . > ' /- g x.“ 44.454" 'E 1000 » /«... .3 ,. ,,,/ soo - ,w'/ 1 o ‘ ' "”9 . 4 16 32 64 [28 256 Network size (c) number of links traversed 14000 . . conventional -°— 12000 basic SAF '+'“' “-g BE-SAF ..a ..... U E ,4 U 8 I! 0 1 L L A 4 l6 32 64 128 256 Network size (b) average completion time (psec) 18 r r conventional *— 16 ' basic SAF —~---- ‘ BE-SAF G 14 « '1: 8. 3 8 8 6 4 ., 2 ~ . 0 4 1 L 1 4 16 32 64 128 256 Network size ((1) bandwidth consumption Figure 3.16: Comparisons of flooding alternatives with partitioned SMC. in order to reach switches that are not covered by the SMC. Hence, the protocol consumes more bandwidth than it does in the previous experiment. However, its bandwidth consumption is still lower than that of the other alternatives. Experiment 3: N on-existent spanning MC. Let us now consider the worst-case setting for the SAF protocols: when the SMC does not exist at all. This situation may happen after network re—initialization and prior to reconstruction of the SMC, and can also be considered as a worst-case situation with respect to multiple link failures that partition the SMC. As we can see in Figure 3.17, the conventional flooding protocol outperforms the two SAF protocols in receipt time and completion time. The time differences between the conventional flooding protocol and the Basic SAF protocol are marginal, but those between the conventional flooding protocol and the BE SAF protocol are much more significant. The BB SAF protocol suffers in this experiment because each switch has to perform two rounds of forwarding to neighbors, one for dummies and one of real LSAS. In this experiment, the three flooding alternatives consume essentially the same amount of network bandwidth, because the two SAF protocols, like the conventional flooding protocol, use only link-by-link forwarding when the SMC does not exist. 14000 . . - r conventional ~0— 12000 1 basic SAF ------ BE-SAF o ' 10000 ’ a * '5 8000 l’ n 1 o .2. " 3 o J 1 x L 4 16 32 64 128 256 Network size (a) average receipt time (psec) 3m T v 1 fi' conventional *— 2500 . basic SAF -+~-- 3 real LSA in BE-SAF -° ----- .. dummy LSA in BE-SAF - § 2000- 0 an .5 5 1500 - > 2 jg 1000 > 4 ..1 5m . J 0 A A 4 16 32 64 128 256 Network size (c) number of links traversed 22000 20000- 18000 16000 - 14000 » .L'é §12000 - '3 10000 - 04 3000 ~ 6000 4000 '3, 2000 " r conventional —o— _...--0 basic SAF -+---- BE-SAF ...... 4 O 1 4 I6 32 64 128 256 Network size (b) average completion time (psec) conventional +- 16 * basic SAF --~ ‘ BE-SAF -°- 14 > < 12 h 1 .‘é . E 10; T £ 8 fl" ' . c3 6 _ .. 1 2 r 0 1 1 J L 4 16 32 64 128 256 Network size ((1) bandwidth consumption Figure 3.17: Comparisons of flooding alternatives when SMC does not exist. 67 Experiment 4: Performance of the SMC protocol. We also studied the per- formance of the SMC protocol under two scenarios: reorganization of an existing SMC when an SMC link fails, and construction of a new SMC. Corresponding simu- lation results, along with confidence intervals, are plotted in Figure 3.18. The results show that a partitioned SMC requires less than 2.5 milliseconds to reorganize, while constructing an SMC from scratch (a relatively rare event) requires less than 12 ms. 2500 _ . “KL,“ , 14000 I, 12000 ~ 2000 10000 ~ 1500 B: 1 E mean +— E 8000 h i— upper bound of 95% Cl ------ '1— lower bound of 95% Cl --° ---- 6000 - 1000 - ~ ' . 4000 . Q upper bound of 95% Cl -‘--~ " lower bound of 95% C1~~~~° 500 - 2000 0 A 1 1 L 0 A 1 1 A 4 4 16 32 64 128 256 4 16 32 64 128 256 Network size Network size (a) time to reorganize a partitioned SMC (psec) (b) time to construct a new SMC (psec) Figure 3.18: Performance of the SMC protocol. Interpretation of results. Among the three flooding alternatives, the BE SAF protocol experiences the most variation in performance across the three experiments. The ideal setting for the protocol occurs when there are no event LSAS and network switches simply flood status information periodically. The LSAS in this setting tend to be long because periodic flooding must include descriptions for all incident links, including those connected to hosts. In Experiment 1, the BE SAF protocol is fast and consumes significantly less bandwidth than do the other two alternatives. The protocol also favors long LSAS for the effectiveness of dummy forwarding. Besides the bandwidth benefit, the use of dummy forwarding might also reduce ATM proto- col overhead during link-by-link forwarding; researchers have reported that the ATM protocol overhead of one-cell packets (such as LSA dummies) can be dramatically re— duced if these packets are treated as a special case [50]. We conclude that the BE SAF 68 protocol is the best choice for periodic flooding during normal network operation. On the other hand, the worst-case behavior of the BE SAF protocol is worst among the three flooding alternatives. However, the adverse scenarios considered in Experiments 2 and 3 are likely to stem from emergency events, such as component failures, whose LSAS are typically short. Given these results, we may conclude that a good heuristic would be to use the BE SAF protocol for periodic flooding or fluctuation in resource availability (for example, changes in residual bandwidth of a link), but to invoke the Basic SAF protocol for the dissemination of network component failures. 3.5 Summary In this chapter, we have prOposed two switch-aided flooding protocols and an accom- panying protocol to construct spanning MCs. The protocols are designed to exploit ATM hardware cell switching and cell duplication. SAF protocols use the SMC as a broadcast medium to reduce flooding time. However, the protocols do not rely entirely on the SMC, but rather revert to point-to-point message forwarding if the SMC is damaged or under construction. Two SAF protocols were described: the Basic SAF protocol and the BE SAF protocol. Under normal operating conditions, both protocols deliver network updates several times faster than the conventional flooding algorithm. The Basic SAF protocol is a relatively simple extension of the conventional flooding protocol and should be straightforward to implement. Our simulation study shows that the difference in the bandwidth consumed by the Basic SAF protocol and the conventional flooding is significant for small networks, but is only marginal for large networks. The advantage of the Basic SAF protocol over the BE SAF protocol is its stability in performance under adverse circumstances, for example, when the SMC is partitioned or under construction. We also note that the bandwidth consumption of this protocol may be even smaller when flooding event LSAS, due to their short lengths; under the Fore SPANS implementation, event LSAS are one-cell packets. The BB SAF protocol addresses the bandwidth consumption issue by introducing 69 dummy LSA forwarding. The bandwidth savings of this method is particularly sig- nificant when the network size is large or when the LSAS are long. The performance of the BE SAF protocol is more sensitive to adverse network circumstances, however. As a simple heuristic, an “adaptive” network management system could use the BE SAF protocol for periodic flooding Operations (whose corresponding LSAS are typi- cally sufficiently long to benefit from the use of dummy forwarding), but switch to the Basic SAF protocol in the presence of emergency events, such as link failures. The results in this chapter support the theme of this dissertation: the mutual beneficial relation between LSR and group communication. We have demonstrated that group communication techniques help improve the performance of LSR. Specifi- cally, we have used a spanning tree to improve the performance of flooding operations. In the next chapter, we will push farther in this direction, introducing another type of topology that has been used in host-level group communication, the ring topol- ogy, to further improve flooding performance. We will show that the combination of a spanning tree and a ring produces an Optimal flooding method for use by ATM networks. Chapter 4 Optimal SAF Operations In the previous chapter, we improved the performance of flooding operations in ATM networks by constructing a tree topology, a common technique of supporting group communication. Another type of topology that has been used in group communication is the ring topology, which connects the members of a group in a circular manner. Host-level applications of a “group ring” include barrier synchronization [51], leader election [52], reliable multicast [53], and maintaining among group members consistent orderings of receiving messages [53]. In this chapter, we construct a ring that uses ATM VCS to connect all switches in an ATM network for use as the acknowledgment topology in flooding operations. Switches, after receiving an LSA from the SMC, exchange acknowledgments or dummy LSAS only with neighboring switches defined by the ring, as opposed to all the neighboring switches defined by the physical topology. The resultant flooding protocol, called Efficient Reliable (ER) SAF, is optimal in terms of complexity, for it requires only 0(1) complexity in LSA receipt time and flooding completion time, and incurs only O(|V|) bandwidth for both LSA delivery and reliability implementation. 4.1 Motivation In the previous chapter, we developed two SAF methods, namely the Basic SAF and BE SAF protocols. These two SAF protocols outperform the conventional flood- 70 71 ing algorithm by using a hardware-based spanning tree, the SMC, to speed up the dissemination of LSAS. We note that the (remaining) overheads of the two SAF pro- tocols stem from the requirement to guarantee the delivery of any LSA to all network nodes that are reachable from the originating node. In general, both the conven- tional flooding method and previous SAF protocols achieve reliability by means of a “neighbor watching” principle: every node, after receiving an LSA, makes sure that all its neighboring nodes have also received the LSA. In the conventional flooding protocol, the principle is implemented by reliably forwarding an incoming LSA to all neighbors, except the one from which the LSA arrives. In the Basic SAF and BE SAF protocols, the principle is implemented by exchanging acknowledgments or dummy LSAS with all neighboring switches. The communication of every switch with all neighboring switches inevitably consumes 0([E I) bandwidth and requires 0(DG) time to complete a flooding operation, where D0 is the maximum node degree in the given network tOpology G. To avoid overheads that are associated with reliability, one could use the SMC for best-effort flooding and ignore the reliability issue altogether. In this method, which we refer to as the Unreliable SAF protocol, the source node of an LSA broadcasts the LSA on the SMC, but makes no effort to ensure receipt of the LSA by other switches. The speed and bandwidth complexities of the four flooding protocols discussed so far (the conventional, Basic SAF, BE SAF, and Unreliable SAF protocols) are compared in Table 4.1, where diag denotes the diameter of network G. In the table, we distinguish two bandwidth metrics: delivery bandwidth refers to the number of links that an LSA has to traverse, and reliability bandwidth refers to the number of acknowledgments/ dummies produced. As we can see in the table, the three SAF pro- tocols are more eflicient than the conventional flooding protocol. The Unreliable SAF protocol is the most efficient, of course, since it does not include acknowledgments: it exhibits constant complexities in both time metrics and consumes O(|V|) bandwidth. Of course, we would like to use the most efficient flooding protocol available. One method to use the Unreliable SAF protocol is to distinguish two types of network status: topology status and utilization status. As discussed earlier, the topology status 72 Table 4.1: Complexities of various flooding protocols. Flooding Time Bandwidth Method Receipt Completion Delivery Reliability Total Conventional O(dia(G)) O(dia(G) + degg) 0(lEl) 0( El) 0( El) Basic SAF 0(1) 0(degG) 0(lEl) 0( El) 0( El) BE SAF 0(1) 001890) 0(IVI) 0( EI) 0( El) Unreliable SAF 0(1) 0(1) 0(lVl) 0 O( V) ER SAF (this chapter) 0(1) 0(1) O(|V|) 0(lVl) O( V) of a network component (a switch or a communication link) refers to the Operational state of the component; the present topology of the network is determined by the set of currently operational switches and links. The topology of a network can be expected to be relatively static, assuming that reliable components are used to construct the network. The utilization status reflects the availability of network resources. For example, the utilization status of a link includes the bandwidth in use, the delay over the link experienced by recent cells, cell loss rate, and so forth. In ATM networks, utilization status could be very dynamic, as network resources are allocated and released when VCS are set up and torn down. As such, utilization status LSAS are expected to constitute the majority of flooding operations. It has been argued [54, 55] that, while changes in topology status (such as the failures of network components) must be flooded reliably, dynamics in utilization sta- tus could use unreliable, or best-effort, flooding methods. This is because inaccurate resource utilization information would not lead to disastrous situations, but merely result in sub-optimal routing decisions. Moreover, since the utilization of network resources may change at a high rate, one should be concerned with the efi‘lciency of disseminating such changes. It follows that the Unreliable SAF protocol best fits this purpose. We agree that efficiency is a major concern in the flooding of utilization sta- tus LSAS. However, in this chapter we will demonstrate that a reliable SAF protocol can be complexity-wise as efficient as the unreliable SAF protocol. Furthermore, we contend that there are cases where the reliability of resource utilization flooding is important. Let us consider a switch :1: that has been overloaded by heavy traffic. According 73 to ATM PNNI, at least one LSA indicating the utilization change will be flooded throughout the network so that other switches can avoid using switch :1: in future VCS. Should switch :1: advertise the congestion situation unreliably, some switches may not receive the corresponding LSA and thus will continue using the switch in new VCS, further exacerbating the congestion situation. Moreover, it is exactly when a switch is congested that it will most likely drop cells, including the ones pertaining to utilization status LSAS that disseminate the congestion situation. The information about the congestion at :1: may not leave :1: at all, and the problem feeds on itself as new VCS make the congestion situation worse. In this chapter, we continue the SAF work by developing a reliable SAF protocol that is more efficient than the Basic and BE SAF protocols. The Efficient Reliable (ER) SAF protocol constructs a second topology, a virtual ring, to provide reliability. As shown in Table 4.1, the new protocol exhibits speed and bandwidth complexities identical to those Of the unreliable SAF protocol. Further, it retains the reliability of the conventional flooding protocol, that is, an LSA will be delivered to all switches that are reachable from the originating switch. Since a flooding protocol must deliver a given LSA at least once to every such switch, both the 0(1) time complexities and the O(lV|) bandwidth complexities of the ER SAF protocol are Optimal. The remainder of this chapter is organized as follows. In Section 4.2, we de- scribe the ER SAF protocol, including the use of the virtual ring for reliability and issues that arise when decoupling construction / maintenance of the ring from on-going flooding operations. Details of the ER SAF algorithms are provided in Section 4.3. In Section 4.4, we discuss the methods used to construct and maintain the virtual ring. While the ER SAF protocol achieves Optimal complexities, the expected performance under real network conditions is of interest. In Section 4.5, we investigate through simulation the behavior of the ER SAF protocol both in “normal” situations and under adverse circumstances, where network component failures affect the operation of the SMC and/or the virtual ring. The results of our simulation reveal that the ER SAF protocol delivers network updates several times faster than conventional ap— proaches in normal situations, and twice as fast in the presence of component failures. 74 A summarization Of our SAF work is given in Section 4.6. 4.2 ER SAF Protocol Design In this section, we describe the design issues and basic concepts of the ER SAF protocol. In the discussion, we assume that the network topology G = (V, E) is a connected graph, since our concern here is to efficiently flood LSAS to “reachable” nodes. To generalize our discussion to partitioned networks, we can simply apply the argument to each segment. 4.2.1 Basic Concept The ER SAF protocol uses the hardware-based SMC to achieve constant LSA deliv- ery time. However, it adopts a different approach to reliability than previous SAF protocols. Instead of implementing the neighbor watching principle over the physical network topology G, the ER SAF protocol constructs a virtual topology R = (V, ER) to implement reliability. The topology R is a ring that visits all nodes in G exactly once. The topology is virtual because neighboring nodes in R are not necessarily adjacent in the physical network topology G. Rather, they are connected by ATM VCs that may traverse one or more intermediate nodes. Specifically, each node a: in G is connected to its predecessor in R, denoted as Pred(:r), and to its successor in R, denoted as Succ(:c), by VCS RVCpred(a:) and RVCsucc(a:), respectively. (RVC stands for Ring VC.) We defer to Section 4.4 the discussion of the construction and maintenance of ring R. At this point, we merely emphasize that the ER SAF protocol must be able to work properly when the ring R is under construction or involved in maintenance operations. In the ER SAF protocol, the neighbor watching principle is implemented as fol- lows. Any node 1:, after receiving an LSA from the SMC, exchanges acknowledgments of the LSA with Pred(:1:) and Succ(a:), rather than with all its neighboring nodes de- fined by the physical topology G; we will refer to acknowledgments sent via ring VCS as Lacks If every node a: E G receives racks for a given LSA from Pred(:1:) and 75 Succ(a:), then the flooding operation is completed. Let us use an example to illustrate the operation Of the ER SAF protocol in “normal” cases, where the topology of the network is stable and both the SMC and virtual ring R are fully operational. This ex- ample assumes the network and SMC topologies shown in Figure 3.1(c). Figure 4.1(a) depicts a virtual ring R connecting switches in the (alphabetic) order A, B, C, . . . , M. We point out that some ring VCS, such as the H-I VC, traverse one or more inter- mediate nodes. Assuming that the SMC broadcast of an LSA K successfully reaches all switches, as shown in Figure 4.1(b), ensuing neighbor watching activities are de- picted in Figure 4.1(c), where each node exchanges r-acks of 1.7 with its succeeding and preceding nodes in R. The flooding Operation is completed when every node receives two acknowledgments of 8. (a) a virtual ring R for the network (b) a successful SMC broadcast <———> r_ack e c D E F G ’,vO‘---O‘---O----O‘---'O‘---O~\ \ A é l \\ ,’ ‘9‘“‘9‘“"?‘“"S3‘"".O‘""9‘ (c) exchange of acknowledgments in R Figure 4.1: BR SAF flooding in normal cases. In ER SAF operations under normal conditions, nodes require 0(1) time to receive the LSA, and must process 0(1) r.acks. Hence, the per switch workload (that is, the completion time metric) is of constant complexity. 0(lVl) acknowledgments will be 76 produced; the total number of links traversed by r.acks depends on the total length of ring VCS, denoted as lRl. Various existing heuristics for the traveling salesman problem produce cycles where |R| < C x WI and G is a constant [56]. Using such a heuristic in the construction of the ring, the bandwidth consumed by reliability activities is of complexity O(lV|) (our simulation results presented in Section 4.5 show that C is typically less than 1.5). Because the number of links that an LSA traverses in normal cases is exactly the number of SMC links, the bandwidth consumed by LSA delivery also exhibits complexity 0(lVl). Thus, the total bandwidth consumption exhibits complexity 0( l V l ). 4.2.2 Operation Modes In addition to normal situations described above, the ER SAF protocol must handle more difficult scenarios where the SMC broadcast of the LSA does not reach all nodes, where cells pertaining to r-acks are lost, where ring VCs are damaged by network component failures, or where arbitrary combinations of these events occur. If an LSA is being flooded under such adverse circumstances, then there may exist a node :1: that possesses the LSA after the SMC broadcast but does not receive the r.ack of the LSA from a node y E {Pred(:c),Succ(a:)}. (If no such node x exists, then the flooding is completed.) In this case, node a: can retransmit the LSA to y using the corresponding ring VC, and repeat such retransmissions until y returns an r.ack. For a given LSA, when the ER SAF protocol uses the virtual ring R for acknowledgments/retransmissions, we say that it is Operating in the R mode. Adverse network status changes and the R-mode operation create a cyclic de- pendency: in mode R, adverse network changes that damage the ring R can impede their own advertisements, while the repair of the ring R requires up—to—date network topology information contained in such advertisements. To avoid this dilemma, the ER SAF protocol has a second Operation mode, called the G mode, that is used when the ring R is damaged or under construction (the letter G indicates the use of the physical topology G for reliability). When operating in the G mode, the ER SAF protocol is identical to the Basic SAF protocol: a node receiving the LSA on the 77 SMC subsequently exchanges copies of the LSA (and acknowledgments) with each Of its physical neighbors. The ER SAF protocol needs a method to decide which mode to use for a given LSA. In general, the R mode, due to its efficiency, should be used whenever it can ensure reliability, that is, when the ring R is operational; otherwise, the G mode should be used. The source node a Of an LSA uses its “local” status Of R, that is, the Operational status of RVCsucc(a) and RVCpred(a), to determine the mode to use for the LSA. If both RVCs are Operational, then the source node initiates the flooding operation in mode R. Otherwise, it starts the flooding in mode G. Of course, it is possible that the source of an LSA initiates a flooding operation in the R mode, while there are link-down events that damage ring R and that have not yet been learned of by the source. In such circumstances, some node(s) other than the source must change the operation mode during the course of the flooding. In ER SAF, the flooding of a given LSA can change from the R mode to the G mode, but the reverse is not allowed. Consider a scenario where a switch a is flooding a utilization status LSA 6 using mode R, while in the meantime a link used by RVC(2:, y) but not by the SMC has failed. Let us assume that the SMC broadcast of E successfully arrives at all nodes in G. Although both a: and y receive 6 from the SMC, the two nodes cannot receive r-acks of E from one another. Both nodes will try to retransmit 8 to each other, but such retransmissions have no chance to succeed either. The R-mode flooding operation is bound to fail in this situation. Instead, node 1:, after realizing the problem with the RVC(:1:, y), must switch to mode G, initiating a basic SAF operation of K on behalf Of switch a. (Node :1: could learn of the problem via the link—down LSA produced by the endpoints Of the faulty link or when retransmissions fail a predetermined number Of times.) In this manner, we are assured that 8 will reach all network nodes while the ring R is under repair. Even when the ring R is fully operational, there are cases where the R mode is unacceptably inefficient. Consider the example shown in Figure 4.2, where switch G, which is a leaf in the SMC, is advertising the failure of the (G,I) link, which is used by the SMC, but not by the virtual ring R (we assume the ring topology depicted 78 in Figure 4.1(a)). In this situation, the SMC cannot deliver this link-down LSA to any node at all. After failing to receive corresponding r.acks from Pred(G)=F and Succ(G)=H, switch G retransmits the LSA to the two nodes over ring VCS. Nodes F and H will also notice the lack of r-acks from E and 1, respectively, and attempt to retransmit. The result is that the LSA traverses the ring R in a sequential, store-and- forward manner, as depicted in Figure 4.2(b). In general, retransmissions over the ring R degenerate into a sequential procedure whenever multiple nodes, consecutive in R, fail to receive the SMC-switched copy of an LSA. To avoid this performance problem, we introduce a two-three rule as a mode-switching heuristic: whenever any two consecutive nodes in R do not receive a given LSA from the SMC, the ER SAF operation, with respect to that LSA, will switch to mode G. This rule can be formally stated as follows. Two-Three Rule. With respect to a given LSA Z, the two-three rule is satisfied at a node x if any two of the three nodes, 2:, Succ(z), and Pred(x), do not receive 8 from the spanning MC. Precisely, any node a: that is currently in mode R with respect to Z switches to mode G if either one of the following conditions is satisfied. Cl. Node 3: does not receive an r.ack Of 8 from either Succ(x) and Pred(at) after waiting for a predetermined length of time since the receipt of SMC-relayed copy of 8. C2. The first time :1: receives the LSA is from one of its ring neighbors (indicating 11: itself has missed the SMC copy), but has not received the r.ack from the other ring neighbor y (indicating that y may not receive the LSA either). 4.3 Algorithms In this section, we present the algorithms used by the ER SAF protocol. We use the notation NR(:1:) to denote the set {Succ(x), Pred(z)} and the notation NG(a:) 79 B C D E F G (a) the link (H, J) fails (b) sequential store-and-forward in R Figure 4.2: A hypothetical scenario where LSA retransmissions over R degenerate into a bidirectional store-and-forward process. to denote the set of neighboring nodes defined by the network topology G. We assume that nodes in NG(a:) can be reached via ports numbered 1 to K3, where K, = [N0(.T)l. We further assume that LSAS are tagged with a sequence num- ber and source switch ID: an LSA from node i with sequence number j will be denoted as LSA( j ,i), and its corresponding acknowledgments will be denoted as ei- ther r.ack(j,i) or g_ack(j,i), depending on the operation mode of the LSA. Each switch maintains the following data structures: Seq[i], the sequence number of the current LSA from switch i, 1 S i 3 WI; Mode[i] E {R, G}, the operation mode for the flooding of LSA(Seq[il,i); Fsucc[i] and F pred [2], two boolean flags indicating whether 2: has received r.ack(Seq[i],i)/LSA(Seq[i],i) from nodes Succ(x) and Pred(z) respec- tively; and F[i][p], a boolean flag indicating whether the switch has received via port 1) the g_ack(Seq[i],i)/LSA(Seq[i],i), for 1 S i _<_ [V] and 1 g p 5 K1. Let us denote the sequence number of LSA Z by Seq(Z) and the address of the source switch by Source(e). Moreover, every LSA E has a mode bit, denoted by Mode(€), whose value can be either R (for mode R) or G (for mode G). The mode bit of an LSA may change value during the course of the corresponding flooding Operation, and copies of the same LSA may be in different modes. We also assume that, when a switch :1: receives a copy of 2, it can discover the sender of this copy, denoted as Sender(€). The sender information can be determined by the port/RVC/SMC on which E arrives. The source of an LSA invokes the routine FloodLSA, which is shown in Figure 4.3. 8O Parameters to the routine include the ID :1: of the invoking switch and an LSA 8 to be flooded. The switch :1: updates the sequence number of its current LSA to that of 8 and clears relevant F succ, Fpred, and F flags to indicate that it has not received anything about 8 from its neighbors. Finishing these bookkeeping tasks, switch at next decides the Operation mode of Z. The R mode is used if incident ring VCS, RVCsucc(:1:) and RVCpred(:1:), are operational and if at least one incident SMC link is operational. (The use of mode R even when some incident SMC links are malfunctioning encourages the use of fragments of the SMC to disseminate an LSA to as many nodes as possible.) In this case, switch :0 sends the LSA on the SMC with mode bit set to R, and sends r-acks of the LSA to nodes in NR(:1:) via ring VCs. Otherwise, the G mode is used, that is, switch :1: sends the LSA on the SMC with mode bit set to G, and forwards the LSA to nodes in NG(:1:) via physical links. The SMC broadcast may be skipped in mode G, however, if all SMC links incident to a: have failed. In either mode, a timer is setup to wait for responses from the set of neighboring nodes determined by the chosen operation mode. Algorithm: FloodLSA. Input: the switch ID 2:, and an LSA 3. Seq[x] = Seq(fl). F[zllp] = FALSE, for all 1 S p S K. Pancake] = Fpredlcc] = FALSE. IF (either RVCsucc(:c) or RVCpred(:1:) is damaged) or (all SMC links incident to a: are malfunctioning) Mode(€) = Mode[a:] = G. ENDIF IF (mode[:r] = R) Broadcast Z on the SMC. Send r.ack(Seq(£),:1:) to the two nodes in NR(:1:) via ring VCS. Set up an r-timer(€). ELSE /* mode[:1:] = G*/ IF (at least one incident SMC link is operational), Broadcast If on the SMC. ENDIF Forward 3 to all nodes in Ng(:1:) via physical links. Set up a g.timer(l). ENDIF Figure 4.3: The sender algorithm Of the ER SAF protocol. 81 Switches that receive an LSA invoke the routine ReceiveLSA, which is shown in Figure 4.4. Parameter x indicates the ID of the invoking switch and parameter 6 is the LSA received. The routine first decides whether it is dealing with a new LSA by checking [’8 sequence number against the current sequence number recorded locally. If a new LSA is Observed, corresponding Fsucc, Fpred, and F flags are cleared to indicate that nothing has yet been learned about this LSA from neighbors, and the network image at :r is updated according to 8. Subsequent processing depends on the mode of 8. The processing of E when Mode(€)=G follows the Basic SMC protocol: if K arrives on the SMC, then it is forwarded on all ports; otherwise, 6 arrives on a port p and is forwarded on all the other ports. The processing of 8 when Mode(€)=R is somewhat complicated. First, the switch needs to decide whether to change to the G mode. If 8 arrives on the SMC, its mode is changed when any incident RVC of a: is damaged. If E arrives on a ring VC, its mode is changed when condition C2 of the Two-Three Rule is satisfied. If mode switching does occur, 8 is forwarded on all ports p, 1 g p 5 Km. Otherwise, the r.ack of 2 is sent to Pred(x) and Succ(z) if K arrived on the SMC, or to the Sender(€) if K arrived on a ring VC. This concludes the processing of 8 when it is received for the first time. When switch :1: receives subsequent copies of 8, it discards such copies, unless the current mode at :1: with respect to l is the R mode and the arriving copy is in mode G, forcing :1: to switch to the G mode and to forward Z along all incident links. We point out, again, that sending an acknowledgment is necessary even for duplicate copies of 2. Lastly, if 8 did not arrive on the SMC, switch a: must remember that it has received 6 from its neighboring node Sender(€). When a switch 2: receives the acknowledgment of an LSA K, it invokes the ReceiveACK routine, shown in Figure 4.5. The purpose of the routine is straight- forward: The receipt of an acknowledgment from a switch y assures :1: that y has already received the LSA. This situation is recorded in the corresponding Fsucc, Fpred, or F flags, according to the port/RVC on which the acknowledgment arrives. When the timer associated with LSA K fires at switch :1:, the switch invokes the TimeoutHandler routine shown in Figure 4.6. The timer, however, may be ignored for two reasons: it was set up for an LSA with an obsolete sequence number, or the 82 Algorithm: ReceiveLSA. Input: the switch ID :1:, and an LSA If. a = Source(Z). IF (Seq(f) > Seq[a]) /* This is the first copy. */ Seq[a] = Seq(t’), Fsucc [:1:] = Fpredlrr] = FALSE, and F[:1:][q] = FALSE, for all 1 S q S K. Update the local network image at :1: according to 6. IF (Mode(€) = R) IF (8 is received from the SMC) IF ((both RVCSUCCCB) and RVCpred(:1:) are Operational) Send r.ack(Seq(€),a) via the two RVCs, and set up an r_timer(£). ELSE Change Mode(€) to G, forward 12 on port p, 1 S p 3 K3, and set up a g-timer(€). ENDIF ELSE /* Z is received from a ring VC 1); check condition C2 of the Two-Three Rule */ IF (Sender(€)=Pred(:1:) and Fsucc[:r]=FALSE) or (Sender(€)=Succ(:1:) and Fpred[:1:]=FALSE) Change Mode(€) to G, forward 6 on port p, 1 S p 5 K1, and set up a g_timer(€). ELSE Send r-ack(Seq(€), a) to Sender(€) via RVC v. ENDIF ENDIF ELSE /* Mode(€) = G */ IF (2 is received from the SMC) Forward 13 on port p, 1 S p 5 K3, and set up a g-timer(€). ELSE /* Z is received from port p */ Forward Z on all ports, except p, and set up a g-timer(€). Send g.ack(Seq(€), a) to Sender(€) via port p. ENDIF ENDIF Mode[a] = Mode(€). ELSE /* This is an extra copy. */ If (Mode[a] = R) but (Mode(£) = G) Change Modela] to G, forward 2 on port p, 1 g p S K I, and set up a g_timer(€). ELSE Return an r.ack or g_ack to the Sender(€), depending on Mode(€). ENDIF ENDIF IF (2 is received from a ring VC) Set Fpred [a] or F9,Ulcc [a] to TRUE, depending on Sender(£). ELSE IF (2 is received from port p) F[a][p] = TRUE. ENDIF /* No flag to set if K is received from the SMC. */ Figure 4.4: The ReceiveLSA routine in the ER SAF protocol. 83 Algorithm: ReceiveACK. Input: the switch ID :1:, and an acknowledgment d. a = Source(d). IF (d is received from Pred(x)) Fpred[a] = TRUE. ELSE IF (d is received from Succ(z)) F succ[a] = TRUE. ELSE /* d is received from port p */ F[a][p] = TRUE. ENDIF Figure 4.5: The ReceiveACK routine in the ER SAF protocol. timer is an r-timer for an LSA whose operation mode at :1: has been changed to mode G since the setup of the timer. Subsequent processing, if required, depends on the type of the timer. For a g_timer, the routine forwards the associated LSA 6 to ports whose corresponding F flags have not been set to TRUE. The processing of an r_timer is more complicated, however, as we have to decide whether to switch to mode G. The mode is changed when local RVCs are found to be damaged, or when condition C1 of the Two-Three Rule is satisfied. If the mode needs to be changed, then the LSA is forwarded on all ports p, 1 g p 5 Km. Otherwise, the LSA is forwarded (or, more precisely in this case, retransmitted) to a node in N R(:1:) whose corresponding flag is FALSE (at most one such retransmission will be performed, otherwise the Two-Three Rule would have been satisfied). Lastly, a timer in an appropriate mode is set up to wait for responses from neighboring nodes to which this LSA has been forwarded / retransmitted. 4.4 The Virtual Ring In this section, we discuss the construction and maintenance of the virtual ring R. For this topic we must explicitly consider the handling Of network partitioning, and hence we drop the assumption that the network G is connected. While a flooding operation is only concerned with delivering the LSA to all nodes reachable from the source node, the ring construction procedures of the ER SAF protocol must construct 84 Algorithm: TimeoutHandler. Input: the switch ID :1: and a timer(K). a = Source(K). IF (Seq(K) < Seq[a]) or (timer(K) is an r-timer but Mode[a]=G) Return. ENDIF IF (timer(K) is a g-timer) For 1 g p S K, forward K on port p if (F[allp] = FALSE). Set up a new g_timer(K). ELSE /* timer(K) is an r_timer */ IF (either RVCSUCC(:1:) or RVCpred(a:) is damaged) or /* The next condition is the C1 of the two-three rule. */ (Fsucc[a]=FALSE and Fpred[a]=FALSE), /* Switch to the G mode. */ Mode[a] = Mode(K) = G. Forward K via all incident port p. Set up a new g..timer(K). ELSE /* Retransmit K via a ring VC. */ IF (F811cc [a] = FALSE) THEN forward K via RVCsucc(:1:). IF (Fpred[a] = FALSE) THEN forward K via RVCpred(:r). Set up a new r_timer(K). ENDIF ENDIF Figure 4.6: The timeout handler in the ER SAF protocol. a ring within each network segment during partitioning periods, and re-construct a new ring when two or more segments merge into one. Within each segment, the construction of the virtual ring R comprises three steps. First, the leader switch in the segment, elected by the ATM leader election protocol, computes an ordering of the switches that are reachable from the leader according to the local network image Of the leader. Second, the leader switch advertises the ordering using the Basic SAF protocol; such an LSA is called a switch ordering LSA. Third, every switch establishes a ring VC to its successor as defined by the ordering. If we define the criterion of the ordering computation to be the minimum total lengths of RVCs, then the switch ordering problem becomes the well-known traveling salesman problem [56]. Since the problem is NP-complete, we use the following heuristic [56]: the leader computes a depth-first search tree that spans all reachable switches in its 85 local network image, and uses the ordering determined by the pre-order traversal of the tree. When a switch :1: receives a switch ordering LSA, it sets up a VC to the succeeding switch defined by the ordering, following the procedures described in ATM UNI 3.1 standard [30]. It also accepts a VC-setup request from switch y, where y is the predecessor of :1: in that ordering. The paths of the two ring VCs are recorded, so that, if subsequent link-down LSAS are received, switch :1: can detect damage to incident ring VCS. The maintenance phase of the virtual ring is divided into two levels: repair and reconstruction. When a switch :1: learns of damage to its RVCsucc(:1:) from a link- down LSA, it shall try to establish a new VC to Succ(as). If this task succeeds, the ring is repaired and no further action is needed. Otherwise, the network has been partitioned. Under such circumstances, a leader will be elected within each segment, and will subsequently compute and flood an ordering of switches within the segment. Switches within each segment then follow the new ordering to establish new ring VCS, resulting in a new ring within the segment. Another situation requiring ring reconstruction occurs when network segments re—unify with each other (possibly because malfunctioning network components have recovered). As in the previous case, a leader election will take place, and the new leader will compute and flood a switch ordering of the merged segment. In general, the leader of a network/ segment monitors the set of reachable switches defined by its local network image. When a membership change occurs to this “reachable set,” or when the leader is newly elected, a (new) ordering of switches in the set is computed and flooded, resulting in the (re)construction of the virtual ring. 4.5 Performance Evaluation We studied the performance of the ER SAF protocol through simulation. The sim- ulator is based on the CSIM package [48]. Confidence intervals were computed, but for most cases are very small and, for clarity, are not shown in the plots. 86 Networks comprising up to 400 switches were simulated. Since each switch is likely to be attached to several hosts, such networks may include thousands of hosts. For each graph, 40 graphs were generated randomly, and 100 simulation runs were performed on each graph. Each run used a randomly selected core node for SMC construction and a randomly selected flooding source. The core node selected for a simulation session is also used as the root of the depth-first search tree, which deter- mines the ring topology. Table 4.2 shows the characteristics of the graphs generated. These random graphs exhibit average node degrees conforming to those observed in some subnetworks in the Internet [57]. degree diameter size min. l avg. rmax. min. l avg. l max. 10 1.95 3.87 5.80 2 2.950 4 20 1.25 3.86 7.28 3 4.475 6 40 1.05 4.02 7.95 5 5.700 8 60 1.02 3.92 8.65 5 6.475 9 80 1.00 4.00 9.38 6 7.000 8 100 1.00 4.10 9.45 6 6.975 8 120 1.00 4.18 10.22 6 7.450 9 140 1.00 4.23 10.60 6 7.475 10 160 1.00 4.30 10.20 7 7.525 9 180 1.00 4.39 10.68 6 7.625 9 200 1.00 4.43 11.18 7 7.750 9 250 1.00 4.57 11.38 7 7.700 9 300 1.00 4.72 12.05 7 7.950 10 350 1.00 4.88 12.12 7 7.775 9 400 1.00 5.08 12.65 7 7.625 9 Table 4.2: Characteristics of randomly generated graphs. Each message transmission in a flooding Operation incurs ATM protocol overhead. Like the simulation studies in the previous chapter, we measured these overheads on the ATM testbed in our laboratory. The testbed comprises Sun SPARC-lO worksta- tions equipped with Fore SBA-200 adapters and connected by three Fore ASX-200 switches. From these measurements, we obtained the figure 600 usec, which includes the overhead at both the sending and receiving switches. The per-hop hardware switching delay was found to be 12 usec. 87 Experiment 1: Normal cases. By normal, we refer to situations where network topology is stable, and the SMC and the virtual ring R are operational. Such cir- cumstances typically occur for the flooding of utilization status information and for periodic flooding. The simulation results pertaining to this setting are plotted in Fig- ure 4.7. For the two time metrics, receipt time and completion time, we show both the average and worst-case results. As we can see in Figure 4.7(a), the ER SAF pro- tocol delivers LSAS several times faster than does the conventional flooding protocol. This is especially true for large networks. When the network size is larger than or equal to 100, the average receipt time of the ER SAF protocol is less than one-fifth of that of the conventional flooding, and the worst-case time is less than one-eighth of that of the conventional flooding. Similarly, the completion time and LSA bandwidth consumption of the ER SAF protocol are only a small fraction of their counterparts in the conventional flooding (see Figure 4.7 (b) and (c) respectively). Furthermore, the constant time complexities of the ER SAF protocol are clearly demonstrated by the flat curves. In Figure 4.7(d), we plot the results regarding the reliability mechanisms of the two flooding protocols. For the conventional flooding protocol, we computed the average number of acknowledgments produced by the 4000 flooding operations for each graph size. For the ER SAF protocol, we compute both the average number of r.acks, as well as the total number of links traversed by these r.acks. The latter metric is of interest in the ER SAF protocol because some ring VCS may involve more than one physical links. As shown in the figure, the average number of r.acks is less than that Of the acknowledgments produced by the conventional flooding protocol, although the difference is not as dramatic as the cases in previous figures. We also note that the number of links traversed by r.acks is typically within 150% of the number of r-acks, suggesting that the average length of ring VCs is less then 1.5. Experiment 2: Flooding of link-down events. Next, we studied the perfor- mance of the ER SAF protocol when used to disseminate network component fail- ures, namely link-down events. For each graph, we randomly select and remove a link 88 20000 . 5 5 18000 - .. , a . 16000 ~ .. . .» ...- « 8 314000 - “/1" l g E 12000 ~ ,1" 1 .... 1— f t 8 10000 ' 09 r . .3 8m " °’ , r conventional (max) -°— . '5- ': _ 5 3000 f' conventional (avg) -------- 5 6000 . ciinvfililfilfa‘iltlfilil : 2000 7"] ESAF (max) -O-- l m . ESAF (3V8) 3. law b ESAF (avg) '4'. _. 4 ESAF (max) "3-.-, MH"“"“““MMWW“ 2000 [pi-+H—o—Ww—fl 0 ‘ ‘ ‘ ‘ ‘ ‘ ' 0 . . 1 . 1 1 a 0 50 100 150 200 250 300 350 400 o 50 100 150 200 250 300 350 400 Network size Network size (a) receipt time (usec) (b) completion time (psec) 1800 . fi fi 1 v . . 1800 [ . . . 1 . w . . , conventional -+- , 1600 1 conventional -°— 30 1600 ES AF (forwardings of r_acks) 1.”--. go 1400 . ESAF W3 :5 1400 _ ESAF (# ofr_acks) “° """ "g 1200 » g 1200 - ,1. o ,. 8 1000 - 1 3: 1000 - 1 E 800 _ 1 g 800 . ,0 “a 600 * E 600 - ' o 2 400 * .................. - o. 400 - 200 » z 200 . 0 ""1 """ 1 1 1 1 4 1 0 ' 1 1 1 1 1 1 1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network size Network size (0) LSA delivery bandwidth ((1) reliability bandwidth Figure 4.7: Comparisons of flooding alternatives with an operational SMC and virtual ring. whose removal will not disconnect the graph, and select one of the endpoints of the link to advertise the event. The remaining parts of the network are assumed to be sta- ble. Under the assumption that network components have a long MTBF (mean time between failures), it is unlikely that a second component will fail during the flooding of the first failure (though of course the ER SAF protocol can handle this situation). We expect flooding performance in single-failure scenarios to be representative for the flooding of topology status LSAS. Corresponding simulation results are plotted in Figure 4.8. As we can see in Figure 4.8(a) and (b), the ER SAF protocol is still much faster than conventional 89 flooding. In most cases of the receipt times, the ER SAF protocol is more than twice as fast as its competitor. We note that a link-down event does not necessarily force the ER SAF protocol to use the G mode. In our simulation, approximately 60% of the link-down events result in the use of the G mode. The performance of the remaining 40% should resemble that of the normal cases. To investigate the performance of the ER SAF protocol in the G mode, we extracted those samples that use mode G and plotted the respective results in Figure 4.8(c). As shown, the average receipt times of LSAS flooded in mode G are only slightly higher than the overall average. This is because the ER SAF protocol also uses the (possibly fragmented) SMC to disseminate LSAS in mode G. After the SMC broadcast, all nodes that receive the LSA start forwarding the LSA via point-to—point links at nearly the same time, resulting in speedier point-to-point forwarding when compared to the conventional flooding protocol. With the presence of G-mode operations, the bandwidth consumptions by the two flooding alternatives are approximately the same, as shown in Figure 4.8(d). Experiment 3: Ring construction. Lastly, we investigated the time to construct a ring. As discussed earlier, the construction process comprises three phases: First, the leader switch computes switch ordering using the depth-first-search tree heuristic. Second, the leader switch broadcasts the ordering using the conventional flooding protocol. Third, every switch establishes a ring VC to its successor defined in the ordering. Since the first phase is a simply linear time algorithm performed locally at the leader, we ignored this phase in the simulation. The average and worst case ring construction times under this assumption are plotted in Figure 4.9. As we can see, the virtual ring R can be constructed within 22 milliseconds even for relatively large networks. 4.6 Summary We have described an efficient reliable (ER) SAF protocol. The ER SAF protocol constructs an SMC to broadcast network status updates in hardware and uses a 90 9000 . , 20000 8000 r _ conventional (max) —+— . A g 7000 conventional (avg) -------- £15000 ’ V 6000 ~ ESAF(max) «1 1 o E ESAF(avg) 1:: _____ .E F 5000 """""""""""""""" 7” --------- 210000 1 .1 """""""" o g 4000 » """ ’8 . 8 x E- . 'fl . ,__._.. b—4- 3000 ’ 7 n a o o o 9 o o G 9 0 U'lr'Ir,‘ ., I-n q—n-“I'-.- 1.....- a: f, "' fl 8 5000 1;?” conventional (max) -+— . 200° ’6’" 1 ..~~ ---- --~-- -. .-..-. -- -— - 1 conventional (avg) W. V”. ESAF (max) --«-- 1°00 *” . . . . . . . ‘ 0 . . . - ESAFuvg) 7*- 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network size Network size (a) receipt time (,usec) (b) completion time (usec) 10000 . . . . . . . 1800 . . . . 1600 * LSA (conventional) -+— . / 8000 - :9 Ack (conventional) ------- / ..--" A 5 1400 ' LSA (ESAF) H...” / ‘ 3 ED 1200 . Ack (ESAF) ~--—- .. / 0 6000 - ,,,,, « ‘ 3 _ 28: ,2!” comp. (g-mode only) -°—- 3 1000 0 ,r' comp. (all samples) ~*---- _ g) 4000 > recv.(g-mode only) "s ~~~~ 1 g 800 > recv. (all samples) --- - 1... 600 > 1 <3 8 2000 » .1;--:,:t'2;?;"::‘;'i’.i".;";';"‘,I"."._.___,___“,’ 2 400 1 « V 200 O 1 1 1 1 1 1 1 o 1 1 1 1 A 1 a 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network size Network size (c) ER SAF receipt time in different modes (d) bandwidth (the number of links tra- (usec) versed) Figure 4.8: Comparisons of flooding alternatives in the performance of flooding link- down events. virtual ring topology to minimize reliability overhead. In normal cases, where the network topology is stable and the SMC and the virtual ring are operational, this protocol is Optimal in terms Of flooding time and bandwidth consumption. When network component failures affect the operation of the SMC or the ring, the ER SAF protocol resorts to a more conservative flooding method, namely the basic SAF protocol. Our simulation results reveal that the ER SAF protocol is several times faster than the conventional flooding protocol in normal cases, and is still twice as fast as the conventional flooding under adverse circumstances. The use of a ring topology in the ER SAF protocol serves as yet another example of how traditional 91 V I I I I Y n"‘ 20 .1 ---------- v- ‘‘‘‘‘ . ...... "” 15 ” “I" Time (ms) 5 ~ average -°— 1 maximum -------- 0 1 1 0 50 100 150 200 250 300 350 400 Network size Figure 4.9: The average/ worst case time to build a virtual ring. group communication techniques can be used to improve the performance of LSR. Chapter 5 A Generic Method of MC Construction In previous two chapters, we used group communication techniques, namely, the construction of a multiparty communication channel, to improve the performance of LSR in ATM networks. In this chapter, we shift our attention to the other direction of the mutually beneficial relationship between group communication and LSR, that is, how the robustness of LSR and the complete tOpology information made available by LSR can help develop novel and efficient group communication solutions for use by multiparty communication applications. Specifically, we propose a protocol for the construction and maintenance of multipoint connections (MCS). A distinguishing feature of the protocol is its generality: the proposed solution can incorporate any MC topology computation algorithm, and hence can be used with MCS of different tOpology types or performance criteria, a requirement stemming from the diversity of multiparty communication applications. The protocol is based on LSR: information regarding multipoint connections is broadcast to network switches, which perform all MC topology computations locally. The protocol is free from routing loops, even transient ones, and will tolerate any combination of link / node failures, including those that partition the network for a period of time. The correctness of the protocol, which is modeled as a consensus problem in a distributed system, is established by formal proofs. Results of a simulation study show that the generality of the protocol can be 92 93 achieved with negligible to moderate signaling overhead. 5. 1 Motivation As described in Chapter 2, the applications that use multi—party communication vary widely and include teleconferencing, computer-supported cooperative work, dis- tributed interactive simulation, remote teaching, tele—gaming, replicated file servers, parallel database search, and distributed parallel processing. Such applications have widely disparate needs with respect to network services. Among such services is the MC protocol itself, which defines the set of rules and conventions by which MCS are constructed and maintained, and which is executed among processing entities within a communications network. We have discussed in Chapter 2 three major MC tOpology types, namely, SSTs, SRTS, and ROSTS. Even for a fixed topology type, different topology computation algorithms could be used, depending on the relative importance of various perfor- mance criteria, such as bounds on transmission delays, network resource consumption, multicast packet loss rate, and so forth. Many existing multicast protocols can be considered as distributed implementations of one, or a small set of, MC topology com- putation algorithms. For example, the MOSPF and DVMRP protocols implement distributed source-rooted tree algorithms that minimize transmission delays, whereas the CBT protocol implements a particular shared-tree algorithm first described in [8]. However, the emerging demand for routing based on quality—of-service (Q08) [58] has stimulated the development of many other MC topology computation algorithms that may be better suited for certain classes of applications [10, 34, 35, 59, 60]. Many such algorithms have not been incorporated into current multicast protocols. The diversity of MC topology types demanded by multiparty communication ap- plications, and the wide variety of MC topology computation algorithms designed for different performance criteria, give rise to the following question: Is it possible to develop an MC protocol “chassis,” that is, a framework that is able to accommodate multiple existing, and future, MC topology algorithms? The ongoing development of 94 new service models (available bit rates, controlled load, quality-of-service, and so on) further emphasizes the need for such a generic MC protocol. The main contribution of this chapter is to demonstrate that such a challenging goal is achievable in networks based on LSR. In this chapter, we propose an LSR-based MC protocol, called the generic MC (GMC) protocol, which can be used as a distributed implementation of “any” MC topology algorithm. The GMC protocol extends LSR by including information about MCS in the network images maintained at switches. MC topology and membership information are broadcast throughout the network by means of extended LSAS. We emphasize that the GMC protocol is intended to construct and maintain MCs among switches, rather than hosts. As discussed in Section 2.2.2, a host in a network is attached to one or more switches, called the ingress switch of the host, and uses a local membership management protocol, such as IGMP [7], to inform its respective ingress switch of MC membership status. When one or more attached hosts of a switch are interested in an MC, the switch is said to be a member switch of the MC. Figure 5.1 shows the same MC as depicted in Figure 2.1 (b), complete with member hosts. — Connection Link Connection Member Switch Intermediate Switch Connection Member Host [35500 Other Host Figure 5.1: Example MC showing member switches and attached hosts. The primary task of the GMC protocol is to keep MC images consistent and up- to-date, while incurring minimum protocol overhead. Since both network topology information and MC image information are available at all switches, any method of computing MC topologies can be used. Indeed, the topology computation algorithm is a “plug-in” component of GMC, rather than an inherent part of the protocol. In 95 addition to its ability to support multiple MC types and topology algorithms, the GMC protocol exhibits the following properties: 1. The protocol is free from routing loops. Due to delays in the dissemination of changes in network status, the participating switches in an MC protocol may have inconsistent knowledge of the network for short periods of time. Many multicast protocols produce transient routing loops under such circumstances [5, 6, 4, 2]. Routing loops, even temporary ones, may introduce network congestion under conditions of heavy traffic. As we will demonstrate later, the GMC protocol avoids routing loop entirely. 2. The protocol is robust. Being a link-state routing protocol, the GMC protocol has the intrinsic advantage of fault tolerance. The protocol handles faulty com- ponents in the network through topology computations that are triggered by link/nodal events. In fact, the protocol survives network partitioning, and is able to construct correct MC topologies after re—unification. Further, we will show that the GMC protocol can survive memory overflow problems at switches: given an MC, the protocol will be able to construct the MC as long as at least one member of the MC does not purge the image of the MC indefinitely. 3. The protocol exhibits a low level of computational redundancy compared to ex- isting LSR-based MC solutions. As we will discuss in Section 5.2, LSR-based MC solutions, such as the MOSPF protocol, may perform identical tOpology computations at multiple switches, incurring the problem of computational re- dundancy. Since an MC topology computation is typically a non-trivial task (for example, many of the Steiner tree heuristics are of 0(N2) or 0(N3) com- plexities), the GMC protocol is designed to minimize the number of topology computations. We will show via a simulation study that the performance of GMC in terms of computational overhead compares favorably with that of MO- SPF. Given the availability of network status information at all network switches, LSR provides a solid foundation for developing an MC protocol chassis such as GMC. 96 While the idea of LSR-based MC protocols is not new, previous solutions have not achieved the versatility that is dictated by the diversity of multiparty communication applications. This paper is a “proof of concept” that, at least in LSR-based networks, such a generic MC protocol can be constructed and can operate efficiently. The remainder of this chapter is organized as follows. Section 5.2 discusses various background subjects as well as related work. Section 5.3 describes the design and operation of the GMC protocol. We prove in Section 5.4 the correctness of the GMC protocol. Section 5.5 presents the results of a simulation study, in which the behavior of the GMC protocol is evaluated under various workloads. A summary of this chapter is given in Section 5.6. 5.2 LSR-Based Multipoint Connections As described in Chapter 2, switching elements in LSR-based networks use LSAS to advertise local status information. This approach can be extended to support multi- party communication by distributing MC membership information in LSAS [1]. That is, whenever a switch wants to join or leave an MC, a membership-event advertise- ment for the connection is flooded through the network. Such an advertisement should contain at least the ID of the connection and the address of the source switch. Switches in the network collect these advertisements and maintain member lists and MC topology information for all active MCS. The differences among LSR-based MC protocols lie in how topology computations are triggered. The MOSPF protocol is an extension of the unicast OSPF protocol [11]. In the MOSPF protocol, the addresses of the hosts listening to a multicast address are broadcast in group-membership LSAS, and routers maintain complete member lists for all active multicast addresses. However, a router does not compute the topology of a multicast connection until it actually receives a datagram destined for the corresponding multicast address. Upon receiving a datagram for multicast address M, the router consults its local database for the member list of M and computes a shortest-path tree, rooted at the source of the datagram, that reaches all hosts 97 listening to M. The router then saves this topology information in a routing cache and forwards the datagram along the appropriate outgoing links. This forwarding will trigger further topology computations at downstream routers. Moreover, multicast routing entries created in this process must be cleared upon the arrival of LSAS that advertise membership or network changes, resulting in the repetition of the process when subsequent multicast datagrams arrive. The MOSPF approach (on-demand, data-driven topology computations) is well- suited to the construction of source-rooted trees. However, this method has limita- tions in other contexts. First, if the MC is an ROST, it is independent of the nodes sending to that MC; its topology computations cannot be triggered by packets from senders, but rather depend on the actions of receivers. Second, the MOSPF performs identical topology computations at all members and intermediate nodes of an MC. This computational redundancy could produce heavy workloads at switches, given the high cost of topology computations. (MOSPF uses Dijkstra’s shortest path al- gorithm, which exhibits complexity 0(N2), where N is the number of switches in a network.) Third, the MOSPF protocol requires the availability of MC membership information at a router to compute the topology of an MC. Losses of such information (for example, due to memory overflows) could lead to improper protocol Operation. To summarize, while the MOSPF protocol serves some multicast scenarios well, it may not possess the efficiency and flexibility to accommodate many current and future distributed applications. 5.3 The GMC Protocol The GMC protocol extends LSR by incorporating MC images at network switches. The essence of the protocol is to maintain consistency of MC images throughout the network. In presence of group membership dynamics or changes to network topology, an MC protocol must update the MC topologies that are affected by those events. The GMC protocol uses an event-driven approach to this problem: the switches that detect events are required to compute and advertise new MC topologies. For 98 example, a switch that detects a “link-down” event suggests alternative topologies for any MCS that were using the malfunctioning link. Similarly, a switch that changes its membership status with respect to an MC implicitly “detects” a membership change event, and suggests a new topology for that MC. Other switches in the network receive the topology proposal and, if it is accepted, modify MC links and/or routing entries accordingly. 5.3.1 Design Issues A major problem that the GMC protocol must solve is the proposal of multiple, inconsistent MC topologies by switches that detect different events at nearly the same time. An example of this problem is illustrated in Figure 5.2. The example begins with the network and MC configuration that are depicted in Figure 5.2(a), where nodes A, B, and C are members of an MC. Let us assume that switches D and E request to join the MC at approximately the same time. Without knowledge of each other’s intentions, switch D sees member list (A, B,C, D) and proposes a topology spanning those four nodes, while switch E sees member list (A, B, C, E) and proposes a different topology (see Figure 5.2(b)). If updates to routing table entries are not handled properly, the two inconsistent proposals could result in a routing loop shown in Figure 5.2(c). In the GMC protocol, the inconsistent-proposal problem can be detected by em- bedding membership information in MC topology proposals. In the above scenario, for example, if node D notices that itself is absent from E’s tOpology proposal, and node E notices a similar flaw in the proposal from D, then both of them will subse- quently compute new (and correct) proposals. We will show in the formal presentation of GMC that, while this example concerns membership inconsistency problems, the same method is used to cope with inconsistency problems created by simultaneous network tOpology status changes. During a busy period when multiple events take place concurrently, multiple pro- posals may be suggested and flooded through the network. Although some of these proposals are more up—to—date than others, the underlying flooding mechanism has 99 Connection member Non-member switch Joining member MC link (b) switches D and E request to join, and (c) potential routing loop. propose inconsistent topologies. Figure 5.2: Problem created by inconsistent topology proposals. no such knowledge and may deliver proposals in any order. Consider the example shown in Figure 5.3, which continues the scenario in Figure 5.2. Here, we assume that switch F also requests to join the MC, after receiving the proposals from D and E. Figure 5.3(a) depicts the MC topology suggested by F. As shown, this proposal contains update-to-date membership information and should override proposals from other switches. It is possible, however, for the switch A to receive F’s proposal before receiving the earlier ones (perhaps because A ignored proposal advertisements from D and E due to the lack of buffer space but later recovers and is able to accept the proposal from F). The proposals from D and E will eventually arrive at A by means of retransmission, incorrectly overriding the up-to—date MC image already es- tablished at A (see Figure 5.3(b)). The GMC protocol uses the well-known timestamp technique [61] to resolve this proposal ordering issue. Another desirable property of MC protocols is freedom from temporary routing 100 (a) an update-to—date proposal (b) a hypothetical configuration at from F. A. Figure 5.3: The topology ordering problem. If F’s proposal is received before those of D and E, these obsolete proposals will override the update-to-date MC image at A. ' loops. In the previous example, even if inconsistent proposals are detected and even- tually resolved, any routing loop, however transient, can quickly leading to traffic congestion if heavy traffic loads are placed on the MC during that period. In the GMC protocol, a tOpology proposal is uniquely identified by its source switch ID and its timestamp value. This stamp-source pair serves as the ID of an MC topology. To prevent loops, the two switches at the ends of an MC link exchange the IDs of their local MC images before establishing the link as part of the MC. Only if the two IDs identical will the MC link be established. Using this check, MC links, such as those in the loop of Figure 5.2(c), cannot all be permitted, because somewhere in the 100p two adjacent nodes must have different MC topologies, and hence, different topology IDs. 5.3.2 Protocol Overview With the above design issues in mind, the operation of the GMC protocol can be summarized as follows. 0 Every switch It maintains a timestamp R3,", for every active MC m. The value of this timestamp is set to the largest timestamp value among the received LSAS relating to m. The switch will ignore topology proposals about the MC m with 101 stamp values less than or equal to Rx‘m. Every switch a: maintains a mailbox for every active MC. The mailbox stores received, but not yet processed, LSAS that are relevant to the MC. When the switch detects a local event that affects the MC m (for example, the switch changes its membership status respect to m or detects failure of an incident link that is used by m), the switch creates and floods an event LSA, which describes the event. The LSA may also contain a new topology proposal if the mailbox for m is empty. (There is no reason to compute a new topology for m if information regarding m from other switches is yet to be processed.) When the switch receives a topology proposal P for m, it checks the proposal for consistency problems. The receiving switch checks only “local” inconsistencies, that is, it checks if its own membership status in P conforms with its current membership status and if P includes any malfunctioning incident links of the switch. If a local inconsistency is detected, the switch objects — it computes and advertises a new topology proposal. LSAS that carry prOposals produced in this manner are called triggered LSAS. A topology proposal P is accepted at a switch :1: if the switch finds no local inconsistency for P and if the timestamp of P is greater than Ram. To prevent the GMC protocol from being overly reactive to bursts of events, topology computations are subject to a hold—down period. The hold-down pe- riod guarantees that successive topology computations must be at least At seconds apart. Assuming that the current image for the MC m at switch :1: is received or computed at time a, and that the switch is ready to compute a new topology at time b where H - a < At, the switch sets up a timer, called TC-TIMER (TC stands for Topology Computation), with length At — fl + a. The postponed topology computation is resumed if no locally consistent topology prOposals are received before the timer fires. The proper choice of the At value is a subject of our performance study in Section 5.5. 102 5.3.3 GMC LSA Format Before we present the details of the GMC protocol, we must define the format of LSAS. We use the term non-GMC LSA to refer to an advertisement produced and processed by the underlying unicast LSR protocol, and the term GMC LSA to refer to an advertisement produced by the GMC protocol. In a network comprising n switches, a non-GMC LSA is a tuple (S, seq, F, D), where S is the source of the LSA, seq is the sequence of the LSA, F with value gmc indicates that the LSA is used for the advertisement of a link/ nodal event, and D encodes a description of the event. For example, a description of a link-down event must include at least the two end switches of the link. The exact format of link/ nodal event descriptions is defined by the underlying unicast LSR protocol, and is not discussed further. A GMC LSA is a tuple (S, seq, F, V,G,P, T), where S E {0, 1,... ,n — 1} is the source address of the LSA, seq is the sequence number of the LSA, F with value gmc identifies this LSA as an GMC LSA, V E (join, leave, link, none} specifies an event from the source switch S, G identifies the MC to which this LSA is relevant, P is either a topology proposal for G or the member list of G, and T is a timestamp. An event of type “link” in a GMC LSA indicates that a link/nodal event affects the topology of an MC. Specifically, a link/nodal event will cause the unicast LSR protocol to produce exactly one non-GMC LSA and will cause the GMC protocol to produce k GMC LSAS, where k is the number of MCs whose topologies are affected by the event. We use the configuration shown in Figure 5.4 to illustrate. Let us assume that the following events occur: switch X intends to join connection Cl, switch E wishes to leave connection 02, and switch F detects the failure of the (F, B) link. As shown in Figure 5.5, the three events trigger five advertisements: one for the join event, one for the leave event, and three for the link event. Given a link/ nodal event that occurs at switch F, we assume that switch F floods the single corresponding non-GMC LSA before flooding the corresponding GMC LSAs. We further assume that the sequence number of a non-GMC LSA will be 103 smaller than those of the corresponding GMC LSAS. A switch that receives an LSA out of order will not process it until the switch has received preceding LSAS. There- fore, at any receiving switch, the processing of the non-GMC LSAS (by the unicast LSR protocol) will precede the processing of the k GMC LSAS (by the GMC proto- col). Hence, the event advertised in the non-GMC LSA will have been incorporated in the local network image at the switch before the ensuing GMC LSAS are used to update MC images. 0 Ci member l \ o CZ member / ’1. ~ ‘ ' ’t — Ci link s- I, “>&\ i, — -- 02 link l/I/l—fl, ' \ l . . B ‘0 --- LmkusedbyCiandCZ Figure 5.4: A network/ MC configuration. As demonstrated in the previous example, the GMC protocol produces a set of GMC LSAS that disseminate all events relevant to an MC. (For example, in Figure 5.5, the protocol produces two GMC LSAS for connection Cl: one for the join of X and another for the failure of the (F, B) link.) Therefore, the algorithms of the GMC protocol can be presented in a per—MC manner without loss of generality. 5.3.4 Data Structures And Protocol States Besides the aforementioned timestamp Ram, every switch a: in the network maintains a variable last-tc_timex,m (last topology computation time) for each MC m, and uses a mailboxmm to store incoming GMC LSAs regarding m. Every switch a: in the network also maintains a local image for each MC m, denoted by Image[:r, m]. An MC image includes the topology of the MC (denoted by P(Image[:r, m])), the ID of the switch that proposed the topology (denoted by S (Image[:r, m])), and the timestamp of the topology (denoted by T(Image[:1:, m])). The switch a: further maintains a list of MC members, denoted by Members[:r, m], and a real time clock, clock[:r]. In the following 104 Events Advertisements GMC LSAS for C1 -5; V E _ E’__ _ _T____ ”x’gmmn :C1l ... ... l / (Si .V_ .9. P_ -.._ l— Smichxiom (151!“ 1“".3. _ "'__ .___ "__ connection Ci Non-GMC LSAs (processed by unicast LSR) SF 0 pl F, CWT . . . . 7 Link (F'B) fans / _ m __: _ . #——‘ _ — GMC LSAS for CZ S F V G P T F" * . i i ' SwitchEleaves flPTEQ'Flb C2L_ l ___l u | A connection C2 \US F V G P T li- "A ..I __-__ 7 ' f,~.9f*£-.“i€.§3n__ l _J Figure 5.5: Events and advertisements in the GMC protocol. discussion, subscripts and indices in this notation may be omitted, if they are clear from the context. With respect to an MC m, the GMC protocol at a switch can be in one of the four states shown in Figure 5.6: EVENT-HANDLING, RECEIVING-LSA, DELAYED- TC, or IDLE. Initially, the GMC protocol is in the IDLE state. Whenever an event relating to m occurs at a switch, the switch moves into the EVEN T—HANDLIN G state and invokes its EventHandler routine. This routine creates and floods an event LSA, which describes the event and may also contain a new topology proposal. When- ever GMC LSAs are present in the mailbox of m at a switch, the switch enters the RECEIVING-LSA state, invokes the ReceiveLSA routine to process the incom- ing LSAs, and checks for inconsistency problems before accepting the topology pro- 105 0 Intermediate state 0 Initial state ZC—Timer—Handlero one TC—Timer goes off , EventHandlero done ReceiveLSAo done and no local events Figure 5.6: The state-transition diagram of the GMC protocol. posal, if present, in the LSA. After the completion of either the EventHandler or the ReceiveLSA routine, the switch returns to the IDLE state. When a hold-down timer fires, the switch enters the DELAYED-TC (Delayed Topology Computation) state and invokes the TCTimerHandler routing. When local events, LSA arrivals, and timer firings occur simultaneously, the EVENT-HANDLING state has priority, followed by the RECEIVING-LSA state. 5.3.5 Protocol Algorithms We are now ready to describe the algorithms for EventHandler, ReceiveLSA, and TCTimerHandler. In the following, we assume that the floodings of LSAS are reli- able and that LSAS from the same switch are ordered by sequence number. Reliable flooding can be implemented using either a reliable hop-by-hop protocol, or by peri- odic re—flooding [12]. Different reliability mechanisms and flooding algorithms affect the timing behavior of the GMC protocol, but do not affect its correctness. At a switch a: and given an MC m, GMC algorithms share the data structures described in the previous section. (Additional variables shared by these algorithms, such as the make-proposal_flag variable, will be introduced later.) We point out that simultaneous accesses to shared data structures and variables cannot occur, because at any moment of time the GMC protocol is in exactly one of the four states shown in Figure 5.6, and will leave the current state only if the corresponding routine is completed. 106 The EventHandler algorithm is given in Figure 5.7. The algorithm is presented in a per-MC manner, that is, when an event occurs, this routine is invoked for every connection affected by the event. This protocol entity is responsible for the gen- eration of GMC LSAs only; the non-GMC LSA resulting from link/nodal event is generated and flooded by the underlying unicast protocol. In Figure 5.7, the local switch is identified by parameter r, the event is given in parameter co, and the af- fected connection is given by parameter m. The EventHandler may be invoked because of membership change events (that is, when switch a: joins or leaves the MC m), or link state events that affect the MC (for example, an incident link that is used by the MC fails). In both cases, the routine advances the timestamp R of m (line 1), updates MC member list when necessary (lines 2-4), and computes a new MC topol- ogy P for m (line 6), if such an action is not prohibited by a hold-down period. If a new topology is not computed due to the hold-down period, then the TC-TIMER is set up at line 9 to defer the computation to TCTimerHandler (if the timer is already in use, line 9 restarts the timer). Even if the computation at line 6 is performed, the result P may be obsolete after the completion of the computation, due to the arrivals of new GMC LSAS regarding m. If P remains up-to-date after computation, it is flooded throughout the network (line 14) and accepted at a: itself (by calling an auxiliary routine, AcceptTopology, at line 15). When the proposing of a topology is postponed due to either the hold-down period or obsolescence, the EventHandler at line 17 floods the event eu with a member list of m, rather than an MC topology, and defers to the ReceiveLSA routine to make sure that a correct MC image is eventually established. This information is passed to ReceiveLSA by setting a shared variable, make_proposal_flag, equal to TRUE (line 18). As with other GMC variables, at each switch :15 there is one make_proposal_fiag variable for each MC m. The AcceptTopology algorithm, shown in Figure 5.8, registers an MC topology P into the local database of the invoking switch 2:, and attempts to establish incident MC links according to the new topology. The local MC image Image, including the tOpology, the source switch ID, and timestamp, are updated at lines 1 to 3. The routine then tries to establish MC links that are defined in P and incident to :1:. 107 Algorithm: EventHandler Input: switch ID :1:, event 62), and connection m l: R = R + l. 2: IF (eu is for membership status change of :1:) 3: Update Members(m) accordingly. 4: EN DIF 5: IF (clock — last_tc_time > tc-holddown), 6: Compute a new tOpology proposal P for the connection m. 7: last-tc_time = clock. 8: ELSE 9: Set the TC-TIMER to value tc_holddown — clock + last-tc_time. 10: ENDIF 11: IF (a new topology P is computed) and 12: (no LSAS for m received during the computation), 13: /* proposal is still valid */ 14: Flood the GMC LSA (m,gmc,ev,m, P, R). 15: AcceptTopology (as, m, P, R, :1:). 16: ELSE /* flood event but defer to ReceiveLSA to make proposal */ 17: Flood the GMC LSA (x,gmc, eu,m, Members(m), R). 18: make_proposal_flag = TRUE. 19: ENDIF Figure 5.7: The algorithm for EventHandler. As described earlier, an MC link (:1:,y) is established only if the MC image at the neighboring switch 3; has a source switch ID and a timestamp identical to those at 1: (lines 6-9). Before completion, the routine sets the make_proposal.flag variable to FALSE, and records the current time in last_tc-time. The algorithm for the ReceiveLSA routine is given in Figure 5.9. Parameter 2: identifies the local switch, and parameter m specifies the MC. The routine is invoked when the switch enters the RECEIVING-LSA state, that is, when there is at least one LSA in mailbox for connection m. For every such LSA, ReceiveLSA updates the local member list of connection m if the event in Z is about a membership change (line 3). Next, the routine checks for inconsistency problems in the LSA and records the result in a variable, my-status_consistent (lines 4-9). As mentioned earlier, the switch a: is only interested in local consistency, that is, the received LSA must contain correct membership information with respect to 2: (line 5) and any topology proposal in the LSA must not use any malfunctioning links that are incident to 3 (line 4). 108 Algorithm: AcceptTopology Input: switch ID :1:, connection m, topology P, stamp T and source S. 1: P(Image[:c][m]) = P. 2: S(Image[$][m]) = S. 3: T(Image[:r][m]) = T. 4: FOR(every link t in P that is incident to 2;) DO 5: Let t be an (2:,y) link. 6: Exchange messages with y to learn S (Image[y] [m]) and T(Image[y][m]). 7: IF (S(Imagelxl[ml) = 5(1magelyllml)) and (T (Imagelxllml) = T(Image[yllml)) 8: Establish (2;, y) link for connection m. 9: ENDIF 10: ENDDO 11: make.proposal.flag = FALSE. 12: last-tc_time = clock. Figure 5.8: The algorithm for AcceptTopology. Next, the routine decides if the LSA can be accepted (lines 10-13). For an LSA E to be accepted it must include a topology proposal that is more recent than the local one and that is locally consistent. The LSA l is more up-to-date than the local MC image at :1: if it is tagged with a larger timestamp value; a tie in the timestamp comparion is resolved by the values of source switch IDs (line 12). If the LSA is accepted, then the AcceptTopology routine is invoked to update the local MC image (line 14), and the make_proposal_fiag for connection m is set to FALSE (line 15), since an up-to-date topology for the connection has been accepted. Otherwise, the switch checks whether its local status is consistent with the received topology proposal (line 17). If not, then the switch plans to construct a new topology proposal by setting its make_proposal_flag variable to TRUE (although it may need to process additional LSAS first). To conclude the processing of the current LSA 8, the ReceiveLSA routine advances the R timestamp for MC m to be at least as large as that of 8 (line 19). Since the R timestamp will be advanced again before the switch cc proposes and floods any topology in the future (line 1 of EventHandler, line 31 of ReceiveLSA, and line 8 of TCTimerHandler), the advancement at line 19 ensures that subsequent topology proposals will be tagged with timestamps larger than anything a: has received. After consuming all the LSAS in the mailbox, the ReceiveLSA routine decides 109 Algorithm: ReceiveLSA Input: switch ID :1:, connection ID m. 1: WHILE (there are LSAS for connection m in mailbox) 2: Get next LSA E = (S, gmc, V,m, P, T). 3: Update member list of m accordingly, if V is for membership update. 4: IF (for all link t used in P that is incident to :1:, t is ON) and 5: (the membership of :1: in P(€) is consistent with that in Members(m)), 6: my_status-consistent = TRUE. 7: ELSE 8: my-status_consistent = FALSE. 9: ENDIF 10: IF (P(£’) is a topology proposal) and 11: (T(E) Z R) and 12: (T(Z) > T(Image), or (T(Z) = T(Image) and 3(3) > S(Image)) and 13: (my-status-consistent= TRUE)), 14: AcceptTopology (:1:, m, P(€), T(t), S(€)). 15: make_proposal_flag = FALSE. 16: ELSE 17: make-proposal_fiag = TRUE, if (my_status_consistent = FALSE). 18: ENDIF 19: R = max{R,T(€)}. 20: EN DWHILE 21: IF (make-proposal..flag = TRUE) 22: IF (clock — last-tc_time > tc-holddown), 23: Compute a new topology P for the connection m. 24: last-tc_time = clock. 25: ELSE 26: Set up the TC-TIMER with length tc_holddown — clock + last-tc_time. 27: ENDIF 28: IF (a new topology P is computed) and 29: (there are no LSAS in mailbox for connection m) and 30: (no local events for connection m queued at z), 31: R = R + l. 32: Flood (3:, gmc, none, m, P, R). 33: AcceptTopology (:r, m, P, R, 3:). 34: ENDIF 35: ENDIF Figure 5.9: The algorithm for ReceiveLSA. whether a new proposal should be computed, depending on the value of the make_proposal.flag variable (line 21) and the hold-down mechanism (line 22). If a topology is computed at line 23, two conditions must be satisfied before the proposal is actually flooded: 1) no new GMC LSAS arrive during the computation period (line 110 29), and 2) no local events take place during the period (line 30). If the proposal is still up-to-date at the end of computation, then it is flooded to the other switches and accepted locally (lines 32-33). Otherwise, it is withdrawn and the make_proposal_flag remains true, indicating the lack of an up-to—date MC image for m at switch :1:. In the case where the topology computation is held down, the TC-TIMBR is set up (or restarted, if it is already in use) at line 26 to defer to computation to TCTimerHandler. The algorithm for the TCTimerHandler routine is given in Figure 5.10. Again, parameter :1: identifies the local switch, and parameter m specifies the involved MC. Before resuming a postponed topology computation, the routine first checks if this computation is still needed. The computation may no longer be necessary because, during the hold-down period, topology proposal(s) may have been received and ac- cepted (hence setting the make_proposal_flag to FALSE), or there may be pending “news” about the MC (GMC LSAS in the mailbox or events in the event queue). Similar to the previous routines, the new topology P is actually flooded only if no further news about the MC is observed during the computation period. Algorithm: TCTimerHandler Input: switch ID a: and connection 777.. 1: IF (make_proposal_flag= FALSE) and 2: (there are no LSAs in mailbox for connection m) and 3: (no queued events for connection m), 4: Compute a new tOpology P for the connection m. 5: last_tc-time = clock. 6: IF (there are no LSAS in mailbox for connection m) and 7: (no queued events for connection 777.), 8: R = R + 1. 9: Flood (2:, gmc, none, m, P, R). 10: AcceptTopology(z, m, P, R, x). 11: make_proposal.fiag = FALSE. 12: ENDIF 13: ENDDO Figure 5.10: The algorithm for TCTimerHandler. 111 5.3.6 MC Creation and Destruction The creation and destruction of an MC require no special mechanisms. When the first member of an MC advertises its presence, the other switches allocate necessary data structures for the MC and accept the topology proposal contained in the advertise- ment. When a switch detects an empty member list of an MC, local data structures corresponding to the MC are deleted. 5.4 Proof of Correctness In this chapter, we formally show the correctness of the GMC protocol in two steps. In the first step, we consider the correctness of the protocol without memory short- age problems (that is, operational switches will not lose MC images). Under this assumption, we will show that, given a finite set of events, the algorithm will reach consensus about MC images among network switches by producing a finite number of LSA broadcasts. (Our simulation results, presented in Section 5.5, show that in practice the number of LSAS per event is likely to be small.) In the second step, we describe (minor) extensions to the GMC protocol to handle losses of GMC data structures, and establish a sufficient condition for the GMC protocol to work cor- rectly in presence of such switch memory overflows. For clarity, the discussion in this section is in terms of a single MC. As illustrated in Figure 5.5, the GMC protocol, when given a set of events If (link-state and/or MC membership changes), produces a set of GMC LSAs, Em, exclusively for every MC m. Thus, the protocol activities associated with different MCS proceed independently; herein lies the generality of a proof regarding a single MC. 5.4.1 Correctness without Memory Overflows Under the assumption that switches, unless crashed, will not lose MC images, the GMC protocol proceeds as described in Section 5.3. In the following discussion, we assume a finite set of events, denoted as II, that does not leave the network 112 permanently partitioned. Our goal is to show that such an even set will not lead to infinitely looping GMC activities. We point out that temporary partitioning could be produced by such an event set 11; all but permanent partitions are handled by GMC. Lemma 1 Given an MCm and a set of events II as defined above, the GMC protocol produces a finite set of LSAs, Lm. Proof: Since flooding operations are assumed to be reliable (see Section 5.3), there exists a time T by which all the events in II are learned by all switches. (If II incurs temporary partitioning, reliability of flooding can be enforced by periodic re-flooding.) Any GMC LSA produced after time T will incorporate the changes in H. Such an LSA will not be objected to by any other switch (that is, the corresponding my-status_consistent values will be TRUE at all switches), and hence will not trigger any additional LSAs. Since an LSA must require a minimum time At to construct, we see that the GMC protocol is able to produce only a finite number of LSAS by time T, and hence the set Em must be finite. E] In the following discussion, a GMC LSA is said to be (locally) consistent at a switch y if the ensuing my_status-consistent value is TRUE at y. Also, we will drop the subscript m in the notation Lm, since all the discussions are about an MC m that has at least one member join event in H (otherwise, the MC is inactive and is not relevant) . Definition 1 We denote by [max the LSA in .C that has the maximum timestamp- source pair (T(Zmaz), S (2mg). The concept of the maximum element in .C is well defined because the set I. cannot be empty; at least one GMC LSA is generated for the member join assumed above. We will see that all network switches will accept tmax, and the topology contained in [max will be the consensus among all network switches, due to the two properties stated in subsequent lemmas. Recall that S (t) and T(Z) are the source and timestamp of an LSA 6. 113 Lemma 2 The LSA 3mm, includes a topology proposal. Proof: Let us assume the opposite. A GMC LSA that does not contain a topology proposal must be produced by the EventHandler routine at line 17, a scenario that occurs when there are incoming GMC LSAS during the topology computation of (max. If this happens to Emu, the value of R at switch S (Emu) at this moment is Tam”), and the make_proposal_flag variable is set to TRUE. Consider the GMC protocol activities at S (Emu) after the production of (max (a GMC activity is an invocation of EventHandler, ReceiveLSA, or TCTimerHandler). The ReceiveLSA must be invoked at least once, to process the LSA(s) that arrived during the processing of tmax. We exclude the possibility of post-[max events occurring at S (Emu); otherwise, further invocations of EventHandler will advance the timestamp R, and flood LSAS with timestamps greater than that of tum, a contradiction to the choice of tmax. Therefore, during the post-Zmax activities at Sam“), one of the following must happen: at least one LSA with a timestamp—source pair greater than (T(Zmax), S (lawn is received (and accepted), or the TRUE value in the make_proposal_flag variable forces ReceiveLSA (or TCTimerHandler if required by a hold-down period) to compute and flood at least one topology with the timestamp value R advanced. Since both cases imply the existence of timestamp—source pairs larger than (T(tmax),S(€max)), they lead to contradictions to the choice of Emax, completing the proof. [I] Lemma 3 The topology P(€maz) is consistent at all network switches. Proof: Suppose to the contrary that [max is detected to be inconsistent at some switch y. In response to this situation, switch y sets its make.proposal_fiag to FALSE (at line 15 of ReceiveLSA). In the meantime, the R variable at y is advanced to T(me) (line 19 of ReceiveLSA), the maximum timestamp value in .C, prohibiting subsequent LSAS from being accepted at y (line 11). After y’s receipt of (max, there can be no local events at y; otherwise, EventHandler would produce GMC LSAS with timestamps larger than or equal to T (Ema) + 1, a contradiction to the choice of 8m”. 114 The TRUE value of make_proposal_flag will cause new topology computations at line 23 of ReceiveLSA or line 4 of TCTimerHandler. The results of these computations, in the absence of further local events, could be dropped in response to incoming LSAS, which spawn additional GMC activities. However, since the number of LSAS in .C is finite, eventually the result of a post-8max topology computation will be flooded with timestamp R+1 = T (Emu)+1, a contradiction to the choice of 6m”. Cl Since the GMC LSA 8mm, has the maximum topology ID and includes a proposal that is consistent at all operational network switches, it shall be accepted by these switches. This observation leads us to the next theorem. Theorem 1 Given an MC m and a finite, non-partitioning set of events II, all op- erational network switches will reach consensus on MC topology with a finite number of MC LSAS. 5.4.2 The Handling of Memory Overflows Next, we consider scenarios where one or more network switches run out of memory space and must purge some entries in their local network images, including MC- related entries. Since a switch can always compute a entirely new MC topology if it has the member list of the MC, the loss of MC topology images (that is, the P(Image) data structures) will not cause problems. Hence, we are concerned only with the loss of MC member lists. Further, we assume that switches will not purge data structures other than member lists and MC topology images. The assumption is reasonable because those two are the most space-consuming data structures used by the GMC protocol. We use an example to illustrate the minor extension to the GMC protocol needed to handle losses of member lists. Consider a scenario where a switch :15 runs out of storage space and decides to purge the member list of an MC m. The lost member list can be re—constructed when a new topology proposal arrives and is accepted. Should this be the case, the temporary loss of the data structure does no harm to 115 the operation of the GMC protocol. The more interesting case is when switch :1: must propose a topology after its member list has been purged. One solution is to have a: create a member list that incorporates only its own membership status (that is, a member list {:1:} if :1: is a member, or else, an empty list), and propose an MC topology according to this list. The topology will be found to be inconsistent at all other switches that are members of the MC, triggering topology proposals computed at these switches. Actually, the GMC protocol with the above revision could survive even more adverse scenarios than isolated, temporary losses of member lists. To investigate tolerance limit of the protocol on this issue, we establish a sufficient condition for the GMC protocol to converge in presence of member list losses. Lemma 4 Given a set of events II and an MC m, let M be the set of members ofm after 11. If there exists at least one switch y E M that does not purge the Members[m] data structure indefinitely, then the set L", is finite. Proof: If [.m is infinite, there must exist switches that indefinitely flood GMC LSAS pertaining to m. Let X be the set of indefinitely flooding switches, and let 7' be a moment in time after y has stopped purging Members[m] and after all events in II have been learned by all switches. Define time T’ 2 r to be a moment in time after switch y has constructed the membership status about switches in X (via the infinite number of LSAs from these switches) and after all switches not in X have stopped flooding. If switch y proposes a topology after time 7’, then switches in X shall no longer produce triggered LSAS, a contradiction to the selection of X. It can be concluded that y must not be in X, and hence remains silent after time r’. (The possibility for y to flood an LSA without a topology proposal is excluded, because LSAS without proposals must be event LSAs, which do not exist after time T.) If y is to remain silent, all the LSAS produced by some switch in X after time 7" must be consistent at y. To remember the fact that y is a member of m, switches in X cannot purge Members[m] after time 7’. Let time T” Z r’ be a time after all switches in X have stopped purging Members[m], and let us consider any switch 3 E X. When 116 other switches in X learn the status of x at sometime after T” (recall that x floods indefinitely and these switches have plenty of opportunity to learn this information), they shall not subsequently purge it. Subsequent LSAS will be consistent at x, so x will become silent. Therefore, x cannot be in X, a contradiction. We are done. [I The next lemma is a counterpart of Lemmas 2 and 3 combined, in presence of member list losses. Lemma 5 Given a set of events II and an MC m, let M be the set of members ofm after II. If there exists at least one switch y E M that does not purge the Members[m] data structure indefinitely, then the [max LSA includes a topology proposal that is consistent at all switches. Proof: With a finite set .Cm, the maximum LSA, (max, in Lm is well defined. Lemma 2 showed that the [max LSA must contain a topology prOposal; otherwise, the switch S (tmax) would have suggested another LSA with a timestamp larger than T(Emax). That argument is independent of the issue of member list losses and therefore still holds. However, we need to consider the possibility that the topology P(Zmax) is based on a newly created member list, which could be incomplete when P(Zmax) is computed. If the member list M’ that S (Emu) uses to compute P(tmax) is not equal to M, it will trigger LSAS from switches in M — M’ and M’ - M. These switches will be tagged with timestamps larger than T(tmax), a contradiction to the selection of em”. Hence, (max must be based on a complete member list. Lemma 3 guarantees the correctness and network-wide acceptance of its topology proposal, concluding the proof. El Hence, no LSAs will be able to override the maximum LSA, Em”, which shall be the consensus on the MC m, even in presence of member list losses. This leads us to the following theorem. Theorem 2 Given a set of events II and an MC m, let M be the set of members 117 of m after II. If there exists at least one switch y E M that does not purge the Members[m] data structure indefinitely, then the GM G protocol will achieve consensus on the topology of the MG m using a finite number of GMC LSAs. 5.5 Performance Evaluation A major objective of the GMC protocol is to reduce the redundancy in topology com- putation incurred by previous LSR-based solutions, while retaining the advantages of LSR (responsiveness, fault tolerance, and so on) and supporting a variety of different MC types. In situations where events are relatively sparse, when a switch detects an event, the GMC protocol suggests a new topology and advertises the topology in an LSA, which will be accepted by all other switches. In this case, there is only one topology computation and one flooding operation per event. This compares very favorably with the MOSPF protocol, which requires a topology computation at every switch involved in the MC. However, it is also important to study the behavior of the GMC protocol when several events occur within a short period of time, during which switches detect inconsistencies in topology proposals and are triggered to pre- pare and advertise their own proposals. Such situations raise the concern of cascading reactions among switches, which could decrease the advantage of GMC over other ap- proaches. A simulation study was conducted to investigate the behavior of the GMC protocol under such circumstances. The simulator is based on the CSIM simulation package [62]. 5.5.1 Simulation methodology Each simulation session is defined by a set of parameters, including topology compu- tation time, LSA transmission time, event generation distributions, network size, and so forth. In this section, we discuss the selection of parameter values. We use the symbol T, to represent the time to compute a topology, and T, to denote the flooding diameter of the network, that is, the time to complete a flooding Operation in the worst case. We define the time T; + Tc to be a round which, as 118 mentioned above, is the amount of time needed to handle sparse events in the GMC protocol. In the GMC protocol, the value of parameter T c may vary from MC to MC, depending on the choice of the topology computation algorithm for that MC. In this study, we assume the use of Dijkstra’s shortest path algorithm, and measured the execution times of the algorithm (using our random graphs as input) on Sun SPARC- 20 workstations. The rationale behind the use of Dijkstra’s algorithm is its widespread use in computing source-rooted trees [6] and its applicability to several heuristics for computing shared tree topologies (for example, the core-based tree heuristic [8] and the KMB algorithm [31]). Further, this assumption allows us to directly compare the GMC protocol with the MOSPF protocol, which also uses Dijkstra’s algorithm. To determine T, values, the following flooding protocol is assumed: LSAS arriving at a switch for the first time are forwarded along all incident links, except the incoming one. LSAS arriving at a switch for the second time are dropped silently. LSAS are forwarded to neighboring switches one by one. For each LSA forwarding, we used software overheads measured on the ATM testbed in our laboratory. The testbed comprises Sun SPARC-10 workstations equipped with Fore SBA-200 adapters and connected by three Fore ASX—100 switches. From these measurements, we obtained the figure 600 usec, which includes the overhead at both the sending and receiving switches. Networks comprising up to 400 switches were simulated. For each network size, 40 graphs were generated randomly, and two simulation sessions were conducted on each graph. Table 5.1 shows the characteristics of the graphs generated. In the table, maximum, minimum, and mean values are averages over the 40 graphs of that size. The durations of hold-down intervals are uniformly distributed, and are selected randomly each time a hold-down timer is set up by a switch. We investigated the performance of the GMC protocol with no hold-down timers (that is, hold-down intervals are of length zero), a short hold-down interval that is distributed from 2 to 10 rounds, a medium hold-down interval from 20 to 100 rounds, and a long hold-down interval from 200 to 1000 rounds. For example, when a round is 10 milliseconds, the 119 Network degree diameter T, round size min mean max min mean max (in ms) (in ms) 10 1.675 3.6 5.675 2 3.25 5 3.555 3.621 20 1.25 3.573 6.725 4 4.75 7 5.123 5.225 40 1.025 3.733 8.025 5 6.175 9 6.705 6.892 60 1.025 3.875 8.65 5 6.8 11 7.5 7.776 80 1 3.914 8.925 6 7.075 9 7.867 8.235 100 1 4.12 9.675 6 7.15 9 8.1 8.558 120 1 4.103 9.725 6 7.525 8 8.423 8.999 140 1 4.201 10.225 7 7.725 9 8.64 9.308 160 1 4.220 10.375 7 7.575 10 8.798 9.576 180 1 4.307 10.5 7 7.75 10 8.9623 9.851 200 1 4.289 10.825 7 7.95 10 9.075 10.078 250 1 4.503 11.275 7 7.95 10 9.338 10.666 300 1 4.704 11.725 7 7.975 10 9.465 11.123 350 1 4.873 12.325 7 7.85 9 9.533 11.422 400 1 5.065 12.65 7 7.85 9 9.623 11.829 Table 5.1: Characteristics of randomly generated graphs. above interval lengths translate into 0.02 to 0.1 seconds, 0.2 to 1 seconds, and 2 to 10 seconds, respectively. We are interested in two performance metrics: topology computations per event and flooding operations per event. The first metric reveals the computational overhead incurred by an MC protocol, and the second measures the communication overhead. In the GMC protocol, the two metrics are not necessarily directly proportional to one another, since computed topologies might not be flooded due to the arrival of new LSAS. 5.5.2 Group Creation Periods In the first set of experiments, we study the behavior of the GMC protocol during group creation periods. That is, we assume that a group has a predetermined start time and that a potentially large number of group members join the group at or about that time. (Such a scenario could occur when, for example, a large number of users join a live broadcast at the beginning of the broadcast.) Specifically, we assume that member arrival times are normally distributed with mean 0, the start time. 120 We chose standard deviation values so that 99% Of members arrive within a chosen interval length. Specifically, we used standard deviations so that 99% of members arrive within 1 second, 10 seconds, 30 seconds, and 10 minutes. The extremely short creation periods, such as the 1-second and 10—second ones, are designed to stress the GMC protocol during very busy periods. To make the group creation periods as busy as possible, we assume that all switches are group members. Short arrival intervals. The performance of the GMC protocol with the 1-second member arrival interval is plotted in Figure 5.11. Figure 5.11(a) plots the number Of topology computations per event, and Figure 5.11(b) plots the number of floodings per event. When a large number of group members arrive within such a short period of time, cascading interactions among switches could occur if the GMC protocol reacted to events too quickly. This behavior is illustrated by the curves corresponding to no use of hold-down timers. These plots start in the vicinity Of one (that is, approximately one computation and flooding per event), because, when the number Of group members is small, member join events are still relatively sparse and do not interfere with one another. As the number of switches / members grows and join events collide with each other, these plots reach approximately 5.8 topology computations and 2.9 flooding operations per event, indicating the presence Of cascading reactions among switches. However, the curves pertaining to the use of hold-down timers, even a short timer, show that the over—reaction of the GMC protocol can be curbed. With the medium hold-down interval, the number Of topology proposals per event approaches zero, and the number of flooding operations per event approaches one. (The latter metric must be greater than or equal to one because the GMC protocol always advertises events immediately, with or without topology proposals.) We note that the number of flooding operations per event when no hold-down is used is not always increasing; see Figure 5.11(b). In general, the number Of flooding Operations does not necessarily grow with “event density,” which is determined by the size Of the group when given a fixed arrival interval. This issue will be further addressed later. Results for 10—second and 30-second creation periods are shown in Figures 5.12 121 and 5.13, respectively. These results are similar to those for the 1-second periods, although the values are much lower. i 5 l- E E 5%: 3 t No hold-down -s— g" . a Short hold-down -+--- E” a Medium hold-down -a----- g E 2 ’ i: l _ ‘ O 5 _ No hold-down +— a “n.1,“ ' Short hold-down ~+--- I Hl‘ ----- ‘uF" ......................................... ijum holddown "a..." 0 alga 9° ‘ GI'D“-fl- .vm. .. ...u . 1|. . “uh-ll 0 l n 1 n 1 s a 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network Size Network Size (a) Computations per event. (b) Floodings per event. Figure 5.11: Performance of the GMC protocol under 1 second arrival interval. 3 V T I T T V I T I I I Y Y 2.4 . < No hold-down -°— ’ 2,5 . Shon hold-down ----+ 1 2.2 NO 1,01de ...—— Medium hold-down " Short hold-down ~--~ ‘6 Long hold-down E 2 _ Medium hold-down ‘ 2 2 P > Long hold-down —--- ’ 8 ” 1.8 i a, 5 _ it a ' .2016 - g- l - .......... ... SL4 - u """ ‘- .. 1‘ E - ~O- ----------------- + .......... 1.2 r- ---..._,°._.“IG ..... 0.5 -‘ 3 -—g“_‘ ““““ I ........... O ........... II ........ .-.“ \ Ramona 1 '- I’"““"“"9-‘——=$ a: = = \‘EMF‘ mama-"g ........... g ........... a ___________ 0‘ .......... D o l ‘F‘F‘I‘mr - - _- 08 L i i l i l 4 0 50 100 150 200 250 300 350 400 0 50 IOO 150 200 250 300 350 400 Network Size Network Size (a) Computations per event. (b) Floodings per event. Figure 5.12: Performance Of the GMC protocol under 10 seconds arrival interval. 10-minute arrival intervals. The performance Of the GMC protocol with 10 min- utes member arrival interval is plotted in Figure 5.14. With this relatively long arrival interval, the interaction between the event density and the lengths Of hold-down in- tervals becomes clear. When the GMC protocol uses an average hold-down interval 122 1.4 r * -..- l - 1.2 r 5 1 * '51:. a S ‘ ._ ,4 30.3 l‘ ‘u, No Hold-down «— E 7'; a, Short Hold-down ------ a, l ’ m 0.6 ~ , in. Medium Hold-down 0- i -‘= g- i, “an . Long Hold-down - g o. 0.4 . 11-0“” . E 0.8 > No Hold-down -0— \K 0 Short Hold-down ...--- x "‘° """" o ....... 1) Medium Hold—down --¢ ----- 0-2 ” as Long Hold-down _..._.. kmthm a“ . 0.6 r 0 4s 1 L l A k- 7:7- T 77 H- ‘ I A A l A L A I 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network Size Network Size (a) Computations per event. (b) Floodings per event. Figure 5.13: Performance of the GMC protocol under 30 seconds arrival intervals. of At seconds, two events must be At apart so that their processing does not interfere with each other (otherwise, the second event will be within the hold-down interval created by the first, forcing the GMC protocol to postpone its topology computa- tion). For the short hold-down interval, even the largest networks create “isolated” join events, resulting in the normal operation of the GMC protocol (that is, one topol- ogy computation and flooding operation per event), as seen in Figures 5.14 (a) and (b). With longer hold-down intervals and larger networks, interference among the pro- cessing of events can be observed. However, this phenomenon of inter-event inter- ference affects performance metrics differently. Considering the number of topology computations per event, the more inter-event interferences, the more topology com- putations are suppressed, and hence the fewer topology proposals per event; see the performance results regarding medium and long hold-down intervals in Figure 5.14(a). For flooding operations per event, however, the suppressing of topology computation when the event takes place can introduce later flooding of the delayed topology pro- posals, resulting in more flooding Operations per event. Since the number of topology computations per event decreases as the network size increases, these extra flooding Operations (the ones for the delayed topology proposals) become increasingly rare. This behavior is illustrated in Figure 5.14(b) by the curve pertaining to the long Proposals per event 0.8 r 0.6 ' 0.4 - 0.2 - 0 I — w v No Hold-down -°-- Shon Hold-down --..-. Medium Hold-down ......., Long Hold-down *- , J 50 100 150 200 250 300 350 400 Network Size (a) Computations per event. 123 Floodings per event No Hold-down +- Shon Hold-down -+°-- Medium Hold-down ....... ‘ Long Hold-down +‘- ~ 7 v fi v v 50 100 150 200 250 300 350 400 Network Size (b) Floodings per event. Figure 5.14: Performance of the GMC protocol under the 10 minutes arrival interval. hold-down interval. A similar phenomenon can be observed in Figure 5.11(b) for the no hold-down case. In summary, the high rates of join events during group creation periods can be very demanding of MC protocols. Our simulation results show that, even in extremely busy periods, the GMC protocol is able to avoid cascading protocol activities by using relatively short hold-down intervals (for example, ones less than 0.1 seconds). With longer hold-down intervals, the GMC protocol processes bursty events effectively in batch mode, so that only one topology computation is incurred in response to multiple events. Although this simulation study targets group creation periods, the results apply to any period with high event density. 5.5.3 Normal Operations During “normal” operation periods of multiparty communication applications, par- ticipants join and leave MCs occasionally, and MC protocols may behave differently than during busy periods, such as group creation periods. In this section, we inves- tigate the behavior of GMC under such circumstances. We assume that inter-arrival times of events are exponentially distributed, and we set the event inter-arrival rates in a way such that an N—switch network has approximately 0.3 x N events during a 124 Proposals per event 1 2 r r r l 2 r fi fi _’__:fl,.,.”—"' war—Wt I Arr‘.“ l == = r: : : :4. = i i P ‘r 1 2:5“: : e 5”?” 555 9“! """"" 2’ """""" 9 """""" g """"" 0 7‘ "'II E 0.8 - g 0.8 . B o. 0.6 h an 0.6 ~ No hold-down -o— g No hold-down -°— 0 4 _ Short hold-down --*--- 0 4 _ Short hold-down ----... ‘ Medium hold-down ‘ 9 " E ' Medium hold-down ~n ----- Long hold-down -*-*—- Long hold-down --~---- 0.2 ~ . 0.2 - O 1 L A 1 L 1 1 0 1 1 1 n L A A O 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network Size Network Size (a) Computations per event. (b) Floodings per event. Figure 5.15: Performance of the GMC protocol in normal operations. period of 3600 rounds, equivalent to one hour if a round is 10 milliseconds. Under these conditions, it is unlikely (but not impossible) that the processing of one event will interfere with that of the preceding or succeeding events. For such interference to occur, very long hold-down intervals must be used. This behavior is illustrated by the results presented in Figure 5.15, in which only the long hold-down intervals produce slightly less-than-one topology computation per event and more-than-one flooding operation per event. 5.5.4 Comparison with the MOSPF Protocol The efficiency of the MOSPF protocol depends on three factors: the event rate, the data arrival rate, and the size of the MC. In this protocol, the topology of an MC is cleared whenever an event LSA arrives, and is recomputed when the next datagram for that MC arrives. Once triggered, this computation is performed at all the switches currently involved in the connection. Consider an MC that involves k switches (members and intermediate nodes), and let p be the probability that no datagram arrives between two consecutive events, then the MOSPF protocol would incur k(1 — p) topology computations per event. The probability p is determined by the ratio of event arrival rate to datagram arrival rate. 125 80 r . . . . . . 140 . T . . . . 7o _ > _ Sdatagrams per second *— 120 l datagram per second ------ 0.2 datagrams per second -G---- 60 ~ ‘5 25100 t D Q) 3 50 - 3 g 5 datagrams persecond -0— g 30 ’ v, 40 _ ldatagram per second ----* + 2 a 0.2 datagrams per second a g 60 _ 8 3o - J 8 ° 9 ‘5: °- 40 ~ 20 * ................................................. 1 10 - """"" 20 ~ g»D-~‘0"‘°’ 0..“ -¢--~o-~o---o-~-ou ........o......... 99 ..." O A P L L g A n 0 L A A_ A A A A O 50 100 150 200 250 300 350 400 0 50 100 ISO 200 250 300 350 400 Network Size Network Size (a) 30—second Creation Interval. (b) Normal Operation. Figure 5.16: Topologies computations per event of the MOSPF protocol. We investigated the performance of the MOSPF protocol during 30—second group creation periods and during normal operation periods, using event distributions as in the GMC simulations and using three datagram arrival rates: 0.2 datagrams per sec- ond, 1 datagram per second, and 5 datagrams per second. The results are presented in Figure 5.16. As we can see in Figure 5.16(a), even with a relatively moderate data rate of 5 datagrams per second, the number of topology computations per event can sometimes grow as high as 70 during 30-second creation periods and 120 during normal Operation periods. Even with the low data rate of 0.2 datagrams per second, the typical number of topology computations per event is somewhere between 4 and 5 during 30—second creation periods, and can be as high as 40 during normal Operation periods. We conclude that the MOSPF protocol incurs significantly higher compu- tational overhead than does the GMC protocol. Regarding the number of flooding operations per event, the MOSPF protocol incurs one flooding operation per event in all circumstances. Although GMC can produce a larger number of floodings, we have seen that a hold-down timer can effectively curb this number. In summary, the GMC protocol incurs far less computational workload at switches than does the MOSPF protocol, during both group creation periods and normal operation periods. This would allow the GMC protocol to sustain a larger number of simultaneous multicast groups. The use of hold-down timers in GMC produces a 126 tradeoff between responsiveness to events and protocol overhead. GMC’S ability to survive memory shortage problems is another significant advantage, especially when the number of groups is large and demands on router memory are heavy. Combining all these factors, we conclude that the GMC protocol is more efficient to supporting individual multicast groups, and scales better in terms of group numbers, than does the MOSPF protocol. Further, we emphasize again that the GMC protocol can accommodate topology algorithms other than the Dijkstra’s shortest path algorithm. 5.6 Summary We have developed an LSR-based generic MC protocol that can be considered as a distributed implementation of any MC topology computation algorithm. Its general- ity stems from the availability of two pieces of information at every switch, network topology information and MC images. Moreover, this generality enables the use of a single protocol for the construction of MCs of different types and optimized for different performance criteria. The correctness of the GMC protocol is established by formal proofs, and the behavior of the protocol is studied through simulation. The re- sults of these simulations Show that the protocol is able to efficiently handle bursts of membership changes in “batch mode,” dramatically reducing the protocol overheads during busy periods, while retaining its event-driven nature in normal operation pe- riods. The GMC work shows that LSR provides a solid foundation for supporting one important aspect of group communication, namely the construction of multiparty communication channels. In the next chapter, we develop LSR-based solutions for two other facets of group communication, specifically, membership management and leadership consensus. Chapter 6 Group Leader Election under Link-State Routing In this chapter, we investigate an issue involved in both LSR and group communi- cation — the leader election problem. To argue for including leader election as a core network service, we identify applications that can benefit from a network-level leader election protocol, including hierarchical LSR, address mapping, and multicast. A solution to the problem, called the Network Leader Election (NLE) protocol, is proposed for use in LSR-based networks. The protocol is robust, for it achieves lead- ership consensus in the presence of adverse events, such as leader failures and network partitioning. The correctness of the protocol is proved formally. A simulation study reveals that the NLE protocol incurs low overhead in handling leader failures and in group creation, and compares favorably with a previous LSR-based election protocol, the ATM domain leader election protocol. 6.1 Introduction The problem of leader election concerns the selection of a distinguished member from a set of computing systems that are interconnected by a network. This problem has been extensively studied in the context of distributed computing systems, for example, in coordinating access to shared resources [63] and in implementing fault-tolerant 127 128 objects [64]. Generally speaking, solutions to the problem are distributed “host- level” algorithms that make use of various services provided by the network, such as reliable delivery of messages, in order to monitor the working status Of the established leader or cast ballots for a new leader. Well-known contributions in this area include the Bully algorithm [65] and the Ring algorithm [52]; more recent developments are described in [66]. In this chapter, we address the leader election problem as it occurs “inside” the network. The participants in the election process are assumed to be switches (or, interchangeably, routers), rather than hosts or application processes. Solutions to this problem are intended to support underlying network functions, as opposed to being directly invoked by user applications. Whereas a host-level election protocol typically considers the underlying network as a “black box,” a network-level election protocol can see and take advantage Of the internal Operation Of the network, in particular, the underlying routing protocol. Network functions that can make use Of an efficient leader election protocol are several. First, in Asynchronous Transfer Mode (ATM) networks and other hierarchical networks, switches in a low-level subnetwork (called a routing domain) select a switch to represent the domain in the next routing level [13]; a solution to this domain leader election problem supports routing operations within the network. Second, many address-mapping services, such as the mapping between group addresses and member addresses [67] and the mapping between network addresses and link-layer addresses [68], use a central server approach; a solution to the server assignment problem selects a leader to undertake the server responsibilities. Third, some IP multicast protocols, such as CBT [4] and PIM [2], identify a network node, called a core node, as the traffic transit center for each multicast group; a solution to this multicast core management problem supports multicast services provided by the network. A common requirement Of solutions to the above problems is fault tolerance: since network functions / services are expected to survive not only single-point failures, but also component failures that may partition the network, the solution to these problems must also survive these adverse scenarios. 129 Our proposed NLE protocol is based on LSR. Specifically, the NLE protocol ex- tends LSR to include group-leader binding LSAS, which are used by group members to advertise their choice Of leader to the rest of the group. Upon receiving such an LSA, other switches in the network either accept this selection, or choose and adver- tise an alternative leader. The Objective of the NLE protocol is to achieve network consensus on leader bindings, even in presence of adverse conditions. The efficiency Of the protocol stems from the use Of timestamps to identify Obsolete advertisements. We argue that previous solutions to the network-level group leader election problem either do not meet the stringent fault tolerance criteria discussed above, or are more costly (in terms of bandwidth consumption and switch workload) when compared to the N LE protocol. As an extension to LSR, the N LE protocol achieves the following properties in fault tolerance. 1. [Leadership Consensus Property] Given a group G and a network that has been partitioned into a set of segments 31,32, . .. ,Sk, k. 2 1, there will be consensus on the leader within each segment 5,, and that leader will be an operational switch within the segment. 2. [Mutual Consensus Property] By requiring group members to report to the established leader, the NLE protocol ensures that, within each network segment 3,, the established leader maintains a member list for the group that includes those, and only those, group members in 5,. It is to be noted that, when the network is not partitioned, the above consensus properties hold throughout the network. Simply put, the NLE protocol can handle leader failures and work properly under catastrOphic scenarios such as network parti- tioning. Results Of a simulation study show that these features can be achieved with minimum protocol overhead. The remainder of this chapter is organized as follows. The design of the NLE protocol is presented in Section 6.2, and the correctness of the protocol, which is modeled as a consensus problem under LSR, is formally proved in Section 6.3. The performance Of the N LE protocol and the ATM domain leader election protocol are 130 compared via simulation in Section 6.4. In Section 6.5, we discuss the application of the NLE protocol to the address resolution problem and to the multicast core management problem; included are simulation results regarding the performance of NLE in creating multicast groups. Finally, a summary of this chapter is given in Section 6.6. 6.2 The NLE Protocol 6.2. 1 Overview Since some decision-making processes of the NLE protocol, such as the leader selection policy, are application dependent, we discuss the protocol Operation in the context of the domain leader election problem. As described in Chapter 2, ATM’s domain leader election protocol uses a rank-based scheme to select leader (the switch with the highest leader priority becomes the leader). Adaptation of the NLE protocol to other problems is discussed in Section 6.5. The Operation of the NLE protocol is summarized as follows. 1. For every group g, each switch :1: in the network maintains a leader binding, denoted as Bindingx(g), whose value is a triple (Leaderx(g),Sourcex(g),Stampx(g)), where Leaderz(g) is the leader of the group g as perceived by at, Sourcex(g) is the switch that suggested this binding, and Stampx(g) is the timestamp associated with the binding. The goal of the NLE protocol is to maintain consensus on Binding: (g) values across the network. 2. When a switch it joins a group g, it searches for the Leaderz(g) entry in its local database. If the entry is not found, group g is said to be unbound at :1:. In this case, switch 12: selects a switch c as the leader of the group according to a leader selection policy, sets Leaderz (g) to c, and broadcasts this binding. For the domain leader election problem, the leader selection policy selects a reachable switch with the highest leader priority. 131 3. Once the switch a: has a Leaderx(g) entry, it sends a J OIN-REQUEST message to switch Leaderm( g). The join Operation will not be considered successful until the return Of a J OIN-ACK from the Leaderz(g). Further, the switch :1: must re- join g (that is, repeat the join process) each time the Leader$(g) value changes. 4. When a switch 2: leaves a group 9, it sends a QUIT-REQUEST to switch Leader$(g). Again, the quit process does not not finish until the corresponding QUIT-ACK returns from Leader1(g). 5. When Leaderx(g) = :1:, switch 2: acts as the leader of the group g: it pro- cesses JOIN-REQUEST/QUIT-REQUEST messages, and returns appropriate acknowledgments. Further, via join and quit requests from members, the leader maintains a member list for g, denoted as MLx(g). A member of 9 will be dropped from ML$(g) if it sends a QUIT-REQUEST message or if it becomes unreachable from the leader 1:. We point out that, since members are required to re-join the group each time a new leader is elected, a new member list will be compiled at the new leader. Member lists are not required at switches other than the leader. 6. When a switch a: that is a member Of a group 9 finds the switch Leaderx(g) unreachable, switch :12 selects and broadcasts a new leader binding for g. To avoid a rush of new leader bindings from all members Of g, a delay timer Of random length is used to postpone the re-selection task. Typically, one member wakes up before others and advertises a new binding. The remaining members simply accept the binding and re—join the group. 7. Even when switch Leader; (g) is still reachable from 2:, the switch a: may decide, according to application-specific leader performance criteria, to select and ad- vertise a new leader for group g. Given a group 9, an objection policy determines when a switch objects to the current leader binding and selects a new leader. For the domain leader election problem, a switch Objects to the current domain leader when it discovers a reachable switch that has a higher leader priority 132 than does the current leader. 8. As with other LSR state information, group bindings are subject to aging. To prevent group bindings from aging out, each switch periodically advertises a list Of groups for which it is the leader. Formally, switch a: periodically advertises a list of group IDs, 01., where a group 9 6 GI. if and only if Leaderx(g) = 2:. At a switch y 5L :5, the binding Bindingy(g) for such a group 9 will be aged out if this periodic flooding is not received for a predetermined length of time. 6.2.2 State Machines and Events At a switch :1:, the NLE protocol defines two finite state machines (F SMs) for each active group g: a Membership Status Machine, denoted as MSM($, g), and a Leader- ship Consensus Machine, denoted as LCM(:r, 9). Both LCM(:c, g) and MSM(2:, 9) ma- chines access the Binding, (g) entry; such accesses are assumed to be atomic to avoid race conditions. Figure 6.1 shows the events processed by the two machines. The LCM(:r, g) processes incoming leader bindings for the group g, and reacts to events that indicate problems with the current leader, such as leader-unreachable events and Objection events defined by the objection policy. The MSM(2:, g) handles join and quit events and is responsible for ensuring that the current leader, Leaderz(g), holds correct information regarding the membership status Of the switch x. The MSM(:r, g) also processes leader-change events, which are raised whenever the LCM (:1:, g) accepts a new binding for group g. . ~—~ Binding LSA Jom arrive ——“' 52:22:25: tatus . Machine Leader changed Machine Leader unreachable Quit T , (MSM) (LCM) Other objection cases L Figure 6.1: The finite state machines in NLE. 133 6.2.3 The Operation of LCM The state transition diagram for the LCM is depicted in Figure 6.2. As shown, an LCM comprises four states: EMPTY, PENDING, REMOTE, and LOCAL. The EMPTY state is the initial state Of LCMs. When there is no binding regarding g at 2:, the LCM(z, g) is in the EMPTY state; the values Of Leaderx(g) and Sourcex(g) are undefined, and the value Of the Stamp; (g) is defined to be zero. An LCM(2:, g) is in the LOCAL state when Leaderz(g) = :r, and in the REMOTE state when Leader, (g) ;£ 1:. The LCM sometimes uses a timer to postpone the task of leader selection. When this happens, the machine enters the PENDING state, waiting for time-out. New Leader(g)=x . roposed/accepted Leader(g) New Leader(g) not it unreachable proposed/accepted New Leadertg) not it accepted. or new Leaderm) not it proposed due to objection New Leader(g)=x accepted. or new Leader(g)=x proposed due to objection. Member unreachable events. JOIN_REOUEST received. and JOIN ACK received. New Leeder(g) not it ew Leader(g)=x accepted. accepted Figure 6.2: The leadership consensus machine at a switch a: for a group g (LCM(:(:, g)). A binding LSA is a pair (9, (c, s, t)), where the first element 9 specifies the group and the second element (c,s,t) is the value of this binding. The LCM processes binding LSAS according to the rules below: A1 An incoming binding LSA l = (g, (c, s, t)) will be accepted at a switch a: if (t, s) > (Stampx(g), Sourcez(g)), otherwise it is rejected at 2:. (The comparison is in lexicographical order.) This rule guarantees that more recent bindings override old ones but that the reverse will not happen. When .2 is accepted, its value (c, s, t) becomes the value of Bindingx(g). Subsequently, the LCM(:2:, g) enters either the LOCAL or REMOTE state, depending whether new Leaderz(g) is a: or not. 134 A2 When a switch a: proposes and advertises a leader c for a group 9 it 1) increases the Stampx(g) by one, 2) sets Sourcex(g) to a: and Leader$(g) to c, and 3) floods a binding LSA (g, Bindingx(g)). The LCM($, 9) then enters either the LOCAL or REMOTE state, depending on whether new Leaderx(g) is :1: or not. There are two situations where the LCM(:1:, g) may use Rule A2 to propose and advertise new leader bindings for the group g: when the Leaderx(g) becomes unreach- able, and when an Objection event is raised according to the objection policy. In the latter case, the LCM proposes a new leader only if the machine is in the REMOTE or LOCAL state. When the current leader Of 9 becomes unreachable from a switch 3:, the switch is triggered to select and advertise a new leader. To avoid a rush of si- multaneous leader binding LSAs from group members, the LCM(:r, g) sets up a delay timer and enters the PENDING state. There are two ways for the LCM to leave the PENDING state: 1) the timer fires, and the machine selects/ advertises a new leader according to Rule A2, or 2) an “acceptable” binding LSA arrives before time-out. In case 2, the delay timer is canceled, and the LCM processes the LSA according to Rule A1. We will discuss the effects of various timer values in Section 6.4. When the LCM(:c, g) enters the LOCAL state, switch a: must create a member list for group g and process J OIN-REQUEST / QUIT-REQUEST messages from members of g. The member list of g is created every time LCM(:1:, g) enters the LOCAL state, and is destroyed every time LCM(z, g) leaves that state. JOIN-REQUEST/QUIT— REQUEST messages will be acknowledged and used to update the member list when LCM(2:, g) is in the LOCAL state, but are discarded silently when the machine is in any other state. When an unreachability event concerns a switch y that is not the leader of the group 9, the action of LCM(:1:, 9) depends on whether :2: considers itself to be the leader. If so (that is, a: = Leaderx(g)), a: removes y from the member list of g; otherwise, it discards the event. 135 6.2.4 The Operation Of MSM The MSM at a switch :1: for a group 9, denoted as MSM(:c, g), reacts to join(g) and quit(g) events. An MSM has four states: MEMBER, JOINING, NON-MEMBER, and LEAVING, among which the NON-MEMBER state is the initial state. With respect to a group g, the MSM at a switch a: is in JOINING state if it wishes to join the group but has not completed the “registration” procedure, namely, the exchange of JOIN-REQUEST and JOIN-ACK messages with the leader, Leaderx(g). After the JOIN-ACK message is received, the joining member enters the MEMBER state. Defined similarly, a member Of g is in the LEAVING state during the exchange Of QUIT-REQUEST and QUIT-ACK messages with the leader, and will enter the NON- MEMBER state after completion. Retransmissions of REQUEST messages may be necessary to ensure successful delivery. Leader changed or ack times out: send QUIT-REQUEST to Leader(g) QUIT-ACK received from Leader(g) ll . Quit(g : Jorn(g): send UIT-REQUES send JOIN-REQUEST to Leader(g) llindw UlT-REQUEST to Leader (9) to Leader(9) JOIN-ACK received from Leader(g) f Joining )_ I Mb" > Leaderc cedhang send JOIN- REQUEST Leader changed or to Leader(g) ack times out send JOIN- REQUEST to Leader(g) Figure 6.3: The membership status machine at a switch a: for a group g (MSM(:r, g)). The MSM does not deal directly with the leader-unreachable events. However, when the LCM(a:, g) changes the leader binding due to leader-unreachable events or other objection events, it generates a leader-change event to be handled by the MSM(:1:, g). If the switch is a member of the group g, the switch must re-join the 136 group, that is, the MSM(:r,g) machine enters the JOINING state so that JOIN- REQUEST and JOIN-ACK messages are exchanged with the new leader. If a switch it: joins a group g when the LCM($, g) is in the EMPTY state, the switch must select and advertise a leader for the group, following the procedure defined in Rule A2. For every switch 1) in the network, including LCM(2:, g), this advertisement will be received by LCM(v, g). 6.3 Proof of Correctness We prove in this section that the NLE protocol achieves consensus on group—leader bindings throughout the network. However, we must be careful when defining what can be proved and what cannot be proved. For example, consider a hypothetical sce- nario where, whenever a switch a: is suggested as the leader of a group g, that switch crashes immediately. The other network switches will detect the unreachability to a: and (some of them) will propose new leaders. Meanwhile, switch at resumes execution shortly after new leader binding proposals are made. If the scenario repeats indefi- nitely and every newly suggested leader immediately crashes, then it is impossible for any leader-management algorithm to maintain stable and consistent leader bindings for the group. We conclude that a more reasonable goal is to study the behavior Of the NLE protocol in response to a finite set of events. (Similar assumptions have been used in the “classic” LSR consensus problem, where the switches must reach consensus on network images [45].) Precisely, let us be given a group 9 and a finite set Of events, E, which may include network status dynamics, group membership changes, unreacha- bility events and other objection events. In addition, E is assumed to contain at least one join event regarding g (otherwise the group is inactive and does not participate in the protocol). We will show that consensus is eventually reached throughout the network on the Binding(g) entries. We begin with the leadership consensus property. We first prove the property for the case where the network remains connected after the event set E. Following that, 137 we consider the case where the network is partitioned. Theorem 3 [Weak Leadership Consensus Property] Let E be a set of network events, including group membership change and network status change events, and let g be a group with at least one join event in E. Assuming that the network is not partitioned after all events in E have taken place, all network switches will eventually agree upon the same leader binding for group 9. Proof: Considering the set of occurrence times of the events in E, we are interested in the maximum such element, tlast, the time of the last event in E. It is to be noted that the assumption Of a connected network after time tum does allow for some non- operational switches, as long as the “survivors” remain reachable from one another. Let M E be the set of switches that are operational and are members Of g after E. If M E is empty, the theorem is true vacuously. Let us consider the more interesting cases where at least one member switch survives E. Let A be the non-empty set of switches that are Operational after tin“, and let B be the set of leader binding LSAS for the group g produced in response to the events in E. Let BA be the subset of B such that (g, (c, s,t)) 6 BA if and only if c E A. In other words, B A is the set of leader binding LSAS for which the designated leader c is operational after tum. We claim two properties regarding the set B A. First, the set B A cannot be empty, since members of 9 will select new reachable leaders if there are no valid bindings for g. (A binding is invalid at a switch a: if the designated leader is unreachable from :1:.) Second, the set B A is finite. In fact, we claim that the set B is finite. This property comes from the fact that the NLE protocol produces a finite number of bindings in response to a single event of any type. The worst case is M binding LSAS per event, where M is the number of members in the group. The worst case happens when the current leader fails and all the M members select/advertise new leaders. Hence, a very loose upper bound for the cardinality of B (and B A) is N x |E|, where N is the number of switches in the network and an upper bound of M. Recall that, for two bindings (c1,sl,t1) and (c2,32,t2), (c1,sl,t1) > (c2,32,t2) if 138 and only if (t1,sl) > (t2,82). Let bum = (c,s,t) be the maximum binding in BA. This maximum element is well-defined because the set BA is finite and non-empty. Since the switch 0 is connected to switches in A after tum, the periodic advertisements of bum from c will eventually be received by all switches in A, which must accept the binding and ignore any others, due to the maximality Of bmaz. The binding bum becomes the final consensus binding among all Operational switches, and the theorem is proved. C1 The proof of the theorem also suggests that the final leader binding is “correct” in the sense that bnm is in B A (that is, the final winner is an operational switch after E). Somewhat surprisingly, showing consensus when the assumption of eventual network connectivity is removed is not difficult at all, as shown in the following theorem. Theorem 4 [Leadership Consensus Property] Let E be a finite event set that partitions the network into k segments, 31,52, . .. ,Sk, where k 2 1. Let g be a group with at least one join event in E. The NLE protocol will achieve consensus on leader bindings for g within each segment 3,, for 1 S i S k. Proof: To see the correctness of the theorem, we apply the argument regarding the set A in the previous proof to each segment 5,. That is, we simply consider switches in S, to be Operational and all other switches to be non-operational. D Next, we consider the mutual consensus property Of the NLE protocol. Theorem 5 [Mutual Consensus Property] Given a group 9 and a set of events E that partitions the network into k segments, 51,82, . .. ,S'k, where k 2 1, the con— sensus leader of g in 5',- produces a member list that includes members, and only those members, in 5,. Proof: In the following discussion, the consensus leader Of g in S,- is denoted as Leader,(g), and the member list maintained by the leader is denoted as M L,- (g). It is not difficult to see that members not in S,- after E will be removed eventually from 139 M L, (g), due tO unreachability events about these members. It remains to be shown that all members of g in S,- will be added to the list M L,- (g) A property of the MSM, shown in Figure 6.3, is that the MSM insists on having the current leader hear about the current membership status. However, some previous membership changes may not be learned by the leader. For example, if a switch it decides to leave a group g while it is in the JOINING state with respect to g, the MSM simply enters the LEAVING state and issues a QUIT-REQUEST; the previous J OIN- REQUEST and JOIN-ACK exchange process is aborted. As a result of this design, given a sequence of interleaved join/quit events, the MSM does not guarantee the success of all respective REQUEST-ACK exchanges, but will enforce the successful exchange with respect to the last event in the sequence. Let us assume that there is a switch y E S,- that is a member of 9 after events in E, but y is not in M L,- (g). By the previous Observation, we are concerned only with the REQUEST-ACK exchange process of the last membership change event, which must be a join event. The assumption that y is not in M L,(g) implies that the leader in S, does not receive a JOIN-REQUEST message from y, and hence will not return a JOIN-ACK message. Consequently, the switch y remains in the JOINING state, where the JOIN-REQUEST message will be issued repeatedly until corresponding acknowledgment is heard. Since y and Leader,(g) are connected, this process will eventually complete, putting y on the M L,- (g) This is a contradiction to the assumption about y, concluding the proof. C] 6.4 Performance Evaluation In this section, we investigate the performance of the N LE protocol in handling leader failures. Specifically, the NLE protocol is compared against the ATM domain leader election protocol [13]. In our simulations, networks comprising up to 400 switches were used. For each network size, 40 graphs were generated randomly, and two simulation sessions were conducted on each graph. Table 6.1 shows the characteristics of the 140 graphs generated. In the table, the symbol Tf denotes the worst-case time to perform a flooding operation in a given network. As in the simulations described in previous chapters, we used software overheads of 600 usec in each LSA forwarding. Network Avg. Avg. Tf size degree diameter (in ms) 10 3.6 3.25 3.56 20 3.57 4.75 5.12 40 3.73 6.18 6.71 60 3.88 6.8 7.5 80 3.91 7.08 7.87 100 4.12 7.15 8.1 120 4.10 7.53 8.42 140 4.20 7.73 8.64 160 4.22 7.58 8.8 180 4.31 7.75 8.96 200 4.29 7.95 9.08 250 4.50 7.95 9.34 300 4.70 7.98 9.47 350 4.87 7.85 9.53 400 5.07 7.85 9.62 Table 6.1: Characteristics Of randomly generated graphs. We consider two metrics for the performance of leader election: the leader-binding convergence time and the number of leader binding LSAS produced for an election. The former refers to the length of the period from the moment the election begins to the moment that all network switches agree on the same leader node. (When an election is held due tO the failure Of the current leader, the election begins at the moment the leader fails.) The latter measures the number of leader-binding LSAS that are sent before consensus on the leader node is reached. In addition, we measured the bandwidth consumption of the two approaches. This is motivated by the fact that switches use point-to-point messages to cast ballots in the NLE protocol, but must use flooding operations in the ATM election protocol. When a leader fails under the N LE protocol, group members select a new leader and send join requests to that switch. Since all members are informed (by corre- sponding LSAS) almost simultaneously, they all could potentially rush to suggest new leaders, resulting in a large number of conflicting leader binding LSAS. The N LE 141 protocol avoids this problem by deferring member rejoins with a random timer. We assume that the current leader crashes at time 0, and that delay timers are uniformly distributed between 0 and a simulation parameter max-delay. We used max.delay values Of 0.1 seconds, 1 seconds, and 10 seconds. The results regarding the metric of the number Of bindings are plotted in Fig- ure 6.4(a). Even the very short maximum delay value (0.1 seconds) introduces fewer than 16 bindings in 400-switch networks. When the maximum delay is set to 1 second, fewer than 3 bindings are generated in large networks. When the maximum delay value of 10 seconds is used, only one binding is created in almost all simulation ses— sions. Although not shown in the figure, the current ATM election protocol produces N preferred-leader LSAS, the equivalent of binding LSAS, in an N -switch network for every leader failure event. The results for convergence time are plotted in Figure 6.4(b). For this performance metric, the shorter the maximum delay value, the faster the convergence, since a short maximum delay value produces early time-out of the delay timers, and hence switches take less time to flood new leader bindings. Also, the larger the network size, the faster the binding converge; not surprisingly, a large number Of switches that set up random delay timers tends to produce one that times out quickly. The results for bandwidth consumption are plotted in Figures 6.5(a) and (b). Bandwidth consumption is measured by counting the total number of links traversed by every LSA and JOIN-REQUEST/ACK messages associated with the election. Figure 6.5(a) shows the results Of the N LE protocol, which uses flooding operations to broadcast leader bindings and point-to—point messages to cast ballots (that is, to send J OIN-REQUEST messages). The curves in Figure 6.5(a) conform with those in Figure 6.4(a), that is, the more concurrent bindings produced, the more bandwidth consumed. The bandwidth consumption Of the ATM election protocol is significantly larger than that of the N LE protocol, as shown in Figure 6.5(b). In summary, compared to the ATM leader election protocol, the NLE protocol incurs far fewer flooding Operations and consumes a small fraction of the bandwidth. We further emphasize that the ATM leader election protocol requires every switch to 142 14 _ 0.l seconds max delay 4— A 800 +3 0.1 SCC max delay +- 4 1 second max delay ---:~ g 1 SCC max delay ...”... 12 + l0 seconds max delay 0 E 700 ’ 10 86C max delay .... ...... ‘ v 2 V’ o 600 L 5 1 on ‘. .S 10 ’ ‘ E a '2 l '5 500 [ e ‘ ES 8 + o , :5 E 400 ’ d 6 >- 4 a Z :3 300 b a ‘l 4 > t>:: 200 ’ a"! J O '2‘ 2 ' ............................. 0 1% l‘ “on". ------------------------- __ ""‘-I-~o-.-.......... .o, 0.2:g'73 '6 .o 'o-~I Mano. 9 ....Q ,, Wow... .0 0 M :«tu‘ A A A A A 0 A A A L A A A o 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network Size Network Size (a) number of bindings. (b) convergence time. Figure 6.4: Performance of the NLE protocol. “X300 r v . r v v v 900000 F 350(1) 0'1 seconds max delay _‘— 4 300000 r ATM election protocol +— i 10] “wild max geii’)’ ‘“'" NLE using 0.1 secondsmaxdelay --¢---- 3 _ secon smax edy 4 700000 . 600000 ' 5 25m " 5 g 2 500000» 3 20000 ~ g 5 s 400000 - ‘° 15000 i “3 300000 - 10°00 ’ 200000 - 5000 - .—- 100000 - .- :7.’.~n """"" 0 - *‘T‘w'l‘ . . l - o c- - . -- A «1 -------- 1"“: """" '1’ 0 50 100 ISO 200 250 300 350 400 0 50 100 ISO 200 250 300 350 400 Network Size Network Size (a) NLE bandwidth. (b) ATM bandwidth. Figure 6.5: Bandwidth usage of alternative election protocols. periodically advertise its preferred leader, while the N LE protocol requires only the leader to periodically broadcast its leader status. We conclude that the NLE protocol is more efficient than the ATM leader election protocol, while being equally robust. 6.5 Other Potential Uses of The NLE Protocol We have discussed the use of the NLE protocol for the domain leader election prob- lem. In this section, we briefly discuss the application Of the protocol to two other important network services, namely, multicast address resolution and multicast core 143 management. In addition, we evaluate the performance of the NLE protocol in group creation. 6.5.1 Multicast Address Resolution In the last several years, a great deal of research has addressed the issue of implement- ing IP over new link layer protocols, such as ATM / AAL5. One of the difficult tasks in implementing IP over ATM networks is how to handle multicast addressing. Whereas IP allows a source node to send a datagram to an abstract multicast group address, the current ATM standard does not support such an abstraction. Rather, ATM sup- ports multicasting through point-to—multipoint unidirectional virtual channels, which require the sender to explicitly establish a connection to each destination. One approach to this problem is to use a Multicast Address Resolution Server (MARS) [67], a central server that acts as a registry, associating IP multicast group identifiers with the ATM interfaces representing the members of the groups. The MARS is queried when an IP‘multicast address needs to be resolved, and hosts and routers must update the MARS when they join and leave groups. As a centralized solution, however, the potential for MARS failure is an important issue. The approach described in [67] is to manually configure nodes with the addresses of one or more backup MARS nodes that they can contact in descending order of preference. An alternative method is to use an election protocol, such as NLE, to “automat- ically” handle MARS failures and, just as important, accommodate network parti- tions. Such an implementation might work as follows. A specific group identifier (call it MARS-CID) is reserved for the election of the MARS; every switch is as- sumed to be a member of this group. The selection and objection policies of the MARS follow a ranking scheme similar to those for domain leader election. If the current MARS crashes, the NLE protocol is used to establish consensus on a new LeaderAMARS-GID) binding. For the new MARS to Operate properly, member lists must be re-collected. To this end, every switch in the network maintains an inter- ested multicast addresses (IMA) list, M3. = {m1,m2, . .. ,mk}, where each element m, is a multicast address that one or more of the attached hosts is interested in. 144 This list is usually maintained by a local membership management protocol, such as the IGMP [7]. Since every switch must send a JOIN-REQUEST to a newly elected MARS, reconstruction of member lists at the MARS can be implemented by aug- menting each J GIN-REQUEST message from switch :5 to include a copy of M3. The consensus properties of the N LE protocol guarantee that, should the network be par- titioned, there will be a MARS within each segment that maintains multicast group member lists for those, and only those, switches in the segment. 6.5.2 Multicast Core Management As discussed in Chapter 2, some prominent IP multicast protocols, such as CBT [4] and PIM [2], associate a multicast traffic transit center, or core node, with each multicast group. In such approaches, datagrams destined to a multicast group are first forwarded to the core node, from which they are distributed along a multicast tree to reach group members. The association of the core node with a multicast group can be modeled as a leader election problem, and the NLE protocol can be applied. One approach is as follows. We assume that a core election is held whenever a multicast group is created (that is, when the first member joins), and that a new core is elected if the current core fails. Regarding the core/leader selection policy, we can assume that the default is the random member policy [46]: whenever a member is required to select and advertise a core node (including group creation time), the member simply recommends itself. A number of other core selection policies are discussed in [46] and could be incorporated into the NLE protocol. As with the applications discussed earlier, the mutual consensus property of the N LE protocol enables arbitrary multicast groups to handle network partitions and re—unifications. The maintenance of the leader binding of every active group at every switch in a network may raise the concern of scalability. In Chapter 7, we describe another core-management method that addresses this issue by using the NLE protocol to select a central server to maintain the leader bindings of all active groups. The method presented above, however, has an advantage in group-join time, because 145 joining switches do not have to query a server to resolve leader-group bindings. This feature is important to situations where members of a group join and exit at a high rate. 6.5.3 Performance of Multicast Group Creation When the NLE protocol is used for domain leader election and MARS election, all switches in the network are members of the (single) group, and the group is assumed to be created at network initialization, a relatively rare event. In the case of multicast core management, on the other hand, there are many multicast groups directly tied to applications, and group creation time may be important to the performance of those applications. Therefore, we conducted a study to evaluate the performance of the NLE protocol when a multicast group is created. As in the previous performance study, we are interested in two performance metrics: convergence time and the number of binding LSAS. It turns out that the convergence time in this case is quite predictable, as shown in the following theorem. Theorem 6 Given the flooding diameter T, of a network (the worst-case time to finish a flooding operation), the convergence time for group creation under NLE is less then 2T], assuming that no network component failures occur during group creation. Proof: Assume that the first member joins a group at time 0. This member finds the group unbound and advertises a leader-binding, which will reach all network switches by Tf. Assuming no component failures, any other switch must join the group by time T; if it is to find the group unbound and propose its own binding. Flooding of any such additional bindings will require another T, time to finish in the worst case. Therefore, after time 2Tf, all network switches will have received all the bindings that have been flooded, and will agree upon the one with the largest value. Hence, the worst case convergence time for leader binding is 2T]. D 146 To investigate the number of binding LSAS produced by group creation, we sim- ulated the creation periods of multicast sessions with M participants. The arrival time of each participant is normally distributed with mean zero, the predetermined startup time of the group. The standard deviation value is set in such a way that 99% of the participants arrive within a predetermined time interval; this interval will simply be called an arrival interval. We used arrival intervals of lengths 1 second and 0.1 seconds. A switch joins the multicast group when its first attached partic- ipating host arrives. In a simulation session, the size of the participant population, M, is controlled by the participant-to-switch ratio; we used the values of 1 and 10 in this investigation. As such, our simulation study covers a wide range of participant population sizes, from 10 (obtained by 10-switch networks with 1 participant per switch) to 4000 (obtained by 400-switch networks with 10 participants per switch). Simulation sessions involving a small number of participants could represent tele- conferencing applications, whereas those involving very large population sizes may represent Distributed Interactive Simulation (DIS) applications. The combination of very short arrival intervals with very large population sizes produces extremely busy group creation periods, in order to stress the NLE protocol. Figure 6.6 shows the results of this study. Figure 6.6(a) plots the results when using the 0.1 seconds arrival interval. The worst case in the figure is only 3.0, meaning that even when 4000 participants join a multicast group within 0.1 seconds, the N LE protocol produces only three leader binding LSAS. Figure 6.6(b) plots the results for the 1 second arrival interval. As shown, this relatively longer (but still very short) arrival interval produces virtually no redundant bindings, that is, there is one leader binding produced per group creation event. We believe that the results in Figure 6.6 demonstrate that the NLE protocol is a viable method for handling many real-world situations. 147 4 f . . r . . . 4 . . v . . 3 5 g 1:] host-to-switch ratio -°—‘ 3 5 _ 1:1 host-to-switch ratio ~— 4 - 10:1 host-to-switch ratio ~ ' 10:1 host-to-switch ratio r m 3 , 4 3 t 2" ........... 3’0 ‘5 2.5 . .................... .E 2.5 ’ .s ' E m 2 i ..r-""’ a 2 r 4 “5 ‘5 o. 1.5 ”if" 2 1.5 ’ "Wt .4“- __._-_ -....- % A t V ”---. z l l ~=¥L":' :J—c vLfiA ¢ ¥ 5 : ° 0.5 > ‘ 0.5 r O A ‘ A A i + A 0 l l l l l l¥ . 0 50 100 150 200 250 300 350 400 o 50 100 150 200 250 300 350 400 Network Size Network Size (3) 0.1 seconds arrival intervals. (b) 1 second arrival intervals. Figure 6.6: Number of bindings generated for group creation. 6.6 Summary We have addressed two facets of group communication in LSR-based networks. Specif- ically, the leader election problem and membership management problem have been studied in a context where participants of the election process are switching elements in LSR-based networks. The prOposed solution, called the Network Leader Election protocol, models the group-leader binding problem as a consensus problem under link state routing. In this model, the local network images at switches are extended with leader binding entries, whose network-wide consistency is guaranteed by the protocol. We have formally proved the correctness of the NLE protocol, including its leader- ship consensus property and the mutual consensus property, under any combination of group member and network status changes. Our simulation studies reveal that the NLE protocol incurs minimal overheads for multicast group creation and moderate overheads to handle leader failures. The performance of the N LE protocol compares favorably with a previous network group leader election protocol for ATM networks. The efficiency of the NLE protocol enables its use by both the international opera- tions of LSR (such as hierarchical routing and address mapping) as well as multiparty communication applications (for example, those that use core-based multicast). In the next chapter, we propose a second multicast core management method that uses 148 the NLE protocol to select a central server, which manages the core nodes for active multicast groups. Chapter 7 Multicast Core Management The problem of multicast core management concerns assigning a network switching element to each multicast group for use as the root of the multicast tree of the group. In the previous chapter, we applied the NLE protocol to this problem in a per- group manner, that is, each group individually holds election to select a respective core node. In this chapter, we pursue an alternative approach to the problem. The proposed method, called the LSR-based Core Management (LCM) protocol, uses the NLE protocol to elect a central server, called the core binding server (CBS), to manage core-group bindings for all active groups within the network. The LCM protocol selects core nodes for groups automatically, handles the failures of both core nodes and the CBS itself, supports core migration whereby multicast groups can adapt to membership and network status changes, and survives network partitioning. The LCM protocol is based on LSR: it relies uses the network status information provided by LSR to monitor the operational status of current core nodes and takes advantage of the shortest path trees computed by LSR to support core migration. Our simulation results reveal that the central server can sustain extremely high workloads, and demonstrate the effectiveness of our core selection and core migration methods. 149 150 7. 1 Introduction As discussed in Chapter 2, a common technique to support multicast, found in the CBT [4, 3] and PIM [2] protocols, is core based forwarding (CBF). A CBF multicast protocol associates a core node with each multicast group; the multicast tree of the group is defined to be the union of core-to—members shortest paths. Messages des- tined for the group are first sent to the core node, which forwards the message along branches of the tree. An advantage of CBF multicast protocols is that they enable simple methods for nodes to join and leave the group. We illustrated in Figure 2.3 the member join operation of the CBT protocol. That example assumes that the joining member has learned a prior the identity of the core node of the target group. Indeed, many CBF multicast protocols do not concern themselves with core management issues, such as who selects the core node (for example, an administrative authority, users/ hosts, or the network), how a core node is selected (that is, which core selection algorithm to use), when a core node is selected (for instance, at the moment a group is created and/or some other time(s) during the life span of the group), how the identity of the core is disseminated to interested parties, and where the identities of the cores of active groups are stored. Before addressing these questions, we identify three basic requirements for core management. 1. Network-level core selection. If the task of core selection is performed by hosts, then the multicast interface between hosts and the network depends on the type of the multicast protocol used by the network. (In networks that use a CBF multicast protocol, for example, a join-group request from a host must include the core address of the group, whereas in networks that use other types of multicast protocols such information is not required.) Hence, automatic core selection by the network is preferred over host-level approaches, such as [69]. 2. Core failure handling. A potential weakness of CBF multicast is the single point of failure at the core. Methods are needed to assign new cores to multicast groups whose current cores have failed. 151 3. Core migration. During the lifetime of a multicast application, the mem- bers of a group may change, and the resource availability in the network may fluctuate. The purpose of core migration is to identify a new core node for the group whose corresponding multicast tree, determined by the current set of group members and present network status, will likely result in significantly better multicast performance than the tree based on the current core. In Chapter 6, we discussed how the NLE protocol can be applied to the core management problem. In that approach, the NLE protocol is applied on a per-group basis to elect core nodes for multicast groups and handle core failures. The practice of storing core-to-group bindings (that is, leader bindings in NLE’s terminology) for all the active groups at every router in the network has advantages and disadvantages. On the positive side, the approach adds no additional delays and overheads to group join operations in CBF multicast, because joining members can resolve core-group mappings locally. This merit may be important for multiparty communication ap- plications whose participants join and leave at a high rate. On the negative side, however, the approach raises the concern of scalability when used to support a very large number of simultaneous multicast groups. Alternatively, one could use a boot- strap mechanism, as proposed by the PIM community [33]. In this method, when a multicast group is created or the core node of an existing group has failed, a hash function is used to map the address/ID of the group to a router in the network as its core node. As such, core-bindings need not be stored at all, for all members of a group will map the ID of the group to the same core node. Complexities of the bootstrap mechanism, however, stem from the tasks of discovering and disseminating the identities and Operational status of routers in the network. Further, core mi- gration is not supported. To remedy this problem, an independent core migration protocol can be used; at least one such protocol has been proposed by Donahoo et a1. [70]. In Donahoo’s protocol, the core node of a multicast group periodically sends probing messages to discover a subset of group members and a set of nodes which, if designated as the new core, may improve multicast performance. The core node then sends the list of “representative” members to the selected core candidates, which use 152 SOphisticated heuristics to evaluate their performance as the core. Evaluation results are sent back to the current core node, which selects the new core. In this chapter, we propose a network-level core management method for use in LSR-based networks. The resulting LCM protocol uses the N LE protocol to select a core-binding server (CBS), which manages the core—to-group bindings for all active multicast groups within a network. The LCM protocol works closely with LSR, using such information as the identities and operational status of network routers and the topology of the network, to support all three core management issues listed above. A contribution of this work is to demonstrate that a single, and yet relatively simple, core management solution can be developed under LSR. We emphasize that LSR-based protocols, such as LCM, are not intended for direct implementation in very large networks or internets, due to the scalability issues of LSR discussed in Chapter 2. For some CBF multicast protocols, such as the PIM protocol, a core node for a multicast address / group m is assigned within each routing domain that contains at least one member of m. Core management issues under such circumstances are by definition “local;” LCM could be used directly by such protocols. In other CBF multicast protocols, such as the CBT protocol, there is only one core node for a given group m throughout the entire Internet. If the members of m are not restricted to a routing domain, hierarchical core management must be used. In this chapter, we present the “basic” LCM protocol; its extension to hierarchical networks is part of our ongoing research. Hereafter, we use the term “network” to refer to a set of routers that are governed by a single administrative authority and which collectively execute LSR. The remainder of this chapter is organized as follows. We present the LCM pro- tocol in Section 7.2. Various performance issues, including the workload at the CBS and multicast performance, are investigated through a simulation study, whose re- sults are presented in Section 7.3. These results justify the use of a central server for core management, and show that the performance of multicast can be improved significantly by the simple core migration heuristic supported by LCM. A summary of this work is given in Section 7 .4. 153 7 .2 The LCM Protocol As discussed, the LCM protocol uses a central server, the CBS, to manage core-to- group bindings. Precisely, the CBS of a network maintains a list of core bindings C = {Core(m) | m is an active multicast address}. When a host wishes to join to a multicast group m, its local router :1: sends a CORE-MAPPING(m) message to the CBS. If the binding Core(m) is contained in the list C, then the CBS places this binding in a CORE—ADDRESS message that is returned to as. Otherwise, the CBS selects a core for m according to an initial core selection heuristic, and adds this binding to C before returning the CORE-ADDRESS message. After obtaining the binding Core(m), router as attaches itself to the multi- cast tree of m using the procedure defined by the underlying CBF multicast protocol. When all attached hosts of router a: have departed from group m, router as follows the procedure of the given multicast protocol to exit the multicast tree, without the involvement of the CBS. Initial core selection. When the first router member of a multicast group m asks the CBS for the core identity of m, group m becomes active and the CBS must select a core node for m. Since at this moment no further membership information regarding m is available, solutions to this initial core selection problem are limited. Previously proposed methods include random selection (randomly pick a router in the network) and random member (randomly pick a router member) [46]. The LCM protocol adOpts a variation of the random member heuristic, called the first-member heuristic, which operates as follows: when the CBS receives a CORE-MAPPING(m) request from router a: for a group m whose Core(m) does not exist in C, it sets Core(m) to :1:. CBS election. The identity of the CBS is not statically configured, but rather is dynamically chosen by the NLE protocol. A specific group identifier (call it CBS- 154 CID) is reserved for the election of the CBS. Every router as in the network is assumed to be a member of this group, and maintains a leader binding Leaderz(CBS-GID), to which CORE-MAPPING messages are sent. The selection and objection policies of the CBS follow a ranking scheme similar to those for domain leader election. CBS failure handling. To handle CBS failures properly, not only must a new CBS be elected, but also the core binding list C must be re-collected at the new CBS. For this purpose, each router :r maintains a list of core bindings that designate itself as the core, that is, each router as maintains C3 = {Core(m) | where m is an active multicast address and Core(m) = 2:}. The list CI is included in the ballot(s) sent from a: during election. Since the CBS will receive ballots from all the routers within the network/ segment, it can collect all bindings in C, except those that designated the old CBS as the core of a group, which are discussed below. Let us consider a network that comprises 4 routers: W (the current CBS), X, Y, and Z. Let C = {(1,W) (2,X) (3,X) (4,X) (5, Y) (6, Z) (7, Z)}, where the pair (m, 2:) denotes Core(m) = :1:. The entire list C is maintained at W, and the partial binding lists at individual routers are CW 2 {(1, W)}, Cx = {(2, X) (3,X) (4,X)}, Cy = {(5,Y)}, and CZ = {(6, Z) (7,Z)}. Let us assume that router W has failed and that router Y is elected as the new CBS. Since Y will receive partial binding lists, which are contained in respective ballots, from routers X and Z, it reconstructs a new binding list C = {(2,X) (3,X) (4,X) (5, Y) (6, Z) (7, Z)}. However, the core bindings relating to W are missing. To remedy this problem, bindings relating to the old CBS must be treated as a special case. In LCM, any router x that is a member of a group m whose Core(m) = CBS(:c) must clear binding Core(m) whenever the value of CBS(:z:) changes, and must consult the new CBS for a new core binding for m. In the previous example, if group 1 has two members X and Z, then both routers must clear their local Core(l) entries and ask the new CBS Y to provide a new binding for 155 group 1. Of course, such a binding does not exist in the (re-created) binding list C, and consequently the new CBS Y considers group 1 as a newly created group, and uses the initial core selection method to choose a new core node for group 1. Core failure handing. The CBS uses the network topology information provided by the underlying LSR protocol to monitor all the core nodes listed in C. Specifically, whenever the CBS loses the connectivity to the core node of a group m, it randomly selects a router as the new core of m and advertises this new core binding throughout the network, using the flooding algorithm supported by the underlying LSR protocol. Both the information of router connectivity, required in core failure detection, and identities of routers, required by the random selection heuristic, are made available to the CBS by the underlying LSR protocol. Although our simulation results, presented in the subsequent section, reveal that randomly selecting core node from among all routers typically does not result in good multicast trees, when compared to many other core selection methods, the new core node of the “victim” group can invoke LCM’s core migration method, discussed below, to regain (or obtain even better) performance. Core migration. The core migration method used in LCM assumes that the core node of a group m maintains a (router) member list of m. This list can be compiled and updated if the J OIN-REQUEST and QUIT-REQUEST messages defined by the underlying CBF multicast protocol are delivered to the core node, in addition to the first router on the tree. (The PIM protocol satisfies this requirement. However, minor changes are required for other CBF protocols to meet this requirement.) Periodically, the core node computes a shortest-path tree to reach the members of m, and finds the center of the resulting tree. If the center is not the core itself, the core node voluntarily steps down by sending a CHANGE-CORE message to the CBS, which updates the binding list C accordingly and floods the new Core(m) value throughout the network. Subsequently, router members of m send JOIN-REQUEST messages to the new core to construct a new multicast tree. We point out that the above 156 shortest-path-tree computation is performed by the underlying LSR routing protocol as part of its normal duties, and that the task of finding the center of a tree can be performed in 0(N) complexity, where N is the number of routers on the tree. As an example, let us consider the network shown in Figure 7.1, where the above core migration method is applied to a group comprising three members A, B, and C. In Figure 7.1(a), we assume that router A is the first node to join the group, and hence is the core of the initial multicast tree of the group. Also shown in the figure is the tree center, router D. In LCM, router A will (eventually) transfer the responsibility of the core node to D, resulting in the multicast tree depicted in Figure 7.1(b). Regarding the performance of the two multicast trees, the tree in Figure 7.1(a) imposes maximum member-to—core distance of 4 and average distance of (O + 2 + 4) / 3 = 2, while the tree in Figure 7.1(b) imposes maximum distance of 2 and average distance of (2 + 2 + 1)/3 = 5/3. (a) initial multicast tree (b) the tree after core migration Figure 7.1: Core migration in LCM. Core binding destruction. When the core node of a group m detects an empty group, it sends a DELETE-BINDING message to the CBS, removing the Core(m) entry from the binding list C. 7 .3 Performance Evaluation In this section, we investigate various aspects of the performance of the LCM pro- tocol, including the workload at the CBS, and effectiveness of LCM’s core selec- 157 tion/ migration policies. Since LCM uses the NLE protocol to select the CBS, sim- ulation results regarding NLE’s performance, presented in Chapter 6, apply to CBS election and will be omitted here. CBS workload. The use of a centralized server for the management of core-group bindings raises the concern of the workload at the server. We investigated this issue via simulation. Our experiments were designed to stress the CBS as much as possible. To this end, we assume that K multicast groups of size S are created simultaneously at time 0. The values of K range from 10 to 200, and those of 8 range from 20 to 200. Given a multicast group, member arrival times (that is, the times members join the group) are normally distributed with mean 0. We chose the standard deviation value such that 99% of arrival times are within a 1-minute interval centered at time 0 (that is, from -30 seconds to +30 seconds). In the busiest cases, 200 groups of 200 members each are created within 1 minute, producing 40000 CORE-MAPPIN G requests within that interval. We assumed the service time of such a request to be 700 usec, which is a typical IP/UDP software overhead observed on many platforms [71]. We used this figure because the look-up of the core binding list can be implemented efficiently, requiring 0(log 5) time using a tree-based data structure or 0(1) time using a hash function. The overhead of this task should be negligible when compared to the software overhead of receiving and returning messages. Results of this study are presented in Figure 7.2. As we can see in Figure 7.2(a), the average queue length at the CBS is less than 2, even for the highest event rates. We point out that the queue length is averaged only over the periods where the CBS is busy. Hence, the smallest possible value of the average queue length metric is one. The maximum queue length at the CBS is plotted in Figure 7 .2(b). Although the maximum queue length was between 10 and 20 in some experiments, we point out that a queue containing 20 requests can be served within 14 milliseconds. We conclude that the CBS can accommodate even the busiest scenarios in our simulation. 158 2 . . . . . . . . a 14 . . . . . j, group size 20 —.— ‘ groupsizczo *— ,x‘ 18 . group size 50 --~--— ‘ 12 - group srze 50 --~ Xi ' group size 100 ......» x‘l group 5173 100 W“ a group size 150 +— 1 a 10 [ 81'0“? 5!“ 150 " ,1" /’ 5 1.6 ~ group size 200 -*-- ,x’ a group Size 200 w»- f/ .— ,I u . o E 8 r -‘ 3 8 a 1-4 ’ e 6 ~ g0 g < 1.2 - 2 4 - 2 l . 4 1 1 1 A 4 m 1 1 0 1 1 r 4 A 1 n 1 r 0 20 4O 60 80 100120140160180 200 0 20 4O 60 80 100120140160180 200 Number of groups Number of groups (a) average (b) maximum Figure 7.2: Queue length at the CBS. Core selection/ migration. In addition to the operational overhead of the LCM protocol, we also investigated the characteristics of multicast trees that result from the LCM core selection and core migration methods. Specifically, we studied the core- to-member distances of such multicast tress. We randomly generated 100 graphs of 144 nodes (that is, routers) with average node degree 4. The average diameter of these graphs is approximately 10. We randomly generated 1000 groups of size S, where values of S range from 2 to 50. For each group, two multicast trees were generated on each graph. First, a member of the group is randomly selected as the “first member” and is used as the core to construct a multicast tree T. Next, we compute the center of T, which is used as the core to construct a second multicast tree for the group. Furthermore, for each group-graph combination, we tried every the node in the graph as the core node and recorded the average performance of resulting trees, in order to obtain the performance of the random (core) selection heuristic. The results of average core-to-member distances are plotted in Figure 7.3(a). As we can see, the performance of the first member heuristic is significantly better than that of the random selection method when group size is small, and approaches that of random selection when group size increases. (The flat curve for the random selection method results because the average distance from a group of nodes to a randomly selected node is approximately half the diameter of the network.) Results for the 159 maximum core-to-member distances (that is, the depths of trees) are plotted in Fig- ure 7.3(b). With respect to this metric, the first member and random selection heuris- tics exhibit approximately identical behaviors. In both Figure 7.3(a) and 7.3(b), the results of the tree-center core selection method clearly demonstrate the benefits of the LCM core migration method, when compared to protocols that do not support migration. In summary, the results presented here support the core management policies of the LCM protocol, which simply assigns the first member of a multicast group as the initial core node, and changes the core node of the group to the center of the current multicast tree after membership information has been revealed and remained stable for a predetermined length of time. 5,5 . . . . . . . r . . a . . r . . . . 1 5 [ W .................................. - -------------- . ' 7 ' ‘ I/’ a 45 ’ ,‘I .. a 6 _ f’ - 8‘ 'l" 8‘ ,I" J: 4 L ,I’ I: 5 "l, 1 E i: g .... .................. o 8 3.5 r f 8 Donna [[[[[[ a _______ 4, ........ . ........ ...... g 3 1- 59809000090000 ..... 0. ............... B ........ G ...... 9 ................. _I' ’ g 4 £00900 .52 .‘2 ‘9 . Random *— o 2.5 - ” Random -°—- « Q 3 - a" FlfSt member ------ . First member -------- Tree center ......" 2 ’ Tree center 0 ‘ 2 _ 1.5 1 A n 1 A 1 z 1 1 r 1 a L 1 ‘ A 1 r 0 5 10 15 20 25 30 35 4O 45 50 O 5 10 15 20 25 30 35 40 45 Group size Group size (a) average (b) maximum Figure 7.3: Core-to-member distances produced by various core selection methods. 7 .4 Summary We have proposed a central-server based core management protocol, the LCM pro- tocol, for use by CBF multicast protocols under LSR. Based on the information pro- vided by LSR, the protocol addresses three aspects of multicast core management, namely, automatic core selection, core failure handling, and core migration, and can survive any combination of network component failures, including those that partition 50 160 the network. Our simulation study has shown that the CBS can handle extremely heavy workloads, and has demonstrated the improvements in multicast performance achieved by LCM’s core migration method. This work once again illustrates the strength of LSR in supporting group communication. Chapter 8 Tree-Based Link State Routing In this chapter, we come full circle, combining group communication techniques dis- cussed earlier to develop a novel link-state routing protocol, called the Tree-based LSR (T-LSR) protocol, for use in general-purpose LSR-based networks, such as the Inter- net. In the T-LSR protocol, a leader router is elected to perform periodic network status broadcast on behalf of all the other routers to reduce the overhead associated with periodic flooding, and a spanning tree is constructed for use by the broadcast of network status updates. We prove the correctness of the T-LSR protocol, that is, its ability to maintain consistent routing information and leader preferences throughout the network under any combination of network component failures, partitioning sce- narios, and undetected transmission errors. The results of a simulation study reveal that the T-LSR protocol imposes a small fraction of the overhead of the conven- tional LSR method during normal operation periods, and incurs moderate overhead during adverse periods when an election is in progress or the spanning tree is under repair / construction. 8.1 Motivation In this chapter, we return to the topic of reducing the operational overhead of LSR. Before presenting our approach, let us take a look again at important performance issues of previous LSR protocols. As discussed earlier, many LSR protocols use the 161 162 conventional flooding algorithm, which forwards every LSA on every communication link. Thus, each router must process, on average, D copies of a given LSA in a network with average node degree D. Second, all routers are required to flood local status periodically. If the flooding period is T seconds, then each router has to process approximately (N * D) /T LSAS per second in an N—router network. Hereafter, we use the term conventional LSR, or C—LSR for abbreviation, to refer to any LSR protocol that uses the conventional flooding algorithm and that requires every router to perform periodic flooding. Both the OSPF protocol [11] and the LSR method described in ATM standards [13] fall in this category. Previous efforts to reduce the overhead of LSR have focused largely on flooding operations. Specifically, Gopal [72] described several hardware implementations of the conventional flooding algorithm. In these implementations, however, a broadcast message still has to traverse all communication links. A software-based, spanning- tree flooding method was discussed in [73]. The main concern of that work was to seamlessly integrate routers that use conventional flooding and with those that use tree-based flooding. It is not clear if that method could survive routing in- formation/ transmission corruption problems. Rajagopalan [74] described a flooding method whereby every router builds a source-rooted tree to advertise its local sta- tus. By contrast, the T-LSR protocol constructs a single spanning tree shared by every router. Using only one tree reduces the number of protocol states that the underlying flooding algorithm must maintain. Our previous efforts to reduce LSR overhead, namely, the SAF protocols, construct a spanning MC to broadcast LSAS; the idea of hardware-based, spanning-tree broadcast of routing information has also been exploited by other researchers [55, 75]. The T-LSR protocol does not assume any capability in hardware and hence can be applied to a wider range of networking platforms. Furthermore, while the above flooding methods improve the performance of individual flooding operations, none of them are concerned with the bigger picture of the entire flooding cycle. In this paper, we propose a novel LSR protocol, Tree-based LSR (T-LSR), which constructs a single spanning tree that is used by all routers for the dissemination 163 of status information. Moreover, the T—LSR protocol elects a leader router to un- dertake the duty of periodic flooding on behalf of other routers. In Figure 8.1, we give an example to illustrate the concept of tree-based flooding. Using the spanning tree topology shown in Figure 8.1(a), the flooding operation in this example requires four steps. A tree-based flooding operation performs only 0(|Vl) LSA message for- wardings, as opposed to 0(|E|) LSA forwardings in the C—LSR protocol. Using the T-LSR protocol, each router in an N -node network that uses T -second flooding cycles processes, on average, only 1/T advertisements produced by periodic flooding, and 0(1) c0pies of any LSA. 0 Node that has received the LSA . Node that has finished the flooding Tree link —’ LSA transmission (d) step 3 (e) step 4 Figure 8.1: An example of tree-based flooding. Of course, the major challenge in designing such a “lightweight” LSR protocol is 164 to provide the same level of robustness as the C-LSR protocol. As we discussed in Chapter 2, one of the critical fault-tolerance requirements of an LSR protocol is to survive undetected transmission errors. (The entire ARPANET was brought down by such errors in 1980 [76].) While the problems of leader election and spanning tree construction have been studied extensively [13, 55, 52, 77], previous solutions deal mainly with component failures (such as leader or tree link failures) and partitioning of the network. Solutions to these problems that also survive message corruption events are relatively unexplored. A class of problems, collectively referred to as the incorrect leadership problem, arises when corrupted network tOpology information is used in the computation of the spanning tree topology, or when undetected transmission errors occur during the establishment of leadership and the construction of the spanning tree. We will formally prove the correctness of the T—LSR protocol, that is, its ability to maintain consistent routing information, construct a correct spanning tree, and achieve leadership consensus under any combination of network component failures, partitioning scenarios, and corruption problems. The remainder of this chapter is organized as follows. An overview of the T-LSR protocol is first given in Section 8.2. Algorithm details of the T-LSR protocol are presented in Section 8.3, followed by the proof of correctness in Section 8.4. The performance of the T-LSR protocol is investigated through simulation. The results of this study, presented in Section 8.5, reveal that the T-LSR protocol imposes a very small fraction of the overhead of the C—LSR protocol during normal operation periods, and incurs only moderate overheads during adverse periods when the spanning tree is under repair / construction and leader election is in progress. Finally, a summarization of this work is given in Section 8.6. 8.2 Overview In this section, we present the operation of the T-LSR protocol. In the discussion, we assume a connected network G = (V, E), where V is the set of routers and E the set of communication links that connect routers. To generalize our discussion to 165 partitioned networks, we simply consider segments individually. For the purpose of cross reference in the subsequent discussion, important rules/ conditions are labeled. (For example, the statement “when a router receives the first copy of a given LSA, the LSA is forwarded along all the links incident to the router except the one on which the LSA arrives” could be labeled as Forward-LSA-Rule-l). Before discussion, we give the control messages formats and data structures of the T-LSR protocol in Tables 8.1 and 8.2 respectively. LSA(:1:, s, m) a link-state advertisement with sequence number s and flood- ing mode m that contains the local status of router x. CTA(a, G;,T,c) a complete-topology advertisement that contains the reach- able network image G; of the leader router a and a spanning tree topology T with epoch number c. Ballot(z, a, c) a ballot message from a child 2 that specifies a as the leader and is used to establish the spanning tree of epoch number c. LEA(a, c) a leadership establishment advertisement that broadcasts the establishment of the leadership of router a and the completion of the construction of the spanning tree with epoch number c suggested by a. Table 8.1: Control messages in the T-LSR protocol. Rank(a:) the rank (leader priority) of router :1:. Leader(a:) the preferred leader of :1:. Mode(:z:) the operation mode (either T or G) of 2:. Epoch(x) the epoch number of the current spanning tree. Flag$[z] a boolean flag that indicates if a: has received the ballot for Leader(:r) from its child 2 in the current spanning tree. Table 8.2: T-LSR data structures at a router as. LSA model. We assume that every LSA originated from a router contains complete status of the router. If a router as has five incident links, for example, then every LSA from :1: contains descriptions of all the five links. When a: wishes to advertise the failure of one of its incident links, it floods an LSA that describes the working status of four links and the non-operational status of the fifth. In this way, an LSA can be uniquely identified by its source router ID and a sequence number. This LSA model is similar to that of the OSPF protocol [11]. Other LSR protocols use a more refined 166 model, where each component of the local status of a router (for example, a specific link) is assigned an LSA ID [13], and must be identified by a (router ID, LSA ID, sequence number) triple. This allows an LSA to contain a part of the local status of a router and is economical in terms of bandwidth consumption if the router frequently advertises changes in individual state components. The T-LSR protocol could be generalized to handle such LSA models. Network image. The network image at a router :1:, denoted as G3,, is defined to be the set of LSAs maintained at 2:. We note that G3, could include unreachable routers because, for example, when :1: loses connectivity to another router y, the LSA regarding y will still be maintained by at until it is aged out. We denote by G; the set of LSAs maintained at :1: that are regarding routers reachable from :1: in the topology defined by 0;. If the network is connected, then G; = Gm. When the network is partitioned, G; is a proper subset of G1. and LSAs in G; describe the topology of the network segment in which :1: resides. Operation modes. The T—LSR protocol elects a leader router to perform periodic flooding on behalf of all the other routers and uses only tree links in the dissemination of network status updates; details are given later. However, there are periods of time when the election is in progress and/ or the spanning tree is under construction. During such adverse periods, the T-LSR protocol reverts to the C—LSR protocol to ensure uninterrupted routing operation. To distinguish adverse periods from normal operation periods, each router operates in one of the following modes: mode T and mode G. We denote by Mode(:z:) the Operation mode at router 2:. 0 During periods when leadership consensus has been achieved and the spanning tree is operational, all the routers in the network operate in mode T. When a router is in mode T, it floods only changes in local status, and uses only spanning tree links in the flooding of LSAS; it does not perform periodic flooding. Every LSA flooded by a T-mode router is tagged with a mode flag of value T; such an LSA is termed T-mode LSA and its respective flooding is termed T-mode 167 flooding. 0 When a router is in mode G, it effectively executes the the C-LSR protocol — it performs both periodic and event-driven flooding, which in turn use all communication links. Every LSA flooded by a G-mode router is tagged with a mode flag of value G; such an LSA is termed G-mode LSA and its respective flooding is termed G-mode flooding. The arrival of a G-mode LSA at a T-mode router forces the router to switch to mode G. The existence of any router in the network that is in mode G indicates a lack of leadership consensus within the network. Leader election and spanning tree construction. Every router :r is configured with a leader priority, denoted by Rank(:r), which constitutes a part of the local status of the router, and which therefore is included in LSAs flooded by :1:. Further, router :1: searches in V(G’$), the set of routers known by a: to be reachable, for the router with the highest rank, and calls the result of this search its preferred leader, denoted as Leader(:r). Subsequent actions taken by router :1: depend on whether or not the value of Leader(:r) is :12 itself. If Leader(a:) is set to :23, then router 2: immediately undertakes the responsibili- ties of the leader router (although at this point not all routers necessarily agree on its leadership). Leader responsibilities include the computation of a spanning tree topology T and periodic broadcast of complete topology advertisements (CTAs). A CTA from :1: contains all the LSAs in G; as well as the spanning tree topology T. To broadcast a CTA, 1r forwards the CTA along branches of T. On the other hand, if router :1: has some other preferred leader, that is, Leader(2:) = a and 2: ;£ a, then a: must await a CTA from a; CTAs from other routers will be silently discarded (Discard-CTA-Condition-l). Upon receiving a CTA from its preferred leader, router :1: processes the LSAS contained in the CTA, extracts the spanning tree T, and forwards the CTA to its children in T. The second task of router as is to receive ballot messages for oz from all its children. After the completion of this task, :1; sends its own ballot to its parent y in T. This ballot also serves 168 to establish the :1:-y tree link. After router a collects all the ballots from its own children, it claims victory by broadcasting a leadership establishment advertisement (LEA), again using only T links. Receipt of the LEA changes the operation mode of every router to T, and the network enters the normal operation of the T-LSR protocol. Re—election. In the T-LSR protocol, leader re—election is triggered by changes in the set of reachable routers. Specifically, when a router a: observes a change in the set V(G’I), it must re-compute its preferred leader (Compute-Leader-Condition-1). To enable router ranks to be changed during protocol operation, :15 also re- computes its leader preference when it detects any change in router ranks (Compute-Leader-Condition-2). In either case, router a: switches to mode G (Enter-Mode-G-Condition-1), and participates in a new election. For illustration, let us consider a network where the administrator has configured a default leader a with rank 3 and a backup leader 6 with rank 2. All the other routers are configured with rank 1. Consider a scenario where the current leader a has just failed. First, neighboring routers of a notice the failure of links incident to a, and flood LSAS that contain the malfunctioning status of such links. Via these LSAS, every router a: de- tects a change in V(G’x) (specifically, that a has been removed from V(G;)), switches to mode G, and sets Leader(:r) to the router with the next highest rank, namely 6. Router [3 also discovers that itself is of highest rank, so it broadcasts CTAs and collects ballots to establish its leadership and construct a new spanning tree. Maintenance of the spanning tree. When a link used by the spanning tree fails, the routers incident to the link switch to mode G (Enter-Mode-G-Condition-2) and flood G-mode LSAS that contain the new state of the link. Upon receipt of such an LSA every router in the network switches to G-mode operation. Routers remain in this mode until a new spanning tree (contained in the next CTA from the leader) has been constructed and the leader has broadcast an LEA. As illustrated above, the spanning tree tOpologies contained in the periodic broad- 169 cast of CTAs from a given leader may change over time in response to network topol- ogy changes. The sequence of tree topologies proposed by a leader router is divided into one or more epochs. Consecutive, identical tree topologies are tagged with the same epoch number; a change in the tree tOpology is reflected by an increment in the epoch number. During each epoch, routers remain in mode T. When a change in epoch number is detected, routers switch to mode G (Enter-Mode-G-Condition-3) until the construction of a new tree is completed. Each router a: records the cur- rent epoch number in the data structure Epoch(:c). Any CTA that contains a spanning tree with a epoch number smaller than Epoch(:r) will be discarded by x (Discard-CTA-Condition-2). We emphasize that routers must cast ballots in every round of tree topology broad- cast (that is, every CTA broadcast), regardless the presence or absence of epoch num- ber changes. Before the broadcast of a CTA, the leader computes a new spanning tree tOpology, if it is currently in mode G (Compute-Tree-Condition-1), and in- creases the epoch number. After receiving all the ballots pertaining to the CTA, if the leader is currently in mode G (Issue-LEA-Condition-l), it broadcasts an LEA(a, c), where c = Epoch(a). Failing to collect any necessary ballot will switch the leader to mode G (Enter-Mode-G-Condition-4). Upon receiving the LEA, any router :1: whose Leader(:r) = a and Epoch(:z:) = c switches to mode T (Enter-Mode-T-Condition-1) and forwards the LEA to its children in the current spanning tree. Otherwise, the LEA is discarded by :13. Flooding algorithm. In the T-LSR protocol, a router could operate in mode T or mode G, and an LSA could also be flooded in either one of the two modes. When an LSA arrives at a router, there are four (flooding mode, operation mode) combinations. Before formally presenting the LSA-forwarding rules under these combinations, let us use the example shown in Figure 8.2 to discuss important scenarios. In the example, router X detects a significant change in the queueing delay over the (X, A) link and disseminates this information by flooding an LSA [x in mode T. Simultaneously, another router Y floods an LSA [y (in the G mode) to advertise the failure of the 170 (Y, B) link, which is used in the spanning tree depicted in Figure 8.2(a). Let us assume that all routers except Y are initially in mode T. Figures 8.2(b) and (c) depict the first and second forwarding steps of the two flooding operations. As shown in Figure 8.2(c), the T-mode LSA 6X encounters two routers, W and Z, whose modes have been changed to G by (y. As shown in Figure 8.3(a), when routers W and Z receive 8X, they change the mode of 8X to G and forward 8x along all respective incident links, including the ones on which the T-mode EX arrived, specifically, the (W, H) and (Z, H) links. The G-mode copy of [x will be considered to be more recent than its T-mode counterpart. When arriving at a router that has received the T-mode (x, the G-mode 6x will be considered being seen for the first time; in Figure 8.3(b), router X forwards the G-mode copy of 3x to its neighbor A, as if it receives [x for the first time. . G—mode node 0 T—modo node Tree link — —-— G-modo flooding —.- T—modo flooding (a) initial configuration (b) first forwarding step (c) second forwarding step Figure 8.2: The flooding of two LSAs in different modes. ,Li‘r f] f f .l A r” I I \ 7- I” \ l \ ....... v ——-.o-—.—-——o 3,1,3 MED—"mo (a) first forwarding step (b) second forwarding step Figure 8.3: The completion of the T-mode flooding in mode G. However, the above situation where the G-mode copy of 8x returns to X itself raises the concern that 13x may have been corrupted before processed by X the second 171 time. If X blindly accepts the corrupted, G-mode c0py of Ex, then router X will have incorrect knowledge about its own status. To cope with problem, when any (G—mode) LSA that contains the local status of a router as arrives at :1:, router X compares the LSA against its local status and discards the LSA if any inconsistency is detected (Discard-LSA-Condition-1) We now present the flooding rules of the T-LSR protocol. Let LSA 8’ = LSA(y, s’,m') 6 GI, where y is the ID of the source router, s is the sequence num- ber, and m’ is mode of the LSA, be the LSA regarding y that is maintained at :1:. When an LSA Z = LSA(y,s,m) arrives at :r, it is ignored by :1: if (s,m) g (s’,m’) (Discard-LSA-Condition-2), where the comparison is in lexicographic order and mode G is defined to be greater than mode T. (Thus, given two LSAS regarding the same router and with identical sequence numbers, the one in mode G overrides the one in mode T.) If 8 is not discarded, it substitutes 2’ in G2 and is forwarded according to the three cases below. In the discussion, we denote by E (x, T) the set of tree links that are incident to as, by E (:1:, G) the set of incident links of x, and by p the link on which 6 arrives. LSA-Forwarding-Case-l: m = Mode($) Forward 3 along links in E (2:, m) — {p}. LSA-Forwarding-Case-Z: m = G, and Mode(:r) = T Set Mode(:z:) to G (Enter-Mode-G-Condition-5), and forward 8 along links in E (:1:, G) — {p}. LSA-Forwarding-Case-B: m = T and Mode(:r) = G Forward LSA(y, s, G) along links in E (:r, G). The first case happens when LSA l and router :c are in the same mode. The last two cases take place when the network is in mode transition. In Case 2, the arrival of a G-mode LSA at a T-mode router switches the router to mode G. The LSA itself is forwarded according to the conventional flooding algorithm. In Case 3, when a G-mode router receives a T-mode LSA, that router changes the flooding mode of the LSA to G and forwards the LSA along all its incident links. 172 Aging. Like the C—LSR protocol, the T-LSR protocol uses the aging mechanism to curb the lifespan of corrupted LSAS. At a non—leader router :1:, the LSA in G3 regarding router y will be removed from G3 taging seconds after its arrival at 2:. This rule, of course, cannot be applied at the leader router itself, because other routers do not perform periodic flooding and hence the leader may not receive new LSAs from other routers for long periods of time. At the leader router, once leadership has been established, LSAs regarding reachable routers are immune to aging. Specifically, let us consider a given router a: and an LSA Z E Gm that is regarding another router y. When its associated aging timer fires, 8 is removed from Gm only if any of the following three conditions are satisfied: Leader(:r) 76 a: (Aging-Condition-l), Mode(a:) 2: G (Aging-Condition-Z), and y is unreachable from 1: in G1: (Aging-Condition-3). If E is not removed, then a new associated aging timer is created. However, because a corrupted LSA maintained by the leader regarding a reachable router is not subject to aging, such corruption may exist for prolonged periods of time if not handled properly. Further, the corruption could propagate throughout the network as the leader includes the LSA in CTAs. This problem is detected and corrected as follows. Let 6 be the LSA in Go regarding a router :1:, where a is the leader router. Let 3’ be the LSA in G,c regarding :1: itself. When router :1: receives a CTA from a, which includes E, a: checks 2 against 8’. If any inconsistency is found, 1: switches to mode G (Enter-Mode-G-Condition-6), discards the CTA (Discard-CTA-Condition-B), and hence will not vote for a, forcing (1 also to switch to mode G and thus allowing the corrupted information to be aged out. Meanwhile, router :1:, now in mode G, floods periodically to provide a with its correct local status information. In order to avoid premature mode switching due to delays in LSAs reaching the leader, the above consistency check is performed only when the CTA is received after tobJ-ectiomdelay seconds after the creation of Z’. In the T-LSR protocol, the leadership of the established leader is also subject to aging. Even during periods where there are no network topology changes or when such changes do not affect its leadership, an established leader must periodically flood CTAs that contain the current spanning tree topology and epoch number. If a router 173 does not receive such CTAs for a predetermined length of time, it must revert to mode G Operation (Enter—Mode-G-Condition-7). Leadership aging addresses the concern where corruption problems in the epoch number of a previous CTA prohibit the acceptance of following CTAs for prolonged periods of time. The handling Of network partitioning. As in the case of handling network component failures, the T-LSR protocol COpes with network partitioning scenarios by having every router a: monitor the set of routers reachable from :1:, V(G;). Let us consider a scenario where router oz is the current leader and a component failure partitions the network into two segments, 51 and 52. Let us assume that a E 81. Routers in 52 will notice the loss of connectivity to the current leader, switch to mode G (Enter-Mode-G-Condition-1), and select a new leader, call it 6. Router 6 will also select itself as the new leader, and, since it is in mode G, will compute and construct a spanning tree within 52 (Compute-Tree-Condition-1). In the meantime, router a will switch to mode G due to changes in the set V(G;) and hence will compute a new spanning tree for use in SI. Should the segments SI and 52 be merged later, routers in SQ, including 6, will change their preferred leaders to a. Simultaneously, router a will switch to mode G due to changes in V(G[,) and hence compute a new spanning tree to cover the entire network. Handling incorrect leadership problems. As defined earlier, the term incorrect leadership problem refers to any corruption problem involved in leader election and spanning tree construction. Let us consider an example shown in Figure 8.4. In the example, the network image at the leader router oz is corrupted in a way that router X, which is reachable from the leader in the real network topology depicted in Fig- ure 8.4(a), is considered unreachable by the leader (see Figure 8.4(b)). Consequently, leader a constructs the incorrect spanning tree T depicted in Figure 8.4(c), which of course does not cover router 1r. Presuming that every router selects a as the preferred leader, router a will obtain the votes from all the routers covered by T and broadcast an LEA. Subsequently, all routers except X Operate in mode T, and any event-driven 174 flooding from a non-X router will use T links and will not reach router X. (a) real network topology (c) resultant incorrect spanning tree T. Figure 8.4: An example of the incorrect leadership problem. It may be argued that, since X does not receive the above LEA and will remain in mode G, the periodic flooding LSAS from X, which are in mode G, will switch the Operation modes of other routers back to G, at least curbing the lifespan Of the above situation within a flooding cycle. To see that this mechanism does not necessarily work, let us further assume that Z = LSA(X, s) is the LSA in G x that is regarding X itself and that 3’ = LSA(X, s’) is the corrupted copy of Z in Ga. Through the CTA broadcasts from a, LSA 6’ is propagated throughout the network and incorporated into the network images at all routers but X. If s’ = s + 228 and a: floods on average every 60 seconds, then it will take X more than 500 years to use sequence numbers larger than 3’. Before that, all the LSAs from X are ignored by other routers. The incorrect spanning tree T and the false leadership of router a in Figure 8.4 can last for a prolonged period of time. TO cope with this problem, every router, upon receiving a CTA containing a tree 175 topology T, checks whether all its neighboring nodes are present in T. If T fails this test at any router, that router will discard the CTA (Discard-CTA-Condition-4) and revert to G-mode Operation (Enter-Mode-G-Condition-8). In the example of Figure 8.4, router Y shall notice the absence of X from the spanning tree in Fig- ure 8.4(c) and refuse to vote for the leader. This keeps the leader router (and all other routers as well) in mode G, enabling the corrupted information regarding X to be aged out. It is proved in Section 8.4 that, even when the current, incorrect leadership and spanning tree hinder the dissemination of subsequent network status updates, this simple test methodology is sufficient to eventually construct correct leadership and a spanning tree if corruption does not happen to the transmission of T and ensuing ballots indefinitely. 8.3 Algorithms In this section, we present the algorithms of the T-LSR protocol. In the discussion, for a given router :1:, we denote by Children(:r) the set of children of a: in the current spanning tree, relative to Leader(:z:), and by Parent(:1:)T the parent of a: in the tree. When a router a: needs to flood its local status for the purpose of either periodic flooding or to broadcast changes in its local status, it invokes the FloodLocalStatus routine shown in Figure 8.5. Parameter a: in the routine indicates the ID of the caller router. This routine first updates the content of LSA l, the LSA regarding :r in its own network image G1,, to reflect the current local status. Next, a: switches to mode G and searches for a new preferred leader if there is any change in the set V(G';) after the update of Z (Compute-Leader-Condition-1) or if the rank of :1: itself has been changed (Compute-Leader-Condition-2). Router :1: must also switch to mode G if any incident tree links are found malfunctioning (Enter-Mode-G-Condition-2). Finally, router 2: increments the sequence number of t and forwards 6 along the set of links defined by its current Operation mode. Shown in Figure 8.6 is the routine that processes incoming LSAs. In the rou- tine, parameter 1: indicates the ID of the caller router and Z is an incoming LSA 176 Algorithm: FloodLocalStatus. Input: router ID 2:. U = V(G’x). Let Z = LSA(x, s, m) be the LSA regarding router a: in G1,. Update the content of Z (and, hence, G; and G2,) to reflect the current local status of :1:. IF (Compute-Leader-Condition-1: U 76 V(G;)) OR (Compute-Leader-Condition-2: Rank(:r) has changed) THEN Mode(:r)=G. (Enter-Mode-G-Condition-1) SetPreferredLeader(). ELSE IF (Enter-Mode-G-Condition-2: Be 6 E(a:,T) that has failed) THEN Mode(:1:) = G. Epoch(:z:) = —1. ENDIF s = s + 1. Forward LSA(z, s, Mode(:1:)) along links in E(a:, Mode(a:)). Figure 8.5: The routine that flood router local status. that is regarding router y with sequence number s and mode m that arrives on link p. The first task of the routine is to check if a: should discard 5 according to Discard-LSA-Condition-l and Discard-LSA-Condition-2. Should 8 pass these tests, it is accepted by ac and is incorperated into GI. Subsequently, the ProcessLSA routine checks for changes in the reachable set, V(G;), and in the rank of router y. Whenever such a change is detected, router :1: switches to the G mode and recomputes its preferred leader. Lastly, the routine forwards 3 according to Forwarding-Case-l, Forwarding-Case-2, and Forwarding-Case-B. The routine that a router 2: uses to set its preferred leader, Leader(:r), is presented in Figure 8.7. As stated, the preferred leader of a: is set to a reachable router w with the highest rank, according to the local network image of :1:. If a: changed its preferred leader, then the current epoch number is set to —1, and, as such, router a: can accept a tree tOpology from the new leader with any epoch number. If the new value of Leader(:r) is :1: itself, the routine invokes the BroadcastCTA routine, discussed next. When a router 2: whose Leader(:r) = :17, it periodically invokes the BroadcastCTA routine, shown in Figure 8.8. The routine first checks the ballots corresponding to the previous CTA broadcast and reverts to mode G if there is any ballot missing 177 Algorithm: ProcessLSA. Input: router ID :1: and 3 = LSA(y, s, m) that arrives on link p U = V(G;). IF (2: = y) THEN /* Check for corruption in LSAs regarding myself */ Let E’ = LSA(y, s’, m’) be the LSA regarding router y in 0;. IF /* Discard-LSA-Condition-l */ (s > 3') OR ((3 = s’) but ((5 ¢ 8’)) OR ((3 < s') and (3’ has existed for more than tobjectiomdelay seconds)) THEN Mode(:z:) = G. Exit. /* l is discarded */ ENDIF ELSE IF (Discard-LSA-Condition-2: (s,m) S (s',m’)) THEN Exit. ENDIF Replace 8' with E in G1,, and set up an aging timer for l. /* Changes in leadership rank or the set of reachable routers ? */ IF (Compute-Leader-Condition-1: U 95 V(G;,)) OR (Compute-Leader-Condition-2: the ranks of y differ in Z and 2’) THEN Mode(:r)=G. SetPreferredLeader () . END IF (LSA-Forwarding-Case- -1: m: Mode(a:)) THEN Forward 6 along links E( (:1:, m) —.{p} ELSE IF (LSA- Forwarding- Case- -:2 m: G, and Mode(:r) = T) THEN Mode(:1: )= G (Enter-Mode-G- Condition-5), and forward 8 along links E( (,2: G) {p}. ELSE IF (LSA- Forwarding-Case- -:3 m: T and Mode(:1:) = G) THEN Forward LSA(z, s, G) along links E(:1:, G). ENDIF Figure 8.6: Processing incoming LSAS. (Enter-Mode-G-Condition-4). (If this is the first CTA broadcast by x, then the check is bound to fail, a result consistent with the fact that a: has not established its leadership and must be in mode G.) If :1: is in mode G, meaning that it is still establishing its leadership, then it must compute a new spanning tree topology T. The routine then broadcasts a CTA that contains the tree topology T and the network image G’I. Lastly, the routine clears the Flag data structures and router a: will await ballots corresponding to this round of CTA broadcast. When a CTA(a, GivT, c) arrives at a router :1: via link p, the router invokes the 178 Algorithm: SetPreferredLeader. Input: router ID x. oldJeader = Leader(a:). Let w be the router with the highest rank in G2,. Leader(a:) = w. IF (old_leader 75 Leader(:1:)) THEN Epoch(:r) = —l. ENDIF IF (Leader(:1:) = :1:) THEN BroadcastCTA(). ENDIF Figure 8.7: Setting preferred leader. Algorithm: BroadcastCTA. Input: router ID :1:. /* Checks the ballots of the previous round of votes */ IF (Enter-Mode-G-Condition-4: 32 E Children(a:) such that Flagx[z] = FALSE) THEN Mode(:c)=G. ENDIF IF (Compute-Tree-Condition-1: Mode(:c) = G) THEN Compute a tree T that spans V(G;). ’I‘ree(a:) = T, and Epoch(:r) = Epoch(a:) + 1. ENDIF Forward CTA(:r, G;,T,Epoch(a:)) to Children(a:)T /* To track ballots for this round of vote, */ Flagx[z] = FALSE, V2 6 Children(:r)Tree(a:). Figure 8.8: The BroadcastCTA routine. ProcessCTA routine shown in Figure 8.9. The CTA is discarded, if it is not from the preferred leader of a: (Discard-CTA-Condition-l), if it contains an Obsolete span- ning tree topology (Discard-CTA-Condition-2), if the LSA regarding :1: in the CTA is inconsistent with the local status of a: (Discard-CTA-Condition-B), or if some neighboring routers of x are absent from the spanning tree T contained in the CTA (Discard-CTA-Condition-4). If the CTA is accepted by :1:, the LSAs contained in the CTA are incorporated into the network image of :1:. Lastly, if the CTA contains a more recent tree topology than the one stored locally, as updates its Epoch(:r) data 179 structure accordingly, and switches to mode G to avoid the use of tree-based flood- ing (Enter-Mode-G-Condition-3). Since routers must cast ballots in every round of CTA broadcast, router 1:, before ending the routine, sets Flag1[z] to FALSE for all z E Children(:r) to await ballots from its chidlren in the tree T. Algorithm: ProcessCTA. Input: router ID a: and an arriving CTA(a, G;,T, c). IF (Discard-CTA-Condition-l: Leader(a:) # (1) OR (Discard-CTA-Condition-2: c < Epoch(:r)) THEN Exit. Let Z = LSA(x, s, m) be the LSA in the CTA regarding :1:. Let 6’ = LSA(x, s’, m’) be the LSA in G3, regarding :3. IF (m 75 m') OR (3 > 3’) OR ((3 = s’) but (Z 76 €’)) OR ((3 < s') and (E’ has existed for more than tobjection-delay seconds)) THEN Mode(:z:) = G. /* (Enter-Mode-G-Condition-6) */ Exit. /* (Discard-CTA-Condition-S) */ ENDIF /* Check for corruption in T */ IF (Discard-CTA-Condition-4: 3 a neighbor of z ¢ V(T)) THEN Mode(x) = G. Exit. ENDIF FOR (each LSA 8 = LSA(y, s,m), y ¢ :3, contained in the CTA) DO Let Z’ = LSA(y, s’,m’) be the LSA regarding router y in G3. IF ((s,m) _>_ (s’,m’)) THEN Replace 8’ with e in Gm, and reset the aging timer for Z. ENDIF ENDFOR Forward this CTA to routers in Children(:z:)T. IF (c > Epoch(:2:)) THEN Mode(:c) = G. (Enter-Mode-G-Condition-3) Tree(:z:) = T, and Epoch(a:) = c. ENDIF Flagx[z] = FALSE, for each z 6 Children(:r)T. Figure 8.9: The processing of incoming CTAs. When a Ballot(z,a,c) message arrives at a router as, the router calls the ProcessBallot routine shown in Figure 8.10. A ballot will be processed only if it is for the perferred leader a of x, if it belongs to epoch Epoch(a:), and if it comes 180 from a child of :1: in Tree(a:). To process such a ballot, :1: sets the flag corresponding to the child 2 and establishes the :1:-z tree link. When the Flag data structures in- dicate the receipt of legitimate ballots from all the children of :12, remaining actions of the routine depens on if :1: is the leader. If Leader(:r) = :1:, then router :1: issues an LEA(a, Epoch(2:)) message to broadcast the successful establishment of its lead- ership. Otherwise router :1: casts its own ballot by sending a Ballot(z, a, c) message to its parent. Algorithm: ProcessBallot. Input: router ID :1: and a message Ballot(z, a, c). IF ((1 = Leader(:1:)) AND (c = Epoch(:r)) AND (z E Children(a:)Tree(a:)) THEN F lag; [z] = TRUE. Establish :1:-z tree link. IF (V2’ 6 Children(:z:)Tree(:c), Flag$[z’] = TRUE) THEN IF (Issue-LEA-Condition-l: Leader(:r) = a: and Mode(:z:) = G) THEN Forward an LEA(a:, c) along all incident links of Tree(:z:), if Mode(x) = G. ELSE Send message Ballot(z, a, c) to Parent(:r). END ENDIF ENDIF Figure 8.10: The processing of ballot messages. When an LEA(a, c) arrives at a router :1:, the router invokes the ProcessLEA routine shown in Figure 8.11. The LEA is first checked for the choice of the leader and the current epoch number. If both conditions are satisfied, then router a: switches to mode T and forwards the LEA to its children in the current spanning tree. Hereafter, router a: enters the normal operation period of the T -LSR protocol —— it stops periodic flooding and will use only tree links to advertise local status updates. Algorithm: ProcessLEA. Input: router ID a: and a LEA(a, c). IF (Enter-Mode-T-Condition-1: a = Leader(a:) and c = Epoch(:c)) THEN Mode(a:)=T. Forward the LEA to all z E Children(a:). ENDIF Figure 8.11: The processing of LEAs. 181 8.4 Proof of Correctness In this section we prove the correctness of the T—LSR protocol. As with any LSR protocol, we must be careful when defining what can be proved and what cannot be proved. For example, consider the problem of establishing leadership consensus in a hypothetical scenario where, whenever a router a is elected as the leader, that router immediately crashes. The other routers will detect the loss of connectivity to a, prompting a new election. Further, let us assume that router a resumes execution shortly after a new leader is elected. If this scenario repeats itself indefinitely, and every newly suggested leader immediately crashes, then it is impossible for any leader- management protocol to maintain stable consensus. We conclude that a more reasonable goal is to study the behavior of the T—LSR protocol in response to a finite set of events. This model reflects real-world circum- stances where bursts of adverse events are followed by quite periods, which allow an LSR protocol to return to normal operation. We denote by E a finite set of network status change and transmission corruption events, and by to a time after the last event in E. Let G be the network topology after E and a be the router with the highest rank, Rank(a), in G. Let us assume here that the network topology G is connected. To accommodate disconnected networks, one can simply apply the following argu- ment to individual network segments. It is further assumed that events in E leave the T-LSR protocol in a chaotic state. Specifically, we assume the following at time to. o The elements of network images are assumed to be random. Specifically, at any router 3:, V(G) — V(Gm) may not be empty (that is, some routers may be absent from GI) and V(Gx) — V(G) may not be empty (“ghost routers” could exist in the network image of 3). Further, the content of the LSA regarding router y 91$ 2: in GE is also assumed to be random. For example, Rank(y) may be corrupted at :1:, some incident links of y may be missing in G3, (and hence GI may not be connected), and ghost incident links of y may exist in G3. 0 At any router x, the values Of the Mode(:1:) and Epoch(:r) data structures are 182 assumed to be random. The goal Of this section is to show that the T-LSR protocol will establish correct leadership, construct a correct spanning tree, and achieve consistent network images in the presence of such chaotic states. We do assume, however, that every router 2: possesses correct knowledge about its own status and local surroundings, specifically, that the LSA regarding 2: itself is not corrupted in G3,. We emphasize that the chaotic states described above can only be created by corruption problems, a very rare type of events. Network status changes, a type of events that happen much more frequently when compared to corruption events, will always leave the T-LSR protocol in consistent states. The behaviors of the T—LSR protocol in the handing of network status changes are investigated by a simulation study; results of the study are presented in Section 8.5. First, we deal with a type of corruption problem, involving ranks, that could hinder the establishment of leadership consensus. Let x(t) be the set of routers whose (corrupted) ranks in G3, at time t are higher than Rank(a). There are two possible causes for a router y to be in $(to): the Rank(y) information is corrupted at 2:, or y itself is a ghost router. The latter case might happen due to the arrival of an LSA(z, 3), whose router ID is corrupted and transformed into a non-existent router ID y and whose rank is corrupted and is larger than Rank(a). Of course, a non-empty x(t) will prohibit a: from selecting a as its preferred leader. In our first lemma, we show that the set (D3, will become empty taging seconds after to, where taging is the length of aging timers. Lemma 6 At any router 2:, x(t) is empty at any time t > to + tam-n9. Proof: Let y be any router in x(t0). Regarding what may happen to y during the [t0,to + tap-mg] period, there are two cases: First, an LSA or CTA originated at router y might arrive and be accepted during the period, fixing the incorrect rank information regarding y at 2: and consequently removing y from $(t). Second, if neither such LSAS nor CTAs from y are accepted by :1: during the period, then we claim that the aging mechanism Of the T-LSR protocol will remove y from (1)3(t). Let 183 8,, be the LSA in G1, that is regarding y, and let t’, to < t’ S taging, be the time when the aging timer of [y fires. Depending on the value of Leader(2:) at t’, we further consider two subcases. In the discussion, we recall that in the T-LSR protocol all the LSAS maintained by a non-leader router, whose Leader(2:) 3f 2:, are subject to aging, whereas at an established leader only LSAs regarding unreachable routers are subject to aging. 1. Leader(2:) E x(t’). In this case, Leader(:z:) ¢ 2:, because due to the assumption that 2: possesses correct knowledge of its local status, including its rank infor- mation, 2: cannot be in x(t’). By Aging-Condition-l, 6,, ages out, and y is removed from (D3. 2. Leader(2:) ¢ x(t’). In this case, Leader(:r) may be 2:. However, routers in x(t’), including y, must be disconnected from 2: in G1 at time t’; otherwise Leader(2:) would be set to a reachable router in x(t’). By Aging-Condition-3, 8,, ages out, and y is removed from (bx. In any Of the above cases, every element y E $(t0) will be removed from the set by the time to + aging, concluding the proof. E1 The next lemma shows how the T-LSR protocol correctly handles incomplete spanning tree topologies. Lemma 7 Given a spanning tree topology T that is broadcast by a router 2:, if T does not include every router in the network, then T cannot win all the votes for 2: from the routers covered by T. Proof: Let VT 6 V(T) fl V(G) be the set of routers covered by T, and VT 2 V(G) — V(T) be the set of routers not covered by T. Consider any x-to-y path P in the connected, physical topology G. Since 2: E VT and y 6 VT, there must exist two consecutive nodes w and z in P such that w E VT and z E VT- If T is discarded by any router in V(T) before arriving at w, then of course T cannot win all the votes for 2:. Otherwise, when T arrives at w, router w will detect the absence of its neighbor z 184 in T. By Discard-CTA-Condition-4, router w would not vote for 2:. We are done. 1:] Next, we investigate what would happen to a leader candidate when some other router do not prefer the candidate. Lemma 8 Given any two routers 2: and y such that Leader(2:) = 2: and Leader(y) 75 2: at a time t _>_ to, if Leader(2:) is not changed after time t and if Leader(y) is never set to 26 after time t, then there exists a time t’ Z t such that router 2: will remain permanently in mode G after t’. Proof: Since Leader(2:) is set to 2: permanently, router 2: broadcasts an infinite sequence of spanning trees (T1, T2, T 3, . . .) after time t. Let us consider the topology T1. If T1 covers y, then, of course, router 2: cannot obtain the vote of router y. If T1 does not cover y, by Lemma 7, T1 cannot win the votes from all the routers in V(Tl). In either case, router 2: must set its mode to G at a time t’ Z t. Because the above argument also applies to every subsequent tree topology T,, i > 1, router 2: will stay in mode G indefinitely after time t’. We are done. El With the above properties established, we are ready for the first major result. In the proof, we use the expression “by Lemma 8 (w, z, t)” to cite that lemma with router w in the position of 2: and z in the position of y, using the time reference point t. Recall that a is assumed to be the router in G with the highest rank. Theorem 7 (Leadership Consensus Property) There exists a time t1 2 to+tag,-,,g such that at any time t 2 t1 and V2: 6 V(G), Leader(2:) = a. Proof: Given a router 2:, we denote by t; the earliest time when ,c is empty. At router a, Leader(a) at time to, must be a itself. Moreover, if any router 3 sets Leader(2:) to a at time tx, then Leader(2:) will not be changed after tx, because no (rank) corruption will occur after time to. It follows that Leader(a) will not be changed after ta. . 185 Next, we consider what happens when a router 2; 3:5 a whose Leader(2:) is never set to a after tx. Although Lemma 6 assures us that x will be empty by time to + taging, Leader(2:) is not guaranteed to be set to a; the rank of a in G3, itself, denoted as Rank$(a), may be corrupted. A different preferred leader, whose rank is higher than Rankz(a), may be selected by 2: at tx. Under such circumstances, by Lemma 8 (a,2:,t’), where t’ = max{tz,ta}, by our earlier argument that Leader(a) will not change value after time ta, and by the selection of 2:, router a will Operate in mode G indefinitely after some time t” 2 t’. Next, we must show that the corrupted Rankx(oz) information at 2: is subject to aging, allowing the G-mode periodic flooding from a to correct the corruption. Let 8,, be the LSA in Gm that is regarding router a and that contains the corrupted ranking information Rankx(a). Let I‘ = (t1, t2, t3, . . .) be the sequence of times when the aging timer associated with [a fires. If there exists any t,- E I‘ such that any one of the three aging conditions holds at time t,, then 8a ages out at time t, (in this case, t,- is the last, largest element in F). Assuming otherwise (that is, none of the three aging conditions holds at every time t,- e F) would result in an infinite F. Under such circumstances, let ta be the smallest element in I‘ that is larger than ta, the time when Leader(a) is permanently set to a. By Aging-Condition-l and the selection Of elements Of I‘, Leader(2:) = 2: at time ta. It follows that, by Lemma 8 (2:, a, ta), and the fact that Leader(a) will not change value after time ta, router 2: must change permanently to mode G at some time t; _>_ ta. Let tb be any element in I‘ that is larger than t2,- At time tb, Mode(2:) = G, a contradiction to the assumption that none of the three aging conditions, including the condition Mode(2:) = G, holds at time tb. Hence, [a will be aged out, enabling router 2: to accept the G-mode flooding from a and learn the correct rank of a. Consequently, Leader(x) will be set a, a contradiction to our assumption that Leader(2:) will never be set to oz. We are done. El Next, we turn our attention to the problem of achieving consistent routing infor- mation. Specifically, we show that all routers will possess network images identical to G. The next lemma establishes this prOperty at the leader router a. In the proof, 186 we use the notation Gx(t) to denote the network image of router 2: at time t. Lemma 9 The network image at the leader oz will converge to G. Proof: We denote by 9(t) the set of routers whose respective information is incor- rect in G0, at time t. There are three causes for a router 2: to be included in (2(to): 2: is a ghost router, 2: is a real router that is absent from G0, or 2: is a real router that is present in GO, but whose LSA 8,, in Go is corrupted. We note that new elements (that is, routers) cannot be added to the “corruption set” Q after time to because corruption problems could not happen after that time. If there exists a time t Z to such that Q(t) = (b, then G, converges to G at the same time and will remain so thereafter. Let us assume the Opposite, that is, Q(OO) is not empty. As argued earlier, there exists a time to when Leader(a) is set to a permanently, and hence, router oz will broadcast, regardless the presence or absence of corruption problems in Go, an infinite sequence of tree topologies after time ta. Let us denote by tn the time when the “corruption set” Q has stabilized to 0(00). Since incorrect parts in Ga stabilize at time tn, network image Ga itself also stabilizes at that time. This stabilized image will be denoted by Ga(oo). We further denote by T = (T1,T2,T3, . . .) the infinite sequence Of tree topologies broadcast by a after time max{tg, ta}. First, we claim that there are no ghost routers left in 9(00). To see this prop- erty, we assume that 2: is a ghost router that remains in V(Ga) indefinitely. If 2: ¢ V(G;(oo)), then by Aging-Condition-3 it will be aged out by time tn +taging, a contradiction. If 2: is in V(G;(oo)), then we claim that 2: is covered by every T E T. Since 2: cannot vote, router a will have to remain in mode G indefinitely and age out 2:, a contradiction. To see the reason why a ghost router 2: E V(G;(OO)) must be in V(T) for any T E T, let us first consider the tree T1 in T. We point out that the fact that T1 is used after tn does not suggest it is computed after that time. Therefore, one cannot infer the coverage of 2: by T1 directly from the presence Of 2: in Gfi,(oo). Let us assume that 2: is disconnected from a in Go, when T1 is computed. It follows that from the 187 point of the computation of T1 to time tn router a must detect at least one change in the set V(Gg) (specifically, the addition of 2:) and must compute a spanning tree to cover 2:, rendering T1 obsolete by time tn, a contradiction to the definition of T1. Hence, T1 must cover 2:. Moreover, if router a never re—computes the spanning tree after T1, then it uses T1 indefinitely after tn (that is, T,- 2 T1 for any i > 1). If router a does perform this re—computation after T1, then subsequent tree topologies in T are based on Ga(oo) and must contain 2:. In both cases, every tree T E T covers router 2:. Next, let us deal with any real router 2: in “(00) whose corrupted LSA 6,, remains in G0 indefinitely (that is, 8,, is never aged out). Let us consider any tree topology T E T. If T contains 2:, then router a needs the vote of 2:. When T arrives at 2:, router 2: will detect the corruption in Em and refuse to vote for a (Discard-CTA-Condition-B), forcing a to switch to mode G. If T does not contain 2:, then, by Lemma 7, T cannot win all the necessary votes for a and router a must also switch to mode G. Since the above argument applies to all the tree topologies in T, router a will remain in mode G indefinitely. It follows that 8; will age out, a contradiction to the selection of 2:. As such, there can be only one type of corruption problems for elements in 52(00): they must be real routers that are absent from Ga after time to. Let 2: be any such a router in set 52(00). By Lemma 7, router a cannot establish its leadership and will be permanently in mode G after some time t’ . By Theorem 7, Leader(2:) will be permanently set to a at some time t3, and thus the LSA regarding O: at 2: must be subject to aging after t3. This ensures that the G-mode flooding from a will be accepted by 2:, turning the operation mode of 2: to G. Consequently, 2: periodically floods its local status in mode G, which is guaranteed to be accepted by a router a that does not have 2: in its network image at all. However, this contradicts with the assumption that x is absent from Ga indefinitely. We have excluded all the possi- ble causes of a non-empty 52(00), and hence have shown that Go, will converge to G. D After corruption problems in Go, have been “cleaned up,” the spanning tree topolo- gies computed by a will be correct and accepted by all the routers in the network, as 188 we show below. Recall that to, denotes the time when Leader(a) is permanently set to 0. Further, since we have shown that (2(00) 2 Q), tn denotes the time when all the corruption problems in the network image of leader or has been removed. Theorem 8 (Tree-Topology Consensus Property) There exists a time after which all the routers in the network agree on the same spanning tree topology, which is a correct spanning tree topology proposed by a. Proof: Let us consider the first spanning tree T1 in T, the sequence of spanning tree topologies broadcast by a after time max{tg, ta}. Since T1 may be computed before tn, that is, before all corruption problems in Ga are resolved, it could contain the three types Of flaws listed below. 1. Tree T1 contains a ghost router 2:. In this case, of course, x will not vote for a, which must in response switch to mode G. 2. Tree T1 does not cover a (real) router x. (This case happens because, when T1 is computed, router x is absent from Go.) By Lemma 7, T1 cannot win all the votes from routers in V(Tl) and router a must switch to mode G. 3. Tree T1 uses a non-existent link x-y, where both x and y are real routers. (This case happens when LSAS regarding x and/or y are corrupted in Go, when T1 is computed.) Without loss Of generality, we assume that 2: be the parent Of y in T1. If this case happens, since x cannot deliver the CTA that contains T1 to y, then router y will not vote for 0, again forcing oz to switch to mode G. If T1 suffers from any one of the above flaws, then a switches mode G and must compute a new spanning tree in the next CTA broadcast. The new tree, computed after tn, will be a correct spanning tree for G and be included in CTA broadcasts thereafter. If, on the other hand, T1 is free from the above problems, then T1 is a correct spanning tree topology (it covers every router, and does not contain non- existent routers and links), and will be used as the tree topology after tn. In the 189 cases where T1 is the permanent spanning tree tOpology after tn, it is worth pointing out that, when T1 is computed, Ga may still contain corruption problems that do not affect the correctness of spanning tree computation, such as disconnected ghost routers. Further, the above argument does not exclude the possibility that T1 is computed before to. As such, the above argument applies to an empty event set E. Let T be the final, correct spanning tree topology computed by a and c be the epoch number of T. After all the larger-than-c values Of the Epoch(2:) data structures throughout the network age out and after all the routers select or as preferred leaders, T will be accepted by all the routers in the network, concluding the proof. 1:] Finally, we are ready to establish the most important property of any LSR proto— col —— the capability to maintain correct, consistent network images throughout the network. Theorem 9 (Network Image Consensus PrOperty) The network image G3 of every router x E V(G) will converge to G. Proof: After a correct spanning tree is constructed, router a will broadcast an LEA, and all routers will enter mode T Operation. Consequently, all non-leader routers stop performing periodic flooding. Since there are no events after to, non-leader routers will not flood event-driven LSAs either. Let Z = LSA(y, s, m) be the LSA regarding y in G0 and let 3’ = LSA(y,s’,m’) be the LSA regarding y in G3, where 2: is any non-leader router. Upon receiving a CTA from a, which contains 2, if (s’,m’) < (s,m), then router x will accept LSA 6, learning the correct status Of y. If (s’,m’) 2 (s,m), then, since x will not receive periodic flooding from y, 3’ will be aged out in taging seconds, again allowing 8 to be accepted by x. We are done. CI 8.5 Performance Evaluation We studied the performance of the T—LSR protocol through simulation. The simulator is based on the CSIM package [48]. Confidence intervals were computed, but for most 190 cases are very small and, for clarity, are not shown in the plots. Networks comprising up to 400 routers were simulated. Such network sizes conform with those supported by existing LSR standards. (For example, the OSPF protocol supports networks with up to 200 routers.) For each network size, 100 graphs were generated randomly. To conform to network topology characteristics observed in the Internet [57], average node degrees of these graphs are typically small, ranging from 2.25 (for 10—node graphs) to 4 (for 400-node graphs). Each message transmission incurs software overheads, including message copy- ing, error checking, processor interruption, and so forth. Of course, such overheads vary from platform to platform. In this study, we measured these overheads on the ATM testbed in our laboratory. The testbed comprises Sun SPARC-10 worksta- tions equipped with Fore SBA-200 adapters and connected with three Fore ASX-200 switches. From these measurements, we Obtained the figure 600 usec, which includes the overhead at both the sending and receiving switches. Since this figure also con- forms with typical raw IP overheads that researchers have observed on a variety Of workstations [71], the results reported here may be applicable to other LSR-based networking platforms. Comparison of periodic flooding overhead. First, we compare the T-LSR pro- tocol with the C-LSR protocol when performing periodic flooding. In one cycle of periodic flooding, called a re-flooding cycle, each router floods exactly once under the C-LSR protocol. In the case of the T-LSR protocol, the leader router broadcasts a CTA, and other routers cast ballots. For either LSR protocol, we measured the num- ber of messages processed, including acknowledgments, per router per re-flooding cycle. The results of this study, presented in Figure 8.12(a), illustrate a major ad- vantage of the T-LSR protocol, namely, reducing the number Of message interrupts at each router. TO account for the differences in message size (the T-LSR protocol uses relatively large messages, namely CTAs, in periodic flooding), we also measured the total mes- sage processing time at each router, using the per-byte software overhead 1.09 as 191 2500 1600 2000 E 1400 - .. 5 1200 - 8 o E 1500 - 5- 1000 * a; E" 800 ~ w a}, 1000 . g 600 . 2 1.. o. 400 - 500 ‘3’ 200 - < 0 r " “““ ‘ ‘ g 0 1- . . 2 2.1 I - L ..r.-_-...,._.....--.. 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network size (routers) Network size (routers) (a) messages per router (b) avg. processing time per router Figure 8.12: Comparison of periodic-flooding overhead. reported in [71]. The resulting metric can be considered as the average workload at each router. To compute the lengths of CTAs, we used the LSA format of the OSPF protocol, where an LSA of a router with d incident links is 24 + 12 x d bytes long (ex- cluding IP header). Thus, in a network with N routers and with average node degree D, a CTA comprises N x (24 + 12 x D) bytes. As we can see in Figure 8.12(b), the T-LSR protocol imposes only a small fraction Of the workload of the C—LSR protocol. To further understand the behavior Of the periodic flooding mechanism of the T- LSR protocol, we plot in Figure 8.13 the times used by CTA broadcasts in networks of different sizes. A CTA broadcast begins at the moment when the leader starts sending the corresponding CTA and ends at the moment when the leader receives all necessary ballots. Since, when the leader router is in mode G, CTA broadcast is also used for leader (re)election and spanning tree construction, results in Figure 8.13 can also be interpreted as the leader election and spanning tree construction times Of the T-LSR protocol. As we can see in the figure, a CTA broadcast can typically be completed within 350 milliseconds. The overhead of CTA broadcasts primarily stems from the large sizes of CTAs. Performance of individual flooding Operations. In addition to periodic flood- ing, both LSR protocols use event-driven flooding to disseminate changes in network 192 350 ms s 250 : "8’ 150 Completion Time (ms) 8 U: C 0 1 1 1 1 l 1 A 0 50 100 150 200 250 300 350 400 Network size (routers) Figure 8.13: Efficiency of CTA broadcast. status. For such flooding Operations, we are interested in three performance metrics: the LSA receipt time, the flooding completion time, and bandwidth consumption. The LSA receipt time of a given router is the time when the first copy of the LSA arrives at the router, whereas the flooding completion time at the router is the time when the router finishes processing the last acknowledgment pertaining to this flood- ing operation. The bandwidth metric refers to the total number of LSA forwardings incurred by a flooding Operation. The averaged results regarding these metrics are presented in Figure 8.14. As seen, the C-LSR protocol outperforms the T-LSR protocol in both time metrics. This is because, under the conventional flooding algorithm, a router acts aggressively, for- warding an LSA to all its neighboring nodes (rather than only neighbors defined by a spanning tree) and thus causing its neighboring nodes to receive the LSA earlier. However, this aggressiveness also implies that the router has to perform larger num- bers of LSA forwarding and process more acknowledgments, as clearly shown in the results regarding the bandwidth metric plotted in Figure 8.14(c). In this metric, the T-LSR protocol enjoys a comfortable lead, of course, because it uses only tree links to forward LSAs. In summary, during normal operation periods of the T-LSR protocol, a flooding operation is somewhat slower to deliver the respective LSA, but much more econom- ical in terms of Operational overhead than its conventional counterpart. Since the 193 12 ‘ ' r ' ,2, ........ . 1200 . . , 2 ...-J ....... -.., .......... .... ...... 4 m C-LSR A 10 - . .. -../w- ‘"‘”‘ l go 1000 - T-LSR ---- E :3 _ o— 8 b J .‘i 6 . < 600 § 4 _ C-LSR (completion) -—.——‘ “5 400 - ___________ j, m + C-LSR (receipt) --—----+ d ............... ] .i‘ T-LSR (completion) W. Z 200 . --------------- 2 " T-LSR (receipt) «--— : ,,,,,,,,,,,, O 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network size (routers) Network size (routers) (a) time (b) bandwidth Figure 8.14: Comparison of event-driven flooding performance. completion times of T-LSR are typically less than 12 milliseconds, the T-LSR proto- col still retains the responsiveness of C-LSR. Moreover, in the C-LSR protocol, while an LSA may be received earlier, the ensuing processing of the LSA, such as the up- dating of routing tables, will be slowed down by the remaining tasks Of the flooding operation. It the belief of the authors that T-LSR’s large advantage in processing overhead outweighs its slightly slower LSA receipt time. Evaluation of flooding mode switching. In the T-LSR protocol, a T-mode flooding Operation has to switch to mode G if it cannot complete the operation in mode T, for example, due to, the failure of a tree link. In this part of our study, we evaluate the overhead imposed by flooding mode switching. Specifically, we assume that two events e1, advertised by router x, and eg, advertised by router y, occur si- multaneously, where e1 affects the spanning tree but e2 does not. (Prior to the events, all the routers are in mode T.) In the C-LSR protocol, the advertisements of both el and e2 use the conventional flooding algorithm. In the T-LSR protocol, router x will advertise e1 in mode G, but router y, without knowing the malfunctioning status of the spanning tree, will initiate the advertisement of e2 in mode T. The flooding of 82 will switch to mode G later in order to reach all routers. We simulated these two flooding Operations and measured the completion time and bandwidth consumption 194 of the entire “scenario,” that is, the flooding of both el and eg, under different LSR protocols. The respective results are plotted in Figure 8.15. As shown, the T-LSR protocol performs slightly less efficient in both performance metrics. This should not be a surprise, because the advertisement of eg in the T—LSR protocol incurs an unfinished T-mode flooding and a complete G-mode flooding. 3o . . . . . . . 2500 a 25 ’ a 2000 E .S 3 20 ~ E ..E_ E 1500 - 1— _ 8 _8 15 < 3 1’3 1000 . 2' 10 U" ”5 ° 0' U 5 . z 500 ~ 0 A A A A 1 1 r 0 A A A A; 1 n M 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Network size (routers) Network size (routers) (a) completion time (b) no. of LSA forwardings Figure 8.15: Overhead of flooding mode switching. We emphasize, however, that flooding mode switching is not expected to occur frequently. For a T-mode flooding to change to mode G, there must be a simultaneous G-mode flooding that has not reached the source of the T—mode flooding. Since the T-LSR protocol advertises in mode G the failure of a network component that damages the spanning tree, according to Figure 6.4(a) such an event can be advertised throughout the network within 10 milliseconds. Thus, only if a T-mode flooding were initiated in the 10 milliseconds window after such a failure would a mode switching occur. As a result, the fraction Of T-mode flooding operations that must change mode is likely to be small. 8.6 Summary We have proposed a novel LSR architecture, called the T-LSR protocol, that elects a leader to perform periodic flooding on behalf of other routers and constructs a 195 spanning tree to reduce of overhead of advertising network status updates. Three consensus properties, namely, the Leader Consensus Property, the Network Image Consensus Property, and Tree-Topology Consensus Property, of the T-LSR proto- col under any combination of network component failures, network partitioning, and message corruption events, have been proved formally. Our simulation results show that the T-LSR protocol incurs a small fraction of the overhead of the C-LSR pro- tocol during its normal Operation periods, and only moderate overhead in adverse circumstances where the spanning tree is under repair/ construction and the leader is being elected. The development of such a lightweight and robust LSR protocol is es- pecially beneficial to communications applications, such as multimedia applications, that demand frequent updates of network status and resource availability information to ensure smooth transit of traflic streams. Chapter 9 Conclusions And Future Work If I were to conclude this work in one sentence, I would say that it “re-visits con- ventional group communication/distributed computing problems under an unusual assumption that complete information about the entire communication network is universally available.” Of course, this assumption is not true in general cases; it holds only in a special computing environment where distributed algorithms are exe- cuted by routers / switches to implement networking protocols in LSR-based networks. It should not be difficult to see that this computing environment provides powerful facilities that can greatly reduce the complexities of group communication problems. As an example, using the network images maintained by LSR, every participant of an election can learn of the loss Of connectivity to the current leader “for free,” without using any probing or monitoring mechanism. On the other hand, the low-level nature of this computing environment presents unique challenges. The biggest challenge is that, since networks are expected to continue providing communication services even in the presence Of exceedingly rare but catastrophic adverse events, such as network partitioning and undetected transmission errors, the algorithms executed within the network so as to implement these services must also survive such events. A fundamen- tal contribution of this dissertation is to show that developing distributed algorithms specifically for this computing environment results in better group communication solutions and, moreover, improvements to LSR itself. Using the network images maintained by LSR, we have developed the GMC proto- 196 197 col, which can be considered as a generic distributed implementation of MC/multicast routing algorithms. The ability to support different MC topology types and compu- tation algorithms is important when a wide spectrum of multiparty communication applications, each with unique characteristics and expectations Of the network, are deployed. Using router / switch connectivity information provided by LSR, we have developed a network-level leader election protocol, the NLE protocol. We have discussed impor- tant network services that could benefit from the NLE protocol, including hierarchical routing, address mapping services, and multicast. Based on the NLE protocol, we have designed a centralized solution to the problem of multicast core management, namely, the LCM protocol. In addition to using network images, the LCM protocol further makes use of the shortest path routing trees computed by LSR to support certain tasks of multicast core management, such as, core migration. Finally, one of the most important group communication problems is network routing itself. Since every network switching element can observe only its local sur- rounding, the task of finding paths to relay communication traffic across the network must be performed by all routers/ switches collectively. In this dissertation, we have advocated the use of group communication techniques to improve the performance Of LSR. For ATM networks, we have developed a family Of efficient flooding algorithms, the SAF protocols, that take advantage of the hardware switching capabilities of such networks. These protocols construct a spanning tree and a ring in a given ATM net- work to improve the performance Of flooding operations in the network. For other LSR-based networks, such as many autonomous systems in the Internet, we have developed the T-LSR protocol to reduce the overhead associated with both periodic and event—driven flooding, using two group communication based techniques, span- ning tree construction and leader election. Considering all these results, we have clearly demonstrated the mutually beneficial relationship between group communica- tion and LSR. The research of this dissertation can be extended in several directions, described as follows. 198 As pointed out earlier, LSR is not intended for direct implementation in large networks. This restriction inevitably raises the question that how our group commu- nication protocols, which are all LSR-based, can be applied in such networks. In the case of ATM networks, the entire network is recursively divided into smaller rout- ing domains, and the same routing method, namely LSR, is applied at all routing levels. In such circumstances, our group communication solutions can be executed recursively in the routing hierarchy. For example, to construct a receiver-only MC in a large ATM network, a top-level MC can be constructed at the top routing level; members of the MC are representative group members elected in the second highest routing domains that have at least one member of the MC. Subsequently, each such domain constructs a second-level MC within that domain. The low-level MCS and the top-level MC are connected together using the representative group members in low-level domains as contact points. This process is repeated until the lowest routing level is reached. In fact, a well-defined routing hierarchy may enable the use of differ- ent MC topology types and computation algorithms at different levels for the same network group. We point out that the PIM protocol already supports such “hybrid MCS,” tO a limited extent, by constructing source rooted trees at the inter-AS level and shared trees within ASS. The generalization of the GMC protocol to allow any MC type at any routing level and the potential applications of hybrid MCS constitute an interesting area of future research. However, in other networks, most prominently the Internet, LSR is restricted to individual routing “islands,” (that is, routing domains or autonomous systems) and another routing method is used to perform routing among these islands. In such cases, an LSR-based group communication protocol must cooperate seamlessly with a high-level protocol, which may not be LSR-based. Such integration issues require further investigation. For this integration problem, the technique that uses a leader election to reduce an LSR-based domain to a single node could play an important role. Considering again the example of the construction of a network-wide MC, an inter-AS MC protocol can consider an LSR-based AS as a single node by electing a representative member in that domain. 199 Another important area of future research is the support of QOS routing, which finds paths to carry resource-demanding, multimedia traffic. Many methods devel- oped in this dissertation address the Operational aspects of network routing and could be used in the design of mechanisms that timely disseminate the information required by QOS routing. For example, our methods could be used to elect rout- ing server/center, and/or reduce the workload of individual routers/switches. One promising possibility is to elect a leader router to periodically collect and broadcast the resource utilization status of the entire network. This collect-and-broadcast pro- cess could be a variation of the CTA broadcast and ballot collection process used in the T-LSR protocol (specifically, each router includes its up—to-date local status in its ballots). When the resource utilization status of the network fluctuates at a high rate, for example, due to a long burst of VC establishment and destruction requests, using an orderly process of information collection and dissemination might produce much more eflicient routing Operations, when compared to having all routers/ switches flood their status changes individually. Furthermore, an adaptive LSR protocol could be developed to adjust the period of the above collect-and-broadcast process so that the process is executed more frequently when the status of the network changes rapidly, and less frequently when the network is stable. Bibliography [1] S. E. Deering and D. R. Cheriton, “Multicast routing in datagram internetworks and extended LANS,” ACM Transactions on Computer Systems, vol. 8, pp. 85—110, May 1990. [2] S. Deering, D. L. Estrin, D. Farinacci, V. Jacobson, C.-G. Liu, and L. Wei, “The PIM architecture for wide-area multicast routing,” IEEE/A CM Trans. on Networking, vol. 4, pp. 153-162, April 1996. [3] A. Ballardie, P. Francis, and J. Crowcroft, “Core based trees,” in Proceedings of the ACM SIG COMM ’93, (San Francisco, CA), September 1993. [4] A. Ballardie, “Core based trees (CBT version 2) multicast routing.” Internet RFC 2189, September 1997. [5] D. Waitzman, C. Partridge, and S. Deering, “Distance vector multicast routing proto- col.” Internet RFC 1075, November 1988. [6] J. Moy, “Multicast extensions to OSPF.” Internet RFC 1584, March 1994. [7] S. Deering, “Host extensions for IP multicasting.” Internet RFC 1112, August 1989. [8] D. W. Wall, Mechanisms for Broadcast and Selective Broadcast. PhD thesis, Stanford University, June 1980. [9] Q. Zhu, M. Parsa, and J. J. Garcia-Luna—Aceves, “A source-based algorithm for delay- constrained minimum-cost multicasting,” in Proceedings of the IEEE INFOCOM ’95, pp. 377-385, 1995. [10] F. Bauer and A. Varma, “Degree-constrained multicasting in point-tO-point networks,” in Proceedings of the IEEE INFOCOM ’95, pp. 369—376, 1995. [11] J. Moy, “OSPF version 2.” Internet RFC 1583, March 1994. [12] J. M. McQuillan, I. Richer, and E. C. Rosen, “The new routing algorithm for the ARPANET,” IEEE Transactions on Communications, pp. 711—719, May 1980. [13] ATM Forum, “Private network-network interface specification version 1.0.” ATM FO- rum technical specification af-pnni-0055.0000, March 1996. [14] W. J. Clark, “Multipoint multimedia conferencing,” IEEE Communications Magazine, May 1992. 200 [15] [10] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [301 201 S. R. Ahuja and J. R. Esnor, “Co-ordination and control of multimedia conferencing,” IEEE Communications Magazine, May 1992. J. Udell, “Computer telephony,” Byte, vol. 19, no. 07, pp. 80—99, 1994. J. Oikarinen and D. Reed, “Internet relay chat protocol.” Internet RFC 1459, May 1993. W. Reinhard, J. Schweitzer, G. Vlksen, and M. Weber, “CSCW tools: Concepts and architectures,” IEEE- Computer, May 1994. M. Harrick, P. V. Rangan, and M. Chen, “System support for computer mediated multimedia collaborations,” in Proceedings of the 1992 ACM Conference on Computer Supported Cooperative Work ( 050 W ’92), pp. 203—209, November 1992. J. M. Pullen, M. Myjak, and C. Bouwens, “Limitations of Internet protocol suite for distributed simulation in the large multicast environment.” Internet draft draft-pullen— lame-00.txt, September 1996. H. W. Holbrook, S. K. Singhal, and D. R. Cheriton, “Log-based receiver-reliable mul- ticast for distributed interactive simulation,” in Proceedings of SI G COMM ’95, (Cam- bridge, MA USA), pp. 328—341, 1995. M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C. Steere, “Coda: A highly available file system for a distributed workstation environ- ment,” IEEE Transactions on Computers, vol. 39, April 1990. J. Postel and J. Reynolds, “File transfer protocol (FTP).” Internet RFC 959, October 1985. T. Berners-Lee, “Hypertext transfer protocol (HTTP).” available at ftp: / / info.cern.ch/ pub/ www/ doc/ http-spec.txt.Z, November 1993. T. Berners-Lee and D. Connolly, “Hypertext markup language 2.0.” Internet RFC 1866, November 1995. J. R. Cooperstock and S. Kotsopoulos, “Why use a fishing line when you have a net? an adaptive multicast data distribution protocol,” in Proceedings of USENIX Technical Conference ’96, 1996. S. Floyd, V. Jacobson, S. McCanne, C.-G. Liu, and L. Zhang, “A reliable multicast framework for light-weight sessions and applications level framing,” in Proceedings of SIGCOMM ’95, (Cambridge, MA USA), pp. 342-356, 1995. M. Hofmann, T. Braun, and G. Carle, “Multicast communication in large scale net- works,” in Proceedings of Third IEEE Workshop on High Performance Communication Subsystems (HPCS), (Mystic, Connecticut USA), August 1995. J. C. Lin and S. Paul, “RMTP: A reliable multicast transport protocol,” in Proceedings of IEEE INFOCOM ’96, March 1996. ATM Forum, ATM User-Network Interface (UNI) Specification Version 3.1. Prentice Hall, September 1994. 202 [31] P. Winter, “Steiner problem in networks: a survey,” Networks, pp. 129—167, 1987. [32] A. J. Ballardie, A New Approach to Multicast Communication in a Data- gram Internetwork. Ph.D. thesis, Department of Computer Science, Uni- versity College London, May 1995. Available via anonymous ftp from cs.uc1.ac.uk:darpa/IDMR/ballardie-thesis.ps.Z. [33] D. Estrin, D. Farinacci, A. Helmy, D. Thaler, S. Deering, M. Handley, V. Jacobson, C. Liu, P. Sharma, and L. Wei, “Protocol independent multicast sparse mode (PIM- SM): Protocol specifications.” Internet RFC 2117, June 1997. [34] M. Imase and B. M. Waxman, “Dynamic Steiner tree problem,” SIAM Journal on Discrete Mathematics, vol. 4, pp. 369—384, August 1991. [35] B. M. Waxman, “Performance evaluation of multipoint routing algorithms,” in Pro- ceedings of INFOCOM’ 93, 1993. [36] A. Thyagarajan and S. Deering, “Hierarchical distancewector multicast routing for the Mbone,” in Proceedings of ACM SIG COMM, (Cambridge, Massachusetts), August 1995. [37] A. Ballardie, “Core based trees (CBT) multicast routing architecture.” Internet RFC 2189, September 1997. [38] C. Shields and J. J. Garcia-Luna-Aceves, “The ordered core based tree protocol,” in Proceedings of IEEE INF OCOMM, (Kobe, Japan), April 1997. [39] S. Kumar, P. Radoslavov, D. Thaler, C. Alaettinoglu, D. Estrin, and M. Handley, “The MASC/BGMP architecture for inter-domain multicast routing,” to appear in Proceedings of ACM SIGCOMM, (Vancouver, Canada) August, 1998. [40] Fore Systems, Inc., ForeRunner SBA-200 ATM SBus Adapter User Manual, 1993. [41] D. Dykeman, H. L. Truong, and H. J. Sandick, “Alternatives for the support of the ATM group services.” ATM Forum internal contribution 95-0438, April 1995. [42] F. Liaw, “A straw man proposal for ATM group multicast routing and signaling pro- tocol: Architecture overview.” ATM Forum internal contribution 94-0995, November 1994. [43] R. Perlman, “Fault-tolerant broadcast of routing information,” in Proceedings of IEEE Infocom ’83, (San Diego), 1983. [44] D. E. Corporation, “Information processing systems — data communications — interme- diate system to intermediate system intra- domain routing protocol,” October 1987. Also available as Internet RFC 1142. [45] D. Bertsekas and R. Gallager, Data Networks. Prentice—Hall, 1987. [46] K. L. Calvert, E. W. Zegura, and M. J. Donahoo, “Core selection methods for multicast routing,” in Proceedings of IEEE I CCCN ’95, (Las Vegas, Nevada), 1995. 203 [47] E. Fleury, Y. Huang, and P. K. McKinley, “On the performance and feasibility of mul- ticast core selection heuristics,” Tech. Rep. MSU-CPS-97—42, Department of Computer Science, Michigan State University, East Lansing, Michigan, October 1997. [48] H. D. Schwetman, “CSIM: A C-based, process-oriented simulation language,” Tech. Rep. PP-080-85, Microelectronics and Computer Technology Corporation, 1985. [49] FORE Systems, Inc., SPANS NNI: Simple Protocol for ATM Network Signaling ( N etwork- to-N etwork Interface) Release 3. 0, 1993. available at ftp: / / ftp.fore.com / pub / docs / spans / spans3nni.ps. [50] T. von Eicken, A. Basu, V. Buch, and W. Vogels, “U-Net: A user-level network in- terface for parallel and distributed computing,” in Proc. of the 15th ACM Symposium on Operating Systems Principles, (Copper Mountain, Colorado), pp. 40—53, December 1995. [51] D. Johnson, D. Lilja, and J. Riedl, “A circulating active barrier synchronization mech- anism,” in Proceedings of the 1995 International Conference on Parallel Processing, vol. I, pp. 202—209, August 1995. [52] N. Fredrickson and N. Lynch, “Electing a leader in a synchronous ring,” Journal of the ACM, vol. 34, pp. 98—115, January 1987. [53] L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, and C. A. Lingley- Papadopoulos, “Totem: A fault-tolerant multicast group communication system,” Communications of the ACM, vol. 39, no. 4, 1996. [54] I. Cidon, T. Hsiao, A. Khamisy, A. Parekh, R. Rom, and M. Sidi, “The OpeNet architecture,” Tech. Rep. 95-37, Sun Microsystems, December 1995. [55] I. Cidon, A. Gupta, T. Hsiao, A. Khamisy, A. Parekh, R. Rom, and M. Sidi, “OPENET: An Open and efficient control platform for ATM networks,” in Proc. INFOCOM ’98, (San Francisco, CA), pp. 824-831, March 1998. [56] M. R. Garey and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-Completeness. 41 Madison Avenue, New York, NY. 10010: W. H. Freeman and Company, 1979. [57] E. W. Zegura, K. L. Calvert, and S. Bhattacharjee, “How to model an internetwork,” in Proceedings of IEEE INFOCOM ‘96, (San Francisco, California), March 1996. [58] E. Crawley, R. Nair, B. Rajagopalan, and H. Sandick, “A framework for QoS-based routing in the Internet.” Internet draft draft-ietf-qosr-framework-OO.txt, March 1996. [59] F. Bauer and A. Varma, “ARIES: A rearrangeable inexpensive edge-based on-line Steiner algorithm,” in Proceedings of IEEE Infocom ’96, (San Francisco, California), pp. 361-368, March 1996. [60] B. M. Waxman, “Routing of multipoint connections,” IEEE Journal of Selected Areas in Communications, vol. 6, no. 9, pp. 1617-1622, 1988. [61] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Com- munications of the ACM, vol. 21, pp. 558—565, July 1978. [62] [63] [64] [05] [66] [07] [08] [09] [70] [71] [72] [73] [74] [75] [70] 204 H. D. Schwetman, “CSIM: A C-based, process-oriented simulation language,” Tech. Rep. PP-080-85, Microelectronics and Computer Technology Corporation, 1985. D. Menasce, R. Muntz, and J. Popek, “A locking protocol for resource coordination in distributed databases,” ACM TODS, pp. 103—138, 1980. K. Birman, “Implementing fault tolerant distributed objects,” IEEE Transaction on Software Engineering, pp. 502—508, 1985. H. Garcia-Molina, “Elections in a distributed computing system,” IEEE Trans. on Computers, vol. 31, pp. 48—59, January 1982. S. Singh and J. Kurose, “Electing ‘good’ leaders,” Journal of Parallel and Distributed Computing, vol. 21, pp. 184—201, May 1994. G. Armitage, “Support for multicast over UNI 3.0/3.1 based ATM networks.” Internet RFC 2022, November 1996. M. Laubach, “Classical IP and ARP over ATM.” Internet RFC 1577, January 1994. M. Handley and V. Jacobson, “SDP: Session description protocol.” Internet draft draft- ietf-mmusic-sdp-03.txt, March 1997. M. J. Donahoo and E. W. Zegura, “Core migration for dynamic multicast routing,” in Proceedings of IEEE ICCCN ’96, (Rockville, Maryland), October 1996. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Shauser, E. Santos, R. Subramonian, and T. von Eicken, “LogP: Towards a realistic model of parallel computation,” in Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), (San Diego, California), pp. 1—12, Association for Computing Machinery, May 1993. A. Gopal, I. Gopal, and S. Kutten, “Broadcast in fast networks,” in Proc. INFO- COM ’90, (San Francisco, CA), June 1990. E. Basturk and P. Stirpe, “A hybrid spanning tree for efficient topology distribution in PNNI,” Tech. Rep. Research Report RC 20922, IBM Research Division, July 1997. B. Rajagopalan, “Efficient link state routing.” NEC Technical Report, 1997. I. Cidon, I. Gopal, M. Kaplan, and S. Kutten, “A distributed control architecture of high-speed networks,” IEEE Trans. Commun., vol. 43, no. 5, pp. 1950—1960, 1995. E. C. Rosen, “Vulnerabilities of network control protocols: An example,” in SI G COMM Computer Communications Review, pp. 10—16, July 1981. (also published as RFC 789). [77] Y. Huang and P. K. McKinley, “Group leader election under link-state routing,” in Pro- ceedings of International Conference on Network Protocols, 1997., (Atlanta, Geogia), October 1997.