LIBRARY
Mlchlgan State
Unlvorslty

 

 

 

PLACE IN RETURN BOX
to remove this checkout from your record.
TO AVOID FINES return on or before date due.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1/93 mamas-p14

GROUP COMMUNICATION UNDER LINK-STATE ROUTING
By

Yih Huang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

March 18, 1998

ABSTRACT
GROUP COMMUNICATION UNDER LINK-STATE ROUTING
By
Yih Huang

Multiparty communication, also termed group communication, is a generalization
of the traditional point-to-poz'nt communication in which more than two parties can
participate in a “conversation.” Many current and emerging communication applica-
tions, such as teleconferencing, computer-supported cooperative work, and distributed
interactive simulation, typically involve several, or a large number of, participants,
and require efﬁcient network support for multiparty communication. Link-state
routing (LSR) is a type of network routing method that makes complete network
status information available throughout the network. Adopted by both the Inter-
net, the de facto standard for data communications, and asynchronous transfer mode
(ATM), an international standard for telecommunications, the importance of LSR
in communication cannot be overstated. In this research, we investigate and exploit
the relationship between LSR and group communication. Speciﬁcally, we develop a
collection of novel and efﬁcient protocols that (1) use group communication methods
to improve the performance of LSR operation, (2) take advantage of LSR to provide
new network services to group communication applications, and (3) beneﬁt both LSR
and group-based applications. Our contributions can be summarized in the following
four areas.

First, we identify an important aspect of LSR operation that can beneﬁt from
group communication methods: the broadcast of network status information, also
known as the ﬂooding Operation. We propose a novel ﬂooding approach for use in ATM

networks, termed switch-aided ﬂooding (SAF), that takes advantage of underlying

ATM hardware functionality. The SAF method is Shown, through both theoretical

analysis and simulation study, to be much more efﬁcient than previous methods.

Second, we address a requirement raised by the diversity of multiparty commu-
nication applications: the need to support different types of multipoint connections
(MCs), the network entities that deﬁne the routing of traffic streams among the par-
ticipants in multiparty conversations. We develop a generic MC (GMC) protocol that
is able to accommodate multiple topology types and computation algorithms as plug-
in components. We Show that a “chassis” for MC protocols can operate efﬁciently

under LSR.

Third, we investigate an issue involved in both LSR and group communication
—- the leader election problem. We deﬁne the problem of “network-level” leader
election, where participants of an election are network switching elements rather than
hosts, and we develop an LSR-based solution to the problem, called the Network-level
Leader Election (NLE) protocol. The N LE protocol is formally proven to be robust;
it handles not only leader failures, but also much more disastrous situations, such
as network partitioning. We apply the NLE protocol to the problem of managing
trafﬁc transit centers, or core nodes, for multicast groups. Our prOposed solution,
called the LSR-based Core Management (LCM) protocol, automatically selects the
core node for a multicast group when the group is created, supports core migration to
improve multicast performance during the lifetime of the group, handles the failures
of both multicast cores and the core management server itself, and survives network

partitioning scenarios.

Lastly, we turn again to the operation and performance of LSR itself. Tradition-
ally, LSR uses two costly techniques to achieve its robustness and responsiveness:
message forwarding on every communication link in the ﬂooding of network status
updates, and the periodic ﬂooding of local status by each router. We conclude this
research by combining two techniques developed earlier, namely the election of a
leader and the construction of multipoint connections, to develop a totally different
approach to LSR. The resulting Tree-based LSR (T-LSR) protocol imposes only a

small fraction of the overhead of previous LSR methods, while guaranteeing to main-

tain consistent routing decisions throughout the network under any combination of
network component failures, partitioning scenarios, and undetected communication
transmission errors. Unlike the ATM-oriented SAF protocols, the T—LSR protocol is
designed for use in general-purpose, LSR-based networking environments and requires
no special hardware support.

In summary, this research reveals a mutually beneﬁcial relationship between group
communication and LSR: many aspects of group communication (such as the con-
struction of communication channels, the management of membership, and the con-
sensus on leadership) can take advantage of the internal operation of LSR, while the
performance of LSR itself can be improved by incorporating various group communi-

cation mechanisms.

TABLE OF CONTENTS

LIST OF FIGURES
LIST OF TABLES
1 Introduction

2 Background

2.1 Multiparty Communication Applications ............
2.1.1 Human-to-Human Interaction .................
2.1.2 Distributed Interactive Simulation ..............
2.1.3 Distributed Information Management .............
2.1.4 Information Distribution ....................
2.2 Multicast Communication ....................
2.2.1 Multicast Routing Topologies .................
2.2.2 Local Membership Management ................
2.2.3 Multicast in the Internet ....................
2.2.4 Multicast in ATM Networks ..................
2.2.5 Discussion ............................
2.3 Overview of Link State Routing .................
2.3.1 Basic Operation ........................
2.3.2 Fault Tolerance Issues .....................
2.3.3 Hierarchical LSR ........................
2.4 Discussion .............................

3 Switch-Aided Flooding

3.1 Motivation .............................
3.2 The Spanning MC Protocol ...................
3.3 The SAF Protocols ........................
3.3.1 Basic SAF Protocol ......................
3.3.2 Bandwidth-Efﬁcient SAF Protocol ..............
3.4 Performance Evaluation .....................
3.5 Summary .............................

4 Optimal SAF Operations

4.1 Motivation .............................
4.2 ER SAF Protocol Design .....................
4.2.1 Basic Concept .........................
4.2.2 Operation Modes ........................

viii

xi

vi

4.3 Algorithms .................................. 78
4.4 The Virtual Ring ............................... 83
4.5 Performance Evaluation ........................... 85
4.6 Summary ................................... 89
5 A Generic Method of MC Construction 92
5.1 Motivation ................................... 93
5.2 LSR-Based Multipoint Connections ..................... 96
5.3 The GMC Protocol .............................. 97
5.3.1 Design Issues ................................ 98
5.3.2 Protocol Overview ............................. 100
5.3.3 GMC LSA Format ............................. 102
5.3.4 Data Structures And Protocol States ................... 103
5.3.5 Protocol Algorithms ............................ 105
5.3.6 MC Creation and Destruction ....................... 111
5.4 Proof of Correctness ............................. 111
5.4.1 Correctness without Memory Overﬂows ................. 111
5.4.2 The Handling of Memory Overﬂows .................... 114
5.5 Performance Evaluation ........................... 117
5.5.1 Simulation methodology .......................... 117
5.5.2 Group Creation Periods .......................... 119
5.5.3 Normal Operations ............................. 123
5.5.4 Comparison with the MOSPF Protocol .................. 124
5.6 Summary ................................... 126
6 Group Leader Election under Link-State Routing 127
6.1 Introduction .................................. 127
6.2 The NLE Protocol .............................. 130
6.2.1 Overview .................................. 130
6.2.2 State Machines and Events ........................ 132
6.2.3 The Operation of LCM .......................... 133
6.2.4 The Operation of MSM .......................... 135
6.3 Proof of Correctness ............................. 136
6.4 Performance Evaluation ........................... 139
6.5 Other Potential Uses of The NLE Protocol ................. 142
6.5.1 Multicast Address Resolution ....................... 143
6.5.2 Multicast Core Management ........................ 144
6.5.3 Performance of Multicast Group Creation ................ 145
6.6 Summary ................................... 147
7 Multicast Core Management 149
7.1 Introduction .................................. 150
7.2 The LCM Protocol .............................. 153
7.3 Performance Evaluation ........................... 156

7.4 Summary ................................... 159

vii

8 'ITee—Based Link State Routing 161
8.1 Motivation ................................... 161
8.2 Overview ................................... 164
8.3 Algorithms .................................. 175
8.4 Proof of Correctness ............................. 181
8.5 Performance Evaluation ........................... 189
8.6 Summary ................................... 194

9 Conclusions And Future Work 196

LIST OF FIGURES

2.1 Three types of MC topologies .........................
2.2 The operation of the DVMRP .........................
2.3 An example of member join Operation in the CBT protocol. .......
2.4 Comparison of multicast forwarding in the CBT protocol and SSTS.‘ . . .
2.5 The operation of the MOSPF protocol ....................
2.6 Shared trees constructed by the PIM protocol ................
2.7 The result of topology transition for the sender 33. ............
2.8 VC Operation in ATM networks ........................
2.9 Operation of the ACBT protocol. ......................
2.10 Problem in correctly identifying node failure .................
2.11 An example of the ﬂooding operation. ...................
2.12 The handling of network partitioning in LSR. ...............
2.13 A network topology. .............................
2.14 Breaking up the network into routing domains. ..............
2.15 The image of the domain A.4 .........................
2.16 The Simpliﬁed/high-level network image. ..................

3.1 Examples of multipoint connections. ....................
3.2 An example MC built by the CBT protocol. ................
3.3 An example of the SMC protocol .......................
3.4 The handling of event LSAS ..........................
3.5 The ReachCore module ............................
3.6 The processing of the reach-core request message. ............
3.7 An example of the Basic SAF protocol ....................
3.8 The Basic SAF protocol with a broken SMC. ...............
3.9 An example of the BE SAF protocol. ....................
3.10 The BE SAF protocol with a broken SMC ..................
3.11 The sender algorithm of the BE SAF protocol ................
3.12 The receive-LSA routine in the BE SAF protocol. .............
3.13 The receive-dummy routine in the BE SAF protocol. ...........
3.14 The timeout handler in the BE SAF protocol. ...............
3.15 Comparisons of ﬂooding alternatives with a correctly functioning SMC. .
3.16 Comparisons of ﬂooding alternatives with partitioned SMC. .......
3.17 Comparisons of ﬂooding alternatives when SMC does not exist. .....
3.18 Performance of the SMC protocol .......................

4.1 ER SAF ﬂooding in normal cases .......................

viii

15
18
20
21
23
24
25
27
30
33
34

41

51
52
52
54
55

59

63

4.2

4.3
4.4
4.5
4.6
4.7

4.8

4.9

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16

6.1
6.2
6.3
6.4
6.5
6.6

7.1
7.2
7.3

8.1
8.2
8.3
8.4
8.5
8.6

ix

A hypothetical scenario where LSA retransmissions over R degenerate into

a bidirectional store-and-forward process ................. 79
The sender algorithm of the ER SAF protocol ................ 80
The ReceiveLSA routine in the ER SAF protocol. ............. 82
The ReceiveACK routine in the ER SAF protocol. ............. 83
The timeout handler in the ER SAF protocol. ............... 84
Comparisons of ﬂooding alternatives with an operational SMC and virtual

ring. .................................... 88
Comparisons of ﬂooding alternatives in the performance of ﬂooding link-

down events ................................. 90
The average/ worst case time to build a virtual ring ............. 91
Example MC Showing member switches and attached hosts. ....... 94
Problem created by inconsistent topology proposals ............. 99
The topology ordering problem ....................... 100
A network/ MC conﬁguration. ........................ 103
Events and advertisements in the GMC protocol. ............. 104
The state-transition diagram of the GMC protocol. ............ 105
The algorithm for EventHandler ........................ 107
The algorithm for AcceptTopology. ..................... 108
The algorithm for ReceiveLSA ......................... 109
The algorithm for TCTimerHandler. ..................... 110
Performance of the GMC protocol under 1 second arrival interval. . . . . 121
Performance of the GMC protocol under 10 seconds arrival interval. . . . 121
Performance of the GMC protocol under 30 seconds arrival intervals. . . 122
Performance of the GMC protocol under the 10 minutes arrival interval. 123
Performance of the GMC protocol in normal operations. ......... 124
Topologies computations per event of the MOSPF protocol. ....... 125
The ﬁnite state machines in N LE ....................... 132

The leadership consensus machine at a switch a: for a group 9 (LCM(a:,g)).133
The membership status machine at a switch :1: for a group 9 (MSM($, g)). 135

Performance of the NLE protocol ....................... 142
Bandwidth usage of alternative election protocols .............. 142
Number of bindings generated for group creation. ............. 147
Core migration in LCM. ........................... 156
Queue length at the CBS. .......................... 158
Core-to-member distances produced by various core selection methods. . 159
An example of tree-based ﬂooding. ..................... 163
The ﬂooding of two LSAS in different modes ................. 170
The completion of the T-mode ﬂooding in mode G. ............ 170
An example of the incorrect leadership problem ............... 174
The routine that ﬂood router local status. ................. 176

Processing incoming LSAS ........................... 177

 

X

8.7 Setting preferred leader. ........................... 178
8.8 The BroadcastCTA routine. ......................... 178
8.9 The processing of incoming CTAS. ..................... 179
8.10 The processing of ballot messages ....................... 180
8.11 The processing of LEAs. ........................... 180
8.12 Comparison of periodic-ﬂooding overhead. ................. 191
8.13 Efﬁciency of CTA broadcast .......................... 192
8.14 Comparison of event-driven ﬂooding performance .............. 193

8.15 Overhead of ﬂooding mode switching ..................... 194

3.1

4.1
4.2

5.1
6.1

8.1
8.2

LIST OF TABLES

Characteristics of randomly generated graphs. ............... 62
Complexities of various ﬂooding protocols. ................. 72
Characteristics of randomly generated graphs. ............... 86
Characteristics of randomly generated graphs. ............... 119
Characteristics of randomly generated graphs. ............... 140
Control messages in the T-LSR protocol ................... 165
T -LSR data structures at a router :r. .................... 165

xi

Chapter 1

Introduction

Many modern distributed applications involve multiparty communication, in which
two or more participants are involved in a group “conversation.” A distinguishing
characteristic of multiparty communication is the requirement for a source party (for
example, a person that is currently speaking in a teleconference) to be heard by
more than one receiving parties (for example, the other participants in the confer-
ence). Applications that involve multi-party communication include teleconferencing,
computer-supported cooperative work, distributed virtual reality, remote teaching,
tele—gaming, replicated ﬁle servers, parallel database search, and distributed paral-
lel processing. This thesis concerns efﬁcient network support for various aspects of
multiparty communication, or, interchangeably, group communication.

Previous prominent works in this direction exist in the form of multicast protocols,
especially those proposed for the Internet [1]. A multicast protocol routes communica-
tion trafﬁc streams from their sources to multiple destinations, as Opposed to exactly
one destination, as in conventional point-to-point routing. Multicast methods sup—
ported within the network are generally favored over host-level multicast methods,
where typically a source explicitly sends a copy of the message individually to each
recipient. The problem with the latter approach is that, when the paths from the
source to destinations share a common link, the message traverses the link multiple
times. Network-supported multicast methods avoid this redundancy by having the

network replicate the message after its traversal of the common link. Representative

1

2
IP multicast protocols include PIM [2], CBT [3, 4], DVMRP [5], and MOSPF [6]. An

important concept supported/used by such protocols is group addressing, whereby
more than one communication party can be referred to as a Single entity. For ex-
ample, IP multicast addresses [7], which are perhaps the most well-known group
addressing mechanism, allow a data packet that is tagged with a Single destination

address to be delivered to all the systems that are “listening” to that address.

Most group communication implementations must deal with two issues: the col-
lection and management of group membership information, and the routing of trafﬁc
streams to reach group members. Alternative approaches to the former issue range
from no network support (that is, no membership management in the network), to
maintaining a member list at every network node for every active group. The latter
issue concerns the topology computation of multipoint connections (MCS), that is,
sets of communication channels that connect group members. Various methods of
MC topology computation have been devised by researchers to meet different per-
formance criteria, such as the transmission delay experienced by group members and
the total bandwidth consumed by the group [8, 9, 10]. Many multicast protocols can
be considered as distributed implementations of one, or a small set of, MC topology

computation algorithms.

Some group communication implementations must deal with a third issue, namely,
group leadership, which arises when a multicast protocol assigns special duties to
one group member. The leader of a group may serve as the center for membership
management, or as a transit point through which all trafﬁc streams destined to the
group must be forwarded. Group leaders can be conﬁgured manually or can be
selected automatically by the network. Important multicast protocols that introduce

such “distinguished” members include PIM [2] and CBT [3, 4].

It Should be noted that the use of group communication is not restricted to appli-
cations; many aspects of the operation of the network itself involve group communica-
tion. An important example is the underlying (unicast) routing protocol, a protocol
that compiles knowledge of the network for the purpose of making routing decisions.

A communication network consists of three major components: hosts, switches (or,

3

synonymously, routers)1, and communication links. The hosts are computers or other
devices that allow users to access the network, while switches relay trafﬁc streams
through the network over communication links. When requested to relay a trafﬁc
stream toward a given destination, a switch must determine on which of its incident
links to send the trafﬁc. To ensure the correctness and quality of this decision, the
switch requires knowledge about the rest of the network. One approach to achieve this
goal is to disseminate the status and conﬁgurations of switches and links through-
out the network so that a global picture of the network can be compiled at every
switch. AS such, the routing protocol uses broadcast operations, a Special case of
multicast operations in which all network nodes are recipients. In this scenario, the
entire network can be considered as a group to which switch status information is
sent. Routing in communication networks has been extensively studied in computer
science. Although not all routing methods use broadcast operations in this manner,
a very important one does.

Link-state routing (LSR) [11, 12] iS an increasingly popular type of unicast rout-
ing. An LSR protocol makes complete knowledge of the network available to all
switches in the manner described above. The local status of each switch, includ-
ing the bandwidth available at incident links, buffer capacity, and the workload, is
learned by the network via the broadcast, or flooding, of link-state advertisements
(LSAS). Based on received advertisements, each switch locally maintains a complete
image of the network, which it uses to make routing decisions. The Open Shortest
Path First (OSPF) protocol [11], introduced by the Internet community, is one of
the most well-known LSR unicast protocols. LSR has also been adopted as the rout-
ing method for Asynchronous Transfer Mode (ATM), a telecommunications standard
that bases all communication on connection-oriented hardware switching of small,

ﬁxed-Size cells [13].

This dissertation addresses the interaction between group communication and

LSR. Our interest in this problem stems from the following three observations. First,

 

1We will not distinguish the two terms here, despite the fact that one of them may be preferred
over the other under certain contexts of discussion.

4

since LSR involves the maintenance of a complete image of the network at every
switch, LSR-based networks might use this information to support a wide range of
group communication algorithms. Locally available network images at switches may
also help reduce the overhead incurred by distributed implementations of these al-
gorithms. The second major advantage of LSR is its fault tolerance. Because every
link is monitored by its incident switches, and every switch is monitored by neigh-
boring switches, malfunctioning components and congested areas are made known
to all functioning switches promptly. Even the earliest LSR protocols were able
to survive disastrous situations, such as network partitioning [12]. Building group
communication facilities upon such a solid foundation has clear implications with re-
spect to robustness. Third, because some important parts of LSR operations exhibit
characteristics of group communication, methods targeted at general-purpose group
communication may help, or be tailored to help, the LSR protocol itself. As we will
demonstrate in later chapters, an efﬁcient group communication method can be used
to accelerate the ﬂooding of switch status information, and leader election plays an
important role in large-scale, LSR-based networks that are organized in a hierarchi-
cal manner. Considering the fact that LSR is being used in the infrastructure of
many modern networks, improving its performance will beneﬁt not only multiparty

communication applications, but all applications that use such networks.

In this dissertation, we model various aspects of group communication as the
consensus problem under LSR, which is deﬁned as follows. Due to delays in receiving
an event advertisement, switches in an LSR-based network can have different views
of the network for a short period of time. The situation is exacerbated when multiple
events are advertised simultaneously. Furthermore, a network can be temporarily
partitioned due to malfunctioning components, and the resulting subnetworks may
evolve independently. The consensus problem under LSR is to guarantee that, given
any combination of status changes, component failures, and transmission errors in
advertisements, all switches will eventually produce identical images of the network,
provided that the network is not permanently partitioned. This deﬁnition can be

generalized to incorporate group management information, if network images are

extended to include such information.

Thesis Statement: By modeling various aspects of group communications (such
as leadership, membership maintenance, and communication channel construction)
as consensus problems under LSR, we develop novel and eﬁlcient solutions for many
important issues of network group communication, including fault-tolerant leader-
consensus management, the support of multiple types of multicast communication
channels, the handling of disastrous situations, such as network partitioning, and the
improvement of the LSR itself.

The major contributions of this work can be summarized as follows.

1. Switching-aided ﬂooding (SAF). This ﬂooding method takes advantage of ATM
hardware cell relaying and duplication to improve the performance of ﬂood-
ing operations in ATM networks. We ﬁrst develop two SAF protocols, called
the Basic SAF and Bandwidth Efﬁcient (BE) SAF protocols, that construct
a hardware-based data-distribution tree to accelerate the dissemination of
(network-status) information. To further improve efﬁciency, we develop a third
SAF protocol that uses a ring topology to handle acknowledgments efﬁciently.
The complexity of this Efficient and Reliable (ER) SAF protocol is shown to be
optimal in terms of bandwidth consumption, workload at switches, and ﬂooding
delay. Improving the performance and efﬁciency of ﬂooding operations can be
very important to the responsiveness of the network in meeting diverse appli-

cation needs.

2. Generic multipoint connection (GMC) protocol. The GMC protocol is based
on LSR and can be considered as an MC protocol “chassis,” that is, a frame-
work that is able to accommodate multiple existing, and future, MC topology
algorithms. Such an MC protocol is expected to beneﬁt a wide variety of multi-
party communication applications that favor different performance criteria. For
example, a live multimedia broadcast could use an MC topology that minimizes

the transmission delays from a single source to a large number of destinations,

6

while a distributed interactive Simulation application may prefer an MC topol-
ogy that can efﬁciently accommodate a large number of participants, each of

which is both a sender and a receiver.

. Network-level Leader Election (NLE) protocol. The NLE protocol establishes
consistent group—leader bindings at network switches, maintains up—to—date
member lists at leaders, and handles network partitioning properly. Speciﬁ-
cally, given a group 9 and a set of network segments 31,52, . .. ,Sk, k 2 1,
within each segment 5',- there will be consensus on a leader for g, and that
leader will be an operational switch in S,- that maintains a member list of 9
containing those and only those members in 5,. The NLE protocol, which is
based on LSR, can be used to select trafﬁc transit centers, or core nodes, for
individual multicast groups and to support hierarchical routing and address
mapping. In addition, we apply the NLE in the design of the LSR-based Core
Management (LCM) protocol. Rather than conducting leader election on a per-
group basis, the LCM protocol uses the N LE protocol to select a switch as the
core management server, which in turn manages core nodes for all the active
groups in the network. Speciﬁcally, the LCM protocol automatically selects
the core node for a multicast group when the group is created, supports core
migration to improve multicast performance during the lifetime of the group,
handles the failures of both multicast cores and the core management server

itself, and survives network partitioning scenarios.

. Tree-based LSR (T-LSR). Traditionally, LSR uses two costly techniques to
achieve its robustness and responsiveness: message forwarding on every com-
munication link in the ﬂooding of network status updates, and the periodic
ﬂooding of local status by each router. We conclude this research by combining
two techniques developed earlier, namely the election of a leader and the con-
struction of MCS, to develop a totally different approach to LSR. The resultant
T-LSR protocol imposes only a small fraction of the overheads of previous LSR

methods, and guarantees to maintain consistent routing decisions throughout

7

the network under any combination of network component failures, partitioning
scenarios, and undetected communication transmission errors. Unlike the SAF
work, the T-LSR protocol is designed for use in general-purpose, LSR-based

networking platforms, assuming no hardware-based capacities of switches.

The remainder of this dissertation is organized as follows. In Chapter 2, we present
background material relevant to this work, including a discussion of the semantics
of group communication as perceived by different types of applications, a survey
of important multicast protocols, and a survey of link-state routing. We present
the SAF protocols in Chapters 3 and 4. Subsequently, we Shift our attention to
the support of group communication by LSR. The GMC protocol is described in
Chapter 5. Chapters 6 and 7, respectively, describe the N LE protocol and its use in
the LCM protocol. The T-LSR protocol is presented in Chapter 8. Conclusions and

possible future directions are discussed in Chapter 9.

Chapter 2

Background

Advances in communication technology have been dramatic in the last two decades.
The Internet, which started out as an experimental project connecting a small num-
ber of military sites and universities, has reached all the continents of the Earth.
The Internet is no longer a playground for a small group of researchers and academi-
cians, but has become a part of everyday life for millions of people in all kinds of
professions. In the meantime, long-established communication infrastructures, such
as telephone and cable television networks, are being transformed into modern infor-
mation superhighways, and are expected to provide a wide Spectrum of new services
(such as video on demand, multimedia telephony, data communication, tele—gaming,
information retrieval, and so forth) directly to individual homes. Moreover, advances
in communication technology are not limited to higher bit rates and lower loss rates;
they also include unconventional ways of using communication channels. One possi-
bility, which is actively being investigated by many researchers and developers, is to
support multiparty communication, whereby more than two communication parties
can conduct “conversations.” In this chapter, we discuss important multiparty com-
munication applications, existing multicast protocols that support those applications,
and link state routing, the type of network routing upon which the proposed methods

are based.

9
2.1 Multiparty Communication Applications

The term multiparty communication, or interchangeably group communication, refers
to a wide spectrum Of communication applications, including human-to—human inter-
action, distributed interactive simulation, distributed information management, and
efﬁcient information distribution. Naturally, such diverse applications have differ-
ent needs and expectations regarding services provided by the underlying network.
Although this dissertation largely concentrates on core network support for group
communication, including multicast operations, membership management, and lead-
ership consensus, in this section we examine the applications and services that may
be implemented atop such network services. Our objective is to assess and classify

the requirements of such applications.

2. 1.1 Human—to-Human Interaction

This class of applications brings together individuals for whom it is either difﬁcult
or costly to meet face to face (for example, due to their locations), but who must
work cooperatively. An example is videoconferencing, which allows participants to
visually and verbally communicate with others over a network [14, 15]. A special
type of teleconferencing, called computer telephony, uses computers and data com-
munication networks, rather than public telephone networks, for transmitting audio
in real time [16]. Teleconferencing does not necessarily use multimedia; text-based
teleconferencing sessions, sometimes called chat rooms, have become pOpular on the
Internet [17]. In addition, Computer-Supported Cooperative Workspace (CSCW)
applications enable workers who possess different areas of expertise, and who are ge-
ographically separated, to remotely and cooperatively conduct difﬁcult Operations or
manipulate sophisticated equipment [18, 19].

An interesting characteristic of many human interaction applications is the rel-
atively loose requirements on multicast reliability. Typically, these applications can
tolerate occasional loss of multicast data at some destinations, Since it is human be—

ings, rather than machines, that receive and interpret incoming messages. Occasional

10

losses of characters in a text-based teleconference, for example, may be perceived as
typos, rather than transmission errors. When multimedia is used, some loss of image
pixels or audio / video frames may produce ﬂares or jumps in playback, but the conver-
sation can continue as long as the degradation is not too severe. On the other hand,
delays and jitters in message delivery may be annoying —— imagine how to conduct a
conversation if one’s voice is not heard by others until 30 seconds later. Therefore,
many applications in this category use best effort multicast, a type of multicast that
does not enforce the successful delivery of multicast data at all destinations. When
possible, such applications might reserve network resources in advance in order to

improve the Quality of Service (QoS) of the network.

2.1.2 Distributed Interactive Simulation

In a DIS application, a virtual environment (VB) is Simulated collectively by a set of
hosts over a network [20]; examples include a virtual battleﬁeld, a virtual shopping
center, and so forth. The interest in DIS originated in the military; a military training
session conducted in a virtual battleﬁeld is much less expensive, and more importantly,
much safer than a real exercise. Civilian uses of such technology include simulation
of police and ﬁre department exercises, as well as the playing of multiparty games
across the Internet. In such VES, some objects are static, such as trees and lakes
in a virtual park, whereas other objects are active —— they move voluntarily or react
to stimuli (people in the virtual park). Some objects may be computer simulated
(for example, enemy tanks in a virtual battleﬁeld), while others are controlled by
users (for example, tanks controlled by trainees). In general, VE objects must sense
and interact with each other in real time. For this purpose, information regarding
the current positions, movements, and actions of objects must be disseminated to all
participating hosts in a timely manner. Network supported multicast Operations and
other group communication facilities can be used to improve performance.

DIS applications are often characterized by their scale; the number of partici-
pants in a VB can range from a few to thousands, and the underlying network can

range from LANS to WANS. The size and the geographic distribution of the par-

11

ticipant population raise the concern of scalability issues regarding the underlying
group communication support. Moreover, DIS applications call for a Special type of
reliability, called selective reliable multicast [20, 21]. Consider a Situation where user
X is engaged in a virtual battleﬁeld and unfortunately loses track of his opponent,
user Y, due to the loss of a sequence of three messages that broadcast the positions
of Y. While the conventional semantics of reliability would force the host of X to
request retransmissions of all three messages, X is interested only in the most recent
position of Y. A selective multicast protocol ensures the “freshness” of object states
maintained at participating hosts, and does not insist on the successful delivery of all

state update messages [21].

2.1.3 Distributed Information Management

Single-server solutions have traditionally dominated the area of information manage-
ment, including the management of ﬁle systems and databases. However, for reasons
of scalability and fault-tolerance, distributed solutions have been proposed and are
gaining momentum. For example, the Coda ﬁle system [22] allows for a ﬁle system to
be replicated at more than one ﬁle server. A client to such a ﬁle system can retrieve
ﬁles from the nearest server, but must submit ﬁle updates to all servers. Further,
servers may fail, and backup systems may join the service. If the client-server com-
munication in such circumstances is modeled as a group communication problem,
clients perceive servers as a single network entity, the server group, and should not
be concerned with server membership dynamics. Similar methods can be applied to
database services, using replicated database servers for either fault tolerance or to
improve the performance of query processing through parallel searching.

Many applications in this category demand atomic multicast operations, whereby
either all destinations of a multicast message receive the message, or none of them
receives it. Consider a scenario where a ﬁle update request is sent to a group of
replicated ﬁle servers; an atomic multicast protocol guarantees that either the ﬁle
is updated at all servers, or at none of them. Although the latter case could be

considered as a failed multicast operation, at least it leaves the servers in a consistent

12

state.

2.1.4 Information Distribution

This category refers to applications that disseminate information to a large popula-
tion. A deﬁning characteristic of such applications is the existence of a single, or a
small set of, information sources and a potentially unlimited audience Size. For ex-
ample, in a remote teaching application, the lecturer in a virtual classroom can reach

a large number of pupils at remote locations.

Some existing information distribution processes can also be re-examined in light
of new network technology. For example, the traditional process of distributing public
domain software works as follows: the distributor sets up an FTP (File Transfer
Protocol) [23] site and interested users individually connect to the site to download
a copy. Download requests for popular software may put a heavy load on the FTP
server, which repeatedly performs identical tasks: retrieving the software from a local
storage medium and Shipping it. (It is not uncommon for servers to be brought down
by these workloads.) Recently, the HTTP (Hyper-Text Transfer Protocol) [24] and
World Wide Web [25] have largely replaced the FTP protocol in this distribution
process, but the problem remains. In fact, the situation has become worse due to the
more user-friendly interfaces and, hence, a larger number of interested users. A much
more efﬁcient approach is to have the distributor (also known as the publisher) set
up a communication group such that group members, or subscribers, simultaneously
receive a copy via multicast. File distribution protocols, a type of multicast protocols
that is designed for this purpose, have been proposed for use in the Internet [26]. File
distribution protocols must use reliable multicast to ensure the receipt of all multicast
data at all destinations. Examples of reliable multicast transport protocols can be

found in [27, 28, 29].

13
2.2 Multicast Communication

Multicast operations, which deliver messages to more than one destination, are centric
to the support of multiparty communication applications. The voice and image of a
teleconference member must reach all other members. The movements of objects and
the status changes of terrain in a VB must be disseminated to all hosts participating
in a DIS session. File update requests must be submitted to all servers. And so on.
Actually, one may argue that the use of multicast is the deﬁning characteristic of
group communication. A multicast protocol is a network protocol that deﬁnes a set
of rules and conventions by which multicast trafﬁc streams are routed from sources
to a set of destinations. This section reviews existing multicast solutions developed
for two important types of networks, the Internet and ATM networks. We start with

a review of routing topologies and membership management techniques.

2.2.1 Multicast Routing Topologies

While many multicast protocols concern Simply the construction of individual mul-
ticast trees (a set of communication links from a source to a set of destinations),
we consider a more general form of multicast routing structure, called a multipoint
connection (MC), whereby one or more sources can reach one or more destinations.

Three major types of MC topologies have been studied:

1. Source-rooted trees (SRT). The MC topology typically comprises a forest of
trees, each individually constructed for a different trafﬁc source. An example
in which two trees reach a set of four receivers is shown in Figure 2.1(a). This
type of topology is well suited to applications with a small number of senders
and a possibly large number of receivers, such as remote teaching and ﬁle dis-
tribution applications. SRTS are relatively straightforward to construct and
are supported by almost all existing multicast protocols. SRT-based MCS are,
however, costly to maintain: a new tree must be constructed for each source,
and every existing tree must be extended to reach a new receiver. Similar over-

heads are incurred for departing senders and receivers. SRTS are supported in

14
the DVMRP protocol [5], MOSPF protocol [6], and the PIM protocol [2], all

designed for use in the Internet. The ATM multicast virtual circuit (multicast

VC) [30] also supports SRTS.

2. Symmetric shared trees (SST). A single tree is constructed to span the members
of an MC; every member is both a sender and a receiver (as in the case of
teleconferencing). Figure 2.1(b) shows an SST spanning ﬁve members. The tree
in the ﬁgure also uses an intermediate node to reach members. Compared with
an SRT-based MC, an SST counterpart tends to use fewer network resources
(in terms of the number of links) than does an SRT-based forest. The problem
of determining an Optimal shared tree is the well-know minimum Steiner tree

problem [31].

3. Receiver-only shared tree (ROST). A Single tree spans the receiver members of
an MC, while senders use one-to-one unidirectional paths to reach any node on
the tree. An example of a ROST with two senders and ﬁve receivers is depicted
in Figure 2.1(c). The ﬁve receivers are connected by a shared tree, depicted with
solid lines, and the sender-tO-tree paths are represented by dashed lines. This
distinction between receivers from senders facilitates membership management
on both sides. For example, a group of replicated ﬁle servers can be connected
by a ROST such that clients to the server group see a single entity, the server
MC; individual servers join and leave the server group without disrupting client-
to-server communication. ROSTS are supported by the core-based tree (CBT)

multicast protocol [32] and the PIM protocol [2, 33].

Besides the type of topology, another issue associated with MC is the topology
computation algorithm. Even with a given topology type, different topology com-
putation algorithms can be used, depending on the relative importance of various
performance criteria. Such criteria include bounds in transmission delays, network
resource consumption, multicast packet loss rate, and so forth. The issue of choosing
the right topology algorithm is particularly important to multimedia applications.

Such applications typically require quality of service from the network in order to

15

Link used by source
_ " to contact the ROST

__ comoctton' Iink
Sourcemember

_. Tree link from $1
- -> Tree link from $2

   

Receiver member _ Connection Unit

Intermediate switch 0 Connection Member Receiver memtm

Source member 0 Intermediate Switch Intermedete switch
(a) two SRTS. (b) an SST. (c) a ROST.

Figure 2.1: Three types of MC topologies.

insure the quality of media playbacks. Thus their performance relies on good MC
tOpology decisions so that network components involved in an MC have the capacity
and resources to sustain the trafﬁc ﬂowing through the MC. For instance, Zhu [9]
presented an algorithm that optimizes cost (for example, bandwidth consumption),
in the presence Of delay constraints. Bauer [10] examined the multicast tree problem
under degree constraints, which may be imposed by hardware switching devices. Wax-
man [34, 35] addressed the problem of dynamic multicast trees, in which a sequence
of membership updates must be carried out one by one. Although this dissertation
does not directly address the issue of MC topology computation algorithm, it advo-
cates generic MC protocols that are capable of accommodating a wide range of MC

topology types and computation algorithms.

2.2.2 Local Membership Management

In this dissertation, our primary concern is switch/ router level multicast. However,
from the viewpoint of applications, communication groups are host groups; members
of such groups are computers or other customer devices that allow users to access
networks. Typically, a host accesses the network via a router/ switch, called the ingress
switch of the host, and uses a local membership management protocol to inform its
ingress switch / router of a list of groups in which the host wishes to participate. The

ingress switch maintains a list of groups, where a group is on the list if one or more

16

attached host(s) of the switch is a member of this group. A switch that has at least
one attached host that is a member of group G will be referred to as the switch
member of G; all the switch members of G form a network group. With every switch
knowing its membership identity with respect to a group, a multicast protocol, when
given a multicast message destined to the group, is responsible for the delivery of the
message from the source switch, the ingress switch of the source of the message, to
switch members of the group.

Perhaps the most well known and widely used local membership management pro-
tocol is the Internet Group Membership Protocol (IGMP), which is designed for use
in broadcast-based LAN 3 [7]. In IGMP, the router of a LAN sends Host-Membership-
Query messages destined to a reserved multicast address that includes all hosts in a
LAN as members. In response, a host returns a Host-Membership—Report message,
which includes a list of multicast addresses in which that host is interested. Via re-
ceived membership reports, a router compiles a list of multicast addresses in which the
network (LAN) is interested. This process is repeated periodically to accommodate
membership dynamics. The IGMP uses several optimization techniques to reduce the
trafﬁc produced by Host-Membership—Report messages, which must be generated by
all hosts in a LAN. Further details of the IGMP can be found in [7].

The group communications solutions developed in this dissertation assume the use
of an existing local membership protocol, such as IGMP, by hosts to communicate

with respective ingress switches regarding membership identities.

2.2.3 Multicast in the Internet

The Internet is a connectionless network, meaning that, when a sender S wishes
to send a datagram to a destination D, the sender is not required to contact D
prior to transmission. When S and D Share a common communication medium (for
example, the two are the endpoints of a point-to—point link, or they both have access
to a broadcast medium, such as Ethernet), D receives the datagram directly from
S. Otherwise, Internet routers collectively deliver the datagram to D as follows: any

router R that receives the datagram will forward the datagram via a communication

17

link that constitutes the ﬁrst hop of an R—to—D shortest path. This forwarding process
starts at the ingress router of S, and is repeated until the datagram arrives at D.
In this manner, the routing of a given IP datagram is dynamic and independent of
other datagrams. The Internet extends this basic point-tO-point datagram delivery
model with multicast addresses. A datagram that contains a multicast address as its
destination is called a multicast datagram, and must be forwarded to all hosts that
are interested in the address.

For the discussion of IP multicast, we review four protocols that have been pro-
posed: DVMRP [5], CBT [3, 4], MOSPF [6], and PIM [2]. In this discussion, the term
router is preferred over the term switch. Also, the term multicast group refers to a
set of hosts that are listening to an IP multicast address. Following these semantics,

multicast groups in the Internet are receiver groups.

Distance Vector Multicast Routing Protocol (DVMRP)

Given a multicast address M, DVMRP builds an SRT individually for each source
of M by means of a broadcast and pruning process. A multicast stream is initially
broadcast throughout the network. The broadcast method, called reverse path for-
warding, works as follows. A router R, upon receiving a multicast packet P that
originates from S and is destined to M, determines whether P arrived on a link that
constitutes the ﬁrst hop of an R-to-S shortest path. If so, R forwards P to all neigh-
boring routers except the one from which P arrived. Otherwise, the packet is silently
discarded by R. In the meantime, routers that are not interested in M send prune
messages “upstream,” that is, one hop toward the source S. An upstream router may
further discover that all its downstream routers have been pruned from the forward-
ing tree, and also send a prune message upstream, unless it is itself a member of M.
This pruning process will be repeated until all the routers involved in the S-to-M
forwarding are either members of M or have downstream members of M, producing
an SRT that is rooted at S and reaches members of M.

We use the example shown in Figure 2.2 to illustrate. In Figure 2.2(a), a multicast

source is using a broadcast tree to reach ﬁve receivers. In Figure 2.2(b), ﬁve non-

18

. Source 0 Receiver .___... Multicasttorwarding __..... Prune message

 

 

 

 

(c) the second step in pruning. (d) the resultant multicast tree.

Figure 2.2: The operation of the DVMRP.

member leaves of the tree send prune messages, which are depicted with dashed
lines. In Figure 2.2(c), an intermediate node in the broadcast tree receives prune
messages from all its children, and sends a prune message upstream. The multicast

tree resulting from this pruning process is depicted in Figure 2.2(d).

An interesting aspect of the DVMRP is that group membership information is
not disseminated, but discovered during tree construction by means of “negative”
membership reports, namely the prune messages. However, for this very reason,
later membership changes cannot be incorporated into established SRTS. To remedy
this problem, existing SRTS must be periodically torn down and re-constructed [5].
This approach causes delays in the handling of membership or network changes. For
example, a new member will not receive multicast packets until the next phase of
tree re—construction. Periodic tree construction also imposes unnecessary overhead
during “quiet” periods, that is, when no changes are taking place. Moreover, shared-

tree topologies are not supported by DVMRP. Additional details of DVMRP can be

19

found in [1, 5]. A hierarchical generalization of DVMRP, called Hierarchical DVMRP
(HDVMRP), is described in [36].

Core-Based Tree (CBT) Multicast Protocol

Unlike DVMRP, the CBT protocol [4, 37] builds a Shared multicast tree for each
group. In the CBT protocol, each multicast group is assigned a distinguished router,
called the core node of the group. A member joins the group by sending a JOIN-
REQUEST message “toward” the core node; the request will stop at the ﬁrst node
that is already on the tree. A branch to the new member is set up by a JOIN-ACK
message, which follows the reverse of the path traversed by the JOIN-REQUEST
message. A member leaves the group (that is, detaches itself from the tree) by
sending a QUIT-REQUEST message to its parent node in the tree, which will also
quit if itself is not a group member and has no other children. An example of the
member join Operation in the CBT protocol is given in Figure 2.3. Figure 2.3(a)
Shows the shortest path P from a joining member X to the core node. It is switch
Y, the ﬁrst on-tree switch along P, that grants the JOIN—REQUEST and returns a
JOIN-ACK message, as depicted in Figure 2.3(b). The result of this join Operation
is Shown in Figure 2.3(c).

The CBT protocol handles adverse network events, including router and link fail-
ures, by periodically sending CBT-ECHO-REQUEST messages upstream. If a cor-
responding CBT-ECHO—REPLY is not heard, a member must rejoin the group by
ﬁnding another path to reach the core. Compared to the DVMRP protocol, the CBT
protocol handles membership changes in an event-driven manner, but still uses a peri-
odic method to incorporate network status changes, causing delays in the handling of
such changes. This hybrid approach of handling changes may serve some applications
well, but could be inappropriate for critical applications that must operate seamlessly
in the presence of network changes.

Another concern with the CBT protocol is its inﬂexibility in MC tOpology: the
protocol does not support the SRT MC topology. Further, the restriction that a

multicast packet must be forwarded to the core node before being forwarded along tree

Core node

Group member
Intermediate node
The joining member

P: the shortest path from
X to the core

 

(a) the shortest path P from a joining member X to the core.

  
 
 

 
   
 
  

The delivery of
JOIN—REQUEST

 
  
 

Y: the first
on-tree switch
on P.

     

(b) the delivery Of CBT messages. (c) the resultant tree.

Figure 2.3: An example Of member join Operation in the CBT protocol.

branches imposes unnecessary steps in multicast forwarding. To illustrate the cost of
this restriction, let us consider a scenario where group members shown in Figure 2.3(c)
are also sources to the group (for example, they are conducting a teleconference).
Figure 2.4(a) shows the forwarding of a multicast packet originated from node X,
when the session is supported by the CBT protocol. For comparison, Figure 2.4(b)
shows the forwarding of the same packet when an SST of the same topology is used.
As we can see, the CBT protocol incurs extra forwarding steps, depicted by dashed

lines in Figure 2.4, due to its restriction in the starting point of tree distribution.

Besides the CBT protocol, the concept of core based multicast has also been
adopted in other IP multicast protocols. Speciﬁcally, the Ordered CBT (OCBT)
protocol [38] addresses the concern of core failures of the CBT protocol, and the

Border Gateway Multicast Protocol (BGMP) [39] constructs core-based multicast

21

Fomarding step

. Core 0 Member ——> before reaching _.
core node

Fonrvarding
along tree branches

 

(a) using the CBT protocol. (b) using an SST.

Figure 2.4: Comparison of multicast forwarding in the CBT protocol and SSTS.

trees that span across the boundaries of autonomous systems (that is, routing domains

in the Internet).

Multicast Extension to OSPF (MOSPF)

The MOSPF protocol [6] is an extension of the Internet LSR protocol, OSPF [11].
In the MOSPF protocol, the identities of group members are broadcast via group-
membership LSAS, such that all routers maintain complete member lists for all active
multicast addresses. The distribution channel for a multicast group is constructed
when the ﬁrst datagram destined for the multicast address is sent. Upon receiving the
ﬁrst datagram that originates from a source S and is destined for a multicast address
M, a router consults its local database for the member list of M and computes a
shortest-path tree T that is rooted at the source switch of the datagram, and reaches
the switch members of M. Subsequently, the router saves a multicast routing entry
such that datagrams from S to M will be forwarded via a set of outgoing links
determined by T, and forwards the datagram accordingly. This forwarding will trigger
further topology computations at downstream routers.

An example of MOSPF operation is given Figure 2.5, where a host that is attached
to router A sends a datagram to a multicast group with members attached to routers

C and D. As shown in Figure 2.5(a), router A computes a shortest-path tree that is

22

rooted at A and reaches C and D. This computation is possible because the topology
of network is compiled by the underlying LSR protocol, OSPF, while the member list
of the destination group ({C, D}, in this example) is made available by MOSPF. The
resultant tree shows that A must forward the datagram to F, which upon receipt
will perform the tree computation again and learn of its downstream routers C and
D; see Figure 2.5(b). When C and D receive the datagram, they will also carry out
the identical tree computation, only to notice that they are leaf routers and should
forward the datagram to their attached hosts; see Figure 2.5(c).

As illustrate, the MOSPF protocol imposes redundancy in topology computation
— identical computations are performed at all routers involved in a multicast tree.
This problem is exacerbated by the restriction that the MOSPF protocol supports
only SRTS; hence this computational redundancy is incurred in a per-source—per-group
manner rather than a per-group manner. Furthermore, to adapt to membership and
network topology changes after a tree construction process, multicast routing entries
created for the tree must be cleared upon the arrival of LSAS that advertise member-
ship or network changes, resulting in the re-construction (and re-computation) of the

tree when new multicast datagrams arrive.

Protocol Independent Multicast (PIM)

With the MOSPF and DVMRP protocols, every router in a routing domain (or
possibly the entire Internet) may be involved in a multicast session. In the case of
the MOSPF protocol, every router receives membership change LSAS and maintains
member lists for all active multicast groups. With the DVMRP protocol, a multicast
stream is periodically broadcast throughout the network. The overhead of network-
wide involvement may be justiﬁed when a large fraction of the hosts in the network
is interested in the multicast; such multicast sessions are sometimes termed dense
mode multicasts [2]. In contrast, sparse mode multicast refers to cases where the
participants represent only a small fraction of the hosts in the network and, therefore,
network-wide involvement is considered too costly. The PIM protocol supports both

dense mode and Sparse mode multicast.

23

O Receiver

0 Source

——-> Fomard direction

  

To attached hosts To attached hosts
(b) The tree computation and for- (c) The tree computation and for-
warding at F. warding at C and D.

Figure 2.5: The operation of the MOSPF protocol.

Like PIM, the CBT protocol, which is a representative approach to support
(receiver-only) shared-tree MCS, also does not incur network-wide involvement. How-
ever, the PIM protocol further emphasizes the need to support other MC topology
types, speciﬁcally the SRT topology. In addition, the designers of the PIM protocol
sought universal applicability of the protocol, and therefore designed the protocol so
as not to rely on any Speciﬁc routing protocol; hence the name Protocol Independent

Multicast.

The PIM approach to supporting both dense mode and sparse mode multicast is

straightforward; it actually comprises two multicast protocols, one for each mode. In

24
the dense mode, the PIM protocol uses the DVMRP protocol (the MOSPF protocol

was not chosen because of its dependence on LSR). For sparse mode multicast, the
PIM protocol “initially” builds receiver-only Shared trees; the construction Of SRTS
is performed selectively for some sources during the multicast session. A network
region, whether it is a LAN, a routing area, or an Autonomous System, that wishes
to participate in a sparse mode multicast, is assigned a rendezvous point (RP), which
must be a PIM-capable router in that region. The RP of a region plays a role similar
to that of the core node in the CBT protocol. Members in that region issue RP-JOIN
requests, which serve the same function as the JOIN-REQUEST messages of the
CBT protocol, producing within the region a ROST rooted at the RP. If N regions
are interested in a multicast address, N different RPS will be associated with the
address, and N shared trees will be constructed. The source of a datagram with a
given multicast address must forward the datagram to all RPS associated with that
address. Each RP will forward the datagram along shared tree branches to reach
group members. These concepts are illustrated in Figure 2.6, where two shared trees
are constructed for a multicast address that has three sources. Detailed information

about the sparse-mode PIM protocol, called PIM-SM, can be found in [33].

e Sender
o Rendezvous point
Receiver

——> Source—to—RP path

—-> Shared-tree branch

 

Figure 2.6: Shared trees constructed by the PIM protocol.

The PIM-SM protocol constructs SRTS by means of a topology transition process,
which operates in a data-driven manner. When router members of a multicast address
observe heavy traffic from a source S, they may determine that the source could be
better served by a private distribution channel, and issue SOURCE—JOIN requests to

S, resulting in a multicast tree that is rooted at S. Continuing the previous example,

25

Figure 2.7 shows that an SRT has been built for the source S3.

6 Receiver
._ -> Source-meted tree link

—>OlhefMC|Ink

 

Figure 2.7: The result of topology transition for the sender S3.

PIM’s approach to supporting multiple MC topology types is elegant and efficient;
we expect wide acceptance of the protocol in the Internet. However, its topology
transition process, which builds SRTS, is data-driven and, hence, cannot be applied
to connection-oriented networks, such as ATM networks, where routing must be es—
tablished and maintained in a manner that is independent of trafﬁc streams. The
previous methods and current challenges of supporting group communication in such
networks will be reviewed in the next section.

An important open issue regarding the PIM protocol is the selection of RPS and
dissemination of their identities. According to the Internet multicast model described
in [7], a host should be able to listen to a multicast address Simply by informing its
ingress router of the address. Since hosts are not obligated to provide RP identities,
routers must obtain RP identities via an independent mechanism, which is not yet
determined at the time of this writing [2]. AS we will Show in Chapters 6 and 7, mod-
eling this RP management problem as a leader election problem within the network

constitutes an important part of our research.

2.2.4 Multicast in ATM Networks

ATM networks are connection-oriented networks that relay small ﬁxed-size cells in
hardware. An ATM cell is 53-byte long, comprising 48 bytes of payload and 5 bytes of
control information. Before transmission, a traﬂic source must set up a virtual circuit

(VC) that deﬁnes a path between the source and a destination. All cells belonging

26

to the VC will follow this path to reach the destination. Switching fabrics at ATM
switches along the path use a virtual circuit identiﬁer (VCI), contained in the control

bytes of each cell, to determine the outgoing link for the cell.

These concepts are perhaps best explained using an example. Figure 2.8(a) de-
picts a VG between a source host S and a destination host D. In the example, we
assume that every switch has four input ports, numbered 0 to 3, and four output
ports, again numbered 0 to 3. Before transmission, the source host S issues a VG
setup request to its ingress switch X; the conventions and procedures that a host fol-
lows to communicate with its ingress switch are termed the User-Network Interface
(UNI) [30]. Included in the request message is an input-VCI ﬁeld, which indicates
the VCI value chosen by the requesting host to identify cells belonging to the VC. In
the example, the source host S chooses the value 5. The ingress switch X determines
an output port that leads to the next-step switch deﬁned by a shortest S-to—D path
(port 2, in this example, which leads to the switch Y), selects an unused output VCI
value for the VC (9, in the example), replaces the value of the input-VCI ﬁeld with
the new value, and forwards the request to the next-step switch (namely, Y). The set
of conventions and procedures that network switches use to communicate with each

other is called the Private Network-to-Network Interface (PN NI) [13].

Continuing the example, the same task is repeated at switches Y and Z. Switch
Y selects the output VCI 2, which becomes the input VCI for Z, and forwards the
request to Z via port 2. Switch Z selects the output VCI 6, which becomes the VCI
value that the destination host D uses to recognize cells pertaining to the VC. An
S-to-D connection has been established. When trafﬁc ﬂows through the connection,
an involved switching fabric uses a switching table to determine the forwarding of
cells. The switching table at the port 0 of the switch X is shown in Figure 2.8(b).
AS we can see, the input VCI 5 is indexed into an entry that informs the switching
fabric of X to forward cells with that VCI value to output port 2, and to tag those
cells with the new VCI value 9. Further details of the VC setup procedure can be
found in the UNI 3.1 [30] and PN N I 1.0 [13] standards, which have been produced by

the ATM Forum, an international non-proﬁt organization that comprises industrial

27

and academic members.

The ingress The ingress
switch of S switch of D

  

 

 

Switch Y Switch 2

(a) the use of VCI values along a path.

Input Ou ut Ou t
VCI tp tpu

 

 

 

 

VCI Port
0 O O
0 O O
O O O
5 9 2
e e e
O O O
O O O

 

 

 

(b) the switching table at port 0 of switch X.

Figure 2.8: VC Operation in ATM networks.

The connection-oriented nature of ATM requires that the topology of an MC be
determined and constructed before the presence of associated trafﬁc streams. Further,
the maintenance of the topology must be performed in a signaling-driven manner,
that is, in response to network control messages, rather than the receipt of multicast
data itself. For these reasons, many IP multicast solutions are not applicable to ATM
networks. In this section, we discuss the ATM protocol used to establish one-to—many
VCS, or multicast VCS, which are the only MC type presently supported by ATM
standards. At the time of this writing, it is not clear which protocol(s) will be used in
ATM to support other MC types. However, we will survey two proposals that have
been discussed in the ATM Forum.

28
Multicast VCS

The concept of one-to-one VCs can be generalized to one-tO-many VCS, or multicast
VCS. This generalization requires an optional hardware feature, called cell replication,
in order to forward multiple copies of an incoming cell via different output ports. This
feature has been supported in many commercial ATM switches, for example, those
provided by Fore Systems [40]. In UNI 3.1, a multicast VC has exactly one source
party, called the root, and can be routed to one or more receiving parties, called
leaves, following a tree topology. A multicast VC is set up by its root, which uses
a procedure similar to the one-to-one VC setup procedure to connect to the ﬁrst
receiver. The result of this ﬁrst step is a multicast VC with exactly one leaf node.
Subsequently, the root can issue as many ADD-PARTY messages as necessary to
attach additional leaves to the multicast VC. However, current ATM standards do
not support group addresses, meaning that the source must learn the identities of
receivers via a host-level protocol. In the most recent version of ATM UNI (namely,
the UNI 4.0), receiver-initiated actions are supported so that receivers can join and
leave a multicast VC without involving the source party. Again, receivers must learn
via a host-level protocol the identities of the source party, or parties, in a multiparty

communication application.

PrOposals for Supporting Group Addressing in ATM

The lack of a group addressing mechanism in present ATM standards leaves the
users/hosts to deal with the membership issue in group communication. The ATM
Forum intends to add group addressing support in a future release of the PNNI
standard [13]. Here, we review two proposals that have emerged within the ATM

Forum.

1. A central-server approach for group membership management is promoted
in [41]. In this proposal, a switch in an ATM network is conﬁgured as the
group management center of the network, where the member lists of all active

groups are maintained. Changes in membership must be sent to this switch in

29

order to update member lists. A host that wishes to construct a multicast VC
to a group C contacts the management center to obtain the member list of C,
and follows the UNI 3.1 standard to set up the multicast VC. This approach is
designed for membership management, and facilitates the construction of mul-
ticast VCS, which are SRTS. Other MC topology types, such as receiver-only
and symmetric Shared trees, are still not supported. Further, the issue of single
point Of failure at the management center is considered “not critical,” and is

not addressed [41].

. A variation of the CBT protocol for use in ATM networks, called the ACBT
(ATM CBT) protocol, is described in [42]. This protocol is Similar to the CBT
protocol in that each group is assigned a core node, which is the root of a tree
that reaches group members. This tree, however, is not an ATM multicast
VC. Rather, the signaling modules of switches involved in the tree maintain
the parent /child relations deﬁned by the tree. In the ACBT protocol, a source
party S can connect to all the members of a group via a single connection
request, resulting in a multicast VC whose topology is the concatenation of
a S—to-Core path and the Shared tree rooted at the core. To illustrate, let
us consider a three-member group shown in Figure 2.9(a), where the shared
tree of the group is depicted by dashed lines. Figures 2.9(b) and (c) Show the
multicast VCS for two different sources. As Shown in Figure 2.9(c), a link may
be used by a multicast VC in two directions. This sometimes happens because
the source must reach the core, the only contact point in the CBT and ACBT
protocols, before the shared-tree can be used. We also emphasize that the two
multicast VCS shown in the ﬁgure operate independently, despite the fact that
they use identical sets of communication links (as deﬁned by the Shared tree)
after a packet has reached the core node; the shared tree of a group exists in the
form of signaling states, and is merely used to deﬁne the topology of multicast
VCS destined to the group. Since multicast VCS destined to a group must be
set up individually (although they Share the same tree topology), it is difﬁcult

30

to support some ATM features on a “per-group” basis. For example, given a
group G, network resources must be reserved for each individual multicast VC

destined to G, rather than for the group along.

Member 0 Core . Source —-.> MC link ——— Shared-tree link

 

(b) an example of resultant multi- (c) another example of resultant
cast VCS. multicast VCS.

Figure 2.9: Operation of the ACBT protocol.

In summary, the ACBT protocol supports group addressing and multicast VCS,
which are source-rooted but not necessarily shortest—path trees. Interestingly,
the protocol, albeit a CBT variation, does not support shared-tree MCS. An-
other respect in which the ACBT protocol differs from the CBT protocol is
the management of core. The ACBT protocol handles the selection of the core
node when a group is created, rather than leaving the task to users/ hosts, as
in the case of the CBT protocol. When the ﬁrst member of a group joins, the
ACBT protocol randomly picks a switch as the core, and advertises this core-
group binding via LSAS. This binding is recorded as part of the network image
at every switch. Subsequent joining members follow a CBT-like procedure to

connect themselves to the core, whose identity Should now be available through—

31

out the network. When different cores are suggested by several initial members
that join the group at approximately the same time, the core candidate with

the smallest ID wins.

2.2.5 Discussion

In summary, the designers Of multicast protocols face the following challenges. First,
multiparty communication applications demand a variety of MC topology types to
meet different performance criteria. While multiple protocols could be used to achieve
this goal, a single “generic” solution promises to avoid unnecessary overheads and
redundancy. Second, it is desirable that host members of a group be aware only of
the address of the group, and not the details of the underlying MC protocol. The fact
that the group is associated with a core node, or a set of rendezvous points, should
be hidden from users and hosts. AS a result, any distinguished members needed in
the protocol Should be selected by the network, rather than by users or hosts. Third,
when such distinguished members are required, the concern of a single point of failure
arises. The network, rather than users and hosts, must handle such failures.
Presently, neither the IP multicast protocols nor the ATM solutions meet all these
requirements. A main theme of this thesis is to Show that these difficult issues in the
Internet and in ATM networks can be appropriately addressed, when the network
uses a speciﬁc type of routing, namely, link-state routing. Speciﬁcally, an LSR-
based generic MC protocol will be presented in Chapter 5, and alternative approaches
to modeling the RP/core management as a leader election problem in LSR-based

networks will be discussed in Chapters 6 and 7.

2.3 Overview Of Link State Routing

LSR was initially designed for use in the ARPAN ET [12]; fault tolerance issues as-
sociated with the original protocol are addressed in [43]. The ISO (International
Standards Organization) version of LSR, the IS-IS (Intermediate System to Interme-

diate System) protocol [44], improves the efﬁciency of LSR when used in networks

32

interconnected by broadcast-based LAN 5, such as Ethernet and token ring. These im-
provements have been incorporated in a new Internet routing protocol, called OSPF
(Open Shortest Path First) [11]. Another recent application of LSR is the ATM
PNNI standard [13], whose contributions include, among others, a method for hi-
erarchically constructing large-scale, LSR-based networks, and an LSR-based group
leader election protocol.

In this section, we provide background on LSR that will be needed later in the
proposal. For purposes of discussion, the terms router, switch, and node will be used

interchangeably.

2.3.1 Basic Operation

The essence of LSR is to maintain complete network images at all switches. For this
purpose, every switch broadcasts throughout the network its local states, including
nodal states and link states. Nodal states concern the working condition of a switch,
for example, the workload at the switch. Link states describe communication links
that are incident to the switch. Typically, link states include queueing delay, data
loss rate, bandwidth, the capacity of associated buffers, monetary cost (for using the
link), and so on. For historical reasons, control messages containing either state type
are referred to as link-state advertisements (LSAS). After compiling an image of the
network incrementally via received LSAS, a switch X routes traffic to a destination D
according to a Shortest X -tO-D path computed locally. In general, the universal avail-
ability of complete network knowledge at every switch creates a robust infrastructure
to support various network services, including group communication.

In order to update network images to reﬂect network status dynamics, every switch
constantly monitors its local states and advertises changes in these states immediately.
For example, when a link fails, the value of its working state is changed from ON
to OFF, producing a link-down LSA from each of its endpoints. Similarly, link-up
LSAS are ﬂooded when the link later returns to an Operational state. The working
state of a link, which has only two values, is discrete; changes in such states are

always advertised. For continuously valued states (such as queueing delay, which is

33

a positive real number), a change in state is advertised only if the change exceeds a
predetermined threshold.

The topology of a network is deﬁned by the set of operational switches and com—
munication links. Although it may be tempting to consider the working states of
switches (as is the case for links), such states are not deﬁned in LSR. That is to say,
there are no “switch-up/switch-down” LSAS. This is because an LSR protocol cannot
distinguish failed nodes from nodes that become unreachable due to failed links. To
illustrate, let us consider the example in Figure 2.10(a), where the node X crashes.
The ﬁve neighboring switches of X (A, B, C, D, and Y) detect the lack of respon-
siveness Of the ﬁve links incident to X, and ﬂood ﬁve respective link-down LSAS. In
this example, switch A can learn of only four link-down events, because switch Y,
which advertises the failure of the (X, Y) link, has been isolated by the failure of X.
Figure 2.10(b) Shows the network as perceived by switch A (and any other switches

other than X and Y) at this moment of time.

 

 

 

(a) node X crashes. (b) the perception Of nodes other
than X and Y.

Figure 2.10: Problem in correctly identifying node failure.

This observation suggests that an LSR protocol, which is not able to determine
if a switch has failed or not, Should instead be concerned with “reachability” to the
switch. For example, once X and Y become unreachable, they cease to exist with
respect to the Operation of A. The concept of reachability is important not only
to the handling of node failures, but also to the handling of much more disastrous

circumstances, such as network partitioning. We will return to this issue in the next

34

section.

A ﬂooding protocol, used for the broadcast of network status information, is a
highly robust protocol that guarantees that eventually all network nodes reachable
from the source of an LSA will receive the LSA. The “conventional” ﬂooding protocol
works as follows. In order to send an LSA, the source switch sends the LSA to all its
neighboring switches. For identiﬁcation, LSAS typically contain the source address
and a sequence number. When an LSA is received by another switch for the ﬁrst time,
it is forwarded on all incident links, except the one on which it arrived. Copies of LSAS
that have already been seen by a switch are silently ignored. In this manner, every
LSA is forwarded by every switch exactly once. An example of this ﬂooding protocol

is depicted in Figure 2.11; the ﬂooding operation requires four steps to complete.

Node that has received the LSA
———> LSA transmission

. Node that has ﬁnished the ﬂooding

 

 

 

 

 

(c) step 3 ((1) step 4

Figure 2.11: An example of the ﬂooding operation.

The conventional ﬂooding method has been adopted for use in both connection-

less networks, such as the Internet, and connection-oriented networks, such ATM

35

networks. In the case of ATM, the hardware-based multicast method, namely the use
of multicast VC, has previously been considered unsuitable to the ﬂooding/ broadcast
of LSAS, because it cannot guarantee the delivery Of LSAS to all reachable nodes,
as guaranteed by the conventional ﬂooding method. Hence, the LSR operation in
ATM proceeds in a less eﬂicient, hop-by-hop manner. In Chapters 3 and 4, we will
demonstrate how to take advantage of multicast VCS in ﬂooding operations, while

providing guaranteed delivery.

2.3.2 Fault Tolerance Issues

Networks are often expected to operate for long periods of time, in the presence of
adverse conditions or even catastrophic scenarios. While many distributed applica-
tions ignore very rare adverse events, the networks themselves, and their underlying
routing protocols, are expected to survive. Two types of such events, or faults, are
of particular interest to LSR researchers: transmission errors not caught by the er-
ror detection mechanism (for example, CRC checksums) and the partitioning of the
network.

In LSR-based networks, the fault tolerance issue is closely related to the consensus
problem. Recall that the consensus problem under LSR is to ensure the convergence
of network images under the most adverse situations. Fault tolerance mechanisms
in LSR either try to eliminate deterrents to achieving consensus or try to achieve
consensus as soon as a consensus-prohibiting situation is cleared. A number of meth-
ods have been proposed to achieve this highly challenging goal [45]. Following is a
summary of the widely-accepted OSPF solution [11]; a similar solution is adOpted in

the ATM PNNI standard [13].

e Switches not only advertise status changes immediately, but also broadcast
their status periodically. This practice enables temporarily isolated segments
of the network to exchange information with each other after re-uniﬁcation
(one segment learns of the existence of other segments in the next ﬂooding

cycle). Periodic ﬂooding also controls the lifetimes of corrupted parts of network

36

images that may occur due to undetected transmission errors, for the corrupted

information will be overwritten in the next cycle of ﬂooding.

e An aging mechanism is used to identify obsolete information. Speciﬁcally, every
entry in a network image has an associated aging timer, and the entry is dis-
carded when its timer goes off. Nulliﬁed parts in a network image can later be
ﬁlled by relevant LSAS with any sequence number value. The aging mechanism
is needed to correct errors that the re-ﬂooding mechanism along may take too
long to correct. An example is undetected transmission errors in the sequence
number ﬁeld of LSAS. Let us consider an LSA with sequence number n that is
incorrectly received as n + k at some switch. Phrther assume that the source of
the LSA re-ﬂoods every minute. If the value of k is 228, it would take more than
500 years for the source switch to catch up (that is, to use sequence numbers
larger than n+ k) and override the corrupted information. An aging mechanism

solves this problem.

To further illustrate the use of these concepts, let us continue the example of F ig-
ure 2.10. Figure 2.12(a) depicts the local image at switch Y after the crash of X. We
point out that the local image at switch Y (incorrectly) still contains links (X, A),
(X, B), (X, C), and (X, D), because Y cannot receive corresponding link-down LSAS.
An aging mechanism solves this problem. Using this mechanism, any node other than
X and Y will remove the link (X, Y) and the nodes X and Y from its local network
image, after not hearing periodic ﬂooding from X and Y for a predetermined pe-
riod of time. Put in another way, the {X , Y} induced subgraph “ages out” in other
parts Of the network because it is no longer periodically reinforced by the two nodes.
Figure 2.12(b) depicts the network image at any non-(X , Y) node after the aging
mechanism takes effect. The network image at Y after aging consists Of only one
node, Y itself, since all the other nodes will age out at Y. This image is omitted in
Figure 2.12.

To ﬁnish the story with a happy ending, we assume that node X later becomes

operational. After the revival of X, all switches learn of the existence of links incident

37

 

 

 

 

(b) network images at nodes other
than X and Y, after aging.

 

(c) the consenting network image,
after the revival of X.

Figure 2.12: The handling of network partitioning in LSR.

to X via link-up LSAS. Switches other than X and Y learn of the existence of these
two nodes via the periodic status broadcasts from them. Similarly, the nodes X and
Y become aware of the other parts of the network via periodic status broadcasts from
other nodes. Eventually, all the switches will learn the network tOpology shown in
Figure 2.12 (c), achieving consensus on the network images throughout the network.

The robustness of LSR is a major reason for its wide acceptance in many modern
networks. However, the operation of LSR may raise concerns about scalability. First,
the size of network images grows with the size of the network, which is consequently
limited by the switch with the least memory space. Second, for a network with an
average degree (the average number of incident links to nodes) (1, every LSA will be
received on average at times by every switch. Further, if the network has N switches
that periodically ﬂood their status for every T seconds, every switch needs to handle
dN / T LSAS per second. When N is sufﬁciently large, the workload of LSA processing

alone will exceed the computation capacity of switches, or the ﬂooding of these LSAS

38

may use up the bandwidth of the network.

Of course, there are ways to address the scalability issue. In the case of the Inter-
net, LSR is intended for use in a set of networks under one administrative authority
(in Internet terminology, an Autonomous System) which typically contains a few hun-
dred switches and possibly several thousand hosts. In some other cases, such as the
case of ATM, LSR is intended to support nation-wide, or even global, networks. In

such cases, scalability can be achieved only by means of hierarchical routing.

2.3.3 Hierarchical LSR

Hierarchical routing reduces the burden on individual switches by hiding the com-
plexity of the entire network. Different ways of supporting a routing hierarchy with
LSR have been developed and deployed [11, 13]. In the Internet, the OSPF protocol
deﬁnes a two-level LSR hierarchy such that a router sees only the subnetwork to which
it belongs and the subnetwork’s boarder routers, that is, routers that connect to the
backbone subnetwork [11]. While intra—subnetwork trafﬁc is routed as described in
the previous section, cross-subnetwork trafﬁc is routed in three stages: ﬁrst through
the home subnetwork to a boarder router, from there across the backbone network
to reach a boarder node of the destination subnetwork, and ﬁnally through the des-
tination subnetwork.

A more general method of hierarchical LSR is described in the ATM PNNI 1.0
standard [13], which allows for arbitrary hierarchy depth. In this method, a physical
network is divided into several peer sub-networks, called routing domains. For exam-
ple, the network shown in Figure 2.13 can be divided as shown in Figure 2.14. This
division is performed manually by conﬁguring every switch with a domain ID.

After division, each domain runs a separate instantiation of LSR, that is, switches
within a domain exchange status information so that each of them maintains a “do-
main image.” Continuing the previous example, the image of domain A.4 is depicted
in Figure 2.15. As shown, a domain image contains not only intra—domain links, but
also outgoing ones. An outgoing link, or inter-domain link, is advertised in the do—

mains containing its endpoints. Hence, the link (A.4.1 A23) in Figure 2.13 will be

39

 

Figure 2.13: A network topology.

Dom 'n A.2
3' Domain 3.1

'5‘

DomainA

 

 

0
Domain 82

Domain A. 4
e)

Domain A. 3

Figure 2.14: Breaking up the network into routing domains.

advertised in domain A.4 by switch A.4.1 and in domain A.2 by switch A23. The
presence of inter-domain links in the image of a domain enables the domain to see
neighboring domains. For a domain to see all the other domains in the network, one
must run a copy of inter-domain LSR.

To perform LSR among domains, a leader switch is elected within each domain.
In ATM PNNI, the nodal states of a switch include two election-related states: leader
priority and preferred leader. The former is manually conﬁgured by network managers
to determine the rank of the switch. The latter is determined as follows: Every switch
independently searches in its domain image for a reachable switch that has the highest

leader priority, and calls the result of the search its preferred leader. As with other

40

 

 

O
/
/
//
A.2.3 "
Us \ /

/ /' O In-domain switch

lntra—domain link

_ 1.. - . — Inter-domain link

A — - — - ~ —- C
, 8.2.1
/ /

O "
A.3.2

Figure 2.15: The image Of the domain A.4.

LSR states, any change in the preferred leader state must be ﬂooded immediately.
If the preferred leader at a switch is the switch itself, this switch Shall, after waiting
for a period of time, inspect its local domain image for the preferred leaders of other

switches. Only if unanimity is obtained will the candidate switch proclaim victory.

For illustration, consider a network where the administrator conﬁgures a default
leader switch X with leader priority 3 and a backup leader Y with priority 2. The
remaining switches are all conﬁgured with priority 1. We assume that initially switch
X is the preferred leader of all other switches. Now consider what happens when
the established leader X crashes. As described earlier, neighboring switches of X
will advertise link-down LSAS for the incident links Of X. Using these LSAS, every
network switch ﬁnds the current leader unreachable, and searches through its local
image for a switch with the next highest priority. In this case, the result would be
Y with priority 2. Since every switch changes its value of the preferred-leader state
to Y, every switch advertises this change immediately. These advertisements can be
considered as “ballots,” which the switch Y must collect before claiming itself the

new leader.

Once elected, a leader learns the identities of neighboring leaders, namely the

leader switches in neighboring domains, via the LSAS regarding inter-domain links.

41

(Preferred leaders of endpoints are included in such LSAS.) The leader then sets up
a VG to connect to each neighboring leader. The inter-domain LSR is performed
collectively by domain leaders as follows: each leader uses inter-leader VCS to ﬂood
to all the other leaders nodal states that present a Simpliﬁed representation of its
home domain and link states that describe its connectivity to neighboring domains.
As such, each leader compiles a simpliﬁed view of the entire network. In this view, a
node represents a routing domain and a link represents the adjacency of its endpoint
domains. For the example of Figure 2.13, corresponding simpliﬁed network image is
depicted in Figure 2.16.

In ATM PNNI, the division-and-Simpliﬁcation process just described can be ap-
plied recursively to build routing hierarchy of any depth. For example, when the
network of Figure 2.13 is connected to an internet, the simpliﬁed network view shown
in Figure 2.16 constitutes a domain in the internet, and a leader is elected among the

domain leaders to represent the entire network in the next routing level.

A2 8.1

 

 

Figure 2.16: The Simpliﬁed/high-level network image.

2.4 Discussion

The main theme of this thesis is to demonstrate and exploit the mutually beneﬁcial
relationship between group communication and LSR. Three facets of group commu-
nication will be examined for being supported by LSR: use of multicast VCS in LSR
ﬂooding, MC construction and maintenance, and leadership consensus. Let us now
brieﬂy introduce each of these problems, given the background information that has

been presented in this chapter.

42

First, LSR itself can beneﬁt from group communication techniques, because many
aspects Of LSR operation exhibit characteristics of group communication. In LSR,
switches in a routing domain form a communication group: they broadcast to the
group, receive broadcast messages (that is, LSAS) from the group, maintain member
lists of the group (which are implicitly included in local domain images), and elect a
leader to represent the group in the next routing level. Moreover, such group commu-
nication characteristics in LSR are even more Obvious in hierarchical LSR networks:
At higher routing levels, the LSR tasks of ﬂooding, membership management, and
leader election are performed collectively by domain leaders. Since leaders are not
necessarily physically adjacent with each other, a ﬂooding operation among leaders
forms a true multicast Operation in the entire network. In this thesis, we identify an
important aspect of LSR that can beneﬁt from group communication: the ﬂooding
Operation. We note that, while present ATM standards use hardware switching and
cell replication to speed up host-level multicast, ﬂooding operations still proceed in a
store-and-forward manner as described earlier. Our ﬁrst main contribution is to Show
that ﬂooding operations can make use of the hardware capability of ATM switching
fabrics to improve performance, while in the meantime guaranteeing delivery to all
nodes reachable from an originating node, as in the case of the conventional ﬂooding
protocol. In Chapters 3 and 4, we describe a family of switch-aided ﬂooding (SAF)

protocols that work in this manner.

Second, the construction of MCS can beneﬁt from the complete network informa-
tion made available by LSR. We have discussed one multicast protocol, the MOSPF
protocol, that takes advantages of LSR; it uses LSR to disseminate membership in-
formation so that every router has a member list for every active MC. However, the
MOSPF protocol is restrictive in supporting different MC topology types, and incurs
computational redundancy. As we noticed in previous sections, multiparty communi-
cation applications need different MC topology types. Further, the rising importance
of QOS service is leading to new, sophisticated MC tOpology computation algorithms,
many of which are not supported by existing MC/multicast protocols. This thesis

will Show that the availability of complete network and MC membership information

43

at switches/ routers in LSR-based networks makes it possible to design a “chassis” for
MC protocols to accommodate existing and future MC topology computation algo—

rithms. The resultant generic MC (GMC) protocol will be presented in Chapter 5.

Third, we consider the problem of leader election. Although leader election is not
directly required by all group communication applications, some prominent multicast
protocols, such as the CBT and PIM, assign a network node as the multicast trafﬁc
transit center, or the core node, for the group. Arguably, the core node of a group
must be selected by the network; if the identity of the core is provided by host
members, then the host-network interface for multicast depends on the choice of
multicast protocol within the network (some multicast protocols require core identities
from the interface, while others do not). Further, the introduction of a trafﬁc transit

center raises the concern of Single point of failure.

The problem of assigning of core nodes to groups can be modeled as a leader elec-
tion problem (the leader of a group undertakes the responsibility of the core node).
The fault tolerance of LSR enables the design of robust election protocols, such as
the ATM leader election protocol, that handle not only leader failures but also dis-
astrous scenarios, for example, network partitioning. However, the overhead of the
current ATM leader election protocol (every group member uses ﬂooding to report its
preferred leader) may be prohibitively expensive if used to support multicast groups
because a large number Of such groups may exist simultaneously in a network. The
design of efﬁcient LSR-based support for the election problem constitutes the third
part of this research. Our NLE protocol, presented in Chapter 6, accommodates a
membership management mechanism that achieves the following consensus property:
a set of mutually reachable group members reach consensus on a leader, which main-
tains a member list containing exactly those members. The LCM protocol, presented
in Chapter 7, uses the NLE protocol to elect a leader switch as the centralized core
management server, which manages the core nodes for all active groups within the

network.

Finally, we come full circle. ‘By combining two group communication techniques

developed earlier, namely the election of a leader and the construction of multipoint

44

connections, we develop a totally different approach to LSR. The resulting Tree-based
LSR (T-LSR) protocol is lightweight, imposing only a small fraction of the overhead
of previous LSR methods, and robust, guaranteeing to survive not only network
component failures and partitioning scenarios, but also undetected communication
transmission errors. As we discussed earlier, properly handling the latter type of
faults is a vital requirement for an LSR protocol. Unlike the ATM-oriented SAF
protocols, the T-LSR protocol is designed for use in general-purpose, LSR-based
networking environments and requires no Special hardware support.

At the ﬁrst glance, the advocation of group-communication-supported LSR Oper-
ations and LSR-based group communication introduces a “chicken and egg” dilemma
—— which one should exist ﬁrst so as to support the other? Our results show that,
with careful design, the circular dependence can be avoided. The SAF and T-LSR
protocols demonstrate how a multiparty communication channel can be constructed
and used to improve the performance of ﬂooding operations, which advertise rout-
ing information (namely, LSAS) necessary for the construction and maintenance of
the channel. On the other hand, the GMC protocol can take advantage of LSR
performance improvements by T—LSR and SAF methods to enable the use of any
topology computation algorithm and hence provide support for any MC topology
type. Moreover, the NLE protocol, which itself is LSR-based, ﬁnds applications in
both the internal operations of LSR (such as hierarchical routing) and the support of
multiparty communication applications (for instance, the management of multicast
cores used by such applications). Such results demonstrate the mutually beneﬁcial

relationship between LSR and group communication.

Chapter 3

Switch-Aided Flooding

In this chapter, we demonstrate an example to support the claim that some aspects
of LSR operation can beneﬁt from group communication. Speciﬁcally, we prOpose
a ﬂooding method, called Switch-Aided Flooding (SAF), for use in ATM networks.
SAF-based protocols take advantage of hardware-supported cell relay and cell duplica-
tion, characteristic of such networks, in order to reduce the time needed to disseminate
changes in network tOpology and resource availability. SAF protocols use a spanning
multipoint connection (SMC), which is a hardware-switched network spanning tree,
but revert to conventional link-by-link ﬂooding when the spanning MC is unavailable
or under construction. Two ﬂooding protocols based on this methodology, as well as
an accompanying protocol to construct and maintain the SMC, are described in this
chapter; a third SAF protocol is described in Chapter 4. The results of a simulation
study reveal that the proposed ﬂooding protocols deliver network updates several
times faster than conventional approaches. Further, the bandwidth consumed by a

ﬂooding operation is also signiﬁcantly reduced.

3.1 Motivation

AS described in Chapter 2, ATM is a connection-oriented communication technology
that relays small ﬁxed-size cells in hardware. Many ATM switching fabrics support

hardware cell duplication, whereby an incoming cell can be forwarded via multiple

45

46

output ports. Although current ATM standards use this feature to support only
multicast (or one-to—many) VCS, such switch functionality enables the construction
of a more generic form of group communication channels, namely, many-to-many VCS,
or multipoint connections (MCS). An example of an MC is depicted in Figure 3.1(a),
where a set of eight switches is interconnected with a tree topology. The responsibility
of each member switch is to forward cells arriving on one link of the tree to all the
other tree links that are incident to that switch. AS illustrated in Figure 3.1(b),
cells arriving on any of the four links incident to the switch a: are forwarded on
the remaining three incident links. Hardware-supported MCS facilitate multiparty
communication applications, such as multimedia teleconferencing, distributed virtual
reality, tele—gaming, and computer-supported cooperative work. MCS used in such
applications typically involve only a small subset of the network switches. A special
type Of MC is the spanning MC (SMC), which includes as its members all switches
in a network. A spanning MC of the network used in Figure 3.1(a) is depicted in
Figure 3.1(c). Since every message transmitted on an SMC is received by all switches,

the SMC can be considered as a virtual broadcast medium of the network.

Although hardware switching and cell duplication may greatly improve the com—
munication performance Observed by end hosts and their applications, the signaling
activities within ATM networks, as deﬁned in UNI 3.1 [30] and PNNI [13] standards,
proceed largely in a connectionless manner. Since signaling must take place prior to
the existence of corresponding VCS [30], VC-setup request messages are forwarded
and processed in a hop-by-hOp manner. Switches along the route of the VC under
construction invoke signaling modules to perform functions related to the requested
VC, such as routing and call admission control. In addition, the ATM PNN I standard
Speciﬁes the use of the ﬂooding protocol described in Chapter 2, which was originally
designed for the ARPANET, a connectionless point-to-point network. Not surpris-
ingly, the protocol proceeds in a hop-by-hop manner, and does not take advantage of

the hardware capabilities of ATM switching fabrics.

We model the ATM ﬂooding operation as a group communication problem, where

an LSA is considered as a multicast message delivered to a group comprising all

  

_ — . rm.
— Lmk used by the MC / ,« A”: \ \ ﬁx —>cell arrivial A- cell lonivarding
— Link not used by the MC /
MC member '
-’ ) other network node
——>
C} .1
(a) an 8-node MC (b) cell forwarding at switch a:

 

 

(c) a Spanning MC Of the network

Figure 3.1: Examples Of multipoint connections.

switches in the network. The proposed SAF method uses a common group communi-
cation topology, the tree topology, to facilitate the dissemination Of LSAS. Speciﬁcally,
the SAF method constructs a Spanning MC, which is used as a “broadcast medium”
for distributing LSAS. The use of an SMC improves the performance of ﬂooding
operations by taking advantage of both hardware cell relaying and cell replication.
However, such an approach must address the challenge of retaining the robustness of
the conventional ﬂooding method, that is, an LSA must reach all switches reachable

from the source of the LSA.

The main contribution of this chapter is to develop and evaluate two SAF-based
ﬂooding protocols, called Basic SAF and bandwidth-eﬁfcient (BE) SAF protocols,
that satisfy these criteria. In addition, an efﬁcient protocol for the construction
and maintenance of spanning MCS is presented. The results of a Simulation study
reveal that these two SAF-based ﬂooding protocols can distribute messages to network

switches several times faster than the conventional ﬂooding algorithm. In the next

48

chapter, we will develop a even more efﬁcient SAF protocol by using a second group
communication topology, the ring topology, to implement reliability. A robust and
efﬁcient ﬂooding protocol can lead to better routing decisions by reducing reaction
time to faulty network components and congested areas. This in turn reduces the
probability of call blocking. Furthermore, general-purpose, LSR-based MC protocols,
such as the MOSPF protocol [6] and the GMC protocol (discussed in Chapter 5), must
disseminate group membership and/or MC tOpology advertisements, and therefore
can also beneﬁt from efﬁcient ﬂooding protocols.

The remainder of this chapter is organized as follows. A protocol that constructs
and maintains a network-wide Spanning MC is presented in Section 3.2. In Section 3.3,
two SAF protocols are presented. The Basic SAF protocol extends the conventional
ﬂooding algorithm to incorporate the use of an SMC. The BE SAF protocol further
addresses the issue of bandwidth consumption in ﬂooding Operations. The perfor-
mance of these two protocols is investigated through a Simulation study, the results
of which are presented in Section 3.4. A summarization of this work is presented in

Section 3.5.

3.2 The Spanning MC Protocol

The SMC protocol constructs and maintains an SMC for use in the SAF protocols.
The protocol is a variation of the CBT protocol [3, 4], a general MC protocol in
which the topology of the MC is the union of shortest paths from the members to
a speciﬁc node, called the core (see Figure 3.2). The SMC protocol differs from the
CBT protocol in the way that the core node of the MC is determined. In the CBT
protocol, the core node is static and is determined by an “outside” mechanism (for
example, by network management procedures). In the SMC protocol, the core node
is dynamic for reasons of robustness, since the SMC protocol must survive extensive
network changes, including failure of the core node itself.

In the SMC protocol, the core node selection problem is modeled as a leader

election problem under LSR. In this approach, every switch :1: uses the same core node

49

. Core node . Core node
0 MC member 0 MC member

0 Intermediate node 0 Intermediate node

   

(a) the member-to—core shortest (b) the resultant MC topology.
paths.

Figure 3.2: An example MC built by the CBT protocol.

selection algorithm to independently identify a new core of the SMC. The choice of
switch a: will be referred to as c3, and the computation will be denoted by C (G, as),
where G is the network image at :17. For now, we use the function C(G, as) that simply
sets cm to the preferred leader at switch :13, that is, we use the domain leader switch
elected by the ATM PNNI as the core node of the SMC. The generalization that allows
the use of any core selection algorithm C (G, x) can be achieved by using our N etwork—
level Leader Election (NLE) protocol, which is discussed in Chapter 6. Discussion

and evaluation of a variety of core selection heuristics can be found in [46, 47].

After selecting the core node locally, each switch tries to establish a connection to
its choice of core node. For a switch a: to reach its core selection cx, the switch sends
a reach_core request one hop towards the core, according to an III-tO-Cx shortest path
computed locally. The receiving switch grants the request after it has successfully
reached the core itself. Using the network shown in Figure 3.1 as an example, the
process of SMC construction is illustrated in Figure 3.3. Let us assume that all nodes
initially select, as the core, the darkened node in Figure 3.3(a); this ﬁgure also shows
the direction of sending reach_core requests. The core node immediately grants
the reach_core requests from its neighboring switches, which subsequently approve
reach_core requests from downstream switches. In this way, SMC links are granted

and established in a “radiating” manner; see Figures 3.3(b) to 3.3(f).

Under the SMC protocol, each switch :1: in the network G executes a set of con-

50

. Core _ _ _p ReachCore request Established SMC link

 

Node that has reached the core 0 Node that has not reached the core

 

 

 

Figure 3.3: An example of the SMC protocol.

stituent protocol modules and maintains the following data structures: a local net-
work image G3, a core selection cm, and an :r-to—cx path P;. (In the following, we
may omit the subscripts if they are clear from context.) Whenever a data structure
must be accessed by concurrent protocol modules, access to the data is assumed to
be atomic, in order to avoid race conditions among protocol entities. Critical regions

and semaphores are well-known techniques to achieve atomic access.

SMC protocol operation is triggered by the receipt of an event LSA (link-down,
link-up, and so on). Periodic LSAS are ignored by the SMC protocol so that the
protocol, and hence reconstruction / reorganization of the SMC, will not occur unnec-
essarily. As shown in Figure 3.4, upon receiving an event LSA, the SMC protocol
at switch as updates the local image G, of the network. The protocol then decides
whether it has to re-connect to the core node because 1) its core selection changes,
or 2) the LSA 6 reports a failed link that is used in P3. When it is necessary to
re-connect to the core, the switch a: tears down the present MC link that leads to the

core node and initiates an attempt to reach the core node by signaling another pro-

51

tocol entity, the ReachCore module. We emphasize that the inclusion of maintaining
network image G1 in SMC algorithms is for the purpose of self-contained discussion;

in real-world contexts, GC is most likely maintained by the underlying LSR protocol.

 

Algorithm: Process-Event-LSA.
Input: switch ID 51:, received LSA f.

Update G according to 8.
IF (cm # C(G,:r)) or (LinkDown(€)=TRUE and Link(€) in P1.)
Let y be the next hop to ca, in P3,.
Disconnect the tree link (3:, y).
c, = C (G, 2:).
Wake up the ReachCore module if it is sleeping.
ENDIF

 

 

 

Figure 3.4: The handling of event LSAS.

The ReachCore module at switch a: is started after the initialization of 2:, and
loops indeﬁnitely. This module is responsible for setting up an SMC link that will
lead to cx. For this purpose, the module sends a reach_core (c3) message one step
towards the core, and will continue doing SO until a positive reply is received from the
appropriate neighbor, indicating that the request has been granted and the desired
link established. We note that the value of c1. may change during this period, because
the Process-Event-LSA module may update the value upon receiving new event LSAS.
After obtaining a positive reply, the ReachCore module records the new to-core path,
Pm, and suspends itself.

The routine that processes a reach-core message is shown in Figure 3.6. The
receipt of such a request from switch y by switch 2: suggests that the switch a: is the
ﬁrst intermediate node on the path from y to Cu. The switch a: grants the request if 1)
it agrees with y upon the choice of core node, and 2) it has itself reached the core (this
can be determined by whether the ReachCore module at 2: is suspended). When the
request is granted, the switch :1: establishes the (at, y) MC link and returns a positive
reply to y, which includes the y-to—cy path used by the SMC. (The establishment
of an MC link involves the setup/modiﬁcation of hardware switching table entries

to implement the type of cell forwarding depicted in Figure 3.1(b).) Otherwise, a

52

 

Algorithm: ReachCore.
Input: switch ID 3:.

LOOP forever
1F (Ca: 95 17)
LOOP
/* note: C; may have been changed by Process-Event-LSA */
Let y be the next stop to reach CI.
Send a reach_core(cx) message to y.
Wait for a reply.
UNTIL (a reply reached_via(P) is received).
P,C = P.
EN DIF
Sleep.
ENDLOOP

 

 

 

Figure 3.5: The ReachCore module

negative reply is returned.

 

Algorithm: Process-Reach_Core.
Input: switch ID a: and a reach_core(c) request from switch y.

if (c = c1.) and (I’ve reached the core node cm)

Setup the (:c, y) MC link.

Return a positive reply, reached_via(Px + (11:,y)), to g.
ELSE

Return a negative reply to y.
ENDIF

 

 

 

Figure 3.6: The processing of the reach-core request message.

Cell Demulplexing. Because a spanning MC is effectively a broadcast medium
that allows interleaving of messages, every switch in the network can broadcast mes-
sages to, and receive messages from, all other switches. However, cells belonging to
simultaneous broadcast messages can be interleaved with one another at intermediate
switches. Receiving switches must be able to demultiplex these messages according to
their sources. Various methods can be used to solve this problem. For example, part
Of the cell payload can be used to label the sources of cells. Alternatively, Spanning

MCS can be switched by the virtual path identiﬁer (VPI). In ATM networks, every

53

cell is tagged with a pair of identiﬁers, VPI and VCI. When the VPI of a VG is used
in switching, the VCI of the cells belonging to the VC is ignored (but remains intact
during transmission). In this approach, the SMC used by the SAF protocol must
be constructed in such a way that the VPI is in effect throughout the MC, and as
such, the VCI ﬁeld can be used to identify the source switch of cells. We emphasize
that the SMC protocol, and the SAF protocols as well, work with any demultiplexing

scheme.

3.3 The SAF Protocols

An SAF protocol is an extension of the conventional ﬂooding protocol. In addition
to a set of point-to-point links in a network, SAF protocols presume the existence
of an SMC to which all the switches in the network have access. In this section, we
present two protocols designed in this manner; they differ in the implementations of

reliability.

3.3.1 Basic SAF Protocol

This protocol works as follows. The source of an LSA ﬁrst broadcasts the LSA on the
SMC and subsequently sends the LSA via all its incident links. If a switch receives
the LSA for the ﬁrst time via the SMC, then it forwards the LSA on all its incident
links. On the other hand, if the switch receives the LSA via a point-to-point link, then
it forwards the LSA on all incident links except the one on which the LSA arrived.
As in the case of conventional ﬂooding, switches silently drop LSAS that have been
seen previously. To illustrate, the ﬂooding example of Figure 2.11 is repeated in
Figure 3.7, but this time using the Basic SAF protocol. As we can see in the ﬁgure,
the operation now requires only two communication steps. In the ﬁrst step, the
source switch broadcasts the LSA, which is switched and duplicated in hardware on
the SMC. In this manner, the constituent cells are pipelined throughout the network.
Provided that other switches receive the LSA in the ﬁrst step, they exchange this

LSA via point-tO-point links in the second step; since every node has already seen

54

the LSA via SMC broadcast, all the point-to-point copies are dropped.

0 Node that received the LSA ——> LSA transmission

 

   

 

(a) step 1: hardware switched broadcast (b) step 2: point-to—point forwarding

Figure 3.7: An example of the Basic SAF protocol.

The Basic SAF protocol uses the SMC as a shortcut for LSA dissemination, but
does not rely on this Shortcut. In normal cases, such as the one in Figure 3.7, switches
receive LSAS immediately via the SMC. However, in situations where one or more
links used in the SMC is malfunctioning, or the SMC itself is under construction, or
cell losses occur on the SMC, then the link-by-link forwarding will guarantee that
the LSA reaches all nodes. Shown in Figure 3.8 is an example of how the Basic SAF
protocol Operates when the SMC is faulty. In this example, a link that is used in the
SMC falls during a ﬂooding operation and the broadcast of the LSA cannot reach
all switches (Figures 3.8(b) and 3.8(b)). As shown in Figures 3.8(c) and 3.8(d), the
remaining switches are reached via link-by-link forwarding. In extreme cases where
the SMC does not exist at all (for example, when the network is re—initialized), the

Basic SAF protocol degenerates to the conventional ﬂooding protocol.

The Basic SAF achieves its efﬁciency at the price of additional bandwidth con-
sumption. Here we compare the bandwidth used by the conventional ﬂooding protocol
against that of the Basic SAF protocol. In the conventional ﬂooding protocol, the
source of an LSA sends the LSA on all its incident links, and other nodes forward
the LSA on all but one of the incident links. Consider a network G = (V, E), where

V is the set of switches and E the set of point-to-point links. The number of links

55

0 Node that received the LSA —-> LSA transmission

 

   

(a) the broken SMC (b) step 1: (partially failed) broad-
cast

 

 

 

   

(c) step 2: link-by-link forwarding (d) step 3: link-by-link forwarding

Figure 3.8: The Basic SAF protocol with a broken SMC.

traversed by conventional ﬂooding is

BC 2 1+ Z(Deg(v) — 1) 2 1+ Deg(G) — N,

vEV

where N = W] and Deg(G) is the sum of node degrees in G. On the other hand, the

Basic SAF protocol would require

Bbasic = (N — 1) + Z Deg(v)

vEV
links, where the ﬁrst term (N — 1) is the number of links used by the broadcast on

the SMC, and the second represents forwarding of the LSA on point-to—point links.

To further clarify the relationship between Bc and Bbasic, let Ang denote the

average node degree of G. The bandwidth consumptions of the two ﬂooding protocols

56

can be rewritten as
Bc=AngxN—N+1a:(AngxN)—N,

and
BbasiczAngxN+N—12(AngxN)+N.

Therefore, the BbasiC to BC ratio can be approximated by

Bbasic ~ ‘4ng +1
BC N Ang — 1'

 

If the network has a small average node degree, then the Basic SAF protocol may
consume Signiﬁcantly more bandwidth than does the conventional ﬂooding protocol.
However, due to the simplicity of this protocol and its advantage in ﬂooding time, the

Basic SAF protocol may be attractive under a variety of conditions (see Section 3.4).

3.3.2 Bandwidth-Efﬁcient SAF Protocol

The Basic SAF protocol can be modiﬁed to reduce bandwidth consumption by in-
troducing the concept of dummy forwarding. In this approach, switches receiving an
LSA via the SMC forward a “dummy” of the LSA, containing only the source address
and sequence number, to neighboring switches. Switches that have ﬁnished the task
of dummy forwarding also expect to see responses (either the real LSA or its dummy)
from all neighboring switches. After waiting for a predetermined period of time, such
switches forward the real LSA to neighboring switches that fail to respond. Switches
receiving the LSA via point-to—point links forward the real LSA on all incident links
except that of arrival, and expect nothing from neighbors.

Again we use examples to illustrate. The operation Of the BE SAF protocol with
a fully operational SMC is depicted in Figure 3.9. Similar to the Basic SAF proto-
col, the BE SAF protocol in this setting requires only two communication steps. In
the ﬁrst step, the source switch broadcasts the LSA, which is switched and dupli-

cated in hardware on the SMC. In the second step, however, switches exchange with

57

neighboring switches dummies of the LSA, rather than the real one.

_. LSA transmission ___. Dummy forwarding
fl" 3%? ’ \
z’ i! \
I, l 1’ ‘ \
I

 

(a) step 1: hardware switched broadcast (b) step 2: point-to—point “dummy” for-
warding

Figure 3.9: An example of the BE SAF protocol.

Figure 3.10 illustrates the operation of the BE SAF protocol with a broken SMC.
After the BE SAF protocol uses the SMC to reach as many nodes as possible (Fig-
ure 3.10(b)), switches that received the LSA via the SMC forward dummies, and in
the meantime expect their neighboring switches to do the same. In this example,
three nodes, namely X, Y, and Z, do not see all the expected dummies from neigh-
boring switches; see the unidirectional dummy forwardings in Figure 3.10(c). The
three nodes, after a predetermined timeout period, start forwarding the real LSA to
their “silent” neighbors, as depicted in Figure 3.10(d). Switches that receive the LSA
via link-by-link forwarding further forward the LSA to all incident links except the
incoming one, as Shown in Figure 3.10(c).

In the discussion of BE SAF algorithms, we denote by K3 the number of switches
that are neighbors of switch :13. Let us assume that the neighbors of the switch :2:
can be reached via ports numbered 1 to K3,, and that the SMC is attached to port
0.1 A switch maintains three data structures: Seq[i], the sequence number of the
current LSA from switch i, 1 g i S N; Received[i], a boolean ﬂag indicating if the
Seq[i]-th LSA from switch i has been received; and PM [p], a boolean ﬂag indicating
whether the switch has received via port p either the Seq[i]-th LSA from switch i or
the corresponding dummy, for 1 S i 5 N and 1 _<_ p g Kx. Let us denote the sequence

 

1If necessary, LSAS received on the SMC can be identified by the VPI value of the MC.

58

 

 

cast

 

 

 

 

((1) step 3: real forwarding after (e) step 4: link-by-link forwarding
timeout triggered by real forwarding

Figure 3.10: The BE SAF protocol with a broken SMC.

number Of LSA f by Seq(€) and the address of the source switch by Source(€).

The source of an LSA invokes the routine BE_SAF.Source, which is shown in
Figure 3.11. Parameters to the routine include the ID :1: of the invoking switch and
an LSA Z to be ﬂooded. The switch :3 updates the sequence number of its current LSA
to that of f and clears relevant F ﬂags to indicate that it has not received anything
about E from its neighbors. The switch then broadcasts 6 over the SMC, forwards the

dummy of f to all neighboring switches, and sets up a timer to await responses (for

59

E or its dummies) from neighbors.

 

Algorithm: BE_SAF_Source.
Input: the switch ID :13, and an LSA f.

Seq[x] = Seq(€).

Received[r] = TRUE.

F[x][p] = FALSE, for all 1 g p g K.

Transmit [.7 over the SMC.

Forward a dummy of l? to all neighboring switches.
Set up a timer(€).

 

 

 

Figure 3.11: The sender algorithm of the BE SAF protocol.

Switches that receive an LSA f invoke the routine BE.SAF-Receive, which is shown
in Figure 3.12. The routine ﬁrst decides whether it is dealing with a new LSA by
checking 2’s sequence number against the current sequence number recorded locally.
If a new LSA is Observed, corresponding F ﬂags are cleared to indicate that nothing
has yet been learned about this LSA from neighbors. The switch then checks whether
the LSA arrived via the SMC or a point-to—point link. If the LSA arrived on the SMC,
then its dummy is forwarded to all neighboring switches. Otherwise, the LSA itself is
forwarded on all point-to-point links except the one on which it arrived. In the cases
where dummies are forwarded, a timer is set up to make sure that the switch :1: hears
responses from neighboring switches; the timeout handler is discussed later.

When a switch :1: receives the dummy of an LSA t, it invokes the
BE.SAF_Receive_Dummy routine, shown in Figure 3.13. The receipt of a dummy
from a neighboring switch y assures :1: that y has already received the LSA and,
therefore, that LSA forwarding to y is unnecessary. This situation is recorded in the
corresponding F ﬂags. As in the case of the BE.SAF_Receive routine, a check is made
to determine if this is (the dummy of) a new LSA. If so, then the corresponding Seq
entry is updated and relevant F ﬂags are reset, as in the previous routine.

A switch :1: that receives an LSA E from the SMC forwards the dummy, rather
than the real LSA, to its neighboring switches. It also expects responses (3 itself
or its dummy) from its neighbors. Lack of a response from a neighboring switch

results in the forwarding of the real 8 to that switch. The timeout-handler shown

60

 

Algorithm: BE_SAF_Receive.
Input: the switch ID IL‘, and an LSA if received from port p.

y = Source(f).
IF (Seq(f) > SeCIlyl)
Seq[y] = Seq(f), and Received[y] :2 FALSE.
F[x][q] = FALSE, for all 1 S q S K.
ENDIF

IF (Received[y] is TRUE)
DO nothing. /* drop LSAS that have been seen before */
ELSE
/* This the ﬁrst time this LSA is received */
Received[y] = TRUE.
IF (p = 0) /* received from the SMC */
Forward a dummy of f to all neighboring switches.
Set up a timer(€).
ELSE /* received from a point-tO-point link */
Forward (3 on all point-to—point ports, except p.
ENDIF
ENDIF

 

 

 

Figure 3.12: The receive-LSA routine in the BE SAF protocol.

 

Algorithm: BE_SAF-Receive_Dummy.
Input: the switch ID 51:, and a dummy d received via port p.

y = Source(d).
1F (Seq(d) > Seq[y])
Seq[y] = Seq(d), and Received[y] = FALSE.
F[y][q] = FALSE, for all 1 S q S K.
ENDIF
F[y][p] = TRUE.

 

 

 

Figure 3.13: The receive-dummy routine in the BE SAF protocol.

in Figure 3.14 checks the F ﬂags to decide for all neighboring switches individually

whether the forwarding of the real I? is necessary.

61

 

Algorithm: BE.SAF_Timeout-Handler.
Input: the switch ID :1: and a timer(€).

IF (Seq[f] = Seq[Source(€)])
/* This timer is for the current LSA Of the source of f.
Out-of—date timers are ignored. */
FOR (port number p = 1 to K) DO
IF (F[Source(€)][p] = FALSE)
forward 8 on port p.
EN DIF
ENDIF

 

 

 

Figure 3.14: The timeout handler in the BE SAF protocol.
3.4 Performance Evaluation

The performance of the three alternative ﬂooding methods (conventional ﬂooding,
Basic SAF, and BE SAF) is studied through Simulation. The simulator is based on the
CSIM package [48]. We are interested in both temporal and bandwidth metrics. Given
a ﬂooding Operation and a switch :12, the temporal metrics of the ﬂooding operation
include the time to receive an LSA, called the receipt time of a3, and the time for a
ﬂooding operation to complete at 1:, called the completion time of :13. (Completion
time includes handling of duplicate LSAS and dummies.) In this study, we measured
average receipt/completion times among all network switches. Conﬁdence intervals
were computed, but for most cases are very small and, for clarity, are not shown in
plots.

The bandwidth consumption of a ﬂooding Operation can be measured by the
number of links traversed by the LSA and, in the case of the BE SAF protocol, that
of its dummy. The former number is denoted by B“ and the latter by B“. Given
length 8,, of an LSA and length 8,; of its dummy, the bandwidth consumed by a ﬂooding
operation is B; x (3,, + B; x lb, where f 6 {conventional, Basic SAF, BE SAF} is the
method of the ﬂooding. In this study, we obtained the B“ and 8“ values through
Simulation runs, and we used the Fore SPANS NNI speciﬁcation, where an it link-

description LSA comprises 4 + 28 at n bytes [49], to determine the fa and [a values.

Networks comprising up to 256 switches were simulated; 20 graphs were generated

62

randomly for each network size. Table 3.1 Shows the characteristics of the graphs
generated. In the table, a parenthesis in an entry represents the (minimum, average,
maximum) triple of the corresponding metric. For example, in the case of the 20 4-
node graphs, the minimum degree of a node, across all the graphs, was 1.0; the average
minimum degree among the graphs was 1.35; and the largest minimum degree among

the graphs was 2.0.

 

 

 

 

 

 

 

 

 

 

 

 

[ Size [ min degree max degree avg degree diameter [

4 (1 1.35 2) (2 2.70 3) 1.98 (2 2.200 3)

8 (1 1.10 2) (3 3.95 5) 2.46 (3 3.850 6)
16 (1 1.00 1) (4 5.25 7) 2.83 (4 5.550 9)
32 (1 1.00 1) (5 7.05 9) 3.34 (5 6.700 12)
64 (1 1.10 2) (6 9.40 12) 4.42 (4 6.150 11)
128 (1 1.50 3) (9 13.95 19) 6.75 (4 5.400 9)
256 (1 3.60 7) (11 21.30 29) 11.14 (4 4.800 8)

 

 

 

 

Table 3.1: Characteristics of randomly generated graphs.

Each communication operation, such as message forwarding, incurs ATM protocol
overhead. We measured these overheads on the ATM testbed in our laboratory.
The testbed comprises Sun SPARC-IO workstations equipped with Fore SBA-200
adapters and connected by three Fore ASX-100 switches. From these measurements,
we obtained the ﬁgure 600 usec, which includes the overhead at both the sending and
receiving switches.

The ﬁnal simulation parameter is the duration of the timer used by the BE SAF
protocol, which is used in awaiting responses from neighboring switches. In the
Simulations, we set the timer according to the average degree of the given network
graph G. Speciﬁcally, we set the timer to be of length Ang(G) x a, where a is the

time to forward an LSA via a point-to-point link.

Experiment 1: Ideal cases. By ideal, we refer to Situations in which the SMC
is completely operational during ﬂooding operations. Such is the case for periodic
ﬂooding during “normal” periods, when all network components function properly,

and during the ﬂooding of event LSAS when none of the faulty network components

63

affect the operation of the SMC. The simulation results pertaining to this setting are
plotted in Figure 3.15. As shown in Figures 3.15(a) and (b), the two SAF protocols
offer signiﬁcant advantage in both LSA receipt time and ﬂooding completion time,

due to their use of the SMC as a “short cut” broadcast medium.

 

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

6000 f 12000 .
5000 ~ 10000 » +
,.-II
4000 - 8000 . «
D U ‘
.E g .......
.3 3000 r conventional -*— . 2 6000 t a _,.."'/-f
§ basic SAF .....c E; .—-"
“ BE‘SAF O " m ”“- .r'
2000 ’ .1 4m ,_ __,_.--""u -----------
a: ........ I conventional -°—
‘ -" basic SAF +-- ‘
1000 - = = = 2000 BE-SAF -3...
0 A A; L 1 0 1 i A A
4 I6 32 64 128 256 4 16 32 64 128 256
Network size Network size
(3) average receipt time (psec) (b) average completion time (psec)
3500 . . . . 18 a .
conventional ~0— “~- .. conventional '
3000 r basic SAF ---- .i “
< real LSA in BE-SAF ...,, 3“
53 2500 _ dummy LSA in BE-SAF ,x'v/y
a
5‘ 2000 i E; .
'8 g -.
E 1500 - § 8 '
E U 6 _
g 1000 ~
LA 4 -
500 ~ . 2 _
. ' I 0 m 1 A A
4 16 32 64 128 256 4 16 32 64 128 256
Network size Network size
(c) number of links traversed (d) bandwidth consumption

Figure 3.15: Comparisons of ﬂooding alternatives with a correctly functioning SMC.

Figure 3.15(c) plots the number of links traversed by a real LSA or its dummy for
the three protocols. As predicted by earlier analysis, the Basic SAF protocol consumes
more bandwidth than does the conventional ﬂooding algorithm; both methods do
not use dummies, but the latter has smaller B“ values. With the use of dummy

forwarding, the B“ values of the BE SAF protocol are signiﬁcantly less than that of

64

the conventional protocol, especially when the network is large. The actual bandwidth
savings of the BE SAF protocol depends on the (a to Ed ratio. Figure 3.15(d) plots
the number of cells per link incurred by the three alternatives, assuming the use
of Fore SPANS NNI and the inclusion of 10 link-state descriptions in the LSA.2
Somewhat surprisingly, the BE SAF protocol consumes more bandwidth than do the
other alternatives when the network is small (for example, containing fewer than than
8 switches). A closer examination of simulation runs reveals that this phenomenon is
due to the small average degrees in small networks, leading to premature timer ﬁrings,
followed by unnecessary LSA forwardings. This problem can be ﬁxed by introducing
longer timeout periods for small networks. For larger networks, the bandwidth savings

of the BE SAF protocol are substantial.

Experiment 2: Partitioned spanning MC. Next, we investigate the perfor-
mance of the ﬂooding algorithms when a (random) link used in the SMC fails, par-
titioning the MC into two segments. In this case, the two SAF protocols still use
the MC to reach as many nodes as possible, and resort to link-by-link forwarding to
reach the remaining nodes. The results are presented in Figure 3.16. As expected, the
average receipt times for the two SAF protocols are larger than those of the previous
experiment. However, they are still signiﬁcantly smaller than those of the conven-
tional ﬂooding protocol. Provided that the component failure rate of the network
is low, the occurrences of simultaneous failures should be rare. We suspect that the
results of this experiment, which represent single-failure situations, combined with
those of the previous experiment, which represent zero-failure situations, cover a very
large fraction of network ﬂooding situations.

Interestingly, the bandwidth consumption problem with the Basic SAF protocol
diminishes with a partitioned SMC, as shown in Figure 3.16(d). This is because LSA
broadcasts do not traverse all links of the partitioned SMC. On the other hand, the

BE SAF protocol in this situation must forward real LSAS, rather than dummies,

 

2An LSA must account for the links to neighboring switches as well as the links to hosts. Consid-
ering that most popular ATM switches can accommodate at least 16 ports, and some even 96 ports,
we believe that 10 incident links per switch may be a relatively conservative representative ﬁgure.

65

 

 

   
    

 

 

 

 

 

 

6“ T V y >
conventional ~—
basic SAF -------
5000 ’ BE-SAF a
4000 ~ 1,
1’ 300° - .........................
U ..
8
M
2000 .
a
1000
4 16 32 64 128 256
Network size
(a) average receipt time (psec)
3000 A
conventional _._ "
2500 ~ basic SAF +-- 3
real LSA in BE-SAF a
"‘ dummy LSAinBE-SAF -.—.
3 2000 ~ .
E 1500 ~ .
> ' /-
g x.“ 44.454"
'E 1000 » /«...
.3 ,. ,,,/
soo - ,w'/ 1
o ‘ ' "”9 .
4 16 32 64 [28 256
Network size

(c) number of links traversed

 

 

 

 

 

 

 

 

 

 

 

 

14000 . .
conventional -°—
12000 basic SAF '+'“' “-g
BE-SAF ..a .....
U
E ,4
U
8
I!
0 1 L L A
4 l6 32 64 128 256
Network size
(b) average completion time (psec)
18 r r
conventional *—
16 ' basic SAF —~---- ‘
BE-SAF G
14 «
'1:
8.
3 8
8
6
4 .,
2 ~ .
0 4 1 L 1
4 16 32 64 128 256

Network size

((1) bandwidth consumption

Figure 3.16: Comparisons of ﬂooding alternatives with partitioned SMC.

in order to reach switches that are not covered by the SMC. Hence, the protocol

consumes more bandwidth than it does in the previous experiment. However, its

bandwidth consumption is still lower than that of the other alternatives.

Experiment 3: N on-existent spanning MC. Let us now consider the worst-case

setting for the SAF protocols: when the SMC does not exist at all. This situation may

happen after network re—initialization and prior to reconstruction of the SMC, and can

also be considered as a worst-case situation with respect to multiple link failures that

partition the SMC. As we can see in Figure 3.17, the conventional ﬂooding protocol

outperforms the two SAF protocols in receipt time and completion time. The time

differences between the conventional ﬂooding protocol and the Basic SAF protocol

are marginal, but those between the conventional ﬂooding protocol and the BE SAF

protocol are much more signiﬁcant. The BB SAF protocol suffers in this experiment

because each switch has to perform two rounds of forwarding to neighbors, one for

dummies and one of real LSAS. In this experiment, the three ﬂooding alternatives

consume essentially the same amount of network bandwidth, because the two SAF

protocols, like the conventional ﬂooding protocol, use only link-by-link forwarding

when the SMC does not exist.

 

 

 

 

 

 

 

 

 

14000 . . - r

conventional ~0—

12000 1 basic SAF ------

BE-SAF o '

10000 ’ a *
'5 8000 l’ n 1
o
.2. "

3
o J 1 x L
4 16 32 64 128 256
Network size
(a) average receipt time (psec)
3m T v 1 ﬁ'
conventional *—
2500 . basic SAF -+~-- 3
real LSA in BE-SAF -° -----
.. dummy LSA in BE-SAF -
§ 2000-
0
an
.5
5 1500 -
>
2
jg 1000 > 4
..1
5m . J
0 A A
4 16 32 64 128 256

Network size

(c) number of links traversed

22000

20000-

18000

16000 -
14000 »
.L'é
§12000 -
'3 10000 -
04 3000 ~

6000

4000 '3,
2000 "

 

 

 

r

conventional —o— _...--0
basic SAF -+----

BE-SAF ...... 4

 

 

O 1
4 I6 32 64 128 256

Network size

(b) average completion time (psec)

 

 

 

 

conventional +-
16 * basic SAF --~ ‘
BE-SAF -°-
14 > <
12 h 1
.‘é .
E 10; T
£ 8 fl" ' .
c3 6 _
.. 1
2 r
0 1 1 J L
4 16 32 64 128 256

Network size

((1) bandwidth consumption

Figure 3.17: Comparisons of ﬂooding alternatives when SMC does not exist.

67

Experiment 4: Performance of the SMC protocol. We also studied the per-
formance of the SMC protocol under two scenarios: reorganization of an existing
SMC when an SMC link fails, and construction of a new SMC. Corresponding simu-
lation results, along with conﬁdence intervals, are plotted in Figure 3.18. The results
show that a partitioned SMC requires less than 2.5 milliseconds to reorganize, while

constructing an SMC from scratch (a relatively rare event) requires less than 12 ms.

 

 

   

 

 

 

 

 

 

2500 _ . “KL,“ , 14000
I, 12000 ~
2000
10000 ~
1500 B: 1
E mean +— E 8000 h
i— upper bound of 95% Cl ------ '1—
lower bound of 95% Cl --° ---- 6000 -
1000 - ~ ' .
4000 . Q upper bound of 95% Cl -‘--~
" lower bound of 95% C1~~~~°
500 -
2000
0 A 1 1 L 0 A 1 1 A 4
4 16 32 64 128 256 4 16 32 64 128 256
Network size Network size
(a) time to reorganize a partitioned SMC (psec) (b) time to construct a new SMC (psec)

Figure 3.18: Performance of the SMC protocol.

Interpretation of results. Among the three ﬂooding alternatives, the BE SAF
protocol experiences the most variation in performance across the three experiments.
The ideal setting for the protocol occurs when there are no event LSAS and network
switches simply ﬂood status information periodically. The LSAS in this setting tend
to be long because periodic ﬂooding must include descriptions for all incident links,
including those connected to hosts. In Experiment 1, the BE SAF protocol is fast
and consumes signiﬁcantly less bandwidth than do the other two alternatives. The
protocol also favors long LSAS for the effectiveness of dummy forwarding. Besides
the bandwidth beneﬁt, the use of dummy forwarding might also reduce ATM proto-
col overhead during link-by-link forwarding; researchers have reported that the ATM
protocol overhead of one-cell packets (such as LSA dummies) can be dramatically re—

duced if these packets are treated as a special case [50]. We conclude that the BE SAF

68

protocol is the best choice for periodic ﬂooding during normal network operation. On
the other hand, the worst-case behavior of the BE SAF protocol is worst among the
three ﬂooding alternatives. However, the adverse scenarios considered in Experiments
2 and 3 are likely to stem from emergency events, such as component failures, whose
LSAS are typically short. Given these results, we may conclude that a good heuristic
would be to use the BE SAF protocol for periodic ﬂooding or ﬂuctuation in resource
availability (for example, changes in residual bandwidth of a link), but to invoke the

Basic SAF protocol for the dissemination of network component failures.

3.5 Summary

In this chapter, we have prOposed two switch-aided ﬂooding protocols and an accom-
panying protocol to construct spanning MCs. The protocols are designed to exploit
ATM hardware cell switching and cell duplication. SAF protocols use the SMC as
a broadcast medium to reduce ﬂooding time. However, the protocols do not rely
entirely on the SMC, but rather revert to point-to-point message forwarding if the
SMC is damaged or under construction.

Two SAF protocols were described: the Basic SAF protocol and the BE SAF
protocol. Under normal operating conditions, both protocols deliver network updates
several times faster than the conventional ﬂooding algorithm. The Basic SAF protocol
is a relatively simple extension of the conventional ﬂooding protocol and should be
straightforward to implement. Our simulation study shows that the difference in
the bandwidth consumed by the Basic SAF protocol and the conventional ﬂooding is
signiﬁcant for small networks, but is only marginal for large networks. The advantage
of the Basic SAF protocol over the BE SAF protocol is its stability in performance
under adverse circumstances, for example, when the SMC is partitioned or under
construction. We also note that the bandwidth consumption of this protocol may be
even smaller when ﬂooding event LSAS, due to their short lengths; under the Fore
SPANS implementation, event LSAS are one-cell packets.

The BB SAF protocol addresses the bandwidth consumption issue by introducing

69

dummy LSA forwarding. The bandwidth savings of this method is particularly sig-
niﬁcant when the network size is large or when the LSAS are long. The performance
of the BE SAF protocol is more sensitive to adverse network circumstances, however.
As a simple heuristic, an “adaptive” network management system could use the BE
SAF protocol for periodic ﬂooding Operations (whose corresponding LSAS are typi-
cally sufﬁciently long to beneﬁt from the use of dummy forwarding), but switch to
the Basic SAF protocol in the presence of emergency events, such as link failures.
The results in this chapter support the theme of this dissertation: the mutual
beneﬁcial relation between LSR and group communication. We have demonstrated
that group communication techniques help improve the performance of LSR. Speciﬁ-
cally, we have used a spanning tree to improve the performance of flooding operations.
In the next chapter, we will push farther in this direction, introducing another type
of topology that has been used in host-level group communication, the ring topol-
ogy, to further improve ﬂooding performance. We will show that the combination
of a spanning tree and a ring produces an Optimal ﬂooding method for use by ATM

networks.

Chapter 4

Optimal SAF Operations

In the previous chapter, we improved the performance of ﬂooding operations in ATM
networks by constructing a tree topology, a common technique of supporting group
communication. Another type of topology that has been used in group communication
is the ring topology, which connects the members of a group in a circular manner.
Host-level applications of a “group ring” include barrier synchronization [51], leader
election [52], reliable multicast [53], and maintaining among group members consistent
orderings of receiving messages [53]. In this chapter, we construct a ring that uses
ATM VCS to connect all switches in an ATM network for use as the acknowledgment
topology in ﬂooding operations. Switches, after receiving an LSA from the SMC,
exchange acknowledgments or dummy LSAS only with neighboring switches deﬁned by
the ring, as opposed to all the neighboring switches deﬁned by the physical topology.
The resultant ﬂooding protocol, called Efﬁcient Reliable (ER) SAF, is optimal in
terms of complexity, for it requires only 0(1) complexity in LSA receipt time and
ﬂooding completion time, and incurs only O(|V|) bandwidth for both LSA delivery

and reliability implementation.

4.1 Motivation

In the previous chapter, we developed two SAF methods, namely the Basic SAF and
BE SAF protocols. These two SAF protocols outperform the conventional ﬂood-

70

71

ing algorithm by using a hardware-based spanning tree, the SMC, to speed up the
dissemination of LSAS. We note that the (remaining) overheads of the two SAF pro-
tocols stem from the requirement to guarantee the delivery of any LSA to all network
nodes that are reachable from the originating node. In general, both the conven-
tional ﬂooding method and previous SAF protocols achieve reliability by means of a
“neighbor watching” principle: every node, after receiving an LSA, makes sure that
all its neighboring nodes have also received the LSA. In the conventional ﬂooding
protocol, the principle is implemented by reliably forwarding an incoming LSA to
all neighbors, except the one from which the LSA arrives. In the Basic SAF and
BE SAF protocols, the principle is implemented by exchanging acknowledgments or
dummy LSAS with all neighboring switches. The communication of every switch with
all neighboring switches inevitably consumes 0([E I) bandwidth and requires 0(DG)
time to complete a ﬂooding operation, where D0 is the maximum node degree in the
given network tOpology G. To avoid overheads that are associated with reliability, one
could use the SMC for best-effort ﬂooding and ignore the reliability issue altogether.
In this method, which we refer to as the Unreliable SAF protocol, the source node of
an LSA broadcasts the LSA on the SMC, but makes no effort to ensure receipt of the
LSA by other switches.

The speed and bandwidth complexities of the four ﬂooding protocols discussed
so far (the conventional, Basic SAF, BE SAF, and Unreliable SAF protocols) are
compared in Table 4.1, where diag denotes the diameter of network G. In the table,
we distinguish two bandwidth metrics: delivery bandwidth refers to the number of
links that an LSA has to traverse, and reliability bandwidth refers to the number of
acknowledgments/ dummies produced. As we can see in the table, the three SAF pro-
tocols are more eﬂicient than the conventional ﬂooding protocol. The Unreliable SAF
protocol is the most efficient, of course, since it does not include acknowledgments: it

exhibits constant complexities in both time metrics and consumes O(|V|) bandwidth.

Of course, we would like to use the most efficient ﬂooding protocol available. One
method to use the Unreliable SAF protocol is to distinguish two types of network

status: topology status and utilization status. As discussed earlier, the topology status

72

Table 4.1: Complexities of various ﬂooding protocols.

 

 

 

 

 

 

 

 

 

 

 

 

Flooding Time Bandwidth

Method Receipt Completion Delivery Reliability Total
Conventional O(dia(G)) O(dia(G) + degg) 0(lEl) 0( El) 0( El)
Basic SAF 0(1) 0(degG) 0(lEl) 0( El) 0( El)
BE SAF 0(1) 001890) 0(IVI) 0( EI) 0( El)
Unreliable SAF 0(1) 0(1) 0(lVl) 0 O( V)
ER SAF (this chapter) 0(1) 0(1) O(|V|) 0(lVl) O( V)

 

 

 

 

 

 

 

of a network component (a switch or a communication link) refers to the Operational
state of the component; the present topology of the network is determined by the set of
currently operational switches and links. The topology of a network can be expected
to be relatively static, assuming that reliable components are used to construct the
network. The utilization status reﬂects the availability of network resources. For
example, the utilization status of a link includes the bandwidth in use, the delay over
the link experienced by recent cells, cell loss rate, and so forth. In ATM networks,
utilization status could be very dynamic, as network resources are allocated and
released when VCS are set up and torn down. As such, utilization status LSAS are
expected to constitute the majority of ﬂooding operations.

It has been argued [54, 55] that, while changes in topology status (such as the
failures of network components) must be ﬂooded reliably, dynamics in utilization sta-
tus could use unreliable, or best-effort, ﬂooding methods. This is because inaccurate
resource utilization information would not lead to disastrous situations, but merely
result in sub-optimal routing decisions. Moreover, since the utilization of network
resources may change at a high rate, one should be concerned with the eﬁ‘lciency of
disseminating such changes. It follows that the Unreliable SAF protocol best ﬁts this
purpose. We agree that efﬁciency is a major concern in the ﬂooding of utilization sta-
tus LSAS. However, in this chapter we will demonstrate that a reliable SAF protocol
can be complexity-wise as efﬁcient as the unreliable SAF protocol. Furthermore, we
contend that there are cases where the reliability of resource utilization ﬂooding is
important.

Let us consider a switch :1: that has been overloaded by heavy traffic. According

 

73
to ATM PNNI, at least one LSA indicating the utilization change will be ﬂooded

throughout the network so that other switches can avoid using switch :1: in future
VCS. Should switch :1: advertise the congestion situation unreliably, some switches
may not receive the corresponding LSA and thus will continue using the switch in
new VCS, further exacerbating the congestion situation. Moreover, it is exactly when
a switch is congested that it will most likely drop cells, including the ones pertaining
to utilization status LSAS that disseminate the congestion situation. The information
about the congestion at :1: may not leave :1: at all, and the problem feeds on itself as

new VCS make the congestion situation worse.

In this chapter, we continue the SAF work by developing a reliable SAF protocol
that is more efﬁcient than the Basic and BE SAF protocols. The Efficient Reliable
(ER) SAF protocol constructs a second topology, a virtual ring, to provide reliability.
As shown in Table 4.1, the new protocol exhibits speed and bandwidth complexities
identical to those Of the unreliable SAF protocol. Further, it retains the reliability of
the conventional ﬂooding protocol, that is, an LSA will be delivered to all switches
that are reachable from the originating switch. Since a ﬂooding protocol must deliver
a given LSA at least once to every such switch, both the 0(1) time complexities and

the O(lV|) bandwidth complexities of the ER SAF protocol are Optimal.

The remainder of this chapter is organized as follows. In Section 4.2, we de-
scribe the ER SAF protocol, including the use of the virtual ring for reliability and
issues that arise when decoupling construction / maintenance of the ring from on-going
ﬂooding operations. Details of the ER SAF algorithms are provided in Section 4.3. In
Section 4.4, we discuss the methods used to construct and maintain the virtual ring.
While the ER SAF protocol achieves Optimal complexities, the expected performance
under real network conditions is of interest. In Section 4.5, we investigate through
simulation the behavior of the ER SAF protocol both in “normal” situations and
under adverse circumstances, where network component failures affect the operation
of the SMC and/or the virtual ring. The results of our simulation reveal that the
ER SAF protocol delivers network updates several times faster than conventional ap—

proaches in normal situations, and twice as fast in the presence of component failures.

74

A summarization Of our SAF work is given in Section 4.6.

4.2 ER SAF Protocol Design

In this section, we describe the design issues and basic concepts of the ER SAF
protocol. In the discussion, we assume that the network topology G = (V, E) is a
connected graph, since our concern here is to efficiently ﬂood LSAS to “reachable”
nodes. To generalize our discussion to partitioned networks, we can simply apply the

argument to each segment.

4.2.1 Basic Concept

The ER SAF protocol uses the hardware-based SMC to achieve constant LSA deliv-
ery time. However, it adopts a different approach to reliability than previous SAF
protocols. Instead of implementing the neighbor watching principle over the physical
network topology G, the ER SAF protocol constructs a virtual topology R = (V, ER)
to implement reliability. The topology R is a ring that visits all nodes in G exactly
once. The topology is virtual because neighboring nodes in R are not necessarily
adjacent in the physical network topology G. Rather, they are connected by ATM
VCs that may traverse one or more intermediate nodes. Speciﬁcally, each node a:
in G is connected to its predecessor in R, denoted as Pred(:r), and to its successor
in R, denoted as Succ(:c), by VCS RVCpred(a:) and RVCsucc(a:), respectively. (RVC
stands for Ring VC.) We defer to Section 4.4 the discussion of the construction and
maintenance of ring R. At this point, we merely emphasize that the ER SAF protocol
must be able to work properly when the ring R is under construction or involved in
maintenance operations.

In the ER SAF protocol, the neighbor watching principle is implemented as fol-
lows. Any node 1:, after receiving an LSA from the SMC, exchanges acknowledgments
of the LSA with Pred(:1:) and Succ(a:), rather than with all its neighboring nodes de-
ﬁned by the physical topology G; we will refer to acknowledgments sent via ring

VCS as Lacks If every node a: E G receives racks for a given LSA from Pred(:1:) and

75

Succ(a:), then the ﬂooding operation is completed. Let us use an example to illustrate
the operation Of the ER SAF protocol in “normal” cases, where the topology of the
network is stable and both the SMC and virtual ring R are fully operational. This ex-
ample assumes the network and SMC topologies shown in Figure 3.1(c). Figure 4.1(a)
depicts a virtual ring R connecting switches in the (alphabetic) order A, B, C, . . . , M.
We point out that some ring VCS, such as the H-I VC, traverse one or more inter-
mediate nodes. Assuming that the SMC broadcast of an LSA K successfully reaches
all switches, as shown in Figure 4.1(b), ensuing neighbor watching activities are de-
picted in Figure 4.1(c), where each node exchanges r-acks of 1.7 with its succeeding and
preceding nodes in R. The ﬂooding Operation is completed when every node receives

two acknowledgments of 8.

 

 

 

 

 

(a) a virtual ring R for the network (b) a successful SMC broadcast

<———> r_ack
e c D E F G
’,vO‘---O‘---O----O‘---'O‘---O~\
\
A é l
\\ ,’
‘9‘“‘9‘“"?‘“"S3‘"".O‘""9‘
(c) exchange of acknowledgments in R
Figure 4.1: BR SAF ﬂooding in normal cases.
In ER SAF operations under normal conditions, nodes require 0(1) time to receive

the LSA, and must process 0(1) r.acks. Hence, the per switch workload (that is, the

completion time metric) is of constant complexity. 0(lVl) acknowledgments will be

76

produced; the total number of links traversed by r.acks depends on the total length
of ring VCS, denoted as lRl. Various existing heuristics for the traveling salesman
problem produce cycles where |R| < C x WI and G is a constant [56]. Using such
a heuristic in the construction of the ring, the bandwidth consumed by reliability
activities is of complexity O(lV|) (our simulation results presented in Section 4.5 show
that C is typically less than 1.5). Because the number of links that an LSA traverses in
normal cases is exactly the number of SMC links, the bandwidth consumed by LSA
delivery also exhibits complexity 0(lVl). Thus, the total bandwidth consumption
exhibits complexity 0( l V l ).

4.2.2 Operation Modes

In addition to normal situations described above, the ER SAF protocol must handle
more difﬁcult scenarios where the SMC broadcast of the LSA does not reach all
nodes, where cells pertaining to r-acks are lost, where ring VCs are damaged by
network component failures, or where arbitrary combinations of these events occur.
If an LSA is being ﬂooded under such adverse circumstances, then there may exist
a node :1: that possesses the LSA after the SMC broadcast but does not receive the
r.ack of the LSA from a node y E {Pred(:c),Succ(a:)}. (If no such node x exists,
then the ﬂooding is completed.) In this case, node a: can retransmit the LSA to
y using the corresponding ring VC, and repeat such retransmissions until y returns
an r.ack. For a given LSA, when the ER SAF protocol uses the virtual ring R for
acknowledgments/retransmissions, we say that it is Operating in the R mode.
Adverse network status changes and the R-mode operation create a cyclic de-
pendency: in mode R, adverse network changes that damage the ring R can impede
their own advertisements, while the repair of the ring R requires up—to—date network
topology information contained in such advertisements. To avoid this dilemma, the
ER SAF protocol has a second Operation mode, called the G mode, that is used when
the ring R is damaged or under construction (the letter G indicates the use of the
physical topology G for reliability). When operating in the G mode, the ER SAF

protocol is identical to the Basic SAF protocol: a node receiving the LSA on the

77

SMC subsequently exchanges copies of the LSA (and acknowledgments) with each Of
its physical neighbors.

The ER SAF protocol needs a method to decide which mode to use for a given
LSA. In general, the R mode, due to its efficiency, should be used whenever it can
ensure reliability, that is, when the ring R is operational; otherwise, the G mode
should be used. The source node a Of an LSA uses its “local” status Of R, that is, the
Operational status of RVCsucc(a) and RVCpred(a), to determine the mode to use for
the LSA. If both RVCs are Operational, then the source node initiates the ﬂooding

operation in mode R. Otherwise, it starts the ﬂooding in mode G.

Of course, it is possible that the source of an LSA initiates a ﬂooding operation in
the R mode, while there are link-down events that damage ring R and that have not
yet been learned of by the source. In such circumstances, some node(s) other than the
source must change the operation mode during the course of the ﬂooding. In ER SAF,
the ﬂooding of a given LSA can change from the R mode to the G mode, but the
reverse is not allowed. Consider a scenario where a switch a is ﬂooding a utilization
status LSA 6 using mode R, while in the meantime a link used by RVC(2:, y) but
not by the SMC has failed. Let us assume that the SMC broadcast of E successfully
arrives at all nodes in G. Although both a: and y receive 6 from the SMC, the two
nodes cannot receive r-acks of E from one another. Both nodes will try to retransmit
8 to each other, but such retransmissions have no chance to succeed either. The
R-mode ﬂooding operation is bound to fail in this situation. Instead, node 1:, after
realizing the problem with the RVC(:1:, y), must switch to mode G, initiating a basic
SAF operation of K on behalf Of switch a. (Node :1: could learn of the problem via the
link—down LSA produced by the endpoints Of the faulty link or when retransmissions
fail a predetermined number Of times.) In this manner, we are assured that 8 will
reach all network nodes while the ring R is under repair.

Even when the ring R is fully operational, there are cases where the R mode is
unacceptably inefficient. Consider the example shown in Figure 4.2, where switch G,
which is a leaf in the SMC, is advertising the failure of the (G,I) link, which is used

by the SMC, but not by the virtual ring R (we assume the ring topology depicted

78

in Figure 4.1(a)). In this situation, the SMC cannot deliver this link-down LSA to
any node at all. After failing to receive corresponding r.acks from Pred(G)=F and
Succ(G)=H, switch G retransmits the LSA to the two nodes over ring VCS. Nodes F
and H will also notice the lack of r-acks from E and 1, respectively, and attempt to
retransmit. The result is that the LSA traverses the ring R in a sequential, store-and-
forward manner, as depicted in Figure 4.2(b). In general, retransmissions over the
ring R degenerate into a sequential procedure whenever multiple nodes, consecutive
in R, fail to receive the SMC-switched copy of an LSA. To avoid this performance
problem, we introduce a two-three rule as a mode-switching heuristic: whenever any
two consecutive nodes in R do not receive a given LSA from the SMC, the ER SAF
operation, with respect to that LSA, will switch to mode G. This rule can be formally

stated as follows.

Two-Three Rule. With respect to a given LSA Z, the two-three rule
is satisﬁed at a node x if any two of the three nodes, 2:, Succ(z), and
Pred(x), do not receive 8 from the spanning MC. Precisely, any node a:
that is currently in mode R with respect to Z switches to mode G if either

one of the following conditions is satisﬁed.

Cl. Node 3: does not receive an r.ack Of 8 from either Succ(x) and Pred(at)

after waiting for a predetermined length of time since the receipt of

SMC-relayed copy of 8.

C2. The ﬁrst time :1: receives the LSA is from one of its ring neighbors
(indicating 11: itself has missed the SMC copy), but has not received
the r.ack from the other ring neighbor y (indicating that y may not
receive the LSA either).

4.3 Algorithms

In this section, we present the algorithms used by the ER SAF protocol. We use
the notation NR(:1:) to denote the set {Succ(x), Pred(z)} and the notation NG(a:)

79

B C D E F G

 

(a) the link (H, J) fails (b) sequential store-and-forward in R

Figure 4.2: A hypothetical scenario where LSA retransmissions over R degenerate
into a bidirectional store-and-forward process.

to denote the set of neighboring nodes deﬁned by the network topology G. We
assume that nodes in NG(a:) can be reached via ports numbered 1 to K3, where
K, = [N0(.T)l. We further assume that LSAS are tagged with a sequence num-
ber and source switch ID: an LSA from node i with sequence number j will be
denoted as LSA( j ,i), and its corresponding acknowledgments will be denoted as ei-
ther r.ack(j,i) or g_ack(j,i), depending on the operation mode of the LSA. Each
switch maintains the following data structures: Seq[i], the sequence number of the
current LSA from switch i, 1 S i 3 WI; Mode[i] E {R, G}, the operation mode for the
ﬂooding of LSA(Seq[il,i); Fsucc[i] and F pred [2], two boolean ﬂags indicating whether
2: has received r.ack(Seq[i],i)/LSA(Seq[i],i) from nodes Succ(x) and Pred(z) respec-
tively; and F[i][p], a boolean ﬂag indicating whether the switch has received via port
1) the g_ack(Seq[i],i)/LSA(Seq[i],i), for 1 S i _<_ [V] and 1 g p 5 K1. Let us denote
the sequence number of LSA Z by Seq(Z) and the address of the source switch by
Source(e). Moreover, every LSA E has a mode bit, denoted by Mode(€), whose value
can be either R (for mode R) or G (for mode G). The mode bit of an LSA may change
value during the course of the corresponding ﬂooding Operation, and copies of the
same LSA may be in different modes. We also assume that, when a switch :1: receives
a copy of 2, it can discover the sender of this copy, denoted as Sender(€). The sender

information can be determined by the port/RVC/SMC on which E arrives.

The source of an LSA invokes the routine FloodLSA, which is shown in Figure 4.3.

8O

Parameters to the routine include the ID :1: of the invoking switch and an LSA 8 to be
ﬂooded. The switch :1: updates the sequence number of its current LSA to that of 8 and
clears relevant F succ, Fpred, and F ﬂags to indicate that it has not received anything
about 8 from its neighbors. Finishing these bookkeeping tasks, switch at next decides
the Operation mode of Z. The R mode is used if incident ring VCS, RVCsucc(:1:) and
RVCpred(:1:), are operational and if at least one incident SMC link is operational. (The
use of mode R even when some incident SMC links are malfunctioning encourages the
use of fragments of the SMC to disseminate an LSA to as many nodes as possible.)
In this case, switch :0 sends the LSA on the SMC with mode bit set to R, and sends
r-acks of the LSA to nodes in NR(:1:) via ring VCs. Otherwise, the G mode is used,
that is, switch :1: sends the LSA on the SMC with mode bit set to G, and forwards the
LSA to nodes in NG(:1:) via physical links. The SMC broadcast may be skipped in
mode G, however, if all SMC links incident to a: have failed. In either mode, a timer
is setup to wait for responses from the set of neighboring nodes determined by the

chosen operation mode.

 

Algorithm: FloodLSA.
Input: the switch ID 2:, and an LSA 3.

Seq[x] = Seq(ﬂ).

F[zllp] = FALSE, for all 1 S p S K.

Pancake] = Fpredlcc] = FALSE.

IF (either RVCsucc(:c) or RVCpred(:1:) is damaged) or
(all SMC links incident to a: are malfunctioning)
Mode(€) = Mode[a:] = G.

ENDIF

IF (mode[:r] = R)

Broadcast Z on the SMC.
Send r.ack(Seq(£),:1:) to the two nodes in NR(:1:) via ring VCS.
Set up an r-timer(€).
ELSE /* mode[:1:] = G*/
IF (at least one incident SMC link is operational),
Broadcast If on the SMC.
ENDIF
Forward 3 to all nodes in Ng(:1:) via physical links.
Set up a g.timer(l).
ENDIF

 

 

 

Figure 4.3: The sender algorithm Of the ER SAF protocol.

81

Switches that receive an LSA invoke the routine ReceiveLSA, which is shown in
Figure 4.4. Parameter x indicates the ID of the invoking switch and parameter 6 is
the LSA received. The routine ﬁrst decides whether it is dealing with a new LSA by
checking [’8 sequence number against the current sequence number recorded locally. If
a new LSA is Observed, corresponding Fsucc, Fpred, and F ﬂags are cleared to indicate
that nothing has yet been learned about this LSA from neighbors, and the network
image at :r is updated according to 8. Subsequent processing depends on the mode of
8. The processing of E when Mode(€)=G follows the Basic SMC protocol: if K arrives
on the SMC, then it is forwarded on all ports; otherwise, 6 arrives on a port p and is
forwarded on all the other ports. The processing of 8 when Mode(€)=R is somewhat
complicated. First, the switch needs to decide whether to change to the G mode. If 8
arrives on the SMC, its mode is changed when any incident RVC of a: is damaged. If
E arrives on a ring VC, its mode is changed when condition C2 of the Two-Three Rule
is satisﬁed. If mode switching does occur, 8 is forwarded on all ports p, 1 g p 5 Km.
Otherwise, the r.ack of 2 is sent to Pred(x) and Succ(z) if K arrived on the SMC, or to
the Sender(€) if K arrived on a ring VC. This concludes the processing of 8 when it is
received for the ﬁrst time. When switch :1: receives subsequent copies of 8, it discards
such copies, unless the current mode at :1: with respect to l is the R mode and the
arriving copy is in mode G, forcing :1: to switch to the G mode and to forward Z along
all incident links. We point out, again, that sending an acknowledgment is necessary
even for duplicate copies of 2. Lastly, if 8 did not arrive on the SMC, switch a: must

remember that it has received 6 from its neighboring node Sender(€).

When a switch 2: receives the acknowledgment of an LSA K, it invokes the
ReceiveACK routine, shown in Figure 4.5. The purpose of the routine is straight-
forward: The receipt of an acknowledgment from a switch y assures :1: that y has
already received the LSA. This situation is recorded in the corresponding Fsucc, Fpred,

or F ﬂags, according to the port/RVC on which the acknowledgment arrives.

When the timer associated with LSA K ﬁres at switch :1:, the switch invokes the
TimeoutHandler routine shown in Figure 4.6. The timer, however, may be ignored

for two reasons: it was set up for an LSA with an obsolete sequence number, or the

82

 

 

Algorithm: ReceiveLSA.
Input: the switch ID :1:, and an LSA If.

a = Source(Z).
IF (Seq(f) > Seq[a]) /* This is the ﬁrst copy. */
Seq[a] = Seq(t’), Fsucc [:1:] = Fpredlrr] = FALSE, and F[:1:][q] = FALSE, for all 1 S q S K.
Update the local network image at :1: according to 6.
IF (Mode(€) = R)
IF (8 is received from the SMC)
IF ((both RVCSUCCCB) and RVCpred(:1:) are Operational)
Send r.ack(Seq(€),a) via the two RVCs, and set up an r_timer(£).
ELSE
Change Mode(€) to G, forward 12 on port p, 1 S p 3 K3, and set up a g-timer(€).
ENDIF
ELSE /* Z is received from a ring VC 1); check condition C2 of the Two-Three Rule */
IF (Sender(€)=Pred(:1:) and Fsucc[:r]=FALSE) or
(Sender(€)=Succ(:1:) and Fpred[:1:]=FALSE)
Change Mode(€) to G, forward 6 on port p, 1 S p 5 K1, and set up a g_timer(€).
ELSE
Send r-ack(Seq(€), a) to Sender(€) via RVC v.
ENDIF
ENDIF
ELSE /* Mode(€) = G */
IF (2 is received from the SMC)
Forward 13 on port p, 1 S p 5 K3, and set up a g-timer(€).
ELSE /* Z is received from port p */
Forward Z on all ports, except p, and set up a g-timer(€).
Send g.ack(Seq(€), a) to Sender(€) via port p.
ENDIF
ENDIF
Mode[a] = Mode(€).
ELSE /* This is an extra copy. */
If (Mode[a] = R) but (Mode(£) = G)
Change Modela] to G, forward 2 on port p, 1 g p S K I, and set up a g_timer(€).
ELSE
Return an r.ack or g_ack to the Sender(€), depending on Mode(€).
ENDIF
ENDIF
IF (2 is received from a ring VC)
Set Fpred [a] or F9,Ulcc [a] to TRUE, depending on Sender(£).
ELSE IF (2 is received from port p)
F[a][p] = TRUE.
ENDIF /* No ﬂag to set if K is received from the SMC. */

 

Figure 4.4: The ReceiveLSA routine in the ER SAF protocol.

 

83

 

Algorithm: ReceiveACK.
Input: the switch ID :1:, and an acknowledgment d.

a = Source(d).

IF (d is received from Pred(x))
Fpred[a] = TRUE.

ELSE IF (d is received from Succ(z))
F succ[a] = TRUE.

ELSE /* d is received from port p */
F[a][p] = TRUE.

ENDIF

 

 

 

Figure 4.5: The ReceiveACK routine in the ER SAF protocol.

timer is an r-timer for an LSA whose operation mode at :1: has been changed to mode
G since the setup of the timer. Subsequent processing, if required, depends on the
type of the timer. For a g_timer, the routine forwards the associated LSA 6 to ports
whose corresponding F ﬂags have not been set to TRUE. The processing of an r_timer
is more complicated, however, as we have to decide whether to switch to mode G.
The mode is changed when local RVCs are found to be damaged, or when condition
C1 of the Two-Three Rule is satisﬁed. If the mode needs to be changed, then the
LSA is forwarded on all ports p, 1 g p 5 Km. Otherwise, the LSA is forwarded (or,
more precisely in this case, retransmitted) to a node in N R(:1:) whose corresponding
ﬂag is FALSE (at most one such retransmission will be performed, otherwise the
Two-Three Rule would have been satisﬁed). Lastly, a timer in an appropriate mode
is set up to wait for responses from neighboring nodes to which this LSA has been

forwarded / retransmitted.

4.4 The Virtual Ring

In this section, we discuss the construction and maintenance of the virtual ring R.
For this topic we must explicitly consider the handling Of network partitioning, and
hence we drop the assumption that the network G is connected. While a ﬂooding
operation is only concerned with delivering the LSA to all nodes reachable from the

source node, the ring construction procedures of the ER SAF protocol must construct

84

 

Algorithm: TimeoutHandler.
Input: the switch ID :1: and a timer(K).

a = Source(K).

IF (Seq(K) < Seq[a]) or (timer(K) is an r-timer but Mode[a]=G)
Return.

ENDIF

IF (timer(K) is a g-timer)
For 1 g p S K, forward K on port p if (F[allp] = FALSE).
Set up a new g_timer(K).
ELSE /* timer(K) is an r_timer */
IF (either RVCSUCC(:1:) or RVCpred(a:) is damaged) or
/* The next condition is the C1 of the two-three rule. */
(Fsucc[a]=FALSE and Fpred[a]=FALSE),
/* Switch to the G mode. */
Mode[a] = Mode(K) = G.
Forward K via all incident port p.
Set up a new g..timer(K).
ELSE
/* Retransmit K via a ring VC. */
IF (F811cc [a] = FALSE) THEN forward K via RVCsucc(:1:).
IF (Fpred[a] = FALSE) THEN forward K via RVCpred(:r).
Set up a new r_timer(K).
ENDIF
ENDIF

 

 

 

Figure 4.6: The timeout handler in the ER SAF protocol.

a ring within each network segment during partitioning periods, and re-construct a
new ring when two or more segments merge into one.

Within each segment, the construction of the virtual ring R comprises three steps.
First, the leader switch in the segment, elected by the ATM leader election protocol,
computes an ordering of the switches that are reachable from the leader according
to the local network image Of the leader. Second, the leader switch advertises the
ordering using the Basic SAF protocol; such an LSA is called a switch ordering LSA.
Third, every switch establishes a ring VC to its successor as deﬁned by the ordering. If
we deﬁne the criterion of the ordering computation to be the minimum total lengths of
RVCs, then the switch ordering problem becomes the well-known traveling salesman
problem [56]. Since the problem is NP-complete, we use the following heuristic [56]:

the leader computes a depth-ﬁrst search tree that spans all reachable switches in its

85

local network image, and uses the ordering determined by the pre-order traversal of
the tree.

When a switch :1: receives a switch ordering LSA, it sets up a VC to the succeeding
switch deﬁned by the ordering, following the procedures described in ATM UNI 3.1
standard [30]. It also accepts a VC-setup request from switch y, where y is the
predecessor of :1: in that ordering. The paths of the two ring VCs are recorded,
so that, if subsequent link-down LSAS are received, switch :1: can detect damage to
incident ring VCS.

The maintenance phase of the virtual ring is divided into two levels: repair and
reconstruction. When a switch :1: learns of damage to its RVCsucc(:1:) from a link-
down LSA, it shall try to establish a new VC to Succ(as). If this task succeeds, the
ring is repaired and no further action is needed. Otherwise, the network has been
partitioned. Under such circumstances, a leader will be elected within each segment,
and will subsequently compute and ﬂood an ordering of switches within the segment.
Switches within each segment then follow the new ordering to establish new ring VCS,
resulting in a new ring within the segment.

Another situation requiring ring reconstruction occurs when network segments
re—unify with each other (possibly because malfunctioning network components have
recovered). As in the previous case, a leader election will take place, and the new
leader will compute and ﬂood a switch ordering of the merged segment. In general,
the leader of a network/ segment monitors the set of reachable switches deﬁned by its
local network image. When a membership change occurs to this “reachable set,” or
when the leader is newly elected, a (new) ordering of switches in the set is computed

and ﬂooded, resulting in the (re)construction of the virtual ring.

4.5 Performance Evaluation

We studied the performance of the ER SAF protocol through simulation. The sim-
ulator is based on the CSIM package [48]. Conﬁdence intervals were computed, but

for most cases are very small and, for clarity, are not shown in the plots.

86

Networks comprising up to 400 switches were simulated. Since each switch is
likely to be attached to several hosts, such networks may include thousands of hosts.
For each graph, 40 graphs were generated randomly, and 100 simulation runs were
performed on each graph. Each run used a randomly selected core node for SMC
construction and a randomly selected ﬂooding source. The core node selected for a
simulation session is also used as the root of the depth-ﬁrst search tree, which deter-
mines the ring topology. Table 4.2 shows the characteristics of the graphs generated.
These random graphs exhibit average node degrees conforming to those observed in

some subnetworks in the Internet [57].

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

degree diameter
size min. l avg. rmax. min. l avg. l max.
10 1.95 3.87 5.80 2 2.950 4
20 1.25 3.86 7.28 3 4.475 6
40 1.05 4.02 7.95 5 5.700 8
60 1.02 3.92 8.65 5 6.475 9
80 1.00 4.00 9.38 6 7.000 8
100 1.00 4.10 9.45 6 6.975 8
120 1.00 4.18 10.22 6 7.450 9
140 1.00 4.23 10.60 6 7.475 10
160 1.00 4.30 10.20 7 7.525 9
180 1.00 4.39 10.68 6 7.625 9
200 1.00 4.43 11.18 7 7.750 9
250 1.00 4.57 11.38 7 7.700 9
300 1.00 4.72 12.05 7 7.950 10
350 1.00 4.88 12.12 7 7.775 9
400 1.00 5.08 12.65 7 7.625 9

 

 

Table 4.2: Characteristics of randomly generated graphs.

Each message transmission in a ﬂooding Operation incurs ATM protocol overhead.
Like the simulation studies in the previous chapter, we measured these overheads on
the ATM testbed in our laboratory. The testbed comprises Sun SPARC-lO worksta-
tions equipped with Fore SBA-200 adapters and connected by three Fore ASX-200
switches. From these measurements, we obtained the ﬁgure 600 usec, which includes
the overhead at both the sending and receiving switches. The per-hop hardware

switching delay was found to be 12 usec.

87

Experiment 1: Normal cases. By normal, we refer to situations where network
topology is stable, and the SMC and the virtual ring R are operational. Such cir-
cumstances typically occur for the ﬂooding of utilization status information and for
periodic ﬂooding. The simulation results pertaining to this setting are plotted in Fig-
ure 4.7. For the two time metrics, receipt time and completion time, we show both
the average and worst-case results. As we can see in Figure 4.7(a), the ER SAF pro-
tocol delivers LSAS several times faster than does the conventional ﬂooding protocol.
This is especially true for large networks. When the network size is larger than or
equal to 100, the average receipt time of the ER SAF protocol is less than one-ﬁfth of
that of the conventional ﬂooding, and the worst-case time is less than one-eighth of
that of the conventional ﬂooding. Similarly, the completion time and LSA bandwidth
consumption of the ER SAF protocol are only a small fraction of their counterparts
in the conventional ﬂooding (see Figure 4.7 (b) and (c) respectively). Furthermore,
the constant time complexities of the ER SAF protocol are clearly demonstrated by
the ﬂat curves.

In Figure 4.7(d), we plot the results regarding the reliability mechanisms of the two
ﬂooding protocols. For the conventional ﬂooding protocol, we computed the average
number of acknowledgments produced by the 4000 ﬂooding operations for each graph
size. For the ER SAF protocol, we compute both the average number of r.acks, as
well as the total number of links traversed by these r.acks. The latter metric is of
interest in the ER SAF protocol because some ring VCS may involve more than one
physical links. As shown in the ﬁgure, the average number of r.acks is less than that
Of the acknowledgments produced by the conventional ﬂooding protocol, although the
difference is not as dramatic as the cases in previous ﬁgures. We also note that the
number of links traversed by r.acks is typically within 150% of the number of r-acks,

suggesting that the average length of ring VCs is less then 1.5.

Experiment 2: Flooding of link-down events. Next, we studied the perfor-
mance of the ER SAF protocol when used to disseminate network component fail-

ures, namely link-down events. For each graph, we randomly select and remove a link

88

 

 

       

 

 

 

 

 

 

 

 

    

 

 

 

 

 

 

20000 . 5 5
18000 - .. , a .
16000 ~ .. . .» ...- «

8 314000 - “/1" l

g E 12000 ~ ,1" 1

.... 1— f

t 8 10000 '

09 r . .3 8m "

°’ , r conventional (max) -°— . '5- ': _

5 3000 f' conventional (avg) -------- 5 6000 . ciinvﬁlilﬁlfa‘iltlﬁlil :
2000 7"] ESAF (max) -O-- l m . ESAF (3V8) 3.
law b ESAF (avg) '4'. _. 4 ESAF (max) "3-.-,

MH"“"“““MMWW“ 2000 [pi-+H—o—Ww—ﬂ
0 ‘ ‘ ‘ ‘ ‘ ‘ ' 0 . . 1 . 1 1 a
0 50 100 150 200 250 300 350 400 o 50 100 150 200 250 300 350 400
Network size Network size
(a) receipt time (usec) (b) completion time (psec)
1800 . ﬁ ﬁ 1 v . . 1800 [ . . . 1 . w .
. , conventional -+- ,
1600 1 conventional -°— 30 1600 ES AF (forwardings of r_acks) 1.”--.
go 1400 . ESAF W3 :5 1400 _ ESAF (# ofr_acks) “° """
"g 1200 » g 1200 - ,1.
o ,.

8 1000 - 1 3: 1000 - 1

E 800 _ 1 g 800 . ,0

“a 600 * E 600 -

' o
2 400 * .................. - o. 400 -
200 » z 200 .
0 ""1 """ 1 1 1 1 4 1 0 ' 1 1 1 1 1 1 1
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network size Network size
(0) LSA delivery bandwidth ((1) reliability bandwidth

Figure 4.7: Comparisons of ﬂooding alternatives with an operational SMC and virtual
ring.

whose removal will not disconnect the graph, and select one of the endpoints of the
link to advertise the event. The remaining parts of the network are assumed to be sta-
ble. Under the assumption that network components have a long MTBF (mean time
between failures), it is unlikely that a second component will fail during the ﬂooding
of the ﬁrst failure (though of course the ER SAF protocol can handle this situation).
We expect ﬂooding performance in single-failure scenarios to be representative for the

ﬂooding of topology status LSAS.

Corresponding simulation results are plotted in Figure 4.8. As we can see in

Figure 4.8(a) and (b), the ER SAF protocol is still much faster than conventional

89

ﬂooding. In most cases of the receipt times, the ER SAF protocol is more than twice
as fast as its competitor. We note that a link-down event does not necessarily force
the ER SAF protocol to use the G mode. In our simulation, approximately 60% of the
link-down events result in the use of the G mode. The performance of the remaining
40% should resemble that of the normal cases. To investigate the performance of
the ER SAF protocol in the G mode, we extracted those samples that use mode G
and plotted the respective results in Figure 4.8(c). As shown, the average receipt
times of LSAS ﬂooded in mode G are only slightly higher than the overall average.
This is because the ER SAF protocol also uses the (possibly fragmented) SMC to
disseminate LSAS in mode G. After the SMC broadcast, all nodes that receive the LSA
start forwarding the LSA via point-to—point links at nearly the same time, resulting
in speedier point-to-point forwarding when compared to the conventional ﬂooding
protocol. With the presence of G-mode operations, the bandwidth consumptions by

the two ﬂooding alternatives are approximately the same, as shown in Figure 4.8(d).

Experiment 3: Ring construction. Lastly, we investigated the time to construct
a ring. As discussed earlier, the construction process comprises three phases: First,
the leader switch computes switch ordering using the depth-ﬁrst-search tree heuristic.
Second, the leader switch broadcasts the ordering using the conventional ﬂooding
protocol. Third, every switch establishes a ring VC to its successor deﬁned in the
ordering. Since the ﬁrst phase is a simply linear time algorithm performed locally at
the leader, we ignored this phase in the simulation. The average and worst case ring
construction times under this assumption are plotted in Figure 4.9. As we can see,
the virtual ring R can be constructed within 22 milliseconds even for relatively large

networks.

4.6 Summary

We have described an efficient reliable (ER) SAF protocol. The ER SAF protocol

constructs an SMC to broadcast network status updates in hardware and uses a

90

 

 

      

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

9000 . , 20000
8000 r
_ conventional (max) —+— . A
g 7000 conventional (avg) -------- £15000 ’
V 6000 ~ ESAF(max) «1 1 o
E ESAF(avg) 1:: _____ .E
F 5000 """""""""""""""" 7” --------- 210000 1
.1 """""""" o
g 4000 » """ ’8 .
8 x E- . 'ﬂ . ,__._.. b—4-
3000 ’ 7 n a o o o 9 o o G 9 0 U'lr'Ir,‘ ., I-n q—n-“I'-.- 1.....-
a: f, "' ﬂ 8 5000 1;?” conventional (max) -+— .
200° ’6’" 1 ..~~ ---- --~-- -. .-..-. -- -— - 1 conventional (avg) W.
V”. ESAF (max) --«--
1°00 *” . . . . . . . ‘ 0 . . . - ESAFuvg) 7*-
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network size Network size
(a) receipt time (,usec) (b) completion time (usec)
10000 . . . . . . . 1800 . . . .
1600 * LSA (conventional) -+— . /
8000 - :9 Ack (conventional) ------- / ..--"
A 5 1400 ' LSA (ESAF) H...” / ‘
3 ED 1200 . Ack (ESAF) ~--—- .. /
0 6000 - ,,,,, « ‘ 3 _
28: ,2!” comp. (g-mode only) -°—- 3 1000
0 ,r' comp. (all samples) ~*---- _
g) 4000 > recv.(g-mode only) "s ~~~~ 1 g 800
> recv. (all samples) --- - 1... 600 > 1
<3 8
2000 » .1;--:,:t'2;?;"::‘;'i’.i".;";';"‘,I"."._.___,___“,’ 2 400 1 «
V 200
O 1 1 1 1 1 1 1 o 1 1 1 1 A 1 a
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network size Network size
(c) ER SAF receipt time in different modes (d) bandwidth (the number of links tra-
(usec) versed)

Figure 4.8: Comparisons of ﬂooding alternatives in the performance of ﬂooding link-
down events.

virtual ring topology to minimize reliability overhead. In normal cases, where the
network topology is stable and the SMC and the virtual ring are operational, this
protocol is Optimal in terms Of ﬂooding time and bandwidth consumption. When
network component failures affect the operation of the SMC or the ring, the ER
SAF protocol resorts to a more conservative ﬂooding method, namely the basic SAF
protocol. Our simulation results reveal that the ER SAF protocol is several times
faster than the conventional ﬂooding protocol in normal cases, and is still twice as
fast as the conventional ﬂooding under adverse circumstances. The use of a ring

topology in the ER SAF protocol serves as yet another example of how traditional

91

 

V I I I I Y
n"‘
20 .1 ----------
v- ‘‘‘‘‘ .
......

"”
15 ” “I"

Time (ms)

 

 

5 ~ average -°— 1
maximum --------
0 1 1

 

 

0 50 100 150 200 250 300 350 400
Network size

Figure 4.9: The average/ worst case time to build a virtual ring.

group communication techniques can be used to improve the performance of LSR.

Chapter 5

A Generic Method of MC

Construction

In previous two chapters, we used group communication techniques, namely, the
construction of a multiparty communication channel, to improve the performance of
LSR in ATM networks. In this chapter, we shift our attention to the other direction
of the mutually beneﬁcial relationship between group communication and LSR, that
is, how the robustness of LSR and the complete tOpology information made available
by LSR can help develop novel and efﬁcient group communication solutions for use
by multiparty communication applications. Speciﬁcally, we propose a protocol for
the construction and maintenance of multipoint connections (MCS). A distinguishing
feature of the protocol is its generality: the proposed solution can incorporate any
MC topology computation algorithm, and hence can be used with MCS of different
tOpology types or performance criteria, a requirement stemming from the diversity of
multiparty communication applications. The protocol is based on LSR: information
regarding multipoint connections is broadcast to network switches, which perform
all MC topology computations locally. The protocol is free from routing loops, even
transient ones, and will tolerate any combination of link / node failures, including those
that partition the network for a period of time. The correctness of the protocol, which
is modeled as a consensus problem in a distributed system, is established by formal

proofs. Results of a simulation study show that the generality of the protocol can be

92

93

achieved with negligible to moderate signaling overhead.

5. 1 Motivation

As described in Chapter 2, the applications that use multi—party communication
vary widely and include teleconferencing, computer-supported cooperative work, dis-
tributed interactive simulation, remote teaching, tele—gaming, replicated ﬁle servers,
parallel database search, and distributed parallel processing. Such applications have
widely disparate needs with respect to network services. Among such services is the
MC protocol itself, which deﬁnes the set of rules and conventions by which MCS are
constructed and maintained, and which is executed among processing entities within
a communications network.

We have discussed in Chapter 2 three major MC tOpology types, namely, SSTs,
SRTS, and ROSTS. Even for a ﬁxed topology type, different topology computation
algorithms could be used, depending on the relative importance of various perfor-
mance criteria, such as bounds on transmission delays, network resource consumption,
multicast packet loss rate, and so forth. Many existing multicast protocols can be
considered as distributed implementations of one, or a small set of, MC topology com-
putation algorithms. For example, the MOSPF and DVMRP protocols implement
distributed source-rooted tree algorithms that minimize transmission delays, whereas
the CBT protocol implements a particular shared-tree algorithm ﬁrst described in [8].
However, the emerging demand for routing based on quality—of-service (Q08) [58] has
stimulated the development of many other MC topology computation algorithms that
may be better suited for certain classes of applications [10, 34, 35, 59, 60]. Many such
algorithms have not been incorporated into current multicast protocols.

The diversity of MC topology types demanded by multiparty communication ap-
plications, and the wide variety of MC topology computation algorithms designed for
different performance criteria, give rise to the following question: Is it possible to
develop an MC protocol “chassis,” that is, a framework that is able to accommodate

multiple existing, and future, MC topology algorithms? The ongoing development of

94

new service models (available bit rates, controlled load, quality-of-service, and so on)
further emphasizes the need for such a generic MC protocol. The main contribution
of this chapter is to demonstrate that such a challenging goal is achievable in networks
based on LSR.

In this chapter, we propose an LSR-based MC protocol, called the generic MC
(GMC) protocol, which can be used as a distributed implementation of “any” MC
topology algorithm. The GMC protocol extends LSR by including information about
MCS in the network images maintained at switches. MC topology and membership
information are broadcast throughout the network by means of extended LSAS. We
emphasize that the GMC protocol is intended to construct and maintain MCs among
switches, rather than hosts. As discussed in Section 2.2.2, a host in a network is
attached to one or more switches, called the ingress switch of the host, and uses a
local membership management protocol, such as IGMP [7], to inform its respective
ingress switch of MC membership status. When one or more attached hosts of a
switch are interested in an MC, the switch is said to be a member switch of the MC.
Figure 5.1 shows the same MC as depicted in Figure 2.1 (b), complete with member

hosts.

— Connection Link
Connection Member Switch
Intermediate Switch
Connection Member Host

[35500

Other Host

 

Figure 5.1: Example MC showing member switches and attached hosts.

The primary task of the GMC protocol is to keep MC images consistent and up-
to-date, while incurring minimum protocol overhead. Since both network topology
information and MC image information are available at all switches, any method of
computing MC topologies can be used. Indeed, the topology computation algorithm

is a “plug-in” component of GMC, rather than an inherent part of the protocol. In

95

addition to its ability to support multiple MC types and topology algorithms, the
GMC protocol exhibits the following properties:

1. The protocol is free from routing loops. Due to delays in the dissemination of
changes in network status, the participating switches in an MC protocol may
have inconsistent knowledge of the network for short periods of time. Many
multicast protocols produce transient routing loops under such circumstances [5,
6, 4, 2]. Routing loops, even temporary ones, may introduce network congestion
under conditions of heavy trafﬁc. As we will demonstrate later, the GMC

protocol avoids routing loop entirely.

2. The protocol is robust. Being a link-state routing protocol, the GMC protocol
has the intrinsic advantage of fault tolerance. The protocol handles faulty com-
ponents in the network through topology computations that are triggered by
link/nodal events. In fact, the protocol survives network partitioning, and is
able to construct correct MC topologies after re—uniﬁcation. Further, we will
show that the GMC protocol can survive memory overﬂow problems at switches:
given an MC, the protocol will be able to construct the MC as long as at least

one member of the MC does not purge the image of the MC indeﬁnitely.

3. The protocol exhibits a low level of computational redundancy compared to ex-
isting LSR-based MC solutions. As we will discuss in Section 5.2, LSR-based
MC solutions, such as the MOSPF protocol, may perform identical tOpology
computations at multiple switches, incurring the problem of computational re-
dundancy. Since an MC topology computation is typically a non-trivial task
(for example, many of the Steiner tree heuristics are of 0(N2) or 0(N3) com-
plexities), the GMC protocol is designed to minimize the number of topology
computations. We will show via a simulation study that the performance of
GMC in terms of computational overhead compares favorably with that of MO-

SPF.

Given the availability of network status information at all network switches, LSR

provides a solid foundation for developing an MC protocol chassis such as GMC.

96

While the idea of LSR-based MC protocols is not new, previous solutions have not
achieved the versatility that is dictated by the diversity of multiparty communication
applications. This paper is a “proof of concept” that, at least in LSR-based networks,
such a generic MC protocol can be constructed and can operate efficiently.

The remainder of this chapter is organized as follows. Section 5.2 discusses various
background subjects as well as related work. Section 5.3 describes the design and
operation of the GMC protocol. We prove in Section 5.4 the correctness of the GMC
protocol. Section 5.5 presents the results of a simulation study, in which the behavior
of the GMC protocol is evaluated under various workloads. A summary of this chapter

is given in Section 5.6.

5.2 LSR-Based Multipoint Connections

As described in Chapter 2, switching elements in LSR-based networks use LSAS to
advertise local status information. This approach can be extended to support multi-
party communication by distributing MC membership information in LSAS [1]. That
is, whenever a switch wants to join or leave an MC, a membership-event advertise-
ment for the connection is ﬂooded through the network. Such an advertisement
should contain at least the ID of the connection and the address of the source switch.
Switches in the network collect these advertisements and maintain member lists and
MC topology information for all active MCS. The differences among LSR-based MC
protocols lie in how topology computations are triggered.

The MOSPF protocol is an extension of the unicast OSPF protocol [11]. In
the MOSPF protocol, the addresses of the hosts listening to a multicast address
are broadcast in group-membership LSAS, and routers maintain complete member
lists for all active multicast addresses. However, a router does not compute the
topology of a multicast connection until it actually receives a datagram destined for
the corresponding multicast address. Upon receiving a datagram for multicast address
M, the router consults its local database for the member list of M and computes

a shortest-path tree, rooted at the source of the datagram, that reaches all hosts

97

listening to M. The router then saves this topology information in a routing cache
and forwards the datagram along the appropriate outgoing links. This forwarding will
trigger further topology computations at downstream routers. Moreover, multicast
routing entries created in this process must be cleared upon the arrival of LSAS that
advertise membership or network changes, resulting in the repetition of the process
when subsequent multicast datagrams arrive.

The MOSPF approach (on-demand, data-driven topology computations) is well-
suited to the construction of source-rooted trees. However, this method has limita-
tions in other contexts. First, if the MC is an ROST, it is independent of the nodes
sending to that MC; its topology computations cannot be triggered by packets from
senders, but rather depend on the actions of receivers. Second, the MOSPF performs
identical topology computations at all members and intermediate nodes of an MC.
This computational redundancy could produce heavy workloads at switches, given
the high cost of topology computations. (MOSPF uses Dijkstra’s shortest path al-
gorithm, which exhibits complexity 0(N2), where N is the number of switches in a
network.) Third, the MOSPF protocol requires the availability of MC membership
information at a router to compute the topology of an MC. Losses of such information
(for example, due to memory overﬂows) could lead to improper protocol Operation. To
summarize, while the MOSPF protocol serves some multicast scenarios well, it may
not possess the efﬁciency and ﬂexibility to accommodate many current and future

distributed applications.

5.3 The GMC Protocol

The GMC protocol extends LSR by incorporating MC images at network switches.
The essence of the protocol is to maintain consistency of MC images throughout the
network. In presence of group membership dynamics or changes to network topology,
an MC protocol must update the MC topologies that are affected by those events.
The GMC protocol uses an event-driven approach to this problem: the switches

that detect events are required to compute and advertise new MC topologies. For

98

example, a switch that detects a “link-down” event suggests alternative topologies for
any MCS that were using the malfunctioning link. Similarly, a switch that changes its
membership status with respect to an MC implicitly “detects” a membership change
event, and suggests a new topology for that MC. Other switches in the network receive
the topology proposal and, if it is accepted, modify MC links and/or routing entries

accordingly.

5.3.1 Design Issues

A major problem that the GMC protocol must solve is the proposal of multiple,
inconsistent MC topologies by switches that detect different events at nearly the
same time. An example of this problem is illustrated in Figure 5.2. The example
begins with the network and MC conﬁguration that are depicted in Figure 5.2(a),
where nodes A, B, and C are members of an MC. Let us assume that switches D and
E request to join the MC at approximately the same time. Without knowledge of each
other’s intentions, switch D sees member list (A, B,C, D) and proposes a topology
spanning those four nodes, while switch E sees member list (A, B, C, E) and proposes
a different topology (see Figure 5.2(b)). If updates to routing table entries are not
handled properly, the two inconsistent proposals could result in a routing loop shown
in Figure 5.2(c).

In the GMC protocol, the inconsistent-proposal problem can be detected by em-
bedding membership information in MC topology proposals. In the above scenario,
for example, if node D notices that itself is absent from E’s tOpology proposal, and
node E notices a similar ﬂaw in the proposal from D, then both of them will subse-
quently compute new (and correct) proposals. We will show in the formal presentation
of GMC that, while this example concerns membership inconsistency problems, the
same method is used to cope with inconsistency problems created by simultaneous
network tOpology status changes.

During a busy period when multiple events take place concurrently, multiple pro-
posals may be suggested and ﬂooded through the network. Although some of these

proposals are more up—to—date than others, the underlying ﬂooding mechanism has

99

Connection member
Non-member switch
Joining member

MC link

 

 

   

(b) switches D and E request to join, and (c) potential routing loop.
propose inconsistent topologies.

Figure 5.2: Problem created by inconsistent topology proposals.

no such knowledge and may deliver proposals in any order. Consider the example
shown in Figure 5.3, which continues the scenario in Figure 5.2. Here, we assume
that switch F also requests to join the MC, after receiving the proposals from D and
E. Figure 5.3(a) depicts the MC topology suggested by F. As shown, this proposal
contains update-to-date membership information and should override proposals from
other switches. It is possible, however, for the switch A to receive F’s proposal before
receiving the earlier ones (perhaps because A ignored proposal advertisements from
D and E due to the lack of buffer space but later recovers and is able to accept
the proposal from F). The proposals from D and E will eventually arrive at A by
means of retransmission, incorrectly overriding the up-to—date MC image already es-
tablished at A (see Figure 5.3(b)). The GMC protocol uses the well-known timestamp

technique [61] to resolve this proposal ordering issue.

Another desirable property of MC protocols is freedom from temporary routing

100

   

(a) an update-to—date proposal (b) a hypothetical conﬁguration at
from F. A.

Figure 5.3: The topology ordering problem. If F’s proposal is received before those
of D and E, these obsolete proposals will override the update-to-date MC image at
A. '

loops. In the previous example, even if inconsistent proposals are detected and even-
tually resolved, any routing loop, however transient, can quickly leading to traffic
congestion if heavy traffic loads are placed on the MC during that period. In the
GMC protocol, a tOpology proposal is uniquely identiﬁed by its source switch ID and
its timestamp value. This stamp-source pair serves as the ID of an MC topology. To
prevent loops, the two switches at the ends of an MC link exchange the IDs of their
local MC images before establishing the link as part of the MC. Only if the two IDs
identical will the MC link be established. Using this check, MC links, such as those
in the loop of Figure 5.2(c), cannot all be permitted, because somewhere in the 100p

two adjacent nodes must have different MC topologies, and hence, different topology

IDs.

5.3.2 Protocol Overview

With the above design issues in mind, the operation of the GMC protocol can be

summarized as follows.

0 Every switch It maintains a timestamp R3,", for every active MC m. The value
of this timestamp is set to the largest timestamp value among the received LSAS

relating to m. The switch will ignore topology proposals about the MC m with

101

stamp values less than or equal to Rx‘m.

Every switch a: maintains a mailbox for every active MC. The mailbox stores

received, but not yet processed, LSAS that are relevant to the MC.

When the switch detects a local event that affects the MC m (for example,
the switch changes its membership status respect to m or detects failure of an
incident link that is used by m), the switch creates and ﬂoods an event LSA,
which describes the event. The LSA may also contain a new topology proposal
if the mailbox for m is empty. (There is no reason to compute a new topology

for m if information regarding m from other switches is yet to be processed.)

When the switch receives a topology proposal P for m, it checks the proposal for
consistency problems. The receiving switch checks only “local” inconsistencies,
that is, it checks if its own membership status in P conforms with its current
membership status and if P includes any malfunctioning incident links of the
switch. If a local inconsistency is detected, the switch objects — it computes
and advertises a new topology proposal. LSAS that carry prOposals produced

in this manner are called triggered LSAS.

A topology proposal P is accepted at a switch :1: if the switch ﬁnds no local

inconsistency for P and if the timestamp of P is greater than Ram.

To prevent the GMC protocol from being overly reactive to bursts of events,
topology computations are subject to a hold—down period. The hold-down pe-
riod guarantees that successive topology computations must be at least At
seconds apart. Assuming that the current image for the MC m at switch :1:
is received or computed at time a, and that the switch is ready to compute a
new topology at time b where H - a < At, the switch sets up a timer, called
TC-TIMER (TC stands for Topology Computation), with length At — ﬂ + a. The
postponed topology computation is resumed if no locally consistent topology
prOposals are received before the timer ﬁres. The proper choice of the At value

is a subject of our performance study in Section 5.5.

102
5.3.3 GMC LSA Format

Before we present the details of the GMC protocol, we must deﬁne the format of
LSAS. We use the term non-GMC LSA to refer to an advertisement produced and
processed by the underlying unicast LSR protocol, and the term GMC LSA to refer
to an advertisement produced by the GMC protocol.

In a network comprising n switches, a non-GMC LSA is a tuple (S, seq, F, D),
where S is the source of the LSA, seq is the sequence of the LSA, F with value
gmc indicates that the LSA is used for the advertisement of a link/ nodal event, and
D encodes a description of the event. For example, a description of a link-down
event must include at least the two end switches of the link. The exact format of
link/ nodal event descriptions is deﬁned by the underlying unicast LSR protocol, and

is not discussed further.

A GMC LSA is a tuple (S, seq, F, V,G,P, T), where S E {0, 1,... ,n — 1} is the
source address of the LSA, seq is the sequence number of the LSA, F with value
gmc identiﬁes this LSA as an GMC LSA, V E (join, leave, link, none} speciﬁes an
event from the source switch S, G identiﬁes the MC to which this LSA is relevant, P
is either a topology proposal for G or the member list of G, and T is a timestamp.
An event of type “link” in a GMC LSA indicates that a link/nodal event affects
the topology of an MC. Speciﬁcally, a link/nodal event will cause the unicast LSR
protocol to produce exactly one non-GMC LSA and will cause the GMC protocol to
produce k GMC LSAS, where k is the number of MCs whose topologies are affected by
the event. We use the conﬁguration shown in Figure 5.4 to illustrate. Let us assume
that the following events occur: switch X intends to join connection Cl, switch E
wishes to leave connection 02, and switch F detects the failure of the (F, B) link.
As shown in Figure 5.5, the three events trigger ﬁve advertisements: one for the join

event, one for the leave event, and three for the link event.

Given a link/ nodal event that occurs at switch F, we assume that switch F ﬂoods
the single corresponding non-GMC LSA before ﬂooding the corresponding GMC
LSAs. We further assume that the sequence number of a non-GMC LSA will be

103

smaller than those of the corresponding GMC LSAS. A switch that receives an LSA
out of order will not process it until the switch has received preceding LSAS. There-
fore, at any receiving switch, the processing of the non-GMC LSAS (by the unicast
LSR protocol) will precede the processing of the k GMC LSAS (by the GMC proto-
col). Hence, the event advertised in the non-GMC LSA will have been incorporated
in the local network image at the switch before the ensuing GMC LSAS are used to

update MC images.

0 Ci member

     

l \ o CZ member
/ ’1. ~
‘ ' ’t — Ci link
s- I,
“>&\ i, — -- 02 link
l/I/l—ﬂ, ' \ l .
. B ‘0 --- LmkusedbyCiandCZ

Figure 5.4: A network/ MC conﬁguration.

As demonstrated in the previous example, the GMC protocol produces a set of
GMC LSAS that disseminate all events relevant to an MC. (For example, in Figure 5.5,
the protocol produces two GMC LSAS for connection Cl: one for the join of X and
another for the failure of the (F, B) link.) Therefore, the algorithms of the GMC

protocol can be presented in a per—MC manner without loss of generality.

5.3.4 Data Structures And Protocol States

Besides the aforementioned timestamp Ram, every switch a: in the network maintains
a variable last-tc_timex,m (last topology computation time) for each MC m, and uses a
mailboxmm to store incoming GMC LSAs regarding m. Every switch a: in the network
also maintains a local image for each MC m, denoted by Image[:r, m]. An MC image
includes the topology of the MC (denoted by P(Image[:r, m])), the ID of the switch
that proposed the topology (denoted by S (Image[:r, m])), and the timestamp of the
topology (denoted by T(Image[:1:, m])). The switch a: further maintains a list of MC

members, denoted by Members[:r, m], and a real time clock, clock[:r]. In the following

104

Events Advertisements

GMC LSAS for C1

 

-5; V E _ E’__ _ _T____
”x’gmmn :C1l ... ... l

/ (Si .V_ .9. P_ -.._ l—
Smichxiom (151!“ 1“".3. _ "'__ .___ "__

connection Ci

 

 

 

Non-GMC LSAs (processed by unicast LSR)

 

SF 0

pl F, CWT . . . . 7
Link (F'B) fans / _ m __: _ . #——‘ _ —

 

 

 

 

GMC LSAS for CZ

 

S F V G P T
F" * . i i '
SwitchEleaves ﬂPTEQ'Flb C2L_ l ___l

u
|
A

connection C2
\US F V G P T
li- "A ..I __-__ 7 '
f,~.9f*£-.“i€.§3n__ l _J

 

 

 

 

 

 

 

Figure 5.5: Events and advertisements in the GMC protocol.

discussion, subscripts and indices in this notation may be omitted, if they are clear
from the context.

With respect to an MC m, the GMC protocol at a switch can be in one of the four
states shown in Figure 5.6: EVENT-HANDLING, RECEIVING-LSA, DELAYED-
TC, or IDLE. Initially, the GMC protocol is in the IDLE state. Whenever an event
relating to m occurs at a switch, the switch moves into the EVEN T—HANDLIN G state
and invokes its EventHandler routine. This routine creates and ﬂoods an event LSA,
which describes the event and may also contain a new topology proposal. When-
ever GMC LSAs are present in the mailbox of m at a switch, the switch enters
the RECEIVING-LSA state, invokes the ReceiveLSA routine to process the incom-

ing LSAs, and checks for inconsistency problems before accepting the topology pro-

105

     
    

0 Intermediate state

0 Initial state ZC—Timer—Handlero

one

TC—Timer goes off

, EventHandlero done
ReceiveLSAo done

 
    

and no local events

Figure 5.6: The state-transition diagram of the GMC protocol.

posal, if present, in the LSA. After the completion of either the EventHandler or the
ReceiveLSA routine, the switch returns to the IDLE state. When a hold-down timer
ﬁres, the switch enters the DELAYED-TC (Delayed Topology Computation) state
and invokes the TCTimerHandler routing. When local events, LSA arrivals, and timer
ﬁrings occur simultaneously, the EVENT-HANDLING state has priority, followed by
the RECEIVING-LSA state.

5.3.5 Protocol Algorithms

We are now ready to describe the algorithms for EventHandler, ReceiveLSA, and
TCTimerHandler. In the following, we assume that the ﬂoodings of LSAS are reli-
able and that LSAS from the same switch are ordered by sequence number. Reliable
ﬂooding can be implemented using either a reliable hop-by-hop protocol, or by peri-
odic re—ﬂooding [12]. Different reliability mechanisms and ﬂooding algorithms affect
the timing behavior of the GMC protocol, but do not affect its correctness.

At a switch a: and given an MC m, GMC algorithms share the data structures
described in the previous section. (Additional variables shared by these algorithms,
such as the make-proposal_ﬂag variable, will be introduced later.) We point out that
simultaneous accesses to shared data structures and variables cannot occur, because
at any moment of time the GMC protocol is in exactly one of the four states shown
in Figure 5.6, and will leave the current state only if the corresponding routine is

completed.

106

The EventHandler algorithm is given in Figure 5.7. The algorithm is presented in
a per-MC manner, that is, when an event occurs, this routine is invoked for every
connection affected by the event. This protocol entity is responsible for the gen-
eration of GMC LSAs only; the non-GMC LSA resulting from link/nodal event is
generated and ﬂooded by the underlying unicast protocol. In Figure 5.7, the local
switch is identiﬁed by parameter r, the event is given in parameter co, and the af-
fected connection is given by parameter m. The EventHandler may be invoked because
of membership change events (that is, when switch a: joins or leaves the MC m), or
link state events that affect the MC (for example, an incident link that is used by
the MC fails). In both cases, the routine advances the timestamp R of m (line 1),
updates MC member list when necessary (lines 2-4), and computes a new MC topol-
ogy P for m (line 6), if such an action is not prohibited by a hold-down period. If
a new topology is not computed due to the hold-down period, then the TC-TIMER is
set up at line 9 to defer the computation to TCTimerHandler (if the timer is already
in use, line 9 restarts the timer). Even if the computation at line 6 is performed,
the result P may be obsolete after the completion of the computation, due to the
arrivals of new GMC LSAS regarding m. If P remains up-to-date after computation,
it is ﬂooded throughout the network (line 14) and accepted at a: itself (by calling an
auxiliary routine, AcceptTopology, at line 15). When the proposing of a topology is
postponed due to either the hold-down period or obsolescence, the EventHandler at
line 17 ﬂoods the event eu with a member list of m, rather than an MC topology, and
defers to the ReceiveLSA routine to make sure that a correct MC image is eventually
established. This information is passed to ReceiveLSA by setting a shared variable,
make_proposal_ﬂag, equal to TRUE (line 18). As with other GMC variables, at each

switch :15 there is one make_proposal_ﬁag variable for each MC m.

The AcceptTopology algorithm, shown in Figure 5.8, registers an MC topology P
into the local database of the invoking switch 2:, and attempts to establish incident
MC links according to the new topology. The local MC image Image, including the
tOpology, the source switch ID, and timestamp, are updated at lines 1 to 3. The

routine then tries to establish MC links that are deﬁned in P and incident to :1:.

107

 

Algorithm: EventHandler
Input: switch ID :1:, event 62), and connection m
l: R = R + l.
2: IF (eu is for membership status change of :1:)
3: Update Members(m) accordingly.
4: EN DIF
5: IF (clock — last_tc_time > tc-holddown),
6: Compute a new tOpology proposal P for the connection m.
7: last-tc_time = clock.
8: ELSE
9: Set the TC-TIMER to value tc_holddown — clock + last-tc_time.
10: ENDIF
11: IF (a new topology P is computed) and
12: (no LSAS for m received during the computation),
13: /* proposal is still valid */
14: Flood the GMC LSA (m,gmc,ev,m, P, R).
15: AcceptTopology (as, m, P, R, :1:).
16: ELSE /* ﬂood event but defer to ReceiveLSA to make proposal */
17: Flood the GMC LSA (x,gmc, eu,m, Members(m), R).
18: make_proposal_ﬂag = TRUE.
19: ENDIF

 

 

 

Figure 5.7: The algorithm for EventHandler.

As described earlier, an MC link (:1:,y) is established only if the MC image at the
neighboring switch 3; has a source switch ID and a timestamp identical to those at
1: (lines 6-9). Before completion, the routine sets the make_proposal.flag variable to

FALSE, and records the current time in last_tc-time.

The algorithm for the ReceiveLSA routine is given in Figure 5.9. Parameter 2:
identiﬁes the local switch, and parameter m speciﬁes the MC. The routine is invoked
when the switch enters the RECEIVING-LSA state, that is, when there is at least
one LSA in mailbox for connection m. For every such LSA, ReceiveLSA updates the
local member list of connection m if the event in Z is about a membership change
(line 3). Next, the routine checks for inconsistency problems in the LSA and records
the result in a variable, my-status_consistent (lines 4-9). As mentioned earlier, the
switch a: is only interested in local consistency, that is, the received LSA must contain
correct membership information with respect to 2: (line 5) and any topology proposal

in the LSA must not use any malfunctioning links that are incident to 3 (line 4).

108

 

 

Algorithm: AcceptTopology
Input: switch ID :1:, connection m, topology P, stamp T and source S.

1: P(Image[:c][m]) = P.
2: S(Image[$][m]) = S.
3: T(Image[:r][m]) = T.
4: FOR(every link t in P that is incident to 2;) DO
5: Let t be an (2:,y) link.
6: Exchange messages with y to learn S (Image[y] [m]) and T(Image[y][m]).
7: IF (S(Imagelxl[ml) = 5(1magelyllml)) and (T (Imagelxllml) = T(Image[yllml))
8: Establish (2;, y) link for connection m.
9: ENDIF
10: ENDDO
11: make.proposal.ﬂag = FALSE.
12: last-tc_time = clock.

 

Figure 5.8: The algorithm for AcceptTopology.

Next, the routine decides if the LSA can be accepted (lines 10-13). For an LSA
E to be accepted it must include a topology proposal that is more recent than the
local one and that is locally consistent. The LSA l is more up-to-date than the local
MC image at :1: if it is tagged with a larger timestamp value; a tie in the timestamp
comparion is resolved by the values of source switch IDs (line 12). If the LSA is
accepted, then the AcceptTopology routine is invoked to update the local MC image
(line 14), and the make_proposal_ﬁag for connection m is set to FALSE (line 15), since
an up-to-date topology for the connection has been accepted. Otherwise, the switch
checks whether its local status is consistent with the received topology proposal (line
17). If not, then the switch plans to construct a new topology proposal by setting
its make_proposal_flag variable to TRUE (although it may need to process additional
LSAS ﬁrst). To conclude the processing of the current LSA 8, the ReceiveLSA routine
advances the R timestamp for MC m to be at least as large as that of 8 (line 19).
Since the R timestamp will be advanced again before the switch cc proposes and ﬂoods
any topology in the future (line 1 of EventHandler, line 31 of ReceiveLSA, and line 8
of TCTimerHandler), the advancement at line 19 ensures that subsequent topology

proposals will be tagged with timestamps larger than anything a: has received.

After consuming all the LSAS in the mailbox, the ReceiveLSA routine decides

 

109

 

Algorithm: ReceiveLSA
Input: switch ID :1:, connection ID m.

1: WHILE (there are LSAS for connection m in mailbox)
2: Get next LSA E = (S, gmc, V,m, P, T).
3: Update member list of m accordingly, if V is for membership update.
4: IF (for all link t used in P that is incident to :1:, t is ON) and
5: (the membership of :1: in P(€) is consistent with that in Members(m)),
6: my_status-consistent = TRUE.
7: ELSE
8: my-status_consistent = FALSE.
9: ENDIF
10: IF (P(£’) is a topology proposal) and
11: (T(E) Z R) and
12: (T(Z) > T(Image), or (T(Z) = T(Image) and 3(3) > S(Image)) and
13: (my-status-consistent= TRUE)),
14: AcceptTopology (:1:, m, P(€), T(t), S(€)).
15: make_proposal_flag = FALSE.
16: ELSE
17: make-proposal_ﬁag = TRUE, if (my_status_consistent = FALSE).

18: ENDIF

19: R = max{R,T(€)}.

20: EN DWHILE

21: IF (make-proposal..flag = TRUE)

22: IF (clock — last-tc_time > tc-holddown),

 

 

23: Compute a new topology P for the connection m.

24: last-tc_time = clock.

25: ELSE

26: Set up the TC-TIMER with length tc_holddown — clock + last-tc_time.
27: ENDIF

28: IF (a new topology P is computed) and

29: (there are no LSAS in mailbox for connection m) and
30: (no local events for connection m queued at z),

31: R = R + l.

32: Flood (3:, gmc, none, m, P, R).

33: AcceptTopology (:r, m, P, R, 3:).

34: ENDIF

35: ENDIF

 

Figure 5.9: The algorithm for ReceiveLSA.

whether a new proposal should be computed, depending on the value of the
make_proposal.flag variable (line 21) and the hold-down mechanism (line 22). If a
topology is computed at line 23, two conditions must be satisﬁed before the proposal

is actually ﬂooded: 1) no new GMC LSAS arrive during the computation period (line

110

29), and 2) no local events take place during the period (line 30). If the proposal is
still up-to-date at the end of computation, then it is ﬂooded to the other switches and
accepted locally (lines 32-33). Otherwise, it is withdrawn and the make_proposal_ﬂag
remains true, indicating the lack of an up-to—date MC image for m at switch :1:. In
the case where the topology computation is held down, the TC-TIMBR is set up (or

restarted, if it is already in use) at line 26 to defer to computation to TCTimerHandler.

The algorithm for the TCTimerHandler routine is given in Figure 5.10. Again,
parameter :1: identiﬁes the local switch, and parameter m speciﬁes the involved MC.
Before resuming a postponed topology computation, the routine ﬁrst checks if this
computation is still needed. The computation may no longer be necessary because,
during the hold-down period, topology proposal(s) may have been received and ac-
cepted (hence setting the make_proposal_flag to FALSE), or there may be pending
“news” about the MC (GMC LSAS in the mailbox or events in the event queue).
Similar to the previous routines, the new topology P is actually ﬂooded only if no

further news about the MC is observed during the computation period.

 

Algorithm: TCTimerHandler
Input: switch ID a: and connection 777..
1: IF (make_proposal_flag= FALSE) and
2: (there are no LSAs in mailbox for connection m) and
3: (no queued events for connection m),
4: Compute a new tOpology P for the connection m.
5: last_tc-time = clock.
6: IF (there are no LSAS in mailbox for connection m) and
7: (no queued events for connection 777.),
8: R = R + 1.
9: Flood (2:, gmc, none, m, P, R).
10: AcceptTopology(z, m, P, R, x).
11: make_proposal.ﬁag = FALSE.
12: ENDIF
13: ENDDO

 

 

 

Figure 5.10: The algorithm for TCTimerHandler.

111
5.3.6 MC Creation and Destruction

The creation and destruction of an MC require no special mechanisms. When the ﬁrst
member of an MC advertises its presence, the other switches allocate necessary data
structures for the MC and accept the topology proposal contained in the advertise-
ment. When a switch detects an empty member list of an MC, local data structures

corresponding to the MC are deleted.

5.4 Proof of Correctness

In this chapter, we formally show the correctness of the GMC protocol in two steps.
In the ﬁrst step, we consider the correctness of the protocol without memory short-
age problems (that is, operational switches will not lose MC images). Under this
assumption, we will show that, given a ﬁnite set of events, the algorithm will reach
consensus about MC images among network switches by producing a ﬁnite number
of LSA broadcasts. (Our simulation results, presented in Section 5.5, show that in
practice the number of LSAS per event is likely to be small.) In the second step,
we describe (minor) extensions to the GMC protocol to handle losses of GMC data
structures, and establish a sufﬁcient condition for the GMC protocol to work cor-
rectly in presence of such switch memory overﬂows. For clarity, the discussion in this
section is in terms of a single MC. As illustrated in Figure 5.5, the GMC protocol,
when given a set of events If (link-state and/or MC membership changes), produces
a set of GMC LSAs, Em, exclusively for every MC m. Thus, the protocol activities
associated with different MCS proceed independently; herein lies the generality of a

proof regarding a single MC.

5.4.1 Correctness without Memory Overﬂows

Under the assumption that switches, unless crashed, will not lose MC images, the
GMC protocol proceeds as described in Section 5.3. In the following discussion,

we assume a finite set of events, denoted as II, that does not leave the network

112

permanently partitioned. Our goal is to show that such an even set will not lead to
inﬁnitely looping GMC activities. We point out that temporary partitioning could be
produced by such an event set 11; all but permanent partitions are handled by GMC.

Lemma 1 Given an MCm and a set of events II as deﬁned above, the GMC protocol
produces a ﬁnite set of LSAs, Lm.

Proof: Since ﬂooding operations are assumed to be reliable (see Section 5.3),
there exists a time T by which all the events in II are learned by all switches. (If
II incurs temporary partitioning, reliability of ﬂooding can be enforced by periodic
re-ﬂooding.) Any GMC LSA produced after time T will incorporate the changes in H.
Such an LSA will not be objected to by any other switch (that is, the corresponding
my-status_consistent values will be TRUE at all switches), and hence will not trigger
any additional LSAs. Since an LSA must require a minimum time At to construct,
we see that the GMC protocol is able to produce only a ﬁnite number of LSAS by

time T, and hence the set Em must be ﬁnite. E]

In the following discussion, a GMC LSA is said to be (locally) consistent at a
switch y if the ensuing my_status-consistent value is TRUE at y. Also, we will drop
the subscript m in the notation Lm, since all the discussions are about an MC m that
has at least one member join event in H (otherwise, the MC is inactive and is not

relevant) .

Deﬁnition 1 We denote by [max the LSA in .C that has the maximum timestamp-
source pair (T(Zmaz), S (2mg).

The concept of the maximum element in .C is well deﬁned because the set I. cannot
be empty; at least one GMC LSA is generated for the member join assumed above.
We will see that all network switches will accept tmax, and the topology contained
in [max will be the consensus among all network switches, due to the two properties
stated in subsequent lemmas. Recall that S (t) and T(Z) are the source and timestamp

of an LSA 6.

113

Lemma 2 The LSA 3mm, includes a topology proposal.

Proof: Let us assume the opposite. A GMC LSA that does not contain a
topology proposal must be produced by the EventHandler routine at line 17, a scenario
that occurs when there are incoming GMC LSAS during the topology computation
of (max. If this happens to Emu, the value of R at switch S (Emu) at this moment is
Tam”), and the make_proposal_flag variable is set to TRUE.

Consider the GMC protocol activities at S (Emu) after the production of (max (a
GMC activity is an invocation of EventHandler, ReceiveLSA, or TCTimerHandler). The
ReceiveLSA must be invoked at least once, to process the LSA(s) that arrived during
the processing of tmax. We exclude the possibility of post-[max events occurring at
S (Emu); otherwise, further invocations of EventHandler will advance the timestamp
R, and ﬂood LSAS with timestamps greater than that of tum, a contradiction
to the choice of tmax. Therefore, during the post-Zmax activities at Sam“), one
of the following must happen: at least one LSA with a timestamp—source pair
greater than (T(Zmax), S (lawn is received (and accepted), or the TRUE value in the
make_proposal_ﬂag variable forces ReceiveLSA (or TCTimerHandler if required by a
hold-down period) to compute and ﬂood at least one topology with the timestamp
value R advanced. Since both cases imply the existence of timestamp—source pairs
larger than (T(tmax),S(€max)), they lead to contradictions to the choice of Emax,
completing the proof. [I]

Lemma 3 The topology P(€maz) is consistent at all network switches.

Proof: Suppose to the contrary that [max is detected to be inconsistent at
some switch y. In response to this situation, switch y sets its make.proposal_ﬁag to
FALSE (at line 15 of ReceiveLSA). In the meantime, the R variable at y is advanced
to T(me) (line 19 of ReceiveLSA), the maximum timestamp value in .C, prohibiting
subsequent LSAS from being accepted at y (line 11). After y’s receipt of (max, there
can be no local events at y; otherwise, EventHandler would produce GMC LSAS with

timestamps larger than or equal to T (Ema) + 1, a contradiction to the choice of 8m”.

114

The TRUE value of make_proposal_flag will cause new topology computations at line
23 of ReceiveLSA or line 4 of TCTimerHandler. The results of these computations,
in the absence of further local events, could be dropped in response to incoming
LSAS, which spawn additional GMC activities. However, since the number of
LSAS in .C is ﬁnite, eventually the result of a post-8max topology computation will

be ﬂooded with timestamp R+1 = T (Emu)+1, a contradiction to the choice of 6m”. Cl

Since the GMC LSA 8mm, has the maximum topology ID and includes a proposal
that is consistent at all operational network switches, it shall be accepted by these

switches. This observation leads us to the next theorem.

Theorem 1 Given an MC m and a ﬁnite, non-partitioning set of events II, all op-

erational network switches will reach consensus on MC topology with a ﬁnite number

of MC LSAS.

5.4.2 The Handling of Memory Overﬂows

Next, we consider scenarios where one or more network switches run out of memory
space and must purge some entries in their local network images, including MC-
related entries. Since a switch can always compute a entirely new MC topology if it
has the member list of the MC, the loss of MC topology images (that is, the P(Image)
data structures) will not cause problems. Hence, we are concerned only with the loss
of MC member lists. Further, we assume that switches will not purge data structures
other than member lists and MC topology images. The assumption is reasonable
because those two are the most space-consuming data structures used by the GMC
protocol.

We use an example to illustrate the minor extension to the GMC protocol needed
to handle losses of member lists. Consider a scenario where a switch :15 runs out of
storage space and decides to purge the member list of an MC m. The lost member
list can be re—constructed when a new topology proposal arrives and is accepted.

Should this be the case, the temporary loss of the data structure does no harm to

115

the operation of the GMC protocol. The more interesting case is when switch :1: must
propose a topology after its member list has been purged. One solution is to have
a: create a member list that incorporates only its own membership status (that is,
a member list {:1:} if :1: is a member, or else, an empty list), and propose an MC
topology according to this list. The topology will be found to be inconsistent at all
other switches that are members of the MC, triggering topology proposals computed
at these switches.

Actually, the GMC protocol with the above revision could survive even more
adverse scenarios than isolated, temporary losses of member lists. To investigate
tolerance limit of the protocol on this issue, we establish a sufﬁcient condition for the

GMC protocol to converge in presence of member list losses.

Lemma 4 Given a set of events II and an MC m, let M be the set of members ofm
after 11. If there exists at least one switch y E M that does not purge the Members[m]

data structure indeﬁnitely, then the set L", is ﬁnite.

Proof: If [.m is inﬁnite, there must exist switches that indeﬁnitely ﬂood GMC
LSAS pertaining to m. Let X be the set of indeﬁnitely ﬂooding switches, and let 7'
be a moment in time after y has stopped purging Members[m] and after all events
in II have been learned by all switches. Deﬁne time T’ 2 r to be a moment in time
after switch y has constructed the membership status about switches in X (via the
inﬁnite number of LSAs from these switches) and after all switches not in X have
stopped ﬂooding. If switch y proposes a topology after time 7’, then switches in X
shall no longer produce triggered LSAS, a contradiction to the selection of X. It can
be concluded that y must not be in X, and hence remains silent after time r’. (The
possibility for y to ﬂood an LSA without a topology proposal is excluded, because
LSAS without proposals must be event LSAs, which do not exist after time T.)

If y is to remain silent, all the LSAS produced by some switch in X after time 7"
must be consistent at y. To remember the fact that y is a member of m, switches in X
cannot purge Members[m] after time 7’. Let time T” Z r’ be a time after all switches

in X have stopped purging Members[m], and let us consider any switch 3 E X. When

116

other switches in X learn the status of x at sometime after T” (recall that x ﬂoods
indeﬁnitely and these switches have plenty of opportunity to learn this information),
they shall not subsequently purge it. Subsequent LSAS will be consistent at x,

so x will become silent. Therefore, x cannot be in X, a contradiction. We are done. [I

The next lemma is a counterpart of Lemmas 2 and 3 combined, in presence of

member list losses.

Lemma 5 Given a set of events II and an MC m, let M be the set of members ofm
after II. If there exists at least one switch y E M that does not purge the Members[m]
data structure indeﬁnitely, then the [max LSA includes a topology proposal that is

consistent at all switches.

Proof: With a ﬁnite set .Cm, the maximum LSA, (max, in Lm is well deﬁned.
Lemma 2 showed that the [max LSA must contain a topology prOposal; otherwise,
the switch S (tmax) would have suggested another LSA with a timestamp larger than
T(Emax). That argument is independent of the issue of member list losses and therefore
still holds. However, we need to consider the possibility that the topology P(Zmax) is
based on a newly created member list, which could be incomplete when P(Zmax) is
computed.

If the member list M’ that S (Emu) uses to compute P(tmax) is not equal to M,
it will trigger LSAS from switches in M — M’ and M’ - M. These switches will be
tagged with timestamps larger than T(tmax), a contradiction to the selection of em”.
Hence, (max must be based on a complete member list. Lemma 3 guarantees the
correctness and network-wide acceptance of its topology proposal, concluding the

proof. El

Hence, no LSAs will be able to override the maximum LSA, Em”, which shall be
the consensus on the MC m, even in presence of member list losses. This leads us to

the following theorem.

Theorem 2 Given a set of events II and an MC m, let M be the set of members

117

of m after II. If there exists at least one switch y E M that does not purge the
Members[m] data structure indeﬁnitely, then the GM G protocol will achieve consensus

on the topology of the MG m using a ﬁnite number of GMC LSAs.

5.5 Performance Evaluation

A major objective of the GMC protocol is to reduce the redundancy in topology com-
putation incurred by previous LSR-based solutions, while retaining the advantages of
LSR (responsiveness, fault tolerance, and so on) and supporting a variety of different
MC types. In situations where events are relatively sparse, when a switch detects
an event, the GMC protocol suggests a new topology and advertises the topology
in an LSA, which will be accepted by all other switches. In this case, there is only
one topology computation and one ﬂooding operation per event. This compares very
favorably with the MOSPF protocol, which requires a topology computation at every
switch involved in the MC. However, it is also important to study the behavior of
the GMC protocol when several events occur within a short period of time, during
which switches detect inconsistencies in topology proposals and are triggered to pre-
pare and advertise their own proposals. Such situations raise the concern of cascading
reactions among switches, which could decrease the advantage of GMC over other ap-
proaches. A simulation study was conducted to investigate the behavior of the GMC
protocol under such circumstances. The simulator is based on the CSIM simulation

package [62].

5.5.1 Simulation methodology

Each simulation session is deﬁned by a set of parameters, including topology compu-
tation time, LSA transmission time, event generation distributions, network size, and
so forth. In this section, we discuss the selection of parameter values.

We use the symbol T, to represent the time to compute a topology, and T, to
denote the flooding diameter of the network, that is, the time to complete a ﬂooding

Operation in the worst case. We deﬁne the time T; + Tc to be a round which, as

118

mentioned above, is the amount of time needed to handle sparse events in the GMC

protocol.

In the GMC protocol, the value of parameter T c may vary from MC to MC,
depending on the choice of the topology computation algorithm for that MC. In this
study, we assume the use of Dijkstra’s shortest path algorithm, and measured the
execution times of the algorithm (using our random graphs as input) on Sun SPARC-
20 workstations. The rationale behind the use of Dijkstra’s algorithm is its widespread
use in computing source-rooted trees [6] and its applicability to several heuristics for
computing shared tree topologies (for example, the core-based tree heuristic [8] and
the KMB algorithm [31]). Further, this assumption allows us to directly compare the
GMC protocol with the MOSPF protocol, which also uses Dijkstra’s algorithm.

To determine T, values, the following ﬂooding protocol is assumed: LSAS arriving
at a switch for the ﬁrst time are forwarded along all incident links, except the incoming
one. LSAS arriving at a switch for the second time are dropped silently. LSAS are
forwarded to neighboring switches one by one. For each LSA forwarding, we used
software overheads measured on the ATM testbed in our laboratory. The testbed
comprises Sun SPARC-10 workstations equipped with Fore SBA-200 adapters and
connected by three Fore ASX—100 switches. From these measurements, we obtained
the ﬁgure 600 usec, which includes the overhead at both the sending and receiving

switches.

Networks comprising up to 400 switches were simulated. For each network size,
40 graphs were generated randomly, and two simulation sessions were conducted on
each graph. Table 5.1 shows the characteristics of the graphs generated. In the table,

maximum, minimum, and mean values are averages over the 40 graphs of that size.

The durations of hold-down intervals are uniformly distributed, and are selected
randomly each time a hold-down timer is set up by a switch. We investigated the
performance of the GMC protocol with no hold-down timers (that is, hold-down
intervals are of length zero), a short hold-down interval that is distributed from 2 to
10 rounds, a medium hold-down interval from 20 to 100 rounds, and a long hold-down

interval from 200 to 1000 rounds. For example, when a round is 10 milliseconds, the

119

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Network degree diameter T, round
size min mean max min mean max (in ms) (in ms)
10 1.675 3.6 5.675 2 3.25 5 3.555 3.621
20 1.25 3.573 6.725 4 4.75 7 5.123 5.225
40 1.025 3.733 8.025 5 6.175 9 6.705 6.892
60 1.025 3.875 8.65 5 6.8 11 7.5 7.776
80 1 3.914 8.925 6 7.075 9 7.867 8.235
100 1 4.12 9.675 6 7.15 9 8.1 8.558
120 1 4.103 9.725 6 7.525 8 8.423 8.999
140 1 4.201 10.225 7 7.725 9 8.64 9.308
160 1 4.220 10.375 7 7.575 10 8.798 9.576
180 1 4.307 10.5 7 7.75 10 8.9623 9.851
200 1 4.289 10.825 7 7.95 10 9.075 10.078
250 1 4.503 11.275 7 7.95 10 9.338 10.666
300 1 4.704 11.725 7 7.975 10 9.465 11.123
350 1 4.873 12.325 7 7.85 9 9.533 11.422
400 1 5.065 12.65 7 7.85 9 9.623 11.829

 

 

 

 

 

 

 

 

 

 

Table 5.1: Characteristics of randomly generated graphs.

above interval lengths translate into 0.02 to 0.1 seconds, 0.2 to 1 seconds, and 2 to
10 seconds, respectively.

We are interested in two performance metrics: topology computations per event
and ﬂooding operations per event. The ﬁrst metric reveals the computational overhead
incurred by an MC protocol, and the second measures the communication overhead.
In the GMC protocol, the two metrics are not necessarily directly proportional to one
another, since computed topologies might not be ﬂooded due to the arrival of new

LSAS.

5.5.2 Group Creation Periods

In the ﬁrst set of experiments, we study the behavior of the GMC protocol during
group creation periods. That is, we assume that a group has a predetermined start
time and that a potentially large number of group members join the group at or
about that time. (Such a scenario could occur when, for example, a large number of
users join a live broadcast at the beginning of the broadcast.) Speciﬁcally, we assume

that member arrival times are normally distributed with mean 0, the start time.

120

We chose standard deviation values so that 99% Of members arrive within a chosen
interval length. Specifically, we used standard deviations so that 99% of members
arrive within 1 second, 10 seconds, 30 seconds, and 10 minutes. The extremely short
creation periods, such as the 1-second and 10—second ones, are designed to stress the
GMC protocol during very busy periods. To make the group creation periods as busy

as possible, we assume that all switches are group members.

Short arrival intervals. The performance of the GMC protocol with the 1-second
member arrival interval is plotted in Figure 5.11. Figure 5.11(a) plots the number Of
topology computations per event, and Figure 5.11(b) plots the number of ﬂoodings
per event. When a large number of group members arrive within such a short period
of time, cascading interactions among switches could occur if the GMC protocol
reacted to events too quickly. This behavior is illustrated by the curves corresponding
to no use of hold-down timers. These plots start in the vicinity Of one (that is,
approximately one computation and ﬂooding per event), because, when the number
Of group members is small, member join events are still relatively sparse and do not
interfere with one another. As the number of switches / members grows and join events
collide with each other, these plots reach approximately 5.8 topology computations
and 2.9 ﬂooding operations per event, indicating the presence Of cascading reactions
among switches. However, the curves pertaining to the use of hold-down timers,
even a short timer, show that the over—reaction of the GMC protocol can be curbed.
With the medium hold-down interval, the number Of topology proposals per event
approaches zero, and the number of ﬂooding operations per event approaches one.
(The latter metric must be greater than or equal to one because the GMC protocol
always advertises events immediately, with or without topology proposals.) We note
that the number of ﬂooding operations per event when no hold-down is used is not
always increasing; see Figure 5.11(b). In general, the number Of ﬂooding Operations
does not necessarily grow with “event density,” which is determined by the size Of
the group when given a ﬁxed arrival interval. This issue will be further addressed

later. Results for 10—second and 30-second creation periods are shown in Figures 5.12

121

and 5.13, respectively. These results are similar to those for the 1-second periods,

although the values are much lower.

 

 

        

 

 

 

 

 

 

 

i
5 l-
E E
5%: 3 t No hold-down -s— g" .
a Short hold-down -+--- E”
a Medium hold-down -a----- g
E 2 ’ i:
l _ ‘ O 5 _ No hold-down +—
a “n.1,“ ' Short hold-down ~+---
I Hl‘ ----- ‘uF" ......................................... ijum holddown "a..."
0 alga 9° ‘ GI'D“-ﬂ- .vm. .. ...u . 1|. . “uh-ll 0 l n 1 n 1 s a
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network Size Network Size
(a) Computations per event. (b) Floodings per event.

Figure 5.11: Performance of the GMC protocol under 1 second arrival interval.

 

 

3 V T I T T V I T I I I Y Y

2.4 . <
No hold-down -°— ’
2,5 . Shon hold-down ----+ 1 2.2 NO 1,01de ...——

 
 
   

Medium hold-down " Short hold-down ~--~

   
     
 

 

 

 

 

 

 

 

‘6 Long hold-down E 2 _ Medium hold-down ‘
2 2 P > Long hold-down —--- ’
8 ” 1.8 i
a, 5 _ it
a ' .2016 -
g- l - .......... ... SL4 -
u """ ‘- .. 1‘ E
- ~O- ----------------- + .......... 1.2 r- ---..._,°._.“IG .....
0.5 -‘ 3 -—g“_‘ ““““ I ........... O ........... II ........ .-.“
\ Ramona 1 '- I’"““"“"9-‘——=$ a: = =
\‘EMF‘ mama-"g ........... g ........... a ___________ 0‘ .......... D
o l ‘F‘F‘I‘mr - - _- 08 L i i l i l 4
0 50 100 150 200 250 300 350 400 0 50 IOO 150 200 250 300 350 400
Network Size Network Size
(a) Computations per event. (b) Floodings per event.

Figure 5.12: Performance Of the GMC protocol under 10 seconds arrival interval.

10-minute arrival intervals. The performance Of the GMC protocol with 10 min-
utes member arrival interval is plotted in Figure 5.14. With this relatively long arrival
interval, the interaction between the event density and the lengths Of hold-down in-

tervals becomes clear. When the GMC protocol uses an average hold-down interval

122

 

 

   

 

 

 

 

 

 

1.4 r * -..- l -
1.2 r
5 1 * '51:.
a S
‘ ._ ,4
30.3 l‘ ‘u, No Hold-down «— E
7'; a, Short Hold-down ------ a, l ’
m 0.6 ~ , in. Medium Hold-down 0- i -‘=
g- i, “an . Long Hold-down - g
o. 0.4 . 11-0“” . E 0.8 > No Hold-down -0—
\K 0 Short Hold-down ...---
x "‘° """" o ....... 1) Medium Hold—down --¢ -----
0-2 ” as Long Hold-down _..._..
kmthm a“ . 0.6 r
0 4s 1 L l A k- 7:7- T 77 H- ‘ I A A l A L A I
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network Size Network Size
(a) Computations per event. (b) Floodings per event.

Figure 5.13: Performance of the GMC protocol under 30 seconds arrival intervals.

of At seconds, two events must be At apart so that their processing does not interfere
with each other (otherwise, the second event will be within the hold-down interval
created by the ﬁrst, forcing the GMC protocol to postpone its topology computa-
tion). For the short hold-down interval, even the largest networks create “isolated”
join events, resulting in the normal operation of the GMC protocol (that is, one topol-
ogy computation and ﬂooding operation per event), as seen in Figures 5.14 (a) and
(b).

With longer hold-down intervals and larger networks, interference among the pro-
cessing of events can be observed. However, this phenomenon of inter-event inter-
ference affects performance metrics differently. Considering the number of topology
computations per event, the more inter-event interferences, the more topology com-
putations are suppressed, and hence the fewer topology proposals per event; see the
performance results regarding medium and long hold-down intervals in Figure 5.14(a).
For ﬂooding operations per event, however, the suppressing of topology computation
when the event takes place can introduce later ﬂooding of the delayed topology pro-
posals, resulting in more ﬂooding Operations per event. Since the number of topology
computations per event decreases as the network size increases, these extra ﬂooding
Operations (the ones for the delayed topology proposals) become increasingly rare.

This behavior is illustrated in Figure 5.14(b) by the curve pertaining to the long

Proposals per event

0.8 r

0.6 '

0.4 -

0.2 -

0

 

 

I — w v

No Hold-down -°--
Shon Hold-down --..-.
Medium Hold-down .......,
Long Hold-down *- ,

 

J

 

50 100 150 200 250 300 350 400
Network Size

(a) Computations per event.

123

Floodings per event

 

 

No Hold-down +-
Shon Hold-down -+°--
Medium Hold-down ....... ‘
Long Hold-down +‘- ~

7 v ﬁ v v

 

 

 

50 100 150 200 250 300 350 400
Network Size

(b) Floodings per event.

Figure 5.14: Performance of the GMC protocol under the 10 minutes arrival interval.

hold-down interval. A similar phenomenon can be observed in Figure 5.11(b) for the

no hold-down case.

In summary, the high rates of join events during group creation periods can be

very demanding of MC protocols. Our simulation results show that, even in extremely

busy periods, the GMC protocol is able to avoid cascading protocol activities by using

relatively short hold-down intervals (for example, ones less than 0.1 seconds). With

longer hold-down intervals, the GMC protocol processes bursty events effectively in

batch mode, so that only one topology computation is incurred in response to multiple

events. Although this simulation study targets group creation periods, the results

apply to any period with high event density.

5.5.3 Normal Operations

During “normal” operation periods of multiparty communication applications, par-

ticipants join and leave MCs occasionally, and MC protocols may behave differently

than during busy periods, such as group creation periods. In this section, we inves-

tigate the behavior of GMC under such circumstances. We assume that inter-arrival

times of events are exponentially distributed, and we set the event inter-arrival rates

in a way such that an N—switch network has approximately 0.3 x N events during a

124

 

 

 

 

Proposals per event

 

 

 

 

 

 

1 2 r r r l 2 r ﬁ ﬁ _’__:ﬂ,.,.”—"'
war—Wt I
Arr‘.“
l == = r: : : :4. = i i P ‘r 1 2:5“: : e 5”?” 555 9“! """"" 2’ """""" 9 """""" g """"" 0
7‘ "'II
E
0.8 - g 0.8 .
B
o.
0.6 h an 0.6 ~
No hold-down -o— g No hold-down -°—
0 4 _ Short hold-down --*--- 0 4 _ Short hold-down ----...
‘ Medium hold-down ‘ 9 " E ' Medium hold-down ~n -----
Long hold-down -*-*—- Long hold-down --~----
0.2 ~ . 0.2 -
O 1 L A 1 L 1 1 0 1 1 1 n L A A
O 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network Size Network Size
(a) Computations per event. (b) Floodings per event.

Figure 5.15: Performance of the GMC protocol in normal operations.

period of 3600 rounds, equivalent to one hour if a round is 10 milliseconds. Under
these conditions, it is unlikely (but not impossible) that the processing of one event
will interfere with that of the preceding or succeeding events. For such interference
to occur, very long hold-down intervals must be used. This behavior is illustrated
by the results presented in Figure 5.15, in which only the long hold-down intervals
produce slightly less-than-one topology computation per event and more-than-one

flooding operation per event.

5.5.4 Comparison with the MOSPF Protocol

The efﬁciency of the MOSPF protocol depends on three factors: the event rate,
the data arrival rate, and the size of the MC. In this protocol, the topology of an
MC is cleared whenever an event LSA arrives, and is recomputed when the next
datagram for that MC arrives. Once triggered, this computation is performed at
all the switches currently involved in the connection. Consider an MC that involves
k switches (members and intermediate nodes), and let p be the probability that no
datagram arrives between two consecutive events, then the MOSPF protocol would
incur k(1 — p) topology computations per event. The probability p is determined by

the ratio of event arrival rate to datagram arrival rate.

125

 

 

   
   

 

 

 

 

 

 

80 r . . . . . . 140 . T . . . .
7o _ > _ Sdatagrams per second *—
120 l datagram per second ------
0.2 datagrams per second -G----
60 ~
‘5 25100 t
D Q)
3 50 - 3
g 5 datagrams persecond -0— g 30 ’
v, 40 _ ldatagram per second ----* + 2
a 0.2 datagrams per second a g 60 _
8 3o - J 8
° 9
‘5: °- 40 ~
20 * ................................................. 1
10 - """"" 20 ~
g»D-~‘0"‘°’ 0..“ -¢--~o-~o---o-~-ou ........o......... 99 ..."
O A P L L g A n 0 L A A_ A A A A
O 50 100 150 200 250 300 350 400 0 50 100 ISO 200 250 300 350 400
Network Size Network Size
(a) 30—second Creation Interval. (b) Normal Operation.

Figure 5.16: Topologies computations per event of the MOSPF protocol.

We investigated the performance of the MOSPF protocol during 30—second group
creation periods and during normal operation periods, using event distributions as in
the GMC simulations and using three datagram arrival rates: 0.2 datagrams per sec-
ond, 1 datagram per second, and 5 datagrams per second. The results are presented
in Figure 5.16. As we can see in Figure 5.16(a), even with a relatively moderate
data rate of 5 datagrams per second, the number of topology computations per event
can sometimes grow as high as 70 during 30-second creation periods and 120 during
normal Operation periods. Even with the low data rate of 0.2 datagrams per second,
the typical number of topology computations per event is somewhere between 4 and 5
during 30—second creation periods, and can be as high as 40 during normal Operation
periods. We conclude that the MOSPF protocol incurs signiﬁcantly higher compu-
tational overhead than does the GMC protocol. Regarding the number of ﬂooding
operations per event, the MOSPF protocol incurs one ﬂooding operation per event
in all circumstances. Although GMC can produce a larger number of ﬂoodings, we
have seen that a hold-down timer can effectively curb this number.

In summary, the GMC protocol incurs far less computational workload at switches
than does the MOSPF protocol, during both group creation periods and normal
operation periods. This would allow the GMC protocol to sustain a larger number

of simultaneous multicast groups. The use of hold-down timers in GMC produces a

126

tradeoff between responsiveness to events and protocol overhead. GMC’S ability to
survive memory shortage problems is another signiﬁcant advantage, especially when
the number of groups is large and demands on router memory are heavy. Combining
all these factors, we conclude that the GMC protocol is more efﬁcient to supporting
individual multicast groups, and scales better in terms of group numbers, than does
the MOSPF protocol. Further, we emphasize again that the GMC protocol can

accommodate topology algorithms other than the Dijkstra’s shortest path algorithm.

5.6 Summary

We have developed an LSR-based generic MC protocol that can be considered as a
distributed implementation of any MC topology computation algorithm. Its general-
ity stems from the availability of two pieces of information at every switch, network
topology information and MC images. Moreover, this generality enables the use of
a single protocol for the construction of MCs of different types and optimized for
different performance criteria. The correctness of the GMC protocol is established by
formal proofs, and the behavior of the protocol is studied through simulation. The re-
sults of these simulations Show that the protocol is able to efficiently handle bursts of
membership changes in “batch mode,” dramatically reducing the protocol overheads
during busy periods, while retaining its event-driven nature in normal operation pe-
riods. The GMC work shows that LSR provides a solid foundation for supporting
one important aspect of group communication, namely the construction of multiparty
communication channels. In the next chapter, we develop LSR-based solutions for
two other facets of group communication, speciﬁcally, membership management and

leadership consensus.

Chapter 6

Group Leader Election under

Link-State Routing

In this chapter, we investigate an issue involved in both LSR and group communi-
cation — the leader election problem. To argue for including leader election as a
core network service, we identify applications that can beneﬁt from a network-level
leader election protocol, including hierarchical LSR, address mapping, and multicast.
A solution to the problem, called the Network Leader Election (NLE) protocol, is
proposed for use in LSR-based networks. The protocol is robust, for it achieves lead-
ership consensus in the presence of adverse events, such as leader failures and network
partitioning. The correctness of the protocol is proved formally. A simulation study
reveals that the NLE protocol incurs low overhead in handling leader failures and in
group creation, and compares favorably with a previous LSR-based election protocol,

the ATM domain leader election protocol.

6.1 Introduction

The problem of leader election concerns the selection of a distinguished member from
a set of computing systems that are interconnected by a network. This problem has
been extensively studied in the context of distributed computing systems, for example,

in coordinating access to shared resources [63] and in implementing fault-tolerant

127

128

objects [64]. Generally speaking, solutions to the problem are distributed “host-
level” algorithms that make use of various services provided by the network, such as
reliable delivery of messages, in order to monitor the working status Of the established
leader or cast ballots for a new leader. Well-known contributions in this area include
the Bully algorithm [65] and the Ring algorithm [52]; more recent developments are
described in [66].

In this chapter, we address the leader election problem as it occurs “inside” the
network. The participants in the election process are assumed to be switches (or,
interchangeably, routers), rather than hosts or application processes. Solutions to
this problem are intended to support underlying network functions, as opposed to
being directly invoked by user applications. Whereas a host-level election protocol
typically considers the underlying network as a “black box,” a network-level election
protocol can see and take advantage Of the internal Operation Of the network, in

particular, the underlying routing protocol.

Network functions that can make use Of an efﬁcient leader election protocol are
several. First, in Asynchronous Transfer Mode (ATM) networks and other hierarchical
networks, switches in a low-level subnetwork (called a routing domain) select a switch
to represent the domain in the next routing level [13]; a solution to this domain
leader election problem supports routing operations within the network. Second,
many address-mapping services, such as the mapping between group addresses and
member addresses [67] and the mapping between network addresses and link-layer
addresses [68], use a central server approach; a solution to the server assignment
problem selects a leader to undertake the server responsibilities. Third, some IP
multicast protocols, such as CBT [4] and PIM [2], identify a network node, called
a core node, as the trafﬁc transit center for each multicast group; a solution to
this multicast core management problem supports multicast services provided by the
network. A common requirement Of solutions to the above problems is fault tolerance:
since network functions / services are expected to survive not only single-point failures,
but also component failures that may partition the network, the solution to these

problems must also survive these adverse scenarios.

129

Our proposed NLE protocol is based on LSR. Speciﬁcally, the NLE protocol ex-
tends LSR to include group-leader binding LSAS, which are used by group members
to advertise their choice Of leader to the rest of the group. Upon receiving such an
LSA, other switches in the network either accept this selection, or choose and adver-
tise an alternative leader. The Objective of the NLE protocol is to achieve network
consensus on leader bindings, even in presence of adverse conditions. The efficiency
Of the protocol stems from the use Of timestamps to identify Obsolete advertisements.
We argue that previous solutions to the network-level group leader election problem
either do not meet the stringent fault tolerance criteria discussed above, or are more
costly (in terms of bandwidth consumption and switch workload) when compared to
the N LE protocol. As an extension to LSR, the N LE protocol achieves the following

properties in fault tolerance.

1. [Leadership Consensus Property] Given a group G and a network that
has been partitioned into a set of segments 31,32, . .. ,Sk, k. 2 1, there will
be consensus on the leader within each segment 5,, and that leader will be an

operational switch within the segment.

2. [Mutual Consensus Property] By requiring group members to report to the
established leader, the NLE protocol ensures that, within each network segment
3,, the established leader maintains a member list for the group that includes

those, and only those, group members in 5,.

It is to be noted that, when the network is not partitioned, the above consensus
properties hold throughout the network. Simply put, the NLE protocol can handle
leader failures and work properly under catastrOphic scenarios such as network parti-
tioning. Results Of a simulation study show that these features can be achieved with
minimum protocol overhead.

The remainder of this chapter is organized as follows. The design of the NLE
protocol is presented in Section 6.2, and the correctness of the protocol, which is
modeled as a consensus problem under LSR, is formally proved in Section 6.3. The

performance Of the N LE protocol and the ATM domain leader election protocol are

130

compared via simulation in Section 6.4. In Section 6.5, we discuss the application
of the NLE protocol to the address resolution problem and to the multicast core
management problem; included are simulation results regarding the performance of
NLE in creating multicast groups. Finally, a summary of this chapter is given in

Section 6.6.

6.2 The NLE Protocol

6.2. 1 Overview

Since some decision-making processes of the NLE protocol, such as the leader selection
policy, are application dependent, we discuss the protocol Operation in the context
of the domain leader election problem. As described in Chapter 2, ATM’s domain
leader election protocol uses a rank-based scheme to select leader (the switch with
the highest leader priority becomes the leader). Adaptation of the NLE protocol to
other problems is discussed in Section 6.5. The Operation of the NLE protocol is

summarized as follows.

1. For every group g, each switch :1: in the network maintains a
leader binding, denoted as Bindingx(g), whose value is a triple
(Leaderx(g),Sourcex(g),Stampx(g)), where Leaderz(g) is the leader of
the group g as perceived by at, Sourcex(g) is the switch that suggested this
binding, and Stampx(g) is the timestamp associated with the binding. The
goal of the NLE protocol is to maintain consensus on Binding: (g) values across

the network.

2. When a switch it joins a group g, it searches for the Leaderz(g) entry in its
local database. If the entry is not found, group g is said to be unbound at :1:.
In this case, switch 12: selects a switch c as the leader of the group according
to a leader selection policy, sets Leaderz (g) to c, and broadcasts this binding.
For the domain leader election problem, the leader selection policy selects a

reachable switch with the highest leader priority.

131
3. Once the switch a: has a Leaderx(g) entry, it sends a J OIN-REQUEST message

to switch Leaderm( g). The join Operation will not be considered successful until
the return Of a J OIN-ACK from the Leaderz(g). Further, the switch :1: must re-

join g (that is, repeat the join process) each time the Leader$(g) value changes.

4. When a switch 2: leaves a group 9, it sends a QUIT-REQUEST to switch
Leader$(g). Again, the quit process does not not finish until the corresponding

QUIT-ACK returns from Leader1(g).

5. When Leaderx(g) = :1:, switch 2: acts as the leader of the group g: it pro-
cesses JOIN-REQUEST/QUIT-REQUEST messages, and returns appropriate
acknowledgments. Further, via join and quit requests from members, the leader
maintains a member list for g, denoted as MLx(g). A member of 9 will be
dropped from ML$(g) if it sends a QUIT-REQUEST message or if it becomes
unreachable from the leader 1:. We point out that, since members are required
to re-join the group each time a new leader is elected, a new member list will
be compiled at the new leader. Member lists are not required at switches other

than the leader.

6. When a switch a: that is a member Of a group 9 ﬁnds the switch Leaderx(g)
unreachable, switch :12 selects and broadcasts a new leader binding for g. To
avoid a rush of new leader bindings from all members Of g, a delay timer Of
random length is used to postpone the re-selection task. Typically, one member
wakes up before others and advertises a new binding. The remaining members

simply accept the binding and re—join the group.

7. Even when switch Leader; (g) is still reachable from 2:, the switch a: may decide,
according to application-speciﬁc leader performance criteria, to select and ad-
vertise a new leader for group g. Given a group 9, an objection policy determines
when a switch objects to the current leader binding and selects a new leader.
For the domain leader election problem, a switch Objects to the current domain

leader when it discovers a reachable switch that has a higher leader priority

132

than does the current leader.

8. As with other LSR state information, group bindings are subject to aging. To
prevent group bindings from aging out, each switch periodically advertises a list
Of groups for which it is the leader. Formally, switch a: periodically advertises a
list of group IDs, 01., where a group 9 6 GI. if and only if Leaderx(g) = 2:. At
a switch y 5L :5, the binding Bindingy(g) for such a group 9 will be aged out if

this periodic ﬂooding is not received for a predetermined length of time.

6.2.2 State Machines and Events

At a switch :1:, the NLE protocol deﬁnes two ﬁnite state machines (F SMs) for each
active group g: a Membership Status Machine, denoted as MSM($, g), and a Leader-
ship Consensus Machine, denoted as LCM(:r, 9). Both LCM(:c, g) and MSM(2:, 9) ma-
chines access the Binding, (g) entry; such accesses are assumed to be atomic to avoid
race conditions. Figure 6.1 shows the events processed by the two machines. The
LCM(:r, g) processes incoming leader bindings for the group g, and reacts to events
that indicate problems with the current leader, such as leader-unreachable events and
Objection events deﬁned by the objection policy. The MSM(2:, g) handles join and
quit events and is responsible for ensuring that the current leader, Leaderz(g), holds
correct information regarding the membership status Of the switch x. The MSM(:r, g)
also processes leader-change events, which are raised whenever the LCM (:1:, g) accepts

a new binding for group g.

 

 

 

 

 

. ~—~ Binding LSA
Jom arrive
——“' 52:22:25:
tatus .
Machine Leader changed Machine Leader unreachable
Quit T
, (MSM) (LCM)
Other objection cases
L

 

 

 

 

 

 

 

Figure 6.1: The ﬁnite state machines in NLE.

133
6.2.3 The Operation of LCM

The state transition diagram for the LCM is depicted in Figure 6.2. As shown,
an LCM comprises four states: EMPTY, PENDING, REMOTE, and LOCAL. The
EMPTY state is the initial state Of LCMs. When there is no binding regarding g at
2:, the LCM(z, g) is in the EMPTY state; the values Of Leaderx(g) and Sourcex(g)
are undeﬁned, and the value Of the Stamp; (g) is deﬁned to be zero. An LCM(2:, g)
is in the LOCAL state when Leaderz(g) = :r, and in the REMOTE state when
Leader, (g) ;£ 1:. The LCM sometimes uses a timer to postpone the task of leader
selection. When this happens, the machine enters the PENDING state, waiting for

time-out.

   

New Leader(g)=x
. roposed/accepted

    
 
  

Leader(g)

New Leader(g) not it
unreachable

proposed/accepted

    
 
 
 
 
 
 
 
   
 
 
  

 
 
 

New Leadertg) not it accepted. or
new Leaderm) not it proposed due to objection

 

 

 
 
 

New Leader(g)=x accepted. or

new Leader(g)=x proposed due to objection. Member unreachable events.

JOIN_REOUEST received. and
JOIN ACK received.

New Leeder(g) not it ew Leader(g)=x
accepted. accepted

Figure 6.2: The leadership consensus machine at a switch a: for a group g (LCM(:(:, g)).

A binding LSA is a pair (9, (c, s, t)), where the ﬁrst element 9 speciﬁes the group
and the second element (c,s,t) is the value of this binding. The LCM processes

binding LSAS according to the rules below:

A1 An incoming binding LSA l = (g, (c, s, t)) will be accepted at a switch a: if (t, s) >
(Stampx(g), Sourcez(g)), otherwise it is rejected at 2:. (The comparison is in
lexicographical order.) This rule guarantees that more recent bindings override
old ones but that the reverse will not happen. When .2 is accepted, its value
(c, s, t) becomes the value of Bindingx(g). Subsequently, the LCM(:2:, g) enters
either the LOCAL or REMOTE state, depending whether new Leaderz(g) is a:

or not.

134

A2 When a switch a: proposes and advertises a leader c for a group 9 it 1) increases
the Stampx(g) by one, 2) sets Sourcex(g) to a: and Leader$(g) to c, and 3)
ﬂoods a binding LSA (g, Bindingx(g)). The LCM($, 9) then enters either the
LOCAL or REMOTE state, depending on whether new Leaderx(g) is :1: or not.

There are two situations where the LCM(:1:, g) may use Rule A2 to propose and
advertise new leader bindings for the group g: when the Leaderx(g) becomes unreach-
able, and when an Objection event is raised according to the objection policy. In the
latter case, the LCM proposes a new leader only if the machine is in the REMOTE
or LOCAL state. When the current leader Of 9 becomes unreachable from a switch
3:, the switch is triggered to select and advertise a new leader. To avoid a rush of si-
multaneous leader binding LSAs from group members, the LCM(:r, g) sets up a delay
timer and enters the PENDING state. There are two ways for the LCM to leave the
PENDING state: 1) the timer ﬁres, and the machine selects/ advertises a new leader
according to Rule A2, or 2) an “acceptable” binding LSA arrives before time-out. In
case 2, the delay timer is canceled, and the LCM processes the LSA according to Rule

A1. We will discuss the effects of various timer values in Section 6.4.

When the LCM(:c, g) enters the LOCAL state, switch a: must create a member list
for group g and process J OIN-REQUEST / QUIT-REQUEST messages from members
of g. The member list of g is created every time LCM(:1:, g) enters the LOCAL state,
and is destroyed every time LCM(z, g) leaves that state. JOIN-REQUEST/QUIT—
REQUEST messages will be acknowledged and used to update the member list when
LCM(2:, g) is in the LOCAL state, but are discarded silently when the machine is in
any other state. When an unreachability event concerns a switch y that is not the
leader of the group 9, the action of LCM(:1:, 9) depends on whether :2: considers itself
to be the leader. If so (that is, a: = Leaderx(g)), a: removes y from the member list

of g; otherwise, it discards the event.

135
6.2.4 The Operation Of MSM

The MSM at a switch :1: for a group 9, denoted as MSM(:c, g), reacts to join(g) and
quit(g) events. An MSM has four states: MEMBER, JOINING, NON-MEMBER,
and LEAVING, among which the NON-MEMBER state is the initial state. With
respect to a group g, the MSM at a switch a: is in JOINING state if it wishes to join
the group but has not completed the “registration” procedure, namely, the exchange
of JOIN-REQUEST and JOIN-ACK messages with the leader, Leaderx(g). After
the JOIN-ACK message is received, the joining member enters the MEMBER state.
Deﬁned similarly, a member Of g is in the LEAVING state during the exchange Of
QUIT-REQUEST and QUIT-ACK messages with the leader, and will enter the NON-
MEMBER state after completion. Retransmissions of REQUEST messages may be

necessary to ensure successful delivery.

Leader changed or
ack times out:

 

 
    
 
  

 

 

 

 

send QUIT-REQUEST
to Leader(g)
QUIT-ACK received from Leader(g)
ll
. Quit(g :
Jorn(g): send UIT-REQUES
send JOIN-REQUEST to Leader(g) llindw UlT-REQUEST
to Leader (9) to Leader(9)
JOIN-ACK received from Leader(g)
f
Joining )_ I Mb" >
Leaderc cedhang
send JOIN- REQUEST
Leader changed or to Leader(g)
ack times out
send JOIN- REQUEST
to Leader(g)

Figure 6.3: The membership status machine at a switch a: for a group g (MSM(:r, g)).

The MSM does not deal directly with the leader-unreachable events. However,
when the LCM(a:, g) changes the leader binding due to leader-unreachable events
or other objection events, it generates a leader-change event to be handled by the

MSM(:1:, g). If the switch is a member of the group g, the switch must re-join the

136

group, that is, the MSM(:r,g) machine enters the JOINING state so that JOIN-
REQUEST and JOIN-ACK messages are exchanged with the new leader.

If a switch it: joins a group g when the LCM($, g) is in the EMPTY state, the switch
must select and advertise a leader for the group, following the procedure deﬁned in

Rule A2. For every switch 1) in the network, including LCM(2:, g), this advertisement
will be received by LCM(v, g).

6.3 Proof of Correctness

We prove in this section that the NLE protocol achieves consensus on group—leader
bindings throughout the network. However, we must be careful when deﬁning what
can be proved and what cannot be proved. For example, consider a hypothetical sce-
nario where, whenever a switch a: is suggested as the leader of a group g, that switch
crashes immediately. The other network switches will detect the unreachability to a:
and (some of them) will propose new leaders. Meanwhile, switch at resumes execution
shortly after new leader binding proposals are made. If the scenario repeats indeﬁ-
nitely and every newly suggested leader immediately crashes, then it is impossible for
any leader-management algorithm to maintain stable and consistent leader bindings
for the group.

We conclude that a more reasonable goal is to study the behavior Of the NLE
protocol in response to a ﬁnite set of events. (Similar assumptions have been used
in the “classic” LSR consensus problem, where the switches must reach consensus on
network images [45].) Precisely, let us be given a group 9 and a ﬁnite set Of events, E,
which may include network status dynamics, group membership changes, unreacha-
bility events and other objection events. In addition, E is assumed to contain at least
one join event regarding g (otherwise the group is inactive and does not participate
in the protocol). We will show that consensus is eventually reached throughout the
network on the Binding(g) entries.

We begin with the leadership consensus property. We ﬁrst prove the property for

the case where the network remains connected after the event set E. Following that,

137

we consider the case where the network is partitioned.

Theorem 3 [Weak Leadership Consensus Property] Let E be a set of network
events, including group membership change and network status change events, and
let g be a group with at least one join event in E. Assuming that the network is not
partitioned after all events in E have taken place, all network switches will eventually

agree upon the same leader binding for group 9.

Proof: Considering the set of occurrence times of the events in E, we are interested
in the maximum such element, tlast, the time of the last event in E. It is to be noted
that the assumption Of a connected network after time tum does allow for some non-
operational switches, as long as the “survivors” remain reachable from one another.
Let M E be the set of switches that are operational and are members Of g after E.
If M E is empty, the theorem is true vacuously. Let us consider the more interesting
cases where at least one member switch survives E.

Let A be the non-empty set of switches that are Operational after tin“, and let B
be the set of leader binding LSAS for the group g produced in response to the events
in E. Let BA be the subset of B such that (g, (c, s,t)) 6 BA if and only if c E A. In
other words, B A is the set of leader binding LSAS for which the designated leader c
is operational after tum.

We claim two properties regarding the set B A. First, the set B A cannot be empty,
since members of 9 will select new reachable leaders if there are no valid bindings for
g. (A binding is invalid at a switch a: if the designated leader is unreachable from :1:.)
Second, the set B A is ﬁnite. In fact, we claim that the set B is ﬁnite. This property
comes from the fact that the NLE protocol produces a ﬁnite number of bindings in
response to a single event of any type. The worst case is M binding LSAS per event,
where M is the number of members in the group. The worst case happens when the
current leader fails and all the M members select/advertise new leaders. Hence, a
very loose upper bound for the cardinality of B (and B A) is N x |E|, where N is the
number of switches in the network and an upper bound of M.

Recall that, for two bindings (c1,sl,t1) and (c2,32,t2), (c1,sl,t1) > (c2,32,t2) if

138

and only if (t1,sl) > (t2,82). Let bum = (c,s,t) be the maximum binding in BA.
This maximum element is well-deﬁned because the set BA is ﬁnite and non-empty.
Since the switch 0 is connected to switches in A after tum, the periodic advertisements
of bum from c will eventually be received by all switches in A, which must accept
the binding and ignore any others, due to the maximality Of bmaz. The binding bum
becomes the ﬁnal consensus binding among all Operational switches, and the theorem

is proved. C1

The proof of the theorem also suggests that the ﬁnal leader binding is “correct” in
the sense that bnm is in B A (that is, the ﬁnal winner is an operational switch after E).
Somewhat surprisingly, showing consensus when the assumption of eventual network

connectivity is removed is not difﬁcult at all, as shown in the following theorem.

Theorem 4 [Leadership Consensus Property] Let E be a ﬁnite event set that
partitions the network into k segments, 31,52, . .. ,Sk, where k 2 1. Let g be a group
with at least one join event in E. The NLE protocol will achieve consensus on leader

bindings for g within each segment 3,, for 1 S i S k.

Proof: To see the correctness of the theorem, we apply the argument regarding
the set A in the previous proof to each segment 5,. That is, we simply consider

switches in S, to be Operational and all other switches to be non-operational. D

Next, we consider the mutual consensus property Of the NLE protocol.

Theorem 5 [Mutual Consensus Property] Given a group 9 and a set of events
E that partitions the network into k segments, 51,82, . .. ,S'k, where k 2 1, the con—
sensus leader of g in 5',- produces a member list that includes members, and only those

members, in 5,.

Proof: In the following discussion, the consensus leader Of g in S,- is denoted as
Leader,(g), and the member list maintained by the leader is denoted as M L,- (g). It

is not difﬁcult to see that members not in S,- after E will be removed eventually from

139

M L, (g), due tO unreachability events about these members. It remains to be shown
that all members of g in S,- will be added to the list M L,- (g)

A property of the MSM, shown in Figure 6.3, is that the MSM insists on having
the current leader hear about the current membership status. However, some previous
membership changes may not be learned by the leader. For example, if a switch it
decides to leave a group g while it is in the JOINING state with respect to g, the MSM
simply enters the LEAVING state and issues a QUIT-REQUEST; the previous J OIN-
REQUEST and JOIN-ACK exchange process is aborted. As a result of this design,
given a sequence of interleaved join/quit events, the MSM does not guarantee the
success of all respective REQUEST-ACK exchanges, but will enforce the successful
exchange with respect to the last event in the sequence.

Let us assume that there is a switch y E S,- that is a member of 9 after events
in E, but y is not in M L,- (g). By the previous Observation, we are concerned only
with the REQUEST-ACK exchange process of the last membership change event,
which must be a join event. The assumption that y is not in M L,(g) implies that
the leader in S, does not receive a JOIN-REQUEST message from y, and hence
will not return a JOIN-ACK message. Consequently, the switch y remains in the
JOINING state, where the JOIN-REQUEST message will be issued repeatedly until
corresponding acknowledgment is heard. Since y and Leader,(g) are connected, this
process will eventually complete, putting y on the M L,- (g) This is a contradiction

to the assumption about y, concluding the proof. C]

6.4 Performance Evaluation

In this section, we investigate the performance of the N LE protocol in handling leader
failures. Speciﬁcally, the NLE protocol is compared against the ATM domain leader
election protocol [13]. In our simulations, networks comprising up to 400 switches were
used. For each network size, 40 graphs were generated randomly, and two simulation

sessions were conducted on each graph. Table 6.1 shows the characteristics of the

140

graphs generated. In the table, the symbol Tf denotes the worst-case time to perform
a ﬂooding operation in a given network. As in the simulations described in previous

chapters, we used software overheads of 600 usec in each LSA forwarding.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Network Avg. Avg. Tf
size degree diameter (in ms)
10 3.6 3.25 3.56
20 3.57 4.75 5.12
40 3.73 6.18 6.71
60 3.88 6.8 7.5
80 3.91 7.08 7.87
100 4.12 7.15 8.1
120 4.10 7.53 8.42
140 4.20 7.73 8.64
160 4.22 7.58 8.8
180 4.31 7.75 8.96
200 4.29 7.95 9.08
250 4.50 7.95 9.34
300 4.70 7.98 9.47
350 4.87 7.85 9.53
400 5.07 7.85 9.62

 

 

 

 

 

 

Table 6.1: Characteristics Of randomly generated graphs.

We consider two metrics for the performance of leader election: the leader-binding
convergence time and the number of leader binding LSAS produced for an election.
The former refers to the length of the period from the moment the election begins
to the moment that all network switches agree on the same leader node. (When an
election is held due tO the failure Of the current leader, the election begins at the
moment the leader fails.) The latter measures the number of leader-binding LSAS
that are sent before consensus on the leader node is reached. In addition, we measured
the bandwidth consumption of the two approaches. This is motivated by the fact that
switches use point-to-point messages to cast ballots in the NLE protocol, but must
use ﬂooding operations in the ATM election protocol.

When a leader fails under the N LE protocol, group members select a new leader
and send join requests to that switch. Since all members are informed (by corre-
sponding LSAS) almost simultaneously, they all could potentially rush to suggest

new leaders, resulting in a large number of conﬂicting leader binding LSAS. The N LE

141

protocol avoids this problem by deferring member rejoins with a random timer. We
assume that the current leader crashes at time 0, and that delay timers are uniformly
distributed between 0 and a simulation parameter max-delay. We used max.delay

values Of 0.1 seconds, 1 seconds, and 10 seconds.

The results regarding the metric of the number Of bindings are plotted in Fig-
ure 6.4(a). Even the very short maximum delay value (0.1 seconds) introduces fewer
than 16 bindings in 400-switch networks. When the maximum delay is set to 1 second,
fewer than 3 bindings are generated in large networks. When the maximum delay
value of 10 seconds is used, only one binding is created in almost all simulation ses—
sions. Although not shown in the ﬁgure, the current ATM election protocol produces
N preferred-leader LSAS, the equivalent of binding LSAS, in an N -switch network for
every leader failure event.

The results for convergence time are plotted in Figure 6.4(b). For this performance
metric, the shorter the maximum delay value, the faster the convergence, since a short
maximum delay value produces early time-out of the delay timers, and hence switches
take less time to ﬂood new leader bindings. Also, the larger the network size, the
faster the binding converge; not surprisingly, a large number Of switches that set up
random delay timers tends to produce one that times out quickly.

The results for bandwidth consumption are plotted in Figures 6.5(a) and (b).
Bandwidth consumption is measured by counting the total number of links traversed
by every LSA and JOIN-REQUEST/ACK messages associated with the election.
Figure 6.5(a) shows the results Of the N LE protocol, which uses ﬂooding operations
to broadcast leader bindings and point-to—point messages to cast ballots (that is, to
send J OIN-REQUEST messages). The curves in Figure 6.5(a) conform with those in
Figure 6.4(a), that is, the more concurrent bindings produced, the more bandwidth
consumed. The bandwidth consumption Of the ATM election protocol is signiﬁcantly
larger than that of the N LE protocol, as shown in Figure 6.5(b).

In summary, compared to the ATM leader election protocol, the NLE protocol
incurs far fewer ﬂooding Operations and consumes a small fraction of the bandwidth.

We further emphasize that the ATM leader election protocol requires every switch to

142

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

14 _ 0.l seconds max delay 4— A 800 +3 0.1 SCC max delay +- 4
1 second max delay ---:~ g 1 SCC max delay ...”...
12 + l0 seconds max delay 0 E 700 ’ 10 86C max delay .... ...... ‘
v 2
V’ o 600 L 5 1
on ‘.
.S 10 ’ ‘ E a
'2 l '5 500 [ e ‘
ES 8 + o ,
:5 E 400 ’
d 6 >- 4 a
Z :3 300 b a ‘l
4 > t>:: 200 ’ a"! J
O '2‘
2 ' ............................. 0 1% l‘ “on".
------------------------- __ ""‘-I-~o-.-.......... .o,
0.2:g'73 '6 .o 'o-~I Mano. 9 ....Q ,, Wow... .0 0 M :«tu‘ A A A A A
0 A A A L A A A
o 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network Size Network Size
(a) number of bindings. (b) convergence time.
Figure 6.4: Performance of the NLE protocol.
“X300 r v . r v v v 900000
F
350(1) 0'1 seconds max delay _‘— 4 300000 r ATM election protocol +— i
10] “wild max geii’)’ ‘“'" NLE using 0.1 secondsmaxdelay --¢----
3 _ secon smax edy 4 700000 .
600000 '
5 25m " 5
g 2 500000»
3 20000 ~ g
5 s 400000 -
‘° 15000 i “3
300000 -
10°00 ’ 200000 -
5000 - .—- 100000 -
.- :7.’.~n """""
0 - *‘T‘w'l‘ . . l - o c- - . -- A «1 -------- 1"“: """" '1’
0 50 100 ISO 200 250 300 350 400 0 50 100 ISO 200 250 300 350 400
Network Size Network Size
(a) NLE bandwidth. (b) ATM bandwidth.

Figure 6.5: Bandwidth usage of alternative election protocols.

periodically advertise its preferred leader, while the N LE protocol requires only the
leader to periodically broadcast its leader status. We conclude that the NLE protocol

is more efﬁcient than the ATM leader election protocol, while being equally robust.

6.5 Other Potential Uses of The NLE Protocol

We have discussed the use of the NLE protocol for the domain leader election prob-
lem. In this section, we brieﬂy discuss the application Of the protocol to two other

important network services, namely, multicast address resolution and multicast core

143

management. In addition, we evaluate the performance of the NLE protocol in group

creation.

6.5.1 Multicast Address Resolution

In the last several years, a great deal of research has addressed the issue of implement-
ing IP over new link layer protocols, such as ATM / AAL5. One of the difﬁcult tasks in
implementing IP over ATM networks is how to handle multicast addressing. Whereas
IP allows a source node to send a datagram to an abstract multicast group address,
the current ATM standard does not support such an abstraction. Rather, ATM sup-
ports multicasting through point-to—multipoint unidirectional virtual channels, which
require the sender to explicitly establish a connection to each destination.

One approach to this problem is to use a Multicast Address Resolution Server
(MARS) [67], a central server that acts as a registry, associating IP multicast group
identiﬁers with the ATM interfaces representing the members of the groups. The
MARS is queried when an IP‘multicast address needs to be resolved, and hosts and
routers must update the MARS when they join and leave groups. As a centralized
solution, however, the potential for MARS failure is an important issue. The approach
described in [67] is to manually conﬁgure nodes with the addresses of one or more
backup MARS nodes that they can contact in descending order of preference.

An alternative method is to use an election protocol, such as NLE, to “automat-
ically” handle MARS failures and, just as important, accommodate network parti-
tions. Such an implementation might work as follows. A speciﬁc group identiﬁer
(call it MARS-CID) is reserved for the election of the MARS; every switch is as-
sumed to be a member of this group. The selection and objection policies of the
MARS follow a ranking scheme similar to those for domain leader election. If the
current MARS crashes, the NLE protocol is used to establish consensus on a new
LeaderAMARS-GID) binding. For the new MARS to Operate properly, member lists
must be re-collected. To this end, every switch in the network maintains an inter-
ested multicast addresses (IMA) list, M3. = {m1,m2, . .. ,mk}, where each element

m, is a multicast address that one or more of the attached hosts is interested in.

144

This list is usually maintained by a local membership management protocol, such as
the IGMP [7]. Since every switch must send a JOIN-REQUEST to a newly elected
MARS, reconstruction of member lists at the MARS can be implemented by aug-
menting each J GIN-REQUEST message from switch :5 to include a copy of M3. The
consensus properties of the N LE protocol guarantee that, should the network be par-
titioned, there will be a MARS within each segment that maintains multicast group

member lists for those, and only those, switches in the segment.

6.5.2 Multicast Core Management

As discussed in Chapter 2, some prominent IP multicast protocols, such as CBT [4]
and PIM [2], associate a multicast trafﬁc transit center, or core node, with each
multicast group. In such approaches, datagrams destined to a multicast group are
ﬁrst forwarded to the core node, from which they are distributed along a multicast
tree to reach group members.

The association of the core node with a multicast group can be modeled as a
leader election problem, and the NLE protocol can be applied. One approach is
as follows. We assume that a core election is held whenever a multicast group is
created (that is, when the ﬁrst member joins), and that a new core is elected if the
current core fails. Regarding the core/leader selection policy, we can assume that
the default is the random member policy [46]: whenever a member is required to
select and advertise a core node (including group creation time), the member simply
recommends itself. A number of other core selection policies are discussed in [46]
and could be incorporated into the NLE protocol. As with the applications discussed
earlier, the mutual consensus property of the N LE protocol enables arbitrary multicast
groups to handle network partitions and re—uniﬁcations.

The maintenance of the leader binding of every active group at every switch in
a network may raise the concern of scalability. In Chapter 7, we describe another
core-management method that addresses this issue by using the NLE protocol to
select a central server to maintain the leader bindings of all active groups. The

method presented above, however, has an advantage in group-join time, because

145

joining switches do not have to query a server to resolve leader-group bindings. This
feature is important to situations where members of a group join and exit at a high

rate.

6.5.3 Performance of Multicast Group Creation

When the NLE protocol is used for domain leader election and MARS election, all
switches in the network are members of the (single) group, and the group is assumed
to be created at network initialization, a relatively rare event. In the case of multicast
core management, on the other hand, there are many multicast groups directly tied
to applications, and group creation time may be important to the performance of
those applications.

Therefore, we conducted a study to evaluate the performance of the NLE protocol
when a multicast group is created. As in the previous performance study, we are
interested in two performance metrics: convergence time and the number of binding
LSAS. It turns out that the convergence time in this case is quite predictable, as

shown in the following theorem.

Theorem 6 Given the ﬂooding diameter T, of a network (the worst-case time to
ﬁnish a flooding operation), the convergence time for group creation under NLE is less

then 2T], assuming that no network component failures occur during group creation.

Proof: Assume that the ﬁrst member joins a group at time 0. This member
ﬁnds the group unbound and advertises a leader-binding, which will reach all
network switches by Tf. Assuming no component failures, any other switch must
join the group by time T; if it is to ﬁnd the group unbound and propose its own
binding. Flooding of any such additional bindings will require another T, time to
ﬁnish in the worst case. Therefore, after time 2Tf, all network switches will have
received all the bindings that have been ﬂooded, and will agree upon the one with

the largest value. Hence, the worst case convergence time for leader binding is 2T]. D

146

To investigate the number of binding LSAS produced by group creation, we sim-
ulated the creation periods of multicast sessions with M participants. The arrival
time of each participant is normally distributed with mean zero, the predetermined
startup time of the group. The standard deviation value is set in such a way that
99% of the participants arrive within a predetermined time interval; this interval will
simply be called an arrival interval. We used arrival intervals of lengths 1 second
and 0.1 seconds. A switch joins the multicast group when its ﬁrst attached partic-
ipating host arrives. In a simulation session, the size of the participant population,
M, is controlled by the participant-to-switch ratio; we used the values of 1 and 10 in
this investigation. As such, our simulation study covers a wide range of participant
population sizes, from 10 (obtained by 10-switch networks with 1 participant per
switch) to 4000 (obtained by 400-switch networks with 10 participants per switch).
Simulation sessions involving a small number of participants could represent tele-
conferencing applications, whereas those involving very large population sizes may
represent Distributed Interactive Simulation (DIS) applications. The combination of
very short arrival intervals with very large population sizes produces extremely busy

group creation periods, in order to stress the NLE protocol.

Figure 6.6 shows the results of this study. Figure 6.6(a) plots the results when
using the 0.1 seconds arrival interval. The worst case in the ﬁgure is only 3.0, meaning
that even when 4000 participants join a multicast group within 0.1 seconds, the N LE
protocol produces only three leader binding LSAS. Figure 6.6(b) plots the results for
the 1 second arrival interval. As shown, this relatively longer (but still very short)
arrival interval produces virtually no redundant bindings, that is, there is one leader
binding produced per group creation event. We believe that the results in Figure 6.6
demonstrate that the NLE protocol is a viable method for handling many real-world

situations.

147

 

 

 

 

 

 

 

 

 

 

4 f . . r . . . 4 . . v . .
3 5 g 1:] host-to-switch ratio -°—‘ 3 5 _ 1:1 host-to-switch ratio ~— 4
- 10:1 host-to-switch ratio ~ ' 10:1 host-to-switch ratio r
m 3 , 4 3 t
2" ........... 3’0
‘5 2.5 . .................... .E 2.5 ’
.s ' E
m 2 i ..r-""’ a 2 r 4
“5 ‘5
o. 1.5 ”if" 2 1.5 ’
"Wt .4“- __._-_ -....- % A t V ”---.
z l l ~=¥L":' :J—c vLﬁA ¢ ¥ 5 : °
0.5 > ‘ 0.5 r
O A ‘ A A i + A 0 l l l l l l¥ .
0 50 100 150 200 250 300 350 400 o 50 100 150 200 250 300 350 400
Network Size Network Size
(3) 0.1 seconds arrival intervals. (b) 1 second arrival intervals.

Figure 6.6: Number of bindings generated for group creation.

6.6 Summary

We have addressed two facets of group communication in LSR-based networks. Specif-
ically, the leader election problem and membership management problem have been
studied in a context where participants of the election process are switching elements
in LSR-based networks. The prOposed solution, called the Network Leader Election
protocol, models the group-leader binding problem as a consensus problem under link
state routing. In this model, the local network images at switches are extended with
leader binding entries, whose network-wide consistency is guaranteed by the protocol.
We have formally proved the correctness of the NLE protocol, including its leader-
ship consensus property and the mutual consensus property, under any combination
of group member and network status changes. Our simulation studies reveal that the
NLE protocol incurs minimal overheads for multicast group creation and moderate
overheads to handle leader failures. The performance of the N LE protocol compares
favorably with a previous network group leader election protocol for ATM networks.
The efﬁciency of the NLE protocol enables its use by both the international opera-
tions of LSR (such as hierarchical routing and address mapping) as well as multiparty
communication applications (for example, those that use core-based multicast). In

the next chapter, we propose a second multicast core management method that uses

148

the NLE protocol to select a central server, which manages the core nodes for active

multicast groups.

Chapter 7

Multicast Core Management

The problem of multicast core management concerns assigning a network switching
element to each multicast group for use as the root of the multicast tree of the group.
In the previous chapter, we applied the NLE protocol to this problem in a per-
group manner, that is, each group individually holds election to select a respective
core node. In this chapter, we pursue an alternative approach to the problem. The
proposed method, called the LSR-based Core Management (LCM) protocol, uses
the NLE protocol to elect a central server, called the core binding server (CBS),
to manage core-group bindings for all active groups within the network. The LCM
protocol selects core nodes for groups automatically, handles the failures of both core
nodes and the CBS itself, supports core migration whereby multicast groups can
adapt to membership and network status changes, and survives network partitioning.
The LCM protocol is based on LSR: it relies uses the network status information
provided by LSR to monitor the operational status of current core nodes and takes
advantage of the shortest path trees computed by LSR to support core migration. Our
simulation results reveal that the central server can sustain extremely high workloads,

and demonstrate the effectiveness of our core selection and core migration methods.

149

150
7. 1 Introduction

As discussed in Chapter 2, a common technique to support multicast, found in the
CBT [4, 3] and PIM [2] protocols, is core based forwarding (CBF). A CBF multicast
protocol associates a core node with each multicast group; the multicast tree of the
group is deﬁned to be the union of core-to—members shortest paths. Messages des-
tined for the group are ﬁrst sent to the core node, which forwards the message along
branches of the tree. An advantage of CBF multicast protocols is that they enable
simple methods for nodes to join and leave the group.

We illustrated in Figure 2.3 the member join operation of the CBT protocol. That
example assumes that the joining member has learned a prior the identity of the core
node of the target group. Indeed, many CBF multicast protocols do not concern
themselves with core management issues, such as who selects the core node (for
example, an administrative authority, users/ hosts, or the network), how a core node
is selected (that is, which core selection algorithm to use), when a core node is selected
(for instance, at the moment a group is created and/or some other time(s) during
the life span of the group), how the identity of the core is disseminated to interested
parties, and where the identities of the cores of active groups are stored. Before

addressing these questions, we identify three basic requirements for core management.

1. Network-level core selection. If the task of core selection is performed by
hosts, then the multicast interface between hosts and the network depends on
the type of the multicast protocol used by the network. (In networks that use
a CBF multicast protocol, for example, a join-group request from a host must
include the core address of the group, whereas in networks that use other types
of multicast protocols such information is not required.) Hence, automatic core

selection by the network is preferred over host-level approaches, such as [69].

2. Core failure handling. A potential weakness of CBF multicast is the single
point of failure at the core. Methods are needed to assign new cores to multicast

groups whose current cores have failed.

151

3. Core migration. During the lifetime of a multicast application, the mem-
bers of a group may change, and the resource availability in the network may
ﬂuctuate. The purpose of core migration is to identify a new core node for
the group whose corresponding multicast tree, determined by the current set
of group members and present network status, will likely result in signiﬁcantly

better multicast performance than the tree based on the current core.

In Chapter 6, we discussed how the NLE protocol can be applied to the core
management problem. In that approach, the NLE protocol is applied on a per-group
basis to elect core nodes for multicast groups and handle core failures. The practice of
storing core-to-group bindings (that is, leader bindings in NLE’s terminology) for all
the active groups at every router in the network has advantages and disadvantages.
On the positive side, the approach adds no additional delays and overheads to group
join operations in CBF multicast, because joining members can resolve core-group
mappings locally. This merit may be important for multiparty communication ap-
plications whose participants join and leave at a high rate. On the negative side,
however, the approach raises the concern of scalability when used to support a very
large number of simultaneous multicast groups. Alternatively, one could use a boot-
strap mechanism, as proposed by the PIM community [33]. In this method, when a
multicast group is created or the core node of an existing group has failed, a hash
function is used to map the address/ID of the group to a router in the network as
its core node. As such, core-bindings need not be stored at all, for all members of
a group will map the ID of the group to the same core node. Complexities of the
bootstrap mechanism, however, stem from the tasks of discovering and disseminating
the identities and Operational status of routers in the network. Further, core mi-
gration is not supported. To remedy this problem, an independent core migration
protocol can be used; at least one such protocol has been proposed by Donahoo et
a1. [70]. In Donahoo’s protocol, the core node of a multicast group periodically sends
probing messages to discover a subset of group members and a set of nodes which, if
designated as the new core, may improve multicast performance. The core node then

sends the list of “representative” members to the selected core candidates, which use

152

SOphisticated heuristics to evaluate their performance as the core. Evaluation results

are sent back to the current core node, which selects the new core.

In this chapter, we propose a network-level core management method for use in
LSR-based networks. The resulting LCM protocol uses the N LE protocol to select a
core-binding server (CBS), which manages the core—to-group bindings for all active
multicast groups within a network. The LCM protocol works closely with LSR, using
such information as the identities and operational status of network routers and the
topology of the network, to support all three core management issues listed above. A
contribution of this work is to demonstrate that a single, and yet relatively simple,

core management solution can be developed under LSR.

We emphasize that LSR-based protocols, such as LCM, are not intended for direct
implementation in very large networks or internets, due to the scalability issues of
LSR discussed in Chapter 2. For some CBF multicast protocols, such as the PIM
protocol, a core node for a multicast address / group m is assigned within each routing
domain that contains at least one member of m. Core management issues under
such circumstances are by deﬁnition “local;” LCM could be used directly by such
protocols. In other CBF multicast protocols, such as the CBT protocol, there is only
one core node for a given group m throughout the entire Internet. If the members of
m are not restricted to a routing domain, hierarchical core management must be used.
In this chapter, we present the “basic” LCM protocol; its extension to hierarchical
networks is part of our ongoing research. Hereafter, we use the term “network” to
refer to a set of routers that are governed by a single administrative authority and

which collectively execute LSR.

The remainder of this chapter is organized as follows. We present the LCM pro-
tocol in Section 7.2. Various performance issues, including the workload at the CBS
and multicast performance, are investigated through a simulation study, whose re-
sults are presented in Section 7.3. These results justify the use of a central server
for core management, and show that the performance of multicast can be improved
signiﬁcantly by the simple core migration heuristic supported by LCM. A summary

of this work is given in Section 7 .4.

153
7 .2 The LCM Protocol

As discussed, the LCM protocol uses a central server, the CBS, to manage core-to-

group bindings. Precisely, the CBS of a network maintains a list of core bindings

C = {Core(m) | m is an active multicast address}.

When a host wishes to join to a multicast group m, its local router :1: sends a
CORE-MAPPING(m) message to the CBS. If the binding Core(m) is contained in
the list C, then the CBS places this binding in a CORE—ADDRESS message that is
returned to as. Otherwise, the CBS selects a core for m according to an initial core
selection heuristic, and adds this binding to C before returning the CORE-ADDRESS
message. After obtaining the binding Core(m), router as attaches itself to the multi-
cast tree of m using the procedure deﬁned by the underlying CBF multicast protocol.
When all attached hosts of router a: have departed from group m, router as follows
the procedure of the given multicast protocol to exit the multicast tree, without the

involvement of the CBS.

Initial core selection. When the ﬁrst router member of a multicast group m asks
the CBS for the core identity of m, group m becomes active and the CBS must
select a core node for m. Since at this moment no further membership information
regarding m is available, solutions to this initial core selection problem are limited.
Previously proposed methods include random selection (randomly pick a router in
the network) and random member (randomly pick a router member) [46]. The LCM
protocol adOpts a variation of the random member heuristic, called the ﬁrst-member
heuristic, which operates as follows: when the CBS receives a CORE-MAPPING(m)
request from router a: for a group m whose Core(m) does not exist in C, it sets

Core(m) to :1:.

CBS election. The identity of the CBS is not statically conﬁgured, but rather is
dynamically chosen by the NLE protocol. A speciﬁc group identiﬁer (call it CBS-

154

CID) is reserved for the election of the CBS. Every router as in the network is assumed
to be a member of this group, and maintains a leader binding Leaderz(CBS-GID),
to which CORE-MAPPING messages are sent. The selection and objection policies

of the CBS follow a ranking scheme similar to those for domain leader election.

CBS failure handling. To handle CBS failures properly, not only must a new
CBS be elected, but also the core binding list C must be re-collected at the new CBS.
For this purpose, each router :r maintains a list of core bindings that designate itself

as the core, that is, each router as maintains

C3 = {Core(m) | where m is an active multicast address and Core(m) = 2:}.

The list CI is included in the ballot(s) sent from a: during election. Since the CBS
will receive ballots from all the routers within the network/ segment, it can collect all
bindings in C, except those that designated the old CBS as the core of a group, which

are discussed below.

Let us consider a network that comprises 4 routers: W (the current CBS), X,
Y, and Z. Let C = {(1,W) (2,X) (3,X) (4,X) (5, Y) (6, Z) (7, Z)}, where the pair
(m, 2:) denotes Core(m) = :1:. The entire list C is maintained at W, and the partial
binding lists at individual routers are CW 2 {(1, W)}, Cx = {(2, X) (3,X) (4,X)},
Cy = {(5,Y)}, and CZ = {(6, Z) (7,Z)}. Let us assume that router W has failed
and that router Y is elected as the new CBS. Since Y will receive partial binding
lists, which are contained in respective ballots, from routers X and Z, it reconstructs
a new binding list C = {(2,X) (3,X) (4,X) (5, Y) (6, Z) (7, Z)}. However, the core
bindings relating to W are missing. To remedy this problem, bindings relating to the
old CBS must be treated as a special case. In LCM, any router x that is a member of
a group m whose Core(m) = CBS(:c) must clear binding Core(m) whenever the value
of CBS(:z:) changes, and must consult the new CBS for a new core binding for m. In
the previous example, if group 1 has two members X and Z, then both routers must

clear their local Core(l) entries and ask the new CBS Y to provide a new binding for

155

group 1. Of course, such a binding does not exist in the (re-created) binding list C,
and consequently the new CBS Y considers group 1 as a newly created group, and

uses the initial core selection method to choose a new core node for group 1.

Core failure handing. The CBS uses the network topology information provided
by the underlying LSR protocol to monitor all the core nodes listed in C. Speciﬁcally,
whenever the CBS loses the connectivity to the core node of a group m, it randomly
selects a router as the new core of m and advertises this new core binding throughout
the network, using the ﬂooding algorithm supported by the underlying LSR protocol.
Both the information of router connectivity, required in core failure detection, and
identities of routers, required by the random selection heuristic, are made available to
the CBS by the underlying LSR protocol. Although our simulation results, presented
in the subsequent section, reveal that randomly selecting core node from among all
routers typically does not result in good multicast trees, when compared to many
other core selection methods, the new core node of the “victim” group can invoke
LCM’s core migration method, discussed below, to regain (or obtain even better)

performance.

Core migration. The core migration method used in LCM assumes that the core
node of a group m maintains a (router) member list of m. This list can be compiled
and updated if the J OIN-REQUEST and QUIT-REQUEST messages deﬁned by the
underlying CBF multicast protocol are delivered to the core node, in addition to the
ﬁrst router on the tree. (The PIM protocol satisﬁes this requirement. However, minor
changes are required for other CBF protocols to meet this requirement.) Periodically,
the core node computes a shortest-path tree to reach the members of m, and ﬁnds
the center of the resulting tree. If the center is not the core itself, the core node
voluntarily steps down by sending a CHANGE-CORE message to the CBS, which
updates the binding list C accordingly and ﬂoods the new Core(m) value throughout
the network. Subsequently, router members of m send JOIN-REQUEST messages

to the new core to construct a new multicast tree. We point out that the above

156

shortest-path-tree computation is performed by the underlying LSR routing protocol
as part of its normal duties, and that the task of ﬁnding the center of a tree can be
performed in 0(N) complexity, where N is the number of routers on the tree.

As an example, let us consider the network shown in Figure 7.1, where the above
core migration method is applied to a group comprising three members A, B, and
C. In Figure 7.1(a), we assume that router A is the ﬁrst node to join the group,
and hence is the core of the initial multicast tree of the group. Also shown in the
ﬁgure is the tree center, router D. In LCM, router A will (eventually) transfer
the responsibility of the core node to D, resulting in the multicast tree depicted
in Figure 7.1(b). Regarding the performance of the two multicast trees, the tree in
Figure 7.1(a) imposes maximum member-to—core distance of 4 and average distance
of (O + 2 + 4) / 3 = 2, while the tree in Figure 7.1(b) imposes maximum distance of 2
and average distance of (2 + 2 + 1)/3 = 5/3.

   

(a) initial multicast tree (b) the tree after core migration

Figure 7.1: Core migration in LCM.

Core binding destruction. When the core node of a group m detects an empty
group, it sends a DELETE-BINDING message to the CBS, removing the Core(m)
entry from the binding list C.

7 .3 Performance Evaluation

In this section, we investigate various aspects of the performance of the LCM pro-

tocol, including the workload at the CBS, and effectiveness of LCM’s core selec-

157

tion/ migration policies. Since LCM uses the NLE protocol to select the CBS, sim-
ulation results regarding NLE’s performance, presented in Chapter 6, apply to CBS

election and will be omitted here.

CBS workload. The use of a centralized server for the management of core-group
bindings raises the concern of the workload at the server. We investigated this issue
via simulation. Our experiments were designed to stress the CBS as much as possible.
To this end, we assume that K multicast groups of size S are created simultaneously
at time 0. The values of K range from 10 to 200, and those of 8 range from 20 to
200. Given a multicast group, member arrival times (that is, the times members join
the group) are normally distributed with mean 0. We chose the standard deviation
value such that 99% of arrival times are within a 1-minute interval centered at time
0 (that is, from -30 seconds to +30 seconds). In the busiest cases, 200 groups of
200 members each are created within 1 minute, producing 40000 CORE-MAPPIN G
requests within that interval. We assumed the service time of such a request to be 700
usec, which is a typical IP/UDP software overhead observed on many platforms [71].
We used this ﬁgure because the look-up of the core binding list can be implemented
efficiently, requiring 0(log 5) time using a tree-based data structure or 0(1) time
using a hash function. The overhead of this task should be negligible when compared

to the software overhead of receiving and returning messages.

Results of this study are presented in Figure 7.2. As we can see in Figure 7.2(a),
the average queue length at the CBS is less than 2, even for the highest event rates.
We point out that the queue length is averaged only over the periods where the CBS
is busy. Hence, the smallest possible value of the average queue length metric is
one. The maximum queue length at the CBS is plotted in Figure 7 .2(b). Although
the maximum queue length was between 10 and 20 in some experiments, we point
out that a queue containing 20 requests can be served within 14 milliseconds. We

conclude that the CBS can accommodate even the busiest scenarios in our simulation.

158

 

 

   

 

 

 

 

 

 

2 . . . . . . . . a 14 . . . . . j,
group size 20 —.— ‘ groupsizczo *— ,x‘
18 . group size 50 --~--— ‘ 12 - group srze 50 --~ Xi
' group size 100 ......» x‘l group 5173 100 W“
a group size 150 +— 1 a 10 [ 81'0“? 5!“ 150 " ,1" /’
5 1.6 ~ group size 200 -*-- ,x’ a group Size 200 w»- f/
.— ,I u .
o E 8 r -‘
3 8
a 1-4 ’ e 6 ~
g0 g
< 1.2 - 2 4 -
2
l .
4 1 1 1 A 4 m 1 1 0 1 1 r 4 A 1 n 1 r
0 20 4O 60 80 100120140160180 200 0 20 4O 60 80 100120140160180 200
Number of groups Number of groups
(a) average (b) maximum

Figure 7.2: Queue length at the CBS.

Core selection/ migration. In addition to the operational overhead of the LCM
protocol, we also investigated the characteristics of multicast trees that result from
the LCM core selection and core migration methods. Speciﬁcally, we studied the core-
to-member distances of such multicast tress. We randomly generated 100 graphs of
144 nodes (that is, routers) with average node degree 4. The average diameter of these
graphs is approximately 10. We randomly generated 1000 groups of size S, where
values of S range from 2 to 50. For each group, two multicast trees were generated on
each graph. First, a member of the group is randomly selected as the “ﬁrst member”
and is used as the core to construct a multicast tree T. Next, we compute the center
of T, which is used as the core to construct a second multicast tree for the group.
Furthermore, for each group-graph combination, we tried every the node in the graph
as the core node and recorded the average performance of resulting trees, in order to
obtain the performance of the random (core) selection heuristic.

The results of average core-to-member distances are plotted in Figure 7.3(a). As
we can see, the performance of the ﬁrst member heuristic is signiﬁcantly better than
that of the random selection method when group size is small, and approaches that of
random selection when group size increases. (The ﬂat curve for the random selection
method results because the average distance from a group of nodes to a randomly

selected node is approximately half the diameter of the network.) Results for the

159

maximum core-to-member distances (that is, the depths of trees) are plotted in Fig-
ure 7.3(b). With respect to this metric, the ﬁrst member and random selection heuris-
tics exhibit approximately identical behaviors. In both Figure 7.3(a) and 7.3(b), the
results of the tree-center core selection method clearly demonstrate the beneﬁts of
the LCM core migration method, when compared to protocols that do not support
migration.

In summary, the results presented here support the core management policies of
the LCM protocol, which simply assigns the ﬁrst member of a multicast group as the
initial core node, and changes the core node of the group to the center of the current
multicast tree after membership information has been revealed and remained stable

for a predetermined length of time.

 

 

 

 

 

 

 

 

 

5,5 . . . . . . . r . . a . . r . . . . 1
5 [ W .................................. - -------------- . ' 7 ' ‘
I/’
a 45 ’ ,‘I .. a 6 _ f’ -
8‘ 'l" 8‘ ,I"
J: 4 L ,I’ I: 5 "l, 1
E i: g .... .................. o
8 3.5 r f 8 Donna [[[[[[ a _______ 4, ........ . ........ ......
g 3 1- 59809000090000 ..... 0. ............... B ........ G ...... 9 ................. _I' ’ g 4 £00900
.52 .‘2 ‘9 . Random *—
o 2.5 - ” Random -°—- « Q 3 - a" FlfSt member ------ .
First member -------- Tree center ......"
2 ’ Tree center 0 ‘ 2 _
1.5 1 A n 1 A 1 z 1 1 r 1 a L 1 ‘ A 1 r
0 5 10 15 20 25 30 35 4O 45 50 O 5 10 15 20 25 30 35 40 45
Group size Group size
(a) average (b) maximum

Figure 7.3: Core-to-member distances produced by various core selection methods.

7 .4 Summary

We have proposed a central-server based core management protocol, the LCM pro-
tocol, for use by CBF multicast protocols under LSR. Based on the information pro-
vided by LSR, the protocol addresses three aspects of multicast core management,
namely, automatic core selection, core failure handling, and core migration, and can

survive any combination of network component failures, including those that partition

50

160

the network. Our simulation study has shown that the CBS can handle extremely
heavy workloads, and has demonstrated the improvements in multicast performance
achieved by LCM’s core migration method. This work once again illustrates the

strength of LSR in supporting group communication.

Chapter 8

Tree-Based Link State Routing

In this chapter, we come full circle, combining group communication techniques dis-
cussed earlier to develop a novel link-state routing protocol, called the Tree-based LSR
(T-LSR) protocol, for use in general-purpose LSR-based networks, such as the Inter-
net. In the T-LSR protocol, a leader router is elected to perform periodic network
status broadcast on behalf of all the other routers to reduce the overhead associated
with periodic ﬂooding, and a spanning tree is constructed for use by the broadcast of
network status updates. We prove the correctness of the T-LSR protocol, that is, its
ability to maintain consistent routing information and leader preferences throughout
the network under any combination of network component failures, partitioning sce-
narios, and undetected transmission errors. The results of a simulation study reveal
that the T-LSR protocol imposes a small fraction of the overhead of the conven-
tional LSR method during normal operation periods, and incurs moderate overhead
during adverse periods when an election is in progress or the spanning tree is under

repair / construction.

8.1 Motivation

In this chapter, we return to the topic of reducing the operational overhead of LSR.
Before presenting our approach, let us take a look again at important performance

issues of previous LSR protocols. As discussed earlier, many LSR protocols use the

161

162

conventional ﬂooding algorithm, which forwards every LSA on every communication
link. Thus, each router must process, on average, D copies of a given LSA in a
network with average node degree D. Second, all routers are required to ﬂood local
status periodically. If the ﬂooding period is T seconds, then each router has to process
approximately (N * D) /T LSAS per second in an N—router network. Hereafter, we
use the term conventional LSR, or C—LSR for abbreviation, to refer to any LSR
protocol that uses the conventional ﬂooding algorithm and that requires every router
to perform periodic ﬂooding. Both the OSPF protocol [11] and the LSR method
described in ATM standards [13] fall in this category.

Previous efforts to reduce the overhead of LSR have focused largely on ﬂooding
operations. Speciﬁcally, Gopal [72] described several hardware implementations of
the conventional ﬂooding algorithm. In these implementations, however, a broadcast
message still has to traverse all communication links. A software-based, spanning-
tree ﬂooding method was discussed in [73]. The main concern of that work was
to seamlessly integrate routers that use conventional ﬂooding and with those that
use tree-based ﬂooding. It is not clear if that method could survive routing in-
formation/ transmission corruption problems. Rajagopalan [74] described a ﬂooding
method whereby every router builds a source-rooted tree to advertise its local sta-
tus. By contrast, the T-LSR protocol constructs a single spanning tree shared by
every router. Using only one tree reduces the number of protocol states that the
underlying ﬂooding algorithm must maintain. Our previous efforts to reduce LSR
overhead, namely, the SAF protocols, construct a spanning MC to broadcast LSAS;
the idea of hardware-based, spanning-tree broadcast of routing information has also
been exploited by other researchers [55, 75]. The T-LSR protocol does not assume
any capability in hardware and hence can be applied to a wider range of networking
platforms. Furthermore, while the above ﬂooding methods improve the performance
of individual ﬂooding operations, none of them are concerned with the bigger picture

of the entire ﬂooding cycle.

In this paper, we propose a novel LSR protocol, Tree-based LSR (T-LSR), which

constructs a single spanning tree that is used by all routers for the dissemination

163

of status information. Moreover, the T—LSR protocol elects a leader router to un-
dertake the duty of periodic ﬂooding on behalf of other routers. In Figure 8.1, we
give an example to illustrate the concept of tree-based ﬂooding. Using the spanning
tree topology shown in Figure 8.1(a), the ﬂooding operation in this example requires
four steps. A tree-based ﬂooding operation performs only 0(|Vl) LSA message for-
wardings, as opposed to 0(|E|) LSA forwardings in the C—LSR protocol. Using the
T-LSR protocol, each router in an N -node network that uses T -second ﬂooding cycles
processes, on average, only 1/T advertisements produced by periodic ﬂooding, and

0(1) c0pies of any LSA.

0 Node that has received the LSA

. Node that has ﬁnished the ﬂooding

 

Tree link

—’ LSA transmission

 

 

 

 

(d) step 3 (e) step 4

Figure 8.1: An example of tree-based ﬂooding.

Of course, the major challenge in designing such a “lightweight” LSR protocol is

164

to provide the same level of robustness as the C-LSR protocol. As we discussed in
Chapter 2, one of the critical fault-tolerance requirements of an LSR protocol is to
survive undetected transmission errors. (The entire ARPANET was brought down
by such errors in 1980 [76].) While the problems of leader election and spanning tree
construction have been studied extensively [13, 55, 52, 77], previous solutions deal
mainly with component failures (such as leader or tree link failures) and partitioning of
the network. Solutions to these problems that also survive message corruption events
are relatively unexplored. A class of problems, collectively referred to as the incorrect
leadership problem, arises when corrupted network tOpology information is used in the
computation of the spanning tree topology, or when undetected transmission errors
occur during the establishment of leadership and the construction of the spanning
tree. We will formally prove the correctness of the T—LSR protocol, that is, its ability
to maintain consistent routing information, construct a correct spanning tree, and
achieve leadership consensus under any combination of network component failures,
partitioning scenarios, and corruption problems.

The remainder of this chapter is organized as follows. An overview of the T-LSR
protocol is ﬁrst given in Section 8.2. Algorithm details of the T-LSR protocol are
presented in Section 8.3, followed by the proof of correctness in Section 8.4. The
performance of the T-LSR protocol is investigated through simulation. The results
of this study, presented in Section 8.5, reveal that the T-LSR protocol imposes a very
small fraction of the overhead of the C—LSR protocol during normal operation periods,
and incurs only moderate overheads during adverse periods when the spanning tree is
under repair / construction and leader election is in progress. Finally, a summarization

of this work is given in Section 8.6.

8.2 Overview

In this section, we present the operation of the T-LSR protocol. In the discussion,
we assume a connected network G = (V, E), where V is the set of routers and E

the set of communication links that connect routers. To generalize our discussion to

165

partitioned networks, we simply consider segments individually. For the purpose of
cross reference in the subsequent discussion, important rules/ conditions are labeled.
(For example, the statement “when a router receives the ﬁrst copy of a given LSA, the
LSA is forwarded along all the links incident to the router except the one on which the
LSA arrives” could be labeled as Forward-LSA-Rule-l). Before discussion, we give
the control messages formats and data structures of the T-LSR protocol in Tables 8.1

and 8.2 respectively.

 

LSA(:1:, s, m) a link-state advertisement with sequence number s and ﬂood-
ing mode m that contains the local status of router x.
CTA(a, G;,T,c) a complete-topology advertisement that contains the reach-
able network image G; of the leader router a and a spanning
tree topology T with epoch number c.

 

 

 

Ballot(z, a, c) a ballot message from a child 2 that speciﬁes a as the leader
and is used to establish the spanning tree of epoch number c.
LEA(a, c) a leadership establishment advertisement that broadcasts the

establishment of the leadership of router a and the completion
of the construction of the spanning tree with epoch number c
suggested by a.

 

 

 

 

Table 8.1: Control messages in the T-LSR protocol.

 

Rank(a:) the rank (leader priority) of router :1:.

Leader(a:) the preferred leader of :1:.

Mode(:z:) the operation mode (either T or G) of 2:.

Epoch(x) the epoch number of the current spanning tree.

Flag$[z] a boolean ﬂag that indicates if a: has received the ballot for
Leader(:r) from its child 2 in the current spanning tree.

 

 

 

 

 

 

 

 

Table 8.2: T-LSR data structures at a router as.

LSA model. We assume that every LSA originated from a router contains complete
status of the router. If a router as has ﬁve incident links, for example, then every LSA
from :1: contains descriptions of all the ﬁve links. When a: wishes to advertise the
failure of one of its incident links, it ﬂoods an LSA that describes the working status
of four links and the non-operational status of the ﬁfth. In this way, an LSA can be
uniquely identiﬁed by its source router ID and a sequence number. This LSA model

is similar to that of the OSPF protocol [11]. Other LSR protocols use a more reﬁned

166

model, where each component of the local status of a router (for example, a speciﬁc
link) is assigned an LSA ID [13], and must be identiﬁed by a (router ID, LSA ID,
sequence number) triple. This allows an LSA to contain a part of the local status of a
router and is economical in terms of bandwidth consumption if the router frequently
advertises changes in individual state components. The T-LSR protocol could be

generalized to handle such LSA models.

Network image. The network image at a router :1:, denoted as G3,, is deﬁned to
be the set of LSAs maintained at 2:. We note that G3, could include unreachable
routers because, for example, when :1: loses connectivity to another router y, the LSA
regarding y will still be maintained by at until it is aged out. We denote by G; the set
of LSAs maintained at :1: that are regarding routers reachable from :1: in the topology
deﬁned by 0;. If the network is connected, then G; = Gm. When the network is
partitioned, G; is a proper subset of G1. and LSAs in G; describe the topology of the

network segment in which :1: resides.

Operation modes. The T—LSR protocol elects a leader router to perform periodic
ﬂooding on behalf of all the other routers and uses only tree links in the dissemination
of network status updates; details are given later. However, there are periods of
time when the election is in progress and/ or the spanning tree is under construction.
During such adverse periods, the T-LSR protocol reverts to the C—LSR protocol to
ensure uninterrupted routing operation. To distinguish adverse periods from normal
operation periods, each router operates in one of the following modes: mode T and

mode G. We denote by Mode(:z:) the Operation mode at router 2:.

0 During periods when leadership consensus has been achieved and the spanning
tree is operational, all the routers in the network operate in mode T. When a
router is in mode T, it ﬂoods only changes in local status, and uses only spanning
tree links in the ﬂooding of LSAS; it does not perform periodic ﬂooding. Every
LSA ﬂooded by a T-mode router is tagged with a mode ﬂag of value T; such
an LSA is termed T-mode LSA and its respective ﬂooding is termed T-mode

167

ﬂooding.

0 When a router is in mode G, it effectively executes the the C-LSR protocol
— it performs both periodic and event-driven ﬂooding, which in turn use all
communication links. Every LSA ﬂooded by a G-mode router is tagged with a
mode ﬂag of value G; such an LSA is termed G-mode LSA and its respective
ﬂooding is termed G-mode flooding. The arrival of a G-mode LSA at a T-mode
router forces the router to switch to mode G. The existence of any router in the
network that is in mode G indicates a lack of leadership consensus within the

network.

Leader election and spanning tree construction. Every router :r is conﬁgured
with a leader priority, denoted by Rank(:r), which constitutes a part of the local status
of the router, and which therefore is included in LSAs ﬂooded by :1:. Further, router
:1: searches in V(G’$), the set of routers known by a: to be reachable, for the router
with the highest rank, and calls the result of this search its preferred leader, denoted
as Leader(:r). Subsequent actions taken by router :1: depend on whether or not the
value of Leader(:r) is :12 itself.

If Leader(a:) is set to :23, then router 2: immediately undertakes the responsibili-
ties of the leader router (although at this point not all routers necessarily agree on
its leadership). Leader responsibilities include the computation of a spanning tree
topology T and periodic broadcast of complete topology advertisements (CTAs). A
CTA from :1: contains all the LSAs in G; as well as the spanning tree topology T. To
broadcast a CTA, 1r forwards the CTA along branches of T.

On the other hand, if router :1: has some other preferred leader, that is, Leader(2:) =
a and 2: ;£ a, then a: must await a CTA from a; CTAs from other routers will
be silently discarded (Discard-CTA-Condition-l). Upon receiving a CTA from its
preferred leader, router :1: processes the LSAS contained in the CTA, extracts the
spanning tree T, and forwards the CTA to its children in T. The second task of
router as is to receive ballot messages for oz from all its children. After the completion

of this task, :1; sends its own ballot to its parent y in T. This ballot also serves

168

to establish the :1:-y tree link. After router a collects all the ballots from its own
children, it claims victory by broadcasting a leadership establishment advertisement
(LEA), again using only T links. Receipt of the LEA changes the operation mode
of every router to T, and the network enters the normal operation of the T-LSR

protocol.

Re—election. In the T-LSR protocol, leader re—election is triggered by changes in
the set of reachable routers. Speciﬁcally, when a router a: observes a change in the
set V(G’I), it must re-compute its preferred leader (Compute-Leader-Condition-1).
To enable router ranks to be changed during protocol operation, :15 also re-
computes its leader preference when it detects any change in router ranks
(Compute-Leader-Condition-2). In either case, router a: switches to mode G
(Enter-Mode-G-Condition-1), and participates in a new election. For illustration,
let us consider a network where the administrator has conﬁgured a default leader a
with rank 3 and a backup leader 6 with rank 2. All the other routers are conﬁgured
with rank 1. Consider a scenario where the current leader a has just failed. First,
neighboring routers of a notice the failure of links incident to a, and ﬂood LSAS that
contain the malfunctioning status of such links. Via these LSAS, every router a: de-
tects a change in V(G’x) (speciﬁcally, that a has been removed from V(G;)), switches
to mode G, and sets Leader(:r) to the router with the next highest rank, namely
6. Router [3 also discovers that itself is of highest rank, so it broadcasts CTAs and

collects ballots to establish its leadership and construct a new spanning tree.

Maintenance of the spanning tree. When a link used by the spanning tree fails,
the routers incident to the link switch to mode G (Enter-Mode-G-Condition-2) and
ﬂood G-mode LSAS that contain the new state of the link. Upon receipt of such an
LSA every router in the network switches to G-mode operation. Routers remain in
this mode until a new spanning tree (contained in the next CTA from the leader) has

been constructed and the leader has broadcast an LEA.

As illustrated above, the spanning tree tOpologies contained in the periodic broad-

169

cast of CTAs from a given leader may change over time in response to network topol-
ogy changes. The sequence of tree topologies proposed by a leader router is divided
into one or more epochs. Consecutive, identical tree topologies are tagged with the
same epoch number; a change in the tree tOpology is reﬂected by an increment in the
epoch number. During each epoch, routers remain in mode T. When a change in
epoch number is detected, routers switch to mode G (Enter-Mode-G-Condition-3)
until the construction of a new tree is completed. Each router a: records the cur-
rent epoch number in the data structure Epoch(:c). Any CTA that contains a
spanning tree with a epoch number smaller than Epoch(:r) will be discarded by x
(Discard-CTA-Condition-2).

We emphasize that routers must cast ballots in every round of tree topology broad-
cast (that is, every CTA broadcast), regardless the presence or absence of epoch num-
ber changes. Before the broadcast of a CTA, the leader computes a new spanning
tree tOpology, if it is currently in mode G (Compute-Tree-Condition-1), and in-
creases the epoch number. After receiving all the ballots pertaining to the CTA, if the
leader is currently in mode G (Issue-LEA-Condition-l), it broadcasts an LEA(a, c),
where c = Epoch(a). Failing to collect any necessary ballot will switch the leader to
mode G (Enter-Mode-G-Condition-4). Upon receiving the LEA, any router :1: whose
Leader(:r) = a and Epoch(:z:) = c switches to mode T (Enter-Mode-T-Condition-1)
and forwards the LEA to its children in the current spanning tree. Otherwise, the

LEA is discarded by :13.

Flooding algorithm. In the T-LSR protocol, a router could operate in mode T or
mode G, and an LSA could also be ﬂooded in either one of the two modes. When an
LSA arrives at a router, there are four (ﬂooding mode, operation mode) combinations.
Before formally presenting the LSA-forwarding rules under these combinations, let us
use the example shown in Figure 8.2 to discuss important scenarios. In the example,
router X detects a signiﬁcant change in the queueing delay over the (X, A) link and
disseminates this information by ﬂooding an LSA [x in mode T. Simultaneously,

another router Y ﬂoods an LSA [y (in the G mode) to advertise the failure of the

170

(Y, B) link, which is used in the spanning tree depicted in Figure 8.2(a). Let us
assume that all routers except Y are initially in mode T. Figures 8.2(b) and (c)
depict the ﬁrst and second forwarding steps of the two ﬂooding operations. As shown
in Figure 8.2(c), the T-mode LSA 6X encounters two routers, W and Z, whose modes
have been changed to G by (y. As shown in Figure 8.3(a), when routers W and Z
receive 8X, they change the mode of 8X to G and forward 8x along all respective
incident links, including the ones on which the T-mode EX arrived, speciﬁcally, the
(W, H) and (Z, H) links. The G-mode copy of [x will be considered to be more recent
than its T-mode counterpart. When arriving at a router that has received the T-mode
(x, the G-mode 6x will be considered being seen for the ﬁrst time; in Figure 8.3(b),

router X forwards the G-mode copy of 3x to its neighbor A, as if it receives [x for

the ﬁrst time.

. G—mode node 0 T—modo node

Tree link — —-— G-modo flooding —.- T—modo ﬂooding

 

 

 

   

(a) initial conﬁguration (b) ﬁrst forwarding step (c) second forwarding step

Figure 8.2: The ﬂooding of two LSAs in different modes.

,Li‘r f] f f
.l A r”

 

 
  
 

I
I

   

\ 7- I” \ l \
....... v ——-.o-—.—-——o 3,1,3 MED—"mo
(a) ﬁrst forwarding step (b) second forwarding step

Figure 8.3: The completion of the T-mode ﬂooding in mode G.

However, the above situation where the G-mode copy of 8x returns to X itself

raises the concern that 13x may have been corrupted before processed by X the second

171

time. If X blindly accepts the corrupted, G-mode c0py of Ex, then router X will have
incorrect knowledge about its own status. To cope with problem, when any (G—mode)
LSA that contains the local status of a router as arrives at :1:, router X compares the
LSA against its local status and discards the LSA if any inconsistency is detected
(Discard-LSA-Condition-1)

We now present the ﬂooding rules of the T-LSR protocol. Let LSA 8’ =
LSA(y, s’,m') 6 GI, where y is the ID of the source router, s is the sequence num-
ber, and m’ is mode of the LSA, be the LSA regarding y that is maintained at :1:.
When an LSA Z = LSA(y,s,m) arrives at :r, it is ignored by :1: if (s,m) g (s’,m’)
(Discard-LSA-Condition-2), where the comparison is in lexicographic order and
mode G is deﬁned to be greater than mode T. (Thus, given two LSAS regarding
the same router and with identical sequence numbers, the one in mode G overrides
the one in mode T.) If 8 is not discarded, it substitutes 2’ in G2 and is forwarded
according to the three cases below. In the discussion, we denote by E (x, T) the set
of tree links that are incident to as, by E (:1:, G) the set of incident links of x, and by

p the link on which 6 arrives.

LSA-Forwarding-Case-l: m = Mode($)
Forward 3 along links in E (2:, m) — {p}.
LSA-Forwarding-Case-Z: m = G, and Mode(:r) = T
Set Mode(:z:) to G (Enter-Mode-G-Condition-5), and
forward 8 along links in E (:1:, G) — {p}.
LSA-Forwarding-Case-B: m = T and Mode(:r) = G
Forward LSA(y, s, G) along links in E (:r, G).

The ﬁrst case happens when LSA l and router :c are in the same mode. The last two
cases take place when the network is in mode transition. In Case 2, the arrival of
a G-mode LSA at a T-mode router switches the router to mode G. The LSA itself
is forwarded according to the conventional ﬂooding algorithm. In Case 3, when a
G-mode router receives a T-mode LSA, that router changes the ﬂooding mode of the

LSA to G and forwards the LSA along all its incident links.

172

Aging. Like the C—LSR protocol, the T-LSR protocol uses the aging mechanism
to curb the lifespan of corrupted LSAS. At a non—leader router :1:, the LSA in G3
regarding router y will be removed from G3 taging seconds after its arrival at 2:. This
rule, of course, cannot be applied at the leader router itself, because other routers do
not perform periodic ﬂooding and hence the leader may not receive new LSAs from
other routers for long periods of time. At the leader router, once leadership has been
established, LSAs regarding reachable routers are immune to aging. Speciﬁcally, let
us consider a given router a: and an LSA Z E Gm that is regarding another router y.
When its associated aging timer ﬁres, 8 is removed from Gm only if any of the following
three conditions are satisﬁed: Leader(:r) 76 a: (Aging-Condition-l), Mode(a:) 2: G
(Aging-Condition-Z), and y is unreachable from 1: in G1: (Aging-Condition-3). If

E is not removed, then a new associated aging timer is created.

However, because a corrupted LSA maintained by the leader regarding a reachable
router is not subject to aging, such corruption may exist for prolonged periods of
time if not handled properly. Further, the corruption could propagate throughout
the network as the leader includes the LSA in CTAs. This problem is detected
and corrected as follows. Let 6 be the LSA in Go regarding a router :1:, where a
is the leader router. Let 3’ be the LSA in G,c regarding :1: itself. When router :1:
receives a CTA from a, which includes E, a: checks 2 against 8’. If any inconsistency
is found, 1: switches to mode G (Enter-Mode-G-Condition-6), discards the CTA
(Discard-CTA-Condition-B), and hence will not vote for a, forcing (1 also to switch
to mode G and thus allowing the corrupted information to be aged out. Meanwhile,
router :1:, now in mode G, ﬂoods periodically to provide a with its correct local status
information. In order to avoid premature mode switching due to delays in LSAs
reaching the leader, the above consistency check is performed only when the CTA is

received after tobJ-ectiomdelay seconds after the creation of Z’.

In the T-LSR protocol, the leadership of the established leader is also subject to
aging. Even during periods where there are no network topology changes or when
such changes do not affect its leadership, an established leader must periodically ﬂood

CTAs that contain the current spanning tree topology and epoch number. If a router

173

does not receive such CTAs for a predetermined length of time, it must revert to
mode G Operation (Enter—Mode-G-Condition-7). Leadership aging addresses the
concern where corruption problems in the epoch number of a previous CTA prohibit

the acceptance of following CTAs for prolonged periods of time.

The handling Of network partitioning. As in the case of handling network
component failures, the T-LSR protocol COpes with network partitioning scenarios
by having every router a: monitor the set of routers reachable from :1:, V(G;). Let
us consider a scenario where router oz is the current leader and a component failure
partitions the network into two segments, 51 and 52. Let us assume that a E 81.
Routers in 52 will notice the loss of connectivity to the current leader, switch to mode
G (Enter-Mode-G-Condition-1), and select a new leader, call it 6. Router 6 will also
select itself as the new leader, and, since it is in mode G, will compute and construct
a spanning tree within 52 (Compute-Tree-Condition-1). In the meantime, router a
will switch to mode G due to changes in the set V(G;) and hence will compute a new
spanning tree for use in SI. Should the segments SI and 52 be merged later, routers
in SQ, including 6, will change their preferred leaders to a. Simultaneously, router a
will switch to mode G due to changes in V(G[,) and hence compute a new spanning

tree to cover the entire network.

Handling incorrect leadership problems. As deﬁned earlier, the term incorrect
leadership problem refers to any corruption problem involved in leader election and
spanning tree construction. Let us consider an example shown in Figure 8.4. In the
example, the network image at the leader router oz is corrupted in a way that router
X, which is reachable from the leader in the real network topology depicted in Fig-
ure 8.4(a), is considered unreachable by the leader (see Figure 8.4(b)). Consequently,
leader a constructs the incorrect spanning tree T depicted in Figure 8.4(c), which of
course does not cover router 1r. Presuming that every router selects a as the preferred
leader, router a will obtain the votes from all the routers covered by T and broadcast

an LEA. Subsequently, all routers except X Operate in mode T, and any event-driven

174

ﬂooding from a non-X router will use T links and will not reach router X.

   

(a) real network topology

 

(c) resultant incorrect spanning tree
T.

Figure 8.4: An example of the incorrect leadership problem.

It may be argued that, since X does not receive the above LEA and will remain
in mode G, the periodic ﬂooding LSAS from X, which are in mode G, will switch the
Operation modes of other routers back to G, at least curbing the lifespan Of the above
situation within a ﬂooding cycle. To see that this mechanism does not necessarily
work, let us further assume that Z = LSA(X, s) is the LSA in G x that is regarding
X itself and that 3’ = LSA(X, s’) is the corrupted copy of Z in Ga. Through the CTA
broadcasts from a, LSA 6’ is propagated throughout the network and incorporated
into the network images at all routers but X. If s’ = s + 228 and a: ﬂoods on average
every 60 seconds, then it will take X more than 500 years to use sequence numbers
larger than 3’. Before that, all the LSAs from X are ignored by other routers. The
incorrect spanning tree T and the false leadership of router a in Figure 8.4 can last
for a prolonged period of time.

TO cope with this problem, every router, upon receiving a CTA containing a tree

175

topology T, checks whether all its neighboring nodes are present in T. If T fails this
test at any router, that router will discard the CTA (Discard-CTA-Condition-4)
and revert to G-mode Operation (Enter-Mode-G-Condition-8). In the example of
Figure 8.4, router Y shall notice the absence of X from the spanning tree in Fig-
ure 8.4(c) and refuse to vote for the leader. This keeps the leader router (and all
other routers as well) in mode G, enabling the corrupted information regarding X
to be aged out. It is proved in Section 8.4 that, even when the current, incorrect
leadership and spanning tree hinder the dissemination of subsequent network status
updates, this simple test methodology is sufﬁcient to eventually construct correct
leadership and a spanning tree if corruption does not happen to the transmission of

T and ensuing ballots indeﬁnitely.

8.3 Algorithms

In this section, we present the algorithms of the T-LSR protocol. In the discussion,
for a given router :1:, we denote by Children(:r) the set of children of a: in the current
spanning tree, relative to Leader(:z:), and by Parent(:1:)T the parent of a: in the tree.

When a router a: needs to ﬂood its local status for the purpose of either periodic
ﬂooding or to broadcast changes in its local status, it invokes the FloodLocalStatus
routine shown in Figure 8.5. Parameter a: in the routine indicates the ID of the caller
router. This routine ﬁrst updates the content of LSA l, the LSA regarding :r in its
own network image G1,, to reﬂect the current local status. Next, a: switches to mode
G and searches for a new preferred leader if there is any change in the set V(G';)
after the update of Z (Compute-Leader-Condition-1) or if the rank of :1: itself has
been changed (Compute-Leader-Condition-2). Router :1: must also switch to mode
G if any incident tree links are found malfunctioning (Enter-Mode-G-Condition-2).
Finally, router 2: increments the sequence number of t and forwards 6 along the set
of links deﬁned by its current Operation mode.

Shown in Figure 8.6 is the routine that processes incoming LSAs. In the rou-

tine, parameter 1: indicates the ID of the caller router and Z is an incoming LSA

176

 

Algorithm: FloodLocalStatus.
Input: router ID 2:.

U = V(G’x).
Let Z = LSA(x, s, m) be the LSA regarding router a: in G1,.
Update the content of Z (and, hence, G; and G2,)
to reﬂect the current local status of :1:.
IF (Compute-Leader-Condition-1: U 76 V(G;)) OR
(Compute-Leader-Condition-2: Rank(:r) has changed) THEN
Mode(:r)=G. (Enter-Mode-G-Condition-1)

SetPreferredLeader().

ELSE IF (Enter-Mode-G-Condition-2: Be 6 E(a:,T) that has failed) THEN
Mode(:1:) = G.
Epoch(:z:) = —1.

ENDIF

s = s + 1.

Forward LSA(z, s, Mode(:1:)) along links in E(a:, Mode(a:)).

 

 

 

Figure 8.5: The routine that ﬂood router local status.

that is regarding router y with sequence number s and mode m that arrives on
link p. The ﬁrst task of the routine is to check if a: should discard 5 according to
Discard-LSA-Condition-l and Discard-LSA-Condition-2. Should 8 pass these
tests, it is accepted by ac and is incorperated into GI. Subsequently, the ProcessLSA
routine checks for changes in the reachable set, V(G;), and in the rank of router y.
Whenever such a change is detected, router :1: switches to the G mode and recomputes
its preferred leader. Lastly, the routine forwards 3 according to Forwarding-Case-l,

Forwarding-Case-2, and Forwarding-Case-B.

The routine that a router 2: uses to set its preferred leader, Leader(:r), is presented
in Figure 8.7. As stated, the preferred leader of a: is set to a reachable router w with
the highest rank, according to the local network image of :1:. If a: changed its preferred
leader, then the current epoch number is set to —1, and, as such, router a: can accept
a tree tOpology from the new leader with any epoch number. If the new value of

Leader(:r) is :1: itself, the routine invokes the BroadcastCTA routine, discussed next.

When a router 2: whose Leader(:r) = :17, it periodically invokes the BroadcastCTA
routine, shown in Figure 8.8. The routine ﬁrst checks the ballots corresponding to

the previous CTA broadcast and reverts to mode G if there is any ballot missing

177

 

Algorithm: ProcessLSA.
Input: router ID :1: and 3 = LSA(y, s, m) that arrives on link p

U = V(G;).
IF (2: = y) THEN /* Check for corruption in LSAs regarding myself */
Let E’ = LSA(y, s’, m’) be the LSA regarding router y in 0;.
IF /* Discard-LSA-Condition-l */
(s > 3') OR ((3 = s’) but ((5 ¢ 8’)) OR
((3 < s') and (3’ has existed for more than tobjectiomdelay seconds)) THEN
Mode(:z:) = G.
Exit. /* l is discarded */
ENDIF
ELSE
IF (Discard-LSA-Condition-2: (s,m) S (s',m’)) THEN Exit.
ENDIF
Replace 8' with E in G1,, and set up an aging timer for l.

/* Changes in leadership rank or the set of reachable routers ? */

IF (Compute-Leader-Condition-1: U 95 V(G;,)) OR
(Compute-Leader-Condition-2: the ranks of y differ in Z and 2’) THEN
Mode(:r)=G.

SetPreferredLeader () .

END

IF (LSA-Forwarding-Case- -1: m: Mode(a:)) THEN
Forward 6 along links E( (:1:, m) —.{p}

ELSE IF (LSA- Forwarding- Case- -:2 m: G, and Mode(:r) = T) THEN
Mode(:1: )= G (Enter-Mode-G- Condition-5), and
forward 8 along links E( (,2: G) {p}.

ELSE IF (LSA- Forwarding-Case- -:3 m: T and Mode(:1:) = G) THEN
Forward LSA(z, s, G) along links E(:1:, G).

ENDIF

 

 

 

Figure 8.6: Processing incoming LSAS.

(Enter-Mode-G-Condition-4). (If this is the ﬁrst CTA broadcast by x, then the
check is bound to fail, a result consistent with the fact that a: has not established
its leadership and must be in mode G.) If :1: is in mode G, meaning that it is still
establishing its leadership, then it must compute a new spanning tree topology T.
The routine then broadcasts a CTA that contains the tree topology T and the network
image G’I. Lastly, the routine clears the Flag data structures and router a: will await
ballots corresponding to this round of CTA broadcast.

When a CTA(a, GivT, c) arrives at a router :1: via link p, the router invokes the

178

 

Algorithm: SetPreferredLeader.
Input: router ID x.

oldJeader = Leader(a:).
Let w be the router with the highest rank in G2,.

Leader(a:) = w.

IF (old_leader 75 Leader(:1:)) THEN
Epoch(:r) = —l.

ENDIF

IF (Leader(:1:) = :1:) THEN
BroadcastCTA().

ENDIF

 

 

 

Figure 8.7: Setting preferred leader.

 

Algorithm: BroadcastCTA.
Input: router ID :1:.

/* Checks the ballots of the previous round of votes */
IF (Enter-Mode-G-Condition-4:
32 E Children(a:) such that Flagx[z] = FALSE) THEN
Mode(:c)=G.
ENDIF

IF (Compute-Tree-Condition-1: Mode(:c) = G) THEN
Compute a tree T that spans V(G;).
’I‘ree(a:) = T, and Epoch(:r) = Epoch(a:) + 1.

ENDIF

Forward CTA(:r, G;,T,Epoch(a:)) to Children(a:)T

/* To track ballots for this round of vote, */
Flagx[z] = FALSE, V2 6 Children(:r)Tree(a:).

 

 

 

Figure 8.8: The BroadcastCTA routine.

ProcessCTA routine shown in Figure 8.9. The CTA is discarded, if it is not from the
preferred leader of a: (Discard-CTA-Condition-l), if it contains an Obsolete span-
ning tree topology (Discard-CTA-Condition-2), if the LSA regarding :1: in the CTA
is inconsistent with the local status of a: (Discard-CTA-Condition-B), or if some
neighboring routers of x are absent from the spanning tree T contained in the CTA
(Discard-CTA-Condition-4). If the CTA is accepted by :1:, the LSAs contained in
the CTA are incorporated into the network image of :1:. Lastly, if the CTA contains

a more recent tree topology than the one stored locally, as updates its Epoch(:r) data

179

structure accordingly, and switches to mode G to avoid the use of tree-based ﬂood-
ing (Enter-Mode-G-Condition-3). Since routers must cast ballots in every round of
CTA broadcast, router 1:, before ending the routine, sets Flag1[z] to FALSE for all
z E Children(:r) to await ballots from its chidlren in the tree T.

 

Algorithm: ProcessCTA.
Input: router ID a: and an arriving CTA(a, G;,T, c).

IF (Discard-CTA-Condition-l: Leader(a:) # (1) OR
(Discard-CTA-Condition-2: c < Epoch(:r)) THEN Exit.

Let Z = LSA(x, s, m) be the LSA in the CTA regarding :1:.

Let 6’ = LSA(x, s’, m’) be the LSA in G3, regarding :3.

IF (m 75 m') OR (3 > 3’) OR ((3 = s’) but (Z 76 €’)) OR
((3 < s') and (E’ has existed for more than tobjection-delay seconds)) THEN
Mode(:z:) = G. /* (Enter-Mode-G-Condition-6) */
Exit. /* (Discard-CTA-Condition-S) */

ENDIF

/* Check for corruption in T */

IF (Discard-CTA-Condition-4: 3 a neighbor of z ¢ V(T)) THEN
Mode(x) = G.
Exit.

ENDIF

FOR (each LSA 8 = LSA(y, s,m), y ¢ :3, contained in the CTA) DO
Let Z’ = LSA(y, s’,m’) be the LSA regarding router y in G3.
IF ((s,m) _>_ (s’,m’)) THEN
Replace 8’ with e in Gm, and reset the aging timer for Z.
ENDIF
ENDFOR

Forward this CTA to routers in Children(:z:)T.
IF (c > Epoch(:2:)) THEN
Mode(:c) = G. (Enter-Mode-G-Condition-3)
Tree(:z:) = T, and Epoch(a:) = c.
ENDIF
Flagx[z] = FALSE, for each z 6 Children(:r)T.

 

 

 

Figure 8.9: The processing of incoming CTAs.

When a Ballot(z,a,c) message arrives at a router as, the router calls the
ProcessBallot routine shown in Figure 8.10. A ballot will be processed only if it

is for the perferred leader a of x, if it belongs to epoch Epoch(a:), and if it comes

180

from a child of :1: in Tree(a:). To process such a ballot, :1: sets the ﬂag corresponding
to the child 2 and establishes the :1:-z tree link. When the Flag data structures in-
dicate the receipt of legitimate ballots from all the children of :12, remaining actions
of the routine depens on if :1: is the leader. If Leader(:r) = :1:, then router :1: issues
an LEA(a, Epoch(2:)) message to broadcast the successful establishment of its lead-
ership. Otherwise router :1: casts its own ballot by sending a Ballot(z, a, c) message

to its parent.

 

Algorithm: ProcessBallot.
Input: router ID :1: and a message Ballot(z, a, c).

IF ((1 = Leader(:1:)) AND (c = Epoch(:r)) AND (z E Children(a:)Tree(a:)) THEN
F lag; [z] = TRUE.
Establish :1:-z tree link.
IF (V2’ 6 Children(:z:)Tree(:c), Flag$[z’] = TRUE) THEN
IF (Issue-LEA-Condition-l: Leader(:r) = a: and Mode(:z:) = G) THEN
Forward an LEA(a:, c) along all incident links of Tree(:z:), if Mode(x) = G.
ELSE
Send message Ballot(z, a, c) to Parent(:r).
END
ENDIF
ENDIF

 

 

 

Figure 8.10: The processing of ballot messages.

When an LEA(a, c) arrives at a router :1:, the router invokes the ProcessLEA
routine shown in Figure 8.11. The LEA is ﬁrst checked for the choice of the leader and
the current epoch number. If both conditions are satisﬁed, then router a: switches to
mode T and forwards the LEA to its children in the current spanning tree. Hereafter,
router a: enters the normal operation period of the T -LSR protocol —— it stops periodic

ﬂooding and will use only tree links to advertise local status updates.

 

Algorithm: ProcessLEA.
Input: router ID a: and a LEA(a, c).

IF (Enter-Mode-T-Condition-1: a = Leader(a:) and c = Epoch(:c)) THEN
Mode(a:)=T.
Forward the LEA to all z E Children(a:).

ENDIF

 

 

 

Figure 8.11: The processing of LEAs.

181
8.4 Proof of Correctness

In this section we prove the correctness of the T—LSR protocol. As with any LSR
protocol, we must be careful when deﬁning what can be proved and what cannot
be proved. For example, consider the problem of establishing leadership consensus
in a hypothetical scenario where, whenever a router a is elected as the leader, that
router immediately crashes. The other routers will detect the loss of connectivity to
a, prompting a new election. Further, let us assume that router a resumes execution
shortly after a new leader is elected. If this scenario repeats itself indeﬁnitely, and
every newly suggested leader immediately crashes, then it is impossible for any leader-
management protocol to maintain stable consensus.

We conclude that a more reasonable goal is to study the behavior of the T—LSR
protocol in response to a ﬁnite set of events. This model reﬂects real-world circum-
stances where bursts of adverse events are followed by quite periods, which allow an
LSR protocol to return to normal operation. We denote by E a ﬁnite set of network
status change and transmission corruption events, and by to a time after the last event
in E. Let G be the network topology after E and a be the router with the highest
rank, Rank(a), in G. Let us assume here that the network topology G is connected.
To accommodate disconnected networks, one can simply apply the following argu-
ment to individual network segments. It is further assumed that events in E leave
the T-LSR protocol in a chaotic state. Speciﬁcally, we assume the following at time

to.

o The elements of network images are assumed to be random. Speciﬁcally, at
any router 3:, V(G) — V(Gm) may not be empty (that is, some routers may be
absent from GI) and V(Gx) — V(G) may not be empty (“ghost routers” could
exist in the network image of 3). Further, the content of the LSA regarding
router y 91$ 2: in GE is also assumed to be random. For example, Rank(y) may
be corrupted at :1:, some incident links of y may be missing in G3, (and hence

GI may not be connected), and ghost incident links of y may exist in G3.

0 At any router x, the values Of the Mode(:1:) and Epoch(:r) data structures are

182

assumed to be random.

The goal Of this section is to show that the T-LSR protocol will establish correct
leadership, construct a correct spanning tree, and achieve consistent network images
in the presence of such chaotic states. We do assume, however, that every router 2:
possesses correct knowledge about its own status and local surroundings, speciﬁcally,
that the LSA regarding 2: itself is not corrupted in G3,.

We emphasize that the chaotic states described above can only be created by
corruption problems, a very rare type of events. Network status changes, a type of
events that happen much more frequently when compared to corruption events, will
always leave the T-LSR protocol in consistent states. The behaviors of the T—LSR
protocol in the handing of network status changes are investigated by a simulation
study; results of the study are presented in Section 8.5.

First, we deal with a type of corruption problem, involving ranks, that could
hinder the establishment of leadership consensus. Let <I>x(t) be the set of routers
whose (corrupted) ranks in G3, at time t are higher than Rank(a). There are two
possible causes for a router y to be in <I>$(to): the Rank(y) information is corrupted
at 2:, or y itself is a ghost router. The latter case might happen due to the arrival
of an LSA(z, 3), whose router ID is corrupted and transformed into a non-existent
router ID y and whose rank is corrupted and is larger than Rank(a). Of course, a
non-empty <I>x(t) will prohibit a: from selecting a as its preferred leader. In our ﬁrst
lemma, we show that the set (D3, will become empty taging seconds after to, where taging

is the length of aging timers.
Lemma 6 At any router 2:, <I>x(t) is empty at any time t > to + tam-n9.

Proof: Let y be any router in <I>x(t0). Regarding what may happen to y during
the [t0,to + tap-mg] period, there are two cases: First, an LSA or CTA originated at
router y might arrive and be accepted during the period, ﬁxing the incorrect rank
information regarding y at 2: and consequently removing y from <I>$(t). Second, if
neither such LSAS nor CTAs from y are accepted by :1: during the period, then we
claim that the aging mechanism Of the T-LSR protocol will remove y from (1)3(t). Let

183

8,, be the LSA in G1, that is regarding y, and let t’, to < t’ S taging, be the time when
the aging timer of [y ﬁres. Depending on the value of Leader(2:) at t’, we further
consider two subcases. In the discussion, we recall that in the T-LSR protocol all the
LSAS maintained by a non-leader router, whose Leader(2:) 3f 2:, are subject to aging,
whereas at an established leader only LSAs regarding unreachable routers are subject

to aging.

1. Leader(2:) E <I>x(t’). In this case, Leader(:z:) ¢ 2:, because due to the assumption
that 2: possesses correct knowledge of its local status, including its rank infor-
mation, 2: cannot be in <I>x(t’). By Aging-Condition-l, 6,, ages out, and y is

removed from (D3.

2. Leader(2:) ¢ <I>x(t’). In this case, Leader(:r) may be 2:. However, routers in
<I>x(t’), including y, must be disconnected from 2: in G1 at time t’; otherwise
Leader(2:) would be set to a reachable router in <I>x(t’). By Aging-Condition-3,

8,, ages out, and y is removed from (bx.

In any Of the above cases, every element y E <I>$(t0) will be removed from the set by

the time to + aging, concluding the proof. E1

The next lemma shows how the T-LSR protocol correctly handles incomplete

spanning tree topologies.

Lemma 7 Given a spanning tree topology T that is broadcast by a router 2:, if T does
not include every router in the network, then T cannot win all the votes for 2: from

the routers covered by T.

Proof: Let VT 6 V(T) ﬂ V(G) be the set of routers covered by T, and
VT 2 V(G) — V(T) be the set of routers not covered by T. Consider any x-to-y path
P in the connected, physical topology G. Since 2: E VT and y 6 VT, there must exist
two consecutive nodes w and z in P such that w E VT and z E VT- If T is discarded by
any router in V(T) before arriving at w, then of course T cannot win all the votes for

2:. Otherwise, when T arrives at w, router w will detect the absence of its neighbor z

184

in T. By Discard-CTA-Condition-4, router w would not vote for 2:. We are done. 1:]

Next, we investigate what would happen to a leader candidate when some other

router do not prefer the candidate.

Lemma 8 Given any two routers 2: and y such that Leader(2:) = 2: and Leader(y) 75 2:
at a time t _>_ to, if Leader(2:) is not changed after time t and if Leader(y) is never
set to 26 after time t, then there exists a time t’ Z t such that router 2: will remain

permanently in mode G after t’.

Proof: Since Leader(2:) is set to 2: permanently, router 2: broadcasts an inﬁnite
sequence of spanning trees (T1, T2, T 3, . . .) after time t. Let us consider the topology
T1. If T1 covers y, then, of course, router 2: cannot obtain the vote of router y. If
T1 does not cover y, by Lemma 7, T1 cannot win the votes from all the routers in
V(Tl). In either case, router 2: must set its mode to G at a time t’ Z t. Because the
above argument also applies to every subsequent tree topology T,, i > 1, router 2:

will stay in mode G indeﬁnitely after time t’. We are done. El

With the above properties established, we are ready for the ﬁrst major result.
In the proof, we use the expression “by Lemma 8 (w, z, t)” to cite that lemma with
router w in the position of 2: and z in the position of y, using the time reference point

t. Recall that a is assumed to be the router in G with the highest rank.

Theorem 7 (Leadership Consensus Property) There exists a time t1 2 to+tag,-,,g

such that at any time t 2 t1 and V2: 6 V(G), Leader(2:) = a.

Proof: Given a router 2:, we denote by t; the earliest time when <I>,c is empty.
At router a, Leader(a) at time to, must be a itself. Moreover, if any router 3 sets
Leader(2:) to a at time tx, then Leader(2:) will not be changed after tx, because
no (rank) corruption will occur after time to. It follows that Leader(a) will not be

changed after ta. .

185

Next, we consider what happens when a router 2; 3:5 a whose Leader(2:) is never
set to a after tx. Although Lemma 6 assures us that <I>x will be empty by time
to + taging, Leader(2:) is not guaranteed to be set to a; the rank of a in G3, itself,
denoted as Rank$(a), may be corrupted. A different preferred leader, whose rank
is higher than Rankz(a), may be selected by 2: at tx. Under such circumstances, by
Lemma 8 (a,2:,t’), where t’ = max{tz,ta}, by our earlier argument that Leader(a)
will not change value after time ta, and by the selection of 2:, router a will Operate in
mode G indeﬁnitely after some time t” 2 t’. Next, we must show that the corrupted
Rankx(oz) information at 2: is subject to aging, allowing the G-mode periodic ﬂooding

from a to correct the corruption.

Let 8,, be the LSA in Gm that is regarding router a and that contains the corrupted
ranking information Rankx(a). Let I‘ = (t1, t2, t3, . . .) be the sequence of times when
the aging timer associated with [a ﬁres. If there exists any t,- E I‘ such that any one
of the three aging conditions holds at time t,, then 8a ages out at time t, (in this case,
t,- is the last, largest element in F). Assuming otherwise (that is, none of the three
aging conditions holds at every time t,- e F) would result in an inﬁnite F. Under such
circumstances, let ta be the smallest element in I‘ that is larger than ta, the time
when Leader(a) is permanently set to a. By Aging-Condition-l and the selection
Of elements Of I‘, Leader(2:) = 2: at time ta. It follows that, by Lemma 8 (2:, a, ta),
and the fact that Leader(a) will not change value after time ta, router 2: must change
permanently to mode G at some time t; _>_ ta. Let tb be any element in I‘ that is
larger than t2,- At time tb, Mode(2:) = G, a contradiction to the assumption that
none of the three aging conditions, including the condition Mode(2:) = G, holds at
time tb. Hence, [a will be aged out, enabling router 2: to accept the G-mode ﬂooding
from a and learn the correct rank of a. Consequently, Leader(x) will be set a, a

contradiction to our assumption that Leader(2:) will never be set to oz. We are done. El

Next, we turn our attention to the problem of achieving consistent routing infor-
mation. Speciﬁcally, we show that all routers will possess network images identical

to G. The next lemma establishes this prOperty at the leader router a. In the proof,

186

we use the notation Gx(t) to denote the network image of router 2: at time t.
Lemma 9 The network image at the leader oz will converge to G.

Proof: We denote by 9(t) the set of routers whose respective information is incor-
rect in G0, at time t. There are three causes for a router 2: to be included in (2(to): 2:
is a ghost router, 2: is a real router that is absent from G0, or 2: is a real router that
is present in GO, but whose LSA 8,, in Go is corrupted. We note that new elements
(that is, routers) cannot be added to the “corruption set” Q after time to because
corruption problems could not happen after that time. If there exists a time t Z to
such that Q(t) = (b, then G, converges to G at the same time and will remain so
thereafter.

Let us assume the Opposite, that is, Q(OO) is not empty. As argued earlier, there
exists a time to when Leader(a) is set to a permanently, and hence, router oz will
broadcast, regardless the presence or absence of corruption problems in Go, an inﬁnite
sequence of tree topologies after time ta. Let us denote by tn the time when the
“corruption set” Q has stabilized to 0(00). Since incorrect parts in Ga stabilize at
time tn, network image Ga itself also stabilizes at that time. This stabilized image
will be denoted by Ga(oo). We further denote by T = (T1,T2,T3, . . .) the inﬁnite
sequence Of tree topologies broadcast by a after time max{tg, ta}.

First, we claim that there are no ghost routers left in 9(00). To see this prop-
erty, we assume that 2: is a ghost router that remains in V(Ga) indeﬁnitely. If
2: ¢ V(G;(oo)), then by Aging-Condition-3 it will be aged out by time tn +taging, a
contradiction. If 2: is in V(G;(oo)), then we claim that 2: is covered by every T E T.
Since 2: cannot vote, router a will have to remain in mode G indeﬁnitely and age out
2:, a contradiction.

To see the reason why a ghost router 2: E V(G;(OO)) must be in V(T) for any
T E T, let us ﬁrst consider the tree T1 in T. We point out that the fact that T1 is
used after tn does not suggest it is computed after that time. Therefore, one cannot
infer the coverage of 2: by T1 directly from the presence Of 2: in Gﬁ,(oo). Let us assume

that 2: is disconnected from a in Go, when T1 is computed. It follows that from the

187

point of the computation of T1 to time tn router a must detect at least one change
in the set V(Gg) (speciﬁcally, the addition of 2:) and must compute a spanning tree
to cover 2:, rendering T1 obsolete by time tn, a contradiction to the deﬁnition of T1.
Hence, T1 must cover 2:. Moreover, if router a never re—computes the spanning tree
after T1, then it uses T1 indeﬁnitely after tn (that is, T,- 2 T1 for any i > 1). If router
a does perform this re—computation after T1, then subsequent tree topologies in T
are based on Ga(oo) and must contain 2:. In both cases, every tree T E T covers

router 2:.

Next, let us deal with any real router 2: in “(00) whose corrupted LSA 6,, remains in
G0 indeﬁnitely (that is, 8,, is never aged out). Let us consider any tree topology T E T.
If T contains 2:, then router a needs the vote of 2:. When T arrives at 2:, router 2: will
detect the corruption in Em and refuse to vote for a (Discard-CTA-Condition-B),
forcing a to switch to mode G. If T does not contain 2:, then, by Lemma 7, T cannot
win all the necessary votes for a and router a must also switch to mode G. Since the
above argument applies to all the tree topologies in T, router a will remain in mode

G indeﬁnitely. It follows that 8; will age out, a contradiction to the selection of 2:.

As such, there can be only one type of corruption problems for elements in 52(00):
they must be real routers that are absent from Ga after time to. Let 2: be any such
a router in set 52(00). By Lemma 7, router a cannot establish its leadership and
will be permanently in mode G after some time t’ . By Theorem 7, Leader(2:) will
be permanently set to a at some time t3, and thus the LSA regarding O: at 2: must
be subject to aging after t3. This ensures that the G-mode ﬂooding from a will be
accepted by 2:, turning the operation mode of 2: to G. Consequently, 2: periodically
ﬂoods its local status in mode G, which is guaranteed to be accepted by a router a
that does not have 2: in its network image at all. However, this contradicts with the
assumption that x is absent from Ga indeﬁnitely. We have excluded all the possi-

ble causes of a non-empty 52(00), and hence have shown that Go, will converge to G. D

After corruption problems in Go, have been “cleaned up,” the spanning tree topolo-

gies computed by a will be correct and accepted by all the routers in the network, as

188

we show below. Recall that to, denotes the time when Leader(a) is permanently set
to 0. Further, since we have shown that (2(00) 2 Q), tn denotes the time when all the

corruption problems in the network image of leader or has been removed.

Theorem 8 (Tree-Topology Consensus Property) There exists a time after
which all the routers in the network agree on the same spanning tree topology, which

is a correct spanning tree topology proposed by a.

Proof:

Let us consider the ﬁrst spanning tree T1 in T, the sequence of spanning tree
topologies broadcast by a after time max{tg, ta}. Since T1 may be computed before
tn, that is, before all corruption problems in Ga are resolved, it could contain the

three types Of ﬂaws listed below.

1. Tree T1 contains a ghost router 2:. In this case, of course, x will not vote for a,

which must in response switch to mode G.

2. Tree T1 does not cover a (real) router x. (This case happens because, when T1
is computed, router x is absent from Go.) By Lemma 7, T1 cannot win all the

votes from routers in V(Tl) and router a must switch to mode G.

3. Tree T1 uses a non-existent link x-y, where both x and y are real routers. (This
case happens when LSAS regarding x and/or y are corrupted in Go, when T1 is
computed.) Without loss Of generality, we assume that 2: be the parent Of y in
T1. If this case happens, since x cannot deliver the CTA that contains T1 to y,

then router y will not vote for 0, again forcing oz to switch to mode G.

If T1 suffers from any one of the above ﬂaws, then a switches mode G and must
compute a new spanning tree in the next CTA broadcast. The new tree, computed
after tn, will be a correct spanning tree for G and be included in CTA broadcasts
thereafter. If, on the other hand, T1 is free from the above problems, then T1 is
a correct spanning tree topology (it covers every router, and does not contain non-

existent routers and links), and will be used as the tree topology after tn. In the

189

cases where T1 is the permanent spanning tree tOpology after tn, it is worth pointing
out that, when T1 is computed, Ga may still contain corruption problems that do
not affect the correctness of spanning tree computation, such as disconnected ghost
routers. Further, the above argument does not exclude the possibility that T1 is
computed before to. As such, the above argument applies to an empty event set E.
Let T be the ﬁnal, correct spanning tree topology computed by a and c be the
epoch number of T. After all the larger-than-c values Of the Epoch(2:) data structures
throughout the network age out and after all the routers select or as preferred leaders,

T will be accepted by all the routers in the network, concluding the proof. 1:]

Finally, we are ready to establish the most important property of any LSR proto—
col —— the capability to maintain correct, consistent network images throughout the

network.

Theorem 9 (Network Image Consensus PrOperty) The network image G3 of

every router x E V(G) will converge to G.

Proof: After a correct spanning tree is constructed, router a will broadcast an
LEA, and all routers will enter mode T Operation. Consequently, all non-leader
routers stop performing periodic ﬂooding. Since there are no events after to,
non-leader routers will not ﬂood event-driven LSAs either. Let Z = LSA(y, s, m) be
the LSA regarding y in G0 and let 3’ = LSA(y,s’,m’) be the LSA regarding y in
G3, where 2: is any non-leader router. Upon receiving a CTA from a, which contains
2, if (s’,m’) < (s,m), then router x will accept LSA 6, learning the correct status
Of y. If (s’,m’) 2 (s,m), then, since x will not receive periodic ﬂooding from y, 3’

will be aged out in taging seconds, again allowing 8 to be accepted by x. We are done. CI

8.5 Performance Evaluation

We studied the performance of the T—LSR protocol through simulation. The simulator
is based on the CSIM package [48]. Conﬁdence intervals were computed, but for most

190

cases are very small and, for clarity, are not shown in the plots. Networks comprising
up to 400 routers were simulated. Such network sizes conform with those supported
by existing LSR standards. (For example, the OSPF protocol supports networks with
up to 200 routers.) For each network size, 100 graphs were generated randomly. To
conform to network topology characteristics observed in the Internet [57], average
node degrees of these graphs are typically small, ranging from 2.25 (for 10—node
graphs) to 4 (for 400-node graphs).

Each message transmission incurs software overheads, including message copy-
ing, error checking, processor interruption, and so forth. Of course, such overheads
vary from platform to platform. In this study, we measured these overheads on the
ATM testbed in our laboratory. The testbed comprises Sun SPARC-10 worksta-
tions equipped with Fore SBA-200 adapters and connected with three Fore ASX-200
switches. From these measurements, we Obtained the ﬁgure 600 usec, which includes
the overhead at both the sending and receiving switches. Since this ﬁgure also con-
forms with typical raw IP overheads that researchers have observed on a variety Of
workstations [71], the results reported here may be applicable to other LSR-based

networking platforms.

Comparison of periodic ﬂooding overhead. First, we compare the T-LSR pro-
tocol with the C-LSR protocol when performing periodic ﬂooding. In one cycle of
periodic ﬂooding, called a re-flooding cycle, each router ﬂoods exactly once under the
C-LSR protocol. In the case of the T-LSR protocol, the leader router broadcasts a
CTA, and other routers cast ballots. For either LSR protocol, we measured the num-
ber of messages processed, including acknowledgments, per router per re-ﬂooding
cycle. The results of this study, presented in Figure 8.12(a), illustrate a major ad-
vantage of the T-LSR protocol, namely, reducing the number Of message interrupts
at each router.

TO account for the differences in message size (the T-LSR protocol uses relatively
large messages, namely CTAs, in periodic ﬂooding), we also measured the total mes-

sage processing time at each router, using the per-byte software overhead 1.09 as

191

 

 

 

 

 

 

 

 

 

2500 1600
2000 E 1400 -
.. 5 1200 -
8 o
E 1500 - 5- 1000 *
a; E" 800 ~
w
a}, 1000 . g 600 .
2 1..
o. 400 -
500 ‘3’ 200 -
<
0 r " “““ ‘ ‘ g 0 1- . . 2 2.1 I - L ..r.-_-...,._.....--..
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network size (routers) Network size (routers)
(a) messages per router (b) avg. processing time per router

Figure 8.12: Comparison of periodic-ﬂooding overhead.

reported in [71]. The resulting metric can be considered as the average workload at
each router. To compute the lengths of CTAs, we used the LSA format of the OSPF
protocol, where an LSA of a router with d incident links is 24 + 12 x d bytes long (ex-
cluding IP header). Thus, in a network with N routers and with average node degree
D, a CTA comprises N x (24 + 12 x D) bytes. As we can see in Figure 8.12(b), the
T-LSR protocol imposes only a small fraction Of the workload of the C—LSR protocol.

To further understand the behavior Of the periodic ﬂooding mechanism of the T-
LSR protocol, we plot in Figure 8.13 the times used by CTA broadcasts in networks
of different sizes. A CTA broadcast begins at the moment when the leader starts
sending the corresponding CTA and ends at the moment when the leader receives
all necessary ballots. Since, when the leader router is in mode G, CTA broadcast is
also used for leader (re)election and spanning tree construction, results in Figure 8.13
can also be interpreted as the leader election and spanning tree construction times Of
the T-LSR protocol. As we can see in the ﬁgure, a CTA broadcast can typically be
completed within 350 milliseconds. The overhead of CTA broadcasts primarily stems

from the large sizes of CTAs.

Performance of individual ﬂooding Operations. In addition to periodic ﬂood-

ing, both LSR protocols use event-driven ﬂooding to disseminate changes in network

192
350 ms

 

s

250 :

"8’

150

Completion Time (ms)
8

U:
C

 

 

 

 

0 1 1 1 1 l 1 A
0 50 100 150 200 250 300 350 400
Network size (routers)

Figure 8.13: Efﬁciency of CTA broadcast.

status. For such ﬂooding Operations, we are interested in three performance metrics:
the LSA receipt time, the ﬂooding completion time, and bandwidth consumption.
The LSA receipt time of a given router is the time when the ﬁrst copy of the LSA
arrives at the router, whereas the ﬂooding completion time at the router is the time
when the router ﬁnishes processing the last acknowledgment pertaining to this ﬂood-
ing operation. The bandwidth metric refers to the total number of LSA forwardings

incurred by a ﬂooding Operation.

The averaged results regarding these metrics are presented in Figure 8.14. As seen,
the C-LSR protocol outperforms the T-LSR protocol in both time metrics. This is
because, under the conventional ﬂooding algorithm, a router acts aggressively, for-
warding an LSA to all its neighboring nodes (rather than only neighbors deﬁned by
a spanning tree) and thus causing its neighboring nodes to receive the LSA earlier.
However, this aggressiveness also implies that the router has to perform larger num-
bers of LSA forwarding and process more acknowledgments, as clearly shown in the
results regarding the bandwidth metric plotted in Figure 8.14(c). In this metric, the
T-LSR protocol enjoys a comfortable lead, of course, because it uses only tree links
to forward LSAs.

In summary, during normal operation periods of the T-LSR protocol, a ﬂooding

operation is somewhat slower to deliver the respective LSA, but much more econom-

ical in terms of Operational overhead than its conventional counterpart. Since the

193

 

 

 

 

 

 

 

 

 

12 ‘ ' r ' ,2, ........ . 1200 . . , 2
...-J ....... -.., .......... .... ...... 4 m C-LSR
A 10 - . .. -../w- ‘"‘”‘ l go 1000 - T-LSR ----
E :3 _
o— 8 b J
.‘i 6 . < 600
§ 4 _ C-LSR (completion) -—.——‘ “5 400 - ___________ j,
m + C-LSR (receipt) --—----+ d ............... ]
.i‘ T-LSR (completion) W. Z 200 . ---------------
2 " T-LSR (receipt) «--— : ,,,,,,,,,,,,
O 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network size (routers) Network size (routers)
(a) time (b) bandwidth

Figure 8.14: Comparison of event-driven ﬂooding performance.

completion times of T-LSR are typically less than 12 milliseconds, the T-LSR proto-
col still retains the responsiveness of C-LSR. Moreover, in the C-LSR protocol, while
an LSA may be received earlier, the ensuing processing of the LSA, such as the up-
dating of routing tables, will be slowed down by the remaining tasks Of the ﬂooding
operation. It the belief of the authors that T-LSR’s large advantage in processing

overhead outweighs its slightly slower LSA receipt time.

Evaluation of ﬂooding mode switching. In the T-LSR protocol, a T-mode
ﬂooding Operation has to switch to mode G if it cannot complete the operation in
mode T, for example, due to, the failure of a tree link. In this part of our study, we
evaluate the overhead imposed by ﬂooding mode switching. Speciﬁcally, we assume
that two events e1, advertised by router x, and eg, advertised by router y, occur si-
multaneously, where e1 affects the spanning tree but e2 does not. (Prior to the events,
all the routers are in mode T.) In the C-LSR protocol, the advertisements of both el
and e2 use the conventional ﬂooding algorithm. In the T-LSR protocol, router x will
advertise e1 in mode G, but router y, without knowing the malfunctioning status of
the spanning tree, will initiate the advertisement of e2 in mode T. The ﬂooding of
82 will switch to mode G later in order to reach all routers. We simulated these two

ﬂooding Operations and measured the completion time and bandwidth consumption

194

of the entire “scenario,” that is, the ﬂooding of both el and eg, under different LSR
protocols. The respective results are plotted in Figure 8.15. As shown, the T-LSR
protocol performs slightly less efﬁcient in both performance metrics. This should
not be a surprise, because the advertisement of eg in the T—LSR protocol incurs an

unﬁnished T-mode ﬂooding and a complete G-mode ﬂooding.

 

 

   

 

 

 

 

 

 

3o . . . . . . . 2500
a 25 ’ a 2000
E .S
3 20 ~ E
..E_ E 1500 -
1— _ 8
_8 15 <
3 1’3 1000 .
2' 10 U" ”5
° 0'
U 5 . z 500 ~
0 A A A A 1 1 r 0 A A A A; 1 n M
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Network size (routers) Network size (routers)
(a) completion time (b) no. of LSA forwardings

Figure 8.15: Overhead of ﬂooding mode switching.

We emphasize, however, that ﬂooding mode switching is not expected to occur
frequently. For a T-mode ﬂooding to change to mode G, there must be a simultaneous
G-mode ﬂooding that has not reached the source of the T—mode ﬂooding. Since
the T-LSR protocol advertises in mode G the failure of a network component that
damages the spanning tree, according to Figure 6.4(a) such an event can be advertised
throughout the network within 10 milliseconds. Thus, only if a T-mode ﬂooding were
initiated in the 10 milliseconds window after such a failure would a mode switching
occur. As a result, the fraction Of T-mode ﬂooding operations that must change mode

is likely to be small.

8.6 Summary

We have proposed a novel LSR architecture, called the T-LSR protocol, that elects

a leader to perform periodic ﬂooding on behalf of other routers and constructs a

195

spanning tree to reduce of overhead of advertising network status updates. Three
consensus properties, namely, the Leader Consensus Property, the Network Image
Consensus Property, and Tree-Topology Consensus Property, of the T-LSR proto-
col under any combination of network component failures, network partitioning, and
message corruption events, have been proved formally. Our simulation results show
that the T-LSR protocol incurs a small fraction of the overhead of the C-LSR pro-
tocol during its normal Operation periods, and only moderate overhead in adverse
circumstances where the spanning tree is under repair/ construction and the leader is
being elected. The development of such a lightweight and robust LSR protocol is es-
pecially beneﬁcial to communications applications, such as multimedia applications,
that demand frequent updates of network status and resource availability information

to ensure smooth transit of traﬂic streams.

Chapter 9

Conclusions And Future Work

If I were to conclude this work in one sentence, I would say that it “re-visits con-
ventional group communication/distributed computing problems under an unusual
assumption that complete information about the entire communication network is
universally available.” Of course, this assumption is not true in general cases; it
holds only in a special computing environment where distributed algorithms are exe-
cuted by routers / switches to implement networking protocols in LSR-based networks.
It should not be difﬁcult to see that this computing environment provides powerful
facilities that can greatly reduce the complexities of group communication problems.
As an example, using the network images maintained by LSR, every participant of an
election can learn of the loss Of connectivity to the current leader “for free,” without
using any probing or monitoring mechanism. On the other hand, the low-level nature
of this computing environment presents unique challenges. The biggest challenge is
that, since networks are expected to continue providing communication services even
in the presence Of exceedingly rare but catastrophic adverse events, such as network
partitioning and undetected transmission errors, the algorithms executed within the
network so as to implement these services must also survive such events. A fundamen-
tal contribution of this dissertation is to show that developing distributed algorithms
speciﬁcally for this computing environment results in better group communication
solutions and, moreover, improvements to LSR itself.

Using the network images maintained by LSR, we have developed the GMC proto-

196

197

col, which can be considered as a generic distributed implementation of MC/multicast
routing algorithms. The ability to support different MC topology types and compu-
tation algorithms is important when a wide spectrum of multiparty communication
applications, each with unique characteristics and expectations Of the network, are
deployed.

Using router / switch connectivity information provided by LSR, we have developed
a network-level leader election protocol, the NLE protocol. We have discussed impor-
tant network services that could beneﬁt from the NLE protocol, including hierarchical
routing, address mapping services, and multicast. Based on the NLE protocol, we
have designed a centralized solution to the problem of multicast core management,
namely, the LCM protocol. In addition to using network images, the LCM protocol
further makes use of the shortest path routing trees computed by LSR to support

certain tasks of multicast core management, such as, core migration.

Finally, one of the most important group communication problems is network
routing itself. Since every network switching element can observe only its local sur-
rounding, the task of ﬁnding paths to relay communication trafﬁc across the network
must be performed by all routers/ switches collectively. In this dissertation, we have
advocated the use of group communication techniques to improve the performance Of
LSR. For ATM networks, we have developed a family Of efﬁcient ﬂooding algorithms,
the SAF protocols, that take advantage of the hardware switching capabilities of such
networks. These protocols construct a spanning tree and a ring in a given ATM net-
work to improve the performance Of ﬂooding operations in the network. For other
LSR-based networks, such as many autonomous systems in the Internet, we have
developed the T-LSR protocol to reduce the overhead associated with both periodic
and event—driven ﬂooding, using two group communication based techniques, span-
ning tree construction and leader election. Considering all these results, we have
clearly demonstrated the mutually beneﬁcial relationship between group communica-
tion and LSR.

The research of this dissertation can be extended in several directions, described

as follows.

198

As pointed out earlier, LSR is not intended for direct implementation in large
networks. This restriction inevitably raises the question that how our group commu-
nication protocols, which are all LSR-based, can be applied in such networks. In the
case of ATM networks, the entire network is recursively divided into smaller rout-
ing domains, and the same routing method, namely LSR, is applied at all routing
levels. In such circumstances, our group communication solutions can be executed
recursively in the routing hierarchy. For example, to construct a receiver-only MC
in a large ATM network, a top-level MC can be constructed at the top routing level;
members of the MC are representative group members elected in the second highest
routing domains that have at least one member of the MC. Subsequently, each such
domain constructs a second-level MC within that domain. The low-level MCS and
the top-level MC are connected together using the representative group members in
low-level domains as contact points. This process is repeated until the lowest routing
level is reached. In fact, a well-deﬁned routing hierarchy may enable the use of differ-
ent MC topology types and computation algorithms at different levels for the same
network group. We point out that the PIM protocol already supports such “hybrid
MCS,” tO a limited extent, by constructing source rooted trees at the inter-AS level
and shared trees within ASS. The generalization of the GMC protocol to allow any
MC type at any routing level and the potential applications of hybrid MCS constitute

an interesting area of future research.

However, in other networks, most prominently the Internet, LSR is restricted to
individual routing “islands,” (that is, routing domains or autonomous systems) and
another routing method is used to perform routing among these islands. In such
cases, an LSR-based group communication protocol must cooperate seamlessly with
a high-level protocol, which may not be LSR-based. Such integration issues require
further investigation. For this integration problem, the technique that uses a leader
election to reduce an LSR-based domain to a single node could play an important
role. Considering again the example of the construction of a network-wide MC, an
inter-AS MC protocol can consider an LSR-based AS as a single node by electing a

representative member in that domain.

199

Another important area of future research is the support of QOS routing, which
ﬁnds paths to carry resource-demanding, multimedia trafﬁc. Many methods devel-
oped in this dissertation address the Operational aspects of network routing and
could be used in the design of mechanisms that timely disseminate the information
required by QOS routing. For example, our methods could be used to elect rout-
ing server/center, and/or reduce the workload of individual routers/switches. One
promising possibility is to elect a leader router to periodically collect and broadcast
the resource utilization status of the entire network. This collect-and-broadcast pro-
cess could be a variation of the CTA broadcast and ballot collection process used in
the T-LSR protocol (speciﬁcally, each router includes its up—to-date local status in its
ballots). When the resource utilization status of the network ﬂuctuates at a high rate,
for example, due to a long burst of VC establishment and destruction requests, using
an orderly process of information collection and dissemination might produce much
more eﬂicient routing Operations, when compared to having all routers/ switches ﬂood
their status changes individually. Furthermore, an adaptive LSR protocol could be
developed to adjust the period of the above collect-and-broadcast process so that the
process is executed more frequently when the status of the network changes rapidly,

and less frequently when the network is stable.

Bibliography

[1] S. E. Deering and D. R. Cheriton, “Multicast routing in datagram internetworks and
extended LANS,” ACM Transactions on Computer Systems, vol. 8, pp. 85—110, May
1990.

[2] S. Deering, D. L. Estrin, D. Farinacci, V. Jacobson, C.-G. Liu, and L. Wei, “The
PIM architecture for wide-area multicast routing,” IEEE/A CM Trans. on Networking,
vol. 4, pp. 153-162, April 1996.

[3] A. Ballardie, P. Francis, and J. Crowcroft, “Core based trees,” in Proceedings of the
ACM SIG COMM ’93, (San Francisco, CA), September 1993.

[4] A. Ballardie, “Core based trees (CBT version 2) multicast routing.” Internet RFC
2189, September 1997.

[5] D. Waitzman, C. Partridge, and S. Deering, “Distance vector multicast routing proto-
col.” Internet RFC 1075, November 1988.

[6] J. Moy, “Multicast extensions to OSPF.” Internet RFC 1584, March 1994.
[7] S. Deering, “Host extensions for IP multicasting.” Internet RFC 1112, August 1989.

[8] D. W. Wall, Mechanisms for Broadcast and Selective Broadcast. PhD thesis, Stanford
University, June 1980.

[9] Q. Zhu, M. Parsa, and J. J. Garcia-Luna—Aceves, “A source-based algorithm for delay-
constrained minimum-cost multicasting,” in Proceedings of the IEEE INFOCOM ’95,
pp. 377-385, 1995.

[10] F. Bauer and A. Varma, “Degree-constrained multicasting in point-tO-point networks,”
in Proceedings of the IEEE INFOCOM ’95, pp. 369—376, 1995.

[11] J. Moy, “OSPF version 2.” Internet RFC 1583, March 1994.

[12] J. M. McQuillan, I. Richer, and E. C. Rosen, “The new routing algorithm for the
ARPANET,” IEEE Transactions on Communications, pp. 711—719, May 1980.

[13] ATM Forum, “Private network-network interface speciﬁcation version 1.0.” ATM FO-
rum technical speciﬁcation af-pnni-0055.0000, March 1996.

[14] W. J. Clark, “Multipoint multimedia conferencing,” IEEE Communications Magazine,
May 1992.

200

[15]

[10]
[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[301

201

S. R. Ahuja and J. R. Esnor, “Co-ordination and control of multimedia conferencing,”
IEEE Communications Magazine, May 1992.

J. Udell, “Computer telephony,” Byte, vol. 19, no. 07, pp. 80—99, 1994.

J. Oikarinen and D. Reed, “Internet relay chat protocol.” Internet RFC 1459, May
1993.

W. Reinhard, J. Schweitzer, G. Vlksen, and M. Weber, “CSCW tools: Concepts and
architectures,” IEEE- Computer, May 1994.

M. Harrick, P. V. Rangan, and M. Chen, “System support for computer mediated
multimedia collaborations,” in Proceedings of the 1992 ACM Conference on Computer
Supported Cooperative Work ( 050 W ’92), pp. 203—209, November 1992.

J. M. Pullen, M. Myjak, and C. Bouwens, “Limitations of Internet protocol suite for
distributed simulation in the large multicast environment.” Internet draft draft-pullen—
lame-00.txt, September 1996.

H. W. Holbrook, S. K. Singhal, and D. R. Cheriton, “Log-based receiver-reliable mul-
ticast for distributed interactive simulation,” in Proceedings of SI G COMM ’95, (Cam-
bridge, MA USA), pp. 328—341, 1995.

M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C.
Steere, “Coda: A highly available ﬁle system for a distributed workstation environ-
ment,” IEEE Transactions on Computers, vol. 39, April 1990.

J. Postel and J. Reynolds, “File transfer protocol (FTP).” Internet RFC 959, October
1985.

T. Berners-Lee, “Hypertext transfer protocol (HTTP).” available at
ftp: / / info.cern.ch/ pub/ www/ doc/ http-spec.txt.Z, November 1993.

T. Berners-Lee and D. Connolly, “Hypertext markup language 2.0.” Internet RFC
1866, November 1995.

J. R. Cooperstock and S. Kotsopoulos, “Why use a ﬁshing line when you have a net?
an adaptive multicast data distribution protocol,” in Proceedings of USENIX Technical
Conference ’96, 1996.

S. Floyd, V. Jacobson, S. McCanne, C.-G. Liu, and L. Zhang, “A reliable multicast
framework for light-weight sessions and applications level framing,” in Proceedings of
SIGCOMM ’95, (Cambridge, MA USA), pp. 342-356, 1995.

M. Hofmann, T. Braun, and G. Carle, “Multicast communication in large scale net-
works,” in Proceedings of Third IEEE Workshop on High Performance Communication
Subsystems (HPCS), (Mystic, Connecticut USA), August 1995.

J. C. Lin and S. Paul, “RMTP: A reliable multicast transport protocol,” in Proceedings
of IEEE INFOCOM ’96, March 1996.

ATM Forum, ATM User-Network Interface (UNI) Speciﬁcation Version 3.1. Prentice
Hall, September 1994.

202

[31] P. Winter, “Steiner problem in networks: a survey,” Networks, pp. 129—167, 1987.

[32] A. J. Ballardie, A New Approach to Multicast Communication in a Data-
gram Internetwork. Ph.D. thesis, Department of Computer Science, Uni-
versity College London, May 1995. Available via anonymous ftp from
cs.uc1.ac.uk:darpa/IDMR/ballardie-thesis.ps.Z.

[33] D. Estrin, D. Farinacci, A. Helmy, D. Thaler, S. Deering, M. Handley, V. Jacobson,
C. Liu, P. Sharma, and L. Wei, “Protocol independent multicast sparse mode (PIM-
SM): Protocol speciﬁcations.” Internet RFC 2117, June 1997.

[34] M. Imase and B. M. Waxman, “Dynamic Steiner tree problem,” SIAM Journal on
Discrete Mathematics, vol. 4, pp. 369—384, August 1991.

[35] B. M. Waxman, “Performance evaluation of multipoint routing algorithms,” in Pro-
ceedings of INFOCOM’ 93, 1993.

[36] A. Thyagarajan and S. Deering, “Hierarchical distancewector multicast routing for
the Mbone,” in Proceedings of ACM SIG COMM, (Cambridge, Massachusetts), August
1995.

[37] A. Ballardie, “Core based trees (CBT) multicast routing architecture.” Internet RFC
2189, September 1997.

[38] C. Shields and J. J. Garcia-Luna-Aceves, “The ordered core based tree protocol,” in
Proceedings of IEEE INF OCOMM, (Kobe, Japan), April 1997.

[39] S. Kumar, P. Radoslavov, D. Thaler, C. Alaettinoglu, D. Estrin, and M. Handley,
“The MASC/BGMP architecture for inter-domain multicast routing,” to appear in
Proceedings of ACM SIGCOMM, (Vancouver, Canada) August, 1998.

[40] Fore Systems, Inc., ForeRunner SBA-200 ATM SBus Adapter User Manual, 1993.

[41] D. Dykeman, H. L. Truong, and H. J. Sandick, “Alternatives for the support of the
ATM group services.” ATM Forum internal contribution 95-0438, April 1995.

[42] F. Liaw, “A straw man proposal for ATM group multicast routing and signaling pro-
tocol: Architecture overview.” ATM Forum internal contribution 94-0995, November
1994.

[43] R. Perlman, “Fault-tolerant broadcast of routing information,” in Proceedings of IEEE
Infocom ’83, (San Diego), 1983.

[44] D. E. Corporation, “Information processing systems — data communications — interme-
diate system to intermediate system intra- domain routing protocol,” October 1987.
Also available as Internet RFC 1142.

[45] D. Bertsekas and R. Gallager, Data Networks. Prentice—Hall, 1987.

[46] K. L. Calvert, E. W. Zegura, and M. J. Donahoo, “Core selection methods for multicast
routing,” in Proceedings of IEEE I CCCN ’95, (Las Vegas, Nevada), 1995.

203

[47] E. Fleury, Y. Huang, and P. K. McKinley, “On the performance and feasibility of mul-
ticast core selection heuristics,” Tech. Rep. MSU-CPS-97—42, Department of Computer
Science, Michigan State University, East Lansing, Michigan, October 1997.

[48] H. D. Schwetman, “CSIM: A C-based, process-oriented simulation language,” Tech.
Rep. PP-080-85, Microelectronics and Computer Technology Corporation, 1985.

[49] FORE Systems, Inc., SPANS NNI: Simple Protocol for ATM Network
Signaling ( N etwork- to-N etwork Interface) Release 3. 0, 1993. available at
ftp: / / ftp.fore.com / pub / docs / spans / spans3nni.ps.

[50] T. von Eicken, A. Basu, V. Buch, and W. Vogels, “U-Net: A user-level network in-
terface for parallel and distributed computing,” in Proc. of the 15th ACM Symposium
on Operating Systems Principles, (Copper Mountain, Colorado), pp. 40—53, December
1995.

[51] D. Johnson, D. Lilja, and J. Riedl, “A circulating active barrier synchronization mech-
anism,” in Proceedings of the 1995 International Conference on Parallel Processing,
vol. I, pp. 202—209, August 1995.

[52] N. Fredrickson and N. Lynch, “Electing a leader in a synchronous ring,” Journal of the
ACM, vol. 34, pp. 98—115, January 1987.

[53] L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, and C. A. Lingley-
Papadopoulos, “Totem: A fault-tolerant multicast group communication system,”
Communications of the ACM, vol. 39, no. 4, 1996.

[54] I. Cidon, T. Hsiao, A. Khamisy, A. Parekh, R. Rom, and M. Sidi, “The OpeNet
architecture,” Tech. Rep. 95-37, Sun Microsystems, December 1995.

[55] I. Cidon, A. Gupta, T. Hsiao, A. Khamisy, A. Parekh, R. Rom, and M. Sidi, “OPENET:
An Open and efﬁcient control platform for ATM networks,” in Proc. INFOCOM ’98,
(San Francisco, CA), pp. 824-831, March 1998.

[56] M. R. Garey and D. S. Johnson, Computers and Intractability, A Guide to the Theory
of NP-Completeness. 41 Madison Avenue, New York, NY. 10010: W. H. Freeman and
Company, 1979.

[57] E. W. Zegura, K. L. Calvert, and S. Bhattacharjee, “How to model an internetwork,”
in Proceedings of IEEE INFOCOM ‘96, (San Francisco, California), March 1996.

[58] E. Crawley, R. Nair, B. Rajagopalan, and H. Sandick, “A framework for QoS-based
routing in the Internet.” Internet draft draft-ietf-qosr-framework-OO.txt, March 1996.

[59] F. Bauer and A. Varma, “ARIES: A rearrangeable inexpensive edge-based on-line
Steiner algorithm,” in Proceedings of IEEE Infocom ’96, (San Francisco, California),
pp. 361-368, March 1996.

[60] B. M. Waxman, “Routing of multipoint connections,” IEEE Journal of Selected Areas
in Communications, vol. 6, no. 9, pp. 1617-1622, 1988.

[61] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Com-
munications of the ACM, vol. 21, pp. 558—565, July 1978.

[62]

[63]

[64]

[05]

[66]

[07]

[08]
[09]

[70]

[71]

[72]

[73]

[74]
[75]

[70]

204

H. D. Schwetman, “CSIM: A C-based, process-oriented simulation language,” Tech.
Rep. PP-080-85, Microelectronics and Computer Technology Corporation, 1985.

D. Menasce, R. Muntz, and J. Popek, “A locking protocol for resource coordination in
distributed databases,” ACM TODS, pp. 103—138, 1980.

K. Birman, “Implementing fault tolerant distributed objects,” IEEE Transaction on
Software Engineering, pp. 502—508, 1985.

H. Garcia-Molina, “Elections in a distributed computing system,” IEEE Trans. on
Computers, vol. 31, pp. 48—59, January 1982.

S. Singh and J. Kurose, “Electing ‘good’ leaders,” Journal of Parallel and Distributed
Computing, vol. 21, pp. 184—201, May 1994.

G. Armitage, “Support for multicast over UNI 3.0/3.1 based ATM networks.” Internet
RFC 2022, November 1996.

M. Laubach, “Classical IP and ARP over ATM.” Internet RFC 1577, January 1994.

M. Handley and V. Jacobson, “SDP: Session description protocol.” Internet draft draft-
ietf-mmusic-sdp-03.txt, March 1997.

M. J. Donahoo and E. W. Zegura, “Core migration for dynamic multicast routing,” in
Proceedings of IEEE ICCCN ’96, (Rockville, Maryland), October 1996.

D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Shauser, E. Santos, R. Subramonian,
and T. von Eicken, “LogP: Towards a realistic model of parallel computation,” in
Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPOPP), (San Diego, California), pp. 1—12, Association for
Computing Machinery, May 1993.

A. Gopal, I. Gopal, and S. Kutten, “Broadcast in fast networks,” in Proc. INFO-
COM ’90, (San Francisco, CA), June 1990.

E. Basturk and P. Stirpe, “A hybrid spanning tree for efﬁcient topology distribution
in PNNI,” Tech. Rep. Research Report RC 20922, IBM Research Division, July 1997.

B. Rajagopalan, “Efficient link state routing.” NEC Technical Report, 1997.

I. Cidon, I. Gopal, M. Kaplan, and S. Kutten, “A distributed control architecture of
high-speed networks,” IEEE Trans. Commun., vol. 43, no. 5, pp. 1950—1960, 1995.

E. C. Rosen, “Vulnerabilities of network control protocols: An example,” in SI G COMM
Computer Communications Review, pp. 10—16, July 1981. (also published as RFC 789).

[77] Y. Huang and P. K. McKinley, “Group leader election under link-state routing,” in Pro-

ceedings of International Conference on Network Protocols, 1997., (Atlanta, Geogia),
October 1997.