cf}...

 

 

 

..
. mﬁ‘mﬂw’.
39;: ‘ .

an”

inn
. .
‘13 I. .0

{ﬁx-.1...

. ‘9}

 

Wherﬂtﬂalﬂ ﬂ . ; ... . .. ‘ . . V . .wmwvmwmurwamumﬁmv www.mgkmwdﬁvakﬁug.

   
 
 

  

 

MICHIGAN STA

Iluusulyiliiﬂillliiijillll

12 1

LIBRARY
Mlchlgan State
Unlverslty

 

 

 

 

llllllll

 

 

 

 

This is to certify that the

thesis entitled

DESIGN AND IMPLEMENTATION OF A RELIABLE
MULTICAST PROTOCOL

presented by

Robin F. Wright

has been accepted towards fulfillment
of the requirements for

Master's degree in Compggez Science
& Engineering

an MAR

Major professor

Date_’ll H I 95

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

PLACE IN REIURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

| DATE DUE DATE DUE DATE DUE

 

 

006-2009qu

 

 

 

 

 

 

 

 

 

 

 

 

 

1m mus-nu

 

 

 

 

 

ABSTRACT

DESIGN AND IMPLEMENTATION OF A RELIABLE
MULTICAST PROTOCOL
By
Robin F. Wright

Multicast communication, in which a single sender communicates with two or
more receivers, is becoming an important part of computer networking. Multicast
communication can be beneﬁcial to many types of applications, including computer
supported cooperative work, data distribution services, distributed interactive sim-
ulation, tale-gaming, and distributed parallel processing. IP multicast provides a
mechanism for best-effort multicast. However, the differing requirements of appli-
cations has resulted in many reliable multicast protocols, but little consensus on a
single standard. Moreover, few reliable multicast protocols are currently provided as
operating system services.

In this research, we design and implement RMC, a reliable multicast protocol
suited for integration in an operating system kernel. RMC takes advantage of an
underlying best-effort multicast delivery system, such as IP multicast, and provides
end-to—end reliable delivery of a stream of data to multiple destinations. In this thesis,
we use the RMC protocol to study three key issues in reliable multicast protocols:
reliability, group management, and ﬂow control. We focus on NAK-based reliability

because it scales better than ACK-based reliability, and we limit work at the sender by

 

 

 

 

Supporting anonymous group membership. While NAK-based protocols typically use

rate-based flow control, however, kernel-level protocols typically use window-based
ﬂow control, due to limited buffer space. We explore the combination of rate-based
and window-based ﬂow control to allow for the use of NAK-based reliability in a
kernel~leve1 implementation. Our contributions can be summarized in the following
three areas.

First, we have designed the RMC reliable multicast protocol. RMC uses negative
acknowledgments for reliability and a combination of window-based and rate-based
flow control. While typical window-based protocols use positive acknowledgments to
slide the window forward, RMC uses purely negative information, and a timer that
is based on the round-trip time to the most distant receiver, to prevent the window
from sliding forward too quickly. Further, RMC uses anonymous group membership
to further reduce the work that is required of the sending host.

Second, we have implemented the RMC protocol in the Linux operating system
kernel. RMC is implemented atop IP, enabling its use across wide area networks,
and beneath the BSD socket interface, enabling applications to access it via standard
system calls. Moreover, the implementation required us to address a number of
important practical problems that might not arise if we had limited the study to
analysis and/or simulation; examples include buffer management, interrupt handling,
and data locking.

Third, we have evaluated the performance of the RMC protocol through both
experimentation and simulation. We ﬁrst evaluate the performance of RMC as im-
plemented in the Linux kernel on machines in a local area network. In this environ-
ment, we are able to observe the behavior of RMC in a working environment. The
experimental tests reveal that RMC performs very well in a local area network, where

packet loss and retransmissions are minimal. We further evaluate the performance of

“HI

 

 

 

 

RMC in a simulated network, where we can set network parameters such as network

delay and loss rates to simulate wide area networks. Results of the simulation tests
show that RMC is scalable up to 100 receivers in a local area network environment,
where loss rates and network delay are low. Although throughputs are lower in a wide
area network environment, due to higher loss rates, the RMC protocol still provides
good performance and is able to accommodate a large number of NAKs, even without
the use of global NAK suppression. Moreover, the efﬁciency that is gained by using

reliable multicast over reliable unicast is very large.

 

 

 

 

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES
1 Introduction

2 Background and Related Work

2.1 Applications of Multicast Communication .................
2.2 Multicast Communication in the Internet ..................
2.2.1 LAN multicast ...............................
2.2.2 Multicast Routing .............................
2.2.3 Multicast Group Membership .......................
2.3 Reliable Multicast Protocols .........................
2.3.1 Reliability ..................................
2.3.2 Connection Management ..........................
2.3.3 Flow control ................................
3 RMC Protocol Description

3.1 RMC Protocol Overview ...........................
3.2 RMC Packet Format .............................
3.3 NAK-Based Reliability ............................
3.4 RMC Connection Management .......................
3.4.1 Port numbers and connections .......................
3.4.2 Establishing a connection .........................
3.4.3 Closing a connection ............................
3.4.4 Round-trip time estimation ........................
3.5 RMC Flow Control ..............................
3.5.1 Window-based control ...........................
3.5.2 Rate—based control .............................
3.5.3 Slow start and congestion avoidance ...................

4 RMC Implementation in Linux

4.1 Linux Networking ...............................
4.1.1 BSD Sockets ................................
4.1.2 INET Sockets ................................
4.1.3 The IP Interface ..............................
4.1.4 Network Devices and Bottom Half Handling ...............
4.2 The RMC Interface and Data Structures ..................

vii

viii

 

 

 

 

4.2.1 The RMC Interface ............................. 61

4.2.2 RMC Data Structures ........................... 66
4.2.3 RMC Ports ................................. 67
4.3 RMC Protocol Implementation ....................... 70
4.3.1 RMC Socket Creation and Connection Initialization ........... 71
4.3.2 The RMC Receiver ............................. 73
4.3.3 The RMC Sender .............................. 81
4.3.4 RMC Socket Close ............................. 88
5 Performance Evaluation 90
5.1 Experimental Study ............................. 90
5.1.1 Experimental Conditions .......................... 91
5.1.2 Experimental Results ............................ 92
5.2 Simulation Study ............................... 95
5.2.1 Simulation Environment .......................... 95
5.2.2 Simulation Results for “Fast” Receivers ................. 98
5.2.3 Simulation Results for “Slow” Receivers ................. 102
6 Conclusions and Future Work 108

vi

3.1
4.1

5.1
5.2
5.3
5.4

 

LIST OF TABLES

RMC packet types ............................... 33
RMC-supported BSD functions ........................ 63
Test machines. ................................ 91
Characteristic groups. ............................ 98
Simulation tests. ............................... 99
Simulation tests. ............................... 104

4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24

LIST OF FIGURES

Multicasting in single-hop networks. ....................
Multicasting in multiple-hop networks. ...................
The TCP/IP network layers ..........................

Flow of data in RMC. ............................
The RMC header ................................
NAK-based reliability with centralized recovery. ..............
Keepalive messages sent with exponential backoff [1]. ...........
The send window ................................
The receive window. .............................
The receive window. .............................

Linux networking layers. ...........................
BSD socket system calls ............................
Sequence of system calls for a simple client and server. ..........
The net_families table. ............................
The proto_ops structure used by BSD as the interface to different address

families. ..................................
Linux BSD and INET socket structures. ..................
The sock data structure ............................
A socket buffer .................................
The lock_sock and releasesock functions ...................
The inet_protocol structure. .........................
IP sending functions ..............................
A receiving application. ...........................
A sending application. ............................
The RMC network protocol family structure .................
RMC receiver speciﬁc information. .....................
RMC sender speciﬁc information. ......................
RMC socket buffer information ........................
RMC’S hashing table structure. .......................
The RMC socket structures ..........................
The RMC receiver. ..............................
Implementation of the receive window. ...................
The receive function ..............................
The RMC ck_wnd function. .........................
The RMC sender ................................

viii

 

55
57
59
60

, |l IIII I'll

III
III.

 

 

 

 

4.25 Implementation of the send window. .................... 82

4.26 The rmc_send_skb function. ......................... 85
4.27 The retransmission request structure ..................... 86
5.1 Throughput of RMC in a local environment ................. 92
5.2 Total rate requests for the 10 MB disk test. ................ 94
5.3 Feedback for the 40 MB disk test ....................... 95
5.4 Average throughput with minimum and maximum throughput values. . . 96
5.5 Simulation Architecture. ........................... 97
5.6 Simulation Throughput. ........................... 100
5.7 Results for 10 fast receivers. ......................... 101
5.8 Results for 25 fast receivers. ......................... 102
5.9 Results for 100 fast receivers. ........................ 103
5.10 Results for 10 slow receivers .......................... 105
5.11 Results for 25 slow receivers .......................... 106
5.12 Results for 100 slow receivers. ........................ 107

 

 

 

 

 

Chapter 1

Introduction

Computer networking has become an essential part of the computing ﬁeld, both in
local area and wide area environments. In local area environments, computer networks
are being used to create distributed computing environments and parallel servers. In
addition, as networking technology has become more advanced, wide area networks
have grown both in size and utilization, to reach not only research institutions and

large corporations, but also homes and small businesses.

Many of the applications that execute in local area and wide area networks in-
volve multicast communication, in which a single source host transmits information
to a limited group of receiving hosts. Classes of applications that utilize multicast
communication include collaborative applications such as shared Whiteboards, video
conferencing, and distance education; applications for distributing data such as soft-
ware and databases; distributed interactive simulation applications; and applications
in the entertainment industry including video and audio broadcasts.

Currently, Ethernet multicast and IP multicast and provide multicast services in a
local area and wide area networks, respectively. Both of these services are unreliable,

which means that the data is not guaranteed to arrive safely and in order at the

1

 

destinations. However, many of the applications that use multicast communication

require the communication to be reliable. Because there is no current standard for
reliable multicast, these applications typically manage their own reliability at user
level.

In light of the growing need for reliable multicast, much work has been done on
developing reliable multicast protocols. Most reliable multicast protocol implemen-
tations must address three issues: reliability, ﬂow control, and group management.

Reliability is the basis for reliable multicast protocols, and ensures that all mem-
bers of the multicast group receive complete and correct data. Usually, reliability
is provided by either a sender-initiated or a receiver-initiated method. In sender-
initiated, or ACK—based, protocols, each receiver in the multicast group conﬁrms safe
data arrival by sending a positive acknowledgment (ACK) for each packet that is
correctly received. The sender is responsible for detecting data loss and subsequently
retransmitting the lost data. Because the nature of ACK-based protocols requires
a great deal of receiver feedback, ACK-based protocols suffer from the well-known
ACK-implosion problem [2], in which the sender becomes overwhelmed with feedback
messages as the number of receivers grows large. In receiver-initiated, or NAK-based,
protocols, each receiver sends a negative acknowledgment (NAK) to the sender only
when data loss is detected at that receiver. NAK-based protocols lessen the need
for receiver feedback, but may still suffer from implosion in the event of a highly
correlated data loss.

Group management concerns how, and to what extent, the sender is required to
maintain the state of each member in the multicast group. The extent that a pro-
tocol maintains group membership information varies widely under different reliable
multicast protocols. Under a tight group management policy, for example, the sender

may maintain complete state information on each member of the multicast group,

 

including the state of each member’s receive window. In contrast, under a loose

group management policy, the sender may maintain minimal information about the
multicast group; for example, the sender may only maintain the number of receivers
in the group. The type of reliability that the reliable multicast protocol uses often
determines the type of group management that is needed. Protocols that use ACK-
based reliability often require careful management of the group, while protocols that
use N AK—based reliability are less reliant on the composition of the group.

The purpose of ﬂow control in a reliable multicast protocol is to prevent the sender
from causing packet loss at a given receiver by causing the receiver’s buffer to overﬂow.
Flow control for reliable multicast protocols may be rate-based or a window-based.
In rate-based flow control, timers are used to limit the sender to transmitting data at
or below a given rate. Data loss is used as an indication of congestion, and sending
rates are adjusted accordingly. In window-based ﬂow control, the sender is limited
to sending data that falls within its current send window. As receivers conﬁrm that
data has been correctly received, the send window slides forward, allowing the sender
to transmit additional data.

In this thesis, we study the issues of reliability, group management, and ﬂow con-
trol as they apply to a reliable multicast protocol that is designed to be implemented
in the operating system kernel. We focus on NAK-based reliability because it scales
better than ACK—based reliability and we limit work at the sender by supporting
anonymous group membership. While NAK-based protocols typically use rate-based
ﬂow control, however, due to limited buffer space, kernel—level protocols typically use
window-based ﬂow control. We explore the combination of rate-based and window-
based flow control to allow for the use of NAK-based reliability in a kernel-level
implementation.

Thesis Statement: Using a combination of window-based and rate-based ﬂow

 

 

 

control, in conjunction with NAK-based recovery and anonymous group membership,

it is possible to develop an eﬂicient and scalable reliable multicast protocol that ﬁts

well into the operating system kernel.

The major contributions of this work can be summarized as follows.

1.

[\3

w

Reliable Multicast Protocol (RMC) design. RMC is a reliable multicast pro-
tocol that uses negative acknowledgments for reliability and a combination of
window-based and rate-based flow control. While typical window-based pro-
tocols use positive acknowledgments to slide the window forward, RMC uses
purely negative information, and a timer that is based on the round-trip time
to the most distant receiver, to prevent the window from sliding forward too
quickly. Further, RMC uses anonymous group membership to further reduce

the work that is required of the sending host.

. RMC Implementation in the Linux kernel. The RMC protocol has been care-

fully integrated in the Linux kernel, and provides an efﬁcient and easily acces-
sible solution to the need for reliable multicast. The implementation required
us to address a number of practical problems, including buffer management,

interrupt handling, and data locking.

. Performance evaluation through both experimentation and simulation. We ﬁrst

evaluate the performance of RMC as implemented in the Linux kernel on ma-
chines in a local area network. In this environment, we are able to observe the
behavior of RMC in a working environment. We further evaluate the perfor-
mance of RMC in a simulated network, where we can set network parameters

such as network delay and loss rates to simulate wide area networks.

The remainder of this thesis is organized as follows. In Chapter 2, we present

material related to this work, including background material on multicast commu-

 

 

nication and applications that use multicast, the requirements for reliable multicast

protocols, and a survey of N AK-based reliable multicast protocols. We present the
design of the RMC protocol in Chapter 3, and subsequently describe the Linux im-
plementation of the RMC protocol in Chapter 4. Chapter 5 presents performance
results of the RMC protocol in two settings: the Linux kernel and a simulated net-

work environment. Finally, in Chapter 6 we discuss conclusions and possible future

directions of this work.

Chapter 2

Background and Related Work

Three basic types of communication take place on a network: unicast, broadcast,
and multicast. Unicast communication is a point-to—point operation in which a single
source host, the sender, transmits data to a single destination host, the receiver.
In contrast, broadcast communication is a point-to-multipoint operation in which
a single source host sends data to all the hosts on a network. Finally, multicast
communication occurs when a single source host transmits data to a limited group of
receivers.

Multicasting is implemented in different ways, depending on the network archi-
tecture. In a single-hop network, such as a local area network (LAN), data can
be delivered directly to destinations. In the medium is shared, as in a traditional
Ethernet network, the network interfaces of receiving nodes can be programmed to
accept packets containing certain multicast addresses, as shown in Figure 2.1(a). If
the medium is switched, as in an Ethernet switch, the switch can be programmed to

forward multicast packets to a speciﬁed set of nodes, as shown in Figure 2.1(b).

In a multiple-hop network, multicasting is implemented by sending data over a

multicast tree that is rooted at the sender and that includes all members of the

6

 

 

 

 

 

 

 

RRRQRR®SO

 

(a) Shared Medium (bus,hub) (b) Switched Medium

Figure 2.1: Multicasting in single—hop networks.

multicast group. The concept of a multicast tree is shown in Figure 2.2. Routers and
switches receive multicast packets and forward or replicate them as necessary to reach
the intended set of destinations. Whether the network is single-hop or multiple-hop,
using multicast to communicate to a limited group of receivers saves a large amount
of bandwidth compared to separately unicasting the data to each member of the
group. Multicasting is also a better choice than broadcasting the data, which would
unnecessarily interrupt processors that do not wish to participate in the multicast

operation.

2.1 Applications of Multicast Communication

Applications of multicast communication are widespread in both local and wide area
networks. In local area networks, networked collections of commodity workstations
may be used as a platform for distributed parallel computing [3, 4]. Here, multicast
supports many of the operations that are necessary to implement both distributed
shared memory and message passing. In distributed shared memory, for example,
multicast may be used to perform memory updates and invalidations. In message

passing, multicast may be used to distribute data in parallel number algorithms such

 

 

 

 

 

D Switch/router
@ Sending host
® Receiving host

 
   

Figure 2.2: Multicasting in multiple-hop networks.

as matrix operations. Another application of multicast occurs in distributed operating
systems [5]. In this case, multicast serves as an efﬁcient method for communicating
data or events among participating personal computers, workstations, and servers.
Multicast is also useful in a LAN for communication between machines participating

in distributed database management and parallel web servers.

There are several classes of wide area applications that beneﬁt from multicast.
Collaborative applications form one of these classes. In collaborative applications,
humans use computers and networks as a tool to work together in groups or teams.
For example, a group of people may use video conferencing, shared Whiteboards, or
interactive chat to share their ideas on a project [6]. An instructor interacting with
a group of students through a distance education application is another example of
a collaborative application [7, 8]. In collaborative applications, multicast is used to
distribute and synchronize events and data among the participating members.

A second class of Internet applications that uses multicast is Distributed Inter-

active Simulation (DIS). Like collaborative applications, DIS also involves human

 

 

 

interaction. However, in DIS, participating players interact in a common virtual

world that is created through simulations. DIS was originally developed for use in
the military to simulate battleﬁeld scenarios for training purposes. Today, DIS con-
tinues to be used to train people to function in complex environments such as airplane
cockpits and space vehicles [9, 10]. DIS is also increasingly being used in interactive
gaming applications. Similar to collaborative applications, in DIS, multicast is used
to keep the state of the simulation synchronized at all participants on a real—time
basis.

Data dissemination applications form a third class of Internet applications that
beneﬁt from multicast. Multicast communication provides this type of application
with an efﬁcient way to distribute data to a group of receivers. For example, data
dissemination applications may be used to distributed live data feeds such as stock
quotes or bank transactions. These applications are also commonly used to distribute
bulk data such as software and database updates. Multicast communication also
has great potential in the entertainment industry. Multicast may be used in the
entertainment realm for computer versions of video and audio broadcasting as well

as for services such as video on demand.

2.2 Multicast Communication in the Internet

Multicast is supported at two levels in the TCP/IP network hierarchy, shown in Figure
2.3. Multicast services within a single local area network are supported at the MAC,
or hardware layer. At the network level protocol, which sits above the MAC layer
in the network protocol stack, multicast routing protocols controls the forwarding
of multicast packets through network interconnects such as a routers. Finally, the

Internet Group Management Protocol (IGMP) manages multicast group membership

 

at the network protocol level.

2.2.1 LAN multicast

In local area Ethernet networks, Ethernet cards use the low-order bit of the high-
order octet in the 48-bit hardware address to distinguish unicast addresses (0) from
multicast addresses (1). By default, an Ethernet card ﬁlters only packets from the
network that are addressed to either its built-in hardware address or to the broadcast
address. However, multicast-capable Ethernet cards can be programmed to also ac-
cept packets that are addressed to speciﬁed multicast addresses. Thus, an Ethernet
packet can be locally multicast by simply setting the Ethernet header destination
ﬁeld to a multicast address. Further, a host can join and leave a multicast group on
the local network by requesting its network interface card to accept or reject packets

for a speciﬁed multicast address [11].

 

 

 

 

 

 

 

Conceptual Layer Boundary
Application Layer Application User level
""ir'lnliléil; TCP ] um: "satires,“
Network Layer IP
Hardware Layer MAC
l Physical Network

 

Figure 2.3: The TCP/IP network layers.

2.2.2 Multicast Routing

The Internet Protocol (IP) supports multicast services and has reserved IP addresses
in the range 224.000 through 239.255.255.255 as multicast addresses. Multicast-

enabled routers use multicast routing protocols to determine how to route multicast

10

 

packets from the source host to each member of the multicast group. In general,

when a new multicast group is formed, multicast routing protocols construct multicast
trees that connect all members of the multicast group. As multicast communication
commences, routers use the trees to forward multicast packets from the source to
each member of the multicast group. Depending on whether the multicast group is
expected to be densely or sparsely distributed throughout a network, routing protocols
may use either a dense mode or a sparse mode routing algorithm [12].

Dense mode multicast routing protocols use sender-initiated ﬂooding techniques
to propagate routing information to multicast routers. The Distance Vector Multicast
Routing Protocol (DVMRP) [13], Multicast extensions to Open Shortest Path First
(MOSPF) [14], and the Protocol Independent Multicast-Dense Mode(PIM-DM) [15]
can be classiﬁed as dense—mode protocols. Dense mode multicast routing protocols are
efﬁcient when the multicast group is widely represented throughout a network or when
senders and receivers are located close to each other in the network. However, due
to the periodic ﬂooding that these protocols use to collect routing information, dense
mode routing protocols generally do not work well for situations where members of
the multicast group are located sparsely throughout the network, or where bandwidth
is not readily available.

Sparse mode multicast routing protocols, including Core Based Trees (CBT) [l6]
and Protocol Independent Multicast—Sparse Mode (PIM-SM) [17], are intended for
multicast groups that are sparsely distributed over a large network. These protocols
use receiver-initiated techniques through which routers explicitly join multicast trees.
That is, a router only becomes involved in construction of the tree at the time a
host in its subnet becomes a member of the given multicast group. In contrast to
dense mode protocols, where a tree is created for each (source, multicast group) pair,

sparse mode multicast routing protocols construct a single tree that is shared by all

11

 

members of the group, regardless of which host is the sender. Compared to creating

a tree for each (source, multicast group) pair, using a shared tree greatly saves on the
amount of state information that must be stored at each router.

It is possible that not all routers in a network will be multicast—enabled. To al-
low multicast communication to span parts of a network that are not connected by
multicast-enabled routers, multicast tunneling may be used. In multicast tunneling,
multicast packets are encapsulated in unicast packets to be transported through the
portion that is not multicast-enabled. The Internet Multicast BackBone (MBone)
is a virtual network within the Internet that is composed of interconnected subnet-
works and routers that support IP multicast trafﬁc. It consists of islands of network
that have routing capabilities connected to other islands through multicast tunnel-
ing. The MBone has been used for the development of many distributed multimedia

applications as well as for the delivery of live events [18].

2.2.3 Multicast Group Membership

Management of the multicast group membership within the IP layer is implemented
by the Internet Group Management Protocol (IGMP) [19]. When a host wishes to
join an IP multicast group, it issues an IGMP join request for the appropriate IP
multicast address. Local multicast-capable routers that receive the join request set
up the proper routing information and propagate the group membership information
to other multicast-capable routers in the Internet. To maintain group membership
information, multicast routers periodically poll the local network to determine the
set of locally active multicast groups. If several polls for a given multicast address
occur without response, the router assumes that no hosts on the local network are

interested in the group, and it stops advertising its membership to other routers.

12

2.3 Reliable Multicast Protocols

Both hardware multicast and IP multicast provide best effort or unreliable data deliv-
ery. This means that they do not guarantee complete and correct delivery of the data.
However, many of the applications that utilize multicast services require the data to
be delivered reliably. For example, operations in a distributed operating system or
in distributed parallel computing are assumed to be reliable. In addition, many of
the applications that use multicast in the wide area networks need, or would beneﬁt
from, a reliable multicast service. Mass distribution of software, for example, would
lose many of the beneﬁts of multicasting if the software was delivered incorrectly to
a number of sites and had to be redelivered anyway. Data, such as stock prices, is
of little use to stock brokers if they cannot rely on the data being correct. In addi-
tion, although multimedia often must be delivered with time constraints, a certain
degree of reliability will provide end users with better quality. Reliable multicasting
also makes it more straightforward to implement encryption and compression on a

multimedia data stream.

In light of the growing need for reliable multicast communication, a great deal of
research has been devoted to this tOpic in the last several years. Numerous reliable
multicast protocols have been designed and analyzed, and a number have been im-
plemented. In addition to the basic transfer of data that is provided by underlying
network layers, there are three main requirements for reliable multicast protocols:

reliability, ﬂow control, and connection management. Each of these is discussed in

turn.

13

 

2.3.1 Reliability

In multicast communication, reliability is the property of ensuring that each receiver
in the multicast group receives correct and complete data. Traditionally, reliability is
achieved by retransmitting any data packets that are lost or damaged during trans-
mission. Depending on the network loss characteristics, retransmitted data may be
unicast only to those receivers that request it, or it may be multicast to the entire
group. When losses are highly correlated among the receivers, it is more efﬁcient
in terms of bandwidth to multicast the retransmissions. However, when loss is un-
correlated, it is more efﬁcient to unicast the retransmissions. In cases where both
correlated and uncorrelated losses occur, protocols may choose to use an adaptive
retransmission strategy. In this method, senders collect feedback information for a
period of time and then either unicast or multicast the retransmission based on the

degree of loss.

ACK—Based Protocols. The responsibility of ensuring that data is reliably de-
livered in a multicast protocol can be placed either at the sender or at each of the
receivers. Sender-initiated protocols rely on the sender to ensure that all data packets
are reliably delivered to all members of the multicast group. In sender-initiated pro-
tocols, each receiver in the multicast group sends positive acknowledgments (ACKS)
to the sender to conﬁrm the safe arrival of each data packet. The sender detects data
loss by setting a timer while waiting for acknowledgments from all the receivers. If
the timer ﬁres before the sender receives ACKs from all of the receivers, the sender as-
sumes that the data has been lost and retransmits the packet to the multicast group.
In this thesis, we will refer to sender-initiated protocols as ACK—based protocols.
The Xpress 'ITansport Protocol (XTP) [20] is one protocol that implements an

ACK—based reliable multicast service. In XTP, the sender periodically requests re-

14

 

 

2.3.1 Reliability

In multicast communication, reliability is the property of ensuring that each receiver
in the multicast group receives correct and complete data. Traditionally, reliability is
achieved by retransmitting any data packets that are lost or damaged during trans-
mission. Depending on the network loss characteristics, retransmitted data may be
unicast only to those receivers that request it, or it may be multicast to the entire
group. When losses are highly correlated among the receivers, it is more efﬁcient
in terms of bandwidth to multicast the retransmissions. However, when loss is un-
correlated, it is more efﬁcient to unicast the retransmissions. In cases where both
correlated and uncorrelated losses occur, protocols may choose to use an adaptive
retransmission strategy. In this method, senders collect feedback information for a
period of time and then either unicast or multicast the retransmission based on the

degree of loss.

ACK-Based Protocols. The responsibility of ensuring that data is reliably de-
livered in a multicast protocol can be placed either at the sender or at each of the
receivers. Sender-initiated protocols rely on the sender to ensure that all data packets
are reliably delivered to all members of the multicast group. In sender-initiated pro-
tocols, each receiver in the multicast group sends positive acknowledgments (ACKs)
to the sender to conﬁrm the safe arrival of each data packet. The sender detects data
loss by setting a timer while waiting for acknowledgments from all the receivers. If
the timer ﬁres before the sender receives ACKs from all of the receivers, the sender as-
sumes that the data has been lost and retransmits the packet to the multicast group.
In this thesis, we will refer to sender-initiated protocols as ACK—based protocols.
The Xpress Transport Protocol (XTP) [20] is one protocol that implements an

ACK—based reliable multicast service. In XTP, the sender periodically requests re-

14

 

ceiver status by setting a bit in the outgoing packets. Receivers respond to this

request by returning an ACK to the sender. The sender aggregates ACK information
from the multicast group and retransmits data to the entire multicast group.

Single Connection Emulation (SCE) [21] is another ACK-based reliable multicast
protocol. SCE is implemented by adding a network layer between the IP and TCP
layers. The SCE layer uses IP multicast to transmit data to the multicast group, and
receivers send ACKs back to the sender as they would in a TCP connection. At the
sender, the SCE layer consolidates ACKs and other feedback information and passes
it up to the TCP layer. The TCP layer, in turn, interprets the feedback as it would
for a unicast connection and retransmits data accordingly.

ACK-based protocols work reasonably well when the number of receivers from
which the sender must process feedback is small. However, ACK-based protocols
require the sender to maintain state information for each receiver, which is expensive
for large groups. More importantly, the number of ACKs grows linearly with the
number of receivers. As a result, the sender can be quickly overwhelmed when the
number of receivers is large. This problem is commonly known as the ACK implosion

problem [2].

NAK-Based Protocols. In contrast to sender-initiated reliable multicast proto-
cols, receiver-initiated protocols transfer the responsibility of reliability to each re-
ceiver in the multicast group. That is, receivers must detect missing or damaged
packets in the incoming data stream, and request retransmission of the needed data.
These retransmission requests may also be referred to as negative acknowledgments
(NAKs). We will refer to receiver-initiated protocols as NAK-based protocols.

Since receivers detect data loss by gaps in the sequence numbers of incoming

packets, it is possible for the last packet in a transmission to be lost without being

15

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

detected by the receivers. To address this issue, most NAK—based protocols send

periodic keepalive or heartbeat packets during idle sending times. These packets
typically include the sequence number of the next data octet to be sent. A receiver
can detect loss at the end of the transmission be comparing the sequence number
indicated by a keepalive packet to the sequence number of the next expected data at
that receiver.

Researchers at Stanford University conducted some of the earlier work in N AK-
based reliable multicast in their user-level implementation of the Log—Based Receiver-
Reliable (LBRM) protocol [1]. LBRM makes use of a log server that logs all transmit-
ted packets from the source. When a receiver detects a lost packet, it sends a NAK
to the log server, which is then responsible for retransmitting the requested packet.
The sender needs only to reliably deliver the packet to the log server (by a positive
acknowledgment from the log server) and can then release the data from its memory.
Like LBRM, RMC uses NAK-based reliability. Rather than using a log server, how-
ever, in RMC NAKs are sent directly to the sender and the sender is responsible for
buffering sent data.

During idle periods, LBRM sends heartbeat packets at a variable rate rather
than a constant one. Each sender maintains a heartbeat timer that represents the
time from the last transaction to the time the next heartbeat will be sent. This
timer is initialized to a minimum value after a data transmission and is doubled after
each heartbeat transition until it reaches a maximum value. This method enables
lost packets to be detected quickly and reduces the number of heartbeat messages
compared to a ﬁxed rate protocol. RMC also sends keepalive messages according to
an exponential backoff timer during idle periods.

The Reliable Adaptive Multicast Protocol (RAMP) [22] is a NAK-based reli-

able multicast protocol that was designed and implemented by researchers at TASC.

16

 

 

 

 

 

 

 

 

 

 

 

 

 

RAMP is a transport layer protocol that is layered on top of IP multicast services.

RAMP is intended to support collaborative-interactive applications, where the ap-
plication typically is both a sender and a receiver of a single multicast group. In
addition, this protocol was designed for an all-optical, circuit-switched, gigabit net-
work developed by the ARPA-sponsored Testbed for Optical Networking (TBONE)
project. In this network, packet loss mainly occurs due to buffer overﬂow at the

receiver. As a result, packet loss is highly uncorrelated among the receivers.

In light of the given network characteristics, reliability in RAMP is provided by
immediate NAKs unicast from the receivers to the sender, followed by unicast retrans-
missions of the requested data from the sender to the requesting receiver. Unicast
retransmissions eliminate unnecessary processing of redundant packets at other re-
ceivers. Further, the use of immediate rather than delayed NAKs improves receiver
performance by decreasing recovery time and, as a result, giving applications earlier
access to the data so that protocol buffer space may be freed for incoming data. Like
RAMP, in RMC receivers send immediate NAKs to the sender. However, in RMC

loss is assumed to be correlated and, consequently, retransmissions are multicast.

Assuming that packets are successfully received more often than they are not,
NAK-based protocols are more scalable than ACK-based protocols. However, NAK-
based protocols may also suffer from an implosion of feedback at the sender in the case
of large, highly correlated loss of a portion of the data stream. In addition, if there is
a disruption in the connection between the sender and a subset of the multicast group,
the sender might incorrectly interpret the lack of NAKs as an indication that data is
being received successfully at all receivers. As a result, the sender may prematurely

release buffered data.

17

 

 

 

 

 

 

Local Recovery Mechanisms. In order to reduce implosion and workload at

the sender, local recovery may be used. In this method, rather than placing the
entire responsibility of retransmission on the sender, some or all of the receivers are
also allowed to retransmit requested data. Ideally, the retransmissions will come
from a receiver that is close to the requesting receiver, so that loss recovery time
will also be reduced. Local recovery may be used in both ACK- and NAK-based
protocols. Depending on which receivers are allowed to perform the retransmissions,
local recovery is referred to as either unstructured or structured.

Unstructured local recovery is used only in NAK-based protocols, and requires
NAKs to be multicast to the entire receiver group. Typically, any receiver that
overhears a NAK and is able to supply the missing data can retransmit the data
to the multicast group. Since both NAKs and retransmissions are multicast, and
only one NAK and one retransmission are necessary for each data loss, NAKs and
retransmissions from multiple nodes are redundant and cause unnecessary network
trafﬁc. To address this problem, local recovery is normally used in combination with
NAK suppression. In NAK suppression, when a receiver detects data loss, instead of
immediately sending the NAK, it waits a random period of time and then multicasts
the NAK to the multicast group. If a receiver “overhears” a NAK for the same data
while it is waiting to send its own NAK, it cancels its NAK and allows the other
NAK to report the loss. Similarly, to restrict the number of responses to a NAK,
each receiver waits a random amount of time before retransmitting data, and if it
hears another retransmission before doing so, it cancels the retransmission.

Unstructured local recovery with NAK suppression is used by the Scaleable Reli-
able Multicast (SRM) [6] protocol. Rather that multicasting NAKs and retransmis-
sions to the entire group, SRM deﬁnes local loss neighborhoods based on the number

of hops from the requesting host. When a receiver detects a loss, it multicasts the

18

 

 

 

NAK to its local neighborhood by specifying a limited number of routing hops, or

time-to-live (TTL). Any receiver that hears the NAK, multicasts the retransmission
with the same TTL. In this way, a receiver that overhears a NAK and cancels its
own, will also overhear the retransmission. Receivers suppress both NAKs and re-
transmissions based on the distance from the host that originally sent the data or
NAK.

Unstructured local recovery with NAK suppression reduces the feedback implosion
and the workload at the sender. However, multicasting of NAKs and retransmissions
not only introduces extra network trafﬁc, it also interrupts receivers with unnecessary
data. Structured local recovery, which may be used in both NAK- and ACK-based
protocols, attempts to reduce unnecessary network trafﬁc by directing feedback from
a receiver to a speciﬁc destination. In structured local recovery, receivers are arranged
into a logical hierarchical structure. Within this structure, internal nodes are respon-
sible for receiving feedback from, and retransmitting data to, their children in the
hierarchical structure. Feedback is unicast to a receiver’s parent and retransmissions
may be unicast to each requesting receiver or locally multicast to a host’s children.
The number of children for any given host is assumed to be relatively small, so that
feedback implosion does not occur.

The Reliable Multicast Transport Protocol developed at Bell Laboratories (BL
RMTP) [23] uses structured local recovery in an ACK-based protocol. In this proto-
col, data is ﬁrst multicast to the multicast group using a best effort multicast service
such as IP multicast. Each receiver is assigned a designated receiver to which is
sends periodic ACKs. Designated receivers process the ACKs to determine which
packets need to be retransmitted, and either unicast or multicast the requested data,
depending on the number of receivers that report the loss.

The Tree—based Multicast Transport Protocol (TMTP) [24] is another protocol

19

 

 

 

 

 

 

that uses structured local recovery. In TMTP, participants are organized into do-

mains, within which a domain manager is responsible for error recovery and local
retransmissions. TMTP uses periodic ACKs to provide reliability between the set
of domain managers and the sender, as well as between domain members and their
domain manager. Because ACKs are sent only periodically, domain members may
locally multicast NAKs to the domain manager for faster error recovery.

While structured local recovery reduces the number of unnecessary NAKs and
retransmissions compared to unstructured local recovery, structured local recover
requires extra work to be done to construct and maintain the logical structure on the
receiver group. In addition, local recovery requires receivers to buffer received data
and perform extra work to process N AKs and retransmissions. This may result in
unfair treatment of the receivers, especially if some of the receivers are relatively slow
or have limited memory. RMC uses neither unstructured nor structured recovery.
Instead, recovery is centralized at the sender. Although this results in increased work
at the sender, it also allows RMC to treat all receivers equally. In addition, RMC
completely separates membership issues from the operation of the protocol. Thus,
there is no need to adapt the protocol operation every time a host joins or leaves the

group.

2.3.2 Connection Management

The second major requirement for reliable multicast protocols is connection manage-
ment. Connection management includes maintaining group membership information
as well as managing connection setup and tear down. Connection management is an

issue for both reliable and best-effort multicast.

One issue in connection management concerns to what degree the sender must

20

 

 

maintain explicit knowledge of the members of the multicast group. ACK—based

protocols rely on the sender to keep track of not only which hosts are members of
the group, but also the state of each member. NAK-based protocols can provide
reliability while maintaining minimal information of the receiver group, but some
NAK-based protocols also require the sender to manage group membership in more
detail for other purposes. Doing so is helpful to the sender, for example, when it is
deciding whether to unicast or multicast retransmissions. Another reason that the
sender may maintain more detailed information on the receiver group is to protect
itself against false retransmission requests from hosts that do not have permission
to be in the multicast group. Finally, some applications that operate on top of
a multicast protocol may need to have access to group membership information.
However, while group membership information is useful, and often necessary, for
correct protocol operation, it also consumes a large amount of sender resources when
the multicast group is large. Not only must the sender maintain the information in
memory; it must also use valuable processor time to process and update membership
status. In order to reduce the responsibilities at the sender, RMC supports anonymous
group membership. Explicit group membership may be added at in the application

if necessary.

A second issue in connection management is connection setup and tear down.
Here, the protocol must specify how and when a receiver may join and leave a mul-
ticast group at the protocol level, as well as how and when the sender may open and
close a multicast connection. Some protocols may specify that receivers can only join
at the beginning of a transmission and leave at the end of a transmission, while others
may allow late joins and early leaves. For example, in protocols that are designed for
bulk data distributions, it doesn’t make sense for receivers to join in mid-stream, or

to leave the connection early. In addition, is the protocol is required to maintain a

21

 

 

 

 

 

 

 

 

structure on the receiver group, as is the case for TMTP [24] and BL RMTP [23], the

protocol can avoid adjusting the structure by disallowing late joins and early leaves.
However, for protocols that are designed for ongoing interaction, as is the case in
RAMP [22] and SRM [6], it is useful to allow new users to join late or leave early.
RMC can be used for both interactive applications and for bulk data distribution.
Because RMC supports anonymous group membership, it can easily support late joins

and early leaves. Applications may disallow late joins and early leaves if necessary.

2.3.3 Flow control

The third major requirement of reliable multicast protocols is ﬂow control. The
goal of ﬂow control is to prevent the sender from overrunning a receiver’s buffer
space. Packets that are sent when a receiver’s buffer is full must be dropped and,
consequently, retransmitted. Thus, ﬂow control is necessary for efﬁcient, reliable
data transfer. Flow control is especially important when the transport protocol is
implemented in the kernel, where buffer space is limited.

The two most commonly used methods for ﬂow control in reliable multicast proto-
cols are rate-based and window-based control [25]. In rate-based flow control, timers
are used to limit the sender to transmitting data at or below a given rate. Usually,
the amount of data loss that is reported by the receivers is used as an indication of
congestion, and sending rates are adjusted accordingly. In window-based ﬂow control,
as used by TCP, a sender is limited to sending only those packets that fall within its
send window. The send window advances as the sender receives sufﬁcient acknowledg-
ments from receivers. In addition, in order for the protocol to make the most efficient
use of the network, the size of the window is allowed to increase linearly during pe-

riods of successful transmission without loss. When transmission is unsuccessful, as

22

 

 

 

Signaled by a NAK or missing ACKS, the window size is reduced to prevent further

loss of packets.

For reliable multicast, the choice of ﬂow control is inﬂuenced by whether the pro-
tocol is ACK- or NAK-based. In ACK-based protocols, the sender receives feedback
from the multicast group on a regular basis so that it is fairly straightforward to
use window—based flow control. Further, the sender can monitor round—trip times to
estimate the amount of congestion in the network and adjust its window size appro-
priately. The XTP, SCE, and BL RMTP protocols all use window-based ﬂow control.
TMTP uses a combination of rate.based and window-based ﬂow control. Windows
are maintained both at the sender and at each of the receivers. Domain managers
track the windows of each of the receivers in their domain to determine the amount
of data that may be sent at a given time. Rate-based control is used to control the
transmission rate of the data within the current window. Adding rate-based control is
intended to prevent congestion in the network due to bursty trafﬁc [24]. Like TMTP,
RMC uses a combination of window-based and rate-based ﬂow control. However,
TMTP uses ACKs to relay information about the state of the windows, while RMC
is a purely NAK-based protocol.

In NAK-based protocols, senders receive feedback only when negative events oc-
cur. The sender does not know the state of the receivers’ buffers and cannot use
this information to maintain a send window. Thus, many NAK-based protocols use
rate-based ﬂow control. In SRM, for example, a single transmission rate is speciﬁed
to control transmission of both original and retransmitted packets. Usually this rate
is based on the frequency of incoming NAKs.

RAMP also uses rate-based ﬂow control. To determine the appropriate sending
rate, RAMP records the number of N AKs received and the number of packets sent in

deﬁned intervals of the transmission. The delay between the transmission of individual

23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

packets is based on the ratio of the NAKs received to the number of packets sent
during a given interval. To encourage the sender to throttle the sending rate before a
receiver’s buffer is actually full, the receivers intentionally drOp segments when their
buffers become 80% full. This drop is reported to the sender in the form of a NAK,
causing the sender to slow its transmission rate.

Researchers at TASC have also proposed a more proactive ﬂow control method
for the RAMP protocol [22]. In this new approach, as a receiver’s buffer reaches a
critical occupancy, the receiver estimates a reduced transmission rate and rate du-
ration required for recovery. This information is sent to the sender in the form of a
throttle request. A sender satisﬁes throttle requests by falling back to the requested
transmission rate for the given period, and then resuming normal transmission. This
approach is better than the previous method in that it uses explicit receiver state
to set the transmission rate, rather than relying on implicit NAK counts. In ad-
dition, the throttle messages may be treated anonymously, freeing the sender from
maintaining the state of each receiver in the group. RMC uses a method similar to
that proposed by RAMP to control the transmission rate. Rather than calculating a
reduced transmission rate and rate duration, in RMC receivers are allowed to send
one rate requests to the sender every round-trip time for as long as its buffer is past
the given threshold. The sender responds to each rate request by further slowing or
continuing to halt the forward transmission.

In addition to controlling the amount of data that is sent, a ﬂow control policy
must also manage a sender’s buffer space. Speciﬁcally, since the sender must be
able to retransmit data when it is lost at the receivers, the sender must carefully
manage buffers of previously transmitted data to ensure that no data is “released”
before all receivers have received it correctly. In ACK-based protocols, the sender can

safely release buffer space when it has received ACKs from all receivers for the given

24

 

 

 

 

 

 

 

data. However, with pure NAK-based multicast protocols, it has been shown that
deadlock cannot be prevented when ﬁnite buffers are used [26]. That is, the sender
has no way to know that the data has arrived at all the receivers and cannot safely
release buffer space. One solution to preventing deadlock with NAK-based protocols
would be to require the sender to buffer all data until the transmission is complete.
This solution is feasible when the sender has unlimited buffering space. For example,
the sender may keep a copy of the sent data on disk. However, if the protocol is
implemented in the kernel, buffer space is limited, and another approach must be
taken to prevent deadlock. One approach that many NAK-based protocols take is
to require the receivers to send periodic ACKS, enabling the sender to safely release
buffer space. This technique is used by both TMTP [24] and LBRM [1]. RAMP [22]
takes a slightly different approach by requiring receivers to send an ACK only when
the connection is closing. Finally, other implementations may require the application
to buffer the data. Application buffering is appropriate for applications such as shared
Whiteboards, where data is maintained within the application.

Because RMC uses a purely NAK-based approach and is designed to be imple-
mented in the kernel, there is a possibility that the protocol may release data pre-
maturely. RMC approaches this problem by following a conservative approach to
releasing data. In RMC, senders are required to buffer data for a minimum of 10
round-trip times. If data is released too early, both the sending and the receiving
applications are informed of the retransmission error. At this point, the applications

may take appropriate actions to retrieve or recreate the lost data.

25

 

 

 

Chapter 3

RMC Protocol Description

The RMC reliable multicast protocol is designed to be scalable and to ﬁt well into
an operating system kernel. To provide scalability, RMC limits work of the sender
whenever possible. RMC shifts the responsibility of reliability to the receivers by using
NAK-based reliability. Further, RMC supports anonymous group membership so that
the sender is required to maintain minimal information about the receiver group. Pure
NAK-based protocols typically use rate-based ﬂow control, which requires an inﬁnite
buffer. However, kernel-level protocols require a limited buffer and typically used
window-based ﬂow control. To support a limited buffer with NAK-based reliability,
RMC uses a combination of window-based and rate-based ﬂow control.

In this chapter, we ﬁrst give an overview of the RMC protocol and the RMC packet
format. We then describe the strategies that RMC uses to implement reliability,

connection management, and flow control.

3.1 RMC Protocol Overview

Multicast communication by means of the RMC protocol involves three key entities:

the sending and receiving applications, the RMC protocol itself, and an underlying

26

 

 

best-effort multicast service, such as IP multicast. Figure 3.1 depicts the data ﬂow

among these entities. The application process at the sending host passes a data
stream to the RMC protocol for transmission. The RMC protocol fragments this
data stream into a sequence of data packets, each of which is assigned a sequence
number and preﬁxed with an RMC header. The data packets are then transmitted
using the best-effort network multicast service, which attempts to deliver a copy of
each packet to each of the hosts in the multicast group. As the data packets arrive
at the receiving hosts, the RMC protocol checks the packets for correctness and
reassembles the individual packets into a data stream that is identical to the data
stream that was sent from the source application. Finally, the RMC protocol at each
of the hosts delivers the reassembled data stream to the receiving application at that

host.

Reliability. The responsibility of ensuring reliability in RMC is placed at each of
the receivers. To detect data loss, receivers scan the sequence numbers of incoming
data packets for possible gaps. When a receiver detects a loss, it uses a negative
acknowledgment to request retransmission of the lost data. RMC uses centralized
recovery, which means that NAKs are sent directly to the sender, and the sender is
solely responsible for retransmitting lost data.

A common problem with pure NAK—based error recovery is the possible loss of
the last packet in a transmission. Because receivers detect loss by ﬁnding gaps in the
sequence numbers of incoming packets, if the last packet is lost or the next packet
transmission has an unbounded delay, a receiver will not detect the loss in a timely
manner. To allow the receivers to quickly detect data loss that may have occurred at
the end of the current data stream, the sender transmits keepalive messages to the

receiver group during times when the sender is otherwise idle. Keepalive messages are

27

 

 

 

 

 

# Data Stream

—> Packet Stream

 

 

 

 

 

 

 

 

 

 

 

Application

 

 

 

 

 

 

Receiving Host Receiving Host

Figure 3.1: Flow of data in RMC.

sent according to an exponential backoff algorithm: the sender transmits a keepalive
message immediately following a transmission, enabling the receivers to quickly detect
a loss, but decreases the frequency of keepalive messages over time so as to reduce
network trafﬁc.

Because NAKs may also be lost during transmission, receivers must be prepared
to retransmit NAKs. Receivers should be careful, however, to not unnecessarily
retransmit NAKs before the sender has sufﬁcient time to retransmit the data. In
RMC, local NAK suppression is used to determine when to retransmit NAKs. In this
approach, pending NAKs are retransmitted at suppression intervals, which are set to

a multiple of the round-trip time from the receiver to the sender.

28

 

 

 

-1

 

Connection Management. To reduce the workload at the sender, group manage-

ment is very basic in the RMC protocol. Receivers send join requests when they begin
to listen to the multicast connection, and leave messages when they stop listening to
the connection. From these join and leave messages, the sender maintains a count of
the number of receivers to signal the start and end of transmission. In addition to
maintaining a receiver count, the sender also maintains an estimate of the round-trip
time to the most distant receiver. An initial estimate is ﬁgured from the join requests,
and the estimate is kept up to date by NAKs and throttle requests. This round-trip
time estimate is used throughout the protocol for things such as determining when

to release buffer space.

Flow Control. RMC uses a combination of window—based and rate-based methods
for ﬂow control. Like traditional window-based protocols, such as TCP, in RMC the
sender maintains a send window that deﬁnes a window of bytes that may currently
be sent. The send window ﬁlls as the sending application passes data to the protocol.
Each data byte must remain in the window for a minimum retention time after it
is transmitted. The send window may advance past each byte as its retention time
expires. Similarly, each of the receivers maintains a receive window that deﬁnes the
range of data that the receiver is currently able to receive. The receive window ﬁlls
as data arrives from the network and advances as the receiving application consumes
the received data. In the rate-based part of flow control, the sender transmits packets
within the send window at a speciﬁed transmission rate. The transmission rate is
limited by a maximum rate which is deﬁned by the speed of the underlying network,
and which is inﬂuenced by the frequency of incoming NAKs.

The responsibility of preventing buffer overﬂow is located primarily at each of the

receivers. In order to monitor the data ﬂow, each of the receivers divides its receive

29

 

 

 

     
 
  
 

/

     

q
9.

   

 

I:
‘-
n
r -

  
     
  

   
    

7y

?

Window into three regions: the safe region, the warning region, and the critical re-
gion. Within the safe region, the receiver assumes there is no danger of overﬂow, and
takes no ﬂow control action. If the window ﬁlls into the warning region, however,
the receiver sends throttle messages that request the sender to slow its sending rate.
Finally, if a receiver’s window ﬁlls into the critical region, the receiver sends urgent
throttle messages to stop forward transmission by the sender. In response to throttle
messages and NAKs, the sender goes through slow start and congestion avoidance
phases, which are similar to those used by TCP. In response to urgent throttle mes-
sages, the sender stops forward transmission and sends a keepalive message, followed

by a period of slow start.

3.2 RMC Packet Format

RMC segments are preﬁxed with a 20-byte header before they are passed to the
multicast service for transmission. The RMC header, shown in Figure 3.2, includes
information necessary for data delivery, as well as information used to manage the
RMC connection. The reader will recognize similarities and differences with the
header used by TCP [27]. The ﬁelds in the header are as follows:

0 16 31

Source Port [ Destination P011
Sequence Number

 

 

 

Rate Advertisement

Length
Checksum [ Tries [Type] l l

 

 

 

 

 

UR nused
FIN

Figure 3.2: The RMC header.

30

 

 

 

 

 

Source Port. The 16-bit source port ﬁeld iS used to identify the sending process on

the source host.

Destination Port. The 16-bit destination port ﬁeld is used to identify the receiving

process on the destination hosts.

Sequence Number. Like TCP, every octet of data that is sent over an RMC con—
nection is assigned a sequence number. Sequence numbers give the sender and
the receivers a common way to refer to portions of the data stream. For exam-
ple, receivers use sequence numbers to reassemble incoming data packets into
the original data stream and to detect missing portions of the data stream. At
the sender, sequence numbers indexes buffered data so that it can be easily

identiﬁed for retransmission.

The sequence number ﬁeld in the RMC header is a 32—bit unsigned integer that
contains the sequence number of the ﬁrst octet of data in the packet. Since
sequence numbers are stored as 32-bit unsigned integers, the sequence number
space is ﬁnite, ranging from 0 to 232 — 1. For this reason, all arithmetic that
deals with sequence numbers should be done modulo 2”. The initial sequence
number for a connection is a random number chosen by the sender when the
connection is opened. Each octet from the data stream is labeled in sequence

from this point.

Rate Advertisement. The rate advertisement is a 32-bit unsigned integer that is
used for ﬂow control purposes. In packets from the sender, this ﬁeld is used to
inform the receivers of the current transmission rate. Receivers use this ﬁeld in

feedback messages to suggest a lower sending rate from the sender.
Length. The length ﬁeld is a 32-bit integer that gives the number of octets of data

31

 

 

 

 

 

 

 

contained in each data packet. This ﬁdd is also used in negative acknowledg-

ments to identify the length of the requeSted data retransmission.

Checksum. Similar to the TCP header checksum, the checksum ﬁeld is the 16 bit
one’s complement of the one’s complement sum of all 16-bit words in the header
and text. If a segment contains an odd number of data text octets, the last
octet is padded on the right to form a 16-bit word for checksum purposes. The
padding is not transmitted as a part of the segment. The checksum ﬁeld is set

to all zeroes while the checksum is being computed.

Tries. The 8-bit tries ﬁeld indicates the number of times that this packet has been
transmitted. The main purpose of this ﬁeld is to distinguish original transmis-
sions from retransmissions. Similar to Karn’s algorithm [28], round-trip time

estimates are based only on original transmissions.

Type. There are presently nine RMC packet types. These packet types and their uses
are summarized in Table 3.1. DATA and KEEPALIVE packets are always mul-
ticast by the sender to the entire multicast group. Feedback messages, including
NAK, JOIN, LEAVE, and CONTROL packets, are always unicast by the re-
questing receiver to the sender. NAKERR, JOINACCEPT, and LEAVEOK
packets are unicast by the sender to the receiver that made the corresponding

request.

Urgent. The urgent ﬁeld is a 1-bit ﬂag that is set to true to distinguish a throttle

request as urgent.

Fin. The ﬁn ﬁeld is also a 1-bit ﬂag that is set by the sender in DATA and
KEEPALIVE packets when the sender has completed transmission for the cur-

rent connection.

32

 

 

 

or a retransmission.
to
to
for retransmission.
to to
to request
has been received.
a to
the multicast
to request

has been received.
a to request a
transmission rate.

to
idle time.

Table 3.1: RMC packet types.

3.3 NAK-Based Reliability

RMC uses a NAK-based, centralized recovery strategy to implement reliability. As
each receiver reassembles the data stream, it detects missing segments by monitoring
the sequence numbers of incoming data packets. When a gap occurs in the sequence
numbers of two incoming packets, the receiver knows that one or more data packets
has been lost or delivered out of order. The receiver immediately sends a NAK for
the missing data directly to the sender, provided that a NAK for this range of data
has not previously been sent. The sender buffers previously sent data and responds
to NAKs by retransmitting the requested data.

Out-of-order data that is received following a data loss cannot be incorporated
into the data stream until the lost data is recovered. This out-of-order data does not
cause redundant N AKs to be generated, but is stored in an out-of-order queue for
integration after the missing data arrives.

An example of RMC’s basic recovery process is shown in Figure 3.3. In the ﬁgure,

33

 

 

the receiver successfully receives packets 1 and 2. Packet 3 is lost during transmission,

and packet 4 is received successfully. On the arrival of packet 4, the receiver detects a
gap in the incoming data stream at packet 3, and immediately sends a NAK for packet
3. While the receiver is waiting for packet 3, packets 5 and 6 arrive and are buffered
with packet 4 in an out-of-order queue. When packet 3 arrives, packets 3, 4, 5, and
6 are added to the end of the data stream and made available to the application.

Sender Site Events Network Messages Receiver Side Events

Open a connection Open a connection
Send Packet 1

Send Packet 2 s: Receive Packet 1
Send Packet 3 Expect Packet 2, Receive Packet 2

 

 

 

\ .......
Send Packet 4 \:
Send Packet 5 Expect Packet 3, Receive Packet 4
Send NAK .
Send Packet 6 Expect Packet 3, Receive Packet 5
Receive NAK 3 .
Resend Packet 3 Expect Packet 3, Receive Packet 6
Expect Packet 3, Receive Packet 3
_l [ Successful
Time Time —>

transmission
""" > Lost transmission

 

 

 

Figure 3.3: NAK-based reliability with centralized recovery.

RMC uses centralized recovery for data retransmissions. This means that all
NAKs are sent directly to the sender and the sender is solely responsible for retrans-
mitting data to the receivers. Although centralized recovery places the burden of
retransmission totally on the sender, it is less complicated than local recovery. First,
centralized recovery. does not require any of the receivers to perform extra work or to
buffer the received data, as does local recovery. Second, centralized recovery does not
require a structure to be imposed and maintained on the receiver group. This lack of
structure requires less work by the sender than does local recovery, especially when

the receiver group membership changes frequently.

34

 

 

 

 

 

As discussed earlier, an important issue in multicast reliability is whether retrans—

missions are unicast to each requesting receiver or multicast to the entire group. In
the current version of RMC, all data retransmissions are multicast. Depending on
the loss characteristics of the underlying network and the size of the multicast group,
however, it may be beneﬁcial to adopt adaptive retransmission. Under adaptive
retransmission, after receiving a NAK for a given segment of data, rather than im-
mediately resending the requested data, the sender would collect N AKs for the same
data for a given amount of time. After that period of time, if the number of NAKs
exceeded some threshold, the sender would multicast the retransmission. Otherwise,
the sender would unicast the retransmission to each of the requesting receivers.

Another issue in NAK-based protocols is the possibility of last packet loss. Because
receivers detect loss by ﬁnding gaps in the sequence numbers of incoming packets, if
the last packet is lost, a receiver will never detect the loss. To allow the receivers to
quickly detect data loss that may have occurred at the end of the data stream, RMC
sends keepalive messages to the receiver group during times when the sender is idle.
In keepalive messages, the sequence number ﬁeld of the header is set to equal the
next unused sequence number at the sender. Receivers handle keepalive messages by
checking the sequence number from the keepalive message against the next expected
sequence number. If there is a gap, the receiver knows that data has been lost, and
sends a NAK for the lost data.

In RMC, keepalive messages are sent according to an exponential backoff timer,
as in LBRM [1]. An example of keepalive with exponential backoff is shown in Figure
3.4. The sender maintains a keepalive interval k, which speciﬁes the time between
when the previous packet was sent and the time the next keepalive message should
be sent. Each time a data packet is sent, k is reset to a minimum value lam-n, which

is deﬁned as one round-trip time. Each time a keepalive message is sent, the value

35

 

 

 

of k is doubled, up to a maximum value km“, which is currently deﬁned as ﬁve

round-trip times. The idea of using exponential backoff is to cluster the keepalive
messages following a data transmission, enabling receivers to quickly detect isolated
packet losses. On the other hand, keepalive messages are sent less frequently during

long idle periods, so as to reduce network trafﬁc.

 

 

 

k min
—I I- I- k m -l— k m -I
Time I | I I I I I I I I I g
[I I I I I [I I I A
Keepalive messages Keepalive messages
Data packet x Data packet x+1

Figure 3.4: Keepalive messages sent with exponential backoff [1].

Because the underlying data delivery service is unreliable, receivers must be pre-
pared to resend NAKs in the event that they are lost. Ideally, only one NAK should
arrive at the sender for each segment that is lost at each receiver. Thus, receivers
must be careful to not resend a NAK before the original NAK has had ample time
to be received and processed by the sender and the retransmitted data has had time

to traverse the network.

RMC uses local NAK suppression to control NAK retransmissions. In local NAK
suppression, after a NAK is sent, it is placed on a list of pending NAKs. Each pending
NAK is then resent at suppression intervals, until the requested data is delivered. The
suppression interval deﬁnes an ample amount of time for a NAK to be processed and
the data to be resent under normal circumstances. In the current version of RMC,

the suppression interval is deﬁned as four round-trip times to the sender.

36

 

 

 

 

 

3.4 RMC Connection Management

In order to keep the workload at. the sender small, RMC requires very little of the
sender in terms of group membership and connection management. RMC supports
anonymous group membership and allows receivers to join after transmission has
started and to leave before the transmission completes. A late joiner may not request
retransmission of data that was sent before the ﬁrst data packet that it received after
joining the group.

Receivers are required to send JOIN messages when joining a multicast connection
and LEAVE messages when leaving the connection. The sender uses these requests to
maintain the number of receivers in the group, but maintains no information about
individual receivers. This type of membership frees the sender from maintaining and
ensuring correctness of the state for each receiver. If needed, explicit membership

may be maintained at a higher level by the user application.

3.4.1 Port numbers and connections

Like UDP and TCP, RMC uses port numbers to identify the sending and receiving
processes on the participating machines. Sending and receiving processes must bind
to a local port before any data communication is allowed to take place.
Conceptually, an RMC connection is deﬁned by a pair of endpoints: one for the
sender and one for the multicast group. An endpoint is identiﬁed by a (address, port)
pair, where the address is the identiﬁcation that the lower network layer assigns the
machine or the multicast group and the port is an RMC port on that host. The
sender sends data to the multicast receiver endpoint, and receivers send feedback to
the sender endpoint. The multicast receiver endpoint may be further deﬁned by the

set of hosts that have joined the multicast group. This means that all receivers must

37

 

bind to a common port.

3.4.2 Establishing a connection

Because RMC supports anonymous group membership, establishing a connection at
the sending host is quite simple. The only requirements are that the sender binds to
a local port and establishes a destination endpoint consisting of the multicast address
and the port number.

Receivers are required to perform two tasks in order to establish a connection.
First, the receiver must inform the supporting network layer that it wishes to join
the multicast group. Second, the receiver must send a JOIN message to the sender in
response to the ﬁrst data packet that is received from the data stream. Although the
sender keeps no speciﬁc information on each of the receivers, the join request serves
two purposes. The primary use of the join requests is so that the sender can keep an
estimate of the number of receivers that are listening to the transmission. In addition,
the sender uses the join request to calculate a round-trip time for the new receiver,
which is used to update the round-trip time value used by the sender to manage data
release and keepalive timers.

To conﬁrm that the join request was received, the sender unicasts a J OIN ACCEPT
packet back to the joining receiver. The receiver uses this response to estimate the

round-trip time to the sender.

3.4.3 Closing a connection

In order to close a connection, a receiver is again required to complete two tasks. First,
the receiver informs the supporting network layer that it wishes to leave the multicast

group. This allows the supporting service to stop delivering multicast packets for

38

 

 

 

 

that multicast group to the receiving host. Second, a receiver informs the sender that
it is leaving the RMC connection by sending a LEAVE message to the sender. This
enables the sender to adjust its count of the listening receivers. The sender responds to
LEAVE messages by unicasting a LEAVEOK packet back to the requesting receiver.

At this point the receiver is free to close its connection.

Closing a connection on the sender endpoint is more complicated because the
sender must be sure that all sent data has been reliably delivered to all of the receivers
before it closes the connection. First, the sender must ensure that any data that
was previously queued to be sent is pushed on to the network. Next, it must give
receivers ample time to request retransmissions for any lost data. The algorithm for
determining this time is based on the algorithm that determines when data packets
may be released from the send buffer. That is, the connection may be closed after

the last data packet has been released from the send buffer.

3.4.4 Round-trip time estimation

Both the sender and the receivers make use of round-trip time information to con-
trol certain aspects of the protocol behavior. The receivers use the round-trip time
information mainly for suppressing feedback messages. Receivers obtain an initial
estimate of the round-trip time by sending a join request when they receive the ﬁrst
data packet. In addition, each receiver updates the round-trip time each time a data
response to a NAK is received. Although the receiver can not be sure that its NAK

resulted in the retransmission, the resulting estimate is good enough for the intended

purposes.
The sender uses an estimation of the round-trip time to set the congestion window

timer and to determine how long to buffer data. The sender calculates an estimate

39

 

of the round-trip time from join requests. The estimate is kept up to date by NAKs

and throttle requests.

3.5 RMC Flow Control

As previously discussed, RMC uses a combination of window-based and rate-based
ﬂow control. The window-based portion of the flow control manages the buffer space.
The rate-based portion of the ﬂow control, on the other hand, actually sets a limit

on how quickly data within the send window may be transmitted.

3.5.1 Window-based control

As a part of the window-based control, the sender and each of the receivers enforce
a set of rules on the local sequence number space that deﬁnes a window of data that
may be sent and received, respectively. The following paragraphs discuss in detail
the rules that the sender and the receivers use to deﬁne the sequence space, as well

as the terminology used to refer to the sequence space.

Send Sequence Space. The send sequence space is shown in Figure 3.5. The send
window, shown in gray in the ﬁgure, includes all data that is currently buffered or that
can be currently buffered by the protocol. The send window begins at snd_wnd and
has a size of snd.wnd_size. The snd_n:rt index marks the boundary in the sequence
space between data that has been sent and data that is yet to be sent. Because sent
data must be within the buffer, snd_na:t will always fall at or within the boundaries
of the send window.

As shown in Figure 3.5, snd_wnd, snd.n:rt, and snd_wnd + snd-wnd_size divide

the send sequence space into four regions. Region 81, which falls to the left of the

40

 

 

 

LC] Send Window [

 

       
 

<— SI 82 S4 —>

 

I

snd_wnd snd_nxt snd_wnd +
snd_wnd_size

Figure 3.5: The send window.

send window, refers to the part of the data stream that has already been sent and
subsequently released from the buffer. Data that has been sent and that is being
buffered for possible retransmissions falls within region S2. Data in region S3 has not
yet been sent, but will ﬁt in the send window. Data in region S4 does not currently
ﬁt into the send window.

Unlike positive acknowledgment protocols, where the receiver explicitly informs
the sender that data has been successfully received and can be released from its
buffer, in NAK-based protocols, the receivers inform the sender only when data has
not been received. Thus, without using an inﬁnite buffer, it is impossible to provide
100% reliable delivery of data to the entire receiver group. RMC handles possible
data loss by informing both the sending and the receiving application.

In RMC, buffer release is based on when a packet was most recently sent and an
estimate of the round-trip time to the most distant receiver. In the current version
of RMC, the minimum time that any data packet must be buffered is ten round—trip
times from the most recent time the packet was sent, although this parameter is

conﬁgurable.

Receive Sequence Space. The receive sequence space is depicted in Figure 3.6.

Like the sender, receivers also deﬁne a window on the sequence space. The receive

41

window identiﬁes the portion of the data stream that has been received and is being
buffered for the application, as well as data yet to be received for which there exists
available buffer space. The rcv-wnd index marks the front of the receive window
and is the sequence number of the ﬁrst octet that is being buffered and ready to
be delivered to the receiving application. The receive window has size rcv_wnd.size.
Within the receive window, the rcv.nxt index marks the boundary between data that
has been received and data that is yet to be received. Because packets that do no ﬁt
in the buffer are dropped, rcv.na:t will always fall at or within the boundaries of the

receive window.

 

[:[ Receive Window

 

 

 

    
 

«R1 R4—>

rcv_wnd rcv_nxt rcv_wnd +
rcv_wnd_size

Figure 3.6: The receive window.

Similar to the sending side, the receive sequence space is also divided into regions,
as shown in Figure 3.6. Region R1 refers to that part of the data stream that has
been received and consequently consumed by the receiving application. Region R2
refers to data that has been received and is being buffered, but which has not yet been
read by the application. Data that can be immediately received and buffered falls in
region R3. Finally, R4 is the part of the data stream to be received in the future, but

which does not ﬁt into the current receive window, and cannot be currently buffered.

42

 

 

 

 

3.5.2 Rate-based control

In the rate-based portion of RMC’S ﬂow control, the sender maintains a current
transmission rate that deﬁnes how quickly the sender may transmit data from within
the send window. The value of the current transmission rate is advertised in every

outgoing packet, and is limited by the speed of the underlying network.

[ [:[ Receive Window [

T1 T2
Safe Warning [ Critical
<——

 

 

 

 

 

 

 

 

 

 

 

Region II ‘ Region >‘ Region
<—Rl—><——-—- R2 r[‘ R3——¥<—R4->
II ] ll
rcv_wnd rcv_nxt rcv_wnd +

rcv_wnd_size

Figure 3.7: The receive window.

The responsibility of ensuring that the receivers’ buffer spaces do not overﬂow
is placed at each of the receivers. Receivers monitor data ﬂow by dividing the re-
ceive window into three ﬂow regions, and then taking different ﬂow control actions
depending on how full the receive window becomes. These ﬂow regions are deﬁned
by two thresholds, T1 and T2, as shown in Figure 3.7. T1 marks the beginning of
the warning region, and is set by the protocol to the point in the window where there
is room to receive four round-trip times worth of data at the maximum transmission
rate. Similarly, T2 marks the beginning of the critical region and is set to the point
where there is room to receive two round-trip times worth of data. The area to the
left of T1 is considered a safe region.

Receivers use throttle requests to request the sender to reduce its sending rate
when the receiver has determined there is a possibility of overﬂow in the local receive

window. Each receiver is locally suppressed to sending one throttle reduce request

43

 

 

 

Der round-trip time to the sender.

On the receipt of each packet, a receiver uses the advertised rate to evaluate the
state of its receive window. The receiver uses the following rules to determine when

to send a throttle request.

1. If the receive window is ﬁlled only into the safe region, no ﬂow control action is

taken.

2. If the receive window is ﬁlled into the warning region, the receiver sends a
throttle request only if the amount of data that may be sent at the advertised
rate for the next four round-trip times is greater than the empty portion of the

receive window.

3. If the receive window is ﬁlled into the critical region, the buffer is considered
dangerously full. The receiver sends a throttle request to stop forward trans-

mission regardless of the advertised rate.

If the throttle requests fail to arrive at the sender in time to sufﬁciently slow or
stop forward transmission and a receiver’s window becomes full, the receiver must
drop those packets that do not ﬁt in the window. As space in the receive window
opens up, the receiver sends NAKs to request retransmission of those packets that it

previously dropped.

3.5.3 Slow start and congestion avoidance

At the beginning of data transmission for a new connection, and any time following a
NAK or a throttle request, the sender sets the transmission rate to a minimum value,
which is approximately one-tenth of the maximum transmission rate. In this way, the

sender can test the transmission rate under the current network conditions without

44

 

 

 

Quickly overwhelming the network or the receivers. However, because the sender only

receives notiﬁcation from receivers of negative events, it has no indication of when or
by how much it may increase the transmission rate to make full use of the potential
of the network. For this reason, the sender uses slow start and congestion avoidance
phases to increase the transmission rate following a rate decrease. RMC’S slow start
and congestion avoidance phases are based on the slow start and congestion avoidance
methods used by TCP [29], but are applied to the transmission rate, rather than the
size of the window.

In the slow start phase, the transmission rate is allowed to increase on an exponen-
tial scale. That is, during slow start, the transmission rate starts at the minimum rate
and is allowed to double every round-trip time. In contrast, during the congestion
avoidance phase, the transmission rate is only allowed to increase linearly.

Following a rate decrease, the sender starts in the slow start phase and uses a rate
threshold to determine when to move to the more conservative congestion avoidance
phase. The rate threshold is set at what is considered to be a “safe” rate at the time
the rate decrease occurs. If a NAK initiates the rate decrease, the rate threshold is set
at one half of the current transmission rate. If, however, a throttle request initiates
the rate decrease, the rate threshold is set at one half of the transmission rate that is
requested by the receiver in the throttle request. If the sender receives another NAK
or throttle request during its slow start or congestion avoidance phases, it sets a new

rate threshold and begins slow start again.

45

 

 

 

 

Chapter 4

RMC Implementation in Linux

For this thesis project, the RMC protocol was implemented as a network driver in
the Linux operating system kernel. We chose to implement the RMC protocol at the
kernel-level rather than at user-level for performance purposes. In most cases, kernel-
level network drivers perform better than user-level drivers because kernel-level drivers
have closer access to the network and run at higher priority than user-level code. In
addition, when the protocol is located in the kernel, much of the protocol processing
can take place without having to frequently cross the user/kernel boundary. The
Linux kernel was chosen as the medium for the protocol implementation because the
Linux kernel source code is publicly available. Further, the Linux operating system

is widely used and is fairly well tested.

This chapter provides a description of the implementation of the RMC protocol
in the network stack of the Linux kernel. First, we present background material on
the implementation of the Linux networking stack, focusing on the TCP/IP network
stack. We then describe where and how RMC ﬁts into the Linux network stack.
Finally, we discuss details of the RMC protocol implementation, including the im-

plementation of socket creation, the architecture of the receiver and sender, and the

46

 

 

 

implementation of socket close.

4.1 Linux Networking

Linux is a freely available UNIX-based operating system that began as a project of
Linus Torvalds, a Finnish student of computer science. Linux is distributed under the
GNU Public License, which means that the full source code is available to users. The
Linux source code was ﬁrst made available on the Internet in 1991, and continues to
evolve due to the support of developers located worldwide [30].

Networking support is an integral and important part of the Linux kernel. Net-
working is important to Linux not only because of the growing need for networked
computers, but also because the success of the Linux operating system is due largely
to the collaborative environment that the Internet provides [31].

Linux supports several types of networking sockets, also known as address families.
The most widely used address family is the IN ET address family, which implements
the TCP/IP network protocol. In addition to the INET address family, Linux also
supports the UNIX, AX25, IPX, AppleTalk, and X25 socket address families.

Linux networking is implemented as a series of connected layers as shown in Figure
4.1. User-level applications access networking capabilities through the BSD socket
interface. In the INET address family, BSD sockets are supported by the INET
socket layer, which manages communication endpoints for IP-based transport proto-
cols. IN ET sockets normally interface with either the TCP or UDP transport layer
protocols, but may also access the IP layer directly, given that the application is being
run by a super user. The IP layer implements the Internet Protocol and is supported
directly by the network devices, including Ethernet, PPP, and SLIP [31]. Within the

Linux networking layers, RMC is implemented as a transport layer protocol and is

47

 

located between the BSD socket layer and the IP layer.

Network
Applications
I

User

 

Kernel

 

 
   

I
BSD
Sockets

 
  

Socket
Interface

Protocol
Layers

[ [P
Dewces

Figure 4.1: Linux networking layers.

 

 

 

4.1.1 BSD Sockets

BSD sockets provide a uniﬁed interface between user-level applications and kernel-
level networking services from the supported address families. The BSD socket inter-
face, shown in Figure 4.2, consists of a set of system calls that user-level applications
use to create network sockets, initialize communication endpoints, and communicate
with other applications on remote machines. RMC supports a subset of the BSD
system socket calls and provides a BSD socket interface to user-level applications.
Often, two communicating applications take on a client-server relationship, where

the server continuously satisﬁes requests by the client. A typical sequence of calls

48

 

 

 

 

 

int socket(int addr,family, int type, int protocol);

int bind(int sockfd, struct sockaddr *my_addr, int addrlen);

int connect(int sockfd, struct sockaddr *dest_addr, int addrlen);

int send(int sockfd, const void *msg, int len, unsigned int flags);

int sendto(int sockfd, const void *msg, int len, unsigned int flags,
const struct sockaddr #to, int tolen);

ssize_t write(int sockfd, const void *buf, size_t count);

int recv(int sockfd, void *buf, int len, unsigned int flags);

int recvfrom(int sockfd, void *buf, int len, unsigned int flags,
struct sockaddr *from, int *fromlen);

ssize_t read(int sockfd, void *buf, size_t count);

int setsockopt(int sockfd, int level, int optname, const void toptval,
int optlen);

int close(int sockfd);

 

Figure 4.2: BSD socket system calls.

made by the client and server to carry out such a data transfer is shown in Figure 4.3.
The server runs in a continuous loop, accepting connections, receiving and processing

requests from that connection, and closing the connection [32].

Client Server
Application Application
socket socket

connect bind
write listen
I
read accept
close read
I
write
close
Figure 4.3: Sequence of system calls for a simple client and server.
49

 

 

 

 

BSD Socket Structure. For each socket that is created, BSD allocates and ini-

tializes a socket structure to maintain high-level information about the socket. The
type ﬁeld and the ops ﬁeld in the socket structure are used to distinguish which
address family the socket belongs to. The type ﬁeld holds the identiﬁer for the con—
nection type. For example, the SOCKSTREAM type corresponds to a TCP socket,
while the SOCK.DGRAM type corresponds to a UDP socket. The ops ﬁeld is a

pointer to a vector of address family protocol operations.

BSD Socket Interface. In the Linux kernel, the initial interface between BSD
sockets and each lower layer address family is deﬁned by a net_proto_family struc-
ture, which includes a unique address family identiﬁer and a pointer to a create
function. During initialization of the networking services during system startup, each
address family registers a net_proto_fami1y vector with the BSD layer. The BSD
layer stores the net_proto-family vectors in the net_families table. An example

of the net_families table after system initialization is shown in Figure 4.4.

net_families

 

AF_INET inet_create()
AF_UNIX unix__create()
AF_AX25 ax25_create()
AF_IPX ipx_create()

 

 

 

 

 

 

 

 

Figure 4.4: The net_families table.

Each address family also initializes a protocol operations vector that deﬁnes ad-
ditional functions speciﬁc to the address family. The protocol operations vector is

deﬁned by the proto_ops structure, which is shown in Figure 4.5. The family ﬁeld

50

 

indicates which address family the structure represents and is initialized to the same

unique address family identiﬁer that is included in the net_proto_family structure.
The remaining ﬁelds are function pointers to other address family speciﬁc functions.
For example, in the AF_INE’I‘ address family’s protocol operations vector, the sendmsg
and recvmsg ﬁelds are initialized to point to the inet_sendmsg and inet_recvmsg

functions.

 

struct proto-ops {

int family, /* Unique address family identifier. */
int (*dup)(...); It Duplicate a socket. */
int (*release)(...); /* Release a socket structure. */
int (*bind)(...); /* Bind a socket to a local port. */
int (*connect)(...); /* Connect a socket to a remote addr. */
int (*accept)(...); /# Accept connections from remote hosts. */
int (*listen)(...); /* Listen for connections. */
int (*shutdown)(...); /# Close a socket connection. #/

int (*setsockopt)(. . .); /* Set option specific to addr. family. ‘/
int (*getsockopt)(. . .); /* Get option specific to addr. family. */
int (*sendmsg)(. . .); /# Send a message on the socket. */
int (*recvmsg)(. . .); /* Receive a message on the socket. */

};

 

Figure 4.5: The proto_ops structure used by BSD as the interface to different address
families.

After the system is up and running, user applications use the socket system call
to create new network sockets, specifying the address family identiﬁer as a parameter.
In the implementation of the socket system call, BSD uses the address family iden-
tiﬁer as an index into the net_families table and calls the corresponding create
function. Each address family’s create function is responsible for completing any
address family speciﬁc initialization for the new socket, including setting the BSD
socket type ﬁeld to the connection type and the BSD socket ops ﬁeld to point to

the address family’s proto_ops vector. The resulting data structures are shown in

51

 

 

 

Figure 4.6. Later, as the network sockets are used, BSD references the protocol oper-
ations vector in the given BSD socket structure to complete address family speciﬁc

operations for that socket.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_ BSD socket
7 type SOCK_STREAM
Ops > proto_ops
sk _ family AF_INET
dup -—> sock_no_dup()
release -—> inet_releaseO
bind —-> inet_bind()
INET socket
' family _INET .
type S CK_STRE AM sendmsg -—>1net_sendmsg()
protocol IP]. ROTO_TCP recvmsg _' met-recvmsg)
socket —4

 

 

 

 

Figure 4.6: Linux BSD and INET socket structures.

4.1.2 IN ET Sockets

The IN ET socket layer is located one layer below BSD sockets in the Linux networking
stack. The purpose of the IN ET socket layer is to provide a common interface between
BSD sockets and the TCP, UDP, and IP protocols, as well as to provide common
supporting functions for those protocols that are built on top of the IP layer. RMC
bypasses the IN ET socket layer, but uses many of the data structures and functions
that the IN ET socket layer provides, including the INET socket data structure, socket

buffers and the functions used to manage socket buffers, and the socket locking and

socket memory management functions.

52

 

The INET Socket Structure. The INET socket layer uses its own socket data

structure, the sock structure, to maintain network information that is speciﬁc to
those protocols that are based on the IP layer. A sock structure is allocated and
initialized for each INET socket as part of the inet_create function. Because RMC
is based on the IP layer, it also allocates an INET socket to maintain information
about each RMC socket.

A portion of the sock structure is shown in Figure 4.7. Among other things, the
sock structure includes IP address and transport layer port information for the local
and remote communication endpoints. The sock structure also includes a series of
variables that track memory allocation for the socket and a series of variables that
deﬁne the data queues for incoming and outgoing packets, as well as a place to add
in a transport protocol control block for transport layer speciﬁc information. Other
ﬁelds in the sock structure maintain information about the local state of the socket

and miscellaneous information about the state of the connection.

Socket buﬂ'ers. Linux uses socket buffers, deﬁned by the sk_buff structure, to
manage individual communication packets and to pass them between the different
socket layers and the device drivers. RMC receives packets from and passes packets
to the IP layer in the form of socket buffers. The sk_buff structure, depicted in
Figure 4.8, is composed of a data block with an attached control block. The data
block holds the data that will be sent to, or that was received from, the network. The
control block consists of length and pointer ﬁelds that allow each protocol layer to
manipulate the application data using standard methods.

Normally, when a process allocates an sk_buff for an outgoing packet, it allocates
enough space for both the data and all of the network headers. It copies the data

into the data block, reserving enough space in front of the data for the lower network

53

 

 

 

 

 

struct sock {

/* Socket addressing information.

/

/

i

LI

\ o
* u

* o

u32
u32
u32
__u16
unsigned short
__u16

daddr;
rcv_saddr;
saddr;
dport;
num;
sport;

Memory allocation information

int

atomic_t

int

atomic_t
unsigned int

Queues for incoming and outgoing packets.
struct sk_buff_head
struct sk_buff_head
struct sk_buff_head

rcvbuf;
rmem_alloc;
sndbuf;
wmem_alloc;
allocation;

*/

/* Foreign IPv4 addr
/* Bound local IPv4 addr

 

 

/* Sending source
/* Destination port
/* Local port

/* Source port

. */

*/
*/
*/
*/
*/
*/

/* Size of receive buffer in bytes
/* Receive queue bytes committed

/* Size of send buffer in bytes

/* Transmit queue bytes committed
/* Allocation mode

receive_queue;
vrite_queue;
back_log;

/*
/#
/*

Transport protocol specific information.

union {

struct tcp_opt

} tp_pinfo;

unsigned short

af_tcp;

mss;

volatile unsigned char state;
shutdown;

unsigned short

struct wait_queue

int

struct proto

unsigned char
struct socket

Callbacks I'l/

void
void
int

*tsleep;

sock_readers;

*prot;

protocol;
*socket;

/*
/*
/*
/*
/*
/*
/*
/*

*/

Incoming packets
Packet sending queue
Backlog packet queue

*/

current eff. mss
Connection state
Ready to shut down.
Sock wait queue

User count

Transport layer func.
Transport layer id

BSD socket

(*state_change)(struct sock *sk);

(*data_ready)(struct sock *sk,int bytes);

‘/

r/

*/
*/
*/
I'=/
*/
*/
*/
#l

*/
*/
*/
*/
*/

(*backlog_rcv)(struct sock *sk, struct sk_buff *skb);

 

Figure 4.7: The sock data structure.

54

 

 

 

transport layer header

 

network layer header
link layer header

 

 

 

transport layer
control block

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Control
Block

head

data

tail

end _

Packet'to be ‘_

transmitted ‘— Data
Block

W
'—, _

 

Figure 4.8: A socket buffer.

layers to add in their network headers. By initially reserving space for headers in the
data block, each of the network layers can easily insert their headers into the data

block and avoid costly copy operations.

The control block of an sk_buff also includes pointers to the network headers
in the data packet. These are useful, for example, as an incoming data packet is
being passed up the network stack. As each layer processes the packet, it sets the
appropriate header pointer to point to the beginning of its header. It then sets the
data pointer to point just past its network header. Thus, when the next level receives
the packet, it can easily ﬁnd the beginning of its header. In addition, if the current
network level needs to access lower layer headers, it can easily locate the network

headers using the network header pointers that were set by the lower layers.

55

 

 

 

 

 

 

 

Like the INET socket structure, socket bUffers also have a transport layer control

block that transport layer protocols may use to maintain transport layer speciﬁc
information. For example, TCP may use the transport layer control block to maintain

information such as the beginning and ending sequence numbers of the data packet.

Socket Locking. Incoming data packets are delivered by interrupt and bottom
half handlers, which are allowed to interrupt other processes. The processing that
occurs as a result of an incoming packet has the potential of changing the socket state,
possibly disrupting work that is being done in the interrupted process. In addition,
it is common for network protocols to use timers to schedule future work. Often, the
scheduled work involves updating the socket state. Thus, a timer interrupt could also
have an unpredictable effect on the socket.

Packets are delivered to RMC as a part of the network bottom half handler. In
addition, RMC uses timers to schedule various events. To prevent sockets from being
put in an inconsistent state, the RMC protocol must be careful to lock sockets while
they are being accessed or modiﬁed.

Socket locking is implemented by maintaining a socket reader count for each
socket. To check the lock status of a socket, interrupt handlers simply check the
value of the socket reader count. If it is zero, the socket is considered unlocked and
the handler continues. If, however, the count is one or more, the socket is locked, and
the handler’s processing is delayed.

The socket reader count is maintained in the sock_readers ﬁeld of the INET
socket structure. The lock_sock and release_sock functions, shown in Figure 4.9,
are used to lock and unlock a socket. To lock a socket, the lock_sock function
increments the socket reader count. Likewise, the release_sock function decrements

the socket reader count to unlock a socket. In addition, if a socket becomes unlocked

56

 

 

 

 

as a result of the release, packets on the baCklog queue are processed and delivered

to the socket via a backlog receive function.

 

static inline void lock_sock(struct sock *sk)

{
sk->sock_readers++;
barrier();
}
static inline void release_sock(struct sock *sk)
{
barrier();
if ((sk->sock_readers = sk—>sock_readers-1) == 0)
__release_sock(sk);
}
void __release_sock(struct sock *sk)
{
if (!sk->prot |l !sk->backlog_rcv)
return;
/* See if we have any packets built up. */
start_bh_atomic();
while (!skb_queue_empty(ksk->back_1og)) {
struct sk_buff * skb = sk->back_1og.next;
__skb_unlink(skb, ask->back_log);
sk->backlog_rcv(sk, skb);
}
end_bh_atomic();
}

 

Figure 4.9: The lock_sock and release_sock functions.

Socket Memory Management. Memory is a valuable and limited resource at the
kernel-level of an operating system, and must be carefully managed. RMC manages
socket memory in the same way that the INET socket layer does. In the INET layer,
each INET socket that is created is given a limited amount of memory that it may
use for processing the incoming and outgoing data stream. The maximum amount

of memory that may be allocated for writing and reading is set in the sndbuf and

57

 

 

 

 

rcvbuf ﬁelds in the IN ET socket structure during socket creation. For example, the

default value for each TCP socket is 64 KB.

The INET functions sock_wmalloc and sock_rmalloc handle the allocation of
memory for writing and reading, respectively. These functions keep track of the
amount of memory that has been allocated for each socket by incrementing the
wmem_alloc and rmem_alloc ﬁelds in the INET socket structure. Should the re-
quested amount of memory cause a socket to exceed its allocation limit, the functions
deny the request and the socket must wait until memory that it previously allocated
is freed. The sock_wfree and sock_rfree INET functions handle freeing memory
to the socket for the purposes of writing and reading. In addition to freeing the
memory, they decrement the wmem_alloc and rmem_alloc ﬁelds by the amount of
memory that was freed, and awaken any processes that are waiting for memory for

the given socket.

4.1.3 The IP Interface

RMC is built on top of the IP layer. The interface to the IP layer may be deﬁned
in terms of receiving and sending IP packets. In this section we discuss the interface

that IP uses to communicate with other layers of the network stack.

Receiving IP Packets. The interface between transport layer protocols and the
supporting IP layer in the receiving direction is deﬁned by the inet_protocol struc-
ture, which is shown in Figure 4.10. Among other ﬁelds, the inet_protocol structure
includes the handler ﬁeld, which is a pointer to the transport layer’s receiver inter-
rupt handler for processing incoming data packets, and the err_handler ﬁeld, which
is a pointer to the transport layer’s error interrupt handler for handling ICMP pack-

ets. The inet_protocol structure also includes a protocol ﬁeld, which is used for

58

 

 

 

 

 

protocol identiﬁcation.

During system initialization, each transport protocol registers an inet_protocol
vector with IP. Later, as data or ICMP packets arrive from the network, IP uses
the protocol identiﬁcation ﬁeld in the IP header as an index into the registered
inet_protocol vectors. IP then delivers packets to the appropriate transport layer

protocol by calling the corresponding handler.

 

struct inet_protocol
{
int (*handler)(struct sk_buff *skb, unsigned short len);
void (*err_handler)(struct sk_buff #skb, unsigned char *dp, int len);
struct inet_protocol *next;
unsigned char protocol;
unsigned char capyzl;
void *data;
const char tname;

 

Figure 4.10: The inet_protocol structure.

Sending IP Packets. The interface between IP and transport layer protocols in
the outgoing direction is deﬁned by two main sending functions: ip_queue_xmit and
ip_build_and_send_pkt. The function deﬁnitions for these functions are shown in
Figure 4.11. Both functions take a socket buffer as input, preﬁx the packet stored in
the socket buffer with an IP header, and queue the packet to be sent out by the lower
level device driver. The ip_queue_xmit determines to which socket the socket buffer
belongs and uses the bound destination address from the socket to route the outgoing
packet. The ip_queue_xmit function is commonly used by the TCP protocol since
sockets are bound to a single destination. The ip_build_and_send_pkt function, on
the other hand, allows the calling entity to specify the destination address independent

of the bound destination. However, it also requires the calling entity to have already

59

 

 

calculated the routing information for the desired destination and have stored it in the

control block of the socket buffer. To calculate the routing information, IP provides
another function, ip_route_output, which is normally called immediately before
ip_build_and_send_pkt. The ip_build_and_send_pkt function is commonly used

by UDP, where sockets may send to a number of different destinations.

 

void ip_queue_xmit(struct sk_buff Itskb);
void ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
u32 saddr, u32 daddr, struct ip_options Itopt);

 

Figure 4.11: IP sending functions.

4.1.4 Network Devices and Bottom Half Handling

At the lowest layer of the network stack in the Linux kernel, packets are received from
and sent to the network by network device drivers. When a network device receives
data from the network, it triggers an interrupt in the operating system, which causes
the corresponding interrupt handler in the device driver to be invoked.

Because interrupts suspend the operating system from all other processing while
the interrupt is handled, it is desirable to have interrupt handlers do only the “impor-
tant” processing and delay less important processing to a time that is more convenient
for the operating system. Linux uses bottom half handlers to delay non-critical in-
terrupt processing to a later time.

In Linux networking, a bottom half handler is used by the network device drivers to
deliver incoming data packets to the network level protocols. Instead of immediately
delivering incoming packets to the IP layer, for example, the interrupt handler of
the network device driver puts the packet on a backlog queue and mark the network

bottom half as active. Later, on the return from every system call and every slow

60

 

 

 

 

 

 

interrupt, if no further interrupt is running at the time, a list of 32 bottom halves is

scanned. Each bottom half routine is run if it is marked active, and then automatically
marked inactive. The network bottom half reads packets from the backlog queue and
delivers the packets to the appropriate protocol as identiﬁed in the Ethernet header,
in this case, the IP layer. As a part of the network bottom half, the IP layer delivers

each packet to the appropriate transport layer protocol such as TCP or RMC.

4.2 The RMC Interface and Data Structures

In this section, we discuss the user interface to RMC and the RMC interface to the
BSD and IP layers of the network stack. We also discuss the data structures that

RMC uses to manage sockets.

4.2.1 The RMC Interface

The RMC protocol has been implemented to support a subset of the BSD socket calls,
as listed in Table 4.1. RMC sockets support half-duplex communication. An RMC
socket can be a sender or a receiver, but not both. Data ﬂows in the forward direc-
tion, from the sender to the receivers, and control information ﬂows in the backward
direction, from a receiver to the sender.

A sample receiving application is shown in Figure 4.12. The receiver ﬁrst creates
a socket with protocol family AF_RMC, type SDL_IP and protocol IPPROT0_RMC. After
binding to a well-known multicast port, the receiving application uses setsockopt to
join a multicast group. It then uses the recv system call to receive data on the RMC
socket. When the application ﬁnishes using the connection, it calls close.

The corresponding sending application is shown in Figure 4.13. The sender ﬁrst

creates a socket in the same way that the receiver did. The sender then binds to

61

 

 

 

 

 

#include <stdio.h>
#include <netinet/in.h>
#include <netinet/rmc.h>
#define MCPORT 4096

int main(int argc, char **argv) {
int sock,c = 0;
struct sockaddr_in my_addr, data_addr;
struct ip_mreq mreq;
char buf[BUFSIZ];
long int myaddr = inet_addr("35.9.26.153");
long int mcaddr = inet_addr("234.5.5.5");

/* Set up the RMC socket. */
if ((sock = socket(AF_RHC,SOCK_IP,IPPROT0_RHC)) < 0)
{ perror("socket"); exit(1); }

/* Bind to data (multicast) port. */

memset(hdata-addr,0,sizeof(data_addr));

my_addr.sin_family - AF_RMC;

my_addr.sin_addr.s_addr = INADDR_ANY;

my_addr.sin_port = htons(MCPORT);

if (bind(sock,(struct sockaddr*)&my_addr,sizeof(my_addr)) < O)
{ perror("bind"); exit(1); }

/* Join the multicast group. */

mreq.imr_mu1tiaddr.s_addr = mcaddr;

mreq.imr_interface.s_addr = htonl(INADDR_ANY);

if (setsockopt(sock,SOL_RMC,RMC_ADD_MEHBERSHIP,kmreq,sizeof(mreq)) < 0)
{ perror("setsockopt"); exit(1); }

/* Receive the data. */
if ((c = recv(sock,buf,sizeof(buf),0)) < 0)
{ perror("netin"); exit(1); }

/* Close the socket. */
close(sock);

 

Figure 4.12: A receiving application.

62

 

 

 

 

connect to a

sendto Send outgoing data.
write

recv
recvfrom Receive the next incoming data.
read

 

Table 4.1: RMC-supported BSD functions.

a local port and connects to the same well-known multicast port that the receivers
are bound to. The sender uses send to send data to the multicast group. When the

sender has completed its communication, it calls close.

As was shown in Figure 4.1, the RMC protocol is layered between the BSD socket
layer and the IP layer in the Linux kernel. RMC bypasses the INET socket layer and
provides a direct interface to the BSD layer for two reasons. First, by going around
the INET socket layer, the RMC implementation remains separate from the existing
implementation of TCP and UDP. While RMC uses many of the data structures and
functions that are provided by the INET socket layer, very few changes were made
to the existing code to accommodate RMC. Second, by bypassing INET sockets, the
path between the user application and the network is shorter, so the performance
should be better.

The rmc_family_ops and the rmc_proto_ops vectors, which are shown in Fig-
ure 4.14, deﬁne the interface between BSD and RMC. The rmc_fami1y_ops vector
provides BSD with an entry point into the RMC protocol when new sockets are
created. This vector is registered with BSD upon initialization of the system, and

is stored as a part of the net_families table, which is shown in Figure 4.4. The

63

 

 

 

 

 

#include <stdio.h>
#include <netinet/in.h>
#include <netinet/rmc.h>
#define HCPDRT 4096
#define HYPORT 4097

int main(int argc, char ttargv) {
int sock,c = 0;
struct sockaddr_in my_addr, data_addr;
char bufEBUFSIZ];
unsigned long myaddr = inet_addr(”35.9.26.154");
unsigned long mcaddr = inet_addr("234.5.5.5");

/* Initialize the data buffer. ... t/

/* Set up the socket. */
if ((SOCk B socket(AF_RMC,SOCK_IP,IPPROT0_RMC)) < 0)
{ perror("socket"); exit(1); }

/# Bind to my address. */

memset(&my_addr,0,sizeof(my_addr));

my_addr.sin_family = AF_RMC;

my_addr.sin_addr.s_addr = INADDR_ANY;

my_addr.sin_port - htons(MYPORT);

if (bind(sock,(struct sockaddr*)&my_addr,sizeof(my_addr)) < 0)
{ perror("bind"); exit(1); }

/* Connect to the multicast address. */

memset(&data_addr,0,sizeof(data_addr));

data_addr.sin_fami1y = AF_RMC;

data_addr.sin_addr.s_addr - mcaddr;

data_addr.sin_port = htons(HCPORT);

if (connect(sock,(struct sockaddr*)&data_addr,sizeof(data_addr)) < 0)
{ perror("connect"); exit(1); }

/* Send the data. */
if ((c - send(sock,(void*)buf,sizeof(buf),0)) < O)
{ perror("sen "); exit(1); }

/* Close the socket. */
close(sock);

 

Figure 4.13: A sending application.

64

 

rmc_proto_ops vector speciﬁes the remainder of the RMC speciﬁc functions as de-

ﬁned by the proto_ops structure, which is deﬁned in Figure 4.5.

 

struct net_proto_family rmc_family_ops = {

AF_RMC, /* Unique identifier for RMC. */
rmc_create /# Create a new RMC socket. */

};

static struct proto_ops rmc_proto_ops = {
AF_RHC, /* Unique identifier for RMC. */
rmc_dup, /* Duplicate an RMC socket. */
rmc_release, /* Release an RMC socket. Free all assoc. mem. */
rmc_bind, /* Bind an RMC socket to a local port. */
rmc_connect, /* Connect an RMC socket to a multicast addr. */
rmc_accept, /* Not supported. */
rmc_listen, /* Not supported. */
rmc_shutdown, /* Close an RMC socket connection. */
rmc_setsockopt, /* Set RMC options. */
rmc_getsockopt, /* Get RMC option values. */
rmc_sendmsg, /* Send a message on the RMC socket. */
rmc_recvmsg, /* Receive a message on the RMC socket. */

 

 

Figure 4.14: The RMC network protocol family structure.

The interface between RMC and IP is deﬁned by the rmc_protocol vector,
which deﬁnes the inet_protocol structure (Figure 4.10) for RMC. During initializa-
tion of the system, RMC also registers the rmc_protocol vector with the IP layer.
Among other ﬁelds, the rmc_protocol vector includes a pointer to the RMC han-
dler, rmc_ip_rcv, and a unique protocol ID, IPPROT0_RMC. To identify IP packets as
RMC packets, RMC sets the protocol ﬁeld of the IP header of all outgoing packets to
IPPROT0-RMC. AS IF processes incoming packets, it uses the protocol ﬁeld in the IP
header to identify packets of type IPPROT0_RMC and deliver them to the rmc_ip_rcv
function.

For sending packets, the RMC protocol implementation makes use of two IP send-

65

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ing routines. The ip_queue_xmit routine takes care of building IP packets and rout-
ing them to the destination speciﬁed in the IN ET socket structure. For the sender, the
speciﬁed destination is the multicast group, and for the receivers, the speciﬁed desti-
nation is the sender. Since most trafﬁc ﬂows from the sender to the multicast group
and from multicast group members back to the sender, RMC uses ip_queue_xmit to
transmit most outgoing data packets. However, JOINACCEPT and NAKERR mes-
sages are delivered only to the group member that is joining the group or the group
member affected by the error. Thus, packets of these types must be routed to a des-
tination other than the one speciﬁed in the sock structure. In this case, RMC ﬁrst
calls ip_route_output to route the packet to the appropriate destination, and then
calls ip_build_and_send_pkt to build the IP packet and send it to the destination

speciﬁed in the function parameters.

4.2.2 RMC Data Structures

RMC uses the control blocks in both the IN ET socket structure and in socket buffers
to maintain information that is speciﬁc to RMC sending and receiving sockets. In
the INET socket structure (Figure 4.7), the RMC socket control block is deﬁned by
the rmc_opt structure, and is stored as a part of the tp_pinfo union. The rmc_opt
structure is shown in Figures 4.15 and 4.16. All RMC sockets use the multicast hash
information in the RMC socket control block to support quick access to each RMC
socket. For receiving sockets, the RMC socket control block includes variables that
maintain the receive window, the NAK list, and the NAK timer, as well as the out-
of-order queue. Fields that are relevant only to sending sockets include those that
maintain the send window and rate window, as well as retransmission and keepalive

information.

66

 

 

 

 

struct rmc_opt {
/* multicast hash info */
struct sock *mc_next; /* Next RMC socket in this hash list. */
struct sock *mc_prev; /* Previous RMC socket in this hash list. */
unsigned short master; /* Is this the master receiver for rmc? */

/* Receive information #/

__u32 rcv_wnd; /* Seq. number of first byte in receive window. */
__u32 rcv_wnd_size; /* Size of the receive window. */
__u32 rcv_nxt; /* Seq. number that we want to receive next. */
__u32 unnaked_nxt; /* Last unnaked sequence number. */
__u32 copied-seq; /* Next seq. number to copy to the user. */
int wnd_time; /* Next time we can request a window reduction. */
__u32 rcv_threshl; /* Border between safe and warning regions. */

__u32 rcv_thresh2; /* Border between warning and critical regions. */

/* nak structures */
struct timer_list nak_timer; /* Timer to control resending of NAKs. */
struct sk_buff_head nak_list; /* List of pending NAKs. */
struct sk_buff-head out_of_order_queue; /# Out of order segments */

l; /* struct rmc_opt */

 

Figure 4.15: RMC receiver speciﬁc information. i

Packets are passed to and from the RMC layer in socket buffers. RMC uses the
control block in the socket buffer structure to hold RMC speciﬁc data. The control
block is a block of 48 bytes set aside by the socket buffer structure for transport
layer speciﬁc information. The RMC socket buffer control block is deﬁned by the
rmc_skb_cb structure, which is shown in Figure 4.17. Information in the RMC control
block includes the beginning and ending sequence numbers of the data packet, as well

as time stamps for when the packet arrives or is sent and for when the packet expires.

4.2.3 RMC Ports

Like TCP and UDP, RMC uses ports to identify participating processes on the sending

and receiving machines. RMC hashes each sockets by its bound port number, which

67

 

 

 

struct rmc_opt {

/# Send information */

int members; /* Number of receivers in the multicast group.

__u32 snd_wnd; /* Seq. number of first packet in send window.

__u32 snd_wnd_size; /t Size of the send window.

__u32 snd_nxt; /* First seq. number to use on next send.

__u32 prev_snd_nxt; /t first seq. number to use on next send

int advance_time; /* How long to keep an skb in the window.

int min_sends; /# Min. number of sends before releasing an skb.

__u32 write_seq; /* Next seq. number to use for a new packet.

int total_sends; /* Total number of sends/resends in the window.

int wait_for_mem; /* Flag, true if waiting for memory.

struct sk_buff *send_head; /* Next packet to send on the backlog.

/* Rate-based flow control information. */

__u32 last_pkt; /* End of the rate window.

__u32 snd_rate_wnd; /* Start of the rate window.

__u32 max_snd_rate_wnd; /* Maximum value for the rate window.

__u32 snd_ssthresh; /* Slow start size threshold.

struct timer_list slow_start_timer; /* Timer to control packet trans.

__u32 wnd_increase; /* Size of possible rate window increase.

struct probe join_req; /* Initial seq. number join information.

int rtt; /* Estimated round-trip time.

/* Retransmission information */

struct retrans_req *retrans_queue; /* List of NAKs to service.

struct timer_list retrans_timer; /* Timer to control packet retrans.
/* keepalive structures */

struct timer_list ka_timer; /* Timer to control keepalive pkt. trans.

int ka_time; /* Current keepalive interval.

int sent_kas; /* Number of consecutive keepalive pkts.

unsigned int idle_time; /* Last time non-keepalive pkt. sent.
__u8 flags; /# Miscellaneous flags. */

}; /* struct rmc_opt */

*/
*/
*/
*/
*/
*/
*/
*/

*l
*/

 

Figure 4.16: RMC sender speciﬁc information.

68

 

 

 

struct rmc_skb_cb {

struct inet_skb_parm header; /* For incoming frames. */
__u32 seq; /* Starting sequence number */
__u32 snd_seq; /* Ending sequence number #/
unsigned long when; /* Time the packet was last sent. */
unsigned long timeout; /* For NAKs, when to resend. */
unsigned char type; /* RMC packet type. */
__u8 flags; /* Miscellaneous flags. */
__u8 tries; /* Times this packet has been sent. */

 

Figure 4.17: RMC socket buffer information.

is stored in the num ﬁeld in the INET socket structure (Figure 4.7). References to
RMC sockets are stored in a hash table so that destination sockets can be quickly

located as packets arrive from the network.

In a multicasting environment, it is possible that multiple processes on the same
machine wish to receive data from the same multicast source. In this situation, these
processes bind to the same local port, but each process in this group should be given
its own copy of the incoming multicast data. One way to handle this situation is to
immediately deliver a copy of each incoming packet destined for the shared multicast
endpoint to each of the destination sockets in the group. As a result, however, each
socket in the group has to reassemble the incoming data stream and handle its own
error recovery and flow control. This method clearly results in a replication of effort

among the sockets on the same host that are listening to the same multicast group.

To reduce both the local processing needs and the amount of feedback to the
sender, it is preferable to have a single controlling mechanism for the group of sockets
on the same host that are listening to the same multicast endpoint. RMC implements
this idea by assigning one of the sockets to be a master socket and making all other

sockets in the group to be children of the master socket. Master sockets are marked by

69

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
  
 

       
 

-. 71..

'. r

h.

 
  

Setting the master ﬁeld in the INET socket structure (Figure 4.7) to 1. All incoming

  

data is ﬁrst delivered to the master socket, which reassembles the incoming data .
stream and handles error recovery and ﬂow control for the group. The master socket

delivers a separate copy of the in-order data to each of its children.

The RMC hashing structure is shown in Figure 4.18. RMC supports master and
children sockets by adding the multicast hash list ﬁelds mc_next and mc_prev to the
RMC socket control block (Figure 4.15). These ﬁelds are used to maintain a linked-
list of the RMC sockets listening to the same multicast source. A master socket is
hashed by its bound address and uses the mc_next variable to point to the list of
its children. In the ﬁgure, the normal hash list extends from the RMC hash table
entry, and the multicast hash list extends down from the master socket. Hashing the
master socket in this way makes it easy for the receive interrupt handler to initially

deliver incoming data to the master socket. Extending the multicast hash list from

 

the master socket makes it easy for the master socket to distribute in-order data to

all its children.

4.3 RMC Protocol Implementation

Now that we have discussed where and how RMC ﬁts into the Linux networking stack,
as well as the data structures that RMC uses to maintain sockets, we will describe
the implementation of the RMC protocol. We will ﬁrst discuss the implementation
of RMC socket creation and connection initialization. We will then discuss the ar-
chitecture of the RMC receiver, followed by the architecture of the RMC sender. We

will ﬁnish with a discussion of RMC connection tear down.

70 ‘_

~‘

 

 

 

RMC

Hash Table
0 _—
l #— Multicast
Hash L15! Master socket

 

 

r’l:|
I:

<- Normal Hash List

E

 

 

 

was?

..—

Socket
_ Structure

N h h l' '
Child sockets < ormal as 1st painters
I: Multicast hash list pointers

 

 

127

 

 

 

 

 

Figure 4.18: RMC’s hashing table structure.
4.3.1 RMC Socket Creation and Connection Initialization

Both the sender and the receiver must perform special tasks when an RMC socket
is created and when the RMC connection is ﬁrst established. User applications use
the socket system call to create RMC sockets. As a result of this call, the BSD
layer allocates a BSD socket structure and calls the rmc_create function. The
rmc_create function allocates an INET socket structure for the new socket and
creates links between the BSD and INET socket structures. In addition, the ops
ﬁeld of the BSD socket structure is set to point to RMC’s protocol operations vector,
rmc_proto_ops. Further initialization of the INET socket structure are performed
by the sock_init_data and rmc_init_sock functions. This initialization includes
setting the size of the rate window to a single segment and the state of the socket to
RMC_LISTEN. The resulting structures are shown in Figure 4.19.

After a receiving application creates a new RMC socket, and before it is able

to receive data on an RMC socket, the application must bind the socket to a well-

71

 

 

BSD socket
SOCK_IP

  

AF_INET

sock_no_dup()
release rmc_releaseO
bind rmc_bindO

INET socket

rmc_sendmng
rrnc__recvmsg()

Figure 4.19: The RMC socket structures.

known multicast port and join an IP multicast group. The application uses the bind
system call to bind the socket. In RMC, the bind system call is implemented by
the rmc_bind function, which initializes the num in the INET socket structure to
the bound port. The application calls setsockopt with RMC_ADD_MEMBERSHIP and
the desired multicast address as parameters to join the IP multicast group. The
rmc_setsockopt function, which implements the setsockopt system call for RMC,

calls the IP ip_add_mc_membership function to carry out the join process.

After the receiving application has bound the socket and joined the IP multicast
group, it may begin to receive data on the socket by calling one of the recv system
calls. On receipt of the ﬁrst data packet, the receiver sets up the receive window
by initializing the rcv_nxt, rcv_wnd, unnaked_nxt, and copied_seq variables in the
RMC socket control block (Figure 4.15) to the beginning sequence number of the

packet. The receiver then calls rmc_join_mc to send an RMC join request to the

72

 

 

 

 

 

 

 

 

 

 

 

 

 

sender.

Before transmitting data on an RMC socket, the sending application must also
bind the socket to a local port using the bind system call. In addition, the sending
application must then connect the socket to the multicast port. To achieve this, the
application calls the connect system call with the desired destination address as a
parameter. The rmc_connect function, which implements the connect system call
for RMC, initializes the daddr and dport variables in the INET socket structure
(Figure 4.7) to the speciﬁed multicast destination address and port.

After the socket is bound and connected, the sender may use one of the send
system calls to transmit data on the RMC socket. The sending socket is initially
in the RMC_LISTEN state and remains in this state until it receives the ﬁrst join
request. Because it is useless to transmit data to an empty multicast group, the
sending function transmits the ﬁrst data packet and puts itself to sleep while waiting
for incoming join requests. When the keepalive controller runs, it observes that
the state is RMC_LISTEN and transmits the ﬁrst packet instead of normal keepalive
packets. After the sender receives the ﬁrst join request, the socket state is set to
RMC_ESTABLISHED and the sending function is awakened. At this point, the connection

is established and normal transmission is allowed to continue.

4.3.2 The RMC Receiver

The architecture of the RMC receiver is shown in Figure 4.20. The RMC receiver
has three major components: the packet processor, the NAK manager, and the ap-
plication interface. The packet processor manages incoming data packets, the NAK
manager controls NAK retransmissions, and the application interface delivers avail-

able data to the receiving application.

73

 

Incoming Data Packets

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Packet
Processor I
Initial
Packet Data >I Backlog Queue
Processor Packets I (“Ck—‘08) Network
(rmc_ip_rcv) I Interface
I v
Main __ Data OutOfOrder Queue
Packet m—I (out_of_order_queue) I I I I
Processor l
(rmc_rcv_data) NAK
I—> Manager Resent >
‘ - (nak_timer) NAKs
New NAKs Pending NAK List
—.I (nak_list) I I I I I
New NAKs ‘
Outgoing Rate Requests

 

 

 

The receiver uses three packet queues to manage data packets as they pass through

Inorder Data Packets RWIVC Queue
(receive_queue)

 

Figure 4.20: The RMC receiver.

74

HIH
L.

 

Interface
(rmc_recvmsg)

 

 

 

i}

Data to the
Application

the receiver. The backlog queue, which is implemented by the back__1og socket buffer
list in the INET socket structure (Figure 4.7), is used to temporarily hold data packets
that arrive while the destination socket is locked. The out-of-order queue is imple-
mented by the out-of_order_queue socket buffer list in the RMC socket control
block (Figure 4.15) and holds packets that cannot yet be integrated into the data
stream due to one or more gaps in the data stream. The receive queue, implemented

by the receive-queue socket buffer list also located in the RMC socket control block,

 

 

 

 

 

holds in-order packet that are available to the receiving application.

The receive window, as shown in Figure 4.21, is composed of the receive queue and
the out-of-order queue. Several variables in the RMC socket control block (Figure
4.15) are used to mark different points in the receive window. The rcv_wnd variable
marks the sequence number of the ﬁrst data byte in the receive queue and the rcv_nxt
variable marks the sequence number of the last in-order data byte stored in the receive
queue. The rcv_wnd_size variable gives the size of the receive window. It can be
added to rcv_wnd to calculate the right edge of the receive window. The variable
unnaked_nxt gives the value of the sequence number of the last data byte in the
receive queue. In addition to the three data queues, the NAK list, implemented by

nak_list in the RMC socket control block (Figure 4.15), holds pending NAKs.

receiverqueuc out_of_orfleLqueue
W—l—‘——‘I I l l—l———ﬁl

I

rcv_IWnd I l I

unnaked_nxt
rcv_nxt nak_list rcv_wnd +
rcv_wnd_size

Ll: Receive Window Socket Buffer % Pending NAK I

 

 

 

 

 

Figure 4.21: Implementation of the receive window.

Packet processor. Incoming data packets are initially delivered to the RMC pro-
tocol through the packet processor, which has two parts. Packets are ﬁrst processed
by the initial packet processor, which is implemented by the rmc_ip_rcv function.
This function receives packets from the IP layer and calls the rmc_lookup function to
locate the destination socket in the hash table, rmc_hash, according to the packet’s
destination. If the destination socket is located and it is not currently locked, the

packet is delivered immediately to the main packet processor. Otherwise the packet

75

 

is appended to the backlog queue for the socket. Packets in the backlog queue are

passed to the main packet processor the next time the socket becomes unlocked.

The main packet processor is implemented by the rmc_rcv_data function, which
is shown in Figure 4.22. The responsibilities of the main packet processor include
reassembling the incoming data packets into the original data stream, determining
when to generate NAKs, and monitoring the receive window to determine when to
send rate requests.

The main packet processor calls the rmc_send_nak function to generate and send
NAKs for missing data. Each new NAK is inserted into nak_list and is given a
timeout value by setting the timeout ﬁeld in the RMC control block for the socket
buffer (Figure 4.17) to the current time plus four times the estimated round-trip time.
If it has not already been activated, the NAK monitor is scheduled to run when the
new NAK will expire.

Rate requests are prepared and sent by the rmc_req_rate_reduce function. To
limit the number of rate requests, each function that calls rmc_req_rate_reduce is
responsible for making sure that the appropriate amount of time has passed since
the last rate request was sent before sending a new rate request. The socket control
block variable wnd_time (Figure 4.15) holds the earliest time when the next rate
request may be sent. Before making a rate request, the calling function compares the
value of wnd_time to the current time. If the current time is greater than or equal
to wnd_time, the rmc_req_rate_reduce function may be called. In addition, every
time the rmc_req_rate_reduce function is called, the calling function sets wnd_time
to the earliest time that the next rate request may be sent.

When the main packet processor receives a packet, it determines what actions to
take by comparing the sequence number range of the packet, indicated by the RMC

socket buffer control block variables seq and end_seq, to the sequence space of the

76

 

 

 

int rmc_rcv_data(struct sock *sk, struct sk_buff *skb) {

struct rmchdr* rh = skb—>h.rmc;
__u32 inseq = RMC_SKB_CB(skb)->seq;
struct rmc_opt* rp = &(sk->tp_pinfo.af_rmc);

if

/#
if

if

(rp->rcv_nxt == 0) { /* This is the first packet. */
/* Update the RMC socket variables and send a join request. ... t/ }

Check where this packet fits into the incoming data stream. */
(after(RMC_SKB_CB(skb)->end_seq,(rp->rcv_wnd + rp—>rcv_wnd_size))) {
/* Out of bounds, discard the packet. */

in_oob(sk,skb);

kfree_skb(skb);

rmc_statistics.RmcInOOB++;

return 0;

(inseq == rp->rcv_nxt) { /* In sequence. */
inorder_deliver(sk,skb);

/* Check the out-of—order queue and update naks. */
ck_ooq(sk);

advance_nak_list(sk);

} else if (after(inseq,rp->rcv_nxt)) {

/* Out of order, possibly not yet received, store in the 000 queue. */

qinsert(krp->out_of_order_queue,skb);

/* update pending naks and send new naks, if necessary */

update_nak_list(sk,skb);

if (after(inseq,rp->unnaked_nxt)) /* Need to nak. */
rmc_send_nak(sk,skb->nh.iph—>saddr,rh->src,rp->unnaked_nxt,inseq);

} else {

}

/* Previously received, free the buffer. */
rmc_statistics.RmcInDiscards++;
kfree_skb(skb);

return 0;

/* Advance the unnaked seq. number if necessary. *I
if (after(RMC_SKB_CB(skb)->end_seq,rp->unnaked_nxt))

rp->unnaked_nxt - RMC_SKB_CB(skb)->end_seq;

/* Check the window for overflow. */
ck_wnd(sk,skb);
return 1;

} /* rmc_rcv_data */

 

Figure 4.22: The receive function.

77

 

 

 

 

 

 

 

 

 

 

 

receive window. The packet processor may take one of four actions listed below.

1. If end_seq is greater than the end of the receive window, the packet will not ﬁt

in the window. The packet is discarded and the in_oob function is called. The
in_oob function calls rmc_req_rate_reduce to initiate an urgent rate request.
In addition, if the packet leaves a gap at the end of the receive window that
is large enough to hold an additional packet, in_oob calls rmc_send_nak to

generate a new NAK for the free space.

. If seq is equal to rcv_nxt, the packet is in sequence, and it is appended to
the receive queue. The data_ready function is called to wake up any kernel
processes waiting for data to be put on the receive queue. The ck_ooq function
is then called to check the out-of-order queue for packets that may now be
in sequence. In-order packets are removed from the out-of-order queue and
appended to the receive queue. In addition, advance_nak_1ist is called to
remove any NAKs from the NAK list for data with sequence numbers less than

or equal to end_seq.

. If seq is greater than rcv_nxt, the packet is not in order and has not been pre-
viously received. It is stored in the out—of—order queue. The update_nak_list
function is called to remove any NAKs from the NAK list for data in the range

of the out—of-order packet only.

. If none of the above conditions apply, the packet has previously been received

and is discarded.

For each packet that is successfully received and stored in the receive window, the

main packet processor checks the state of the receive window for possible overﬂow

by calling the ck_wnd function, which is listed in Figure 4.23. The ck_wnd function

78

 

 

 

 

 

ﬁrst checks if enough time has passed since the last rate request by examining the

wnd_time variable. If wnd_time is less than the current time, the ck_wnd function
calculates the amount of free space in the receive window. If the free space is less than
the critical threshold, rcv_thresh2, the ck_wnd function calls rmc_req_rate_reduce
with the urgent parameter set to 1. If the free space is less than the warning threshold,
rcv_thesh1, ck_wnd calls rmc_req_rate_reduce with the urgent parameter set to 0.
If a rate request is sent, the wnd_time variable is reset to one round-trip time from

the current time.

 

static void ck_wnd(struct sock *sk, struct sk_buff *skb) {
struct rmc_opt *rp = &(sk->tp_pinfo.af_rmc);
__u32 curr_last_pkt;
__u32 ad_rate,curr_wnd;
__u32 time = JIFFIES;

/* Calculate the windows */

ad_rate = ntohl(skb->h.rmc->rate);
curr_last_pkt = rp->rcv_wnd + rp->rcv_wnd_size;
curr_wnd = curr_last_pkt - rp->rcv_nxt;

if (rp->wnd_time <= time) { /* Send only one request per rtt. */
if (curr_wnd <= rp->rcv_thresh2) {
/# Send urgent control packet. */
rmc_req_rate_reduce(sk,skb,1);
rp->wnd_time = time + rp->rtt;
} else if ((curr_wnd <= rp->rcv_thresh1)
&& (curr_wnd < (4*rp->rtt*ad_rate))) {
/* Send warning control packet. */
rmc_req_rate_reduce(sk,skb,0);
rp->wnd_time = time + rp->rtt;

}
return;
} /* ck_wnd */

 

Figure 4.23: The RMC ck_wnd function.

79

 

 

 

 

 

NAK manager. The second part of the receive architecture is the NAK manager,

whose duty it is to monitor the NAK list and resend pending NAKs at appropriate
intervals. The NAK manager is implemented by the nak_timer timer, which is stored
as a part of the RMC socket control block (Figure 4.15). The interrupt handler for
the NAK timer is the nak_timer function. Each time the NAK list becomes non-
empty, the NAK manager is scheduled to run by setting nak_timer to ﬁre at the time
the newly created NAK will expire. When the timer ﬁres, the nak_timer function
examines each NAK on the NAK list. If the value of the expire ﬁeld in the NAK’s
control block is less than the current time, the NAK has expired. The NAK is resent
and expire is reset to four round-trip times from the current time. Finally, if the
NAK list is non-empty, the NAK manager reschedules itself by setting nak_timer to

ﬁre at the earliest time when a NAK on the list will expire.

Application interface. The ﬁnal part of the receiver is the application interface.
The application interface implements the receive system calls and delivers data to
receiving applications. The rmc_recvmsg function implements the application inter-
face. This function pulls data packets from the receive queue and copies the requested
number of data bytes to user-level memory using the Linux memcpy_toiovec func-
tion. It uses the copied_seq variable from the RMC socket control block (Figure
4.15) to track the sequence number of the next data byte that should be delivered to
the application. After a data packet has been completely consumed by the applica-
tion, the data packet is released from the receive window and the receive window is
slid past the consumed packet by setting rcv_wnd to the ending sequence number of
the packet. If there are no data packets in the receive queue when rmc_recvmsg is
called, the process releases the lock on the socket and puts itself to sleep by setting its

state to TASK_INTERRUPTIBLE and calling schedule. The process will be awakened

80

 

 

 

 

by the main packet processor when new data has been inserted in the receive queue.

At this point, rmc_recvmsg relocks the socket and continues to copy available data

to the application.

4.3.3 The RMC Sender

An overview of the RMC sender architecture is shown in Figure 4.24. The RMC
sender may be thought of as ﬁve concurrent tasks: the application interface, the
transmitter, the feedback processor, the retransmitter, and the keepalive manager.
The application interface and the transmitter work together to transmit original data.
The feedback processor processes rate requests and works with the retransmitter to

resend data packets. Finally, the keepalive controller sends keepalive messages during

times when both the transmitter and the retransmitter are idle.

Data from the
Application

G

 

Application
Interface
(rmc_sendmsg)

 

 

 

NAKs. Rate Requests

 

I

 

 

 

 

 

 

 

 

 

 

 

 

 

 
 

 

 

 

 

 

 

 

Feedback ”CW/0*
Processor Interface
(rmc_mastcr_rcv)
l
I
New ' .
R t C tr 1 Inf u
Dam If 9 - 9'! .°_ - ATE!“ _________ - hummer New Data Packets
Packets i (slow_slart_timer)
I Send Window Keepal'
! V , 'te . _ WC .
i (th _queuc) Pnonty Controller Keenaltve
i Info. (ka_timcr) Packets
I
Retrans. I . . ‘
Requms t ________________________ , Retransnnttcr Retransmttted Data Packets
(reuans_timer)

 

 

 

 

 

Rctrans. Request List
(rctrans_qucue)

Figure 4.24: The RMC sender.

81

 

 

 

 

The sender uses the send window, shown in Figure 4.25, to buffer data packets
that have been sent and data packets that are waiting to be sent. The send window
is implemented by the write_queue socket buffer queue, which is stored as part of
the IN ET socket structure (Figure 4.7). All other variables that are used to maintain
the send window are stored in the RMC socket control block (Figure 4.16). The
snd_wnd variable holds the beginning sequence number of the ﬁrst packet in the send
window. The snd_wnd_size variable holds the size of the send window, and can be
added to snd_wnd to calculate the ending sequence number of the send window. The
send_head variable points to the next packet queued to be sent. If all data packets in
the send window have been sent, the send_head variable is set to NULL. The snd_nxt

variable holds the value of the sequence number of the next data byte to be sent.

 

 

 

 

 

 

 

 

write_queue send head
I i .
snd_wnd I snd_nxt l snd_wnd +

prev_snd_nxt last_pkt snd_wnd_size

. . ' ' ' Send Rate
I ' I
I::I Send Window M Socket Buffer , _ _ _ WI l

 

 

Figure 4.25: Implementation of the send window.

Because kernel processes should not block, it is impractical to implement rate-
based transmission in the kernel by introducing a delay between each packet. For
this reason, rate-based transmission is implemented as a window within the send
sequence space. The rate window, outlined by the dashed line in Figure 4.25, begins
at the point prev_snd_nxt and extends to the point last_pkt. The rate window
advances at set send intervals and packets within the rate window may be sent at

any rate. In this implementation, the send interval is ten milliseconds, which is

82

 

 

the basic unit of time in the Linux kernel. The size of the window, held in the

snd_rate_wnd variable in the RMC socket control block (Figure 4.16), is determined
by the desired rate, which is expressed as the number of bytes that can be sent in
one sending interval. The upper bound on the size of the rate window is held in
the max_snd_rate_wnd variable, also located in the RMC socket control block. The
value of max_snd_rate_wnd is currently set to 12450 bytes. This rate window allows
a maximum transmission rate of approximately 9.5 Mbps and was chosen because the

underlying network of the test environment was 10 Mpbs Ethernet.

Application interface. Data enters the protocol through the application inter-
face, which is initiated when the sending application calls one of the send system
calls on an RMC socket. The main responsibility of the application interface, which
is implemented by the rmc_sendmsg function, is to fragment the data into RMC
packets and make the packets available for transmission by inserting them in the
send window. Because it is more efﬁcient to send larger data packets, the applica-
tion interface ﬁrst attempts to add data to the last packet in the queue of packets
to be sent. The remaining data is fragmented into segments that do not exceed
the maximum segment size. Memory is allocated for each segment by calling the
sock_wmalloc function. If all the write memory for the socket is currently being
used, the process puts itself to sleep by adding itself to a wait queue, setting its state
to TASK_INTERRUPTIBLE, and calling schedule. Each time write memory is released
in the sock_wfree function, the write_space function is called to awaken all pro-
cesses waiting for write memory. After memory has been allocated, the application
interface reserves space for lower layer headers, copies the data from user space using
the Linux csum_a.nd_copy_from_user function, and builds the RMC header. Each

data packet is then passed to the rmc_send_skb function.

83

 

 

The rmc_send_skb function, which is shown in Figure 4.26, is responsible for

adding packets to the send window and either sending them immediately or queuing
them for later transmission. Each packet is stored in the send window by calling
skb_queue_tail for the write-queue socket buffer queue. Two conditions are tested
to determine if the packet may be sent immediately. First, the send_head variable
must be NULL, signifying that there are no other packets waiting to be sent. Second,
the ending sequence number of the packet must be less than the value of last_pkt,
meaning that the packet falls within the current rate window. If the packet cannot
be sent immediately, it is queued in the backlog for later transmission. If however,
the packet can be sent immediately, the keepalive timer is stopped and the packet
is passed to the complete_skb_xmit function. The complete_skb_xmit function
completes the RMC header for the packet and records the sending time in the when
ﬁeld in the socket buffer’s control block (Figure 4.17). A checksum is computed by
calling the rmc_send_check function, and the packet is passed to ip_queue_xmit
for transmission. On return from complete_skb_xmit, rmc_send_skb attempts to
free any expired buffer space and then restarts the keepalive timer at the minimum

keepalive interval.

Transmitter. The transmitter is responsible for taking packets from the backlog
and sending them to the multicast group according to the current rate. The trans-
mitter, which is implemented by the slow_start_timer timer in the RMC socket
control block (Figure 4.16), enforces both the slow start and the congestion avoid-
ance phases of the rate control. Each time the transmitter ﬁres, it determines how
many bytes of data were sent since the previous run by subtracting the current value
of the next sequence number to be sent, snd_nxt, from the beginning of the current

rate window, prev_snd_nxt. It uses this value as an upper bound on the increase

84

 

 

 

 

 

 

 

 

int rmc_send_skb(struct sock #sk, struct sk_buff *skb, int force-queue) {
struct rmc_optt rp = &(sk->tp_pinfo.af_rmc);
/* Save in the queue and send the packet */
RMC_SKB_CB(skb)->tries = 0;
__skb_queue_tai1(tsk->write-queue,skb);

if (lforce,queue as rp->send_head == NULL && rmc_snd_test(sk,skb)) {

/* ok to send right away */

cancel_keepalive(sk);

complete_skb_xmit(sk,skb,GFP_KERNEL);

rp->snd_nxt = RMC_SKB_CB(skb)->end_seq;

It Free send buffer space if near full of sent data. */

if (rp->snd_nxt > (rp->snd_wnd + rp->snd_wnd_size - 2*sk->mss))
advance_window(sk);

reset_keepalive(sk,RMC_MIN_KATIME);

return 0;

}

/* queue the packet */
if (rp->send_head == NULL)
rp—>send_head = skb;
return 0;
} /* rmc_send_skb */

 

Figure 4.26: The rmc_send_skb function.

of the rate window size, snd_rate_wnd. The transmitter calculates a new value for
snd_rate_wnd according to whether the transmission is in slow start or congestion
avoidance. If the transmission is in the slow start phase, the rate window is allowed
to increase by the number of bytes sent during the previous interval. If, however, the
transmission is in the congestion avoidance phase, the rate Window is only allowed to
increase by the maximum segment size. In addition, the value of max_snd-rate_wnd
is used as an upper bound on the value of snd_rate_wnd. After the size of the new
rate window has been calculated, the transmitter advances prev_snd_nxt to snd_nxt
and last_pkt to snd_nxt plus snd_rate_wnd. The transmitter begins sending pack-
ets, starting with the packet at send_head. It continues to send all packets that fall

within the rate window, advancing send_head and snd_nxt as necessary. After all

85

 

 

data packets in the current rate window have been sent, the transmitter reschedules

itself by setting slow_start_timer to expire in 10 ms.

Feedback processor. The feedback processor handles feedback from the receivers
in the form of NAKs and rate requests and forwards the feedback to appropriate enti-
ties in the sender. When a NAK arrives at the sender, the add_retrans_req function
is called to allocate and initialize a new retrans_req structure. The retrans_req
structure, as shown in Figure 4.27, is composed of the beginning and ending sequence
numbers for the request as well as a pointer to the next socket buffer that should be
sent for the request. The new request is inserted in the retrans_queue list in the
RMC socket control block (Figure 4.16). Multiple requests for the same data range
are discarded. Because retransmissions have priority over both original transmissions
and keepalive messages, the slow_start_timer and keepalive_timer timers are
suspended. To initiate the congestion avoidance phase, the feedback processor halves
the value of snd_rate_wnd, and the retrans_timer function is called to start the

retransmitter.

 

struct retrans_req {
-_u32 begin;
__u32 end;
struct sk_buff *retrans_head;
struct retrans_req *next;

};

 

Figure 4.27: The retransmission request structure.

Rate requests are handled by the reduce_rate function. If the rate request is ur-
gent, reduce_rate stops forward transmission by setting snd_rate_wnd to zero. The
flag ﬁeld in the RMC socket control block is set to RMC_FLAG_KA and the keepalive

controller is set to ﬁre in one round-trip time.

86

 

 

For non-urgent rate requests, reduce_rate sets snd_rate_wnd to the size of a

single segment to initiate a slow start phase. The threshold between slow start and
congestion avoidance is marked by setting the snd_ssthresh ﬁeld in the RMC socket
control block to one-half of the rate requested in the incoming rate request. The slow

start timer is then restarted to continue forward transmission.

Retransmitter. It is the retransmitter’s responsibility to serve retransmission re-
quests from the retransmission request list, retrans_queue. The retransmitter is
implemented by the retrans_timer timer in the RMC socket control block (Figure
4.16) and the handler for the timer is the retrans_timer function. When the retrans-
mission timer ﬁres, the retrans_timer calls rmc_retrans_xmit to send a window of
packets according to the current rate. If the retrans_queue is non-empty after send-
ing the window of packets, the retrans_timer reschedules itself to run in the next
send interval. When the retransmitter has complete resending all requested data,
the retrans_timer function resets the keepalive and slow start timers to continue

forward transmission.

Keepalive controller. The ﬁnal piece of the sender architecture is the keepalive
controller, which is responsible for sending keepalive packets during idle times. The
keepalive controller is implemented by the ka_timer timer in the RMC socket con-
trol block (Figure 4.16). After the ﬁrst packet in the outgoing data stream is sent,
the ka_timer timer is scheduled to run in the minimum keepalive time by calling
reset_keepalive. Each time the keepalive controller ﬁres, the ka_timer function
is executed. This function sends a keepalive message to the group and sets the
keepalive time value, ka_time to twice the previous value, up to the maximum

keepalive time value, MAX_KA_TIME, which is 2 seconds. The keepalive controller

87

 

then reschedules itself to run in the time SPeCiﬁed by the ka_time variable. Be-

cause the keepalive controller has the lowest transmission priority, it is suspended
by calling cancel_keepalive when either the transmitter or the retransmitter be-
come active. When the transmitter and the retransmitter complete their current task,
they reschedule the keepalive controller by calling reset_keepalive. Each time the
keepalive controller is suspended and reset, ka_time is set to the minimum value

RMC_MIN_KATIME, which is 100 ms.

The keepalive controller handles two special cases. The ﬁrst special case occurs
when the sending socket is in the RMC_LISTEN state. Each time the ka_t imer function
runs, it checks the socket state. If the state is RMC_LISTEN, ka_t imer sends the packet
at the head of the send window instead of a keepalive packet. The second special case
occurs when the feedback processor sets the RMC_FLAG_KA ﬂag after receiving an ur-
gent rate request. In this case, to restart forward transmission, the ka_t imer function
clears the ﬂag and reschedules the transmitter by setting the slow_start_timer timer
to two round-trip times. Two round-trip times are given rather than one to give the

receiving hosts ample time to respond to the keepalive message.

4.3.4 RMC Socket Close

After the user application has ﬁnished communicating on an RMC socket, it calls the
close system call to tear down the RMC connection. The rmc_shutdown function
implements the close system call for RMC. The main responsibility of this function
is to stop all activity occurring on behalf of the socket and release all the socket data.
For a receiving socket, closing the socket is simply a matter of sending a LEAVE
message to the sender by calling rmc_leave_ms, and deallocating all of the data

structures that were used to maintain the socket by calling the rmc_destroy_sock

88

 

 

function.

At the sender, care must be taken that all data in the send window has been
transmitted and receiver sockets have had a chance to request retransmissions before
the sender frees all of its socket data structures. To ensure this, when a sender socket
issues a close, it must go through two wait periods. During the ﬁrst wait period,
the sender calls rmc_send_fin which sets the FIN flag in the last outgoing packet.
The FIN flag signals receivers that the sender has completed all transmissions. The
sender then puts itself to sleep by adding itself to a wait queue, setting its state
to TASK_INTERRUPTIBLE, and calling schedule. After the transmitter sends the last
packet in the send window, the value of send_head becomes NULL and the transmitter
calls state_change to awaken the waiting process. The second wait period is used to
wait for retransmission requests from the receivers. This wait period is implemented
by putting the sending process to sleep while all the data packets in the send window
are allowed to expire. During this phase, the keepalive controller continues to send
periodic keepalive messages. After all packets in the send window have expired and
a minimum amount of idle time has passed, the keepalive controller awakens the
sleeping process by calling state_change. After the second wait period expires, the
rmc_destroy_sock function is called to delete timers for the socket and free the data

structures used to maintain the socket.

89

 

 

 

 

 

 

 

 

Chapter 5

Performance Evaluation

In this chapter we evaluate the performance of the RMC protocol. We ﬁrst investigate
the behavior of RMC as implemented in the Linux kernel, through experimentation in
a local network testbed. In this environment, we are able to observe the performance
of the protocol with up to eight receivers. We then use simulation to study the
performance of RMC in larger networks. Using a simulated environment allows us to
test RMC for up to 100 receivers, while varying network conditions such as network
speed, network delay, and loss characteristics. In each test, we vary kernel buffer size

from 64 KB to 1 MB and use throughput to assess the protocol effectiveness.

5.1 Experimental Study

We used a local network testbed to evaluate the performance of RMC in a working
local area network. In this section we discuss the conditions and results of this

experimental study.

90

 

 

 

 

Table 5.1: Test machines.

5.1.1 Experimental Conditions

For the experimental study, we used 9 PCS running RedHat Linux 5.0, kernel versic
2.1.103. In each case, the sender was a Pentium 166MHz with 64 MB of RAM. Tl
eight machines listed in Table 5.1 served as receivers. All machines were connecte
using 10 Mbps Ethernet. Each experiment was repeated ﬁve times, and the minimu:
and maximum values were thrown out. Averages of the remaining three measuremen

are reported here.

To study the effect of different numbers of receivers in a local environment, v
conducted tests for 1, 2, 4 and 8 receivers. In the ﬁrst three cases all receivers we:
Pentium II 300MHz machines. In the last case, the receivers were those listed 5
Figure 5.1. For each set of receivers we collected statistics on 10 MB and a 40 M
transfers, using both memory-to-memory and disk-to—disk tests. In the memory-ti
memory tests, the sender sent data from memory and each of the receivers receive
data into a memory buffer. In the disk-to-disk tests, the sender sent a ﬁle which
read from disk, and each of the receivers stored the received data to a ﬁle on disk. E
performing both the memory and the disk tests, we were able to observe the protocol
behavior when the applications were always ready, as well as when they were slow

by I/O operations.

91

 

 

5.1.2 Experimental Results

The throughput results for each of the experimental tests are shown in Figure 5.1.
Each plot shows the average throughput over ﬁve tests for the given kernel buffer
size. The resulting throughput ranges from 4.4 Mbps for 64 KB buffers to 8.4 Mbps
for 1 MB buffers.

 

.—
O

 

10 - -%

on
on

1

i

O
v
G‘
r

Throughput (mm)
A

Throughput (Wire)
A

l Receiver -°—

 

 

 

 

 

 

lReoeiver *—
2 Receivers *— 2 Receivers *—
2 4Receivers ~— ' 2 ' 4Receivers —--
8 Receivers *— 8 Receivers -—
0 ‘ ‘ ‘ ‘ 0 ' ' ' ‘
64 128 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)
(a) Memory-to—memory, 10MB (b) Memory-to—memory, 40MB

 

 

1
l
l
1

O
I
O

Throughput (Mbps)
1%

moushput (Mbps)
&

1 Receiver *—

 

 

 

 

 

 

2Reoeivcrs -°— 2Recervers -°—
2 4Recervers -— ‘ 2 4Reccrvers —~—
8 Receivers -— 8Rccctvers -—
0 I 41 A J J 0 A A A A
64 128 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)
(c) Disk-to—disk, 10MB ((1) Disk-to—disk, 40MB

Figure 5.1: Throughput of RMC in a local environment.

From the results shown in Figure 5.1, we can make two general observations
about RMC in a local environment. The ﬁrst observation is that the kernel buffer
size affects throughput for tests run with both 64 KB and 128 KB kernel buffers, but

has a minimal effect on tests where the buffer size was 256 KB or greater. The limit

92

 

 

 

 

 

in throughput for smaller buffers is most likely due to the lack of buffer space at the
sender. The sender is required to buffer sent packets for at least 10 round-trip times.
Because the Linux clock ticks once every 10 ms and the estimated round-trip time
must be greater than zero, the minimum estimated round-trip time is 10 ms. As a
result, the minimum buffering time is 100 ms. When the buffer size is less than 256
KB, the sender completely ﬁlls its buffer space and must wait for the buffer space
to be freed before sending more data. The second observation is that throughput
varies only slightly with the number of receivers, especially when there is sufficient
buffer space allocated to the protocol. For example, in the disk-to—disk 40 MB test,
throughput values are less for 8 receivers than for lesser number of receivers. However,
given a sufﬁcient amount of memory, the throughput values for 8 receivers are close
to those for lesser number of receivers. This indicates that RMC is scalable up to 8

receivers in a local environment.

To further understand the behavior of RMC, we monitored the NAK and rate
request feedback activity that occurred during each test. We report this activity in
terms of the total number of NAKs and rate requests that arrive at the sender during
each test.

In the memory-to-memory tests for both 10 MB and 40 MB there was no data loss,
and receivers’ buffers did not ﬁll past the thresholds, so no feedback was generated.
Thus, as is evident in Figures 5.1 (a) and (b), the throughput values are nearly
identical for these cases. For the disk-to-disk 10 MB test there was also no data loss
and therefore no NAK activity. Receivers did, however, utilize enough of their buffer
space to generate rate requests. As shown in Figure 5.2, the number of rate requests
for this test is proportional to the number of receivers for both the 64 KB and the 128
KB buffer cases. Rate requests for 256 KB buffers and larger are minimal. Comparing

against Figures 5.1 (a) and (c), we see that the difference in throughput between the

93

 

 

 

 

 

 

 

 

10 MB disk tests and the 10 MB memory tests (in which there was no feedback) is

small, indicating that the rate requests had little effect on the throughput.

 

60

Total Rate Requests
8 8 2:"
N
:c
i.
<
3

8

 

 

 

0 . . r -
64 128 256 512 1024
Kernel Memory (KB)

Figure 5.2: Total rate requests for the 10 MB disk test.

Figures 5.3 (a) and (b) show the NAK and rate request feedback statistics for
the 40 MB disk-to—disk tests. The number of NAKs for the 40 MB disk-to-disk
transfer is less that 2 in most cases. In the one case where a larger number of total
NAKs occurred, a large portion of the total is due to NAK retransmissions. In
this case, a receiver may have had an incorrect estimate of round—trip time, causing
it to retransmit NAKs too early. While the number of NAKs is quite small, the
number of rate requests for these tests is noticeable and seemingly unpredictable.
Because the protocol is running at kernel-level, it is difﬁcult to assess the cause for
this unpredictability. A number of different activities in the operating system or
delays in the I/O may have caused the application to slow, allowing the kernel buffers
to quickly ﬁll. As shown in Figure 5.1(b) and (d), there is a decrease in throughput
from the 10 MB ﬁle test to the 40 MB ﬁle test. However, in most cases, the decrease
in throughput is less than 1 Mbps.

In Figure 5.4 we break down the experimental results by the number of receivers

and give the minimum and maximum throughput values for each case. In all cases,

94

 

 

 

 

 

O

52
<
z
.35
1-

&

  

Total Rate Requests

 

 

 

 

 
 

 

 

0 l . r

54 123 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)
(a) Total NAKs. (b) Total rate requests.

Figure 5.3: Feedback for the 40 MB disk test.

the performance results for the memory test are very consistent. The disk-to—disk
tests vary more than the memory tests, with the most variance occurring in the 40
MB disk tests. This variance is again difﬁcult to assess, and may be caused by a

number of different operating system or I/O operations.

5.2 Simulation Study

In order to evaluate the performance of RMC in environments other than a local area
network, we used the CSIM simulation package to implement a simulated network
environment in which we could test the RMC protocol. The simulation was imple-
mented using portions of the data structures from the Linux kernel, so that we could

easily plug the RMC implementation into the simulation with only minor changes.

5.2.1 Simulation Environment

The architecture of the simulated environment is shown in Figure 5.5. In this ﬁgure,
each box represents a CSIM process, of which there are three types: host processes,

network interface processes, and router processes. The host processes are used to

95

 

 

 

 

Disk 10MB Cl

Memory 10MB D
Memory 40MB D
0

Disk 10MB D

01:11st El
8 - Memory 10MB 13
Memory 40MB 13

o
l

b
l

 

Throughput (Mbps)

Throughpm (Ml-7w)

N
j,

 

 

     

 

64 128 156 5 12 1024

 

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

   

 

Koml Memory (KB) Kcmcl Memory (KB)
(b) 1 Receiver (b) 2 Receivers
10 v r r l 1 10 1 1 r r
Dirk 10MB Cl DiDI:k 10MB D
our 40MB :1 ﬂ - _ ﬂ .
8 - Memory 10MB 0 _ i. 8 ' Memory 10MB Cl - .1 -
Memory 40MB CI *1 '~ Memory 40MB El :‘ I ;
é o — , g 6- '. .
g ‘ _ : ,; :~ g- 4 :1 ‘
2 . zI .
0 » .7: 0 'I
64 256 512 1024 1024
Kernel Memory (KB) KernelZ Memory

(b) 4 Receivers (b) 8 Receivers

Figure 5.4: Average throughput with minimum and maximum throughput values.

simulate each participating host. A host process controls the operation of the RMC
protocol and underlying operating system on the host, as well as the sending or re-
ceiving application. Each host process is coupled with a network interface process,
which handles incoming packets for the host and simulates the network delay associ-
ated with each packet. The router processes are used to simulate a simple network
topology through which packets are routed before being delivered to the network

interfaces. Each router is assigned a network speed, a queue size, and a loss rate.

The network simulation works as follows. At a given host, outgoing packets are
constructed with a full RMC header and a partial IP header, and then passed to the

closest router. Within a router, the packets are taken from the local queue, assigned

96

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.5: Simulation Architecture.

a delay according to the network speed, and passed on to the next router or to the
appropriate network interface, as dictated by the IP destination. Multicast packets
are duplicated within a router as necessary. At the network interface, packets are
received one at a time, held for the assigned delay, and then passed to the host. At
the host, incoming packets are passed to the RMC protocol, where normal process-
ing continues. To simulate network loss, each router and network interface may be

assigned a loss rate, which it uses to randomly drop packets.

To simulate the protocol, we imported the RMC protocol code directly from the
Linux kernel into the simulation. We then brought in portions of data structures
and functions from the Linux kernel that were used by the protocol. In order to
simulate processing time at a host, we measured the time to complete RMC processing

and lower layer network processing in the Linux kernel, and then introduced the

97

 

 

 

 

 

 

ms
ms
ms

 

Table 5.2: Characteristic groups.

appropriate delays in the host process. For sending and receiving data of length l,
the RMC delay was (10 + .025 * l)ps and the lower layer delay was 150ps.

In order to understand how RMC performs when hosts have different capabilities,
we used the router processes to divide receivers into characteristic groups, where a
characteristic group is deﬁned by its network delay and loss properties. We divided
the loss rate between the router process and each receiver’s network interface process
to simulate both correlated and uncorrelated loss. As reported in [33], most losses take
place in the tail links of the network. The network backbone and the individual sites
are mostly loss free. Thus, we set the loss parameters such that 90% of the loss was
correlated and occurred at the router process and 10% of the loss was uncorrelated and
occurred at the network interface process. Deﬁnitions of the characteristics groups
that we used are listed in Table 5.2. Group A is intended to simulated a local
environment, group C is intended to simulate a wide area environment, and group B
simulates something between the two. These ﬁgures are similar to those reported in

[34].

5.2.2 Simulation Results for “Fast” Receivers

We conducted two sets of simulation tests for 10, 25, and 100 receivers. In our ﬁrst set
of tests we assumed that the receivers had “fast” 300 MHz processors similar to those
in the experimental tests and that the connecting network was 10 Mbps Ethernet.

For this set of tests we ran ﬁve different tests using different combinations of the three

98

 

 

 

 

 

 

 

 

Table 5.3: Simulation tests.

characteristic groups. As shown in Table 5.3, for tests 1, 2, and 3, the receiver group
consisted of the single characteristic groups A, B and C, while tests 4 and 5 used
receivers from characteristic groups B and C in opposite proportions.

Figure 5.6 shows the throughput results for tests 1, 2, and 3. In these tests we
see that kernel buffer size becomes an important factor in determining throughput.
Because the network delay is larger than in a local area network, the sender is required
to buffer data for a longer time. This means that once the send window is full, the
sender must wait for longer periods of time before buffer space is freed. An increase in
the number of resent packets further slows packet release since packets must be held
for the minimum time from the last time they are sent. We also observe a deﬁnite
decrease in throughput from 10 to 25 receivers and from 25 to 100 receivers. This
decrease is most likely due to the NAK activity. Unlike the experimental study, in the
simulation study we introduce a constant network loss. Because we do not use global
NAK suppression, the number of NAKs that arrive at the sender increases linearly
with the number of receivers and with the loss rate. The sender does a slow start on
the receipt of each NAK, which greatly affects the throughput, especially with high
loss rates.

Despite the high loss rates, RMC is still able to achieve over 6 Mbps for up to
100 receivers in group A and up to 25 receivers in group B. In addition, the efficiency
that is gained by using reliable multicast over reliable unicast is large, even when the

throughput is low.

99

.l
l Fan—l. ...:rl.l II

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10
IO Receivers -°— 10 Receivers -¢—
25Receivers -°— 25Receivers *—
3 ' 100 Receivers -— 100 Receivers --
E. 3?
.o B
s 6 > a
‘5 ‘5
a. G.
a a
2 I /
0 . . . . . . . r .
64 128 256 512 1024 64 l28 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)

(a) Test 1

(b) Test 2

 

10 Receivers -°-

25 Receivers -°—
’ 100 Receivers -— (

 

 

 

64 128 256 512 1024
Kernel Memory (KB)
(c) Test 3

Figure 5.6: Simulation Throughput.

In Figures 5.7, 5.8, and 5.9, we present the throughput results for all 5 test cases

for 10, 25, and 100 receivers, respectively. In each ﬁgure, the (a) subﬁgure gives the

average throughput results and the (b) subﬁgure shows the minimum and maximum

throughputs along with the

average. In addition, the (c) subﬁgure shows the total

number of NAKs that arrive at the sender and the (d) subﬁgure shows the total

number of rate requests that are generated. In each test the throughput generally

increases with kernel buffer size. Further, for each receiver group size, the best per-

formance occurs in test 1, which represents a local environment. Test 1 is followed in

throughput by test 2, which

simulates a metropolitan area network. Test 3, which is

100

 

 

the wide area network, yields the lowest throughput. For test 4, where the receiver

group is primarily composed of medium area receivers, and test 5, where the receiver

group is primarily composed of wide area receivers, the throughput values are close to

that of the wide area environment. This shows that RMC adapts to the least capable

receivers in the multicast group.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o A I A I I
64 123 256 512 1024
Kernel Memory (KB)
(3.) Average throughput. (b) Minimum and maximum throughput.
1800 . - . . . 50 r ,
Test] -°—~
16m . /\—_\' TestZ _._
5° ’ Teen *— ‘
1400 ’ w Tcsl4 +_
1200- Trail ~— 3 4o ”“5 +-
'3 Tele +— §
2 1m ' Tests —-——
Test-t *— 3 30 -
E 300 - Tests -— g
u M _
600 g 20
400 - M
10 »
200 L
rm —‘
0 * 0 A
64 128 256 512 1024 64 128 756 512 1024

Kernel Memory (KB)

(c) Total NAKs.

Kernel Memory (KB)

((1) Total rate requests.

Figure 5.7: Results for 10 fast receivers.

Looking at subﬁgures (c) and (d) from Figures 5.7, 5.8, and 5.9, we can analyze the

feedback behavior from this set of simulation tests. We see that the number of NAKs

remains constant across different buffer sizes, and is proportional to the number of

receivers and loss rate of the test. The number of rate requests, on the other hand, is

101

 

 

 

 

.-
O

 

 

 

Throughputhps)
N b 0‘ on
_ ' ' 'eeea
39.233
_ 1 1 H

 

64 128 256 512 1024

 

 

 

 

 

 

 

 

 

Kernel Memory (KB)
(a) Average throughput. (b) Minimum and maximum throughput.
4500 - - . . a 250
Testl -—
m - ”fw—d—J 4 T312 *-
. Test3 -— .
350° ' N ‘ 20° Team -—
3 ”‘—
3ooo - mu .._ « has
% Test2 —-— g- 150 .
Z 2500 ' Test?) —.— 4 M
3 2000 i5“; *_ 5
- est .._ -
13 f E 100 -
1500 - lg
1000 ' W ‘ 50 _
500
o r *0 0 A A
64 123 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)
(0) Total NAKs. ((1) Total rate requests.

Figure 5.8: Results for 25 fast receivers.

large for 64 KB and 128 KB buffers, but is small or insigniﬁcant for buffers of size 256
KB and larger. In these tests NAKs most likely had the effect of lowering an entire
curve, while rate requests lessened throughput only for small buffers. In all cases, the
number of NAKs is much larger than the number of rate requests, suggesting that

the overall effect of rate requests was minimal.

5.2.3 Simulation Results for “Slow” Receivers

In the second set of simulation tests, we introduced “slow” 75 MHz processors while

still using a 10 Mbps Ethernet network. Slow processors were deﬁned to be two

102

 

 

 

 

      

Waughput (Mbps)

 

 

 

64 128 256 512 107A
Kernel Memory (KB)

(a) Average throughput.

 

 

 

 

 

 

 

 

_ Test] -°—
18“ Q ] 12m - T6812 _.— -
. T313 -—
16000 Test-1 —.—
14000 - . 91000 r Tests _._ .
1 Test] 4— g
2000 - TestZ +— <
:23 Tesl3 —~—— a: 80° I
3 10000 ' Test4 «— 5
13 8000 _ Tests -— ‘ '3 600 L
m. d a .00.
4000 -
200 .
zooo -
0 4+ I 1' o R A
64 128 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)
(c) Total NAKs. ((1) Total rate requests.

Figure 5.9: Results for 100 fast receivers.

generations old and were four times as slow as the fast processors used in the previous
set of simulation tests. For this set of tests, we focused on characteristic groups B
and C, making a subset of the processors slow. We performed three different tests
for groups of 10, 25, and 100 receivers. The three tests are summarized in Table 5.4.
Test 6 is similar to test 2 in the previous set of tests, in that all the processors belong
to group B. In test 2 all of the processors are fast, while in test 6 all of the processors
are slow. In test 7, the processors are all still in group B, but 8 of them are fast and
only 2 are slow. Test 8 is similar to test 4, where 8 processors belong to group B and

2 processors belong to group C. In test 8, however, the group B processors are fast

103

l.
g

I", ‘ ~ If!
n 1
24" I
’ 1‘
‘Y

‘ \\

8 fast and 2 slow 20 fast and 5 slow 90 fast and 10 slow

 

2 slow in C 5 slow in C 10 slow in C

Table 5.4: Simulation tests.

and the 2 group C processors are slow.

The test results for 10, 25 and 100 receivers are shown in Figures 5.10, 5.11, and
5.12, respectively. In each ﬁgure, the (a) and (b) subﬁgures show the throughput, the
(c) subﬁgure shows the NAK feedback, and the (d) subﬁgure shows the rate request
feedback.

For the 10 receiver case as shown in Figure 5.10, the results for tests 6 and 7,
where all processors are from group B, are very close. Further, the results for tests 6
and 7 are similar to those for test 2, shown in Figure 5.7. We see the same similarities
between tests 8 and 4, which mix the receivers between groups B and C. These results
suggest that simply introducing slower processors into the group do not signiﬁcantly
affect the protocol performance.

The same observations that were made for the 10 receiver case can be made for
the 25 and 100 receiver cases. As shown in Figure 5.11(a), the difference in the results
between tests 6 and 7 for the 25 receiver cases is slightly more, but still not signiﬁcant.
Similarly, the feedback activity for tests 6 and 7 is similar to the feedback activity for
test 2 and the feedback activity for test 8 is similar to the feedback for test 4. Again,
introducing slow receivers has little effect on the protocol.

We see a larger effect of introducing slow processors in the 100 receiver tests,
shown in Figure 5.12. While the throughput is limited by the high packet loss, there
is more of a difference in the throughput values between tests 6 and 7 than in the

10 and 25 receiver case. In addition, as shown in Figure 5.12(d), the number of rate

104

 

 

 

 

 

 

 

 

64 128 256 512 1024
Kernel Memory (KB)
(a) Average throughput.

 

 

 

700 . . . . . 30
a» . /——//\ - 25
500

Total Naks
Total Rate Requests
3 8

 

 

 

 

 

 

 

300
10 ~
200 1
Test6 -°— 5
Test7 -—
100 Tests -—
o . - . a . o
64 128 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)
(c) Total NAKs. (d) Total rate requests.

Figure 5.10: Results for 10 slow receivers.

requests that are being generated in test 6 is much larger than the number being
generated in test 2, shown in Figure 5.9(d). Thus, the effect of slow processors is
clearly more when the number of slow receivers is greater. In addition, in test 6
with 128 KB of memory, there is a signiﬁcantly larger number of rate requests than
for the other cases. It is difﬁcult to determine the cause of the high number of rate
requests in this case. For the case, the capabilities of the receivers may have been
somehow negatively affected by the increase in buffer space at both the sender and

at the receivers, causing a strange surge in the number of rate requests.

105

 

 

 

1o 10

 

 

1 1 I 1 1

Test 6 ~— if“: 3
Test 7 +— est
8 Test 8 —u— ‘ 3 ' Test 8 U "‘

O
.

Throughput (Mbps)
A
W (Mbps)
‘

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 > 2 — : —
o 1 . . . . 0 I'm . l—l . H r . A
64 128 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) KanelMemay (1:13)
(a) Average throughput. (b) Minimum and maximum throughput.
1300 . . . . . 70 . . . . .
Test6 -°—
16“) . M . 60 _ Test7 _._ .
Test8 -—
1400 r
a so -
1200 - 3
W .‘3
$1000 \_—/—/’° 5' 40 .
a
S 800 a? 30 - 1
3
600 (S. 20 _
400 Two +—
Test7 -°— 10
200 Test8 -—
o 1 m 1 . o b A A
64 128 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kernel Memory (KB)
(c) Total NAKs. ((1) Total rate requests.

Figure 5.11: Results for 25 slow receivers.

106

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10 ‘r 1 lo I V I l l
rum 1:1
3:5; _._ Test7 :1
3 'Tesl8 _“_ . 8 - TestBD ‘
£6 . 4 g. 6 ~ "
E E
E4 - g 4 —
2 - ‘ 2 L [I H -
on. . . . ,ﬂﬁlnflﬂnﬂﬂ
64 128 256 512 1024 64 128 256 512 1024
Kernel Memory (KB) Kerrie] Memory
(a) Average throughput. (b) Minimum and maximum throughput.
9000 900
Test6 -°—
8000 - 800 ~ Test7 4,.
Tests --
7W ' 700
:1
mo » « g 600 -
£5000 - j 5 500 *
9
E4000 . g 400 -
3000 - - g 300 -
2000 - . 200 .
Testé -°—
1000 ' T6817 'G—‘ I“) -
Tests -——
0 A I I l I o l I I
64 128 256 512 1024 64 128 256 512 107A
Kernel Memory (KB) Kernel Memory (KB)
(c) Total NAKs. (d) Total rate requests.

Figure 5.12: Results for 100 slow receivers.

107

 

 

Chapter 6

Conclusions and Future Work

In this research project we explored the design and implementation of the RMC
reliable multicast protocol. The goal of the project was to design a protocol that is
both scalable and ﬁts well into an operating system kernel. We considered three issues
in the RMC design: reliability, connection management, and ﬂow control. In each
aspect of the design, the interaction between the receivers and the sender is minimized.
RMC uses NAK-based reliability and supports anonymous group membership. To
allow for limited buffer space in combination with pure NAK-based reliability, RMC
uses a combination of window-based and rate—base flow control. The window-based
control manages the ﬁnite buffer space, while the rate-based control manages how
quickly data within the window is transmitted. We have implemented the RMC
protocol in the Linux kernel network stack. The implementation is based on IP
multicast and provides a BSD socket interface to user-level applications.

We studied the performance of the RMC protocol through both experimentation
and simulation. Performance tests in the Linux kernel revealed that RMC performs
well for up to 8 receivers in a local area network, where packet loss is minimal. We used

a network simulation to test the performance of RMC in both local area and wide area

108

 

 

 

network environments for up to 100 receivers. Results of the simulation tests show

that RMC is scalable up to 100 receivers in a local area network environment, where
loss rates and network delay are low. Although throughputs are lower in a wide area
network environment with higher loss rates, the RMC protocol still provides good
performance and is able to accommodate a large number of NAKs, even without
global suppression. Moreover, the efﬁciency that is gained by using reliable multicast
over reliable unicast is very large.

There are several aspects of RMC that may be improved. First, we may want to
design a mechanism by which the sender could estimate the available network speed
across the connection. The sender could then use this estimate to as an absolute
limit on the transmission rate. Secondly, the protocol could be better adapted to
fairly handle large variations in the capabilities of the receiver group. Currently,
RMC adapts the transmission to the least capable receiver. To make the transmission
more fair for all receivers, the sender could classify the receiver group according to
their capabilities, and then transmit the data at different rates for each group. Other
improvements for RMC include adding global NAK suppression to make the protocol
more scalable in networks with higher loss rates and adding an option for maintaining

explicit group membership.

109

 

 

Bibliography

[1] Hugh W. Holbrook, Sandeep K. Singhal, and David R. Cheriton. Log-Based
Receiver-Reliable Multicast for Distributed Interactive Simulation. In Proceed-
ings of SIGCOMM ’95, August 1995.

[2] Peter Danzig. Flow Control for Limited Buffer Multicast. IEEE Transactions
on Software Engineering, 20(1):1—12, January 1994.

[3] Thomas E. Anderson, David E. Culler, and David A. Patterson. A Case for
NOW (Networks of Workstations). IEEE Micro, pages 54—64, February 1995.

[4] Gregory Pﬁster. In Search of Clusters. Prentice Hall, 2nd edition, 1997.

[5] Andrew S. Tanenbaum. Experiences with the Amoeba Distributed Operating
System. Communictions of the ACM, 33:46—63, December 1993.

[6] Sally Floyd, Van Jacobson, Ching-Gung Liu, Steven McCanne, and Lixia Zhang.
A Reliable Multicast Framework for Light-weight Sessions and Application Level
Framing. IEEE/ACM Transactions on Networking, December 1996.

[7] Robel Barrios and P. K. McKinley. WebClass: A multimedia web—based in-
structional tool. Technical Report MSU-CPS-98—26, Department of Computer
Science and Engineering, Michigan State University, East Lansing, Michigan,
August 1998.

[8] I. M. I. Habbab, M. Kavehrad, and C. W. Sundberg. Protocols for very high-
speed optical ﬁber local area networks using a passive star topology. Journal of
Lightwaue Technology, LT-5(12):1782—1794, December 1987.

[9] M. Loper. Distributed interactive simulation (DIS) requirement for multicast.
Multipeer/Multicast Workshop, Compendium of Solicited Papers, sponsored by
US. Army Program Manager for Trai ning Devices and the Institute for Ad—
vanced Simulatio n and Training, August 1991. Draft presented at Orlando,
Florida.

[10] Donald P. Brutzman, Michael R. Macedonia, and Michael J. Zyda. Internetwork
infrastructure requirements for virtual environments,. In Annual Symposium of
the Virtual Reality Modeling Language (VRML .95), December 1995.

[11] IEEE. 802.3: Carrier sens multiple access with collision detection, 1985.

110

 

 

 

 

[12] Vicki Johnson and Marjory Johnson. Introduction to IP Multicast
Routing. IP Multicast Initiative White Paper, 1997. Available at
http: / / www.ipmulticast.com / community / whitepapers / introrouting.html.

[13] S. Deering, C. Partridge, and D. Waitzman. Distance Vector Multicast Routing
Protocol. Internet RFC 1075, November 1988.

[14] J. Moy. Multicast Extensions to OSPF. Internet RFC 1584, March 1994.

[15] Steven Deering et. al. Protocol independent multicast version2, dense mode
speciﬁcation. IETF Draft, May 1997.

[16] T. Ballardie. Core Based Trees (CBT) Multicast Routing Architecture. Internet
RFC 2201, September 1997.

[17] S. Deering, D. Estrin, D. Farinacci, A. Helmy, D. Thaler, M. Handley, V. Ja-
cobson, C. Liu, P. Sharma, and L. Wei. Protocol Independent Multicast-Sparse
Mode (PIM-SM): Protocol Speciﬁcation. Internet RFC 2362, June 1998.

[18] Michael R. Macedonia and Donald P. Brutzman. MBone provides audio and
video across the Internet. IEEE Computer, April 1994.

[19] B. Fenner. Internet Group Management Protocol, Version 2. Internet RFC 2236,
November 1997.

[20] W. T. Strayer, editor. Xpress Transport Protocol Speciﬁcation, Revision 4.0.
XTP Forum, March 1995.

[21] Rajesh Talpade and Mostafa H. Ammar. Single Connection Emulation (SCE):
An Architecture for Providing a Reliable Multicast Transport Service. In Proceed-
ings of the IEEE International Conference on Distributed Computing Systems,
Vancouver, BC, Canada, June 1995.

[22] Alex Koifman and Stephen Zabele. RAMP: A Reliable Adaptive Multicast Pro-
tocol. In Proceedings of IEEE INFOCOM, pages 1442—1451, March 1996.

[23] Sanjoy Paul, Krishan K. Sabnani, John C. Lin, and Supratik Bhattacharyya.
Reliable Multicast Transport Protocol RMTP. IEEE Journal on Selected Areas
in Communications, 15(3):407—421, April 1997.

[24] Rajendra Yavatkar, James Grifﬁoen, and Madhu Sudan. A Reliable Dissemina-
tion Protocol for Interactive Collaborative Applications. In Proceedings of the
ACM Multimedia ’95 Conference, November 1995.

[25] Tetsuo Sano Teruji Shiroshita, Osamu Takahashi, and Masahide Yamashita.
Monitoring-based Flow Control for Reliable Multicast Protocols and its Eval-
uation. In Proceedings of IEEE International Performance, Computing, and
Communication Conference, pages 403—409, February 1997.

111

 

 

 

[26] Brian N. Levine and J .J . Garcia-Luna—Aceves. A Comparison of Reliable Multi—
cast Protocols. accepted to appear in ACM Multimedia Systems Journal, 1998.

[27] University of Southern California Information Sciences Institute. Transmission
Control Protocol. Internet RFC 793, September 1981.

[28] P. Karn and C. Partridge. Estimating Round-Trip Times in Reliable Transport
Protocols. In Proceedings of SIGCOMM ’87, August 1987.

[29] Van Jacobson and Michael J. Karels. Congestion Avoidance and Control. In
Proceedings of SIGCOMM ’88, August 1988.

[30] Michael Beck, Harald Bohme, Mirko Dziadzka, Ulrich Kunitz, Robert Magnus,
and Dirk Verworner. Linux Kernel Internals. Addison-Wesley, 1996.

[31] David A. Rusling. The Linux
Kernel. 1997. Available at ftp: / / sunsite.unc.edu / pub / Linux/ docs / linux-doc-
project / linux-kernel / tlk~0.8—2.ps. gz.

[32] Douglas E. Corner and David L. Stevens. Internetworking with TCP/IP, volume
III. Prentice Hall, Englewood Cliffs, NJ, 1993.

[33] Don Towsley, Jim Kurose, and Sridhar Pingali. A Comparison of Sender-Initiated
and Receiver-Initiated Reliable Multicast Protocols. IEEE Journal on Selected
Areas in Communication, 15(3):398—406, April 1997.

[34] Michael S. Borella and Gregory B. Brewster. Measurement and Analysis of Long-
Range Dependent Behavior of Internet Packet Delay. In IEEE INFOCOM 1998,
pages 497—504, April 1998.

112

 

 

 

   

ARIES

"‘iiliiiilii]