32m5u5a9‘ 5 t

aha-12.4 . l

o
.

.. .5. . rmaﬁ».._mm.w.

:
n _:

.
an...»

han—vm.
» ..
.. é» .

w

. i. a
n4 5.5.5
7 .155.

.. mu?“ in:

:1 rat.

a
a J I

«ﬂaunt-”.5 n

21.3wa J‘

 

“Mmurdmxhb y :.
Jthmmv pun-HR».
1%.}! .1 I.

2.. (49......“10
.i.

1.9:..quﬂhe u.

1.5?!2L..K1.f.3

) 9.. 1.
I. .u: 6
ll... . (5.x: .
0 ‘ 1335:?

 

MESS
l

2004
sc67qvq7

LIBRARY
Michigan State
University

This is to certify that the
thesis entitled

CHANNEL BALANCING STRATEGIES TO OPTIMIZE
UPLINK UTILIZATION

presented by

ASHOK NALKUND

has been accepted towards fulfillment
of the requirements for the

Master of degree in

Computer Science
Science

 

\NlEdi; ‘
/ Major Pro? "3 Signature

i3/4103

Date

 

 

MSU is an Afﬁrmative Action/Equal Opportunity Institution

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

FEB 02; 2'01?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 cJCIFIC/DateDuepSS-pts

 

 

 

CHANNEL BALANCING STRATEGIES TO OPTIMIZE UPLINK UTILIZATION
By

Ashok Nalkund

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Computer Science and Engineering

2003

ABSTRACT
CHANNEL BALANCING STRATEGIES TO OPTIMIZE UPLINK UTILIZATION
By

Ashok Nalkund

Linux®1 provides a rich set of features for networking including routing, ﬁrewalling,
trafﬁc-shaping and support for LAN and WAN interfaces. But the support for bal-
ancing generic outgoing trafﬁc across multiple uplinks is very basic. This support
is limited to balancing routes over the multiple equal cost links. In this thesis, we
study, implement and compare different strategies to distribute outgoing traffic across
multiple links. The strategies presented in this thesis are: per-packet distribution,
per—route distribution, per-session distribution and bandwidth-usage-based distribu-
tion. The networking code in the Linux operating system was modiﬁed to implement
these features. Experimental results indicate that per-session and per-packet based
distribution strategies perform better than the other two strategies with respect to
efﬁcient utilization of bandwidth on the multiple links.

lLinux® is a registered trademark of Linus Torvalds

COPYRIGHT

Copyright by ASHOK NALKUND (nalkunda@cse.msu.edu), 2003

DEDICATION

This work is dedicated to my brother, Sriharsha.

iv

ACKNOWLEDGMENTS

First and foremost, I would like to express my heart-felt thanks to my adviser Prof.
Ni. None of this would have been possible if not for his guidance and help . I would
like to take this Opportunity to thank all the wonderful people who have contributed
in making Linux such a unique operating system. Without the efforts of the open-
source community, Linux might not have been the robust operating system it is. I
would also like to thank my friends and fellow laboratory members. I would like to
particularly mention my friends (in no order) Abhishek Patil, Pradeep Padala, Ravi
Parimi and Smitha Kommareddi. My thanks to the faculty and staff of the Computer
Science department for their support. Last but not the least, I would like to thank

my parents, my brother and my sisters for their support and conﬁdence in me.

Contents

List of Figures

Chapters:

1. Introduction

2. Related Work
Linux Virtual Server Project .......................

2.1
2.2
2.3

2.4
2.5
2.6
2.7

2.8

Round-Robin DN S

Policy Based Routing ...........................

2.3.1
2.3.2
2.3.3

Source Policy Routing ......................
Routing with multiple uplinks ..................
Load Balancing ..........................

EQL Driver: Serial IP Load Balancing .................
TEQL: “True” (or “trivial”) link equalizer ...............
Linux Ethernet Bonding Driver .....................
Queuing Disciplines For Linux ......................

2.7.1
2.7.2
2.7.3

Classless Queues
Classful Queues ..........................
Hierarchical Token Bucket (HTB) ................

Limitations of Related Work .......................

3. Linux Networking Internals

3.1 Data structures

3.1.1
3.1.2
3.1.3
3.1.4
3.1.5
3.1.6
3.1.7
3.1.8
3.1.9
3.1.10
3.1.11
3.1.12
3.1.13
3.1.14
3.1.15

sk-buff ...............................
socket
sock

net_device ............................

Qdisc ................................
dst_entry ..............................
neighbour .............................
rtable
fib_table ..............................
fn_hash ...............................
fn_zone ...............................
fib_node
fib_info ...............................

ooooooooooooooooooooooooooooooo

oooooooooooooooooooooooooooooooo

vi

Page

viii

23
23
24
24
25
26

3.1.16 ﬁb_result .............................. 28

3.1.17 nf.conntrack ............................ 28

3.1.18 nf_ct.info ............................. 28

3.1.19 ip.conntrack_tuple_hash ..................... 28

3.1.20 ip_conntrack ............................ 28

3.1.21 ip_conntrack.info ......................... 29

3.1.22 ip.conntrack_tuple ........................ 29

3.1.23 ip.conntrack_protocol ....................... 29

3.2 Packet Reception ............................. 30
3.3 Packet Transmission ........................... 36
3.4 Packet Forwarding ............................ 42
3.5 Netfilter Framework ........................... 43
4. Channel Balancing Algorithms 48
4.1 Strategies ................................. 48
4.1.1 Route Based Channel Balancing ................. 48

4.1.2 Per-packet Based Channel Balancing .............. 49

4.1.3 Session Based Channel Balancing ................ 50

4.1.4 Bandwidth Usage Based Channel Balancing .......... 51

4.2 Implementation .............................. 52
4.2.1 Route Based Channel Balancing ................ 53

4.2.2 Per-packet Based Channel Balancing .............. 54

4.2.3 Session Based Channel Balancing ................ 56

4.2.4 Bandwidth Usage Based Channel Balancing .......... 58

4.2.5 Discussion of the strategies. . . . . . . . . . . . . . . . . . . . 61

5. Experimentation Details 62
6. Results 65
7. Conclusions and Future Work 73
Bibliography .................................... 75

vii

List of Figures

Figure Page
2.1 Linux Virtual Server ........................... 4
2.2 Hierarchy of qdiscs ............................ 14
3.1 Neighbor Table Structure ......................... 23
3.2 FIB Table concepts ............................ 25
3.3 FIB Table Details ............................. 26
3.4 Netﬁlter Hooks .............................. 44
4.1 Multiple Routes between hosts ...................... 49
4.2. Nexthop Selection Logic in Multipath Route .............. 52
43 Route Based Channel Balancing ..................... 53
4.4 Per-Packet Based Channel Balancing .................. 55
4.5 Session Based Channel Balancing .................... 56
4.6 Bandwidth Usage Based Channel Balancing .............. 59
5.1 Testbed Setup ............................... 63
6.1 Bandwidth usage for File Size 8KB ................... 66
6.2 Bandwidth usage for File Size 16KB .................. 67
6.3 Bandwidth usage for File Size 64KB .................. 68
6.4 Bandwidth usage for File Size 128KB .................. 68
6.5 Bandwidth usage for File Size 512KB .................. 69
6.6 Bandwidth usage for File Size 1M .................... 69

viii

6.7 Bandwidth usage for File Size 2M .................... 70

6.8 Bandwidth usage for File Size 4M .................... 70
6.9 Bandwidth usage for File Size 8M .................... 71
6.10 Bandwidth usage for File Size 16M ................... 71
6.11 Bandwidth usage for Various (mixed) File Sizes ............ 72
6.12 Transfer Times .............................. 72

ix

Chapter 1

Introduction

Internet has revolutionized the way people live, work, entertain and communicate.
A large number of homes have network connectivity these days. Network connectivity
has become cheap and affordable and also critical to the performance of many small
and medium businesses. Having redundant network connections is one way to ensure
increased connectivity and hence increased network reachability. One would like to
use the redundant network connections efﬁciently under normal circumstances and
only in emergency situations like one of them going down, fall back on the other
connections. In these normal circumstances, efﬁcient balancing of the trafﬁc across
the redundant network connections is important to keep the utilization high and costs
lowl. In this thesis, we study different strategies to balance the outgoing trafﬁc across
multiple network connections.

There are various commercial networking equipment manufacturers who provide
advanced routing features in their equipment. However such equipment are very ex-
pensive for organizations with small budgets. Linux is a free operating system which
provides a rich set of networking features. But the features available for balancing
outgoing trafﬁc are not sufﬁciently powerful and hence very basic outgoing trafﬁc
balancing can be achieved. There are several projects related to load-balancing im-

1Network connections can be billed differently: bandwidth allocated, bytes transfered, etc

plemented under Linux. But these projects have certain limitations which restrict
their application (see chapter 2).

In this research, we intend to study, implement and test different strategies for
balancing outgoing trafﬁc across multiple network connections. The goal is to effec-
tively utilize the available network connections without too much of a performance
degradation of applications depending on the network.

In chapter 2 we describe some of the related works. In chapter 3 we delve into
the internal details of the networking code in Linux. The design and implementation
of different strategies is discussed in chapter 4. Chapter 5 discusses the experimen-
tation setup, configuration and testing strategies. Results of the experimentation are
discussed in chapter 6. Finally we present the conclusions and some ideas for future

work related to this research in chapter 7.

Chapter 2

Related Work

One often confuses load-balancing of incoming trafﬁc with load-balancing of out—
going trafﬁc. While there are many technologies for implementing incoming load-
balancing, techniques to implement outgoing load-balancing are very few. In this
chapter we describe some of the load-balancing techniques for both incoming as well
outgoing trafﬁc. This will give the reader a better understanding of the differences

between the two scenarios and also an understanding of how it is achieved.

2.1 Linux Virtual Server Project

Quoting the ofﬁcial website of the Linux Virtual Server Project(LVS)[2]:

The Linux Virtual Server is a highly scalable and highly available server
built on a cluster of real servers, with the load balancer running on the
Linux operating system. The architecture of the cluster is transparent to

end users. End users only see a single virtual server.

This is an incoming-load-balancer. As mentioned above, the LVS project provides a
load-balancer which distributes the incoming requests among many internal servers
which are invisible to the end users. Figure 2.11 illustrates a typical Linux Virtual
Server environment. The load-balancer running the LVS accepts incoming requests

1Source: Linux Virtual Server Project. http:/,lwww.linuxvirtualserver.org.

 

 

 

       

 

 

Internet ' *
Load Balancer
Linux Box

 

 

 

 

 

 

 

 

 

 

Figure 2.1: Linux Virtual Server

and then routes them to the real servers according to different conﬁgured schedulers:
round-robin, weighted-round-robin, least connection, weighted-least-connection and
persistent client connection. Load-balancing can be conﬁgured for individual proto-
col and port combination. The real servers can be added or removed transparently
making the architecture very scalable. The Linux Virtual Server is implemented in
three ways: NATZ, IP Tunneling and Direct Routing. In NAT, the load-balancer
has a public IP address and the real-servers have private IP addresses. The clients
connect to the load-balancer which then performs DNAT3 on the packets and routes

2Network Address Translation

3Destination Network Address 'Iiranslation

them to the internal servers. The disadvantage of this implementation is that every
request and response packet has to be rewritten by the load-balancer which becomes
a bottleneck. With IP tunneling, the load-balancer schedules requests to the different
real servers, and the real servers return replies directly to the clients. This allows the
load-balancer to handle huge amounts of requests, scaling to over 100 real servers.
Due to the very low overhead involved at the load-balancer, IP Tunneling allows a
load-balancer to serve as a very high-performance virtual server. The third technique
is to use Direct Routing. Here also the load-balancer only processes the client-to-
server half of the connection“ and the server responses can follow any network path
to the clients. This method does not have the overhead of tunneling and hence can

scale very well.

2.2 Round-Robin DNS

Round-Robin DNS [12] is another incoming-load-balancing technique used for
balancing the incoming requests among a set of servers. Here a single name is mapped
to different IP addresses in a round-robin fashion, thereby controlling which server a
client connects to. For every client that requests a FQDN5-to-IP mapping, in other
words a DNS lookup, a different server IP address will be chosen from the list in a
round-robin manner. However, one must remember that these mappings are cached
on the client system as well as on the intermediate DNS servers which might have run
the query at the clients’ request. Thus clients which happen to send the queries to
the same intermediate DNS servers will be routed to the same IP address and hence
the load-balancing is not perfect. Also the Time To Live (T T L) for the DNS record

4Often the requests are very short followed by long responses from the server.

5Fully Qualiﬁed Domain Name

is difﬁcult to choose in these settings. If a large value is chosen, then the same IP
address will be provided for a longer time which skews the load-balancing towards
this IP address. On the other hand, if the TTL value is small, the load-balancing
will be more efﬁcient but the DNS queries themselves will be a bottleneck. Also
depending on the amount of trafﬁc caused by different clients, the load-balancing will
be different. Thus even with the IP address rotation, the servers might be loaded
differently due to the difference in the clients’ requests. Further, once a server fails,
the client which receives the IP address of the failed server will be unable to connect
to the server even if it tries to reconnect. Only after the TTL expires and the entry
for the dead server is removed from the list of IP address at the DNS server6 will the

client be able to connect to the server.

2.3 Policy Based Routing

Policy Based Routing [4, 3] uses policies or ‘rules’ to make a routing decision
for a packet. This feature is available in Linux by compiling the kernel with the
“IP: advanced router” (CONFIGJPADVANCEDJZOUTER = yes) and “IP: policy
routing” (CONFIGJP_MULTIPLE_TABLES = yes) features. The routing policy
database allows multiple routing tables on a Linux router. The appropriate table is
looked up as speciﬁed by the ‘rules’. The iproute27 package is available for manipu-
lating the routing policy database and the routing tables. By default, the kernel has
three tables: local, main and default (table ID 253 denotes the default table). Fol-

lowing is the output produced by running the ‘ip rule’ command on a Linux system:

# ip rule

6It might so happen that by the time the TTL expires, other client requests have caused the dead
server’s IP addresses to be the one to be given to the next request.

7ftp://ftp.inr.ac.ru/ip-routing/

0: from all lookup local
32766: from all lockup main
32767: from all lockup 253

#

As shown above, by default all the rules apply to all the packets. We can generate
new rules and override the default routing of packets. Following are the contents of

the three tables on a machine which has two interfaces (ethO is down, ethl is up)8:

# ip route show table local

broadcast 67.167.130.128 dev ethl proto kernel \
scope link src 67.167.130.153

broadcast 127.255.255.255 dev lo proto kernel \
scope link src 127.0.0.1

local 67.167.130.153 dev ethl proto kernel \
scope host src 67.167.130.153

broadcast 67.167.130.255 dev eth1 proto kernel \
scope link src 67.167.130.153

broadcast 127.0.0.0 dev lo proto kernel \
scope link src 127.0.0.1

local 127.0 0.1 dev lo proto kernel \
scope host src 127.0.0.1

local 127.0.0 0/8 dev lo proto kernel \
scope host src 127 0.0.1

#

# ip route show table main

8The lines have been broken for readability

67.167.130.128/25 dev eth1 proto kernel \

scope link src 67.167.130.153

127.0 0.0/8 dev lo scope link

default via 67.167.130.129 dev ethl

#

# ip route show table 253

#

2.3.1 Source Policy Routing

Source Policy Routing [3] balances the trafﬁc based on the source IP address. We

create new tables and add routes to the tables. Policy rules are created that specify

the table to lookup for particular source IP address.

# echo 201 Test >> /etc/iproute2/rt_tables

# ip rule add from 192.168.1.100 table Test

# ip rule

0: from all lookup local

32765: from 192.168.1.100 lookup Test
32766: from all lookup main

32767: from all lookup 253

#

Now we add the appropriate routing entries in the new table:

# ip route add default via 192.168.1 1 dev ethO table Test

# ip route show table Test

default via 192.168.1.1 dev ethO

#

2.3.2 Routing with multiple uplinks

If multiple uplinks are available, then one would like to make sure that packets
arriving on an interface are answered on the same interface. This can be achieved by
setting up split access [3] on these multiple interfaces. We create routing tables for
each interface and create routes for network and default route in each table. Then we
add rules specifying that packets arriving from any of the connected networks should

be answered on the same interface by causing the corresponding table to be looked

up.

Routing entries for table T1:

# ip route add $P1_NET dev $IF1 src $IP1 table T1
# ip route add default via $P1 table T1

Routing entries for table T2:

# ip route add $P2_NET dev $IF2 src $IP2 table T2

# ip route add default via $P2 table T2

Reply on the same interface:
# ip route add $P1_NET dev $IF1 src $IP1
Reply on the same interface:

# ip route add $P2_NET dev $IF2 src $IP2

Default route:

# ip route add default via $P1

2.3.3 Load Balancing

With the above mentioned split access in place, one can setup crude load-balancing
by specifying a multipath route as the default route. With a multipath route, the
kernel will balance the route lookup over all the nexthops speciﬁed in the multipath

route in accordance with the weights assigned to each.

# ip route add default nexthop via 10.0.1.2 dev ethO weight 1\

nexthop via 192 168 1.2 dev eth1 weight 1

The above multipath route will equalize the routes over ethO and ethl equally.

2.4 EQL Driver: Serial IP Load Balancing

EQL is a device driver available in Linux which provides a software device to
load-balance IP serial links (SLIP or uncompressed PPP) to increase the bandwidth.
This is useful in cases where we have two or more modems and want to increase the
bandwidth by binding the modems together. We compile the kernel with the eql
patch9 and conﬁgure an eql interface with an IP address. Then the default route is
set to point to the eql device. The devices are then “enslaved” with the eql-enslave

command. Devices are freed using the eql_emancipate command.

2.5 TEQL: “True” (or “trivial”) link equalizer

TEQL [3] is a new virtual device available in Linux which equalizes the trafﬁc
going out on the physical slave interfaces. For this virtual device to be used, the slave
devices must be active, i.e be able to raise the busy signal. This device is capable
of equalizing physical interfaces which have varying bandwidths but it is advisable

9ftp: / / slaughter.ncm.com / pub / Linux / LOADBALAN CING / eql-1.1.tar.gz

10

that the bandwidths should be comparable to avoid reordering problems. To use this
feature, ﬁrst we insert the sch_teql module which creates a new device teqlN and a
new qdisc (see section 2.7) with the same name. The physical interfaces can now be

enslaved to this device and the desired routing setup.

# tc qdisc add dev ethO root tequ
# tc qdisc add dev eth1 root tequ

# ip link set dev tequ up

The slave interfaces are conﬁgured as they would normally be and the default route is
set to point to the teql interface. The most restricting requirement in this technique
is that both ends of the links are required to participate for it to work properly. This
is sometimes not possible if the ISP is not very forthcoming with accommodating

customers’ requests.

2.6 Linux Ethernet Bonding Driver

The Ethernet bonding driver provides for bonding Ethernet channels together. Its
called ‘Etherchannel’ by Cisco, ‘Trunking’ by Sun and ‘Bonding’ [5] in Linux. Mul-
tiple Ethernet connections can be ‘bonded’ together to act as a single channel with
increased bandwidth. This requires the other end of the connection also to support
bonding. Although this seems similar to the EQL driver (section 2.4), this driver
manages Ethernet segments while the EQL driver manages serial lines. The kernel
needs to be compiled with the ‘Bonding Driver Support’ (CONFIGBONDING). A
bonding network interface is deﬁned in the system conﬁguration ﬁles similar to the
other network interfaces”. The IP information is provided for the bonding interface.
The physical Ethernet devices are not conﬁgured with any IP information and are

10For RedHat Linux systems, this can be in /etc/sysconﬁg/network-scripts/ifcfg-bondO ﬁle.

11

made slaves of th bonding device. The MAC address for the bonding interface is cho-
sen from the ﬁrst enslaved Ethernet interface. The bonding driver can be conﬁgured

to take care of failing interfaces by monitoring their MII link status.

2.7 Queuing Disciplines For Linux

Linux provides a very rich set of features for managing bandwidth by means of
queues. With queuing we can control and shape the outgoing trafﬁc from a machine.
There are different queuing disciplines available in Linux. Some of them are classless
which just accept data and only reschedule, delay or drop it. Classful queues allow
different kinds of traffic to be treated differently. The term qdisc is used to refer to

a queuing discipline.
2.7.1 Classless Queues

These queues are used to shape trafﬁc on the entire interface, without any subdi-
visions. All packets are treated the same. Following are some of the classless queues

available on Linux.

pﬁfo_fast

pﬁfofast [3] is a queue which performs a ‘First In, First Out’ scheduling of the
arriving packets. There are three bands which are prioritized, band 0 receiving higher
priority than band 1, which receives more priority than band 2. Band 0 contains
packets which have the ‘minimum delay’ TOSll flag set. As long as there are packets
in band 0, band 1 is not processed (similarly for band 1 and band 2). The TOS
value in the packet is used to map the packet to one of the bands. The mapping can
itself be speciﬁed with priomap which is set by the kernel.The queue length can be

1 1 Type of Service

12

conﬁgured with ifconﬁg or tc command. This is the default qdisc for any interface.
Adding any other qdisc to the interface causes this qdisc to simply return without

any action.

Token Bucket Filter (TBF)

The T BF [3] qdisc ensures that packets are sent out at some administratively set
rate while allowing for short bursts of trafﬁc which may exceed the set rate. TBF
consists of a buffer (bucket) which is ﬁlled with tokens at a speciﬁc rate called the
token rate. The token rate is the administratively set rate mentioned above. Each
token dequeues a packet from the data queue and sends it out. If the token rate is
greater than the data rate, tokens simply accumulate up to the bucket size. Excess
tokens are then discarded. If the data rate is greater than the token rate, then
some packets will be dropped until tokens are available. The accumulation of the
tokens when the data rate is smaller than the token rate allows for short bursts of
data exceeding the token rate. The bucket size (burst size), token rate and various
other parameters are conﬁgurable. Note that in actual implementation, the token
corresponds to bytes and not to packets. This qdisc is very convenient to slow down

an interface to match the actual available bandwidth as in case of cable modems and

DSL lines.

2.7.2 Classful Queues

In Classful queues [3], classes and ﬁlters are used to manage the bandwidth. Filters
added to a Classful queue look at the packet and decide what to do with the packet
and return this decision to the Classful queue. The Classful queue then enqueues the
packet in the appropriate class. The ﬁlters and classes can be nested to achieve better

control over the trafﬁc as shown in Figure 2.2. The class at the bottom of this nesting

l3

1: root qdisc

/1:l child class

child classes

/1\H2\ leaf class

12: qdisc
10:1 10:2 12:1 12:2 leafclass

Figure 2.2: Hierarchy of qdiscs

enqueues the packet to the qdisc it contains. When a packet arrives for transmission,
the root qdisc enqueues it to the appropriate child class by ﬁltering the packet. The
child class can in turn apply ﬁltering to enqueue the packet to one of its child classes.
When a packet needs to be dequeued by the kernel to send to the interface, the root
qdisc gets a dequeue request which is passed on to the child class. The class with
the lowest handle containing a packet returns the packet to the kernel. The handle
for a class is speciﬁed when the class or qdisc is added to the interface. Due to this

nesting, the child class cannot dequeue a packet faster than its parent will allow.

Class Based Queuing (CBQ)

CBQ [3] is a very widely used and complicated Classful qdisc which also works as
a shaper. However the shaping algorithm used in CBQ is not very precise and hence
causes CBQ to behave unexpectedly in some situations like not achieving the desired
trafﬁc rate. CBQ uses the idle time between requests for packets by the hardware
layer to shape the trafﬁc. Determining the idle time is a very tricky issue due to

various reasons like the driver implementation for the hardware, hardware details

14

(bus speed), etc. Even with these limitations, CBQ works well in many circumstances.
The parameters which conﬁgure the shaping include the average packet size, physical
bandwidth, required rate of trafﬁc, etc. The queuing properties of CBQ are conﬁgured
by specifying the weight, priority and allocated amount for each inner class. The
priority property makes CBQ a priority queue where packets are dequeued from the
highest priority inner queue before checking lower priority inner queues. The amount

speciﬁes how much data an inner queue can send out when requested.

2.7.3 Hierarchical Token Bucket (HTB)

Hierarchical Token Bucket (HTB) [3] is similar to CBQ but instead of working with
idle time calculations to shape trafﬁc, it works as a Classful Token Bucket Filter. HT B
can be used to specify how to divide the available bandwidth for different purposes
and also how to share the bandwidth among them, by lending and borrowing, if
required. This is a scalable queuing discipline unlike CBQ which gets complex even
for a simple setup. If a class requests bandwidth less than the amount assigned, the
remaining bandwidth is distributed to the other classes that request for bandwidth.
Thus excess bandwidth can be lent to other classes. First we add a HTB qdisc as
root qdisc to the interface. Then we add a root class”. This is required because
bandwidth cannot be shared among the root classes. Hence we create this root class
and then make the actual classes children of this root class. Then we add the classes
specifying the bandwidth guaranteed for the class. Further we add ﬁlters to classify
the packets into these classes.

12A root class is a class with the root qdisc as its parents

15

2.8 Limitations of Related Work

The projects discussed in this chapter are unsuitable for achieving generic outgoing
trafﬁc balancing. Some of these works are applicable only to incoming trafﬁc, for
example Linux Virtual Server Project and Round-Robin DNS (see sections 2.1 and
2.2 respectively). Others like policy routing (see section 2.3) have restrictions over
what trafﬁc they balance. The rules in policy routing specify what kind of trafﬁc will
be balanced, for example packets with a speciﬁc source IP address. To balance the
generic outgoing trafﬁc, enough rules would have to be created to cover all kinds of
packets in the generic trafﬁc. EQL and Ethernet bonding drivers (see sections 2.4 and
2.6 respectively) apply only to speciﬁc types of interfaces (IP serial links and Ethernet
interfaces respectively). Further, EQL, TEQL and Ethernet bonding, also require the
cooperation of the ISP providing the connections which may not be possible in some
case. Finally queuing disciplines (section 2.7) can only be used to shape the trafﬁc
which is already queued for an interface. This does not provide for balancing the
trafﬁc across multiple interfaces.

Thus we see that balancing the generic outgoing trafﬁc is very difﬁcult if not

impossible with the available tools.

16

Chapter 3

Linux Networking Internals

The networking code in Linux is one of the most complex and largest piece of code
in the kernel, in fact, the networking code makes up for around 20% of the entire Linux
kernel code. In this chapter we discuss the details of the networking code in the Linux
kernel and trace the path of a packet received, sent out and forwarded by the Linux
kernel.

The Linux kernel supports different network architectures including IPv4 Internet
protocols (PF .INET), IPv6 Internet Protocols (PFJNET 6), Appletalk
(PFAPPLETALK), AX.25 (PF _X25), IPX - Novell Protocols (PFJPX), etc. It also
allows for different scheduling algorithms like Packet First-In-First-Out (pﬁfo), Byte
First-In—First-Out (bﬁfo), Token Buffer Filter (TBF), Stochastic Fairness Queuing
(SF Q), etc. The networking code in Linux is based upon the Swansea University
Computer Society’s N ET 3.039 code. The current version of the networking code in
Linux 2.4 kernel is NET4.0. The code has been divided into family of protocols and
further into layers. Each layer has a well-deﬁned interface with the adjacent layers.
To make the code efﬁcient, copying of data is avoided between the layers, instead,
enough space is reserved for every packet for the header / trailer for the different layers.

In this chapter, we concentrate on the IPv4 Internet Protocols family.

17

When an application generates trafﬁc, it sends it to the transport layer (TCP
or UDP) through the socket interface. The transport layer then passes it on to the
network layer (IP). This layer looks at the destination of the packet and decides
whether the destination is the local host or a different host. If the packet is for the
same host, it passes it back to the upper layer for passing it on to the local destination
application. If the packet is for another host, the kernel searches the route cache, and
if required the Forwarding Information Base (FIB), for a route to the destination.
It prepares the packet for transmission to the destination and hands the prepared
packet to the interface to be sent out on the physical medium.

On the other hand, when a packet arrives at an interface and if the hardware
address belongs to the host or is a broadcast, it queues the packet for processing by
the kernel. The IP layer looks at the destination and if the packet is for the host, it
passes it on to the transport layer for further processing. The transport layer then
de-multiplexes and sends the data to the appropriate program. If the packet is for
another host and forwarding is enabled on the host, the routing cache and FIB tables
are consulted to route the packet correctly. If forwarding is disabled, the packet is
dropped.

In Linux, the work to be done when an interrupt is received, is divided into two
parts: top—half and bottom-half. Different tasks in the system have different bottom-
half handlers, like networking bottom-half handler, timer bottom-half handler, SCSI
bottom-half handler etc. In the top-half, only the most essential work, like moving
data from the hardware buffer to kernel buffer, is performed. The top-half sets a flag
to indicate that the kernel needs to complete the work for the interrupt by running the
bottom-half handler. When the kernel scheduler runs, it checks these ﬂags and runs

the corresponding bottom-half handlers. These bottom-half handlers complete the

18

work related to the interrupt. For example, the networking bottom-half handler gets
a packet from the queue, checks for the protocol type and hands it to the protocol-
speciﬁc receive function.

In 2.4 versions of the Linux kernel, the bottom-half handler for networking is
implemented as a softirq. Bottom-halfs are very demanding on the system: only one
bottom-half can run on a system irrespective of the number of CPUS the system has.
Softirqs can run on multiple CPUS at the same time and are used for very very high
frequency threaded jobs. The bottom-half handler for networking net-bh() has now
been replaced with the net_rx-action () softirq in the 2.4 versions.

Now that you have a basic understanding of networking in Linux, you may skip
to the next chapter if you do not wish to get entangled in the intricacies of the Linux
networking code. However for those who decide to continue, having the Linux kernel
code handy is strongly advised. This will let you browse the actual code as you read
about various functions and data structures. The source and header ﬁlenames in this
document are relative to the kernel source location, usually /usr/src/linux.

The following sections are based on Linux kernel version 2.4.18 which can be
obtained from the Linux Kernel website: http://www.kernel.org. We should note
that in the following discussions, a number of issues like fast-routing, multiprocess
systems, etc are not discussed. The discussion of these issues would complicate the
discussion and the actual control path of the packet would not be clearly understood.
Hence, only the control paths which would be taken in the most general cases are

discussed. We also assume that all the interfaces involved are Ethernet interfaces.

19

3.1 Data structures

At the highest level, BSD sockets are used to represent a network connection.
These structures hide the implementation details of the different protocol families.
INET sockets are speciﬁc to IPv4 protocol family. The generic operations on the
BSD sockets invoke functions speciﬁc to the protocol family such as the INET socket
Operations. In the following sections we discuss some of the important data structures

in the Linux networking code.

3.1.1 sk_buff

sk_buff1 is one of the most important data structure in the Linux networking code.
It represents a buffer which provides control information and data to be transmitted
or received. It reserves enough storage so that at each layer of the networking stack,
the headers or trailers can be added without any additional overhead. The data
is copied only twice through its entire journey through the networking layers: once
from user-space to the kernel-space and once from the kernel-space to the output
medium. Important members of this data structure include pointers to the head of
the buffer list, next and previous pointers, pointer to the sock (see section 3.1.3)
structure which owns this buffer, net_device (see section 3.1.4) pointer to the device
from which this packet arrived or on which this packet will be sent out, pointers to the
different network layer headers and trailers, checksum, length of the data currently in
the buffer, pointer to the connection tracking structure nf_ct_info (see section 3.1.18)
and control information for the upper layer protocol.

1 include / linux / skbuff.h

20

3.1.2 socket

socket2 represents a BSD socket in the networking code. Some of its members
include the state of the socket, ﬂags, pointer to the inode associated with the socket,
pointer to the ﬁle associated with the socket, a pointer to the low-level“ sock associated
with the socket and a pointer to the proto-ops structure which contains function
pointers to handle different operations on the socket. For a TCP socket, it points to
the inet_stream.ops3 which contains functions to handle a TCP socket bind, connect,

listen and other operations. Most often, the sock is an INET socket.

3.1.3 sock

The sock4 structure represents a messed up collection of network layer and bits
of transport layer information for a socket. Some of the most important members
of this structure are pointers to the next and previous sock structures (next, prev),
local and foreign addresses (rcv_saddr, daddr), source and destination port numbers
(sport, dport), address family (family), state of the socket (state), pointer to the
destination cache entry (dst..cache), receive and send/write queues (receive-queue,
write_queue), pointer to a struct proto structure (prot), which has pointers to the
transport layer functions to operate on the socket, transport layer options structure
(tp_pinfo), pointer to the parent BSD socket (socket). The prot points to tcp.prot5
for TCP sockets and udp_port6 for UDP sockets.

2include/linux/net.h
3net/ipv4/af_inet.c
4include / net / sock.h
5net/ipv4/tcp.c
6net/ipv4/udp.c

21

 

3.1

dar
Ida
llt’l

lf’f'

3. 1.4 net_device

net_device7 represents a network device in the Linux networking code. It includes
data related to higher levels of networking as well as basic I / O. The important ﬁelds
related to networking include the name of the interface (name), pointer to the next
net-device (next), unique device identiﬁer (iﬁndex), pointer to functions to get in-
terface statistics (getstatus for wired and get-wireless_stats for wireless interfaces),
interface address (dev_addr), pointers to the queuing disciplines attached to the in-
terface (qdisc), pointers to speciﬁc higher level protocols including AppleTalk, IPv4,
IPv6 etc, transmit queue length ( tx-queue_len), pointers to functions to perform differ-
ent actions on the interface like start transmit (hardstaermitO), set MAC address

(set.mac_addr()).
3. 1.5 Qdisc

Qdisc8 represents a queuing discipline. It has pointers to the queuing and de-
queuing functions (enqueue, dequeue), pointer to the next Qdisc structure (next),
pointer to a Qdisc_ops structure (Ops), which has pointers to functions for different
operations of speciﬁc disciplines, pointer to the interface to which this qdisc is bound
(dev), a statistics structure (struct tc_stats stats)9, which tracks the bytes, packets,
drops and other information of the qdisc. ops points to tbf_qdisc.ops10 for a Token
Bucket Filter (TBF) qdisc and pfifo.qdisc_opsll for a Packet First-In-First-Out qdisc.

7include/linux / netdeviceh
8 include / net / pkt _sched.h
9include / net / pkt _sched.h
lonet / sched / sch _tbf .c
1 1 net /sched/sch_ﬁfo.c

22

 

3.]

3.1.6 dst_entry

This represents a destination cache entry. dst.entry12 contains pointers to input
and output functions and information about a route. Its members include device
pointer (dev), pointer to the neighbor (neighbour), hardware header pointer (hh),
input and output functions (input, output), various other information like ﬂags, last
used time, path MT U, etc. It also holds a pointer to a neigh_ops13 structure (Ops)

which contains family, protocol and pointers to functions for rerouting, destroying

the route, etc.

3.1.7 neighbour

Neighbor Table
(struct neigh_table)
Parameters

\ (struct neigh_params)
\\ \

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

\.
\ Neighbors
(struct neighbour)
Pneigh
(struct pneigh_entry)
(struct
neigh_table) a
O.)
l is) 3
G M
Q) v
(struct g
neigh_table) V

 

 

 

 

 

 

 

 

 

 

Figure 3.1: Neighbor Table Structure

l2include / net / dst.h
13include / net / dst.h

23

This represents information about the neighbour connected to the system one
hop away. Like other Objects in the Linux kernel, it has pointers to the next neigh-
bour structure (next). Other information in the structure include pointer to the
neigh_table”, pointer to the device (dev), ﬂags, hardware address (ha), pointer to
the output function (output), pointer to the neigh_ops15 structure (Ops) which has
pointers to the functions for diﬂerent operations like destroying the neighbour struc-

ture, transmitting a packet from the queue (queue_xmit).

3.1.8 rtable

An entry in the routing table is represented by a rtable16 structure. This contains
a union (u) of destination cache entry (dst) and the next rtable structure (next).
Thus the union allows the same member to be used as a pointer to the next rtable
or as a pointer to the destination cache entry for the route. Other information in
rtable are the ﬂags and type of the route (rtJlags, rt-type), source and destination
addresses (rt_src, rt-dst), interface index (rtjif), gateway address (rt_gateway) and

key for route lookup (key).

3.1.9 ﬁb_table

The use of multiple routing tables in Linux is made possible by the use of
i‘ib.tablel7 structures (see Figures 3.1.918 and 3.1.1019). This structure contains a
table identiﬁer (tb_id), table manipulation functions like table lookup (thookup),

14include / net / neighbour.h

1 5 include / net / neighbour.h

16include / net / route.h

l7include / net / ip_ﬁb.h

1 8 Source: Linux 1P Networking, http:/ /www.cs.unh.edu/cnrg/gherrin

19Source: Linux IP Networking, http://www.cs.unh.edu/cnrg/gherrin

24

Netmask Table

one entry for each

 

Netrnask Table

one zone for each

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Network Information

routing information for

split into speciﬁc
network
information
(addresses, TOS
etc)

and protocols
(IP, ATM, etc)

potential netrnask known subnet mask each known network
0000 N Default 0.0.0.0
8000 zone node info
C000 ll
em 1
E000 ( p y)
10.0.0.0
zone / node info
F000 / (F000)
F800
. 17216.0.
(empty) node info
FEOO
zone
/ (FFOO) \ 172.2900
FFOO node info
FF80
(empty)
FFFF

 

 

 

Figure 3.2: FIB Table concepts

insertion (thnsert), deletion

(tb_delete) and pointer to the associated FIB hash table (tb.data).

3.1.10 fn_hash

fn_hash2° is the FIB hash table which contains pointers (fn_zones) to the individual

zones representing netmask. The number of zones is 33 which represents one zone per

bit in the mask. The fnJoneJist points to the ﬁrst non-empty zone in the table. The

20net/ipv4/ﬁb_hash.c

25

Netmask Table
(struct fn_hash)

I

I

[ Network Zones Network Information
I

(struct fn_zone)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

’[struct fib_node) (struct ﬁb_info)
fn_zones[O] : fz_next : fn_key ﬁb_protocol
tb__data fn_zones[l] I fz_hash < ' fib_metrics
: fZ-maSk : ﬁb_nhﬂ
fn_zones[2] ' fn_key
l I
I l fn_key
(struct l l
fb tabl ' . fn k
I _ e) / l fz_next / g - ey ﬁb_PTOI9COI
I = a ﬁb_metrics
' fZ—haSh \ : fn_key ﬁb_nh[]
I fz_mask l
fn_zones[32] : .
fn_zone_list i E fn_key
Main/Local Tables I 5
I I

Figure 3.3: FIB Table Details

zones are organized from most speciﬁc (longest preﬁx) to the most general (shortest

preﬁx), thus the fn_zone_list points to the most speciﬁc zone in the table.

3.1.11 fn_zone

Each fn_zone21 represents routing data for one netmask. Thus for the 32-bit
IPv4 addresses, there are 33 such zones. It contains a pointer to the next non-
empty zone (fz_next), pointer tO the fib_node hash table (fz_hash), number of entries
(fz-ent), number of buckets in the hash table (fz_divisor), mask to generate the hash
(fz_hashmask), mask for the zone (fz_mask) and order of the zone which is the index
of the zone in fn_hash.

2’ net / ipv4 / ﬁb_hash.c

26

3.1.12 ﬁb_node

ﬁb_node22 denotes information speciﬁc to a route. Its ﬁelds include pointer to the
next node (fnmext), pointer to ﬁb_info structure (fn_info), the search key (fn_key),
T OS (fn-tos), type of route (fn_type), scope of the route (fn_scope) and state
(fn_state). The key, a 32 bit unsigned number, is formed by ANDing the destination
address with the netmask of the zone. Thus it encodes both the destination address

and the mask of the route.
3.1.13 ﬁb_info

fib_info23 contains information that can be shared between many routes. This
includes the route metrics (ﬁb_metrics), number Of nexthops available for the route
(ﬁb_nhs), array Of ﬁb_nh structures (ﬁb_nh), network protocol (fib.protocol), status
information (dead) and pointers to the next and previous ﬁb-info structures (fib_next,

ﬁb_prev).
3.1.14 ﬁb_nh

ﬁb_nh24 represents the nexthop information in a route. It contains a pointer tO the
device (nh-dev), ﬂags, scope, weight of the nexthop (nh_weight), outgoing interface

index (nh_oif) and the address of the gateway (nh-gw)

3.1.15 rt_key

rt_key25 contains information which form the key for route lookup. This includes
the source and destination addresses (src, dst), incoming and outgoing interface in-

22net/ipv4/ﬁb_hash.c
23include / net / ip_ﬁb.h
24include / net / ip_ﬁb.h

25include / net / route.h

27

dices (iif, oif), ﬁrewall mark if conﬁgured (fwmark), TOS (tos) and scope of the route

(scope).
3.1 .16 ﬁb_result

A ﬁb_result26 structure is returned when a route is found in the routing table. This
structure contains the preﬁx length (preﬁxlen), the index Of the next hop selected
(nh.sel), type and scope Of the route (type, scope) and the routing rule (ﬁb_rule) if

multiple tables support is conﬁgured.
3.1.17 nf_conntrack

m".conntrack27 represents a general netﬁlter connection. This is used only for
usage count (use) and deletion through the destroy function pointer. Other speciﬁc

connection structures, for example ip_conntrack contain this structure within them.

3.1.18 nf_ct_info

nf_ct_info contains a pointer to a nf_conntrack structure (master).
3.1.19 ip_conntrack_tuple-hash

ip_conntrack_tuple_hash is an entry in the connection hash table. It contains the

ip_conntrack_tuple (tuple) and the pointer to the IP connection (ctrack).
3.1.20 ip-conntrack

ip_conntrack28 represents an IP connection in the connection tracking mechanism.
It contains an nf_conntrack structure (ct_general) for usage count, a couple of

26include / net / ip_ﬁb.h
27include/linux/skbuff.h
28include/linux/skbuff.h

28

ip-conntrack_tuple_hash structures (one each for original and reply direction, tuple-
hash), status tO indicate whether we have seen packets in both the directions (status),
pointer to a ip_conntrack.helper if registered. If this connection is expected by another
connection, that connection is stored in a nf.ct_info structure (master). An array of
nf_ct_inf0 structures which indicate the relation Of the packet with the connection is

available as infos.
3.1.21 ip_conntrack_info

ip_conntrack_info29 is an enumeration type which indicates the status of the con-
nection. The different values Of this type include IP_CTESTABLISHED to indicate

an established connection, IP-CT_NEW to indicate a new connection etc.

3.1.22 ip_conntrack_tuple

ip_conntrack_tuple30 contains the information related to an IP connection. The
ﬁelds in this structure hold the destination information (IP address, port for TCP and
UDP, and type and code for ICMP), source information (IP address, port for TCP and
UDP, and type and code for ICMP) and transport layer protocol identiﬁer. The source
related information is manipulable and is encapsulated in an ip_conntrack_manip31

structure.
3.1.23 ip_conntrack_protocol

ip_conntrack.protocol holds information and function pointers for speciﬁc proto-
cols. Its ﬁelds include the protocol identiﬁer (proto), protocol name (name), function
pointer to convert packet to tuple (pkt_to_tup1e), function pointer to invert the tuple

29include/linux/netﬁlter_ipv4/ip_conntrack.h
3°include/linux/netﬁlter_ipv4/ip.conntrack.tuple.h

31include/linux/netﬁlter_ipv4/ip_conntrack.tuple.h

29

for the packet (invert_tuple), function pointer to handle the packet (packet), function
pointer to create a new connection of this protocol (new), etc. This is represented
by ip.conntrack.protocol_tcp32 for TCP connections, ip.conntrack_protocol_udp33 for

UDP connections and by ip_conntrack_protocol_icmp34 for ICMP connections.

3.2 Packet Reception

The journey Of a packet in the kernel begins with the hardware interrupt Of the
interface receiving the packet from the physical medium. The interrupt service routine
(ISR) for the interface checks for the length Of the queue (ring buffer overﬂow), any
transmission errors like checksum, frame errors, etc. Then it allocates space (an
sk_buff structure, usually called skb). It checks for the packet’s protocol ID and
whether the packet is a hardware broadcast, multicast or destined for other host”. It
then calls the netif_rx()36. netif_rx() enqueues the passed skb at the end of the receive
queue for the CPU and raises the networking receive softirq (NET.RX-SOFTIRQ)
ﬂag.

The NET_RX_SOFTIRQ is mapped to the net_rx_action()37. When the kernel
ﬁnds that the NET_RX_SOF T IRQ has been raised, it calls net_rx-action(). In this
function, packets (skb) are de-queued from the CPU’s receive queue. For every packet
type, i.e. protocol family, that has been registered, the interface registering the
protocol family is checked to match the incoming interface. If the interfaces match or

32 net / ipv4 / netﬁlter / ip.conntrack.proto_tcp.c
33 net /ipv4 / netﬁlter /ip.conntrack.proto.udp.c
3" net /ipv4/netﬂlter/ip.conntrack_proto.icmp.c

35When an Ethernet interface is in promiscuous mode, packets destined for other MAC addresses
also will be processed.

36net / core / dev.c

37net/core/dev.c

30

if the protocol applies to all the interfaces, the function (packet_type.func) to handle
packet Of the particular protocol family is called. For IN ET family, this function is
ip_rcv( 38. As long as the queue has packets and there is time, packets are de—queued
and processed.

ip_rcv() is the main IP receive function. This function gets a pointer to the IP
header Of the packet and performs checks like IP version, checksum, minimum packet
size, etc. If these checks pass, it invokes the NFJP_PRE_ROUTING hook in the
netﬁlter framework. If the hook returns a success, then ip_rcv.ﬁnish( 39 is called.

ip.rcv_ﬁnish() invokes ip.route.input()4° if the skb->dst, where skb if a sk_buH
structure discussed in 3.1.1, is not already set correctly. The skb->dst will be set
correctly if the source address of the packet is a IOOpback interface, otherwise it is not
set If ip_route.input() returns an error, the packet is dropped, i.e the buffer (skb)
is freed and an error code NET_RX_DROP is returned. If ip.route.input() returns
success, the packet is processed further. If the packet contains any TP options, they
are processed. Finally, skb->dst->input() is called to process the packet. In INET
family, this function maps to ipforward( )41 for packets destined to other host and to
ip_local_deliver()42 for local packets. The journey of the packet along ipforwardO is
described in 3.4.

ip_route.input() is the function where the route lookup starts. First a hash is
generated from the source address, destination address, incoming interface index and
Type of Service (TOS) of the packet. This is used for lookup in the route cache hash

38net/ipv4/ip_output .c
39net/ipv4/ip.input.c
4°net/ipv4/route.c

‘1 net / ipv4 / ip.forward.c
4‘"net/ipv4/ip.input.c

31

table. The corresponding bucket in the route cache hash table is traversed searching
for a route that matches the destination address, source address, incoming interface,
outgoing interface index set to 0 and the TOS. If ﬁrewall marking is enabled in the
kernel (CONFIGJPROUTEI‘WMARK = yes), then ﬁrewall mark is also matched
with the route cache entry. If an entry is found, then the entry is updated with the
last used time, number of times used and a pointer to the entry is made available in
the packet’s dst, which is a dst_entry structure discussed in 3.1.6, and the function
returns. The route cache entry contains dst-entry as its member. This structure, as
discussed in 3.1.6, holds all the information required for the packet to be sent along
the selected route which includes the neighbour information, the device, hardware
header cache, the output function to be used to send the packet along the route, etc.
If a route cache entry is not found, then ip_route_input.mc()43 is invoked if it is a
multicast packet, otherwise ip_route_input_slow()44 is invoked. In this work, we do
not discuss ip_route.input_mc() and only concentrate on ip.route_input_slow().
ip_route_input_slow() constructs a key (rt_key) based on the destination address,
source address, TOS, ﬁrewall mark if conﬁgured, incoming interface index, route
scope set to RT_SCOPE_UNIVERSE45. If the source address is multicast address
or the source address has a bad address class or the source address is a loopback
address“, then an error is returned which causes the packet to be dropped. Other
checks like source and destination addresses being 0, source address having a zero
network address are also checked and the packet is dropped if any such test succeeds.

43net/ipv4/route.c
4‘lnet/ipv4/route.c
45Route scopes are used to indicate the type of route: site, link, host, universe.

46A packet with a loopback source address will have the skb->dst already set by the output routine
which would have already been checked in ip_rcv_ﬁnish () before invoking ip.route.input(). Hence a
packet with loopback source address will never reach this function.

32

If the packet has a valid source and destination address, then the actual route lookup
is performed. If multiple routing tables support (CONFIGJPJVIULTIPLE-TABLES
= no) is not enabled in the kernel, ﬁbJookup( 47 is called to lookup the route in the
Forwarding Information Base (FIB) tables. Otherwise, if the multiple routing tables
support is enabled, then the table lookup routine ﬁbJookup( 48 is invoked. If the route
lookup results in a local route, then the packet is destined for the local machine. In
this case, a route cache entry is created and its ﬁelds including the search key, function
to process the input packet (which is ip_local_deliver()), incoming interface index, etc
are set. If it is not a local packet and if IP forwarding is not enabled on the interface,
then the packet is dropped. At this point, if multipath is conﬁgured in the kernel and
if there are more than one nexthops available for the route, ﬁb_select_multipath( 49
is called which selects one of the multiple nexthops based on the route conﬁguration.
After further checks like no outgoing interface available or invalid source (should not
be a broadcast or local address), a route cache entry is created and ﬁlled in with the
information about the route selected. The input and output functions for the route
cache entry are set to ip_forward() and ip.output( 5" respectively. Finally the route
cache entry created above is inserted into the route cache by calling rtjntern_hash()51.
One should note that only the most important functionalities of ip_route_input_slow()
have been discussed here.

47include / net / ip_ﬁb.h
48net/ipv4/ﬁb.rules.c
49net/ipv4/ﬁb_semantics.c
50net/ipv4/ip.output.c

5 1 net / ipv4 / route.c

33

ﬁb_lookupO52 calls the table lookup routine, thookupO, on the local-table and
main_table. If they fail an error is returned else success is returned. The structure Of
these tables is discussed in 3.1.9.

In ﬁb_lookup()53, the table Of routing rules is searched to ﬁnd the correct routing
table to use for lookup. The search is based on the source address, destination address,
TOS, ﬁrewall mark and incoming interface index. Once a matching rule is found, the
action speciﬁed for the rule speciﬁes whether to proceed with the route lookup or to
return an error. The lookup proceeds if the action is RTN.UNICAST or RT N_NAT.
The matching rule speciﬁes the table to use for the lookup and the thookupO is
invoked on that table. If thookup() returns an error, the error is propagated to the
caller and the packet is ﬁnally dropped.

tb-lookup() maps to fn_hash.lookup()54. In this function, the FIB tables are
searched for longest matching route entry. As discussed in 3.1.11, each routing table
has 33 zones (one for each bit in the network mask). The search starts with the
zone pointed to by fn.zone_list which is the ﬁrst non-empty zone in the table. The
zones are linked together with the most restrictive zone (32 bit netmask) being at the
head and relaxing the restriction with each zone. In each zone, fz_key( 55 is called to
generate the search key by ANDing the destination address and the zone netmask.
Then fz_chainO56 is called which invokes fn_hash()57 to generate a hash to quickly
locate the potential ﬁinode in the hash table of ﬁb_node for the zone, which might

52include / net / ip_fib.h
53 net /ipv4/ﬁb.rules.c
5" net / ipv4 / ﬁb_hash.c
55net/ipv4/ﬁb_hash.c
56net/ipv4/ﬁb_hash.c
57net/ipv4/ﬁb_hash.c

34

have an entry for the destination. The search key is compared with the ﬁb_node key.
If the search key is less than that of the node key, then the search moves to the next
zone. If the search key is greater than the node key, the search continues with the
next ﬁb_node entry. If there is a perfect match, then we have a route which is the
longest match for the destination. Now, the ﬁb_node entry is checked for its sc0pe
and state. If the node is in a zombie state or if the scope of the ﬁb.node is less than
the sc0pe of the search key, then the search continues with the next ﬁb_node. Else
ﬁb_semantic_match()58 is invoked to setup the ﬁb_result structure with the results
Of the lookup. If ﬁb_semantic_match() returns success, then the type, scope and
preﬁxlen of the ﬁb_result is set to the values in the ﬁb_node.

In ﬁb_semanticmatch 0, one of the parameters passed is the fib_info of the ﬁb.node
which was selected during the search earlier. As discussed in 3.1.13, ﬁb_info contains
a list of nexthops available for this route. If the multipath routing is not conﬁgured in
the kernel (CONFIGJPJIO UTE_M ULTIPAT H = no), then the number Of nexthOps
is at the most 1, which is selected if it is not dead. Otherwise, the ﬁrst nexthop
which is not dead is selected to be the nexthop for the packet. This is indicated by
setting res->nh_sel = nhsel where res is a ﬁb_result structure and nhsel and nhsel
are indices into the array Of nexthops.

In ﬁb_selectJnultipath () which is invoked if multipath is enabled in the kernel and
if the number of nexthops for the selected route is greater than 1, a weighted-round-
robin algorithm is used to select the nexthop for the packet. Each nexthop speciﬁed
in the route is given a weight. The algorithm selects the nexthop according to the
weights.

58net/ipv4/ﬁb_semantics.c

35

ip_locaLdeliverO is the function which ﬁnally delivers the packet destined for the
local host up to the higher layers. If the packet is fragment, ip.defrag()59 is invoked
which reassembles the packet. Then the NF .IP.LOCAL.IN netﬁlter hook is invoked
which if successful calls ip_local.deliver_ﬁnish()60.

ip_local_deliver_ﬁnish() gets the higher layer protocol information from the IP
header of the packet. If the protocol is IPPROTOJZAW indicating that the packet
is destined for a raw socket, raw-v4_input( 6’ is invoked for further processing. If the
protocol is not IPPROTOJIAW, then corresponding protocol handler is invoked as
ipproto—>handler(). For TCP, this resolves to tcp_v4_rcv()62, for UDP, this resolves

to udp.rcv()63 and for ICMP, it resolves to icmp_rcv()64

3.3 Packet Transmission

A. packet transmission starts with a call to the the sock_write() which ﬁrst gets
the socket associated the ﬁle descriptor being written to and then sets up the message
header and calls socksendmsg065. The socket is discussed in 3.1.2.

socksendmng calls sobk->ops->sendmsg() of the socket which maps to different
functions for different types Of sockets. For IN ET sockets, it maps to inetsendmng“.

59 net / ipv4 / ip_fragment .c
6Onet/ipv4/ip_input.c

61 net / ipv4 / raw.c
62net/ipv4/tcp_ipv4.c

63 net / ipv4 / udp.c

6“ net / ipv4 / icmp.c

65net /socket.c

66net/ipv4/af.inet.c

36

inetsendmng ﬁrst binds an unbound socket and then invokes
sk->prot->sendmsg() where sk is the INET socket associated with the BSD socket.
INET sockets are discussed in 3.1.3.

sk->prot->sendmsg() maps to tcp_sendmsg()67 for TCP sockets and
udp_sendmsg()68 for UDP sockets. We take up tcpsendmng and trace the packet
as it travels down the network stack.

tcpsendmng ﬁrst waits for the connection setup to complete. Then it forms
packets from the message. If there is space in the last awaiting packet (content
length < MSSSQ), the data is added to it. Finally tcp_send_skb()70 is invoked.

tcp_send_skb() ﬁrst enqueues the packet to the TCP write queue which holds the
packets to be transmitted. Then tcp.transmit_skb()71 is invoked to send packets im-
mediately if required, for example for packets which have the FIN bit set. Otherwise,
the function is invoked later by other functions to transmit the packets.

When a packet needs to be transmitted tcp_transmit_skb() is invoked.
tcp-transmit_skb() ﬁrst builds the TCP header for the packet. Then it invokes tp-
>af_speciﬁc->send-check() where tp is a tcp_0pt structure and afspeciﬁc is the
pointer to the INET family speciﬁc functions. afspeciﬁc is set to ipv4_speciﬁc72.
send_check() maps to tcp_v4.send_check( 73 which computes the checksum if required.

Finally tp—>af_speciﬁc->queue_xmit() is invoked which maps to ip-queue_xmit()7".

67net/ipv4/tcp.c
68net/ipv4/udp.c

69 Maximum Segment Size
70net/ipv4/tcp_output.c

71 net / ipv4 / tcp_output.c
72net/ipv4/tcp_ipv4.c

73 net / ipv4 / tcp_ipv4.c

7“ net / ipv4 / ip.output.c

37

In ip-queue_xmit(), route lookup is performed if the packet is not already routed,
by calling ip_route-output()75. If the route lookup fails, the packet is freed and
an error is returned. The transport layer mechanism will take care of the retrans—
mission in this case. If a route is found, then a pointer to the dst structure in
the route is set up in the packet. Now the IP header is built and added to the
packet. The NF _IP.LOCAL-OUT netﬁlter hook is called which if successful invokes
ip-queue_xmit2( 76.

In ip_route-output() a key is formed from the destination address, source address,
outgoing interface index and the TOS. The outgoing interface index is set if the socket
is bound to a particular interface. Then ip_route_output_key()77 is invoked with the
key.

In ip_route_output_key() the route cache hash table is searched for a route. First
a hash is generated from the source address, destination address, outgoing interface
index and TOS of the packet. This is used for lookup in the route cache hash table.
The corresponding bucket in the route cache hash table is traversed looking for a route
that matches the destination address, source address, outgoing interface, incoming
interface index being 0 and type of service ﬂag. If ﬁrewall marking is enabled in the
kernel (CONFIGJP_ROUTE_FWMARK = yes), then ﬁrewall mark is also matched
with the route cache entry. If an entry is found, then the entry is updated with the
last used time, number of times used and a pointer to the entry is made available in
the struct rtable **rp passed as argument to the function and the function returns.
If a route cache entry is not found, then ip_route_output_slow()78 is invoked.

75include / net / route.h
76 net / ipv4 / ip.output.c
77net/ipv4/route.c

78 net /ipv4/route.c

38

ip.route-output.slow() ﬁrst checks if the packet has a source address bound to it,
in which case it ﬁnds the device having that source address. Else it checks if the
outgoing interface is speciﬁed for the packet, in this case, it selects an address bound
to the interface. Then a check is made to see if the packet has a destination address.
If it does not have a destination address, the source and destination addresses are set
to the loopback address and the loopback device is chosen as the device to send out
the packet on. If the previous tests fail, then ﬁb_lookup() is invoked to ﬁnd the route
for the destination. This function has been discussed earlier in 3.2. If ﬁb_lookup()
returns an error but the outgoing interface is speciﬁed for the packet, the destination
is assumed to be on the link and a link-scope source address is selected. If the
route lookup fails and the outgoing interface is not speciﬁed for the packet, then
the destination is unreachable and an error is returned. If the route lookup yields
a local route, then the loopback device is selected for the outgoing interface. Else,
if multipath is conﬁgured in the kernel and there are multiple nexthops available
for the route, ﬁb_select.multipath() is called to select a nexthop from the multiple
nexthops available. If multipath is not enabled or if there is only one nexthop available
for the route, the default nexthop is selected. At this point, the outgoing interface
and the nexthop have been selected for the packet. A route cache entry is created
and ﬁlled in with the destination address, source address, TOS, outgoing interface
index, etc. The rth->u.dst.output which speciﬁes the function used for outgoing
packets on this route is set to ip-output(). If the route is a local route, then rth-
>u.dst.input, which speciﬁes the function used for incoming packets on this route,
is set to ip_local-deliver(). ip_output() and ip_local-deliver() have been discussed in

3.2. Multicast packets and multicast routing are handled here which might change

39

the input and output functions setup above. Finally the route cache entry is inserted
into the route cache with a call to rtjnternhash 0, which has been discussed in 3.2.

In ip-queue_xmit2(), if the packet is larger than the PMTU79 of the route, then
the ip_frag'mentO80 is called. Else, the IP checksum is computed with a call to
ip_send-check( 81 and then skb->dst->output(skb) is called which maps to
ip_output().

In ip_output(), ip-do.nat( 82 is invoked if NAT is conﬁgured in the kernel, else
ip_finish_output()83 is invoked.

ip_ﬁnish_output() invokes the NF _IP_POST _ROUTING netﬁlter hook which if
successful calls ip_ﬁnish_output2( 84.

In ip_ﬁnish_output2(), if a hardware header cache is available for the packet’s
outgoing interface, the hardware header is added to the packet and then hh_output()
function of the hardware header cache is invoked. If no hardware header cache is
available for the packet and if the neighbour data is available for the route, then
the output() of the neighbour data is invoked. If neither is available, the packet is
dropped and an error is returned.

In general, hh_output() function maps to dev_queue_xmit( 85. When a neighbour
is deleted, hh_output() maps to neigh_blackhole()86 which just frees the packet buffer
and returns an error.

79 Path Maximum Transmission Unit
80net /ipv4/ip_output.c

81 net / ipv4 / ip_output.c
8"’net/ipv4/ip.nat.dumb.c
83net/ipv4/ ip.output.c

84 net / ipv4 / ip..output.c

85net / core / dev.c

86net/core/dev.c

40

output() of the neighbour maps to different functions depending on the state of the
neighbour. These functions provide address resolution functionality. If the neighbour
is dead, it maps to neigh-blackhole(). If the neighbour is in a good, then it maps to
neigh->ops->connected-output. For Ethernet devices, the connected-output() maps
to neigh_resolve_output( 87.

dev_queue_xmit() checks for the device queue and if the device has a queue, en-
queues the packet for transmission by the device. It then runs the qdisc (qdisc_run ()88)
attached to the device and returns either success or failure depending on whether the
queuing was successful or not. Some types of devices, for example loopback, tunnels
etc, do not have an associated queue. In such cases, the hardstaermitO associated
with the device is invoked. This function maps to the routine which takes care Of
the actual transmission of the packet from the interface and hence depends on the
device driver. It is interesting to note that in'the case Of a loopback interface, the
hard-start_xmit() invokes the netif_rx() simulating a packet reception by the loopback
interface.

It is in qdisc_run() that the queuing disciplines come into picture. This calls
the qdisc_restart()89 which de—queues a packet from the device’s queue and calls the
hardstaermitO to transmit the de-queued packet. The packet to be de—queued is
decided by the qdisc installed on the interface and any trafﬁc shaping setup on the
interface.

87net / core / neighbour.c
88include / net / pkt _sched .h

89 net / sched / sch .generic . c

41

3.4 Packet Forwarding

All packets which are being forwarded by the system follow the path taken by
the incoming packets till they are passed to ip_forward(). This is the point where
the incoming packets destined for the localhost and those being forwarded go their
separate paths.

At this point, the outgoing interface for the packet has been selected. The rout-
ing table entry for the route chosen is available in (struct rtable*)skb->dst and the
corresponding interface is available in rt->u.dst.dev.

In ip_forward(), the packet is checked to make sure it is destined for a host and
not a broadcast or other type of packet which should not be forwarded. The packet is
dropped if it is not destined for a host. The IP Time-To—Live (TTL) is checked and if
it is less than or equal to 1, the packet is dropped and an ICMP time exceeded error
message is sent to the origin Of the packet. Other Options like strict source routing
(SR), don’t fragment (DF) etc are also checked. ApprOpriate ICMP error messages
are sent back and the packet is dropped if the packet fails any of the checks.

If network address translation (NAT) is enabled on the outgoing interface,
ip.do_nat()90 is invoked on the packet . If ip_do_nat() returns an error, the packet is
dropped.

Finally the NF _IP_F ORWARD netﬁlter hook is invoked on the packet. On success,
this hook invokes ip_forwardjinish( 9’ on the packet.

90 net /ipv4/ip_nat -dumb.c
9’ net /ipv4 /ip_forward .c

42

If there are no IP Options, ip.forward.ﬁnish() invokes ip_send( 92 on the packet.
If there are Options, then ip_forward_options()93 is invoked to process the Options and
then ip_send() is invoked to send Off the packet.

ip_send() calls ip_fragment()94 if the packet is larger than the Path Maximum
Transmission Unit (PMTU) of the outgoing route which in turn fragments the packet
and then calls ip_ﬁnish_output( 95 else calls ip_ﬁnish-output().

ip_ﬁnish-output() calls the NF _IP_POST -ROUTIN G netﬁlter hook and if the hook
is successful it calls ipﬁnish_output2( 96.

From here, the packet follows the same path as taken by a packet being transmitted

from the system, which has been discussed in 3.3.

3.5 Netﬁlter Framework

From the Linux netﬁlter Hacking HOWTOW:
Netﬁlter is merely a series of hooks in various points in a protocol stack...

The framework provides a number Of hooks which are called at different times
during a packet’s journey through the protocol stack on the system as depicted in
Figure 3.498.

After checking the incoming packets for IP version, checksum, etc, ip_rcv() passes
them through the NF_IP_PRE_ROUTING hook ([1] in Figure 3.4). After success-

92include / net / ip.h

93net/ipv4/ip.options.c

94 net / ipv4 / ip.output.c

95net/ipv4/ip_output.c

96 net / ipv4 / ip_output.c

9“’http: / / netﬁlter.org / documentation / HOWTO / / netﬁlter-hacking—HOWTO.html

98 Source: Linux netﬁlter Hacking HOWTO, http: / / www.netﬁlter.org / documentation / HOWTO /
netﬁlter-hacking—HOWTO-3.html

43

lncomin
g ——_>l1l '_"|ROUTE'I ——*[3I——>[4|—>Forwardcd/

Packets Outgoing
Packets
3]
“ [ROUTE]
Local—bound Packets [[5]

Locally—generated Packets

Figure 3.4: Netﬁlter Hooks

fully passing through the hook, the routing code decides whether the packet is
destined for the local system or it has to be forwarded to another host. If the
packet is for the local system, the NFJP_LOCAL_IN ([2] in the above Figure 3.4)
is invoked from ipJocaLdeliverO. If on the other hand the packet is for a differ-
ent system, ip_forward() calls the NFJPIORWARD ([3] in Figure 3.4). Packets
which are generated by the local system ﬁrst pass through NF _IP_LOCAL-OUT ([5]
in Figure 3.4) netﬁlter hook called from ip_queue_xmit(). Locally generated pack—
ets destined for other system as well as forwarded packets ﬁnally pass through the
NF _IP_POST_ROUTING hook ([4] in Figure 3.4).

The hooks can return one Of the following results: accept, drop, stolen, queue or
repeat. If accepted, the packet continues its journey through the protocol stack. If
the verdict returned is drop, then the packet is dropped and the traversal is aborted.
Stolen indicates that the hook has taken control over the packet and that the protocol

stack should not continue processing the packet. Queue provides a mechanism to

44

queue the packet for user-level handling Of the packet. Repeat causes the hook tO be
called again on the packet.

Any module can register itself at any of the hooks. During the registration, a
function is speciﬁed which will be invoked when the hook is called. This function
will be given a pointer to the packet and should return one Of accept, drop or queue
verdicts. During the registration, the priority of the function for that book is spec-
iﬁed. When the hook is invoked, the registered functions are called in the order Of
their priority. For example, the following modules register their functions with the
NF_IP.LOCAL-OUT (in the order of priority/call): Conntrack, Mangle, NAT (desti-
nation NAT), F ilter99. Conntrack module handles the connection tracking mechanism
in Linux while the Filter module is responsible for ﬁltering packets based on differ-
ent criteria like source and destination address, source and destination port number,
ICMP type, etc. The ﬁlter module is used to setup rules to conﬁgure ﬁrewall on the
system.

Now we discuss the connection tracking mechanism in the netﬁlter framework.
This mechanism is later used in our work to identify the connection to which a packet
might belong.

ip-conntrack_in( ’00 is registered at NF _IP_PRE_ROUTING hook. A packet after
passing the basic IP checks comes here where it is checked if a connection is already
associated with this packet (probably coming on a loopback device), in which case
NF_ACCEPT is returned. If the packet is a fragment, then ip_ct_gather_frags( “’1 is
invoked to gather the fragments. Then the protocol of the transport layer is iden-
tiﬁed from the packet. If the packet is an ICMP error message, it is dealt with

99include/linux/netﬁlter_ipv4.h
100net/ipv4/netﬁlter/ip_conntrack.core.c

‘01 net / ipv4/netﬁlter/ip-conntrack.core.c

45

here and NF _ACCEPT is returned. Then resolvemormal_ct( 102 is invoked to get
a connection associated with the packet. If resolvemormaLctO fails, NF _ACCEPT
is returned. The proto->packet(), where proto is a ip_conntrack..protocol (discussed
in 3.1.23) structure, is invoked. This function maps to different functions for dif-
ferent protocols, for example, for TCP it maps to tcp_packet( )103 and for UDP it
maps to udp_packet()1°4. After some more checks, if there is a helper function reg-
istered for the packet’s protocol, it is invoked. Finally, if resolve_normal.ct() had
indicated that the packets have been seen in both the directions for the connection,
the IPS_SEEN.REPLY.BIT105 bit is set in the connection status and the function
returns the result of the call to proto—>packet().
resolveJIormaLctO ﬁrst forms the IP connection tracking tuple

(ip_conntrack_tuple as discussed in 3.1.22) for the packet. If the tuple cannot be
formed an error in the form of NULL is returned. Else ip-conntrack.ﬁnd-get( )106 is
invoked to ﬁnd any existing connection corresponding to the tuple. If a connection is

’07 is called to create a new connection

not found for the tuple, then init_conntrack(
for the tuple. If there is any problem creating a new connection, then an error
is returned. Once we have a connection corresponding to the tuple (either a new
connection or an existing connection), various checks are performed to set the status

of the connection, for example, is this packet in the original direction or is it a reply

‘02 net /ipv4/netﬁlter/ip.conntrack.core.c

103 net /ipv4/netﬁlter/ip_conntrack.proto.tcp.c
10“ net /ipv4 /netﬁlter / ip.conntrack.proto.udp.c
105include/linux/netﬁlter_ipv4 /ip.conntrack.h
‘06 net /ipv4 /netﬁlter / ip.conntrack.core.c

1°7net / ipv4 / netﬁlter / i p-conntrack.core .c

46

packet, if a reply has been seen then the connection is set to established status, etc.
The function then returns the connection associated with the tuple.

init-conntrack() is the function where a new connection is created for a packet.
The tuple (ip-conntrack-tuple discussed in 3.1.22 is passed to this function which is
used to generate a hash for placing it in the connection hash table. If the number
of connections is more than the maximum number of the connections possible, a
connection is dropped from table randomly or from the chain into which the new
connection will be put into. Next the tuple for the reverse direction of the packet is
generated by calling invert-tuple()1°8. If the inverse tuple generation fails, then an
error is returned without putting the new connection into the connection table. A
new connection is allocated and its members ﬁlled in. The tuple and the connection
information is ﬁlled in for both the original as well as the reverse direction packets.
The packet’s transport layer protocol’s new() function is invoked to setup information
speciﬁc tO the protocol in the connection structure. After ﬁlling in other members
of the connection like timeout information, the connection’s tuplehash (discussed in
3.1.19) is returned.

As remarked earlier, the above discussion might be overwhelming to some readers.
It is expected to be z.) However, section 4.1 discusses the strategies at a higher level

and should be more easy to follow than the above discussion.

103 net /ipv4/netﬁlter/ip_conntrack.core.c

47

Chapter 4

Channel Balancing Algorithms

In this chapter, we look at the four different strategies to balance the outgoing
trafﬁc on an equal cost multipath route. First we discuss the strategies and then
their implementation details. All the strategies require the kernel to have mulitpath

feature (CONFIGJPROUTE-MULTIPATH = yes) enabled.

4.1 Strategies

For the discussion of the strategies, we assume that some source hosts (srcA, sch,
srcC, etc) have more than one route to some destination hosts (dstl, dst2, dst3, etc)
using multiple interfaces (ethO, eth1, eth2, etc) on the gateway/router. We assume
that these routes have equal metrics, i.e. equal cost. If the cost Of the routes were
unequal, then the best route would be used and the others discarded’. Therefore, for
following discussions, we assume more than one equal cost routes to the destination

hosts (see Figure 4.1).
4.1.1 Route Based Channel Balancing

In this strategy, load balancing is considered when the route for the destination is
looked up. For example, when srcA causes a route lookup on the router/ load balancer
for dstl, the ﬁrst route, via ethO might be used. As long as the route cache entry for

1Some router vendors provide feature to balance trafﬁc across unequal routes also.

48

 
   

Intermediate

router
"""""""""""""" .24.
’ dsti

lntemet
Cloud

I I

[ Router/Load router g
E ' balancer
law , .: _ ,

   

 

 

dstZ

 

 

 

router

Figure 4.1: Multiple Routes between hosts

dstl is valid, all data for dst1 will be routed over the same path. When a new route
is looked up, say for dst2, the load balancing algorithm will select the next equal cost
path, via eth1. Now all the data for dst2 will be routed along eth1 as long as its
route cache entry is valid. The multipath route can be conﬁgured to have different
weights for the different equal cost routes. For example, the route via ethO might be
given a weight of 2, while the rest are given a weight of 1. This will cause eth0 to be
used twice as many times as the others during the route lookup. Further, if there are
connections to dst1, dst2, dst3 and dst4, the connections to dstI and dst4 will use

ethO while dst2 and dst3 will use ethl and eth2 respectively.
4.1.2 Per-packet Based Channel Balancing

In this strategy, load balancing is applied for every packet. When a packet needs
to be transmitted or forwarded to a destination for which there are multiple equal

cost routes available, the load balancer will select the appropriate route. For example

49

if we have 3 packets to be forwarded to dst1 and we have 3 equal cost routes to
dst1, then one packet will be sent over each route, thus balancing the trafﬁc. The
balancing is not restricted tO a speciﬁc destination. If there were 3 packets, one
packet destined for dst1 and two packets destined for dst2, then the ﬁrst packet will
be sent over ﬁrst route, second packet over the second route, third packet over the
third route. Thus one packet will be sent over each route. Thus, per-packet based
channel balancing does not consider the destination or the source information when
balancing the traffic. It considers each packet individually and picks a route for it.
This may cause packets belonging to the same connection (say a web upload session)
to be sent out over different routes.

In this strategy also the administrator has the choice to assign different weights

to each participating route and control how much traﬂic is sent over each route.

4.1.3 Session Based Channel Balancing

We deﬁne a session as collection of related packets transferred between two hosts,
the source initiating the connection to the destination. In other words, a TCP con-
nection and a UDP connection are examples of a session. In IPv4 family, the source
and destination IP addresses, the source and destination ports for TCP and UDP or
the type and code for ICMP, deﬁne a connection.

In this strategy, the load balancing is done based on sessions. When a session is
initiated (either on the load balancer or on a host which sends data to the load bal-
ancer for forwarding) and comes in for routing, the load balancer selects a route from
the multiple equal cost routes. In the future, all arriving packets which belong to that

session are routed along the same route. When another session comes in for routing,

50

the load balancer selects another route. We use the netﬁlter2 framework’s connection
tracking mechanism to determine if a packet belongs to an existing connection or is
a new connection.

For example, consider a TCP session initiated from srcA to dst1. The port num-
bers involved are 8011 (source port) and 22 (destination port). The connection can be
speciﬁed as <srcA28011, dst1:22>. When this session is initiated, the load balancer
will select a route, say via eth0. All packets which belong to this session will then be
routed along eth0. If another session deﬁned by <schz7291, dst2z80> comes in for
routing, the load balancer will now select another route, say via eth1 and all packets

of this session will then be routed along eth1.
4.1.4 Bandwidth Usage Based Channel Balancing

In this strategy, the current bandwidth usage on the routes is used to decide the
route for the current packet. The bandwidth usage on each route (interface) is mea-
sured periodically and weight-averaged over a period of time. When a packet comes
in for routing, the route (interface) that has the maximum unused bandwidth avail-
able is used. Similar to the strategy in section 4.1.2, this strategy also considers each
packet individually. This may cause the packets belonging to the same connection
tO be routed over different routes which may result in out-of-order delivery at the
destination.

For example, if there are two 10Mbps routes to a destination say through eth0
and eth1 and their bandwidth usage is 3Mbps and 4Mbps respectively, then the next
incoming packet will be sent out on eth0 because it has more of unused bandwidth,
7Mbps.

2http: //www.netﬁlter.org/

51

ip_route_input_slow() {
search FIB tables
if(multipath route found) {
/* Now we select one of the multipath
routes according to the strategy being used */

if(session_debug == true) { /* Using session based strategy */
if (connection associated with packet is found) {
if(outgoing interface is set) {
if(ﬁb_select_multipath_previous() == false) {
/* Dead interface ? */
ﬁb_se1ect_multipath_session()

} else {/* New connection */
ﬁb_select_multipath_session()

} else { /* Setup the session */
ﬁb_select_multipath_session()

} else i(bandwidth_debug = true) { /* Using bandwidth based strategy */
ﬁb_select__multipath_bandwidth()
} else {
/* This will be route based or per-packet depending on the */
/* EQUALIZE flag in the multipath route */
fib_select_multipath()

Figure 4.2: Nexthop Selection Logic in Multipath Route

4.2 Implementation

In this section we discuss the implementation details of the different strategies
discussed above. We discuss the changes made to the stock3 Linux kernel, any conﬁg-
uration requirements for the kernel build, the method to enable a particular strategy
and some sample commands to illustrate the setup process. Figure 4.2 illustrates the
control ﬂow for selecting the nexthop for various strategies.

3Any default installation of Linux Operating system distributed by vendors.

52

 
 
  
   
 

Route found in
route cache?

   

 

ll

Send packet on
the selected route

Packet arrival
for routing

/

  

Search FIB
tables

    
 

Route found in
FIB tables“?

Multipath
Route?

  

No

    
    
 

 

Discard
packet

/

 

Select nexthop
(weighted—round
-robin)

 

 

ll

 

Insert route int

 

 

/

route cache

°/

Figure 4.3: Route Based Channel Balancing

4.2.1 Route Based Channel Balancing

This load balancing technique is available in the stock Linux kernel. Before the

load balancing feature can be used, the administrator needs to conﬁgure the multipath

route.

The following command will setup the equal cost multipath with equal weights to

each participating route“:

# ip route add 20.0.0.0/24 \

4The command has been split into multiple lines for readability.

53

 

nexthop via 10 0.1.2 dev ethO weight 1 \
nexthop via 10.0.2.2 dev eth1 weight 1 \

nexthop via 10.0.3.2 dev eth2 weight 1 \

The route lookup starts with call to ip_route_input() which searches the route
cache for a matching route. If it fails, ip_routejnputsIOWO is invoked to search the
FIB tables for a suitable route. If the result Of the search is a multipath route, then
the ﬁb_select.multipath() is invoked which selects the nexthop based on the weights
assigned to the nexthops. The chosen nexthop is added to the route cache so that
packets destined to the same destination can now be routed along the chosen route

without causing a route lookup.
4.2.2 Per-packet Based Channel Balancing

This strategy is not available in the stock Linux kernel and the nano5 patch
should be applied to the kernel and the kernel rebuilt. When rebuilding the kernel,
the multipath feature (CONF IG_IP_RO U TE..M ULT IPATH = yes) should be enabled.
The following command will setup per-packet based load balancing with equal weights

to each participating route (notice the use of the equalize keyword in the command):

# ip route add equalize 20.0.0.0/24 \
nexthop via 10.0.1.2 dev ethO weight 1 \
nexthop via 10.0.2.2 dev eth1 weight 1 \

nexthop via 10.0.3.2 dev eth2 weight 1

The nano patch changes the route lookup code. When a packet arrives destined for
a host for which there is no route in the route cache, the FIB tables are searched as in
section 4.2.2. If a multipath route is found, then before adding the route entry into the

5http: / /trash.net / kaber / equalize / equalize_2.4.18.patch

54

  
 
 
 

Packet arrival

for routing
_No Search FIB
l tables

Route found in
FIB tables?

 

Route found in
route cache?

  

 

    

Discard
packet
Select nexthop
(weighted—round
-robin)

    

 

Equalize‘ ﬂag
set?

   

 

Multipath
Route?

/ Send packet on / No
the selected route
W

1 Insert route into
route cache

End

    

 

 

 

 

 

 

 

Figure 4.4: Per-Packet Based Channel Balancing

route cache, it is marked with RTCFEQUALIZE ﬂag. Later when this route cache
entry is selected to route another packet, the route cache entry is deleted and the FIB
tables are searched again. This makes sure that all packets using the multipath route
cause a search of the FIB tables and the packets are distributed over the multiple
nexthOps according to the weights. The route cache lookup routine for forwarded
packets ip_route_inputO, ip.route.input.slow() which sets up the route cache entry

for forwarded packets, ip_route_output_810w( ) which sets up the route cache entry for

55

 

  
 
  
 

Route found in
route cache?

Equalize flag
set?

    
   
 

Packet arrival
for routing

I

Yes

No
—l

 

Search FIB
tables

 

  

 

Send packet on
the selected route

 

  

 

Select
‘ previously
nexthop

 

   

session_debug
set?

  

No

NO

Select nexthop
e1 ghted—roun
—rlobin)

   

acket belongs
to existing
sessmn?

    

 

 

 

 

 

 

End

/ Insert route into
/ route cache

 

Figure 4.5: Session Based Channel Balancing

outgoing packets, ip_route_output_key() which searches the route cache for outgoing

packets are modiﬁed to implement this strategy.

4.2.3 Session Based Channel Balancing

This implementation requires the application of the nano patch to the stock

Linux kernel.

56

This strategy requires the connection tracking mechanism (CON-

FIGNETFILTER, CONFIGJPJVRCONNTRACK and CONFIGJPNPLFTP) to
be conﬁgured in the kernel and not as separate modules.

The ip-conntrack_tuple structure was modiﬁed to record the last chosen outgoing
interface for the connection (oif). oif is initialized to -1 in resolve_normal-ct() before
calling init_conntrack() for a new session. When the new connection is created in
init_conntrack(), off of the reply packet’s tuple is initialized to -1.

A ﬂag, session-debug, is added to the kernel to indicate that session based channel
balancing is conﬁgured. This ﬂag can be set by inserting the module session_debug.o
which was written separately.

In ip_route_input_slow(), if the FIB table search for the packet’s destination yields
a multipath route, then we check if session_debug is set. If set, we check if the
packet is associated with a connection. If a connection is found for the packet, we get
the IP connection structure (ip_conntrack) of the packet. We now have information
about the packet’s direction as well as the outgoing interface used by the packet’s
connection. If the outgoing interface is not set (Off 2 -1), then this indicates that the
packet is part Of a new connection. Here we call ﬁb_selectJnultipathsession ()6. If the
outgoing interface is set indicating that the packet is part Of an existing connection,
we call ﬁb_selectJnultipath_previous(). If ﬁb_select_multipath_previous() returns an
error due tO a dead interface or other reasons, we call
ﬁb_select_multipath_session() similar to a packet belonging to a new connection.

If the packet did not have a connection associated with it, we call
ﬁb_select_multipath_session () to select the nexthop.
ﬁb_select_multipath_session () invokes ﬁb_selectmultipathO to select the nexthop ac-
cording to the weights of the nexthops.

6net/ipv4/ﬁb_semantics.c

57

 

If the session-debug was not set, we call ﬁb_selectJnultipathO to select the nex-
thop, similar to section 4.1.2 strategy.

We implement the new function ﬁb_select_multipath.previous7 which
searches the multipath route for the nexthop last used by the connection. It takes the
outgoing interface index of the connection as one of its parameters. If the interface
of the nexthop last used is still alive, then the nhsel Of the route cache entry is set
to indicate the selection of the nexthop and a success is returned. If the nexthop
corresponding to the last used interface is not found or if it is dead, then an error is
returned.

Finally in ip_route_input_slow(), just before inserting the route cache entry into the
route cache, the outgoing interface for the packet’s connection is set to the interface
on which the packet is being sent out.

To use this strategy, the administrator should insert the module session_debug.o

to set the session_debug ﬂag and then set up the multipath route:

# insmod session_debug.o

# ip route add equalize 20.0.0.0/24 \
nexthop via 10.0.1.2 dev ethO weight 1 \
nexthop via 10 0.2.2 dev eth1 weight 1 \

nexthop via 10.0.3 2 dev eth2 weight 1
4.2.4 Bandwidth Usage Based Channel Balancing

To implement this strategy, in addition to the kernel modiﬁcations, a kernel mod-
ule was implemented to measure and estimate the current bandwidth usage on all the

7net/ipv4/ﬁbsemanticsc

58

 

  
  

Packet arrival
for routing

 

   

Route found in

_ Search FIB/
route cache? ’0 tables

   

 

 

 

Yes

 

Equalize ﬂag
set?

    

 

 

 

 

 

 

 

 

unused
bandwidth -rlobin)

/ Send packet on // S l t th S l m”
the selected route e ec nex ope ect nex 0P
/With maximump (weighted—mun

 

 

 

 

 

 

/[ Insert route into
/ route cache

 

End

Figure 4.6: Bandwidth Usage Based Channel Balancing

interfaces. This strategy also requires the application Of the nano patch to the stock

Linux kernel.

The net-device structure was modiﬁed to implement the bandwidth estimator.
A bwusage8 structure (bwusage) was added to net_device. bwusage holds various
statistical information about the interface: its name (name), transmission rate in

bits/ sec and packets/sec (tx-bps and tx_pps respectively), average transmission rate

8include / net / bwestimatorh

59

in bits/sec and packets/sec (tx_avbps and tx-avpps respectively), total bytes and
packets sent out (tx-total_bytes and tx_tota1.packets respectively) and bytes and
packets sent out during the last measurement interval9 (tx-bytes and tx.packets re-
spectively).

In ip_route_input_slow(), if the FIB table search for the packet’s destination yields
a multipath route, then we check if bw_debug is set. This ﬂag indicates the selection
Of the bandwidth usage based channel balancing strategy and is set by inserting
bw-debug.o module into the kernel. bw_debug.o is a kernel module written for this
thesis. If set, we call ﬁb_select_multipath_bandwidth()lo to select the nexthop for the
packet. If bw_debug is not set, we call
ﬁb_select.multipath() to select the nexthop, similar to section 4.1.2 strategy.

In ﬁb_select_multipath-bandwidth O, we iterate over all the available nexthops and
identify the one with the maximum unused bandwidth. The bandwidth information
is available from the bwusage structure of the interface associated with the nexthOp.
The nexthop with the maximum unused bandwidth is then selected to send the packet.
nh_sel of the ﬁb_result is set to indicate the selection.

To use this strategy, the administrator should insert the module bw_debug.o tO

set the bw-debug ﬂag and then set up the multipath route:

# insmod bw_debug.o

# ip route add equalize 20.0.0.0/24 \
nexthop via 10.0 1.2 dev ethO weight 1 \
nexthop via 10.0.2.2 dev eth1 weight 1 \

nexthop via 10.0.3.2 dev eth2 weight 1

9This is set to 1 see which can be conﬁgured in the bandwidth estimation module.

10net/ipv4/ﬁb_semantics.c

60

 

4.2.5 Discussion Of the strategies

Each Of the strategies discussed above have certain points of interest. The route
based strategy and session based strategy ensure that the packets of a connection
passing through the route will be routed in-order while the per-packet based and
bandwidth based strategies do not ensure it. The packets of a connection might
be routed over different routes in these strategies. While this might seem to affect
the T CP’s timing mechanism which controls the TCP window size (according to the
Sliding Window Protocol), it should be noted that the IP protocol in itself does not
guarantee any in-order delivery of packets. Thus, even though the packets passing
through the load balancer might be routed in-order in the route based and session
based strategies, there is no guarantee that they will reach the destination host in-
order.

Another issue to note is that except for the route based strategy, all the other
strategies invoke a FIB table search for every packet. This might cause scalability
issues when routing large amounts of data. But experimental results on a LAN show
that these strategies can achieve the same routing rate as the route based strategy
for small sized data transfers. We should note here that this implementation is not
intended for large networks where the trafﬁc is ﬂowing at gigabit-per-second (Gbps)
rates. In such situations, one must consider the affects of the FIB lookups for every

packet in per packet based, session based and bandwidth based strategies.

61

Chapter 5

Experimentation Details

In this chapter we describe the testbed setup, the trafﬁc pattern and type, and
the statistics collected.

The testbed setup is shown in Figure 5.1. As shown in the ﬁgure, two servers
(srcA and sch) serve as the source Of data and two hosts (dst1 and dst2) serve
as the sink. These hosts (dst1 and dst2) have been conﬁgured with two additional
loopback addresses (10:1 and 10:2)1 each, to which the data is sent. This gives us
4 different destination addresses. The data is sent from srcA and sch to dst1:loI,
dst1:102, dst2:lo1 and dst2:lo2. The default route on srcA and sch is conﬁgured via
the channel balancer router (balancer). On balancer, the default route is a multipath
route consisting of routes via the three intermediate routers (testa, testb and testc).
Depending on the channel balancing strategy being tested, we either set or unset the
EQUALIZE ﬂag on the multipath route. On the intermediate routers (testa, testb
and testc), routes are setup to route packets destined for the destination addresses
dst1:loI and dst1:102 via dst1:eth0 of dst1. Similarly, packets destined for dst2:lol
and dst2:loZ are routed on the intermediate routers via dst2:eth0 of dst2. TO be able
to route any acknowledgment packets or other error packets, intermediate routers are
conﬁgured to route data destined for srcA and sch over the link connecting them to

1We use the name of the interface conﬁgured with an IP address to denote the IP address for
readability purposes.

62

 

172.20.1.0/24

    

    
  
 

    

  

eth0(2) 1 eth1 .1
......... testa ( ) ..-.L.’9’1('1)
C31 eth0(.1) eth0(.5) if”; lo:2(.1)
srcA ‘ '
. 172.16.1.0/24 dstt
172.170.0/24 172.20.20/24
11.0.0.0/241
eth1(.1)
,7“ 172.162.0124 elm-2)
. " ., eth2(.1) 9310(2) 172.21.1.0/24
ems j ('25 ethsu) testb 5
Load balancer '_¥Io:1(.1)
a eth0(.6),i...;l 10:24.1)
m L 172.16.4.0/24 dst2 }
If: - - ' T" gimgAL ,, j
5’08 9mm eth0(-2) \- 172.21.20/24

testc

Figure 5.1: Testbed Setup

balancer. Finally, dst1 is conﬁgured to send data destined for the source servers via
testa and dst2 is conﬁgured to send data destined for the source servers via testb.
This setup enables the routing of data between the source and destination hosts.
We use the Secure Copy program (scp) to transfer the data. The transfer is ini-
tiated from the source servers. Session based strategy requires that the connections
be outgoing on the multipath route. If the connection had been initiated from the
destination hosts, then connection tracking mechanism would remember the incom-
ing interface and mark that as the interface to use for packets going in the reverse
directions. Hence to balance the sessions on the multipath route, we should have

the connection initiated from inside the network and going out on the multipath

63

route. This is achieved by using the scp command on the source servers similar to

the following:
ashoknn-src1:“>scp fi1e1.dat nnashokQashoknn-dstl:data/

The above command will copy the ﬁlel.dat from the ashoknn-srcl machine to the
ashoknn-dst1 machine.

Files Of different sizes: 8KB, 16KB, 64KB, 128KB, 512KB, 1MB, 2MB, 4MB,
8MB and 16MB, were transfered. The tests were conducted by sending 20 ﬁles each
Of the sizes 8KB, 16KB, 64KB, 128KB and 512KB. For each Of sizes 1MB, 2MB,
4MB, 8M and 16MB, 5 ﬁles were transfered. Each source would transmit the ﬁles of
a given size (say 8KB) to each of the destination hosts (dst1:lol, dst1:lo2, dst2:lol
and dst2:lo2). Thus during the test for 8KB ﬁle size, srcA and sch each would send
20 ﬁles of 8KB size to dst1:lol, dst1:102, dst2:lo1 and dst2:102 each.

We used iptables rules tO determine how much data was transfered on each inter-
face for each source-destination combination. In addition, for the bandwidth usage
based strategy, a bandwidth usage estimator module was inserted into the kernel
which measured the bandwidth usage on each interface every second and averaged it

over a period of 8 seconds.

64

Chapter 6

Results

Figures 6.1 through 6.11 show the bandw1dth utilization on each interface during
ﬁle transfer of different sizes under different strategies.

In route based channel balancing, we observe that two of the routes participating
in the multipath route are used almost equally. However we see an anomaly during
the transfer Of ﬁles of 2MB size. This might be due to the timing of the arriving
connections which caused the same route to be taken for more connections.

We Observe that the bandwidth utilization on each interface is approximately
the same when per-packet based channel balancing strategy is used. This veriﬁes
the design Of the strategy. Although we expected performance degradation due to
possible out-Of-Order delivery of packets at the destination, it is not reﬂected in the
experimental results. It could be due to the small number of hops involved (3 hOps)
in reaching the destination. If there were more hops, we might have been able to
notice the affects of the out—of—order delivery Of packets.

Session based strategy also shows equal utilization of each route. But this would
be because during each test, the same amount of data was transfered by each con-
nection thus equalizing the bandwidth utilization on the routes. The difference in
the bandwidth utilization of the routes is visible in Figure 6.11 where the connections

transfered different amounts of data. In this case, the timing of the incoming connec-

65

 

 

160

 

5 140
= 120
S

E 7;)” 100
3 5- 80
§ v 60
a 40
C

8 20

0 I I .1

Route Per-packet Session Bandwidth

Channel Balancing Strategy

 

 

 

Figure 6.1: Bandwidth usage for File Size 8KB

tions Would also matter in the utilization of the routes. For example, consider three
transfers of 1MB, 1KB and 16KB respectively. Balancing these sessions, lets say that
1MB session is routed over eth1, 1KB session routed over eth2 and 16KB session
routed over eth4. Now if an 8MB session came in, it would be routed over the ﬁrst
route eth1 thereby raising the bandwidth utilization of eth1 drastically. Therefore,
in this strategy, the bandwidth utilization of each sessmn will have an impact on the
bandwidth utilization of each route.

We notice that in all the transfers with bandwidth based strategy, eth1 is not
utilized. This is because the simulated bandwidth available for eth1 was only leps
whereas the other interfaces were conﬁgured to simulate 10Mbps. Thus the available

bandwidth on eth1 was lesser than the available bandwidth on the other interfaces.

66

 

    

File Size 16KB
250 .. .. .. .

E 200

E

E A 150

s r

.C

5 1‘, 100

3

11

5 50

1:0

0 EL“... . . __
Route Per-packet Session Bandwidth

Channel Balancing Strategy

 

 

 

Figure 6.2: Bandwidth usage for File Size 16KB

If the bandwidth of the routes was comparable, then we can expect behavior similar
to the per-packet based channel balancing strategy.

The transfer times involved in these tests (see Figure 6.12) indicate that the
different strategies perform equally well when the size of the ﬁles transfered is small
(3 512KB). As the size of the ﬁle increases, the per-packet based and session based
strategies show better performance. We should remember that in all the case except in
the mixed ﬁle size case, the ﬁle size in each connection was same. This would cause
the data transfered by every connection to be the same and hence the bandwidth

utilization of the routes in the session based tests would be approximately the same.

67

 

 

Bandwidth Utilization

(Kbps)

800
700
600
500
400
300
200
1 00

File Size 64KB

 

Route Per—packet Session Bandwidth
Channel Balancing Strategy

 

Figure 6.3: Bandwidth usage for File Size 64KB

 

 

Bandwidth Utilization

File Size 128KB

 

Route Equalize Session Bandwidth

Channel Balancing Strategy

 

Figure 6.4: Bandwidth usage for File Size 128KB

68

 

 

 

 

 

Bandwidth Utilization

(Kbps)

3500
3000
2500
2000
1500
1000

500

 

File Size 512KB

4n,

  

Route Per-packet Session Bandwidth
Channel Balancing Strategy

 

Figure 6.5: Bandwidth usage for File Size 512KB

 

 

Bandwidth Utilization

(Kbps)

File Size 1MB

 

Route Per-packet Session Bandwidth

Channel Balancing Strategy

 

Figure 6.6: Bandwidth usage for File Size 1M

69

 

 

 

 

Bandwidth Utilization
(Kbps)

Route

File Size 2MB

Per-packet

 

Session Bandwidth

Channel Balancing Strategy

 

Figure 6.7: Bandwidth usage for File Size 2M

 

 

Bandwidth Utilization

(Kbps)

1 0000

8000

6000

4000

2000

File Size 4MB

Route Per-packet

 

Session Bandwidth

Channel Balancing Strategy

 

Figure 6.8: Bandwidth usage for File Size 4M

70

 

 

 

 

Bandwidth Utilization
(Kbps)

File Size 8MB

 

Route Per-packet Session Bandwidth
Channel Balancing Strategy

 

 

Figure 6.9: Bandwidth usage for File Size 8M

 

Bandwidth Utilization

File Size 16MB

 

Route Per-packet Session Bandwidth

Channel Balancing Strategy

 

Figure 6.10: Bandwidth usage for File Size 16M

71

 

 

 

 

 

Bandwidth Utilization
(Kbps)

Mixed File Sizes

800
700
600
500
400
300
200
1 00

 

Route Equalize Session Bandwidth
Channel Balancing Strategy

 

Figure 6.11: Bandwidth usage for Various (mixed) File Sizes

 

 

Transfer Time (sec)

Transfer Times

700
600 .

I Per-packet
500 DSession
400 DBandwdth

  

File Size

 

Figure 6.12: Transfer Times

72

 

 

Chapter 7

Conclusions and Future Work

The experimental results indicate that the per-packet based and session-based
channel balancing perform better than the route—based and bandwidth-based channel
balancing in the experimental setup. Bandwidth based channel balancing strategy
tends to push more data on higher capacity links than try to push data on all the
available links. Balancing trafﬁc based only on the unused bandwidth might cause
data to be pushed on an interface at a rate above the threshold value when the
queuing delays on the interface will decrease the performance. This in itself can be
studied further as another channel balancing strategy where the percentage of the
unused bandwidth is sought to be equalized on the routes.

Due to the topology of the testbed, the effects of out—of—order delivery on the per-
formance of the strategies is not clear in the tests conducted. Experiments simulating
out-of-order delivery of packets can be conducted to study these effects.

Looking at the testbed, we observe the destination hosts were only three hops
away from the source. The effects of varying the number of hops between the source
and destination can also be taken up for future research.

Ideally we would like to see the performance of the different strategies in a real—
world scenario. It is in the real-world scenario (or a well simulated environment)

that the effectiveness of the strategies can be truly evaluated. The variations in the

73

 

connection times, lengths, data transfered, destination addresses, number of hops,
latency, out-of—delivery of packets, etc, may cause the strategies to behave in a manner
not brought out by the laboratory experiments. Also, experiments can be conducted
to study the effects of the packet size on the performance Of the strategies.

Other strategies can also be looked into as part of future work. One such strategy
would be to look at the retransmission rates on the routes and pick the one with the
least retransmission rate. History of the past connections can also be used to choose
the route, if connections to a particular destination perform well on one of the routes,
then future connections to that destination should be routed over the same route.
As mentioned above, the bandwidth usage based strategy can be modiﬁed to use the

percentage of unused bandwidth as an equalizing measure.

74

Bibliography

[1] Glenn Herrin. Linux IP Networking A Guide to the Implementation and Modiﬁ-
cation of the Linux Protocol Stack http://www.cs.unh.edu/cnrg/gherrin/linux—
net.html, 2000

[2] The Linux Virtual Server Project. http://www.linuxvirtualserver.org

[3] Bert Hubert. Linux Advanced Routing 65 Trafﬁc Control HOWTO.
http://lartc.org/lartc.html, 2002

[4] Alexey Kuznetov. Policy Routing. /usr/src/linux/Documentation/networking/
policy-routingtxt, 1999

[5] Thomas Davis. Linux Ethernet Bonding Driver mini-howto.
/ usr / src / linux / Documentation / networking / bonding.txt

[6] Alavoor Vasudevan. The Linux Kernel H 0 WTO.
http: //www.tldp.org/HOWTO/ Kernel-HOWTO.html, 2003

[7] Rusty Russell and Harald Welte. Linux netﬁlter Hacking HOWTO.
http: / / www.metﬁlterorg / documentation / HOWTO / / netﬁlter~hacking-
HOWTO.html, 2002

[8] Christoph Simon. Nano-Howto to use more than one independent Internet con-
nection. http://www.ssi.bg/ ja/nano.txt, 2001

[9] Gianluca Insolvibile. Inside the Linux Packet Filter.
http: / / www.1inuxjournal.com / article.php?sid=4852, 2002

[10] Gianluca Insolvibile. Inside the Linux Packet Filter, Part II.
http: / / www.1inuxjournal.com / article.php?sid=5617, 2002

[11] Wensong Zhang. Linux Virtual Server for Scalable Network Services.
http: //www.linuxvirtualserver.org/Ols/lvs.psgz, 2000

[12] T. Brisco. RFC 1794 - DNS Support for Load Balancing.
http: //www.faqs.org/rfcs/rfc1794.html, 1995

[13] Alessandro Rubini and Jonathan Corbet. Linux Device Drivers, O’Rielly and
Associates, Inc. 2001

[14] Daniel P. Bovet and Marco Cesati. Understading the Linux Kernel, O’Rielly and
Associates, Inc. 2002

75

   

u[[[I[‘[gi[[[[[[g[[3]]