32m5u5a9‘ 5 t aha-12.4 . l o . .. .5. . rmafi».._mm.w. : n _: . an...» han—vm. » .. .. é» . w . i. a n4 5.5.5 7 .155. .. mu?“ in: :1 rat. a a J I «flaunt-”.5 n 21.3wa J‘ “Mmurdmxhb y :. Jthmmv pun-HR». 1%.}! .1 I. 2.. (49......“10 .i. 1.9:..quflhe u. 1.5?!2L..K1.f.3 ) 9.. 1. I. .u: 6 ll... . (5.x: . 0 ‘ 1335:? MESS l 2004 sc67qvq7 LIBRARY Michigan State University This is to certify that the thesis entitled CHANNEL BALANCING STRATEGIES TO OPTIMIZE UPLINK UTILIZATION presented by ASHOK NALKUND has been accepted towards fulfillment of the requirements for the Master of degree in Computer Science Science \NlEdi; ‘ / Major Pro? "3 Signature i3/4103 Date MSU is an Affirmative Action/Equal Opportunity Institution PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE FEB 02; 2'01? 6/01 cJCIFIC/DateDuepSS-pts CHANNEL BALANCING STRATEGIES TO OPTIMIZE UPLINK UTILIZATION By Ashok Nalkund A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science and Engineering 2003 ABSTRACT CHANNEL BALANCING STRATEGIES TO OPTIMIZE UPLINK UTILIZATION By Ashok Nalkund Linux®1 provides a rich set of features for networking including routing, firewalling, traffic-shaping and support for LAN and WAN interfaces. But the support for bal- ancing generic outgoing traffic across multiple uplinks is very basic. This support is limited to balancing routes over the multiple equal cost links. In this thesis, we study, implement and compare different strategies to distribute outgoing traffic across multiple links. The strategies presented in this thesis are: per-packet distribution, per—route distribution, per-session distribution and bandwidth-usage-based distribu- tion. The networking code in the Linux operating system was modified to implement these features. Experimental results indicate that per-session and per-packet based distribution strategies perform better than the other two strategies with respect to efficient utilization of bandwidth on the multiple links. lLinux® is a registered trademark of Linus Torvalds COPYRIGHT Copyright by ASHOK NALKUND (nalkunda@cse.msu.edu), 2003 DEDICATION This work is dedicated to my brother, Sriharsha. iv ACKNOWLEDGMENTS First and foremost, I would like to express my heart-felt thanks to my adviser Prof. Ni. None of this would have been possible if not for his guidance and help . I would like to take this Opportunity to thank all the wonderful people who have contributed in making Linux such a unique operating system. Without the efforts of the open- source community, Linux might not have been the robust operating system it is. I would also like to thank my friends and fellow laboratory members. I would like to particularly mention my friends (in no order) Abhishek Patil, Pradeep Padala, Ravi Parimi and Smitha Kommareddi. My thanks to the faculty and staff of the Computer Science department for their support. Last but not the least, I would like to thank my parents, my brother and my sisters for their support and confidence in me. Contents List of Figures Chapters: 1. Introduction 2. Related Work Linux Virtual Server Project ....................... 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Round-Robin DN S Policy Based Routing ........................... 2.3.1 2.3.2 2.3.3 Source Policy Routing ...................... Routing with multiple uplinks .................. Load Balancing .......................... EQL Driver: Serial IP Load Balancing ................. TEQL: “True” (or “trivial”) link equalizer ............... Linux Ethernet Bonding Driver ..................... Queuing Disciplines For Linux ...................... 2.7.1 2.7.2 2.7.3 Classless Queues Classful Queues .......................... Hierarchical Token Bucket (HTB) ................ Limitations of Related Work ....................... 3. Linux Networking Internals 3.1 Data structures 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 3.1.7 3.1.8 3.1.9 3.1.10 3.1.11 3.1.12 3.1.13 3.1.14 3.1.15 sk-buff ............................... socket sock net_device ............................ Qdisc ................................ dst_entry .............................. neighbour ............................. rtable fib_table .............................. fn_hash ............................... fn_zone ............................... fib_node fib_info ............................... ooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooo vi Page viii 23 23 24 24 25 26 3.1.16 fib_result .............................. 28 3.1.17 nf.conntrack ............................ 28 3.1.18 nf_ct.info ............................. 28 3.1.19 ip.conntrack_tuple_hash ..................... 28 3.1.20 ip_conntrack ............................ 28 3.1.21 ip_conntrack.info ......................... 29 3.1.22 ip.conntrack_tuple ........................ 29 3.1.23 ip.conntrack_protocol ....................... 29 3.2 Packet Reception ............................. 30 3.3 Packet Transmission ........................... 36 3.4 Packet Forwarding ............................ 42 3.5 Netfilter Framework ........................... 43 4. Channel Balancing Algorithms 48 4.1 Strategies ................................. 48 4.1.1 Route Based Channel Balancing ................. 48 4.1.2 Per-packet Based Channel Balancing .............. 49 4.1.3 Session Based Channel Balancing ................ 50 4.1.4 Bandwidth Usage Based Channel Balancing .......... 51 4.2 Implementation .............................. 52 4.2.1 Route Based Channel Balancing ................ 53 4.2.2 Per-packet Based Channel Balancing .............. 54 4.2.3 Session Based Channel Balancing ................ 56 4.2.4 Bandwidth Usage Based Channel Balancing .......... 58 4.2.5 Discussion of the strategies. . . . . . . . . . . . . . . . . . . . 61 5. Experimentation Details 62 6. Results 65 7. Conclusions and Future Work 73 Bibliography .................................... 75 vii List of Figures Figure Page 2.1 Linux Virtual Server ........................... 4 2.2 Hierarchy of qdiscs ............................ 14 3.1 Neighbor Table Structure ......................... 23 3.2 FIB Table concepts ............................ 25 3.3 FIB Table Details ............................. 26 3.4 Netfilter Hooks .............................. 44 4.1 Multiple Routes between hosts ...................... 49 4.2. Nexthop Selection Logic in Multipath Route .............. 52 43 Route Based Channel Balancing ..................... 53 4.4 Per-Packet Based Channel Balancing .................. 55 4.5 Session Based Channel Balancing .................... 56 4.6 Bandwidth Usage Based Channel Balancing .............. 59 5.1 Testbed Setup ............................... 63 6.1 Bandwidth usage for File Size 8KB ................... 66 6.2 Bandwidth usage for File Size 16KB .................. 67 6.3 Bandwidth usage for File Size 64KB .................. 68 6.4 Bandwidth usage for File Size 128KB .................. 68 6.5 Bandwidth usage for File Size 512KB .................. 69 6.6 Bandwidth usage for File Size 1M .................... 69 viii 6.7 Bandwidth usage for File Size 2M .................... 70 6.8 Bandwidth usage for File Size 4M .................... 70 6.9 Bandwidth usage for File Size 8M .................... 71 6.10 Bandwidth usage for File Size 16M ................... 71 6.11 Bandwidth usage for Various (mixed) File Sizes ............ 72 6.12 Transfer Times .............................. 72 ix Chapter 1 Introduction Internet has revolutionized the way people live, work, entertain and communicate. A large number of homes have network connectivity these days. Network connectivity has become cheap and affordable and also critical to the performance of many small and medium businesses. Having redundant network connections is one way to ensure increased connectivity and hence increased network reachability. One would like to use the redundant network connections efficiently under normal circumstances and only in emergency situations like one of them going down, fall back on the other connections. In these normal circumstances, efficient balancing of the traffic across the redundant network connections is important to keep the utilization high and costs lowl. In this thesis, we study different strategies to balance the outgoing traffic across multiple network connections. There are various commercial networking equipment manufacturers who provide advanced routing features in their equipment. However such equipment are very ex- pensive for organizations with small budgets. Linux is a free operating system which provides a rich set of networking features. But the features available for balancing outgoing traffic are not sufficiently powerful and hence very basic outgoing traffic balancing can be achieved. There are several projects related to load-balancing im- 1Network connections can be billed differently: bandwidth allocated, bytes transfered, etc plemented under Linux. But these projects have certain limitations which restrict their application (see chapter 2). In this research, we intend to study, implement and test different strategies for balancing outgoing traffic across multiple network connections. The goal is to effec- tively utilize the available network connections without too much of a performance degradation of applications depending on the network. In chapter 2 we describe some of the related works. In chapter 3 we delve into the internal details of the networking code in Linux. The design and implementation of different strategies is discussed in chapter 4. Chapter 5 discusses the experimen- tation setup, configuration and testing strategies. Results of the experimentation are discussed in chapter 6. Finally we present the conclusions and some ideas for future work related to this research in chapter 7. Chapter 2 Related Work One often confuses load-balancing of incoming traffic with load-balancing of out— going traffic. While there are many technologies for implementing incoming load- balancing, techniques to implement outgoing load-balancing are very few. In this chapter we describe some of the load-balancing techniques for both incoming as well outgoing traffic. This will give the reader a better understanding of the differences between the two scenarios and also an understanding of how it is achieved. 2.1 Linux Virtual Server Project Quoting the official website of the Linux Virtual Server Project(LVS)[2]: The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the cluster is transparent to end users. End users only see a single virtual server. This is an incoming-load-balancer. As mentioned above, the LVS project provides a load-balancer which distributes the incoming requests among many internal servers which are invisible to the end users. Figure 2.11 illustrates a typical Linux Virtual Server environment. The load-balancer running the LVS accepts incoming requests 1Source: Linux Virtual Server Project. http:/,lwww.linuxvirtualserver.org. Internet ' * Load Balancer Linux Box Figure 2.1: Linux Virtual Server and then routes them to the real servers according to different configured schedulers: round-robin, weighted-round-robin, least connection, weighted-least-connection and persistent client connection. Load-balancing can be configured for individual proto- col and port combination. The real servers can be added or removed transparently making the architecture very scalable. The Linux Virtual Server is implemented in three ways: NATZ, IP Tunneling and Direct Routing. In NAT, the load-balancer has a public IP address and the real-servers have private IP addresses. The clients connect to the load-balancer which then performs DNAT3 on the packets and routes 2Network Address Translation 3Destination Network Address 'Iiranslation them to the internal servers. The disadvantage of this implementation is that every request and response packet has to be rewritten by the load-balancer which becomes a bottleneck. With IP tunneling, the load-balancer schedules requests to the different real servers, and the real servers return replies directly to the clients. This allows the load-balancer to handle huge amounts of requests, scaling to over 100 real servers. Due to the very low overhead involved at the load-balancer, IP Tunneling allows a load-balancer to serve as a very high-performance virtual server. The third technique is to use Direct Routing. Here also the load-balancer only processes the client-to- server half of the connection“ and the server responses can follow any network path to the clients. This method does not have the overhead of tunneling and hence can scale very well. 2.2 Round-Robin DNS Round-Robin DNS [12] is another incoming-load-balancing technique used for balancing the incoming requests among a set of servers. Here a single name is mapped to different IP addresses in a round-robin fashion, thereby controlling which server a client connects to. For every client that requests a FQDN5-to-IP mapping, in other words a DNS lookup, a different server IP address will be chosen from the list in a round-robin manner. However, one must remember that these mappings are cached on the client system as well as on the intermediate DNS servers which might have run the query at the clients’ request. Thus clients which happen to send the queries to the same intermediate DNS servers will be routed to the same IP address and hence the load-balancing is not perfect. Also the Time To Live (T T L) for the DNS record 4Often the requests are very short followed by long responses from the server. 5Fully Qualified Domain Name is difficult to choose in these settings. If a large value is chosen, then the same IP address will be provided for a longer time which skews the load-balancing towards this IP address. On the other hand, if the TTL value is small, the load-balancing will be more efficient but the DNS queries themselves will be a bottleneck. Also depending on the amount of traffic caused by different clients, the load-balancing will be different. Thus even with the IP address rotation, the servers might be loaded differently due to the difference in the clients’ requests. Further, once a server fails, the client which receives the IP address of the failed server will be unable to connect to the server even if it tries to reconnect. Only after the TTL expires and the entry for the dead server is removed from the list of IP address at the DNS server6 will the client be able to connect to the server. 2.3 Policy Based Routing Policy Based Routing [4, 3] uses policies or ‘rules’ to make a routing decision for a packet. This feature is available in Linux by compiling the kernel with the “IP: advanced router” (CONFIGJPADVANCEDJZOUTER = yes) and “IP: policy routing” (CONFIGJP_MULTIPLE_TABLES = yes) features. The routing policy database allows multiple routing tables on a Linux router. The appropriate table is looked up as specified by the ‘rules’. The iproute27 package is available for manipu- lating the routing policy database and the routing tables. By default, the kernel has three tables: local, main and default (table ID 253 denotes the default table). Fol- lowing is the output produced by running the ‘ip rule’ command on a Linux system: # ip rule 6It might so happen that by the time the TTL expires, other client requests have caused the dead server’s IP addresses to be the one to be given to the next request. 7ftp://ftp.inr.ac.ru/ip-routing/ 0: from all lookup local 32766: from all lockup main 32767: from all lockup 253 # As shown above, by default all the rules apply to all the packets. We can generate new rules and override the default routing of packets. Following are the contents of the three tables on a machine which has two interfaces (ethO is down, ethl is up)8: # ip route show table local broadcast 67.167.130.128 dev ethl proto kernel \ scope link src 67.167.130.153 broadcast 127.255.255.255 dev lo proto kernel \ scope link src 127.0.0.1 local 67.167.130.153 dev ethl proto kernel \ scope host src 67.167.130.153 broadcast 67.167.130.255 dev eth1 proto kernel \ scope link src 67.167.130.153 broadcast 127.0.0.0 dev lo proto kernel \ scope link src 127.0.0.1 local 127.0 0.1 dev lo proto kernel \ scope host src 127.0.0.1 local 127.0.0 0/8 dev lo proto kernel \ scope host src 127 0.0.1 # # ip route show table main 8The lines have been broken for readability 67.167.130.128/25 dev eth1 proto kernel \ scope link src 67.167.130.153 127.0 0.0/8 dev lo scope link default via 67.167.130.129 dev ethl # # ip route show table 253 # 2.3.1 Source Policy Routing Source Policy Routing [3] balances the traffic based on the source IP address. We create new tables and add routes to the tables. Policy rules are created that specify the table to lookup for particular source IP address. # echo 201 Test >> /etc/iproute2/rt_tables # ip rule add from 192.168.1.100 table Test # ip rule 0: from all lookup local 32765: from 192.168.1.100 lookup Test 32766: from all lookup main 32767: from all lookup 253 # Now we add the appropriate routing entries in the new table: # ip route add default via 192.168.1 1 dev ethO table Test # ip route show table Test default via 192.168.1.1 dev ethO # 2.3.2 Routing with multiple uplinks If multiple uplinks are available, then one would like to make sure that packets arriving on an interface are answered on the same interface. This can be achieved by setting up split access [3] on these multiple interfaces. We create routing tables for each interface and create routes for network and default route in each table. Then we add rules specifying that packets arriving from any of the connected networks should be answered on the same interface by causing the corresponding table to be looked up. Routing entries for table T1: # ip route add $P1_NET dev $IF1 src $IP1 table T1 # ip route add default via $P1 table T1 Routing entries for table T2: # ip route add $P2_NET dev $IF2 src $IP2 table T2 # ip route add default via $P2 table T2 Reply on the same interface: # ip route add $P1_NET dev $IF1 src $IP1 Reply on the same interface: # ip route add $P2_NET dev $IF2 src $IP2 Default route: # ip route add default via $P1 2.3.3 Load Balancing With the above mentioned split access in place, one can setup crude load-balancing by specifying a multipath route as the default route. With a multipath route, the kernel will balance the route lookup over all the nexthops specified in the multipath route in accordance with the weights assigned to each. # ip route add default nexthop via 10.0.1.2 dev ethO weight 1\ nexthop via 192 168 1.2 dev eth1 weight 1 The above multipath route will equalize the routes over ethO and ethl equally. 2.4 EQL Driver: Serial IP Load Balancing EQL is a device driver available in Linux which provides a software device to load-balance IP serial links (SLIP or uncompressed PPP) to increase the bandwidth. This is useful in cases where we have two or more modems and want to increase the bandwidth by binding the modems together. We compile the kernel with the eql patch9 and configure an eql interface with an IP address. Then the default route is set to point to the eql device. The devices are then “enslaved” with the eql-enslave command. Devices are freed using the eql_emancipate command. 2.5 TEQL: “True” (or “trivial”) link equalizer TEQL [3] is a new virtual device available in Linux which equalizes the traffic going out on the physical slave interfaces. For this virtual device to be used, the slave devices must be active, i.e be able to raise the busy signal. This device is capable of equalizing physical interfaces which have varying bandwidths but it is advisable 9ftp: / / slaughter.ncm.com / pub / Linux / LOADBALAN CING / eql-1.1.tar.gz 10 that the bandwidths should be comparable to avoid reordering problems. To use this feature, first we insert the sch_teql module which creates a new device teqlN and a new qdisc (see section 2.7) with the same name. The physical interfaces can now be enslaved to this device and the desired routing setup. # tc qdisc add dev ethO root tequ # tc qdisc add dev eth1 root tequ # ip link set dev tequ up The slave interfaces are configured as they would normally be and the default route is set to point to the teql interface. The most restricting requirement in this technique is that both ends of the links are required to participate for it to work properly. This is sometimes not possible if the ISP is not very forthcoming with accommodating customers’ requests. 2.6 Linux Ethernet Bonding Driver The Ethernet bonding driver provides for bonding Ethernet channels together. Its called ‘Etherchannel’ by Cisco, ‘Trunking’ by Sun and ‘Bonding’ [5] in Linux. Mul- tiple Ethernet connections can be ‘bonded’ together to act as a single channel with increased bandwidth. This requires the other end of the connection also to support bonding. Although this seems similar to the EQL driver (section 2.4), this driver manages Ethernet segments while the EQL driver manages serial lines. The kernel needs to be compiled with the ‘Bonding Driver Support’ (CONFIGBONDING). A bonding network interface is defined in the system configuration files similar to the other network interfaces”. The IP information is provided for the bonding interface. The physical Ethernet devices are not configured with any IP information and are 10For RedHat Linux systems, this can be in /etc/sysconfig/network-scripts/ifcfg-bondO file. 11 made slaves of th bonding device. The MAC address for the bonding interface is cho- sen from the first enslaved Ethernet interface. The bonding driver can be configured to take care of failing interfaces by monitoring their MII link status. 2.7 Queuing Disciplines For Linux Linux provides a very rich set of features for managing bandwidth by means of queues. With queuing we can control and shape the outgoing traffic from a machine. There are different queuing disciplines available in Linux. Some of them are classless which just accept data and only reschedule, delay or drop it. Classful queues allow different kinds of traffic to be treated differently. The term qdisc is used to refer to a queuing discipline. 2.7.1 Classless Queues These queues are used to shape traffic on the entire interface, without any subdi- visions. All packets are treated the same. Following are some of the classless queues available on Linux. pfifo_fast pfifofast [3] is a queue which performs a ‘First In, First Out’ scheduling of the arriving packets. There are three bands which are prioritized, band 0 receiving higher priority than band 1, which receives more priority than band 2. Band 0 contains packets which have the ‘minimum delay’ TOSll flag set. As long as there are packets in band 0, band 1 is not processed (similarly for band 1 and band 2). The TOS value in the packet is used to map the packet to one of the bands. The mapping can itself be specified with priomap which is set by the kernel.The queue length can be 1 1 Type of Service 12 configured with ifconfig or tc command. This is the default qdisc for any interface. Adding any other qdisc to the interface causes this qdisc to simply return without any action. Token Bucket Filter (TBF) The T BF [3] qdisc ensures that packets are sent out at some administratively set rate while allowing for short bursts of traffic which may exceed the set rate. TBF consists of a buffer (bucket) which is filled with tokens at a specific rate called the token rate. The token rate is the administratively set rate mentioned above. Each token dequeues a packet from the data queue and sends it out. If the token rate is greater than the data rate, tokens simply accumulate up to the bucket size. Excess tokens are then discarded. If the data rate is greater than the token rate, then some packets will be dropped until tokens are available. The accumulation of the tokens when the data rate is smaller than the token rate allows for short bursts of data exceeding the token rate. The bucket size (burst size), token rate and various other parameters are configurable. Note that in actual implementation, the token corresponds to bytes and not to packets. This qdisc is very convenient to slow down an interface to match the actual available bandwidth as in case of cable modems and DSL lines. 2.7.2 Classful Queues In Classful queues [3], classes and filters are used to manage the bandwidth. Filters added to a Classful queue look at the packet and decide what to do with the packet and return this decision to the Classful queue. The Classful queue then enqueues the packet in the appropriate class. The filters and classes can be nested to achieve better control over the traffic as shown in Figure 2.2. The class at the bottom of this nesting l3 1: root qdisc /1:l child class child classes /1\H2\ leaf class 12: qdisc 10:1 10:2 12:1 12:2 leafclass Figure 2.2: Hierarchy of qdiscs enqueues the packet to the qdisc it contains. When a packet arrives for transmission, the root qdisc enqueues it to the appropriate child class by filtering the packet. The child class can in turn apply filtering to enqueue the packet to one of its child classes. When a packet needs to be dequeued by the kernel to send to the interface, the root qdisc gets a dequeue request which is passed on to the child class. The class with the lowest handle containing a packet returns the packet to the kernel. The handle for a class is specified when the class or qdisc is added to the interface. Due to this nesting, the child class cannot dequeue a packet faster than its parent will allow. Class Based Queuing (CBQ) CBQ [3] is a very widely used and complicated Classful qdisc which also works as a shaper. However the shaping algorithm used in CBQ is not very precise and hence causes CBQ to behave unexpectedly in some situations like not achieving the desired traffic rate. CBQ uses the idle time between requests for packets by the hardware layer to shape the traffic. Determining the idle time is a very tricky issue due to various reasons like the driver implementation for the hardware, hardware details 14 (bus speed), etc. Even with these limitations, CBQ works well in many circumstances. The parameters which configure the shaping include the average packet size, physical bandwidth, required rate of traffic, etc. The queuing properties of CBQ are configured by specifying the weight, priority and allocated amount for each inner class. The priority property makes CBQ a priority queue where packets are dequeued from the highest priority inner queue before checking lower priority inner queues. The amount specifies how much data an inner queue can send out when requested. 2.7.3 Hierarchical Token Bucket (HTB) Hierarchical Token Bucket (HTB) [3] is similar to CBQ but instead of working with idle time calculations to shape traffic, it works as a Classful Token Bucket Filter. HT B can be used to specify how to divide the available bandwidth for different purposes and also how to share the bandwidth among them, by lending and borrowing, if required. This is a scalable queuing discipline unlike CBQ which gets complex even for a simple setup. If a class requests bandwidth less than the amount assigned, the remaining bandwidth is distributed to the other classes that request for bandwidth. Thus excess bandwidth can be lent to other classes. First we add a HTB qdisc as root qdisc to the interface. Then we add a root class”. This is required because bandwidth cannot be shared among the root classes. Hence we create this root class and then make the actual classes children of this root class. Then we add the classes specifying the bandwidth guaranteed for the class. Further we add filters to classify the packets into these classes. 12A root class is a class with the root qdisc as its parents 15 2.8 Limitations of Related Work The projects discussed in this chapter are unsuitable for achieving generic outgoing traffic balancing. Some of these works are applicable only to incoming traffic, for example Linux Virtual Server Project and Round-Robin DNS (see sections 2.1 and 2.2 respectively). Others like policy routing (see section 2.3) have restrictions over what traffic they balance. The rules in policy routing specify what kind of traffic will be balanced, for example packets with a specific source IP address. To balance the generic outgoing traffic, enough rules would have to be created to cover all kinds of packets in the generic traffic. EQL and Ethernet bonding drivers (see sections 2.4 and 2.6 respectively) apply only to specific types of interfaces (IP serial links and Ethernet interfaces respectively). Further, EQL, TEQL and Ethernet bonding, also require the cooperation of the ISP providing the connections which may not be possible in some case. Finally queuing disciplines (section 2.7) can only be used to shape the traffic which is already queued for an interface. This does not provide for balancing the traffic across multiple interfaces. Thus we see that balancing the generic outgoing traffic is very difficult if not impossible with the available tools. 16 Chapter 3 Linux Networking Internals The networking code in Linux is one of the most complex and largest piece of code in the kernel, in fact, the networking code makes up for around 20% of the entire Linux kernel code. In this chapter we discuss the details of the networking code in the Linux kernel and trace the path of a packet received, sent out and forwarded by the Linux kernel. The Linux kernel supports different network architectures including IPv4 Internet protocols (PF .INET), IPv6 Internet Protocols (PFJNET 6), Appletalk (PFAPPLETALK), AX.25 (PF _X25), IPX - Novell Protocols (PFJPX), etc. It also allows for different scheduling algorithms like Packet First-In-First-Out (pfifo), Byte First-In—First-Out (bfifo), Token Buffer Filter (TBF), Stochastic Fairness Queuing (SF Q), etc. The networking code in Linux is based upon the Swansea University Computer Society’s N ET 3.039 code. The current version of the networking code in Linux 2.4 kernel is NET4.0. The code has been divided into family of protocols and further into layers. Each layer has a well-defined interface with the adjacent layers. To make the code efficient, copying of data is avoided between the layers, instead, enough space is reserved for every packet for the header / trailer for the different layers. In this chapter, we concentrate on the IPv4 Internet Protocols family. 17 When an application generates traffic, it sends it to the transport layer (TCP or UDP) through the socket interface. The transport layer then passes it on to the network layer (IP). This layer looks at the destination of the packet and decides whether the destination is the local host or a different host. If the packet is for the same host, it passes it back to the upper layer for passing it on to the local destination application. If the packet is for another host, the kernel searches the route cache, and if required the Forwarding Information Base (FIB), for a route to the destination. It prepares the packet for transmission to the destination and hands the prepared packet to the interface to be sent out on the physical medium. On the other hand, when a packet arrives at an interface and if the hardware address belongs to the host or is a broadcast, it queues the packet for processing by the kernel. The IP layer looks at the destination and if the packet is for the host, it passes it on to the transport layer for further processing. The transport layer then de-multiplexes and sends the data to the appropriate program. If the packet is for another host and forwarding is enabled on the host, the routing cache and FIB tables are consulted to route the packet correctly. If forwarding is disabled, the packet is dropped. In Linux, the work to be done when an interrupt is received, is divided into two parts: top—half and bottom-half. Different tasks in the system have different bottom- half handlers, like networking bottom-half handler, timer bottom-half handler, SCSI bottom-half handler etc. In the top-half, only the most essential work, like moving data from the hardware buffer to kernel buffer, is performed. The top-half sets a flag to indicate that the kernel needs to complete the work for the interrupt by running the bottom-half handler. When the kernel scheduler runs, it checks these flags and runs the corresponding bottom-half handlers. These bottom-half handlers complete the 18 work related to the interrupt. For example, the networking bottom-half handler gets a packet from the queue, checks for the protocol type and hands it to the protocol- specific receive function. In 2.4 versions of the Linux kernel, the bottom-half handler for networking is implemented as a softirq. Bottom-halfs are very demanding on the system: only one bottom-half can run on a system irrespective of the number of CPUS the system has. Softirqs can run on multiple CPUS at the same time and are used for very very high frequency threaded jobs. The bottom-half handler for networking net-bh() has now been replaced with the net_rx-action () softirq in the 2.4 versions. Now that you have a basic understanding of networking in Linux, you may skip to the next chapter if you do not wish to get entangled in the intricacies of the Linux networking code. However for those who decide to continue, having the Linux kernel code handy is strongly advised. This will let you browse the actual code as you read about various functions and data structures. The source and header filenames in this document are relative to the kernel source location, usually /usr/src/linux. The following sections are based on Linux kernel version 2.4.18 which can be obtained from the Linux Kernel website: http://www.kernel.org. We should note that in the following discussions, a number of issues like fast-routing, multiprocess systems, etc are not discussed. The discussion of these issues would complicate the discussion and the actual control path of the packet would not be clearly understood. Hence, only the control paths which would be taken in the most general cases are discussed. We also assume that all the interfaces involved are Ethernet interfaces. 19 3.1 Data structures At the highest level, BSD sockets are used to represent a network connection. These structures hide the implementation details of the different protocol families. INET sockets are specific to IPv4 protocol family. The generic operations on the BSD sockets invoke functions specific to the protocol family such as the INET socket Operations. In the following sections we discuss some of the important data structures in the Linux networking code. 3.1.1 sk_buff sk_buff1 is one of the most important data structure in the Linux networking code. It represents a buffer which provides control information and data to be transmitted or received. It reserves enough storage so that at each layer of the networking stack, the headers or trailers can be added without any additional overhead. The data is copied only twice through its entire journey through the networking layers: once from user-space to the kernel-space and once from the kernel-space to the output medium. Important members of this data structure include pointers to the head of the buffer list, next and previous pointers, pointer to the sock (see section 3.1.3) structure which owns this buffer, net_device (see section 3.1.4) pointer to the device from which this packet arrived or on which this packet will be sent out, pointers to the different network layer headers and trailers, checksum, length of the data currently in the buffer, pointer to the connection tracking structure nf_ct_info (see section 3.1.18) and control information for the upper layer protocol. 1 include / linux / skbuff.h 20 3.1.2 socket socket2 represents a BSD socket in the networking code. Some of its members include the state of the socket, flags, pointer to the inode associated with the socket, pointer to the file associated with the socket, a pointer to the low-level“ sock associated with the socket and a pointer to the proto-ops structure which contains function pointers to handle different operations on the socket. For a TCP socket, it points to the inet_stream.ops3 which contains functions to handle a TCP socket bind, connect, listen and other operations. Most often, the sock is an INET socket. 3.1.3 sock The sock4 structure represents a messed up collection of network layer and bits of transport layer information for a socket. Some of the most important members of this structure are pointers to the next and previous sock structures (next, prev), local and foreign addresses (rcv_saddr, daddr), source and destination port numbers (sport, dport), address family (family), state of the socket (state), pointer to the destination cache entry (dst..cache), receive and send/write queues (receive-queue, write_queue), pointer to a struct proto structure (prot), which has pointers to the transport layer functions to operate on the socket, transport layer options structure (tp_pinfo), pointer to the parent BSD socket (socket). The prot points to tcp.prot5 for TCP sockets and udp_port6 for UDP sockets. 2include/linux/net.h 3net/ipv4/af_inet.c 4include / net / sock.h 5net/ipv4/tcp.c 6net/ipv4/udp.c 21 3.1 dar Ida llt’l lf’f' 3. 1.4 net_device net_device7 represents a network device in the Linux networking code. It includes data related to higher levels of networking as well as basic I / O. The important fields related to networking include the name of the interface (name), pointer to the next net-device (next), unique device identifier (ifindex), pointer to functions to get in- terface statistics (getstatus for wired and get-wireless_stats for wireless interfaces), interface address (dev_addr), pointers to the queuing disciplines attached to the in- terface (qdisc), pointers to specific higher level protocols including AppleTalk, IPv4, IPv6 etc, transmit queue length ( tx-queue_len), pointers to functions to perform differ- ent actions on the interface like start transmit (hardstaermitO), set MAC address (set.mac_addr()). 3. 1.5 Qdisc Qdisc8 represents a queuing discipline. It has pointers to the queuing and de- queuing functions (enqueue, dequeue), pointer to the next Qdisc structure (next), pointer to a Qdisc_ops structure (Ops), which has pointers to functions for different operations of specific disciplines, pointer to the interface to which this qdisc is bound (dev), a statistics structure (struct tc_stats stats)9, which tracks the bytes, packets, drops and other information of the qdisc. ops points to tbf_qdisc.ops10 for a Token Bucket Filter (TBF) qdisc and pfifo.qdisc_opsll for a Packet First-In-First-Out qdisc. 7include/linux / netdeviceh 8 include / net / pkt _sched.h 9include / net / pkt _sched.h lonet / sched / sch _tbf .c 1 1 net /sched/sch_fifo.c 22 3.] 3.1.6 dst_entry This represents a destination cache entry. dst.entry12 contains pointers to input and output functions and information about a route. Its members include device pointer (dev), pointer to the neighbor (neighbour), hardware header pointer (hh), input and output functions (input, output), various other information like flags, last used time, path MT U, etc. It also holds a pointer to a neigh_ops13 structure (Ops) which contains family, protocol and pointers to functions for rerouting, destroying the route, etc. 3.1.7 neighbour Neighbor Table (struct neigh_table) Parameters \ (struct neigh_params) \\ \ \. \ Neighbors (struct neighbour) Pneigh (struct pneigh_entry) (struct neigh_table) a O.) l is) 3 G M Q) v (struct g neigh_table) V Figure 3.1: Neighbor Table Structure l2include / net / dst.h 13include / net / dst.h 23 This represents information about the neighbour connected to the system one hop away. Like other Objects in the Linux kernel, it has pointers to the next neigh- bour structure (next). Other information in the structure include pointer to the neigh_table”, pointer to the device (dev), flags, hardware address (ha), pointer to the output function (output), pointer to the neigh_ops15 structure (Ops) which has pointers to the functions for diflerent operations like destroying the neighbour struc- ture, transmitting a packet from the queue (queue_xmit). 3.1.8 rtable An entry in the routing table is represented by a rtable16 structure. This contains a union (u) of destination cache entry (dst) and the next rtable structure (next). Thus the union allows the same member to be used as a pointer to the next rtable or as a pointer to the destination cache entry for the route. Other information in rtable are the flags and type of the route (rtJlags, rt-type), source and destination addresses (rt_src, rt-dst), interface index (rtjif), gateway address (rt_gateway) and key for route lookup (key). 3.1.9 fib_table The use of multiple routing tables in Linux is made possible by the use of i‘ib.tablel7 structures (see Figures 3.1.918 and 3.1.1019). This structure contains a table identifier (tb_id), table manipulation functions like table lookup (thookup), 14include / net / neighbour.h 1 5 include / net / neighbour.h 16include / net / route.h l7include / net / ip_fib.h 1 8 Source: Linux 1P Networking, http:/ /www.cs.unh.edu/cnrg/gherrin 19Source: Linux IP Networking, http://www.cs.unh.edu/cnrg/gherrin 24 Netmask Table one entry for each Netrnask Table one zone for each Network Information routing information for split into specific network information (addresses, TOS etc) and protocols (IP, ATM, etc) potential netrnask known subnet mask each known network 0000 N Default 0.0.0.0 8000 zone node info C000 ll em 1 E000 ( p y) 10.0.0.0 zone / node info F000 / (F000) F800 . 17216.0. (empty) node info FEOO zone / (FFOO) \ 172.2900 FFOO node info FF80 (empty) FFFF Figure 3.2: FIB Table concepts insertion (thnsert), deletion (tb_delete) and pointer to the associated FIB hash table (tb.data). 3.1.10 fn_hash fn_hash2° is the FIB hash table which contains pointers (fn_zones) to the individual zones representing netmask. The number of zones is 33 which represents one zone per bit in the mask. The fnJoneJist points to the first non-empty zone in the table. The 20net/ipv4/fib_hash.c 25 Netmask Table (struct fn_hash) I I [ Network Zones Network Information I (struct fn_zone) ’[struct fib_node) (struct fib_info) fn_zones[O] : fz_next : fn_key fib_protocol tb__data fn_zones[l] I fz_hash < ' fib_metrics : fZ-maSk : fib_nhfl fn_zones[2] ' fn_key l I I l fn_key (struct l l fb tabl ' . fn k I _ e) / l fz_next / g - ey fib_PTOI9COI I = a fib_metrics ' fZ—haSh \ : fn_key fib_nh[] I fz_mask l fn_zones[32] : . fn_zone_list i E fn_key Main/Local Tables I 5 I I Figure 3.3: FIB Table Details zones are organized from most specific (longest prefix) to the most general (shortest prefix), thus the fn_zone_list points to the most specific zone in the table. 3.1.11 fn_zone Each fn_zone21 represents routing data for one netmask. Thus for the 32-bit IPv4 addresses, there are 33 such zones. It contains a pointer to the next non- empty zone (fz_next), pointer tO the fib_node hash table (fz_hash), number of entries (fz-ent), number of buckets in the hash table (fz_divisor), mask to generate the hash (fz_hashmask), mask for the zone (fz_mask) and order of the zone which is the index of the zone in fn_hash. 2’ net / ipv4 / fib_hash.c 26 3.1.12 fib_node fib_node22 denotes information specific to a route. Its fields include pointer to the next node (fnmext), pointer to fib_info structure (fn_info), the search key (fn_key), T OS (fn-tos), type of route (fn_type), scope of the route (fn_scope) and state (fn_state). The key, a 32 bit unsigned number, is formed by ANDing the destination address with the netmask of the zone. Thus it encodes both the destination address and the mask of the route. 3.1.13 fib_info fib_info23 contains information that can be shared between many routes. This includes the route metrics (fib_metrics), number Of nexthops available for the route (fib_nhs), array Of fib_nh structures (fib_nh), network protocol (fib.protocol), status information (dead) and pointers to the next and previous fib-info structures (fib_next, fib_prev). 3.1.14 fib_nh fib_nh24 represents the nexthop information in a route. It contains a pointer tO the device (nh-dev), flags, scope, weight of the nexthop (nh_weight), outgoing interface index (nh_oif) and the address of the gateway (nh-gw) 3.1.15 rt_key rt_key25 contains information which form the key for route lookup. This includes the source and destination addresses (src, dst), incoming and outgoing interface in- 22net/ipv4/fib_hash.c 23include / net / ip_fib.h 24include / net / ip_fib.h 25include / net / route.h 27 dices (iif, oif), firewall mark if configured (fwmark), TOS (tos) and scope of the route (scope). 3.1 .16 fib_result A fib_result26 structure is returned when a route is found in the routing table. This structure contains the prefix length (prefixlen), the index Of the next hop selected (nh.sel), type and scope Of the route (type, scope) and the routing rule (fib_rule) if multiple tables support is configured. 3.1.17 nf_conntrack m".conntrack27 represents a general netfilter connection. This is used only for usage count (use) and deletion through the destroy function pointer. Other specific connection structures, for example ip_conntrack contain this structure within them. 3.1.18 nf_ct_info nf_ct_info contains a pointer to a nf_conntrack structure (master). 3.1.19 ip_conntrack_tuple-hash ip_conntrack_tuple_hash is an entry in the connection hash table. It contains the ip_conntrack_tuple (tuple) and the pointer to the IP connection (ctrack). 3.1.20 ip-conntrack ip_conntrack28 represents an IP connection in the connection tracking mechanism. It contains an nf_conntrack structure (ct_general) for usage count, a couple of 26include / net / ip_fib.h 27include/linux/skbuff.h 28include/linux/skbuff.h 28 ip-conntrack_tuple_hash structures (one each for original and reply direction, tuple- hash), status tO indicate whether we have seen packets in both the directions (status), pointer to a ip_conntrack.helper if registered. If this connection is expected by another connection, that connection is stored in a nf.ct_info structure (master). An array of nf_ct_inf0 structures which indicate the relation Of the packet with the connection is available as infos. 3.1.21 ip_conntrack_info ip_conntrack_info29 is an enumeration type which indicates the status of the con- nection. The different values Of this type include IP_CTESTABLISHED to indicate an established connection, IP-CT_NEW to indicate a new connection etc. 3.1.22 ip_conntrack_tuple ip_conntrack_tuple30 contains the information related to an IP connection. The fields in this structure hold the destination information (IP address, port for TCP and UDP, and type and code for ICMP), source information (IP address, port for TCP and UDP, and type and code for ICMP) and transport layer protocol identifier. The source related information is manipulable and is encapsulated in an ip_conntrack_manip31 structure. 3.1.23 ip_conntrack_protocol ip_conntrack.protocol holds information and function pointers for specific proto- cols. Its fields include the protocol identifier (proto), protocol name (name), function pointer to convert packet to tuple (pkt_to_tup1e), function pointer to invert the tuple 29include/linux/netfilter_ipv4/ip_conntrack.h 3°include/linux/netfilter_ipv4/ip.conntrack.tuple.h 31include/linux/netfilter_ipv4/ip_conntrack.tuple.h 29 for the packet (invert_tuple), function pointer to handle the packet (packet), function pointer to create a new connection of this protocol (new), etc. This is represented by ip.conntrack.protocol_tcp32 for TCP connections, ip.conntrack_protocol_udp33 for UDP connections and by ip_conntrack_protocol_icmp34 for ICMP connections. 3.2 Packet Reception The journey Of a packet in the kernel begins with the hardware interrupt Of the interface receiving the packet from the physical medium. The interrupt service routine (ISR) for the interface checks for the length Of the queue (ring buffer overflow), any transmission errors like checksum, frame errors, etc. Then it allocates space (an sk_buff structure, usually called skb). It checks for the packet’s protocol ID and whether the packet is a hardware broadcast, multicast or destined for other host”. It then calls the netif_rx()36. netif_rx() enqueues the passed skb at the end of the receive queue for the CPU and raises the networking receive softirq (NET.RX-SOFTIRQ) flag. The NET_RX_SOFTIRQ is mapped to the net_rx_action()37. When the kernel finds that the NET_RX_SOF T IRQ has been raised, it calls net_rx-action(). In this function, packets (skb) are de-queued from the CPU’s receive queue. For every packet type, i.e. protocol family, that has been registered, the interface registering the protocol family is checked to match the incoming interface. If the interfaces match or 32 net / ipv4 / netfilter / ip.conntrack.proto_tcp.c 33 net /ipv4 / netfilter /ip.conntrack.proto.udp.c 3" net /ipv4/netfllter/ip.conntrack_proto.icmp.c 35When an Ethernet interface is in promiscuous mode, packets destined for other MAC addresses also will be processed. 36net / core / dev.c 37net/core/dev.c 30 if the protocol applies to all the interfaces, the function (packet_type.func) to handle packet Of the particular protocol family is called. For IN ET family, this function is ip_rcv( 38. As long as the queue has packets and there is time, packets are de—queued and processed. ip_rcv() is the main IP receive function. This function gets a pointer to the IP header Of the packet and performs checks like IP version, checksum, minimum packet size, etc. If these checks pass, it invokes the NFJP_PRE_ROUTING hook in the netfilter framework. If the hook returns a success, then ip_rcv.finish( 39 is called. ip.rcv_finish() invokes ip.route.input()4° if the skb->dst, where skb if a sk_buH structure discussed in 3.1.1, is not already set correctly. The skb->dst will be set correctly if the source address of the packet is a IOOpback interface, otherwise it is not set If ip_route.input() returns an error, the packet is dropped, i.e the buffer (skb) is freed and an error code NET_RX_DROP is returned. If ip.route.input() returns success, the packet is processed further. If the packet contains any TP options, they are processed. Finally, skb->dst->input() is called to process the packet. In INET family, this function maps to ipforward( )41 for packets destined to other host and to ip_local_deliver()42 for local packets. The journey of the packet along ipforwardO is described in 3.4. ip_route.input() is the function where the route lookup starts. First a hash is generated from the source address, destination address, incoming interface index and Type of Service (TOS) of the packet. This is used for lookup in the route cache hash 38net/ipv4/ip_output .c 39net/ipv4/ip.input.c 4°net/ipv4/route.c ‘1 net / ipv4 / ip.forward.c 4‘"net/ipv4/ip.input.c 31 table. The corresponding bucket in the route cache hash table is traversed searching for a route that matches the destination address, source address, incoming interface, outgoing interface index set to 0 and the TOS. If firewall marking is enabled in the kernel (CONFIGJPROUTEI‘WMARK = yes), then firewall mark is also matched with the route cache entry. If an entry is found, then the entry is updated with the last used time, number of times used and a pointer to the entry is made available in the packet’s dst, which is a dst_entry structure discussed in 3.1.6, and the function returns. The route cache entry contains dst-entry as its member. This structure, as discussed in 3.1.6, holds all the information required for the packet to be sent along the selected route which includes the neighbour information, the device, hardware header cache, the output function to be used to send the packet along the route, etc. If a route cache entry is not found, then ip_route_input.mc()43 is invoked if it is a multicast packet, otherwise ip_route_input_slow()44 is invoked. In this work, we do not discuss ip_route.input_mc() and only concentrate on ip.route_input_slow(). ip_route_input_slow() constructs a key (rt_key) based on the destination address, source address, TOS, firewall mark if configured, incoming interface index, route scope set to RT_SCOPE_UNIVERSE45. If the source address is multicast address or the source address has a bad address class or the source address is a loopback address“, then an error is returned which causes the packet to be dropped. Other checks like source and destination addresses being 0, source address having a zero network address are also checked and the packet is dropped if any such test succeeds. 43net/ipv4/route.c 4‘lnet/ipv4/route.c 45Route scopes are used to indicate the type of route: site, link, host, universe. 46A packet with a loopback source address will have the skb->dst already set by the output routine which would have already been checked in ip_rcv_finish () before invoking ip.route.input(). Hence a packet with loopback source address will never reach this function. 32 If the packet has a valid source and destination address, then the actual route lookup is performed. If multiple routing tables support (CONFIGJPJVIULTIPLE-TABLES = no) is not enabled in the kernel, fibJookup( 47 is called to lookup the route in the Forwarding Information Base (FIB) tables. Otherwise, if the multiple routing tables support is enabled, then the table lookup routine fibJookup( 48 is invoked. If the route lookup results in a local route, then the packet is destined for the local machine. In this case, a route cache entry is created and its fields including the search key, function to process the input packet (which is ip_local_deliver()), incoming interface index, etc are set. If it is not a local packet and if IP forwarding is not enabled on the interface, then the packet is dropped. At this point, if multipath is configured in the kernel and if there are more than one nexthops available for the route, fib_select_multipath( 49 is called which selects one of the multiple nexthops based on the route configuration. After further checks like no outgoing interface available or invalid source (should not be a broadcast or local address), a route cache entry is created and filled in with the information about the route selected. The input and output functions for the route cache entry are set to ip_forward() and ip.output( 5" respectively. Finally the route cache entry created above is inserted into the route cache by calling rtjntern_hash()51. One should note that only the most important functionalities of ip_route_input_slow() have been discussed here. 47include / net / ip_fib.h 48net/ipv4/fib.rules.c 49net/ipv4/fib_semantics.c 50net/ipv4/ip.output.c 5 1 net / ipv4 / route.c 33 fib_lookupO52 calls the table lookup routine, thookupO, on the local-table and main_table. If they fail an error is returned else success is returned. The structure Of these tables is discussed in 3.1.9. In fib_lookup()53, the table Of routing rules is searched to find the correct routing table to use for lookup. The search is based on the source address, destination address, TOS, firewall mark and incoming interface index. Once a matching rule is found, the action specified for the rule specifies whether to proceed with the route lookup or to return an error. The lookup proceeds if the action is RTN.UNICAST or RT N_NAT. The matching rule specifies the table to use for the lookup and the thookupO is invoked on that table. If thookup() returns an error, the error is propagated to the caller and the packet is finally dropped. tb-lookup() maps to fn_hash.lookup()54. In this function, the FIB tables are searched for longest matching route entry. As discussed in 3.1.11, each routing table has 33 zones (one for each bit in the network mask). The search starts with the zone pointed to by fn.zone_list which is the first non-empty zone in the table. The zones are linked together with the most restrictive zone (32 bit netmask) being at the head and relaxing the restriction with each zone. In each zone, fz_key( 55 is called to generate the search key by ANDing the destination address and the zone netmask. Then fz_chainO56 is called which invokes fn_hash()57 to generate a hash to quickly locate the potential fiinode in the hash table of fib_node for the zone, which might 52include / net / ip_fib.h 53 net /ipv4/fib.rules.c 5" net / ipv4 / fib_hash.c 55net/ipv4/fib_hash.c 56net/ipv4/fib_hash.c 57net/ipv4/fib_hash.c 34 have an entry for the destination. The search key is compared with the fib_node key. If the search key is less than that of the node key, then the search moves to the next zone. If the search key is greater than the node key, the search continues with the next fib_node entry. If there is a perfect match, then we have a route which is the longest match for the destination. Now, the fib_node entry is checked for its sc0pe and state. If the node is in a zombie state or if the scope of the fib.node is less than the sc0pe of the search key, then the search continues with the next fib_node. Else fib_semantic_match()58 is invoked to setup the fib_result structure with the results Of the lookup. If fib_semantic_match() returns success, then the type, scope and prefixlen of the fib_result is set to the values in the fib_node. In fib_semanticmatch 0, one of the parameters passed is the fib_info of the fib.node which was selected during the search earlier. As discussed in 3.1.13, fib_info contains a list of nexthops available for this route. If the multipath routing is not configured in the kernel (CONFIGJPJIO UTE_M ULTIPAT H = no), then the number Of nexthOps is at the most 1, which is selected if it is not dead. Otherwise, the first nexthop which is not dead is selected to be the nexthop for the packet. This is indicated by setting res->nh_sel = nhsel where res is a fib_result structure and nhsel and nhsel are indices into the array Of nexthops. In fib_selectJnultipath () which is invoked if multipath is enabled in the kernel and if the number of nexthops for the selected route is greater than 1, a weighted-round- robin algorithm is used to select the nexthop for the packet. Each nexthop specified in the route is given a weight. The algorithm selects the nexthop according to the weights. 58net/ipv4/fib_semantics.c 35 ip_locaLdeliverO is the function which finally delivers the packet destined for the local host up to the higher layers. If the packet is fragment, ip.defrag()59 is invoked which reassembles the packet. Then the NF .IP.LOCAL.IN netfilter hook is invoked which if successful calls ip_local.deliver_finish()60. ip_local_deliver_finish() gets the higher layer protocol information from the IP header of the packet. If the protocol is IPPROTOJZAW indicating that the packet is destined for a raw socket, raw-v4_input( 6’ is invoked for further processing. If the protocol is not IPPROTOJIAW, then corresponding protocol handler is invoked as ipproto—>handler(). For TCP, this resolves to tcp_v4_rcv()62, for UDP, this resolves to udp.rcv()63 and for ICMP, it resolves to icmp_rcv()64 3.3 Packet Transmission A. packet transmission starts with a call to the the sock_write() which first gets the socket associated the file descriptor being written to and then sets up the message header and calls socksendmsg065. The socket is discussed in 3.1.2. socksendmng calls sobk->ops->sendmsg() of the socket which maps to different functions for different types Of sockets. For IN ET sockets, it maps to inetsendmng“. 59 net / ipv4 / ip_fragment .c 6Onet/ipv4/ip_input.c 61 net / ipv4 / raw.c 62net/ipv4/tcp_ipv4.c 63 net / ipv4 / udp.c 6“ net / ipv4 / icmp.c 65net /socket.c 66net/ipv4/af.inet.c 36 inetsendmng first binds an unbound socket and then invokes sk->prot->sendmsg() where sk is the INET socket associated with the BSD socket. INET sockets are discussed in 3.1.3. sk->prot->sendmsg() maps to tcp_sendmsg()67 for TCP sockets and udp_sendmsg()68 for UDP sockets. We take up tcpsendmng and trace the packet as it travels down the network stack. tcpsendmng first waits for the connection setup to complete. Then it forms packets from the message. If there is space in the last awaiting packet (content length < MSSSQ), the data is added to it. Finally tcp_send_skb()70 is invoked. tcp_send_skb() first enqueues the packet to the TCP write queue which holds the packets to be transmitted. Then tcp.transmit_skb()71 is invoked to send packets im- mediately if required, for example for packets which have the FIN bit set. Otherwise, the function is invoked later by other functions to transmit the packets. When a packet needs to be transmitted tcp_transmit_skb() is invoked. tcp-transmit_skb() first builds the TCP header for the packet. Then it invokes tp- >af_specific->send-check() where tp is a tcp_0pt structure and afspecific is the pointer to the INET family specific functions. afspecific is set to ipv4_specific72. send_check() maps to tcp_v4.send_check( 73 which computes the checksum if required. Finally tp—>af_specific->queue_xmit() is invoked which maps to ip-queue_xmit()7". 67net/ipv4/tcp.c 68net/ipv4/udp.c 69 Maximum Segment Size 70net/ipv4/tcp_output.c 71 net / ipv4 / tcp_output.c 72net/ipv4/tcp_ipv4.c 73 net / ipv4 / tcp_ipv4.c 7“ net / ipv4 / ip.output.c 37 In ip-queue_xmit(), route lookup is performed if the packet is not already routed, by calling ip_route-output()75. If the route lookup fails, the packet is freed and an error is returned. The transport layer mechanism will take care of the retrans— mission in this case. If a route is found, then a pointer to the dst structure in the route is set up in the packet. Now the IP header is built and added to the packet. The NF _IP.LOCAL-OUT netfilter hook is called which if successful invokes ip-queue_xmit2( 76. In ip_route-output() a key is formed from the destination address, source address, outgoing interface index and the TOS. The outgoing interface index is set if the socket is bound to a particular interface. Then ip_route_output_key()77 is invoked with the key. In ip_route_output_key() the route cache hash table is searched for a route. First a hash is generated from the source address, destination address, outgoing interface index and TOS of the packet. This is used for lookup in the route cache hash table. The corresponding bucket in the route cache hash table is traversed looking for a route that matches the destination address, source address, outgoing interface, incoming interface index being 0 and type of service flag. If firewall marking is enabled in the kernel (CONFIGJP_ROUTE_FWMARK = yes), then firewall mark is also matched with the route cache entry. If an entry is found, then the entry is updated with the last used time, number of times used and a pointer to the entry is made available in the struct rtable **rp passed as argument to the function and the function returns. If a route cache entry is not found, then ip_route_output_slow()78 is invoked. 75include / net / route.h 76 net / ipv4 / ip.output.c 77net/ipv4/route.c 78 net /ipv4/route.c 38 ip.route-output.slow() first checks if the packet has a source address bound to it, in which case it finds the device having that source address. Else it checks if the outgoing interface is specified for the packet, in this case, it selects an address bound to the interface. Then a check is made to see if the packet has a destination address. If it does not have a destination address, the source and destination addresses are set to the loopback address and the loopback device is chosen as the device to send out the packet on. If the previous tests fail, then fib_lookup() is invoked to find the route for the destination. This function has been discussed earlier in 3.2. If fib_lookup() returns an error but the outgoing interface is specified for the packet, the destination is assumed to be on the link and a link-scope source address is selected. If the route lookup fails and the outgoing interface is not specified for the packet, then the destination is unreachable and an error is returned. If the route lookup yields a local route, then the loopback device is selected for the outgoing interface. Else, if multipath is configured in the kernel and there are multiple nexthops available for the route, fib_select.multipath() is called to select a nexthop from the multiple nexthops available. If multipath is not enabled or if there is only one nexthop available for the route, the default nexthop is selected. At this point, the outgoing interface and the nexthop have been selected for the packet. A route cache entry is created and filled in with the destination address, source address, TOS, outgoing interface index, etc. The rth->u.dst.output which specifies the function used for outgoing packets on this route is set to ip-output(). If the route is a local route, then rth- >u.dst.input, which specifies the function used for incoming packets on this route, is set to ip_local-deliver(). ip_output() and ip_local-deliver() have been discussed in 3.2. Multicast packets and multicast routing are handled here which might change 39 the input and output functions setup above. Finally the route cache entry is inserted into the route cache with a call to rtjnternhash 0, which has been discussed in 3.2. In ip-queue_xmit2(), if the packet is larger than the PMTU79 of the route, then the ip_frag'mentO80 is called. Else, the IP checksum is computed with a call to ip_send-check( 81 and then skb->dst->output(skb) is called which maps to ip_output(). In ip_output(), ip-do.nat( 82 is invoked if NAT is configured in the kernel, else ip_finish_output()83 is invoked. ip_finish_output() invokes the NF _IP_POST _ROUTING netfilter hook which if successful calls ip_finish_output2( 84. In ip_finish_output2(), if a hardware header cache is available for the packet’s outgoing interface, the hardware header is added to the packet and then hh_output() function of the hardware header cache is invoked. If no hardware header cache is available for the packet and if the neighbour data is available for the route, then the output() of the neighbour data is invoked. If neither is available, the packet is dropped and an error is returned. In general, hh_output() function maps to dev_queue_xmit( 85. When a neighbour is deleted, hh_output() maps to neigh_blackhole()86 which just frees the packet buffer and returns an error. 79 Path Maximum Transmission Unit 80net /ipv4/ip_output.c 81 net / ipv4 / ip_output.c 8"’net/ipv4/ip.nat.dumb.c 83net/ipv4/ ip.output.c 84 net / ipv4 / ip..output.c 85net / core / dev.c 86net/core/dev.c 40 output() of the neighbour maps to different functions depending on the state of the neighbour. These functions provide address resolution functionality. If the neighbour is dead, it maps to neigh-blackhole(). If the neighbour is in a good, then it maps to neigh->ops->connected-output. For Ethernet devices, the connected-output() maps to neigh_resolve_output( 87. dev_queue_xmit() checks for the device queue and if the device has a queue, en- queues the packet for transmission by the device. It then runs the qdisc (qdisc_run ()88) attached to the device and returns either success or failure depending on whether the queuing was successful or not. Some types of devices, for example loopback, tunnels etc, do not have an associated queue. In such cases, the hardstaermitO associated with the device is invoked. This function maps to the routine which takes care Of the actual transmission of the packet from the interface and hence depends on the device driver. It is interesting to note that in'the case Of a loopback interface, the hard-start_xmit() invokes the netif_rx() simulating a packet reception by the loopback interface. It is in qdisc_run() that the queuing disciplines come into picture. This calls the qdisc_restart()89 which de—queues a packet from the device’s queue and calls the hardstaermitO to transmit the de-queued packet. The packet to be de—queued is decided by the qdisc installed on the interface and any traffic shaping setup on the interface. 87net / core / neighbour.c 88include / net / pkt _sched .h 89 net / sched / sch .generic . c 41 3.4 Packet Forwarding All packets which are being forwarded by the system follow the path taken by the incoming packets till they are passed to ip_forward(). This is the point where the incoming packets destined for the localhost and those being forwarded go their separate paths. At this point, the outgoing interface for the packet has been selected. The rout- ing table entry for the route chosen is available in (struct rtable*)skb->dst and the corresponding interface is available in rt->u.dst.dev. In ip_forward(), the packet is checked to make sure it is destined for a host and not a broadcast or other type of packet which should not be forwarded. The packet is dropped if it is not destined for a host. The IP Time-To—Live (TTL) is checked and if it is less than or equal to 1, the packet is dropped and an ICMP time exceeded error message is sent to the origin Of the packet. Other Options like strict source routing (SR), don’t fragment (DF) etc are also checked. ApprOpriate ICMP error messages are sent back and the packet is dropped if the packet fails any of the checks. If network address translation (NAT) is enabled on the outgoing interface, ip.do_nat()90 is invoked on the packet . If ip_do_nat() returns an error, the packet is dropped. Finally the NF _IP_F ORWARD netfilter hook is invoked on the packet. On success, this hook invokes ip_forwardjinish( 9’ on the packet. 90 net /ipv4/ip_nat -dumb.c 9’ net /ipv4 /ip_forward .c 42 If there are no IP Options, ip.forward.finish() invokes ip_send( 92 on the packet. If there are Options, then ip_forward_options()93 is invoked to process the Options and then ip_send() is invoked to send Off the packet. ip_send() calls ip_fragment()94 if the packet is larger than the Path Maximum Transmission Unit (PMTU) of the outgoing route which in turn fragments the packet and then calls ip_finish_output( 95 else calls ip_finish-output(). ip_finish-output() calls the NF _IP_POST -ROUTIN G netfilter hook and if the hook is successful it calls ipfinish_output2( 96. From here, the packet follows the same path as taken by a packet being transmitted from the system, which has been discussed in 3.3. 3.5 Netfilter Framework From the Linux netfilter Hacking HOWTOW: Netfilter is merely a series of hooks in various points in a protocol stack... The framework provides a number Of hooks which are called at different times during a packet’s journey through the protocol stack on the system as depicted in Figure 3.498. After checking the incoming packets for IP version, checksum, etc, ip_rcv() passes them through the NF_IP_PRE_ROUTING hook ([1] in Figure 3.4). After success- 92include / net / ip.h 93net/ipv4/ip.options.c 94 net / ipv4 / ip.output.c 95net/ipv4/ip_output.c 96 net / ipv4 / ip_output.c 9“’http: / / netfilter.org / documentation / HOWTO / / netfilter-hacking—HOWTO.html 98 Source: Linux netfilter Hacking HOWTO, http: / / www.netfilter.org / documentation / HOWTO / netfilter-hacking—HOWTO-3.html 43 lncomin g ——_>l1l '_"|ROUTE'I ——*[3I——>[4|—>Forwardcd/ Packets Outgoing Packets 3] “ [ROUTE] Local—bound Packets [[5] Locally—generated Packets Figure 3.4: Netfilter Hooks fully passing through the hook, the routing code decides whether the packet is destined for the local system or it has to be forwarded to another host. If the packet is for the local system, the NFJP_LOCAL_IN ([2] in the above Figure 3.4) is invoked from ipJocaLdeliverO. If on the other hand the packet is for a differ- ent system, ip_forward() calls the NFJPIORWARD ([3] in Figure 3.4). Packets which are generated by the local system first pass through NF _IP_LOCAL-OUT ([5] in Figure 3.4) netfilter hook called from ip_queue_xmit(). Locally generated pack— ets destined for other system as well as forwarded packets finally pass through the NF _IP_POST_ROUTING hook ([4] in Figure 3.4). The hooks can return one Of the following results: accept, drop, stolen, queue or repeat. If accepted, the packet continues its journey through the protocol stack. If the verdict returned is drop, then the packet is dropped and the traversal is aborted. Stolen indicates that the hook has taken control over the packet and that the protocol stack should not continue processing the packet. Queue provides a mechanism to 44 queue the packet for user-level handling Of the packet. Repeat causes the hook tO be called again on the packet. Any module can register itself at any of the hooks. During the registration, a function is specified which will be invoked when the hook is called. This function will be given a pointer to the packet and should return one Of accept, drop or queue verdicts. During the registration, the priority of the function for that book is spec- ified. When the hook is invoked, the registered functions are called in the order Of their priority. For example, the following modules register their functions with the NF_IP.LOCAL-OUT (in the order of priority/call): Conntrack, Mangle, NAT (desti- nation NAT), F ilter99. Conntrack module handles the connection tracking mechanism in Linux while the Filter module is responsible for filtering packets based on differ- ent criteria like source and destination address, source and destination port number, ICMP type, etc. The filter module is used to setup rules to configure firewall on the system. Now we discuss the connection tracking mechanism in the netfilter framework. This mechanism is later used in our work to identify the connection to which a packet might belong. ip-conntrack_in( ’00 is registered at NF _IP_PRE_ROUTING hook. A packet after passing the basic IP checks comes here where it is checked if a connection is already associated with this packet (probably coming on a loopback device), in which case NF_ACCEPT is returned. If the packet is a fragment, then ip_ct_gather_frags( “’1 is invoked to gather the fragments. Then the protocol of the transport layer is iden- tified from the packet. If the packet is an ICMP error message, it is dealt with 99include/linux/netfilter_ipv4.h 100net/ipv4/netfilter/ip_conntrack.core.c ‘01 net / ipv4/netfilter/ip-conntrack.core.c 45 here and NF _ACCEPT is returned. Then resolvemormal_ct( 102 is invoked to get a connection associated with the packet. If resolvemormaLctO fails, NF _ACCEPT is returned. The proto->packet(), where proto is a ip_conntrack..protocol (discussed in 3.1.23) structure, is invoked. This function maps to different functions for dif- ferent protocols, for example, for TCP it maps to tcp_packet( )103 and for UDP it maps to udp_packet()1°4. After some more checks, if there is a helper function reg- istered for the packet’s protocol, it is invoked. Finally, if resolve_normal.ct() had indicated that the packets have been seen in both the directions for the connection, the IPS_SEEN.REPLY.BIT105 bit is set in the connection status and the function returns the result of the call to proto—>packet(). resolveJIormaLctO first forms the IP connection tracking tuple (ip_conntrack_tuple as discussed in 3.1.22) for the packet. If the tuple cannot be formed an error in the form of NULL is returned. Else ip-conntrack.find-get( )106 is invoked to find any existing connection corresponding to the tuple. If a connection is ’07 is called to create a new connection not found for the tuple, then init_conntrack( for the tuple. If there is any problem creating a new connection, then an error is returned. Once we have a connection corresponding to the tuple (either a new connection or an existing connection), various checks are performed to set the status of the connection, for example, is this packet in the original direction or is it a reply ‘02 net /ipv4/netfilter/ip.conntrack.core.c 103 net /ipv4/netfilter/ip_conntrack.proto.tcp.c 10“ net /ipv4 /netfilter / ip.conntrack.proto.udp.c 105include/linux/netfilter_ipv4 /ip.conntrack.h ‘06 net /ipv4 /netfilter / ip.conntrack.core.c 1°7net / ipv4 / netfilter / i p-conntrack.core .c 46 packet, if a reply has been seen then the connection is set to established status, etc. The function then returns the connection associated with the tuple. init-conntrack() is the function where a new connection is created for a packet. The tuple (ip-conntrack-tuple discussed in 3.1.22 is passed to this function which is used to generate a hash for placing it in the connection hash table. If the number of connections is more than the maximum number of the connections possible, a connection is dropped from table randomly or from the chain into which the new connection will be put into. Next the tuple for the reverse direction of the packet is generated by calling invert-tuple()1°8. If the inverse tuple generation fails, then an error is returned without putting the new connection into the connection table. A new connection is allocated and its members filled in. The tuple and the connection information is filled in for both the original as well as the reverse direction packets. The packet’s transport layer protocol’s new() function is invoked to setup information specific tO the protocol in the connection structure. After filling in other members of the connection like timeout information, the connection’s tuplehash (discussed in 3.1.19) is returned. As remarked earlier, the above discussion might be overwhelming to some readers. It is expected to be z.) However, section 4.1 discusses the strategies at a higher level and should be more easy to follow than the above discussion. 103 net /ipv4/netfilter/ip_conntrack.core.c 47 Chapter 4 Channel Balancing Algorithms In this chapter, we look at the four different strategies to balance the outgoing traffic on an equal cost multipath route. First we discuss the strategies and then their implementation details. All the strategies require the kernel to have mulitpath feature (CONFIGJPROUTE-MULTIPATH = yes) enabled. 4.1 Strategies For the discussion of the strategies, we assume that some source hosts (srcA, sch, srcC, etc) have more than one route to some destination hosts (dstl, dst2, dst3, etc) using multiple interfaces (ethO, eth1, eth2, etc) on the gateway/router. We assume that these routes have equal metrics, i.e. equal cost. If the cost Of the routes were unequal, then the best route would be used and the others discarded’. Therefore, for following discussions, we assume more than one equal cost routes to the destination hosts (see Figure 4.1). 4.1.1 Route Based Channel Balancing In this strategy, load balancing is considered when the route for the destination is looked up. For example, when srcA causes a route lookup on the router/ load balancer for dstl, the first route, via ethO might be used. As long as the route cache entry for 1Some router vendors provide feature to balance traffic across unequal routes also. 48 Intermediate router """""""""""""" .24. ’ dsti lntemet Cloud I I [ Router/Load router g E ' balancer law , .: _ , dstZ router Figure 4.1: Multiple Routes between hosts dstl is valid, all data for dst1 will be routed over the same path. When a new route is looked up, say for dst2, the load balancing algorithm will select the next equal cost path, via eth1. Now all the data for dst2 will be routed along eth1 as long as its route cache entry is valid. The multipath route can be configured to have different weights for the different equal cost routes. For example, the route via ethO might be given a weight of 2, while the rest are given a weight of 1. This will cause eth0 to be used twice as many times as the others during the route lookup. Further, if there are connections to dst1, dst2, dst3 and dst4, the connections to dstI and dst4 will use ethO while dst2 and dst3 will use ethl and eth2 respectively. 4.1.2 Per-packet Based Channel Balancing In this strategy, load balancing is applied for every packet. When a packet needs to be transmitted or forwarded to a destination for which there are multiple equal cost routes available, the load balancer will select the appropriate route. For example 49 if we have 3 packets to be forwarded to dst1 and we have 3 equal cost routes to dst1, then one packet will be sent over each route, thus balancing the traffic. The balancing is not restricted tO a specific destination. If there were 3 packets, one packet destined for dst1 and two packets destined for dst2, then the first packet will be sent over first route, second packet over the second route, third packet over the third route. Thus one packet will be sent over each route. Thus, per-packet based channel balancing does not consider the destination or the source information when balancing the traffic. It considers each packet individually and picks a route for it. This may cause packets belonging to the same connection (say a web upload session) to be sent out over different routes. In this strategy also the administrator has the choice to assign different weights to each participating route and control how much traflic is sent over each route. 4.1.3 Session Based Channel Balancing We define a session as collection of related packets transferred between two hosts, the source initiating the connection to the destination. In other words, a TCP con- nection and a UDP connection are examples of a session. In IPv4 family, the source and destination IP addresses, the source and destination ports for TCP and UDP or the type and code for ICMP, define a connection. In this strategy, the load balancing is done based on sessions. When a session is initiated (either on the load balancer or on a host which sends data to the load bal- ancer for forwarding) and comes in for routing, the load balancer selects a route from the multiple equal cost routes. In the future, all arriving packets which belong to that session are routed along the same route. When another session comes in for routing, 50 the load balancer selects another route. We use the netfilter2 framework’s connection tracking mechanism to determine if a packet belongs to an existing connection or is a new connection. For example, consider a TCP session initiated from srcA to dst1. The port num- bers involved are 8011 (source port) and 22 (destination port). The connection can be specified as . When this session is initiated, the load balancer will select a route, say via eth0. All packets which belong to this session will then be routed along eth0. If another session defined by comes in for routing, the load balancer will now select another route, say via eth1 and all packets of this session will then be routed along eth1. 4.1.4 Bandwidth Usage Based Channel Balancing In this strategy, the current bandwidth usage on the routes is used to decide the route for the current packet. The bandwidth usage on each route (interface) is mea- sured periodically and weight-averaged over a period of time. When a packet comes in for routing, the route (interface) that has the maximum unused bandwidth avail- able is used. Similar to the strategy in section 4.1.2, this strategy also considers each packet individually. This may cause the packets belonging to the same connection tO be routed over different routes which may result in out-of-order delivery at the destination. For example, if there are two 10Mbps routes to a destination say through eth0 and eth1 and their bandwidth usage is 3Mbps and 4Mbps respectively, then the next incoming packet will be sent out on eth0 because it has more of unused bandwidth, 7Mbps. 2http: //www.netfilter.org/ 51 ip_route_input_slow() { search FIB tables if(multipath route found) { /* Now we select one of the multipath routes according to the strategy being used */ if(session_debug == true) { /* Using session based strategy */ if (connection associated with packet is found) { if(outgoing interface is set) { if(fib_select_multipath_previous() == false) { /* Dead interface ? */ fib_se1ect_multipath_session() } else {/* New connection */ fib_select_multipath_session() } else { /* Setup the session */ fib_select_multipath_session() } else i(bandwidth_debug = true) { /* Using bandwidth based strategy */ fib_select__multipath_bandwidth() } else { /* This will be route based or per-packet depending on the */ /* EQUALIZE flag in the multipath route */ fib_select_multipath() Figure 4.2: Nexthop Selection Logic in Multipath Route 4.2 Implementation In this section we discuss the implementation details of the different strategies discussed above. We discuss the changes made to the stock3 Linux kernel, any config- uration requirements for the kernel build, the method to enable a particular strategy and some sample commands to illustrate the setup process. Figure 4.2 illustrates the control flow for selecting the nexthop for various strategies. 3Any default installation of Linux Operating system distributed by vendors. 52 Route found in route cache? ll Send packet on the selected route Packet arrival for routing / Search FIB tables Route found in FIB tables“? Multipath Route? No Discard packet / Select nexthop (weighted—round -robin) ll Insert route int / route cache °/ Figure 4.3: Route Based Channel Balancing 4.2.1 Route Based Channel Balancing This load balancing technique is available in the stock Linux kernel. Before the load balancing feature can be used, the administrator needs to configure the multipath route. The following command will setup the equal cost multipath with equal weights to each participating route“: # ip route add 20.0.0.0/24 \ 4The command has been split into multiple lines for readability. 53 nexthop via 10 0.1.2 dev ethO weight 1 \ nexthop via 10.0.2.2 dev eth1 weight 1 \ nexthop via 10.0.3.2 dev eth2 weight 1 \ The route lookup starts with call to ip_route_input() which searches the route cache for a matching route. If it fails, ip_routejnputsIOWO is invoked to search the FIB tables for a suitable route. If the result Of the search is a multipath route, then the fib_select.multipath() is invoked which selects the nexthop based on the weights assigned to the nexthops. The chosen nexthop is added to the route cache so that packets destined to the same destination can now be routed along the chosen route without causing a route lookup. 4.2.2 Per-packet Based Channel Balancing This strategy is not available in the stock Linux kernel and the nano5 patch should be applied to the kernel and the kernel rebuilt. When rebuilding the kernel, the multipath feature (CONF IG_IP_RO U TE..M ULT IPATH = yes) should be enabled. The following command will setup per-packet based load balancing with equal weights to each participating route (notice the use of the equalize keyword in the command): # ip route add equalize 20.0.0.0/24 \ nexthop via 10.0.1.2 dev ethO weight 1 \ nexthop via 10.0.2.2 dev eth1 weight 1 \ nexthop via 10.0.3.2 dev eth2 weight 1 The nano patch changes the route lookup code. When a packet arrives destined for a host for which there is no route in the route cache, the FIB tables are searched as in section 4.2.2. If a multipath route is found, then before adding the route entry into the 5http: / /trash.net / kaber / equalize / equalize_2.4.18.patch 54 Packet arrival for routing _No Search FIB l tables Route found in FIB tables? Route found in route cache? Discard packet Select nexthop (weighted—round -robin) Equalize‘ flag set? Multipath Route? / Send packet on / No the selected route W 1 Insert route into route cache End Figure 4.4: Per-Packet Based Channel Balancing route cache, it is marked with RTCFEQUALIZE flag. Later when this route cache entry is selected to route another packet, the route cache entry is deleted and the FIB tables are searched again. This makes sure that all packets using the multipath route cause a search of the FIB tables and the packets are distributed over the multiple nexthOps according to the weights. The route cache lookup routine for forwarded packets ip_route_inputO, ip.route.input.slow() which sets up the route cache entry for forwarded packets, ip_route_output_810w( ) which sets up the route cache entry for 55 Route found in route cache? Equalize flag set? Packet arrival for routing I Yes No —l Search FIB tables Send packet on the selected route Select ‘ previously nexthop session_debug set? No NO Select nexthop e1 ghted—roun —rlobin) acket belongs to existing sessmn? End / Insert route into / route cache Figure 4.5: Session Based Channel Balancing outgoing packets, ip_route_output_key() which searches the route cache for outgoing packets are modified to implement this strategy. 4.2.3 Session Based Channel Balancing This implementation requires the application of the nano patch to the stock Linux kernel. 56 This strategy requires the connection tracking mechanism (CON- FIGNETFILTER, CONFIGJPJVRCONNTRACK and CONFIGJPNPLFTP) to be configured in the kernel and not as separate modules. The ip-conntrack_tuple structure was modified to record the last chosen outgoing interface for the connection (oif). oif is initialized to -1 in resolve_normal-ct() before calling init_conntrack() for a new session. When the new connection is created in init_conntrack(), off of the reply packet’s tuple is initialized to -1. A flag, session-debug, is added to the kernel to indicate that session based channel balancing is configured. This flag can be set by inserting the module session_debug.o which was written separately. In ip_route_input_slow(), if the FIB table search for the packet’s destination yields a multipath route, then we check if session_debug is set. If set, we check if the packet is associated with a connection. If a connection is found for the packet, we get the IP connection structure (ip_conntrack) of the packet. We now have information about the packet’s direction as well as the outgoing interface used by the packet’s connection. If the outgoing interface is not set (Off 2 -1), then this indicates that the packet is part Of a new connection. Here we call fib_selectJnultipathsession ()6. If the outgoing interface is set indicating that the packet is part Of an existing connection, we call fib_selectJnultipath_previous(). If fib_select_multipath_previous() returns an error due tO a dead interface or other reasons, we call fib_select_multipath_session() similar to a packet belonging to a new connection. If the packet did not have a connection associated with it, we call fib_select_multipath_session () to select the nexthop. fib_select_multipath_session () invokes fib_selectmultipathO to select the nexthop ac- cording to the weights of the nexthops. 6net/ipv4/fib_semantics.c 57 If the session-debug was not set, we call fib_selectJnultipathO to select the nex- thop, similar to section 4.1.2 strategy. We implement the new function fib_select_multipath.previous7 which searches the multipath route for the nexthop last used by the connection. It takes the outgoing interface index of the connection as one of its parameters. If the interface of the nexthop last used is still alive, then the nhsel Of the route cache entry is set to indicate the selection of the nexthop and a success is returned. If the nexthop corresponding to the last used interface is not found or if it is dead, then an error is returned. Finally in ip_route_input_slow(), just before inserting the route cache entry into the route cache, the outgoing interface for the packet’s connection is set to the interface on which the packet is being sent out. To use this strategy, the administrator should insert the module session_debug.o to set the session_debug flag and then set up the multipath route: # insmod session_debug.o # ip route add equalize 20.0.0.0/24 \ nexthop via 10.0.1.2 dev ethO weight 1 \ nexthop via 10 0.2.2 dev eth1 weight 1 \ nexthop via 10.0.3 2 dev eth2 weight 1 4.2.4 Bandwidth Usage Based Channel Balancing To implement this strategy, in addition to the kernel modifications, a kernel mod- ule was implemented to measure and estimate the current bandwidth usage on all the 7net/ipv4/fibsemanticsc 58 Packet arrival for routing Route found in _ Search FIB/ route cache? ’0 tables Yes Equalize flag set? unused bandwidth -rlobin) / Send packet on // S l t th S l m” the selected route e ec nex ope ect nex 0P /With maximump (weighted—mun /[ Insert route into / route cache End Figure 4.6: Bandwidth Usage Based Channel Balancing interfaces. This strategy also requires the application Of the nano patch to the stock Linux kernel. The net-device structure was modified to implement the bandwidth estimator. A bwusage8 structure (bwusage) was added to net_device. bwusage holds various statistical information about the interface: its name (name), transmission rate in bits/ sec and packets/sec (tx-bps and tx_pps respectively), average transmission rate 8include / net / bwestimatorh 59 in bits/sec and packets/sec (tx_avbps and tx-avpps respectively), total bytes and packets sent out (tx-total_bytes and tx_tota1.packets respectively) and bytes and packets sent out during the last measurement interval9 (tx-bytes and tx.packets re- spectively). In ip_route_input_slow(), if the FIB table search for the packet’s destination yields a multipath route, then we check if bw_debug is set. This flag indicates the selection Of the bandwidth usage based channel balancing strategy and is set by inserting bw-debug.o module into the kernel. bw_debug.o is a kernel module written for this thesis. If set, we call fib_select_multipath_bandwidth()lo to select the nexthop for the packet. If bw_debug is not set, we call fib_select.multipath() to select the nexthop, similar to section 4.1.2 strategy. In fib_select_multipath-bandwidth O, we iterate over all the available nexthops and identify the one with the maximum unused bandwidth. The bandwidth information is available from the bwusage structure of the interface associated with the nexthOp. The nexthop with the maximum unused bandwidth is then selected to send the packet. nh_sel of the fib_result is set to indicate the selection. To use this strategy, the administrator should insert the module bw_debug.o tO set the bw-debug flag and then set up the multipath route: # insmod bw_debug.o # ip route add equalize 20.0.0.0/24 \ nexthop via 10.0 1.2 dev ethO weight 1 \ nexthop via 10.0.2.2 dev eth1 weight 1 \ nexthop via 10.0.3.2 dev eth2 weight 1 9This is set to 1 see which can be configured in the bandwidth estimation module. 10net/ipv4/fib_semantics.c 60 4.2.5 Discussion Of the strategies Each Of the strategies discussed above have certain points of interest. The route based strategy and session based strategy ensure that the packets of a connection passing through the route will be routed in-order while the per-packet based and bandwidth based strategies do not ensure it. The packets of a connection might be routed over different routes in these strategies. While this might seem to affect the T CP’s timing mechanism which controls the TCP window size (according to the Sliding Window Protocol), it should be noted that the IP protocol in itself does not guarantee any in-order delivery of packets. Thus, even though the packets passing through the load balancer might be routed in-order in the route based and session based strategies, there is no guarantee that they will reach the destination host in- order. Another issue to note is that except for the route based strategy, all the other strategies invoke a FIB table search for every packet. This might cause scalability issues when routing large amounts of data. But experimental results on a LAN show that these strategies can achieve the same routing rate as the route based strategy for small sized data transfers. We should note here that this implementation is not intended for large networks where the traffic is flowing at gigabit-per-second (Gbps) rates. In such situations, one must consider the affects of the FIB lookups for every packet in per packet based, session based and bandwidth based strategies. 61 Chapter 5 Experimentation Details In this chapter we describe the testbed setup, the traffic pattern and type, and the statistics collected. The testbed setup is shown in Figure 5.1. As shown in the figure, two servers (srcA and sch) serve as the source Of data and two hosts (dst1 and dst2) serve as the sink. These hosts (dst1 and dst2) have been configured with two additional loopback addresses (10:1 and 10:2)1 each, to which the data is sent. This gives us 4 different destination addresses. The data is sent from srcA and sch to dst1:loI, dst1:102, dst2:lo1 and dst2:lo2. The default route on srcA and sch is configured via the channel balancer router (balancer). On balancer, the default route is a multipath route consisting of routes via the three intermediate routers (testa, testb and testc). Depending on the channel balancing strategy being tested, we either set or unset the EQUALIZE flag on the multipath route. On the intermediate routers (testa, testb and testc), routes are setup to route packets destined for the destination addresses dst1:loI and dst1:102 via dst1:eth0 of dst1. Similarly, packets destined for dst2:lol and dst2:loZ are routed on the intermediate routers via dst2:eth0 of dst2. TO be able to route any acknowledgment packets or other error packets, intermediate routers are configured to route data destined for srcA and sch over the link connecting them to 1We use the name of the interface configured with an IP address to denote the IP address for readability purposes. 62 172.20.1.0/24 eth0(2) 1 eth1 .1 ......... testa ( ) ..-.L.’9’1('1) C31 eth0(.1) eth0(.5) if”; lo:2(.1) srcA ‘ ' . 172.16.1.0/24 dstt 172.170.0/24 172.20.20/24 11.0.0.0/241 eth1(.1) ,7“ 172.162.0124 elm-2) . " ., eth2(.1) 9310(2) 172.21.1.0/24 ems j ('25 ethsu) testb 5 Load balancer '_¥Io:1(.1) a eth0(.6),i...;l 10:24.1) m L 172.16.4.0/24 dst2 } If: - - ' T" gimgAL ,, j 5’08 9mm eth0(-2) \- 172.21.20/24 testc Figure 5.1: Testbed Setup balancer. Finally, dst1 is configured to send data destined for the source servers via testa and dst2 is configured to send data destined for the source servers via testb. This setup enables the routing of data between the source and destination hosts. We use the Secure Copy program (scp) to transfer the data. The transfer is ini- tiated from the source servers. Session based strategy requires that the connections be outgoing on the multipath route. If the connection had been initiated from the destination hosts, then connection tracking mechanism would remember the incom- ing interface and mark that as the interface to use for packets going in the reverse directions. Hence to balance the sessions on the multipath route, we should have the connection initiated from inside the network and going out on the multipath 63 route. This is achieved by using the scp command on the source servers similar to the following: ashoknn-src1:“>scp fi1e1.dat nnashokQashoknn-dstl:data/ The above command will copy the filel.dat from the ashoknn-srcl machine to the ashoknn-dst1 machine. Files Of different sizes: 8KB, 16KB, 64KB, 128KB, 512KB, 1MB, 2MB, 4MB, 8MB and 16MB, were transfered. The tests were conducted by sending 20 files each Of the sizes 8KB, 16KB, 64KB, 128KB and 512KB. For each Of sizes 1MB, 2MB, 4MB, 8M and 16MB, 5 files were transfered. Each source would transmit the files of a given size (say 8KB) to each of the destination hosts (dst1:lol, dst1:lo2, dst2:lol and dst2:lo2). Thus during the test for 8KB file size, srcA and sch each would send 20 files of 8KB size to dst1:lol, dst1:102, dst2:lo1 and dst2:102 each. We used iptables rules tO determine how much data was transfered on each inter- face for each source-destination combination. In addition, for the bandwidth usage based strategy, a bandwidth usage estimator module was inserted into the kernel which measured the bandwidth usage on each interface every second and averaged it over a period of 8 seconds. 64 Chapter 6 Results Figures 6.1 through 6.11 show the bandw1dth utilization on each interface during file transfer of different sizes under different strategies. In route based channel balancing, we observe that two of the routes participating in the multipath route are used almost equally. However we see an anomaly during the transfer Of files of 2MB size. This might be due to the timing of the arriving connections which caused the same route to be taken for more connections. We Observe that the bandwidth utilization on each interface is approximately the same when per-packet based channel balancing strategy is used. This verifies the design Of the strategy. Although we expected performance degradation due to possible out-Of-Order delivery of packets at the destination, it is not reflected in the experimental results. It could be due to the small number of hops involved (3 hOps) in reaching the destination. If there were more hops, we might have been able to notice the affects of the out—of—order delivery Of packets. Session based strategy also shows equal utilization of each route. But this would be because during each test, the same amount of data was transfered by each con- nection thus equalizing the bandwidth utilization on the routes. The difference in the bandwidth utilization of the routes is visible in Figure 6.11 where the connections transfered different amounts of data. In this case, the timing of the incoming connec- 65 160 5 140 = 120 S E 7;)” 100 3 5- 80 § v 60 a 40 C 8 20 0 I I .1 Route Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.1: Bandwidth usage for File Size 8KB tions Would also matter in the utilization of the routes. For example, consider three transfers of 1MB, 1KB and 16KB respectively. Balancing these sessions, lets say that 1MB session is routed over eth1, 1KB session routed over eth2 and 16KB session routed over eth4. Now if an 8MB session came in, it would be routed over the first route eth1 thereby raising the bandwidth utilization of eth1 drastically. Therefore, in this strategy, the bandwidth utilization of each sessmn will have an impact on the bandwidth utilization of each route. We notice that in all the transfers with bandwidth based strategy, eth1 is not utilized. This is because the simulated bandwidth available for eth1 was only leps whereas the other interfaces were configured to simulate 10Mbps. Thus the available bandwidth on eth1 was lesser than the available bandwidth on the other interfaces. 66 File Size 16KB 250 .. .. .. . E 200 E E A 150 s r .C 5 1‘, 100 3 11 5 50 1:0 0 EL“... . . __ Route Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.2: Bandwidth usage for File Size 16KB If the bandwidth of the routes was comparable, then we can expect behavior similar to the per-packet based channel balancing strategy. The transfer times involved in these tests (see Figure 6.12) indicate that the different strategies perform equally well when the size of the files transfered is small (3 512KB). As the size of the file increases, the per-packet based and session based strategies show better performance. We should remember that in all the case except in the mixed file size case, the file size in each connection was same. This would cause the data transfered by every connection to be the same and hence the bandwidth utilization of the routes in the session based tests would be approximately the same. 67 Bandwidth Utilization (Kbps) 800 700 600 500 400 300 200 1 00 File Size 64KB Route Per—packet Session Bandwidth Channel Balancing Strategy Figure 6.3: Bandwidth usage for File Size 64KB Bandwidth Utilization File Size 128KB Route Equalize Session Bandwidth Channel Balancing Strategy Figure 6.4: Bandwidth usage for File Size 128KB 68 Bandwidth Utilization (Kbps) 3500 3000 2500 2000 1500 1000 500 File Size 512KB 4n, Route Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.5: Bandwidth usage for File Size 512KB Bandwidth Utilization (Kbps) File Size 1MB Route Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.6: Bandwidth usage for File Size 1M 69 Bandwidth Utilization (Kbps) Route File Size 2MB Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.7: Bandwidth usage for File Size 2M Bandwidth Utilization (Kbps) 1 0000 8000 6000 4000 2000 File Size 4MB Route Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.8: Bandwidth usage for File Size 4M 70 Bandwidth Utilization (Kbps) File Size 8MB Route Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.9: Bandwidth usage for File Size 8M Bandwidth Utilization File Size 16MB Route Per-packet Session Bandwidth Channel Balancing Strategy Figure 6.10: Bandwidth usage for File Size 16M 71 Bandwidth Utilization (Kbps) Mixed File Sizes 800 700 600 500 400 300 200 1 00 Route Equalize Session Bandwidth Channel Balancing Strategy Figure 6.11: Bandwidth usage for Various (mixed) File Sizes Transfer Time (sec) Transfer Times 700 600 . I Per-packet 500 DSession 400 DBandwdth File Size Figure 6.12: Transfer Times 72 Chapter 7 Conclusions and Future Work The experimental results indicate that the per-packet based and session-based channel balancing perform better than the route—based and bandwidth-based channel balancing in the experimental setup. Bandwidth based channel balancing strategy tends to push more data on higher capacity links than try to push data on all the available links. Balancing traffic based only on the unused bandwidth might cause data to be pushed on an interface at a rate above the threshold value when the queuing delays on the interface will decrease the performance. This in itself can be studied further as another channel balancing strategy where the percentage of the unused bandwidth is sought to be equalized on the routes. Due to the topology of the testbed, the effects of out—of—order delivery on the per- formance of the strategies is not clear in the tests conducted. Experiments simulating out-of-order delivery of packets can be conducted to study these effects. Looking at the testbed, we observe the destination hosts were only three hops away from the source. The effects of varying the number of hops between the source and destination can also be taken up for future research. Ideally we would like to see the performance of the different strategies in a real— world scenario. It is in the real-world scenario (or a well simulated environment) that the effectiveness of the strategies can be truly evaluated. The variations in the 73 connection times, lengths, data transfered, destination addresses, number of hops, latency, out-of—delivery of packets, etc, may cause the strategies to behave in a manner not brought out by the laboratory experiments. Also, experiments can be conducted to study the effects of the packet size on the performance Of the strategies. Other strategies can also be looked into as part of future work. One such strategy would be to look at the retransmission rates on the routes and pick the one with the least retransmission rate. History of the past connections can also be used to choose the route, if connections to a particular destination perform well on one of the routes, then future connections to that destination should be routed over the same route. As mentioned above, the bandwidth usage based strategy can be modified to use the percentage of unused bandwidth as an equalizing measure. 74 Bibliography [1] Glenn Herrin. Linux IP Networking A Guide to the Implementation and Modifi- cation of the Linux Protocol Stack http://www.cs.unh.edu/cnrg/gherrin/linux— net.html, 2000 [2] The Linux Virtual Server Project. http://www.linuxvirtualserver.org [3] Bert Hubert. Linux Advanced Routing 65 Traffic Control HOWTO. http://lartc.org/lartc.html, 2002 [4] Alexey Kuznetov. Policy Routing. /usr/src/linux/Documentation/networking/ policy-routingtxt, 1999 [5] Thomas Davis. Linux Ethernet Bonding Driver mini-howto. / usr / src / linux / Documentation / networking / bonding.txt [6] Alavoor Vasudevan. The Linux Kernel H 0 WTO. http: //www.tldp.org/HOWTO/ Kernel-HOWTO.html, 2003 [7] Rusty Russell and Harald Welte. Linux netfilter Hacking HOWTO. http: / / www.metfilterorg / documentation / HOWTO / / netfilter~hacking- HOWTO.html, 2002 [8] Christoph Simon. Nano-Howto to use more than one independent Internet con- nection. http://www.ssi.bg/ ja/nano.txt, 2001 [9] Gianluca Insolvibile. Inside the Linux Packet Filter. http: / / www.1inuxjournal.com / article.php?sid=4852, 2002 [10] Gianluca Insolvibile. Inside the Linux Packet Filter, Part II. http: / / www.1inuxjournal.com / article.php?sid=5617, 2002 [11] Wensong Zhang. Linux Virtual Server for Scalable Network Services. http: //www.linuxvirtualserver.org/Ols/lvs.psgz, 2000 [12] T. Brisco. RFC 1794 - DNS Support for Load Balancing. http: //www.faqs.org/rfcs/rfc1794.html, 1995 [13] Alessandro Rubini and Jonathan Corbet. Linux Device Drivers, O’Rielly and Associates, Inc. 2001 [14] Daniel P. Bovet and Marco Cesati. Understading the Linux Kernel, O’Rielly and Associates, Inc. 2002 75 u[[[I[‘[gi[[[[[[g[[3]]