,wfififiangfiéfigé :L .: 3!, z... :1 ...._....._.... fimfinixkfi .. «MW. 5 .. flaw“? minus”. 1 u. , u- . «O: y _ NV 5...... m3... .5 wjewv...? «t. o a». L... . . I ;# ‘W. fink“ 9m? .m . in. mm. .9 99...... 2t . .. an... . .. , . u “I . c 1...“... u» aura? .. .34.... Java“... puma... ..§......=........;.u.t»wdx.mm.. d... . 5 m5 . 2 ......:éx5...rbrv» a... v . 9% 153.31: 5.5.4.5 4m:- MHL Zita: saruw. n 2.1... >3: 1 . :01 fir...“ yavfiitrflfitfi.‘ gangsta...” Y 35 .i; , . 3...... n...1..,.§,;. .5 .7215. . «5.1.3.. k7}: Eu. . .u.3lu“«.n§..nhuu1¢. . {1.5.3: 1... .L. I... u u . ‘v’.’ V« on v 5;.) .sziulas. .1 fistula“. 2.. .. 7 LE... 61.90 9~ llBRARY Michigan State University This is to certify that the dissertation entitled High Performance, Scalable Web Server Systems presented by Wenting Tang has been accepted towards fulfillment of the requirements for Doctor of Philosophy—degreeinmience & Eng. film/{we Major professor Date SEC/"‘4 ' 36 255] 7 ’ MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requesred. DATE DUE DATE DUE DATE DUE 6/01 cJClRC/DateDue.p65—p.15 HIGH PERFORMANCE, SOALABLE WEB SERVER SYSTEMS By Wentz'ng Tang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2001 ABSTRACT HIGH PERFORMANCE, SCALABLE WEB SERVER SYSTEMS By Wenting Tang The web has experienced phenomenal growth during the last couple of years. During the commercialization process of the Internet, the web has become a “killer app” and millions Of browsers now access comparatively fewer web Sites compared to the number of clients. As more applications are built around the web architecture, web sites have to deal with potentially unlimited number of users. Peak rates for a web service might be as high as 10 times Of the average. Therefore, the performance and scalability issues of such big Sites will be an important issue. In this dissertation, we address problems related to the development of high perfor- mance, scalable web server systems. We make contributions in two aspects of web server performance: The first related to the performance of a single web server and the second related to the performance Of multiple replicated web sites. For a single web server, Browser Initiated Pushing (BIP) is proposed to improve performance based on the observation that today’s typical web page has one or more embedded images. Measurement Shows that BIP is an important technology to improve a single web server’s throughput. We propose two approaches and develope a framework to address the scalability of replicated web Sites. A non-dispatcher approach, Static Allocation and Client Redirection (SSCR ) shares load between replicated web sites and is mainly targeted for global scale web sites. Smart Server Selection (S3) addresses the server selection problem when multiple replicated sites exist. In SB, a client DNS server is extended to prioritize a pool of IP addresses based on the routing metric information collected from routers and other information it collects (geographical location of servers and clients). An efficient scheme to collect routing-metric information from routers is prOposed. A framework to support Content-Aware request distribution in STREAMS-based TCP/IP implementation is deveIOped and prototyped. Content-Aware request dis- tribution provides the ability to support partial replication, flexible web site arrange- ments, Web Quality of Service, and security. Our framework is based on the TCP handoff mechanism. The T CP handoff mechanism is designed as STREAMS mod- ule(s) in the protocol stack. Three different designs are reported according to work- load characteristics. The differentiated web service in the STREAMS-based TCP/IP implementation is discussed. © Copyright 2001 by Wenting Tang All Rights Reserved ACKNOWLEDGMENTS First, I would like to take this Opportunity to thank my academic advisor, professor Matt W. Mutka. During my PhD study at Michigan State University, professor Mutka showed his care and his support both in academia and everyday life. He spent enomorous hours to discuss problems with me. He always provided the insight on the problems and encourages me when I was frustrated. I also would like tO thank Dr. Mutka for his great support of my internship at HP Labs. I am grateful that professor Mutka corrected my grammar and spelling errors in several papers and this thesis so patiently and nicely. I would like to thank my committee members, Dr. Esfahanian, Dr. Ni and Dr. Stapleton, for their helpful suggestions and insights in the research problems and all they did for me during my period Of study. I am indebted a lot to my colleagues in HP Labs, Ludmila Cherkasova, Magnus Karlsson, Todd and others, for their support and encouragement. I want to express my great gratitude to my wife, Yujuan Cheng. Yujuan sacrified all her available time to do housework and Showed her unselfish support for my study. She has contributed a great deal Of her time to raise my two healthy kids so I may focus on my research. Thanks also to my parents and my grand parents who supported me during my academic years financially and spiritually. Without their support, I could not finish my studies. vi TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xi 1 Introduction 1 1.1 Identified Problems ............................. 5 1.1.1 Efficient Delivery of image-rich web pages ............... 5 1.1.2 Load distribution for replicated web servers .............. 6 1.1.3 Server Selection for Global Replicated Web Server Systems ...... 7 1.1.4 Differentiated / predictive Web Quality of Services .......... 8 1.2 Our Contribution .............................. 9 1.3 Structure Of the Content .......................... 10 2 Background and Related Work 12 2.1 Latency Reduction ............................. 13 2.1.1 Web Server Pushing ........................... 13 2.1.2 Client Prefetching ............................. 14 2.1.3 Proxy ................................... 15 2.2 Server Selection ................................ 19 2.3 Load Sharing for Replicated Web Servers .................. 21 2.3.1 DN S based load distribution vs front-end load distribution ...... 22 2.3.2 Front-end request distribution ...................... 22 2.3.3 DNS based distribution .......................... 26 2.4 Differentiated / predictive web QOS ...................... 27 3 Intelligent Browser Initiated Server Pushing (BIP) 29 3.1 Motivation Of BIP .............................. 32 3.1.1 How BIP works .............................. 33 3.1.2 Benefits of BIP .............................. 34 3.2 Enhanced BIP ................................ 37 3.2.1 Three Approaches to the Client Cache Problem ............. 37 3.2.2 Bloom filters ................................ 39 3.2.3 Bloom filters and BIP-Hash ........................ 40 3.2.4 BIP-Hash and a Proxy ........................... 41 3.3 Performance Evaluation ........................... 42 3.3.1 Performance Metrics ............................ 42 3.3.2 Simulation Setup .............................. 43 3.3.3 Performance of BIP, BIP-Ref, and BIP-Hist ............... 44 vii 3.3.4 BIP-Hash Results .............................. 45 3.3.5 Memory requirements for maintaining the link structure information . 47 3.4 Benchmarking BIP approaches ....................... 48 3.4.1 Benchmarking Tool ............................. 48 3.4.2 Experiment Setup ............................. 49 3.4.3 Measurement Results ............................ 52 3.5 Implementation and Possible Limitations of BIPS ............. 62 3.6 Summary ................................... 64 4 S3 - Smart Server Selection 66 4.1 Selection Metric ............................... 67 4.2 Introduction of the 83 approach ...................... 69 4.3 83 - Smart Server Selection ........................ 70 4.3.1 Possible Approaches to Collect Routing Metrics ............ 71 4.3.2 DNS Extensions .............................. 74 4.4 Performance Evaluation ........................... 75 4.4.1 Performance Metric ............................ 75 4.4.2 Simulation Setup .............................. 75 4.4.3 Simulation Results ............................. 77 4.4.4 Overhead Analysis ............................ 88 4.5 Deployment Issues .............................. 89 4.6 Discussion ................................... 90 4.7 Summary ................................... 90 5 Static Scheduling and Client Redirection 92 5.1 Introduction .................................. 92 5.2 Our Proposed Approach: Static Scheduling and Client Redirection (SSCR) 94 5.2.1 Future Load Prediction .......................... 95 5.2.2 Load Distribution ............................. 96 5.2.3 Client Redirection ............................. 98 5.2.4 Wide Area Extensions ........................... 99 5.2.5 Fault-Tolerance and Scalability ...................... 101 5.2.6 Advantages of SSCR ..................... . ..... 101 5.2.7 Implementation of Redirection ...................... 102 5.3 Performance Evaluation ........................... 103 5.3.1 Performance metrics ............................ 103 5.3.2 Traces used in our study .......................... 103 5.3.3 Simulation Setup .............................. 104 5.3.4 Simulation Results ............................. 105 5.4 Summary ................................... 108 6 A Framework for Content-Aware Request Distribution 109 6.1 Introduction ................................. 110 6.1.1 TCP Splicing ............................... 112 6.1.2 TCP handoff ............................... 112 6.1.3 Content Aware Request Distribution in the STREAMS environment . 113 6.2 STREAMS-Based TCP/IP Implementation ................ 115 6.3 Content-Aware Request Distribution: Cluster Design and Application Specific Issues ............................... 118 6.4 Rare-TCP Handoff Design ......................... 126 6.5 Hequent-TCP Handoff Design ....................... 132 6.6 Always-TCP Handoff Design ........................ 136 6.7 Performance evaluation ........................... 138 6.7.1 Measurement testbed ........................... 139 6.7.2 Performance Measurement ........................ 139 6.8 Summary .................................. 141 7 Web Quality of Service and Differentiated Service 143 7.1 Differentiated Web Service at Web server systems ............ 144 7.1.1 Mechanisms ................................ 145 7.1.2 Content-Aware Request Processing and Differentiation ........ 151 7.2 Network Quality of Service ......................... 154 7.3 End-to—End Web QoS ............................ 155 7.4 Content Delivery Networks ......................... 156 7.4.1 How Does A CDN Work? ........................ 157 7.4.2 Benefits of CDN ............................. 158 7.4.3 Content Delivery Network and End-to-End QoS ............ 159 7.5 Summary ................................... 160 8 Summary 161 9 Further Work 164 Bibliography 170 ix 3.1 3.2 3.3 4.1 5.1 5.2 6.1 6.2 6.3 LIST OF TABLES Characteristics of the Workload ....................... 33 Performance Result of BIP,BIP-Ref, BIP-Hist-1,BIP-Hist ......... 44 Simulation Parameters and their Default Values .............. 52 Route metrics for Yahoo replicated servers ................. 76 Web Access Traces Used for Trace-Driven Simulation ........... 104 Load distribution policies studied ....................... 105 Configuration measured ........................... 139 Latency of TCP handoff (ms) ........................ 140 Throughput of TCP handoff ( req/s ) ................... 141 1.1 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1 6.1 6.2 6.3 6.4 6.5 LIST OF FIGURES Subsystems of web system .......................... 3 Components along the path of a HTTP request .............. 12 L4/ L3: Frontend forwarding incoming and outgoing traffic ........ 23 L4/ L2: Response go directly from server to the client ........... 24 Bloom Filter ................................. 39 Latency Comparison of BIP-NV, BIP-V and NBIP (Default) ...... 54 Throughput Comparison of BIP-NV, BIP—V and NBIP (Default) . . . . 55 Concurrent Connections Comparison of BIP—NV, BIP-V and NBIP (Default) 56 Latency Comparison of BIP-NV and N BIP (Size) ............ 57 Throughput Comparison of BIP-NV and N BIP (Size) ........... 58 Latency Comparison of BIP-NV and N BIP (Number) ........... 59 Throughput Comparison of BIP-NV and NBIP (Number) ........ 60 Latency Comparison of BIP-NV and NBIP (Wide Area) ......... 61 Throughput Comparison of BIP-NV and NBIP (Wide Area) ....... 62 Concurrent Connections Comparison of BIP-NV and NBIP (Wide Area) 63 Latency Comparison of BIP-NV and NBIP (CGI) ............. 64 Throughput Comparison of BIP-NV and NBIP (CGI) ........... 65 One Day’s Access History From CSE workload ............... 68 Hops vs Simulation Length ......................... 78 HOpS VS TTLDR ( /\H =1/60) ....................... 79 HOps vs TTLH ( AH =1/60) ........................ 81 Latency vs Simulation Length ....................... 82 Latency vs TTLDR (W: 1,Gamma,)\H = 1/60 ) ............. 84 Latency vs TTLDR (W: 0.6,mixed,/\H = 1/60 ) ............. 85 Latency vs TTLDR (W: 0.3,mixed,AH = 1/60 ) ............. 86 Latency vs TTLDR (W: 0,Uniform,/\H = 1/60 ) ............. 87 Performance of Different Approaches under CSE workloads ........ 106 Traffic flow with TCP splicing mechanism ................. 111 Traffic flow with TCP handoff mechanism ................. 111 STREAMS .................................. 116 STREAMS-Based TCP/1P Implementation ................ 117 New Plug-in Modules for rare-TCP Handoff in STREAMS-Based TCP/1P Implementation .............................. 118 xi 6.6 Web server cluster configurations with content-aware request distribution to support a multi-language web site design. .............. 120 6.7 Partition—based cooperative web proxies design with content-aware request distribution ................................ 122 6.8 Web server cluster configurations to support session integrity and differ- entiated services for e-commerce site design .............. 125 6.9 Remote Request Processing Flow During rare-TCP Handoff ....... 127 6.10 Remote Request Processing for Frequent-TCP Handoff .......... 133 6.11 Request Processing for always-TCP Handoff ................ 137 xii Chapter 1 Introduction The World Wide Web (WWW) has experienced phenomenal growth, both in terms of the number of users and the amount of the information available. It has been reported that, currently, the world wide web has become the killer application of the Internet and 75% of traffic on the Internet backbone is Web (HTTP) traffic [1]. Currently, the web has evolved into the major platform for information delivery and retrieval, distance learning, application service. More and more people depend on the WWW to access information, purchase products, etc. Companies and organizations depend on the WWW for fulfillment of daily activities to save costs and maintain a lead over the competition. As more users access the world wide web, we expect that more companies and organi- zations will move their daily activities onto the web. This trend further triggers more people to access the web to obtain information and to conduct business activities. While the information available through the web is enormous and millions of users access the Internet, many issues should be addressed to meet the user’s expectations: delivering the right information at the right time to the right people at the desirable place. Much effort has been put into improving web users’ web browsing experience. Such efforts include but are not limited to: 1. performance improvements, which deliver a document as quickly as possible to a web user when it is requested. This information includes static web pages, dynamic web pages, audio and video streams. 2. web search engines and document ranking, which crawl through web pages to provide the web documents a user is looking for and to rank the returned documents in the order of relevance. 3. web information personalization, which customizes and organizes the informa- tion according to web user profiles, to provide a better web tour experience. 4. wireless access, enabling web users to access the WWW through wireless devices. 5. security, which provides solutions to make sure that information is only available to the right people at the prOper time. In this dissertation, we focus on the performance of web systems, which deliver the document as quick as possible to a web user. Technically speaking, a web delivery system can be divided into three subsystems: web client systems, intermediate web systems and web server systems, primarily according to the management boundaries as shown in Figure 1.1. The web server system refers to the subsystem that is related to web services and maintained by the service provider side. The intermediate web system refers to the services and equipment specifically for web traffic and primarily maintained and managed by the ISP (Internet Service Provider). The web client system refers to the system where the web related services are managed by client side. Such a classification facilitates the discussion of the technologies. web server] naengediate “(fab e c lent system System system Figure 1.1: Subsystems of web system Generally, a web server system and an intermediate web system tend to be bottlenecks because many users access the system simultaneously. Client web systems seldom become the bottleneck for the following two reasons: 1. the desktop computer is powerful. Due to the advances in micrOprocessors and computer systems, system performance of desktops doubles every 18 months. 2. the desktop has much less communication and computation compared to web servers and intermediate networks in terms of number of users. There is a hot debate whether the network system or the end system is the bot- tleneck. However, network speed has moved to Gigabit speeds in the local area network, and switching technology greatly improves the bandwidth available between two nodes. While the Internet backbone moves to optical networking, we expect that more bandwidth will become available. Under this assumption, we put our focus on the technology which can improve the performance of the web server systems. -"v1 . I ~.u-..vI-3.1'A.'llwunavflVr-‘. proxy server proxy .rowser Figure 2.1: Components along the path of a HTTP request In this chapter, we first discuss the related work on latency reduction techniques, especially web proxies. Next we discuss some of the related work on selection algo- rithms for replicated web servers. After that, related work and background on load 12 distribution of the replicated web servers are presented. Lastly, we briefly present some of the existing work on differentiated web services. 2. 1 Latency Reduction Latency reduction techniques currently proposed include proxying, prefetching, and web server pushing. 2.1.1 Web Server Pushing In this strategy, the web server pushes documents to a place along the network path near the client. When a client requests those documents, the network latency will be reduced. Bandwidth may be saved because the client will no longer need to access the original server. Web hosting and web co-location are two easy ways to achieve this goal. Web hosting refers to an organization or a person that provides web content on its ISP and the ISP manages the web servers and hosts the content to serve web clients. Web co—location refers to an organization that moves its own web servers and equipment to its ISP. In these two cases, web traffic from web users will go to the web servers directly without entering the organization’s network and the link between the organization and its ISP. Generally, the link connecting the organization to the ISP has relatively limited bandwidth. Such an approach saves the amount of WAN traffic, and also reduces the download latency. 13 Some companies, such as Akamai[2], Inktomi[3] and Digital Island[4], recently started to build Content Delivery Network (CDN), a web pushing infrastructure. Companies may push their web documents to these CDNs, and the CDNs are responsible for shipping the documents around the Internet to meet the client access patterns. Since a CDN ships the documents to a place much nearer to the client, the client access latency is reduced and the load on original web server is reduced as well. Some work has been done in this area. The WebOS [5] project at Berkeley proposes a “rent-a—server” idea by which a server can be temporarily rented when needed to serve client requests. Research conducted at Boston University [6] uses traces to investigate where to push the documents from the server’s point of view. RaDaR[7] describes a scalable architecture for web hosting service providers when the documents may be pushed to caches near the clients so that bandwidth may be saved. 2.1.2 Client Prefetching In a client prefetching strategy, the client (either proxy, or browser) prefetches the document before the user makes the request. [The main idea of prefetching is that it tries to overlap the download time with the user thinking time. The proxy may prefetch some popular files, and the browser may prefetch some of the documents according to user’s access history. Client prefetching does not reduce the load on the original server. Mogu1[8] evaluates the performance of client prefetching. Several researchers from Boston University[9] discuss the effect of prefetching on network traffic patterns and 14 propose an algorithm to smooth it. Prefetching also depends on accurate prediction of the user’s next move. It has been reported that the accuracy of prediction of next access from a client can be as high as 85%[10]. 2.1.3 Proxy Web access patterns demonstrate temporal locality and spatial locality. Temporal locality refers to the fact that the same user tends to revisit the same page with certain period of time. Spatial locality refers to the fact that different users behind the same proxy tend to access the same document. A proxy is a cache device that takes advantage of the temporal locality and spatial locality of web accesses. It caches the most frequent accessed files in the cache. A proxy sits between the client and web server as shown in Figure 2.1. The client first establishes the TCP connection with the proxy, then the proxy checks the URL requested. If the requested URL is in the cache, the document is served from cache. Otherwise, the proxy will establish a connection to the original web server, get the document and relay the response to the client It also might cache the document according to a caching policy. It has been reported that the proxy hit rate is about 50%[11]. Benefits Since the proxy may supply the document locally, the response time will be much shorter. A client proxy also reduces the amount of WAN traffic to its ISP. In addition to the performance improvement, a proxy offers other benefits. These benefits include: 15 1. Security. Since a proxy is the only point directly connected to the Internet for an organization, only authorized users may access the external world from an intranet. Also, only authorized users may access the intranet from outside. 2. Privacy. Since only the proxy’s IP address is publicized, the internal network information and IP address will remain anonymous. 3. IP address saving. The network behind the proxy may use private IP addresses and accesses the Internet through a proxy. Different management domains use proxies for their own purposes. A proxy may be classified into three categories according to the management domains: client proxy, network proxy, and reverse proxy (server side proxy). A client proxy sits in the client network domain and is managed by the client site. A reverse proxy is the proxy that sits at the front of the web servers at the server domain. Network proxies are managed by ISPs. Among of the purposes of the network proxy are: c To reduce network traffic inside the ISP and traffic to other ISPs. 0 Ease of management. While it is possible for each web browser to set up a client proxy in a client domain, it is inconvenient for ISPs, which serve thousands of users. In the network proxy case, the client does not know of the presence of the proxy at all. When the web traffic comes into the ISP network, an edge router will automatically forward the web traffic to the nearby proxy (cache engine). The network proxy supplies the document to the client if it is in the cache, and 16 relays the requests to the original server otherwise. Because it is transparent to the clients, sometimes it is referred as transparent proxies. Cache Replacement Policy When a proxy is full, and a new document must be cached, the proxy must decide which document is to be evicted from the current cache in order to cache the new document. This is referred as a Cache Replacement Policy. Work reported in [12, 13, 11] are some of recent results regard to this topic. It has been reported that LRU is not the best cache replacement policy and many algorithms are proposed and evaluated. Cache Document Consistency When a proxy caches a document and later serves the document locally, it is possible that the proxy returns a stale document to the client. Currently a Time-To-Live (TTL) scheme is used to ensure consistency. The document may be cached by any proxy at most for the period of TTL. It has been shown that a stronger consistency model based on update is possible and the overhead is comparable to the current scheme[14]. An approach that proposes a piggyback scheme to carry document inval- idation information within the current response has also been reported[15]. Scalability: Cooperative Proxies Since the proxy is the only gateway for the web traffic, it must relay web requests from the client to the web server and relay responses from the web server to the client. For a 17 large client site, a single proxy server can easily become a bottleneck. Proxy clusters, where a number of proxies are interconnected by high speed networks, provide the necessary scalability and availability. In such a configuration, it is desirable that the contents of different caches in the cluster be shared or different caches in the cache cluster cache a disjoint set of documents. The first case is referred as signature-based cooperative proxies. The second case is referred as partition-based cooperative proxies. 1. Signature-based cooperative proxies. In this configuration, each web proxy may independently cache any documents. However, when a proxy does not find a re- quested document in its cache, it will try to get the document from neighboring proxies before sending a request to the original server. In such a configuration, a mechanism is needed to exchange what is in each proxy’s cache. In [16], a cache sharing protocol based on multicast is proposed. When a proxy can not find a document in its cache, the proxy sends a multicast message and sends the request to the proxy which has the document. The Internet Cache Procotol (ICP) is based on the multicast approach. It has been reported [17] that mul- ticast packet exchange overhead is too large and a new protocol is proposed. In the new protocol, each proxy keeps the information about which document is cached in other cooperative proxies, called a signature. The proxy will send updates if enough changes have been made in its cache contents. 2. Partition-based cooperative proxies. In this configuration, each proxy only cache a subset of documents, and at any given time, a document will only be cached by one proxy. Since there is no replication in cached documents, the total 18 available cache size to cache unique document will be large and the cache hit ratio will be improved. CARP[18] and Consistent Hash[19] are two examples in this direction. In these approaches, when a client accesses a document, the client must know which proxy must be accessed. A mechanism must be introduced to address this problem. 2.2 Server Selection The problem server selection tries to address is that when multiple replicated servers exist on the global scale, which server is the best in term of expected performance? A number of methods for the server selection problem have been proposed. They can be divided into two categories depending on where the server selection is performed: the client side or the server side. In client side approaches, client applications/network services generally collect information from networks and replicated servers, and make a selection. In server side approaches, server side network services gather information and make a decision. There are many client side approaches proposed. They differ in where and how the information is collected. SmartClient [20], the probabilistic model proposed in [21] and automatic selection [22] provide application-level implementations to collect the metrics and make server selection. The SONAR service [23] is a special server that prioritizes a list of IP addresses for a client according to the information it collects. Cisco DistributedDirector [24] is a server side approach. The advantage of approaches such as DistributedDirector is that these do not need to make changes at the client 19 network; all work is done by the service provider. The disadvantage is that the service provider must purchase and maintain special devices. Also, such an approach generally suffers a linear increment in cost as the number of replicated servers increase. In the long run, such a problem (finding nearby replicated servers) should be addressed at the client side, which will make support of global replicated services much easier for the content providers. Some researchers address different aspects of the problem: what kind of metrics are independent and what is the correlation of existing metrics. Metrics include network hops, AS hops, network latency, and application level latency. Examples include [25, 21]. Some network companies [26, 27, 28] prOpose approaches based on a pure network metric. A pure network metric server selection may be divided into two categoies: all the servers reside in the same AS, or servers are located in different AS. All servers located in the same AS If all the servers are located in the same AS, the solution provided by these companies are similar: a device is put in front of each server cluster distributed in wide area. In addition to balancing load for each local cluster, this device also has some routing functionalities. In such a configuration, all the web server clusters have the same virtual IP address. All the devices in front of clusters will advertise a route to this virtual IP address as a host route. To the client network router, it sees multiple routes to the same IP address (virtual IP address), each route with possibly different network metrics, and the router will automatically select the shortest path or some other metric. In this way, each client network automatically selects the nearest server. 20 All servers located in different AS The above approaches do not apply for server clusters in different AS. The problem with different ASes is that the devices in front of each cluster cannot advertise the route to the same IP address any more, because different AS should not originate routes to the same IP address or network. In this case, some kind of variation of DNS server is used. Cisco Distributor is one of them. 2.3 Load Sharing for Replicated Web Servers Load sharing is very important for high throughput of the clusters. A lot of research has been conducted both in academia and in industries. Basically all of the approaches fall in one of two categories: front-end distribution and DNS-based distribution. In the front-end distribution, a special device, named front-end or Load Balancer is introduced and is put at front of the web farms. The load balancer accepts all the inbound web traffic, and decides which web server (backend node) is going to serve the request. In DNS-based approaches, the server side DNS server will map the web site name to a list of IP addresses. The decision is made when the client DNS server queries the server side DNS server for name to IP address mapping. When a client DNS request arrives, the server DNS can decide the IP addresses and TTL value to return. 21 2.3.1 DN S based load distribution vs front-end load distri- bution Both DNS based request distribution and front-end based request distribution have their advantages and disadvantages. The front end accepts all incoming traffic, so it has the following advantages: 1. better load balance. 2. better security. 3. easy management. It has the following disadvantages: 1. possible bottleneck. 2. single point of failure. A DNS based policy addresses the problems of the front-end based approaches, namely bottleneck and single point of failure. Since these approaches expose each web server’s IP address to the clients, it has some disadvantages compared to front-end based approaches. For example, it is not easy to deal with server failure. In addition, some times it is not desirable to expose the web server’s IP address directly to the clients. 2.3.2 Front-end request distribution Bryhni et al. provides a good classification of front-end approaches. The front end approaches can be classified according to decision mechanisms and delivery mecha- nisms. Decision mechanisms are classified according to the protocol layer of the ISO 22 reference model in which the decision is made regarding which servers are to serve the request. The delivery mechanisms are classified according to the layer in which the server is uniquely identified in the cluster. Packets are delivered to the server from the frontend accordingly. With this classification, front-end approaches are classified into the following categories: Back-End Client Front—End ——)> Back-End Figure 2.2: L4/ L3: Frontend forwarding incoming and outgoing traffic 1. NAT (Network Address Translation) (L4/L3). In these approaches, each web server is identified by a unique IP address. The frontend has a virtual IP address. When the client establishes a TCP connection, it sends the connection request to the virtual IP address. The front-end makes the decision based on Layer 4 information, for example, IP address and port information. After the decision 23 Back-End Client Front-End >- Back-End Figure 2.3: L4/ L2: Response go directly from server to the client is made, the flow (connection) is delivered to the correct server. The packets’ IP address of that flow (connection) is changed to the selected server’s IP address (L3) by the front-end and is forwarded accordingly. The outgoing packets’ IP address of the flow is changed back by the front-end to the virtual IP address (Figure 2.2). Because these approaches introduce the IP address translation from virtual IP address to selected server’s IP address back and forth, it is commonly referred as the Network Address Translation approach. 2. NA T-Backend (L4 / L3). The above approach has considerable overhead because the front-end must update the IP address and checksum of each packet. In NAT-backend approaches, modification of the outgoing packets’ IP address and 24 checksum is performed by the backend node, instead of the front end. Since the number of outgoing packets is larger than the incoming packets due to asymmetric web traffic, the load on the front-end should be reduced. Also, if there is another path to the client, the server may send traffic to the client directly without passing through the front-end. This improves the scalability of the front-end. Magic Router [29], Cisco Localdirector [30] and IBM Network Dispatcher [31] belong to this category. . L4/L2. In this approach, each web server in the cluster is aliased to the virtual IP address of the cluster. Each web server has a unique MAC address. When the client sends a connection request to the front-end, the front-end makes the decision and forwards the incoming packets to the selected server using its MAC address. The selected server responds to the client directly without pass- ing through the front end. The IP address and checksum update is avoided. Therefore, the load on the frontend and backend are both reduced. The short- coming of this approach is that since each web server is identified by MAC address, it has to be on the same LAN. The ONE-IP approach [32] belongs to this category(Figure 2.3). . L7/L3. Recently, the request distribution moved from L4 to L7 in order to address some of the problems of the L4 approach, for example, the inability to read the user’s cookies. L7 load distribution also provides a number of benefits, such as partial replication, ease of management, imporved security, and overall throughput (efficient memory usage). Two mechanisms are employed to 25 implement L7 load distribution: TCP handoff and TCP splicing. The main complexity is that in order to check the URL to make the decision, the front- end must establish the connection with the client first. However, when the front-end establishes a connection with the selected server to serve the request, complicated manipulation is needed to maintain the necessary transparency and high throughput. Approaches from Resonate, Ipivot, Alteon and Arrowpoint[33, 28, 27, 34] provide solutions based on L7 load distribution. Academic efforts include [35, 36, 37]. 2.3.3 DN S based distribution DNS based approaches may be further divided into two categories: 1. pure DNS round-robin. 2. DNS round-robin combined with redirection. Pure DNS round—robin based approaches schedule the Internet domain and influences the load on each web server by the way IP address and the TTL value of the name resolution is assigned. Approaches proposed in [38, 39] are such efforts. The DNS round-robin based approach is widely used because of its simplicity. It has been reported that DNS round-robin causes some imbalance in the server load and does not always provide satisfactory performance. The algorithms of the second category come from two sources: DNS based policy supporters that tries to provide better performance and frontend approaches that try 26 to avoid the central point. The approach proposed in [40] belongs to the first source, while the approach proposed in [41] belongs to the second source. 2.4 Differentiated / predictive web (203 Differentiated web QoS has received some attention in recent years. In the last two years, IET F groups and network vendors put much effort into providing differentiated network QoS on the Internet, and many standards and drafts has been proposed. For some applications, network delay is most troublesome. For example, VoIP and streaming media require a preferred treatment of such a flow. However, for web traffic, network delay is only one part of latency the web client experiences. The latency also depends on how fast the end system can deliver the information to the network. It is preferable to provide end-to—end QoS. When a client needs QoS, not only does the network provide preferable treatment to the traffic, but also the end system provides preferable treatment for the traffic as well. There are two categories of differentiated Web QoS. One is user-based, the other is request based. Different sites can choose one of them or combine them together. User-based policy provides differentiated web services based on user identity. For a particular user, no matter which URL he requests, he always gets the predefined service. Web sites which have different groups of subscribed users might consider this option. Request based policies provide differentiation according to which document the user is requesting. Some URLs have higher priority than others. For some online shopping 27 sites, for example, the checkout URL has the highest priority in order to fulfill the purchase transactions. Another application for request-based policy is shared web hosting, where a single computer hosts multiple virtual web sites and different sites would like to have different levels of service. Much research has been done in this area. In [42], the authors evaluate the possibility of supporting a request-based pol- icy on top of a commodity operating system. WebQoS [43] is a commercial system available to support WebQoS. It supports both user-based and request-based differ- entiation policies. Researchers from HP Labs [44] discuss what the end user perceives as WebQoS and what that means for web site developers. 28 Chapter 3 Intelligent Browser Initiated Server Pushing (BIP) As we mentioned in Chapter 1, web pages tend to include large number of embedded images for various reasons. These image requests create a burden on the web server resource and cause longer delays. Experience indicates that people browsing the web tend to click “stop” to terminate a slow download and access another similar site. Reducing download latency is one factor to encourage a web user to continue brows- ing at the current server. Presently, three popular techniques are used to decrease web download latency: Web Proxies, web server pushing,and client prefetch, as we described in Chapter 2. All the existing approaches have their own drawbacks. The proxy approach is not helpful if the document is not in the cache. Furthermore, it is often unsatisfactory due to dynamic documents and other factors (administrative overhead, one-point bottleneck, etc). The web server pushing approach is not popular because infrastruc- 29 tures for receiving large amounts of documents and placing them closer to clients are just emerging [2, 3, 4]. These commerical infrastructures are not sufficiently flexible for most of the web sites due to inconvenience and high-cost. Techniques to decide whether it is appropriate to exploit these technologies are not available yet. Also it is not clear if these approaches can be applied in the intranet environment, where latency is mainly caused by the overloaded server and network latency plays only a limited role. For the client prefetching approach, it is very diflicult to predict a user’s next access. Browser Initiated Pushing (BIP) addresses these problems. It tries to minimize the download latency upon a cache miss using a proxy or client prefetch. It may be used for dynamically generated HTML pages, such as active pages and CGI pages.1 It also provides benefit to users even if no proxy is installed. It reduces the number of HTTP requests received by web server, which may help to reduce the web server’s load. Presently, most HTML pages include embedded content. Since an image is the most obvious embedded content, we focus our discussion and evaluation on images. For simplicity, we refer to those HTML pages with embedded images as I-HTMLs, and refer to those I-HTML pages with all embedded images on the same web server with the I-HT ML page as L-HTMLs. Those I-HTMLs that have at least one embedded image that is not on the same server as the I-HTML page are referred to as R—HTMLs. Under existing approaches, two or more RTTs are required for such an I-HTML, depending on the browser implementation. BIP tries to expedite the download time 1Embedded links of dynamic HTML web pages may be generated as needed. 30 of such an I-HTML. Under BIP, such an I-HTML may be downloaded using one request. One benefit is that BIP reduces the download latency of L-HTML to one RTT. Another benefit of BIP is that it improves server resource utilization. There are two reasons for this. First, it reduces the number of HTTP requests received by the web server by about 20% in our simulations. This will free network and processor resources to satisfy other requests. Second, it may reduce the connection-hold time by at least one RTT. This leads to the improvement of server connection management under HTTP 1.1. BIP may be used to reduce the download latency without prefetching or proxies. It may also be used by proxies or prefetching schemes to expedite the download of I-HTMLs and may serve as a quick recovery mechanism upon a cache miss. Based upon a workload found in our department and college, potentially more than 60% of total successful HTML requests may benefit from BIP. The rest of the chapter is organized as follows. We describe the motivation, mecha- nism, and benefits of BIP in Section 3.1. The inefficiencies of BIP and some enhanced approaches are presented in Section 3.2. In Section 3.3, we describe our simulation model and performance results. In Section 3.4, we show our benchmarking results. Implementation and possible limitations are given in Section 3.5. Finally, we sum- marize results and discuss possible future work in Section 3.6. 31 3.1 Motivation of BIP As we mentioned before, BIP works only for I-htmls, Wthh have embedded images. Before we describe the BIP mechanism, we need to answer the following two questions: 1. What percentage of html pages accessed are I-htmls ? 2. In those I-htmls, how many I-htmls can be downloaded in 1 RTT? That is what percentage of I-htmls are L-htmls? In order to answer these two questions, we use two workloads. One is the CSE workload, which is about one-day’s trace file from our department’s web server. The other is the EGR workload, which is about one-day’s trace file from our college web server. In table 3.1, some statistics of the two workloads is presented. For embedded content, only images are considered. In our study, only static htmls are processed. In the CSE workload, among the total of 13,433 successful html accesses, there are 4,828 non-I-htmls(36%) and 8,605 I-htmls ( 64%). Among these I-htmls, 7,949 ( 92% ) I-htmls are L-htmls, which can be downloaded by 1 RTT. 2. In the EGR workload, in the total of 13,910 html accesses, there are 4,366 non-I-htmls (31%) and 9,544 I-html pages (69%). Among these I-htmls, 9284(97%) I-htmls are L-htmls. In summary, we can see that there are 64% to 69% of all the requested html pages are I—htmls and more than 90% of those I-htmls can be downloaded in 1 RTT. 2BIP can not reduce latency of R-htmls as much as that of L-htmls because a browser has to make at least one new request to another web server. However, it does reduce the number of requests and downloads those local images in 1 RTT. 32 Another characteristics is that on average, there are about 5 embedded images in each I-html(weighted), obtained by total accesses pushed divided by total I-htmls. We will analyze the performance of BIP based on these characteristics. Table 3.1: Characteristics of the Workload Trace CSE Workload EGR Workload Duration (in days) 1 1 Total number of accesses 116,636 90,095 total successful html accesses 13,433 13,910 total image accesses 89,352 64,883 total misslinks html 239 235 total non-I-htmls 4,828 4,366 total I-htmls 8,605 9,544 total L-htmls 7,949 9,284 total accesses pushed 42,767 44,170 3.1.1 How BIP works In the BIP approach, the web server maintains the link structure of all the I-htmls it serves. For the static I-htmls, such information can be retrieved by an html editor or a special tool that runs periodically to update the link structure of all the updated I-htmls. For all the dynamic I-htmls, this can be parsed on the fly. When a browser sends a request, it explicitly tells the web server that it allows embedded images to be pushed. After the server receives the request, it retrieves the link structure of the requested I-html and sends these images in addition to the I-html page. During this process, repeated images will be suppressed and pushed only once for each request. In the server response, the information of which embedded image(s) will be pushed for this request is provided. When the browser receives this response, 33 it parses the I-html as before. The browser will make requests to those embedded images which are not pushed. The main advantage of browser initiated embedded content pushing instead of server automatic pushing is flexibility. For example, when a user turns off the automatic image download option in his browser, the browser does not initiate the push for this request. However, server side pushing does not have such flexibility. BIP also makes the browser have full control on the server’s pushing behavior, which will prevent the ambitious server from flooding the browser with too many images. Furthermore, the browser can decide to turn off BIP due to too much overhead or it can selectively turn on BIP for some sites while turning it off for others. 3.1.2 Benefits of BIP In this section, we analyze the performance benefits of BIP: latency reduction and server resource utilization improvement. Latency Reduction In this section, we use the number we obtained from our workload statistics to com- pare the download latency of an L—html page under three different configurations: HTTP 1.0, HTTP 1.1, BIP. The sample html is a L-html A, with 5 embedded im- ages. 1. HTTP 1.0 ( Without persistent connections) 34 Under HTTP 1.0 specification, the browser will establish a new connection for each request. If we assume at most 4 parallel connections can be established by the browser to the server at the same time, then for A, a browser will first ask for A. After receiving A, the browser will parse A and will find it must download 5 additional embedded images. It establishes 5 connections with 4 parallel connections each time. The browser will need 2 RTTs to download those embedded images. So in total the browser needs 3 RTTs to display html A with all the embedded images. . HTTP 1.1 (Persistent connections) In this senario, a browser first gets A and by parsing A, the browser will need to download 5 embedded images. The browser reuses this existing connection and sends 5 consecutive HTTP requests. Due to the pipeline of HTTP 1.1, the browser will download those embedded images in 1 RTT approximately. In total the browser needs 2 RTTs to display A with all the embedded images. . BIP with HTTP 1.0 / 1.1 In this senario, a browser sends a HTTP request with the server push option. The server will supply all the embedded images along with A. After the browser parses A, it will find all the embedded images needed are pushed by the server, so it takes just 1 RTT to display A. 35 Server Resource Utilization Improvement BIP improves the server resource utilization in two ways. First, it reduces the number of HTTP requests. Simulations show that it can reduce the number of HTTP requests by about 20%-26% in two workloads. Reducing the number of HTTP requests relieves the pressure on the CPU and the number of messages processed by the web server. So it improves the web server throughput. Second, it helps to manage the connections under HTTP 1.1. The general pattern of web accesses is that a user makes a request, takes some time to read the page, makes another request and reads that page for a while, etc. Without BIP, the connection management is difficult because the requests for embedded images interfere with the requests asking for html pages. To further complicate this problem, the cache effect of the browser makes it difficult to predict the request arrival time for the next request to fetch those embedded images. These two factors make it more difficult for web server to predict the next access for an established connection. With BIP, these two problems are addressed. By web server pushing, it is insured that all the embedded images are pushed. Also, by filtering these embedded image requests, it makes predicting the next arrival time easier and more accurate. The web server connection management predictor only needs to predict the next html request arrival, so the predictor accuracy is improved. In [45], with the embedded images in the log, the connection hold time is reduced only by 25%. However, if the embedded images are filtered in the log, the connection hold time is reduced by 50%. Combining a highly precise html request predictor with BIP, we can expect a much better connection management scheme. 36 3.2 Enhanced BIP BIP works fine if there are no shared embedded images between I-HTMLs or there is no client cache. However I-HTMLs from the same person or the same group generally share some embedded images. In this case, downloading the first I-HTML will cause the embedded images to be cached at the client cache. Downloading the second I- HTML does not need to download these embedded images again. This is a problem in our simple BIP scheme because the server does not know what is in the client cache. Simulations show that without considering the client’s cache, the web server will push much more images than needed. This will waste system bandwidth (network bandwidth, web server throughput, and client throughput). The main problem we must address is that we need to convey the contents of the client cache to the web server so that the web server does not push embedded images already within the client cache. The main obstacle is that too much system bandwidth would be consumed if a browser transmitted information about the contents of the client cache to the server directly, either by the full URL or the MD5 digest. 3.2.1 Three Approaches to the Client Cache Problem 1. BIP-Ref BIP-Ref relies on the HTTP referrer field[46]. When a browser sends a request, it contains the last URL this browser visited, which is the referrer field (the referrer URL is not necessary a URL from this web server). The referrer field has been implemented in Netscape Navigator and Microsoft Internet Explorer 37 browsers. The web server may exploit this information, and speculate on what might be in the client cache. When the web server receives a request from the client, the server will not push the embedded images of the referred HTML document. 2. BIP-Hist BIP-Hist maintains the browser’s access history information, so that when a request comes, the web server will check the last N accessed pages and will not push embedded images that have been embedded in these pages. Because maintenance of access history information is quite expensive, the depth of the access history should be kept as small as possible. 3. BIP-Hash In BIP-Hash, the browser uses a Bloom filter to transmit information about its cache contents to the server in the request. Before the server pushes an embedded image, it will first check the Bloom filter to see if the embedded image is in the browser cache. If it does not, then the server will push the image. Otherwise, the server will not push the image to the browser. BIP-Ref and BIP-Hist are rather straight-forward techniques. BIP-Hash merits fur- ther discussion of the Bloom filter mechanisms. 38 3.2.2 Bloom filters A Bloom filter is a method for representing a set V = {V1, V2, . . . , 11"} of n elements (also called keys) to support membership queries. It was invented by Burton Bloom in 1970 [47] and was proposed recently for use in the proxy context by Cao and Fan [17] as a mechanism for efficiently identifying pages in cooperative proxies. The idea of Bloom filter is to allocate a vector of m bits initialized to 0 (see Figure 3.1). The Bloom filter has k hash functions h1,h2, . . .,hk, each with a range {1,...,m }. For each element V E V, the bits at positions h1(u),h2(u),...,hk(u) are set to 1 (a bit may be set to 1 multiple times). Given an element [1, we check the bits h1(p), h2(u), . . . , hk(,a). If all of these bits are set to 1, then we assume u is in the set. However, there is a possibility that we are wrong, which is called a “false positive”. l 2 3 4 ------- m l m 0 l 0 l l ..... 1 o )(I/ [ h1(V) h2(V) """"""" hk(v) Figure 3.1: Bloom Filter The “false positive” probability is given by the expression (1 — (1 — 1/m)’"‘)" z )k- A nice feature of the Bloom filter is that there is a clear tradeoff between the probabil- —kn (1—em ity of “false positive” and the length of vector m. For details about the Bloom filter, its tradeoff between m and the false positive probability, and possible hash functions, see [17]. 39 Applying a Bloom filter in BIP-Hash saves overhead and makes it possible for a client to successfully transmit what is in the cache to the server with little extra overhead. If a browser transmits the full URL, each URL might need 100 bytes. If the browser transmits the MD5 signature, each embedded image in the cache needs 16 bytes. If we assume 99% Bloom filter accuracy, each embedded image in the cache needs 1.3 bytes. 3.2.3 Bloom filters and BIP-Hash In the BIP-Hash approach, the Bloom filter is maintained by the browser on a per- site basis. The full URL is treated as the key in the Bloom filter. When a page is downloaded from a particular site, the browser updates the Bloom filter of that site to include those embedded images. Similarly, when an embedded image is removed from the cache, the Bloom filter of that site is updated accordingly. When a browser establishes a connection with the web server, it sends the Bloom filter with the HTTP request. Each subsequent HTTP request over the same connection will send only an update to the Bloom filter and the server will cache the old Bloom filter and update the Bloom filter on a per-connection basis. Two nice features of Bloom filter make BIP-Hash a very promising approach. First, it will never push an embedded image existing in the client cache. This is because if the embedded image is in the client cache, all k bits of the filter are set and the web server will definitely know that the image is cached. This nice feature saves system bandwidth, except for the Bloom filter overhead itself. 40 A second nice feature is that BIP—Hash has a clear tradeoff between overhead and the number of L-HTMLs reducing latency. If the browser reduces the length m of the Bloom filter, due to a “false positive,” some embedded images that should be pushed are actually not pushed. Therefore, the browser has to make requests for these embedded images and these I—HTMLs cannot save the download latency. On the other hand, if the browser increases the length m of the Bloom filter, the “false positive” rate decreases, and more I-HTMLs save download latency. However, the number of bytes of the filter that are sent over the network will increase. In practice, the browser and the web server may negotiate the value of m. 3.2.4 BIP-Hash and a Proxy The above description of BIP-Hash works nicely for browsers directly connected to Internet. As we mentioned, BIPs may be used by a proxy as a recovery mechanism upon a cache miss. Under such a scenario, the proxy becomes a client from the server’s point of view. BIP, BIP-Ref and BIP-Hist work fine. BIP-Hash depends on the number of cached embedded images from a particular web server. Since the relationship between the length of the Bloom filter and the number of embedded images cached is linear, caching of embedded images by a proxy means a larger Bloom filter is needed. For browsers that are behind a proxy, the browser will send a request with a Bloom filter as normal. When the proxy receives this request from the browser, it will attach 41 its own Bloom filter to the request header and forward this request to the web server. The web server will check both Bloom filters before pushing an embedded image. 3.3 Performance Evaluation We evaluate the performance of BIP, BIP-Ref, BIP-Hist, and BIP-Hash in this sec- tion. We simulate the performance of BIP, BIP-Ref and BIP-Hist by trace-driven simulation and show the performance of BIP-Hash by analysis. We also analyze memory requirement for maintaining the link structure information. 3.3.1 Performance Metrics BIP performs well if we only consider the RTT for I-HTMLS because it tries to push all the embedded images in an I-HTML to the browser. However, it introduces overhead. Other approaches have similarly tried to reduce overhead while providing the benefits of BIP. For evaluating performance we determine the following measures: 1. What is the percentage of HTMLs that will have latency saved (L-HTMLs/ total HTML accesses)? This is “HTML saving latency” in following tables. 2. How much overhead does the approach introduce? There are two ways to mea- sure the overhead. One is to measure the overhead as the number of pushed image misses (pushed image miss means that an embedded image will be pushed by the web server but it is not requested by the web browser in the trace file) di- vided by the total number of images pushed. We call this overhead Overhead-L. Another way to measure overhead is the number of pushed image misses divided 42 by the total number of image accesses in the trace file. We refer to this overhead as Overhead-G. 3.3.2 Simulation Setup We evaluated the performance of these approaches by using trace-driven simulation. First, we obtained the trace file from the web server for one day. Then we downloaded these HTMLs accessed by the browsers and recorded in the logs from the server. We could not download all pages because some HTMLs are password-protected. Next, we analyzed each HTML to generate its embedded link structure information. We also analyzed the embedded image to see if it refers a local file or if it refers to a remote resource. For our simulation, only images were considered as embedded documents. We also assumed that each host ran only one instance of a web browser. Because we use the actual trace file from the web server instead of the browser, we need to determine which embedded images are actually accessed by the browsers. Some browsers may not download the images. Others may access another page without waiting until all the images are retrieved. The browsers also have “warm” caches. However we only simulated “cold” browser caches. We first obtained the number of embedded pushes of all HTMLs accessed (this is maximum number of embedded images that may be pushed), then we kept track of all the embedded images accessed by each host and put those images accessed into the host’s infinite client cache. Web servers will only push those embedded images not in the client cache. In this way, we obtained the 43 minimum number of accesses of images from all the HTML accesses if we assume that all the browsers turn on the automatic image download option and have infinite caches. The difference of these two numbers is those accesses that were never made by the browsers. Next, the simulator examined the log file. For each successful HTML access re- quested, the simulator checked the embedded link structure information and did the corresponding action. 3.3.3 Performance of BIP, BIP-Ref, and BIP-Hist We used the two web access logs described in table 3.1 to evaluate the performance of BIP, BIP-Ref and BIP-Hist. The result is presented in Table 3.2. BIP-Hist-l means the server keeps the access history at depth 1. BIP-Hist means the server keeps access history with infinite depth and infinite client cache. A hit means that an image pushed to a browser by the web server was actually requested by that browser. Table 3.2: Performance Result of BIP,BIP-Ref, BIP-Hist-1,BIP-Hist l ,024 21 Table 3.2 indicates that BIP introduces significant overhead. From the table, we see that about 30% of total pushed images are not used. Overhead-G is up to 17%. 44 Table 3.2 shows that the referrer field helps with web server pushing. Compared to BIP, Overhead-L of BIP-Ref is reduced by 6% and 7%, and Overhead-G is reduced by 3% and 5%, and the number of HTML saving latency is reduced a small amount. Table 3.2 also shows the performance of BIP-Hist-l. Overhead-G is reduced by 2% and 4%, compared to BIP-Ref. At the same time, the number of HTML saving latency reduces by 2%. This is a slight improvement over BIP-Ref. The result of BIP-Hist indicates that with the server maintaining the history of the client, BIP-Hist reduces overhead—G without too much reduction of HTMLs saving latency. However, the overhead for a web server to maintain the access history for each browser is large. It is our observation that maintaining a limited (as small as 1) access history is enough to obtain benefits. 3.3.4 BIP-Hash Results The performance of BIP-Hash depends on what is in client caches and what kind of hash functions are used. The simulation of BIP-Hash would involve the cache replacement policy and cache size. The following provides an analysis of BIP-Hash. Let us assume there are n unique embedded images in a browser cache from a par- ticular site, and the “false positive” rate of the Bloom filter for this site is p. Then, there are p * n images not in the cache and the server will assume they are in the cache. If we assume that all embedded images in the cache are embedded in an HTML document with equal probability, and that a particular page has k unique embedded 45 images, then the probability that the page will have an embedded image that is not in the browser cache but that the server will assume it is in the cache is given by: C((l —p) *n. k) 1_ C(n,k) When n is quite large and p is quite small, C((l —p)nvk) ~ I: C(n,k) ~ (1‘1”) If we set k = 5 and we would like to have 95% of I-HTMLs latency-saved, then (1 — p)5 >= 095 => p <= 0.01 In order to satisfy the condition, m/ n >2 10 and k >2 5. If we suppose it = 500 then m = 5000 bits = 625 bytes. This does not require much bandwidth for high- bandwidth networks, especially when this Bloom filter is sent only when the con- nection is established (HTTP 1.1). Consider the overhead without any type of BIP employed: the browser must make several requests. Each request might require sev- eral hundreds of bytes of request header. These multiple small messages require more CPU and network bandwidth than one large message as in the BIP-Hash approach. We further compare the performance of BIP-Hash and BIP-Hist-l by using the CSE workload. If we assume the above configuration, the total number of the HTMLs saving latency is 56% (59% x 95%), which is the same as the BIP-Hist-l. We further compare the overhead of the two approaches. In BIP-Hist-l, 4,039 pushed images are wasted. If we assume that one embedded image on average is about 2K bytes, then there are total 8,078K bytes bandwidth wasted. If we consider BIP-Hash and 46 we assume each HTML request carries one Bloom filter, then since there are total 13,433 HTML requests in CSE workload, each Bloom filter will have 601 bytes, which means the client cache will cache about 462 embedded images on average. In practice, we expect the bandwidth overhead of BIP-Hash should be much smaller than BIP- Hist-l. So under a typical configuration, we may expect BIP-Hash to have much less overhead than BIP-Hist-l. 3.3.5 Memory requirements for maintaining the link struc- ture information In order to reduce the memory requirements for BIP, the web server may maintain two data structures. One is a mapping array, which maps each embedded image (URL) to a unique sequential number. The other is an embedded link structure list. In this list, for each HTML the web server maintains a list of those sequential numbers corresponding to each embedded image in the HTML page. We assume the size of the mapping array is m and each URL takes p bytes. We further assume the total number of HTML pages is n and on average each HTML page has q embedded links. If we assume a sequential number can be represented by a 4-byte integer, then the total amount of memory required by the two data structures may be expressed as: mp + 4nq. If we assume m = 20000, p = 100, n = 4000, q = 5, then the total amount of memory required is about 2 megabytes 3. For a typical web 3The CSE workload contains about 4,000 unique URLs. 47 site, the memory requirement for maintaining the link structure information is not a problem. 3.4 Benchmarking BIP approaches The benefits of BIP, as we discussed in Section 3.12, can be summarized as fol- lows: Latency reduction, HTTP requests reduction, TCP connection keep-alive time reduction. The three benefits help each other. For example, BIP may avoid some overhead introduced by embedded image requests and therefore throughput may be improved. For the same reason, BIP improves the web server overload behavior and latency may be further reduced. Even though we analyzed the benefits in Section 3.1.2, it is difficult to analyze the interactions among them. Measurements provide a direct method to evaluate the impact of BIP on web server performance. 3.4. 1 Benchmarking Tool The tool we use to benchmark the performance of the Apache Web Server is called httperf [48], which was developed by Hewlett-Packard Laboratories. Httperf is a com- prehensive tool with many features. Httperf may provide sustained load and is easily extensible. It reports network I/O throughput, reply rates, session rates, session time and connection time. Features related to the measurements conducted in this paper are: 48 1. Support for persistent connections and request pipelines. Multiple requests may be sent over one persistent connection in a manner controlled either by the trace file or at a fixed rate. 2. Tracefile support. A trace file may specify which URLs should be accessed and in which order. 3. Session support. Sessions may be defined by a tracefile to simulate session be- havior. For example, a session may simulate the behavior of a browser accessing the web page with embedded images. It may also support CGI requests. 3.4.2 Experiment Setup The measurements we conducted were done mainly within the HSNP (High Speed Network and Performance) laboratory of the Computer Science and Engineering De- partment at Michigan State University. In our experimental environment, a network of SUN Ultra 1 machines are connected by switched 100BaseT Ethernet. The raw bandwidth between machines is 100 Mbps. This is an easily controlled environment in which the network and the machines are lightly used. The benchmark tool we described in Section 3.4.1 is running on these SUN Ultra 1 machines. One machine is used as the web server, which runs the most popular web server - Apache 1.3.9. Other machines are used as clients to generate a sustained load to the web server. Each Ultra 1 workstation has 64 Megabytes of memory and 1.8 Gigabytes of local hard drive. The operating system running on each workstation is Solaris 2.7. 49 For our measurements, we use a single large HTML file to simulate image pushing. For example, if document A has a size of 1.5 KBytes, and 5 embedded images of 2 KBytes each, we simulate this by a single file of 11.5 KBytes.4 Only one sample static page with a number of embedded images is accessed repeatedly, except for the case of our CGI performance measurement. For the CGI measurement, the request is sent and the sample static page is generated by a PERL script. In BIP, the CGI script will generate the HTML page with the designated size. Trace files are used to generate sessions. For Non-BIP cases, each session is comprised of a retrieval of the HTML page, followed by the retrievals of the embedded images in the HTML page. For BIP, after receiving the large HTML file, the connection is closed. No trace file workload is introduced in our measurements because web traffic, access patterns, and other characteristics change. It is very difficult to collect a representa- tive trace file workload. Furthermore, such a measurement result would soon become obsolete. It is our intention to show the potential benefits of BIP instead of BIP’s performance under a specific workload. Image pushing generally has some overhead [49]. We simulate pushing overhead by increasing the size of the single file by a percentage of BIP overhead. In our example, we select 10% BIP overhead [49]. 4We do not measure the link structure maintenance overhead in order to avoid an implementation limitation of Apache. Because the current version of Apache is implemented based on a process-pool, link maintenance has to be provided by a separate process and all Apache processes access the shared link structure by shared memory. This overhead might not exist on a thread-based implementation, which Apache 2.0 plans to provide. Using a single large file to simulate pushing without considering the opening and closing of embedded image files is to avoid excluding the possible optimization of web server implementations for file access. For example, some proxies place frequently accessed documents in memory all the time. Such an optimization does not introduce disk access for frequently accessed documents. Therefore, we essentially measure the upper bound performance of BIP. 50 The system we measured has three parts: clients, networks, and the web server. In order to make sure that we are measuring the web server performance, network and client machines are tested to ensure that the two components are not the bottlenecks. Network bandwidth is measured by netperf.5. With a 99% +/-2.5% confidence inter- val, the measured bandwidth for one TCP connection is 90.44 Mbps. When we use httperf on the Apache Web Server, the largest effective network bandwidth (payload) consumed by one machine is only 40 Mbps. Since one machine is sufficient to generate a sustained load on the web server, even if we consider the protocol overhead, the network is not the bottleneck in our measurement. Next, we use two machines to conduct the measurements to ensure that the client machine is not the bottleneck. Both machines have the same configuration. We determined that the throughput of the web server does not increase compared to the throughput when only one client machine is used. Therefore, one machine is enough to generate load to the web server. We also conducted our measurements across the Internet in order to obtain a better understanding of BIP. We conducted a limited number of measurements since such an environment is difficult to control. For such measurements, the web server runs in the Chemistry Department at the University of New Mexico. The machine hosting the web server is a Intel Celeron 466 with 128 Megabytes of memory and 8 GB disk space. The operating system running at the web server is Linux Redhat 6.2 and the web server software is Apache 1.3.9. The local area network is 10Mbps Ethernet. 5http://www.netperf.org/ The network round trip latency is about 71 milliseconds. The effective bandwidth measured by httperf is 4.6 Mbps. 3.4.3 Measurement Results This section provides measurement results of the web server’s performance under various metrics. It also provides analysis of these results. In all the figures, BIP-NV refers BIP without pushing overhead. BIP-V refers BIP with 10% pushing overhead. Non-BIP is referred as NBIP. Table 3.3 provides the measurement parameters and their default values. Table 3.3: Simulation Parameters and their Default Values Parameter Description Default value Number of embedded images 5 Size of the sample HTML page 1.5 Kbytes Size of embedded images 2 KBytes Network Bandwidth 100 Mbps Pushing Overhead 0 Static Page Yes Persistent Connections Yes Default Configuration In this section, we measured the web server performance of BIP-NV, BIP-V and NBIP under the default parameter configuration. Figures 3.2- 3.4 show the comparison of latency, throughput and concurrent connections, respectively. In Figure 3.2, when the session request rate is below 60 requests/ sec, all approaches have reasonable per- formance. BIP’s latency is about one-half of NBIP. When the session request rate 52 goes above 60 requests / sec, the web server starts to become overloaded under NBIP and the performance of NBIP degrades very quickly. Nevertheless, BIP continues to provide very good performance. When the session request rate reaches 160 re- quests/ sec, where BIP starts to become overloaded, the latency of NBIP is 2 seconds. The latency of BIP is 0.05 second. BIP has the maximum of 160 session requests/ sec, while NBIP only handles 60 session requests/sec. This indicates that, potentially, BIP may improve the throughput by 150%. When both NBIP and BIP operate at their maximum throughput (6O requests/sec and 160 requests/sec, respectively), the latencies are 85 ms for NBIP and 51 ms BIP. This means that BIP may improve the throughput 150% without sacrificing the latency. The number of concurrent connections is the average number of open connections at any given time. This reflects connection resource consumption. As we can see, when the server is not overloaded, the number of concurrent connections is quite low. Once the server reaches the knee point, the number of concurrent connections increases dramatically. In our measurements, all approaches finished their session over one persistent connection. At a request rate of 80 requests/ sec, even the latency of NBIP increases significantly, and httperf still takes the same amount of time to connect to the web server as it does at lower rates. This indicates that connection establishment is not the bottleneck for this default configuration. The increase in the number of concurrent connections is because the web server has the power to accept all the connections but cannot process the requests soon enough, such that the connections stay in an established state longer. The benefit of BIP for connection management is not demonstrated here. 53 In all three figures, BIP-V has similar performance as BIP—NV, and is much better than NBIP. Therefore, we will not consider BIP-V further. The three figures show that BIP is very effective under a LAN environment and BIP boosts web server performance dramatically for static pages with embedded images. 2500 Latency (ms) § 3 I I T T I ,1“ ,r l I A’ I /r I * I / I I _ I / I I l / // —A- BIP-NV ‘ ‘ «e- NBIP / '0- BIP—V / / ,’ 4R 5 4 ,a' f I I 80 100 120 140 160 180 200 Request Rate( Sessions/Sec) Figure 3.2: Latency Comparison of BIP-NV, BIP-V and N BIP (Default) Varying the Size of Embedded Images In this section, we show measurements of how size of the embedded images affects the performance of BIP. We assume in our experiments that a sample page has 5 embedded images. We further assume that all embedded images have the same size. We set image sizes to 0.5 KBytes, 1 KBytes, 2 KBytes, and 4 KBytes respectively, to measure the performance of both approaches. 54 160 T T T T T T $— - T ,, — - — a /, I I 140- a q I I I I 8120» 0’ a w I I a , g / 3100» A] ‘ (3 ,’ -—A— BIP-NV v // —¢- NBIP —o— BIP-V '3 so» At - .C / g) x e ,’ .C 60»- — — + . _ _ _ _ _ -' l— I!" s— a»-__.___*____.____“ I I I 4o~ ,a’ « I I I I l 1 1 L 1 l l 1 20 40 60 so 100 120 140 160 180 200 Request Rate (Sessions/Sec) Figure 3.3: Throughput Comparison of BIP-NV, BIP-V and NBIP (Default) Figure 3.5 and Figure 3.6 show the latency and throughput of both approaches using different image sizes. In Figure 3.5, the upper group of curves shows the latency of NBIP. The bottom group of curves shows the latency of BIP. The curves in each group from bottom to top are corresponding to image sizes of 0.5 KBytes, 1 KBytes, 2 KBytes, 4 KBytes. In Figure 3.6, the curve sequence is in reverse order since as the sizes of the images increase, the throughput decreases. We may clearly see from the two figures that the size of the embedded image is not as important as it first appears. All NBIP approaches have similar throughput with limited difference on the latency. For BIPs, increasing size of the embedded image gradually degrades the throughput, from 180 requests/sec for 0.5 KBytes to 140 requests/sec for 4 KBytes. Surprisingly, we did not see any overlap between the 55 300 l T T I I T I j 250- , , -1} a) , , ,r S -A— BIP-NV A’ -— -¢- NBIP , , ’ g -0- BIP-V / /' §200~ , , y r O , ’ “é , ’ P150» " 8 I ,5? C / / O I l, o , ,4 I— / I 2100- { I,” "‘ a: ’ fl .0 / / E I [I 3 / // Z / l/ 50 r- , Ill -4 ’ ’2 ’ 4 I a $8. 1 '1 :- 2 , a ‘ , “ILA-“M ‘4' 1 1 1 20 40 60 80 100 120 140 160 180 200 Request Rate( Sessions/Sec) Figure 3.4: Concurrent Connections Comparison of BIP-NV, BIP-V and NBIP (De- fault) two groups of curves using our measurements. Therefore, when the size of embedded image changes, BIP continues to offer good performance. Varying the Number of Embedded Images In this section, we show measurements of how the number of embedded images in each HTML page affects the performance of the web server. In our experiments, we use a sample page with a number of 5, 10, 20, and 40 embedded images, respectively. The latencies shown in Figure 3.7 are the latencies when the maximum throughput under each configuration is achieved. The throughputs shown in Figure 3.8 are the maximum throughputs under each configuration. If we examine Figure 3.8, we see that as the number of the embedded images increases, the throughput suffers tremendously. 56 2500 T T T T T W I m 71 ’9 ’ I I ’c I , G I ’ ’0’ / I ’6 ”’v{3 2000» ’9’ "o”, 46” ~ NBIP , , , ’ 09 ’ \/0 I I ‘9 / I ’ 3 /’/e I/I / I 31500“- p’, ’ 8” .. E / G I/ v I / // / x /, g I / I, 0 F " P” A 3100M ’ 1’ ”d ’ ~ I / l/ /’ / Q ’ / I I I: /A ’ ’ e, BIP-NV I I / IQ / I - o’ 20 4O 60 80 100 120 140 160 180 200 Request Rate( Sessions/Sec) Figure 3.5: Latency Comparison of BIP-NV and NBIP (Size) While the throughput of NBIP is 60 session requests/sec if a page has 5 embedded images, it drops to less than 11 session requests/ sec when 40 images are embedded. At the same time, BIP throughput drops from 180 to 140. The latency increases while the gap between the two approaches becomes larger. NBIP needs several connections to finish a session under a larger number of embedded images in a page. If a page has 20 embedded images, on average, the browser needs 2 connections to serve the session. When the page has 40 embedded images, the browser needs 3 connections to serve the session. Since the size of the image increases, the only overhead is the transmission time, which is low in our measurements. When the number of the embedded images increases, a separate request must be sent for 57 180 I Y I 1 I l I 13 - _ / ‘4 I C A ‘ ~ ~ it I’/ 160>- ‘ —¢ BIP-NV \ 4% - - - a \ éI/ \ I A140» "—--A———A-\V—1s 8 z \ U) ,’ ‘ \ Z5120~ A’ I: .9 ’ w I ‘0 / Q) I 91100» [A . ’5 // g- x I g’ 30’ IA NBIP ‘ 9 / .c I .- l' a ’ e _ _ - 6‘ 1 60 _ ’3.“_'_‘_8.’_:~ 9 ‘ “‘ “JWEE‘ / ‘9' - " - — _ — '- I, I 40~ 9’ ~ ’I I I 2% l 1 l l l l l l 20 40 60 80 100 120 140 160 180 200 Request Rate( Sessions/Sec) Figure 3.6: Throughput Comparison of BIP-NV and NBIP (Size) each embedded image and the overhead increases dramatically. This is the situation where BIP may work very well. BIP over Wide Area We describe in this section some measurements we conducted over a wide area as de- tailed in Section 3.4.2. The default configuration is used. Figures 3.9, 3.10 and 3.11 show the latency, throughput and number of concurrent connections for both ap- proaches over the wide area. From our measurements, the throughput for BIP is about 50% more than NBIP, and latency is about 500 ms less than NBIP. Since it takes much longer for the web server to process transfers, the benefits of throughput 58 700 I . ,, / z x I I 600’- x/ .. / -A— BIP—NV / -¢— NBIP /’ 500* , - I, It / I A I /’ m I [I e4oo~ ,/ ,z . v , I’ 5‘ X , ’ C / / 0 , / ~300~ , ,’ . 3 , , II; I, ”’ ’3’ 200- ," ’,«" - / k’T” / / ,’ ’ / .l / _. 100“, I / 4K 0 L 1 5 10 4o 20 Number of Embedded Images Figure 3.7: Latency Comparison of BIP-NV and N BIP (Number) drops because the overhead introduced by processing embedded images compared to the time to serve the session is much less than that in the LAN case. CGI In this section, we show measurements of the performance where CGI scripts (PERL) are executed. Figures 3.12 and 3.13 show the latency and throughput of both ap- proaches. Our Apache Web Server does not include a built-in PERL module. There- fore, each time that a PERL script is invoked, a process is created. Due to this prohibitive overhead, the throughput drops dramatically. Since BIP tries to decrease the overhead of embedded image processing, BIP does not provide much benefit in this case, because the overhead to process the embedded image is much less com- 59 140* E § § Throughput (Sessions/Sec) 8 —A~ BIP—NV —§— NBIP Figure 3.8: Throughput Comparison of BIP-NV and NBIP (Number) pared to the case where a static page is served. For example, in the CGI case, it takes 400 ms to receive the first response while in the static page case, the first byte is received after about 8 ms. BIP has only 20% throughput improvement. Nevertheless, BIP provides benefits of reducing latency because it still takes time for the server to process the embedded image requests. If the PERL module is built within the Apache binary, the overhead of processing a PERL script is much smaller. Therefore, 20 Number of Embedded Images better throughput improvement would be expected. Summary From the above measurements, we see that in any case, BIP reduces latency in com- parison to NBIP. The improvement of throughput of BIP depends on the overhead 60 1800 , T , , I II 1600 ~ , x ’ - I / / 1400— 1" - / 4» BlP—NV ,’ A 1200~ -a— NBIP / [a ’5? [I , é ¥ // I @1000» x X - I I 2 / / s , /, 8m» /" A _‘ / I I I I, ’/ 600r' I, A, _ / I] I / 400* ,l / _ / I I}. ----- z = z a ________ .A/ 2m 1 l l 1 20 40 60 70 80 100 Request Rate( Sessions/Sec) Figure 3.9: Latency Comparison of BIP-NV and N BIP (Wide Area) introduced to process the embedded images compared to the time required to serve the web requests in both BIP and NBIP. We summarize our results: 1. BIP is very effective for boosting web server performance to serve static pages. 2. The size of the embedded image is not as important as would be intuitively expected within a LAN environment. 3. BIP works very well when the number of embedded images in the HTML page is large. 4. BIP improves the latency and throughput reasonably well when the relative overhead(the overload compared to the total time to service the request) intro- 61 60 I I , A” _ _ ._ A ________ 4} P’ ’ 55~ I / l / A50 ’- / 4 g / a) / / E 45 t- / — .2 I (D / (g / r- , -1 V40 1 S / ““““ e- — - - 4&- — — — Is a I ’ ‘ ‘ ~ ~ ‘ .335 ’- //l ‘ ~ ‘1} a o E ” // '— 30 ~ ,” I’ -A— BIP-NV I —II— NBIP I 25 P 1 '* I I I 2 l l l 1 20 40 60 70 80 100 Request Rate (Sessions/Sec) Figure 3.10: Throughput Comparison of BIP-NV and NBIP (Wide Area) duced by processing embedded images is much less compared to the relative overhead of serving static web pages over a LAN. This conclusion is valid for the WAN and CGI case. 3.5 Implementation and Possible Limitations of BIPs It is important to consider how our mechanism might fit with the current HTTP 1.1 Specification. Within the HTTP 1.1 standard, the HTTP request header may have optional fields which may be interpreted by the web server and by the proxy. The addition of a “push request” optional field and a “push response” optional field may be easily fitted into the HTTP 1.1 standard. For deployment, BIP approaches may be implemented by the web browser and supported by web server. Because it 62 180 I 1 T I A} 160~ ,/ a e , , 0140» , it ._ ’I I c120” / I, — 8 —Ar BIP-NV / ,/ E —:— NBIP // I 0100- I // - t / ’ g ,1 A’ O 80- ,’ z, A O / I Q— I I I 3 60» ’ A, ‘ 0 ’ / 'D I, / g x , Z 40* ,/ I, -1 I l I 20- “4’ “flu—4 . ,’:::—-A” A--‘ O L l l J 20 40 60 70 30 100 Request Rate( Sessions/Sec) Figure 3.11: Concurrent Connections Comparison of BIP-N V and N BIP (Wide Area) is implemented in the browser and web server, no special hardware or software is needed or administrated. Since web users tend to upgrade their browsers often, quick deployment could be expected. Possible limitations include: 1. BIP has not been extended to download embedded images in one I-HTML in parallel using different connections originated from the same browser. It is not clear whether this is a desired behavior because one of benefits of HTTP 1.1 is to pipeline the requests and to provide a type of fairness among clients by limiting the connections to the server. 2. BIP does not work well in low-bandwidth links, such as modern users. This is because in such an environment the dominant latency is the transmission delay. 63 5000 4500 4000 Figure 3.12: Latency Comparison of BIP-NV and NBIP (CGI) / x x I. ,V _ 2" / I / .. / ¢ i / / I >- / I / I / / . / ,4 l (k I ’ xi , I I P / IA' / I / I x r— ’/ IA // , -A- BIP—NV I I I —.- NB'P - I ,V ’ ’ A " ’ , .A— — ‘ ‘3’ 4' l l l 1 l l l g 4 8 12 16 20 24 28 32 36 3.6 Summary In this chapter we proposed a browser initiated approach (BIP) to web document pushing. BIP reduces the download latency of those HTML pages having embed- ded contents and improves server resource utilization. Simulations show that BIP can download up to 56% of all the HTML pages in one RTT and saves up to 20%- 26% of total web accesses. BIP-Hash is best for modest caches with a limited num- ber of proxy levels. Phrthermore, BIP-Hist with history depth 1 and BIP-Ref offer reasonable performance with acceptable overhead. Benchmarking shows that in all cases, BIP improves the throughput and reduces latency. BIP impressively boosts the throughput of the web server to serve static web pages with embedded images. .BIP improves the throughput of a web server reasonably over a wide area and improves Request Rate( Sessions/Sec) 64 18 TI I I I I I I I I ,A--—A 16- ’27-'45” \\ ~ 2:” \ I \ \ / 814" / A - (D / ’ nu - E l / ~‘*-—-r"'4 “*"'" .912 ’ ’ a) - , a g f (I) I :7 I 31m ,’ - “S, : 3 o g I |_ 8- —A— BIP-NV ~ '; —&— NBIP I I 6» I -4 I I / 4 _l_ L 1 1 1 1 1 1 1 4 8 12 16 20 24 28 32 36 Request Rate (Sessions/Sec) Figure 3.13: Throughput Comparison of BIP-NV and NBIP (CGI) throughput of web servers that serve CGI requests. The size of the embedded image is not as important as the number of embedded images in an HTML page. BIP offers reasonable latency reduction in all configurations. Our work provides an upper bound for the benefits of BIP. Future work includes implementing an Apache Web Server module to determine a lower bound of the benefits of BIP. 65 Chapter 4 S3 - Smart Server Selection Replicated servers are widely used to meet the increasing traffic demand of some busy web sites such as Yahoo (www.yah00.c0m) and Altavista (www.altavista.com). These web sites have replicated server groups in each continent. With multiple replicated servers running simultaneously, the number of requests serviced by each server is re- duced and a reasonable response time may be achieved. The replication of servers increases fault-tolerance, which is essential to these popular sites. By putting repli- cated servers at appropriate places, replicated servers also increase network proximity and thus significantly reduce access latency perceived by users. A fundamental issue for a global-scale replicated service is how the best server is selected based on the user’s preference. Most replicated server sites now have a menu at the home page of the site. Users may manually select one server according to their knowledge of geographical proximity. However, geographic proximity does not necessarily reflect the network proximity and thus it cannot guarantee the best server selection. It is desirable to automate the server selection so that it is transparent to 66 users. In this chapter we propose an approach, called Smart Server Selection (SB)[50], to perform the server selection based on multiple network metrics efficiently collected from routers. The rest of the chapter is organized-as follows. In section 4.1, we argue that network metrics should be used in server selection. Based on this assumption, Section 4.3 presents an overview of our approaches and extensions to current routers and the DNS. Performance evaluations are presented in Section 4.4. Deployment is- sues are discussed in Section 4.5. A brief discussion is given in Section 4.6. Section 4.7 provides conclusions. 4.1 Selection Metric The very first question we need to answer is the selection metric. Some of the ap- proaches consider the server’s load and some of the approaches do not. Some re- searchers argue that the purpose of the replication of the web site is to provide better web user navigation experience. The most important issue is the user-peceived la- tency. When server selection is performed, the server’s load should be considered. Other approaches only take network metrics into consideration. Research has shown very dynamic server load behavior. In Figure 4.1, the number of accesses per minute computed from a trace file of the web server of the department of computer science and engineering at Michigan State University is shown. As we can see, even during peak hours, the web server load exhibits dramatic changes in several minutes. Accurately estimating a server’s load on time is a challenge. Studies on server selection traditionally have tried to address server selection and load sharing 67 together. In our approach, we separate the server selection problem from the load sharing problem. The server selection is proposed as one of the network services. In our approach, only network metric information is considered. The client is always directed to the nearest site according to network metrics. It is the service provider’s responsibility to provide load sharing and resource scheduling to address the resource requirement. Web Requests/Minutes N1 01 o ' ' l l 1 . o0 500 1000 1500 Minutes Starting From 3:34:00 Figure 4.1: One Day’s Access History From CSE workload The principles of 83 are: 1. The server selection criteria should select the server based on network metrics. 2. Temporary server overload should be addressed by efficient server side load balancing approaches such as [40, 41]. 3. Persistent severe overload should be addressed by capacity planning and moni- toring on the server system, given that today’s hardware is inexpensive. 68 We argue that the principles of S3 make better use of network resources. Load balancing may redirect a user to a server half way around the world simply because that server has lighter usage. By doing this, valuable network resources are wasted. We believe overloaded servers should be avoided by adding new servers, moving the locations of the servers, or choosing server locations that better matches users’ access pattern. 4.2 Introduction of the S3 approach A promising method to solve the server selection problem is to use DNS servers at clients side. All replicated servers may be aliased to the same DNS name. When a client accesses the service, the client DNS may select from a pool of IP addresses corresponding to the replicated servers and return the IP address of the server best matching the user’s preference, such as shortest path or shortest latency. In order to select a server based on shortest path or latency, the DNS server must know metrics regarding routing information leading to different servers. However, such information is only available from the routers. Routers know the metrics of the routes corresponding to each IP address but they have no idea which IP addresses are providing same services. This is only known by DNS. To solve the server selection problem, a mechanism is needed to enable DNS servers and routers to communicate with each other. Router extensions are proposed in this paper to support a route metric query for IP addresses. Extensions to current DNS 69 are prOposed to allow it to collect and cache routing metrics and select the best server. we named the mechanism Smart Server Selection (S3). 83 takes advantage of DNS’s role in the activity of accessing the replicated service. Due to the popularity of the replicated services, it is very likely that multiple hosts that share the common DNS access the same service within a short period of time. When that happens, the routing metric information to the service cached in the DNS may be shared by multiple DN S queries, and therefore reduce the number of routing metric queries and the DNS server response time. Our simulation results Show that S3 provides substantial performance improvement over the DNS round robin approach, which is the default method of server selection, both in terms of the number of hops used and total latency experienced. The overhead analysis shows that the overhead introduced by our routing metric collection scheme is negligible. 4.3 S3 - Smart Server Selection We begin by giving an overview of S3. 83 Operates as follows: 1. The server side DNS maps a DNS name to all replicated servers’ IP addresses. 2. When a client DN S asks for the name-to-IP-address mapping for the replicated services (DNS name), the server DNS sends all available IP addresses to the client DNS. 70 3. When a host sends a domain name query to the client DNS, the client DNS examines whether multiple addresses for this service are available. If multiple addresses are available and the metric is not cached, the client DNS sends a query to collect necessary routing metrics. 4. The DN S selects a server according to the route metrics it collected from the routers and other information available to DN S servers (such as the geographical distance between the client and the replicated servers) and returns the selected IP address to the host. 4.3.1 Possible Approaches to Collect Routing Metrics The DNS must query a router that has full knowledge of Internet routes, which is usually a BGP (Border Gateway Protocol) router [51]. We propose two different approaches. The approach to use depends on the conditions of the local network. Query the gateway router In this first approach, we assume that the client DNS knows at least one gateway router when it is configured. Armed with such information, the DNS sends queries directly to this gateway router. The gateway router searches its routing table and supplies the necessary information to the DNS. The DNS must find the gateway router through other means, and is probably configured manually. A daemon runs on both the DNS and the gateway router, which allows them to communicate with each other. The daemon on the DNS enables the sending of routing information queries 71 and the daemon on the gateway router accepts such queries and sends responses back to the DNS. The main advantage of this approach is the ease of deployment. Only the gateway router needs to be upgraded to support the routing metric query. All other routers need not be upgraded. A technical issue arises if one replicated server is in the same Autonomous System (AS) and may be reached without passing through the gateway router. In this case, the gateway router may not know the routing metric from the DNS to the replicated server. This is the case where OSPF (Open Shortest Path First) [52] is used as the Interior Gateway Protocol (ICP) and the gateway router is in a different area than the DNS. The gateway router provides routing metrics based on the routes from itself to the replicated servers, which may be different from what is perceived by the DNS and the actual clients. Fortunately, since the difference is only for routes inside the same AS, the inaccuracy is not likely to cause noticeable performance degradation. Query the direct-attached router In this second approach, the DNS may send a query to the router to which it is directly attached (default router) to obtain the routing metric information. This approach relieves the requirement of DNS knowledge of the gateway router. We propose extending ICMP (Internet Control Message Protocol) [53] to support such a mechanism instead of developing a brand new protocol because ICMP is im- plemented on every router and such a mechanism is a natural extension of ICMP. 72 The ICMP packet header has a “Packet Type” field. We extend ICMP to make use of two currently obsolete packet types: type 15 (“information request”) and 16 (“information reply”). We specify that routing queries from the DNS use ICMP packet type 15 and responses from the router use type 16. The scenario of collecting routing information is as follows: 1. In a query packet, the DNS sets the packet type to 15 and specifies in the packet body the IP addresses and the routing metrics for which it is looking (hops, latency or others). It then sends the packet to its default router. 2. Upon receiving such a query, a router looks up its own routing table. If it can provide all the required information, it replies to the DNS. Otherwise, it fills the information available in the ICMP packet and forwards the packet along the default path to its default router, if there is one. 3. The upper level router tries to find the missing information. It will not overwrite or repeat the information that is supplied by lower level routers. 4. This process continues until a router supplies the last piece of information. This router creates and sends a reply packet (ICMP type 16) to the DNS. The query packet eventually reaches an EGP router if some routing information about destinations outside the AS is requested. If there is no default path on a router and the information is still not complete, then it means some IP addresses are not reachable from that network. This router may safely send a reply to the DNS. 73 Discussion of the two approaches The second approach has obvious and important advantages over the first approach. First, it relieves the requirement of DNS’s knowledge of the “working” router address. It is completely transparent to the network and DNS administrators, and it works independently of the routing protocol used. Second, it results in highly accurate routing metrics because ICMP packets go through the same path as the data packets to those IP addresses. Third, because it queries IGP routers, the problem of the first approach is addressed. Last, because ICMP extension provides a way to collect route metrics for multiple IP addresses, this service can be exploited by other servers or hosts too. Compared to the first approach, the second one involves all the routers along the default path up to a router where all routing metric can be supplied. All these routers need to be upgraded to support the new ICMP functions. This may result in longer time to deployment, and more investment. These two approaches are complementary to each other. The first approach may be deployed quickly, while the second one should be used when enough new ICMP function enabled routers are available. 4.3.2 DNS Extensions DNS servers will be extended to handle routing metric queries. The DNS cache entries will also be extended to store routing metrics. Under the current approach, the selection criteria based on the available metrics is configured in a central place: 74 the DNS server. Technically, it is easy to let the host or application specify the selection criteria (by applying different weights to each criteria). However, this will lead to the extension of the DNS protocol. At this point, it is not clear if such an extension is necessary. 4.4 Performance Evaluation In this section, we present the results of several simulations conducted to evaluate S3’s performance. 4.4.1 Performance Metric In our simulations, the following two metrics are used to evaluate the performance. 1. Total hops. This is the total number of hops all the packets traverse. 2. Total network latency in milliseconds. This is the total network latency experi- enced by all the packets. 4.4.2 Simulation Setup In our simulations we measured web accessing performance from a network to a site that provides the service globally. The client network modeled in the simulations is the public computer laboratory connected by a 100 Mbps Fast Ethernet in the Department of Computer Science and Engineering at Michigan State University. Ya- hoo (www.yahoo.com) is chosen to be the site that provides the global replicated 75 services. The web site of Yahoo is replicated at over 20 places, and each place has at least two different IP addresses. To be more realistic, only servers providing the En- glish language are selected. The 5 places chosen were: yahoo.com and ca.yahoo.com (North America), sg.yahoo.com (Asia), uk.yahoo.com (Europe), and yahoo.com.au (Australia). Each group is on a different continent. The hops and latency from each place were measured by traceroute [54], and the collected data is shown in Table 4.1. Table 4.1: Route metrics for Yahoo replicated servers Location Hops Latency(ms) United States 15 66 Canada 15 69 United Kingdom 18 154 Australia 20 425 Singapore 15 340 Our client network is modeled as a cluster of N H hosts. In our simulation, this number is set to 100, which is roughly the number of machines in the public instructional laboratories in our department. Each host generates requests to replicated servers independently according to a Poisson process with arrival rate of AH. Different arrival rates are used to model different usage levels of the labs in our department. Usually the labs are heavily used during daytime and lightly used during the night time. Only the number of requests generated by the hosts is considered without simulating the network condition of labs. A change in the route metric occurs as another Poisson process with rate of AR. In our simulation, this parameter is fixed to approximate the real situation between our local network and the five targeted Yahoo sites. 76 Based on the data we observed, the hop number change is modeled as a weighted discrete uniform distribution. The latency change is modeled as a weighted linear combination of Gamma distributions 0, /\ and a uniform distribution, which happens as a Poisson process with rate of AL. For some sites, the Gamma distribution has been reported as the model of latency changel. But for many other routes, the latency change is so vigorous that it cannot fit into a Gamma distribution. To simulate such a condition, a uniform distribution is introduced to the model. The parameters A and a are found by manually fitting the data we collected. The client DNS caches domain-name—to—IP-address mapping during the simulation period. It refreshes the route metric information at a constant rate of TTLDR. The resolver at each host refreshes the knowledge of the best replicated server at a constant rate of TTLH. 4.4.3 Simulation Results Figure 4.2 shows the hops traversed by all the packets during different length of simulations and using different host request arrival rates. The route selection criterion is shortest path first. The figure shows that the total number of hops increases linearly as the simulation time increases. In all cases, the number of hops experienced by the S3 method is 12% less than that generated by the RR method. We have 3 different sites in these experiments with the same mean of the path length (15). If the difference between different routes becomes larger, the savings due to 83 would be greater. 1Based on work done by the ACS lab, Department of Computer Science and Engineering, Michi- gan State University. 77 Hops x10 8 4.5 —B— RR XH=1/10 -A- S3KH=1/10 4__9_RRAH=1/30 -V- S31H=1/30 3.5fi+RR}»H=1/60 S3KH=1I50 RR}. =1/120 3~ + SSAH-1/120 ~D- H- 25" 2- ,I’ 1.5” ’// A” ,"V 1" ,’ ’,—" 0.5- ,’ —-/"’/ _______ 7:1 // ”” ' ————————— w <’.' ’ “"fi M—“ 0 ”é“ 1 1 1 1 0 5 10 15 20 25 30 Simulation length (days) Figure 4.2: Hops vs Simulation Length 78 Hops 7 .- -Ar RR 6.8 ~ _& $3 6.6 ~ 6.4 - 6.2 — A- " A ’ — - —A— _______ —A _________________ A 6 _ 5.8 - ———————————————— .0 5.6 '- B“ __ __ -B" ________ D... .— 3 I 5.4 - 5.2 - 5 L 1 1 1 1 Lon (seconds) x10 Figure 4.3: Hops vs TTLDR ( Ag 2 1/60) 79 Figure 4.3 shows the hops versus TTLDR (the frequency DNS refreshes the routing metric from routers). It should be noted that the experiments were performed under different TTLH. Each dot in the graph represents the results of 6 experiments. The experiments use the same DN S method, host request rate and T T L D R value, but with different TTLH values at 300, 600, 900, 1800, 2700 and 3600 seconds. Since each set of the 6 numbers is similar (less than 1% difference), the average is used to represent them. For the round robin method, there is no concept of server selection based on the routing metric. It is not affected by the value of TTLDR. For SB, the hops increases when TTLDR becomes larger. This suggests that in order to save hops, the DNS needs to refresh the routing metric information more frequently. The slope of the line has a major change when TTLDR changes from 3600 seconds to 1800 seconds, which marks the critical point where DNS changes from the status of being able to update routing information promptly to the status that it can no longer do so. Figure 4.4 shows hops versus TTLH. Similar to Figure 4.3, this is an aggregate representation of a set of experiments. Each dot in this figure corresponds to different TTLDR values at 1800, 3600, 7200, 14400, and 28800 seconds. The graph shows that smaller TT L H results in larger hop saving. A large T TL H means that the host machine refreshes a domain-name-to—IP-address mapping for a long period. In practice, it may also be a result of longer session. In a long session, once a host establishes a connection with a replicated server (such as TCP), all packets of this session should be forwarded to this particular server no matter how the route metric changes at the DNS. 80 7.- -A- RR —0- 83 6.8“ 6.6" 64* 6.2» A _ + _ A ________ a _______ A _______ _A 8 o 6" I 58* _ _ ,_ .. .-o 56 ---—€"" , ,0- ------- 9 " ' F 5.4» I? ’ '0’ 5.2“ 5 l l l l l l l I 0 500 1000 1500 2000 2500 3000 3500 4000 TTLH (seconds) Figure 4.4: Hops vs TTLH ( Ag 2 1/60) 81 Latency (ms) (A) l x109 .9. RH). =1/1o 7" -1, sex =1/1o _9_ am =1/3o 6" _v_ 83A. =1/30 + Rm =1/60 , _<,_ S3AH=1l60 5- + RRAH=1I120 _> 83 AH =1/120 IIIII A l Simulation length (days) Figure 4.5: Latency vs Simulation Length 82 Figure 4.5 shows the simulation results of finding the best latency routes. Overall, they are very similar to the results based on hops, but the latency changes may occur frequently due to network congestion, and the number of hops between two networks is relatively stable. This requires the DNS to collect the route metric more frequently. For stable routes, the latency changes according to a Gamma distribution. For un- stable routes, the distribution is difficult to find and we approximate it using uniform distributions. We further mix the two types of distributions to reflect different situ~ ations. W * Gamma(n, A)+ (1 — W) >1: Uniform(a, b), whereOSW§1 The weight W represents how the two distributions are mixed together. Figures 4.6-4.9 are the result of running S3 for different latency change distributions. In Figure 4.6, W is 1, so the distribution is Gamma. In Figure 4.9, W is 0, cor- responding to a uniform distribution. Figures 4.7 and 4.8 use W of 0.6 and 0.3, respectively. Let us temporarily call the latency change distribution D. If TTLH is reasonably small, such as 30 seconds, we see that the closer D is to Gamma, the more insensitive the latency is to the TTLDR change. The closer D is to uniform, the more sensitive the latency is to the TTLDR change. In other words, T TLDR changes cause large latency variations if D is similar to uniform; but can only cause small latency variation 83 Latency (ms) 12 11 10 on N O) 300 Tune (seconds) —A— RR -i— 83 Figure 4.6: Latency vs TTLDR (W: 1,Gamma,)\H = 1/60 ) 84 Latency (ms) 12 11 10 300 TTLDR (seconds) —A— RR -1— $3 Figure 4.7: Latency vs TTLDR (W: 0.6,mixed,AH = 1/60 ) 85 Latency (ms) x10 12? 11» ————————————— - Control Connection <5) [HandoffAck] ® [ Termination ] -@ Back-End Network Figure 6.9: Remote Request Processing Flow During rare-TCP Handoff These two modules provide a wrapper around the current TCP module. In order to explain the proposed modular TCP handoff design and its implementation details, we consider a typical client request processing. There are two basic cases: remote request processing, i.e. when the front-end node accepting the request must 127 handoff the request to a different back-end node assigned to process this request; and local request processing, i.e. when the front-end node accepting the request is the node which is assigned to process this request. First, we consider remote request processing. There are six logical steps to perform the TCP handoff of an HTTP request in rare-TCP handoff: 1. Finish 3-way TCP handshaking (connection establishment) and get the re- quested URL. 2. Make the routing decision: decide which back-end node is assigned to process the request. 3. Initiate the TCP handoff process with the assigned BE node. 4. Migrate the TCP state from the FE to the BE node. 5. Forward the data packets. 6. Terminate the forwarding mode and release the related resources on the FE after the connection is closed. Now, we describe in detail how these steps are implemented by the newly added UTCP and BTCP modules and original TCP/1P modules in the operating system. 1. 3-way T CP handshake. Before the requested URL is sent to make a routing decision, the connection has to be established between the client and the server. The proposed design depends on the original TCP/IP modules in the current op- erating system to finish the 3-way handshake. In this stage, BTCPFE allocates 128 a connection structure corresponding to each connection request upon receiv- ing a TCP SYN packet from the client. After that, BTCP“; sends the SYN packet upstream. Upon receiving a downstream TCP SYN/ACK packet from the T CPFE module, BTCP“; records the initial sequence number associated with the connection and sends the packet downstream. After BTCPFE receives an ACK packet from the client, it sends the packet upstream to TCPFE. During this process, the BTCPFE emulates the TCP state transitions and changes its state accordingly. In addition to monitoring the 3-way TCP handshaking, BTCP”; keeps a copy of the incoming packets for connection establishment (SYN packet, ACK to SYN/ACK packet sent by the client) and URL (Figure 6.9), for TCP state migration purposes, which are discussed later. Also, because the TCP handofl should be transparent to server applications, the connection must not be exposed to the user level application before the routing decision is made. U TCPFE intercepts the T_CONN_IND message sent by T CPF E. TCPFE continues the 3-way handshake without waiting for explicit messages from the modules on top of TCP. . URL parsing. BTCPFE parses the first data packet from the client, retrieves the URL and makes the distribution decision. . TCP handofir initiation. A special communication channel is needed to initiate the TCP handoff between FE and BE. A Control Connection is used for this purpose between two UTCPFE and UTCPBE as shown in Figure 6.9. This 129 control connection is a pre—established persistent connection set up during the cluster initialization. Each node is connected to all other nodes in the cluster. The TCP handoff request is sent over the control connection to initiate the handoff process. Any communication between BTCPFE and BTCPBE mod- ules goes through the control connection by sending the message to the U TCP module first (see Figure 6.9). After BTCPFE decides to handoff the connection, it sends a handoff request to the BTCPBE (Figure 6.9, step 1). The SYN and ACK packets from the client and the TCP initial sequence number returned by TCPFE are included in the message. BTCPBE uses the information in the handoff request to migrate the associated TCP state (steps 2-4 in Figure 6.9, which are discussed next). If BTCPBE successfully migrates the state, an ac- knowledgement is returned (Figure 6.9, step 5) BTCPFE frees the half-open TCP connection upon receiving the acknowledgement by sending a RST packet upstream to TCPFE and enters forwarding mode. U TCPFE discards the cor- responding T-CONN_IND message when the T_DISCONJND is received from the TCPFE. . TCP state migration. In the STREAMS environment it is not easy to get the current state of a connection at TCPFE, to transfer it and to replicate this state at T CPBE. First, it is difficult to obtain the state out of the black box of the TCP module. Even if this could be done, it is diflicult to replicate the state at BE. TPI does not support schemes by which a new half-Open TCP connection with predefined state may be opened. In the proposed design, the 130 half-open TCP connection is created by replaying the packets to the TCPBE by the BTCPBE. In this case, the BTCPBE acts as a client(Figure 6.9). BTCPBE uses the packets from BTCPFE, updates the destination IP address of SYN packet to BE and sends it upstream (Figure 6.9, step 2). TC PBE responds with SYN-ACK(Figure 6.9, step 3). BTCPBE records the initial sequence number of BE, discards SYN-ACK, updates the ACK packet header properly, and sends it upstream (Figure 6.9, step 4). . Packet forwarding. After the handoff is processed successfully, BTCPFE enters forwarding mode. It forwards all the pending data in BTCPFE, which includes the first data packet (containing the requested URL) (Figure 6.9, step 6). It continues to forward any packets on this connection until the forwarding session is closed. During the packet forwarding step, BTCPFE updates (corrects) the following fields in the packet: 1) the destination IP address (to BE’s IP address); 2) the sequence number of the TCP packet; 3) the TCP checksum. For packets that are sent directly from BE to the client, the BTCPBE mod- ule updates (corrects): 1) the source IP address (to FE’s IP address); 2) the sequence number; 3) TCP checksum. After that, BTCPBE sends the packet downstream. . Handoff connection termination. Connection termination frees state at BE and FE. The data structures at BE are closed by the STREAMS mechanism. BTCPBE monitors the status of the handed ofl' connection and notifies the 131 BTCPFE upon the close of the handed off connection in TCPBE (Figure 6.9, step 7). BTCPFE releases the resources related to the forwarding mechanism after receiving such a notification. Local request processing is performed in the following way. After the BTCPFE finds out that the request should be served locally, the BTCPFE notifies U TCPFE to re- lease the correct T_CONN_IND message to upper STREAMS modules, and sends the data packet (containing the requested URL) to the original TCP module (TCPFE). BTCPFE discards all the packets kept for this connection and frees the data structures associated with this connection. After this, BTCPFE and U TCPFE send packets up- stream as quickly as possible without any extra processing overhead. 6.5 Frequent-TCP Handoff Design Partition-based web proxy clusters demonstrate a different usage pattern of TCP handofl: HTTP requests are more likely to be handed off compared to rare-TCP handoff, as we pointed out in the section 6.3. This difference leads to a different TCP handoff design. In this design, the overhead of remote request should be minimized. The flow of the remote request processing is illustrated in Figure 6.10. The additional modules are introduced at the same positions in the protocol stack as in the previous design, and are referred as BFTCP and U FT CP module, to indicate these modules have different functionalities. 1. Connection setup. Under rare—TCP handoff design, the connection-related re- sources in the TCP module are released by the RST message when the handoff 132 Front-End Back-End Control Connection [ UTCP le madam“; a)? UTCP ] @— [ TCP j ®- [ TCP ] ® “4 @ EQ’ [ BTCPl'm‘J VI 7[ BTCP ] 0.. [ 1P l IP ] -© [] Network a Figure 6.10: Remote Request Processing for Frequent-TCP Handoff is successful. In the case of frequent TCP handoff, it is inefficient to establish a TCP connection with the TCP module at the front-end node and then free the connection most of the time. Connection setup (the original 3-way handshak- ing) is reimplemented by the BFTCPFE module to trigger the client to send the URL. The BpTCPFE also has better control on TCP options. After BFTCPFE receives the URL and makes the decision, BFTCPFE may initiate the hand- off connection through the control connection as before (Figure 6.10, step 1). However, no packet is shipped along the persistent connection in the handoff request at this time. BFTCPFE may gather necessary information (for exam- 133 ple, Initial Sequence Number (ISN), etc.) from the connection establishment packets, and then BFTCPBE may construct these packets from the information provided in the handoff request, and replay the packets locally at the back-end node (Figure 6.10, step 2-4). An acknowledgement is returned by the BE to the FE through the control connection to indicate if the handoff is successful (Figure 6.10, step 5). If the request is processed locally at the front-end, the kept connection establishment packets are replayed to the local TCP module to create the necessary state. . Packet forwarding. Packets may be forwarded to the selected server on top of the IP layer, in the IP layer, or under the IP layer, depending on the clus- ter configuration and the ratio between the local traffic and forwarding traffic. While the BFTCP module may forward the packets on top of IP layer, similar functionality can be achieved by inserting a module on top of the device driver. When BpTCP is implemented on top of the device driver, and all the back-end nodes are located on the same LAN (as in the described partition-based proxy application), it is possible for a cluster to have a virtual IP address, each back- end node is uniquely identified by MAC address, and the packet is forwarded by filling in the right MAC address. This avoids Network Address Translation (NAT) on the front-end and NAT on the back-end node for outgoing traffic. Upon receiving the DLPI message from the device driver, BFTCPFE changes the DLPI message format and destination MAC address and sends the message downstream. 134 When the forwarding packets must traverse a router or a WAN, the packet’s destination may be changed to the selected server’s IP address and a propri- etary protocol may be developed to carry the packet’s original IP address to the selected server so that the response packet’s source IP address may be up- dated accordingly. BFTCPFE updates the packet’s IP address to the selected server’s IP address, and sends the packet upstream. The I PFE forwards the packet according to its routing tables to the back-end node. BFTCPBE has to manipulate the TCP header anyway and updates the initial sequence number and TCP checksum. The original TCP connection establishment can be done in two different mod- ules: either in the operating system TCP module or the BFTCP module. When the TCP module is used to establish the connection with the client, the initial sequence number is correct so local request may be sent to the client directly without any packet header manipulation. If the BFTCP module is used, be- cause of the local handoff, the initial sequence number used by, the TCP module and BTCP module might be different, and therefore BFTCP has to update the . initial sequence number and TCP checksum for every outgoing packet for local requests. In this design, we improve the remote request processing at a price of an additional small penalty for local request processing. . Handoff connection termination. In order to successfully and robustly free the front-end forward session, either the back-end or the front-end node has to observe the two-way TCP control traffic. In the rare-TCP handoff design, the 135 BTCPBE sees the two-way traffic and knows the status of the TCP connection and sends the notification to the front-end upon termination of the connection. In the frequent-TCP handoff design, connection termination depends on where the modules are inserted. If the modules are on top of IP level, the rare- TCP handofl' termination approach is more elegant. If the functionality of the BFTCP is implemented by a module inserted on top of the device driver, the termination approach described in the next section for always-TCP handoff is better1 . 6.6 Always-TCP Handoff Design In always-TCP handoff, there are two kind of nodes, the dedicated front-end node and the back-end web servers. The purpose of the front-end node is to trigger the client to send the URL, and then handoff the connection to the selected server. The request flow of always-TCP handoff is shown in Figure 6.11. In this configuration, only one module is introduced, B ATCP module, at both the front-end and the back-end nodes. The difference between always-TCP handoff and the previously described rare- and frequent-TCP handoff is as follows. 1. Connection setup. The BATCPFE implements the connection setup function as in BFTCPFE module. Packets are exchanged to establish the connection with the client and to get the URL. State migration is done by replaying the 1The module on top of the device driver may not see two-way traffic in case multiple network interface cards exist, and incoming traffic and outgoing traffic go through different interfaces. 136 Front-End Back-End [ TCP ] [ TCP ] Y w *-l O ”O t w :8 "U ”y ”fiOngmal 3-wa hadshakmg mm SYN/ACK p m l -© ll Network @-@ Figure 6.11: Request Processing for always-TCP Handoff packets between the front-end and the back-end node. Since all web traffic is handoff traffic, a TCP SYN packet arrived at web server listening port indicates a handoff request is initiated (Figure 6.11, step 1) and B AT CPB E sends the SYN packet upstream (Figure 6.11, step 2). A TCP persistent connection may not be needed because there is no need to tell the back-end node that a particular flow is the handoff connection. The SYN-ACK packet from TCPBE is intercepted by BATCPBE to the front- end by changing IP address (Figure 6.11, step 3). The BATCPBE receives the ACK and records the initial sequence number returned by the BATCPFE 137 (Figure 6.11, step 4). No acknowledgement is needed. the URL is forwarded as before in step 6. Packet forwarding should be done as quickly as possible. In this configuration, it might be better to forward the packet on top of the device driver. Also virtual IP address should be used to avoid network address translation at the front- end because it is easy for the front-end to become the bottleneck for the whole cluster. 2. Handofir connection termination. The handoff connection is closed in the following fashion. The B ATCPBE intercepts the TCP control packets (packets with flags on, for example, RST, FIN) and sends it to the B AT CPFE (step 7). The B AT CPFE records the connection progress and relays the packets to the client. Data traffic goes directly to the client. The front-end sees two ways traffic and may keep track of handoff connection status and closes the connection in timely manner. 6.7 Performance evaluation The rare-TCP handoff design is prototyped on HP-UX platform. The HP-UX operat- ing system supports Dynamically Loadable Kernel Modules (DLKM). As we described before, two new modules are developed: BTCP and UTCP. In this section, we report the performance and overhead introduced by modular rare- TCP handoff design. 138 6.7 .1 Measurement testbed A small testbed is set up to evaluate the modular TCP handoff mechanism. A 100 Mbps Ethernet switch is used to connect four HP-UX workstations. Two machines (HP J6000 workstations) are used as web servers. These two HP J6000 workstations each have the following configurations: dual 553M HZ CPU, 1G bytes of memory, 120 MHZ central bus, 512KB / 1MB cache and more than 16 Gigabytes of disk space. The tool httperf is used to benchmark the web server performance and latency. Two client machines are used to drive the measurement. 6.7 .2 Performance Measurement . This section shows the TCP handoff measurement. The four configurations we mea- sured are described in table 6.1. We measured the latency and throughput with different document sizes: 120 bytes, 1200 bytes, 6000 bytes and 12000 bytes. Table 6.1: Configuration measured Configuration Description No modules Standard original operating system protocol stack. Dummy modules Two dummy modules which forward the packets without any further processing. Handoff-local Local handoff. After the decision is made, the packets are forwarded without further processing. Handoff-remote Remote handoff. Requests will be handed off to another server. Table 6.2 shows the TCP handoff latency performance. 139 As we may see, the two dummy modules introduce a negligible delay in all the con- figurations on all file sizes. TCP handoff introduces a little extra delay. For example, if the document size is 120 bytes, the latency is increased from 0.9ms to 1.1ms. The latency does not depend on the file size significantly. This is because after the deci- sion is made, the only delay is the forwarding delay. In the case of handoff, the delay increases proportionally with the document size. This is because as the document size increases, the number of forwarded packets (ACK packets) increase. Table 6.2: Latency of TCP handoff (ms) Configuration 120 1200 6000 12000 No modules 0.8 1.2 1.4 2.8 Dummy modules 0.9 1.2 1.4 2.8 Handoff-local 1.1 1.3 1.5 2.9 Handoff-remote 1.6 2.1 2.8 4.4 Table 6.3 shows the TCP handoff throughput performance. As we may see, there is overhead associated with the insertion of two dummy modules. This is the overhead of the STREAMS mechanism. In our approach, the two dummy modules degraded the performance by 5.6% if the document size is 120 bytes. This overhead is per packet overhead. Because this overhead is proportional to the size of the document, the throughput degradation increases as document size becomes larger. In the case of local handoff and a document size of 120 bytes, the throughput degraded about 9%. When the document size reaches 6000 bytes, the network becomes saturated (about 78Mbps web server throughput) and becomes the bottleneck. We were not able to measure the overhead under such configurations. In the case of TCP handoff, the performance degraded about 16% if the document size is 120 bytes. The network 140 is saturated again when the document size is 6000 bytes; the throughput is 770 requests / second. Table 6.3: Throughput of TCP handoff ( req/s ) Configuration 120 1200 6000 12000 No modules 1430 1385 850 800 Dummy modules 1350 1300 800 800 Handoff-local 1300 1260 800 800 Handoff-remote 1200 1130 770 770 6.8 Summary Research on request distribution and request differentiation receives much attention from both industry and academia. Providing scalable and predictable service is essen- tial for future Internet web sites. Content-aware request processing enables intelligent routing and request processing inside a web cluster to support the quality of service requirements for different types of content and to improve overall cluster performance. A STREAMS-based TCP/1P implementation, which is available in leading commer- cial operating systems, offers a framework to implement the TCP handoff mechanism as plug—in modules in the TCP/IP stack. In this chapter, we use three different applications to discuss specifics of content- aware request routing and related architectural design issues. The difl'erence in the usage patterns leads to different trade-off decisions in the modular implementation of TCP handoff mechanism. We discuss these trade-offs and propose a library of STREAMS modules implementing the TCP handoff functionality which addresses 141 different cluster architectures and optimizes the TCP handoff mechanism for specific usage patterns. The library of STREAMS modules proposed in this chapter offers a set of attractive benefits: portability, flexibility, transparency, and efiiciency to support scalable web server cluster design and smart, specially tailored request routing inside the cluster. More importantly, these modules allow much easier deployment of the solution in commercial systems. 142 Chapter 7 Web Quality of Service and Differentiated Service Providing a positive user navigation experience is very important for web site oper- ators and is especially important for business web sites. It has been reported that eight seconds is the practical critical point for a web site. A healthy web site should deliver the content of a web page within 8 seconds. As we discussed in chapter 1, the web request serving process involves a web server system, a network system and a client system. As we pointed out in chapter 1, the web server system and the network system are most likely to be the bottlenecks. In this chapter, we first discuss some of the approaches available to address web differ- entiated service in the web server system and outline the design and implementation of such approaches in both a FreeBSD-like TCP/1P stack and a STREAMS-based TCP/IP stack. We then briefly discuss network QoS and differentiated service. After 143 that, we discuss end-to-end QoS. At last, we discuss an interesting approach: content delivery networks and its relationship to end-to-end web Q08. 7 .1 Differentiated Web Service at Web server sys- tems Differentiated web service at web server systems refers to the situation where certain requests receives preferred treament in terms of resource allocation than do other web requests when the site capacity may not be able to satisfy all the requests with acceptable performance. Differentiated service assumes that access to a web site may be divided into several classes according to some criteria. In chapter 2, we outline two such criteria: one criteria is to classify the users into several classes, which we refer to as client-oriented differentiated service. In this case, users in different classes might receive different levels of service when the resource at the site is limited, no matter which URL the users access. The other criteria is to classify the URLs into several classes, and accesses to different URLs receive different levels of service. We refer to this as service-oriented differentiated service. In practice, the two criteria may be combined. Other criteria may be specified when such a requirement arises. Before we discuss the mechanisms to provide various levels of differentiated service, we take a look at the main steps involved to serve a web request. Web request processing may be logically divided into four sequential stages: 144 1. TCP connection setup with the web site. Web access is based on the HTTP protocol, which uses the TCP protocol. In this stage, the client initiates the TCP connection to the web server site. 2. Request scheduling. During this stage, the web server site schedules this request according to the predetermined policy. 3. Web request processing by the web server. During this stage, the request is received by a web server and processed. 4. Response transmission. During this stage, the response data is transmitted through the network to the client. 7 .1 . 1 Mechanisms In this section, we discuss some of the possible mechanisms and policies to provide differentiated service at each stage according to the four stages we outlined in the previous section. 0 TCP connection setup. During this stage, the client initiates the TCP connec- tion setup by sending a SYN packet. The web server site returns a SYN/ACK packet, then the client sends the ACK packet possibly with the URL as well. From a web site’s point of view, the web site may want to reject unwanted traffic as early as possible. A web site may decide to reject the SYN traffic based on layer 4 information, such as IP address and/or port number according to the predefined policy. However, it is not clear if silently rejecting the SYN packet 145 is beneficial for the server’s load. It has been reported the client will try to reconnect to the server again soon. When the web site is overloaded, it has also been noticed that the web server is slower to respond to a SYN packet. The reason is that when the web server has lot of data to transfer, it receives a lot of ACK packets from clients to acknowl- edge the data packets sent by the web server. If SYN packets are processed in the order they are received, connection establishment will be significantly delayed. To make the situation worse, when the client waits for connection establishment for too long, it starts to resend SYN packets. This further cre- ates load on the server system. One way to overcome this problem is to give connection establishment related packets a higher priority. Before discussing multiple-priority traffic support in the protocol stack, the process of the packet reception in the protocol stack will be discussed. Here the FreeBSD TCP/IP implementation is used as an example. First, the incoming packet processing is described. When a network interface card (NIC) receives a frame, the handler for this device is invoked. The interrupt handler checks the packet and puts the packet into an appropriate queue. In case of an IP packet, the packet is put into the IP queue. A software interrupt handler (IP packet handler) is invoked to process the IP packets from the IP queue once the system interrupt priority drops below the software interrupt’s priority. 146 The IP packet handler (ip_input() function) gets an IP packet from the IP queue and processes the IP header, then it calls the TCP protocol handler function (tcp_input()) to process the TCP layer information. The TCP layer puts the data in the socket buffer. The IP packet handler gets the next packet on the IP queue and starts the same process again until either the queue is empty or this process is interrupted by higher priority interrupts. The application issues a receive() or read() system call to receive this data from the socket buffer. After system call read copies the data from the socket buffer to the buffer supplied by the application, the data is removed from the socket buffer. Multiple queues with different priorities are supported by replacing the IP queue with multiple IP queues with different priorities. When the IP packet handler starts to process the IP packet, depending on the queue policy, the IP packet handler gets a packet from a particular queue, and processes it according to the process described before. Since TCP connection setup related packets have higher priority, the connection setup progresses much more quickly than before. This can solve the previously mentioned problem. Request scheduling. After the connection is established, the client sends the URL. The request scheduler has the ability to check the URL (more general, the HTTP header) and schedules the request based on the content requested by 147 the user. The request scheduler schedules the order the web requests are seen by the web server software. The request scheduling is supported by ordering the ready TCP connections in the listen queue. In the FreeBSD-based TCP/1P implementation, the listen queue is the queue of the sockets. Each socket represents either a complete connection ( connections that have finished 3-way handshaking with the clients and are ready to be accepted by the application) or an incomplete connection (connections that have been initiated by the clients, but for which 3-way hand- shaking has not yet been completed yet). In FreeBSD, two separate queues are maintained by the kernel: so->q is the queue for complete connections, and so- >q0 is the queue for incomplete connections. The application may only accept connections on the complete queue. When the kernel receives the SYN packet, the kernel creates an incomplete socket corresponding to this connection if the particular listen queue is not full and puts it on the incomplete queue so—>q0. After the 3-way handshake finishes, the socket is moved from so->q0 to sO->q in preparation for application acceptance. By default, the apache web server sets the listen queue length to 511 connections. The completed connections will be accepted in the order on the listen queue. Web differentiated service may be supported by ordering the connections in such a way that the high priority connections will be placed in front of the low priority connections. 148 Different scheduling policies might be used here, for example, FIFO (first in, first out), fair queue, weighted fair queue, etc. More advanced policies may be supported, too. For example, one of the policies proposed serves the shortest requests first. Smaller web objects are served earlier than longer web objects, and static pages served earlier than dynamic pages. The previous discussion only applies for HTTP 1.0 connections, because the scheduling happens at the granularity of a connection, instead of a request. The decision for scheduling is made based on the first HTTP header or URL. For example, WebQoS only supports connection scheduling. Support for re- quest scheduling over persistent connections ( HTTP 1.1) is not as easy as the connection scheduling. Request processing. After the web server software accepts the connection, it starts to process this request. At this stage, depending on the web server implementation, the process or thread to process the high priority request may be given a high priority. For a typical E-business application, there are 4 tiers: web client (browsers), web servers, application servers, and database servers. The browsers are used by clients to access the web sites. The web server provides the interface to the server side system. The application server is used to fulfill the processing, and the database server provides the data to be manipulated. In this chain, differentiated service at the web server contributes only partially to this process. Ideally, the priority could be maintained across all the activities in the tiers at 149 the server side. Howerver, this is much more complex and involves all the components along the processing path. 0 Response transmission. During this stage, the kernel might implement several policies regarding the packet transmission. As we may imagine, the kernel may transmit the response data of high priority requests earlier than that of low priority requests. One such policy is shortest request first. That is, the response data from the shortest job receives highest priority. Some of the kernel mechanisms described above may also be implemeneted at the application level. Request scheduling may be supported either at user level or at the kernel level. At user level, the request differentiation is typically implemented by the distribution process, which accepts all the incoming HTTP requests and classifies the requests into one of several classes (queues). The working processes get the requests from the queues and process the requests. The distribution process may be implemented by the web server software itself, or may be supported by a separate process that feeds the requests to the web server software transparently. Kernel based request differentiation has the following advantages: No extra bottlenecks. User-level implementation introduces the distribution pro- cess, which controls all the incoming traffic, and classrfies them into different priority queues. That introduces another central control point, which might become a bottle— neck. The number of distributing processes that are sufficient to process the incoming requests efficiently is highly workload dependent. Kernel-level implementation does 150 not introduce the additional central control points except the ones existing in the kernel already. Less resource consumption. When the server reaches the overloaded point, the admission control has to take place. It is strongly desirable that if the server decides to reject the request (or a new session), a simple rejection message is returned so that the user will not try to submit the same request again and again. For such requests, processing them as quickly and efficiently as possible is critical. User-level implementation has to accept the connection, get the request, and return a message. Kernel-level implementation performs similar actions much more efficiently: it does not introduce context switches and socket buflers. Efficient measurements. Kernel-level implementation may get accurate informa- tion quickly and accurately compared to user-level implementation. These measure- ments may be important to the decision process, such as the number of connections openned at a given moment, activities on each open connection, average response time, the network bandwidth and roundtrip times of each connection, etc. The kernel implementation may take advantage of these measurements and make more intelligent decisions. 7 .1.2 Content-Aware Request Processing and Differentia- tion In this section, we outline a modular design of the web differentiation in a STREAMS- based TCP/1P implementation. In order to provide web differentiated service, two 151 new modules, UTCP (Upper TCP) and BTCP (Bottom TCP), are introduced. We discuss the request classification, request scheduling, admission control, and trans- mission design to support web differentiated service. 0 Request Classification. The classification identifies a request and assigns it to a class. Information within the request or provided by the application is applied against the classification policies to determine the request class. The client sets up the TCP connection with the TCP module in the operating system. The UTCP maintains the necessary number of priority queues and a partially-finished connection queue. BTCP creates a simple structure for each connection and sends the connection establishment packets upstream. UTCP holds the corresponding T-CONN_IND message into the partially-finished con- nection queue. After TCP module finishes three-way handshaking with the client, the client sends the URL to the server. BTCP retrieves the URL after receiving the packet and classifies the request according to the policy specified by the administrator. BTCP sends the classification of the URL to the UTCP, and UTCP places the corresponding T-CON N-IN D message from the partially finished connection queue to one of the supported class queues. This design may support persistent connections. Since for persistent connec- tions, the connection has been already established and is exposed to the appli- cation, UTCP intercepts the subsequent requests, and classifies them into one of the classes, and places the message (T_DATAJND) in one of the queues. 152 e Admission Control. The admission control mechanism prevents server overload. The classes of requests, the admission control policies and the system or service load measurements may be used to determine whether to accept a request (or a new session). The admission control mechanism can be deployed using the UTCP module. First, BTCP checks the URL (or cookie). If this request can not be served at this time, then the BTCP notifies the UTCP module. The UTCP module then sends the message to the client and releases the connection by sending T_DISCON-REQ. For subsequent requests on a persistent connection, UTCP checks the URL (cookie) from the T-DATA.IND message and makes the decision. If such a request can not be served at this time, UTCP sends the customized message to the TCP module and releases the connection. The returned message may be a redirection to other servers, or a customized message stating that the server is busy at this moment. 0 Request Scheduling. Request scheduling is used to provide performance isolation or differentiation depending on the scheduling and classification policies. UTCP module may support a set of request scheduling strategies, for example, FIFO, fair-sharing, weighted fair-sharing, strict priority queue, etc. 0 Transmission Bandwidth Management. STREAMS already supports flow con- trol. Flow control may be used to implement the rate based transmission policy by default. 153 7 .2 Network Quality of Service In the last section, the web differentiated service mechanism is discussed. In this section, we briefly discuss network (.208. Traditionally, Internet only provides best effort service. Packets are delivered by the underlying network as fast as possible. Packets might be delayed, or dropped. Upper level protocols are responsible for recovering from the packet loss. In the last three years, many efforts have been contributed to provide Quality of Service over the Internet. Network QoS may be classified into two categories: Integrated service and differentiated service. 0 Integrated Service. In an integrated service environment, the client reserves the resources for a flow according to its usage along the path. RSVP may be used as a resource reservation protocol. All the routers along the path have to support resource reservation. Per-flow status has to be kept by routers in order to deliver desired Quality of Service. The per-flow status greatly impairs the scalability of network core routers. Given that today’s “killer app”, the web, is built on top of TCP and the connections are short-lived, it would be interesting to see whether such an approach could be justified in overhead. 0 Differentiated Services. In differentiated service, the priority of the traffic is supported at each router. No hard QoS is guaranteed. The client traffic will be classified and marked in the IP header at the network egde before it enters the network. The backbone routers are only responsible to forward the packet as quickly as possible according to the traflic priority, and higher priority traflic is 154 forwarded first. No per flow status is kept. Backbone router design is greatly simplified. Differentiated service seems easier to be supported by ISP and might be incrementally deployed. It may also be applied for web applications. 7 .3 End-to—End Web QOS From the user’s point of view, the ultimate measurement is the response time, the time between initiation of the web request and arrivial of the response data. In a typical web request, many components are involved. The client needs to access the DNS system first. Then the network system is involved. The web server system is a big part. Providing End-to-End QOS for web applications is a big challenge. In order to provide some performance guarantee, different components involved have to provide some level of guarantee, too. DNS and network are Operated by different Operating domains and are difficult to modify in order to provide guaranteed services without extensive changes in the infrastructure. Providing QOS at the web system side is not an easy task as well. As we described before, today’s typical web application involves four tiers: clients, web servers, application servers, and database servers. The client uses a web browser to access the web site. The web server works as a gateway to the web applications’ logic. The application server executes the application logic and generates the response data to web servers. The database server provides necessary data to the application servers. Providing QOS means that the Q08 requirement 155 has to be propogated through all the components, including web servers, application servers and database servers. Moreover, processing time for every component has to be bounded. This is not an easy task. Providing a differentiated service of the web server system is not as challenging as the guaranteed service. Also, it is much easier to integrate the network differentiated service with the web system differentiated service. WebQoS supports integration of network differentiated service and web server system differentiated service. 7 .4 Content Delivery Networks As we mentioned in chapter 2, web hosting and content delivery networks are two complementary ways to improve the web site performance and user-perceived latency for web documents. In web hosting, the content publisher moves the contents to the ISPs so that web traffic will not enter content publisher’s network. As we know, Internet routing has an inherent hierarchy similar to a tree. That means that when a web site is accessed, the client (leaf) first sends the packet to the local ISP (inter- mediate node) and the ISP forwards the traffic to a regional ISP. Then, the regional ISP forwards the packet to the backbone ISP (root of the tree). After that, traffic is forwarded back to another regional ISP, and forwarded again to another ISP where the web site is connected. In short, the packet is sent upward till it reaches the root of the tree and is sent down to another leaf Of the tree. As we may imagine, a good place for the web hosting service is in the root of the tree, that is, backbone network. Statistically this will cut the traffic path by half. 156 What a CDN does is to provide reverse proxies for the web sites near the client locations. A CDN does this by putting servers at the intermediate nodes of the tree as we previous described, and tries to cache documents from web sites so that documents accessed by clients are in the cache and much closer to the clients. A CDN provides the content delivery service so that web site operators focus their effort on the publishing without worrying too much about the performance and management of the web sites. 7 .4.1 How Does A CDN Work? Difl'erent CDNs have different offerings. Here we show how Akamai (one of the major CDNs) provides content delivery service. Akamai claims it has put about 8,000 servers near the POPS (Points Of Presence) of ISPs where the client networks are connected to the Internet. When a web site decides to take advantage of Akamai service, the site uses the pro- gram Free Flow Launcher provided by Akamai to customize the URLs to be served by Akamai server. As we mentioned before, typical web pages contain a lot of embedded images that constitute about 80% of the web traffic One Observation is that although the HTML document containing the embedded images changes frequently, the em- bedded images themselves change less frequently. Typically a web site akamaizes the embedded images and still maintains either static HTML or dynamically generated files. This partially helps the consistency problem. Any document served by Akamai CDN has a URL in the following format: 157 http://a388.g.akamaitech.net/7/388/21/fc35ed7f236388/cnn.com/images/hub2000/ad.infagif In this URL, the A388 refers to a customer, CNN. The URL Of the document in the original server is also included ( cnn.com/images/hub2000/ad.info.gif). In the case the embedded images are akamaized and the HTML documents are still served by the original site, the client first accesses the HTML document. The browser parses this document and gets the URLs like above. The client side DNS server consults the akamai DNS server to get the nearest akamai servers available (IP address list) which most likely have the document or are appropriate to serve the document. The selection process is done by proprietary algorithms. If an Akamai’s server does not have that document, the server retrieves the document from the original server and caches it in the Akamai network. 7 .42 Benefits of CDN The benefits of content delivery networks may be summarized as follows: 0 latency reduction. Because a CDN may reduce the load on the original servers and avoid a lot Of long network round trip latency, the CDN might greatly reduce the latency perceived by the end-users. e scalability . CDNs also provide better scalability. It has been reported that the web site peak rate might be as much as 10 times the average rate. A typical practice is that the web sites do capacity planning and try to provide enough capacity to accomodate the peak rate of web requests. This obviously leads to waste of the computing resources and increases the web site Operation 158 costs both in the hardware and management. Another observation is that since different web sites might have peak rates at different times, the aggregrate resource requirement will be much smaller. In an extreme case, a CDN might allocate resources to serve only one site. Thus CDNs provide almost unlimited scalability for today’s typical web applications. availablility. A CDN may also be used to enhance the availability of documents pushed on the CDN. CDN servers may cache the documents at multiple loca- tions and serve directly from the content servers even under the failure of the original sites. maintenance cost. Under current CDN practice, the embedded images are pushed to the content delivery network, which greatly reduces the burden on the publiser who Operates the web site. This greatly reduces the hardware and management overhead of the web sites. 7 .4.3 Content Delivery Network and End-to-End QOS Streaming media is believed to be the next thing in the industry. Streaming media is much more sensitive to the network conditions. A stream has a fixed bandwidth requirement and is very sensitive to the network latency variation. Network QOS, both integrated service and differentiated service, is still a dream in some sense, because the technology requires a lot of changes in the Internet infrastructure. A typical Internet path usually involves several ISPs. It is difficult for application developers to benefit from the technology if only one or several ISPs deploy such technologies. 159 The deployment Of such technologies will take some time. Application developers may not be able to take advantages Of these solutions in the near future. By placing the content servers at POP’s, where the client network connects to the Internet, a CDN bypasses the whole Internet network. Because clients are just about a few hops away from the content servers, network latency is much more stable and predictable. So CDNs will be an interesting technology to deliver streaming media documents. 7 .5 Summary In this chapter, end—tO-end WebQoS is discussed. We described some of the kernel and user-level mechanisms to achieve web differentiated service at the web system side to differentiate web requests and how these mechanisms may be supported both on the FreeBSD network stack and a STREAMS-based protocol stack. Network QOS is briefly described. Content Delivery Networks (CDN) are presented. We showed CDNs do address some of the need Of the today’s web service, namely, latency re- duction, scalability, availability and maintenance cost. We also showed that because CDN servers are deployed at the network edge, it is an interesting technology for the emerging streaming media delivery. 160 Chapter 8 Summary The web has experienced phenomonal growth during the last couple of years. As the Internet evolves into client/server based computing, where millions Of clients (browsers) access a comparatively few number of popular sites, web site performance and scalability is an important issue to be addressed. In this dissertation, we address some problems related to the development of high performance, scalable web server systems. Our achievement may be divided into two parts: a number of technologies that target a single web server and technologies for multiple replicated web sites. For a single web server, Browser Initiated Pushing (BIP) is proposed to improve performance based on the Observation that today’s typical web page has one or more embedded images. Measurement shows that BIP reduces the user perceived response time by 40% under normal load and improves response time 3-4 times under heavy load for a typical file in a LAN environment. BIP increases server’s throughput by 150% under our measurement configuration in a LAN environment. Trace-driven 161 simulation shows that enhanced BIP reduces the overhead of traditional pushing from 32% to 10%. BIP is an important tool to improve single server performance. For replicated web sites, we proposed two approaches and developed a framework to address the scalability of such a system. A non-dispatcher approach Static Alloca- tion and Client Redirection SSCR to share load between replicated web servers is developed. SSCR is mainly targeted for web server replication at global scale. Trace file simulation shows that SSCR Offers competitive performance to a dispatcher based approach without one-point bottleneck and failure in the cluster environment. Smart Server Selection (S3) addresses the server selection problem. In SB, client DN S server is extended to prioritize a pool of IP addresses based on routing metric infor- mation collected from routers and other information it collects (geographical location of servers and clients). An efficient scheme to collect routing-metric information from routers is proposed. The overhead of such a scheme is independent of the number of replicated servers. This is a big achievement over existing approaches because the overhead of these approaches suflers a linear increment as the number of replicated web server increases. A framework to support Content-Aware request distribution in a STREAMS-based TCP/1P implementation is developed and prototyped. Content-Aware request dis- tribution provides the ability to support partial replication, flexible web site ar- rangements, web Quality Of Service, and security. Our framework is based on the TCP handoff mechanism. The TCP handoff mechanism is designed as a STREAMS module(s) in the protocol stack. Three different designs are reported according to 162 workload charateristics. Differentiated web service support in the STREAMS-based TCP/1P implementation is discussed. 163 Chapter 9 Further Work There are several opportunities to extend and improve the work that we have already accomplished. This section outlines some of the areas Of further research. 0 Performance evaluation of scalable proxy servers based on TCP handoff mech- anisms. In chapter 6, we outlined the approach to use a content-aware protocol stack to support a request resolution in a partition-based cooperative proxy environment. It would be nice to have the performance evaluation of such an approach. 0 Prototype web differentiated service in a STREAMS based TCP/1P environ- ment. We outlined the design in chapter 7. A performance evaluation will strongly augment this work. a For the BIP approaches, we plan tO develop new Apache modules to give a tighter bound on the benefits Of BIP, and possibly measure the performance of the BIP under realistic workloads. 164 o For our SB approaches, a special tool may be developed which prioritizes the IP address list. Such a tool can measure the latency and the number of hops information from the network, because we do not have access to a BGP routing table directly. However, we may use dumped BGP routing tables to identify the number of groups of the servers, so that the number Of entries processed by this tool is limited to the size of a backbone router of the Internet, which has about 400,000 entries. 0 For the SSCR approach, the evaluation is done for local clusters. It would be nice if the ideas may be further validated over wide area replicated web site logs. 165 Bibliography [1] Jeffrey Mogul. What’s wrong with http and why it doesn’t matter. In Proceedings of 1999 USENIX Annual Conference, June 1999. [2] Akamai. http://www.akamai.com. [B] Inktomi. http://www.inktomi.com. [4] Digital Island. http://www. digitalisland. com. [5] Amin Vahdat, Thomas Anderson, Michael Dahlin, David Culler, Eshwar Belani, Paul Eastham, and Chad Yoshikawa. Webos: Operating system services for wide area applications. In The 7th IEEE Symposium on High Performance Distributed Computing, July 1998. [6] A. Bestavros and C. Cunha. Server-initiated document dissemination for the www. In IEEE Data Engineering Bulletin, September 1996. [7] Michael Rabinovich and Amit Aggarwal. Radar: A scalable architecture for a global web hosting service. In Proceedings of 8th World Wide Web conference, May 1999. [8] V. N. padmanabhan and J. C. Mogul. Using predictive prefetching to improve world wide web latency. In Computer communication Review, vol 26(3), July 1996. [9] Crovella Mark and Barford Paul. The network effects of prefetching. Technical Report 1997-002, Department Of Computer Science, Boston University, February 1997. [10] Cunha Carlos R. and Jaccoud Carlos F .B. Determining www user’s next access and its applications to pre-fetching. Technical Report 1997-004, Department of Computer Science, Boston University, March 1997. [11] Stephen Williams, Marc Abrams, Charles R. Standridge andGhaleb Abdulla, and Edward A. Fox. Removal policies in network caches for world-wide web documents. In Proceedings of the 1997 ACM SIGCOMM, August 1997. 166 [12] Pei Cao and Sandy Irani. Cost-aware www proxy caching algorithms. In Pro- ceedings of the 1997 USENIX Symposium on Internet Technology and Systems, December 1997. [13] L. Cherkasova. Improving www proxies performance with greedy-dual-size— frequency caching policy. Technical Report HPL—98-125R1, HP Laboratories Report, July 1998. [14] P. Cao and C. Liu. Maintaining strong cache consistency in the world wide web. IEEE transactions on computers, 47(4), April 1998. [15] Balachander Krishnamurthy and Craig E. Wills. Piggyback server invalidation for proxy cache coherency. Computer Networks and ISDN systems, 30, April 1998. [16] Radhika Malpani, Jacob Lorch, and David Berger. Making world wide web caching servers cooperate. In Proceedings of 4th World Wide Web conference, 1995. [17] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: A scalable wide-area web cache sharing protocol. Technical Report Technical Report 1361, Department of Computer Science University Of Wisconsin-Madison, February 1998. [18] J. Cohen, N. Phadnis, V. Valloppillil, and K. W. Ross. Cache Array Routing Protocol 1.0, September 1997. [19] David Karger, Alex Sherman, and Andy Berkheimer. Web caching with consis- tent hashing. In Proceedings of 8th World Wide Web conference, May 1999. [20] C. Yoshikawam, B. Chun, P. Eastham, A. Vahdat, T. Anderson, and D. Culler. Using smart clients to building scalable services. In USENIX ’97, January 1997. [21] P. Scheuermann M. Sayal and P. Vingralek. Selection algorithms for replicated web servers. In The 1998 SICMETRICS/Performance Workshop on Internet Server Performance, June 1998. [22] J. Heidemann and V. Visweswaraiah. Automatic selection of nearby web servers. Technical Report 98-688, USC/ Information Science Institute, 1998. [23] K. Moore, Jason Cox, and Stan Green. Sonar - a network proximity service. In Internet Draft, Network Working Group, draft-moore-sonar-01.t:ct, February 1996. [24] Cisco Systems Inc. Distributed director. http://www.c18co.com/univercd/cc/td/ doc/product/iaabu/distrdir/. [25] K. Obraczka and F. Silva. Looking at network latency for server proximity. Technical Report 99-714, USC/ Information Science Institute, 1999. 167 [26] Foundry Networks. http://www.foundrynet.com. [27] Alteon. Layer 7 switching. http://www.alteon.com. [28] ipivot. Intelligent Broker. http://www.ipivot.com. [29] E. Anderson, D. Patterson, and E. Brewer. The magicrouter, an application Of fast packet interposing. In OSDI, 1996. [30] Cisco Systems Inc. Localdirector. http: / / www.cisco.com / warp/ public/ cc / cisco/ mkt / scale / locald. [31] G. D. H. Hunt, G. S. Goldszmidt, R. P. King, and R. Mukherjee. Network dispatcher: A connection router for scalable internet services. In Proc. of 7th International World Wide Web conference, Brisbane, Austrilia, April 1998. [32] P. Damani, P. E. Chung, Y. Huang, C. Kintala, and Y.-M. Wang. ONE-1P: Techniques for hosting a service on a cluster of machines. In Sixth International World Wide Web Conference, Santa Clara Convention Center, California, April 1997. [33] Resonate Inc. Central Dispatcher. http://www.resonate.com. [34] Arrowpoint. Content-Aware Switching. http://www.arrowpoint.com. [35] V.Pai, M.Aron, G.Banga, M.Svendsen, P.Drushel, W. Zwaenepoel, and E.Nahum. Locality-aware request distribution in cluster-based network servers. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII), 1998. [36] Mohit Aron, Darren Sanders, Peter Druschel, and Willy Zwaenepoel. Scalable content-aware request distribution in cluster-based network servers. In Proceed- ings of the USENIX 2000 Annual Technical Conference, June 2000. [37] Chu-Sing Yang and Mon-Yen Luo. Efficient support for content-based routing in web server clusters. In Proceedings of the 2nd Usenia: Symposium on Internet technologies and Systems, October 1999. [38] M. Colajanni, P. S. Yu, and D. M. Dias. Analysis of task assignment policies in scalable distributed web-server systems. IEEE Trans. Parallel Dist. Syst., 9(6), June 1998. [39] M. Colajanni, P.S. Yu, and V. Cardelliini. Dynamic load balancing in geograph- ically distributed heterogeneous web servers. In Proc. of 18th International Con- ference on Distributed Computer Systems, Amsterdam, May 1998. [40] V. Cardellini, M. Colajanni, and P. Yu. Redirection algorithms for load shar- ing in distributed web—server systems. In Proc. of the 19th IEEE International Conference on Distributed Computing Systems, Austin, Texas, June 1999. 168 [41] A. Bestavros, M. Crovella, J. Liu, and D. Martin. Distributed packet rewriting and its application to scalable web server architectures. In Proc. of ICNP’98: The 6th IEEE International Conference on Network Protocols,, Austin, Texas, October 1998. [42] Jussara Almeida, Mihaela Dabu, Anand Manikutty, and Pei CaO. Providing differentiated quality-Of-service in web hosting services. In 1998 SIGMETRICS Workshop on Internet Server Performance, March 1998. [43] Hewlett Packward. HP WebQoS v2.0 white paper. [44] Nina Bhatti, Anna Bouch, and Allan Kuchinsky. Integrating user-perceived quality into web server design. In Proceedings of 9th World Wide Web conference, May 2000. [45] E. Cohen, B. Krishnamurthy, and J. Rexford. Eflicient algorithms for predicting requests to web servers. In IEEE Infocom ’99, New York, March 1999. [46] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee. Hypertext transfer protocol — http/1.1. In RFC 2068, January 1997. [47] B. Bloom. Space/ time trade-Offs in hash coding with allowable errors. Commu- nications of ACM, 13(2), July 1970. [48] David Mosberger and Tai Jin. httperf - a tool for measuring web server perfor- mance. In 1998 Workshop on Internet Server Performance, June 1998. [49] Wenting Tang and Matt W. Mutka. Intelligient browser initiated pushing. In 19th IEEE International Performance, Computing, and Communica tions Conference, February 2000. [50] Wenting Tang and Matt W. Mutka. Supporting global replicated services by a routing-metric-aware dns. In 2nd International Workshop on Advanced Issues of E—commerce and Web-Based Information Systems, San Jose, CA, June 2000. [51] J. Doyle. Routing TCP/1P, volume I. Macmillan Technical Publishing, first edition, 1998. [52] J. Moy. Ospf version 2. In RFC 2328, April 1998. [53] J. Postel. Internet control message protocol. In RFC 777, April 1981. [54] W. R. Stevens. TCP/1P illustrated, volume I: The Protocols. Addison-Wesley Professional Computing Series, first edition, 1994. [55] Wenting Tang and Matt W. Mutka. Load distribution via static scheduling and client redirection for replicated web servers. In Proceedings of the First Workshop on Scalable Web Services, Toronto,Canada, August 2000. 169 [56] [57] [58] [59] [60] [61] [62] [63] L. Cherkasova. Flex: Load balancing and management strategy for scalable web hosting service. In Fifth International Symposium on Computers and Commu- nications (ISCC’00), July 2000. A.Cohen, S.Rangarajan, and H.Slye. One the performance of tcp splicing for url-aware redirection. In Proceedings of the 2nd Useniz Symposium on Internet technologies and Systems, October 1999. Wenting Tang, Lumilda Cherkasova, Lance Russell, and Matt W. Mutka. A customized library of plug-ins to support content-aware request processing in streams-based tcp/ip implementation. In 3rd International Workshop on Ad- vanced Issues of E-commerce and Web-Based Information Systems, June 2001. Wenting Tang, Lumilda Cherkasova, Lance Russell, and Matt W. Mutka. Modu- lar tcp handoff design in streams-based tcp/ip implementation. In International Conference on Networking 2001, July 2001. UNIX International, OSI Special Interest Group. Transport Provider Interface ( TPI). UNIX International, OSI Work Group. Data Link Provider Interface specifica- tion. N. Batti and R. Friedrich. Web server support for tiered services. IEEE network, 13(5), 1999. P. Phaal L. Cherkasova. Session based admission control: a mechanism for im- proving performance of commercial web sites. In Seventh International Workshop on Quality of Service, June 1999. 170 IIIIIIIIIIIIIIIIIIIII [[[llI]l[[][[[[[l][l[l][[[[I]ll