. 5?. .11. a. t. A ,1. .L.‘ ,. MICHIG l lINIIIINIHHUIHIIIHlllllllllllHHIHIIUIIUI 301564 4671 ll This is to certify that the dissertation entitled A FRAMEWORK FOR DISTRIBUTED WEB SERVICES presented by Yew—Huey Liu has been accepted towards fulfillment of the requirements for PhD degree in Computer Sc ience Major professor \J‘mzw) H .. M ‘ l Date Sept. 16, 1996 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY . MiLfTigan State University PLACE IN RETURN BOX to romovo thb checkout from your rooord. TO AVOID FINES return on or before onto duo. DATE DUE DATE DUE DATE DUE LN...“ A C MSU lo An Affirmative Adlai/Equal Opportunlty Institution WW1 A FRAMEWORK FOR DISTRIBUTED WEB SERVICES By Yew-Huey Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1996 ABSTRACT A FRAMEWORK FOR DISTRIBUTED WEB SERVICES By Yew-Huey Liu In essence, the World Wide Web is a worldwide string of computer databases using a common information retrieval architecture. With the increasing popularity of the World Wide Web, more and more functions have been added to retrieve not only documents written in HTML (Hypertext Markup Language), but also those in other forms through the Common Gateway Interface (CGI), by constructing HTML documents dynamically. One of the most exciting promises of the Web is the Digital Library. Large-scale Digital Libraries will make huge collections available to thousands of people over wide geographical distances using wide-area computer networks. These libraries will be some of the largest distributed systems ever built. Combining the Web with Digital Libraries will require solutions — to the problems of efficiently generating HTML pages containing digital images from huge digital collections and indices, and to the problems of efficiently navigating through them. Dynamic construction of HTML documents for handling information such as digital libraries is slow and requires much more computer power. A significant performance bottleneck is the initialization and setup phase for a C01 process to gain access to the system containing the data. In this thesis, we present the design and implementation of a framework for distributed Web services. Combining the connection manager daemon, cache manager daemon, and a number of clients (i.e., the Cliette process), we address the performance issue of generating dynamic HTML pages with the existing CGI interface. In a joint study between IBM and the Florida Center for Library Automation (FCLA), this framework has been used as a gateway between the FCLA state of the art Bibliography Search server and the IBM Visual Info system. Using Extended Unified Trace Environment (UTE) tools for program visualization and performance analysis, we show that the framework for distributed Web services provides an efficient gateway solution to address CGI performance issues. © Copyright 1996 by Yew-Huey Liu All Rights Reserved To my husband Eric and my lovely daughter Michelle and Alicia Acknowledgments I would like to take this opportunity to express my appreciation to a number of people, without whom this dissertation could not have been completed. First, my sincerest thanks go to my advisor, Lionel M. Ni. He has taught me many things about computer science, and has been my mentor, my colleague, and my adviser since my enrollment at Michigan State. He provided valuable guidance during my graduate study, and his positive influence on my personal and technical development will carry forward into my future endeavors. I am also grateful to Professors Abdol-Hossein Esfahanian, Philip K. McKinley, Matt W. Mutka, and Richard Brandenburg for their valuable advice, and encouragement, and for serving as members of my dissertation committee. The support provided by IBM through IBM’s Graduate Work Study Program is highly appreciated. I would like to thank my former manager Sigmund Handelman for giving me the opportunity to pursue my degree. My sincere thanks go to my manager, Paul Dantzig. His constant encouragement and support have helped me finally finish this long journey. This thesis could not have been written without the help and understanding of family members. My very special thanks go to my husband, C. Eric Wu, for his everlasting vi encouragement, support, patience, and love. I proudly share this accomplishment with him and my two lovely daughters, Michelle and Alicia. Last, but not least, I would like to thank my mother for her love, encouragement, and helping hands through the years. vii Table of Contents LIST OF TABLES xii LIST OF FIGURES xiv 1 Introduction 1.1 Background .................................. 1.2 Motivation ................................... 1.3 Related Work ................................. 10 1 .4 Organization .................................. 15 2 Distributed Web Services 17 2.1 The HTTP Daemon .............................. 20 2.1.1 The HTTP protocol ............................. 20 2.1.2 The high-performance HTTP server ..................... 22 2.2 Software Architectures for Distributed Web Services ............. 24 2.2.1 Connection Manager Daemon ........................ 29 2.2.2 Cliette processes ............................... 32 2.2.3 CGI processes ................................ 36 2.2.4 Cache Manager Daemon .......................... 38 3 Unified Dace Environment and Its Extension for Distributed Web Services 42 3.1 Unified Trace Environment .......................... 42 3.1.1 Distributed parallel systems ......................... 43 3.1.2 UTE trace generation and libraries ..................... 44 3.1.3 Tools and visualization ........................... 50 3.2 UTE Extensions for Distributed Web services ................. 57 3.2.1 Existing benchmarking tools and open issues ................ 57 3.2.2 New trace events - IP.Send, IP_Recv .................... 59 3.2.3 Dynamic trace generation .......................... 60 viii 3.2.4 Multiple trace channels ........................... 62 3.2.5 Unique IDs for trace generation ....................... 63 3.2.6 Clock synchronization ............................ 63 3.2.7 On-line timing routines for run-time timing data and statistics ....... 64 3.2.8 Enhancement to the utility command - ute2ups ............... 64 3.2.9 Enhancement to the N UPSHOT program .................. 65 3.2.10 User marker - seerGI and seerache .................... 65 4 Performance Evaluation of the Framework 66 4.1 Prototype System Setup ............................ 66 4.1.1 Performance of the traditional design .................... 69 4.1.2 Our framework solution ........................... 71 4.2 Design Considerations ............................. 72 4.2.1 Connection Manager ............................ 72 4.2.2 Number of cliette processes ......................... 73 4.2.3 Number of SP nodes for cliette processes .................. 74 4.3 Performance Results .............................. 75 4.3. l Workload .................................. 75 4.3.2 Influence of the number of cliette processes ................. 76 4.3.3 Influence of the number of SP nodes .................... 77 4.4 Observation .................................. 80 5 A Digital Library Using the Framework 83 5.1 Overall View of the FCLA Digital Library Web Services ........... 84 5.2 The Internal Design of the Visual Info Cliette Process ............. 85 5.2.1 Generating HTML page when the value of Di splayMethod is 2 ..... 89 5.2.2 Generating HTML page when the value of Di splayMethod is l ..... 92 5.3 The Internal Design of the Primary CGI Interface — CGIscript ........ 92 5.4 The Internal Design of the CGI interface — GetGif ............... 96 5.5 Performance of Distributed Web Services ................... 96 6 Digital Library Performance Analysis and Visualization 99 6.1 FCLA Digital Library Trace Environment Setup ............... 99 6.2 Standard HTTP Setting Without Using CMD/Cache Support ......... 103 ix 6.3 Running on a Single Workstation ....................... 6.4 Running on a Single IBM SP2 Node ...................... 6.5 Running on a Single Workstation with Cache Manager ............ 6.6 Running on a Single IBM SP Node with Cache Manager ........... 6.7 Running on a Cluster of Workstations ..................... 6.8 Running on a Cluster of Workstations with Cache Manager .......... 6.9 Running on Three IBM SP Nodes ....................... 6.10 Running on an IBM SP System with Cache Manager ............. 6.11 Running on a Cluster of Four Workstations with One Workstation Dedicated to the HTTP Daemon ............................. 6.12 Observations and Lessons Learned ...................... 7 Conclusion and Future Research 7.1 Research Contributions ............................ 7.2 Directions for Future Research ......................... APPENDICES A Sample C MD Configuration File and CGI API Calls A.1 Sample CMD Configuration File ....................... A.2 General Purpose Request Block ........................ A.3 Connection Manager Daemon API ...................... A.4 Cliette Process API .............................. A.4.1 The HTML Request Block ......................... A.5 CGI Process API ................................ B CMDadmin Manual Page and Its Usage Example B.1 CMD Administration Command Manual Page ................. B.2 CMDadmin -b Command Example ...................... C Sample Cache Manger Configuration File and Its API Calls C.1 Sample Configuration File ........................... C.2 Cache Manager API .............................. D Ute2ups Output File 105 108 110 114 116 119 121 124 127 130 132 132 133 135 135 135 140 145 147 149 149 151 151 152 156 156 157 167 E One set of the CG] preformance trace results 169 F Attributes for the FCLA_1 Index Class 172 xi 3.1 3.2 4.1 4.2 4.3 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 List of Tables A time partition table ............................. A histogram of MPL events and user markers ................. Detailed initialization time in the Digital Library environment ........ Overhead in our framework, assuming 2,049 bytes per HTML page ..... Average Wait/Work time for l CGI process versus 1 Cliette process ..... Elapsed seerGI time statistics (using Web standard component without CMD support) .................................. Elapsed IP-S end and IP_Recv time statistics on a single workstation Elapsed seerGI time statistics on a single workstation ............ Elapsed IP-S end and IP.Recv time statistics on a single IBM SP node Elapsed seerGI time statistics on a single IBM SP node ........... Elapsed IP_Send and IP_Recv time statistics on a single workstation with Cache Manager support .......................... Elapsed seerGI time statistics on a single workstation with Cache Manager support .................................. Cache Manager activity statistics on a single workstation ........... Elapsed IP_S end and I P_Recv time statistics on an IBM SP node with Cache Manager support ............................. Elapsed seerGI time statistics on a single IBM SP node with Cache Manager support .................................. Cache Manager activities on a single IBM SP node .............. Elapsed IP_Send and I P.Recv time statistics on a cluster of workstations Elapsed seerGI time statistics running on a cluster of workstations ..... Elapsed IP_Send and IP_Recv time statistics on a cluster of workstations with Cache Manager support ....................... Elapsed seerGI time statistics on a cluster of workstations with Cache Man- ager support ................................ Cache Manager activities on a cluster of workstations ............ xii 52 107 108 110 112 116 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 Elapsed IP-Send and IP_Recv time statistics on three IBM SP nodes Elapsed seerGI time statistics running on three IBM SP nodes ....... Elapsed IP_Send and IP-Recv time statistics on three IBM SP nodes with Cache Manager support .......................... Elapsed seerGI time statistics on 3 IBM SP nodes with cache manager support .................................. Cache Manager activities on an IBM SP system ............... Elapsed I P_S end and I P-Recv time statistics on a cluster of four workstations with one workstation dedicated to the Web server ............. Elapsed seerGI time statistics on a cluster of four workstations with one dedicated to the HTTP Daemon ...................... Summary of performance results ........................ xiii 126 129 130 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4. l 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1 5.2 5.3 List of Figures How service is established between a cliette and a CGI process ........ Connection Manager Daemon being used by an HTTP (Web) server ..... State Transition Diagram for a Cliette process ................. CGI usage of the cache to avoid cliette connections .............. Cliette usage of the cache to avoid database connections ........... Overview of Cache Manager .......................... Unified Trace Environment for IBM SP systems ............... MPLSend in the UTE/MPI trace library .................... A NUPSHOT visualization of matched sends and receives .......... NUPSHOT visualization: with and without states . Visualization of user markers ......................... File browser for source code association ................... Unified Trace Environment for IBM SP systems ............... Traditional Web server design ......................... Our framework solution ............................ Elapsed time using traditional Web server design ............... Elapsed Time using our framework solution .................. Assuming infinitely fast back-end server .................... Assuming cliette processes’ Busy ratio is 20% ................ Assuming cliette processes’ Busy ratio is 50% ................ Forty-eight processes versus 48 cliette processes ............... Ninety-six processes versus 96 cliette processes ................ FCLA Web services .............................. Graphic representation of documentation stored in the Visual Info ...... Flowchart of FCLA cliette processes ..................... xiv 28 30 35 40 41 41 49 67 68 69 7 1 77 80 5.4 5.5 5.6 5.7 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 A dynamic HTML page contains a page GIF image .............. 91 A dynamic HTML page contains a pick list .................. 93 Flowchart of nph-CGIscript process ...................... 95 Flowchart of GetGif process .......................... 96 One workstation/SP node without the Cache Manager ............. 100 One workstation/SP node with the Cache Manager .............. 101 Three workstations/SP nodes without the Cache Manager ........... 101 Three workstations/SP nodes with the Cache Manager ............ 102 Distributed Web services on a single workstation ............... 106 Distributed Web services on a single IBM SP2 node ............. 109 Distributed Web services on a single workstation with Cache manager support 111 High-Performance Web server on an IBM SP2 system with caching support . 115 Distributed Web services on a cluster of three workstations .......... 117 Distributed Web services on a cluster of three workstations with Cache Manager support .................................. 120 Distributed Web services on 3 IBM SP nodes ................. 122 Distributed Web services on an IBM SP2 system with caching support . . . . 125 Distributed Web services on a cluster of four workstations .......... 128 XV CHAPTER 1 Introduction Computers have been in the center stage of providing high-speed computation power and transaction processing services. Mainframe computers dominated the 19603 and 19708 since IBM introduced the System 360 in 1963. Minicomputers later became popular as departmental systems after DEC introduced the VAX-11 in 1975. In these environments, users employed dumb terminals to instruct these host systems. Gradually, it became common to interconnect mainframes and minicomputers to create distributed systems over computer networks. The 19808 brought PCs, workstations, and LANs. PCs were quickly linked to the large-scale distributed systems, but primarily to emulate the dumb terminal, where the local PC computing capability was not exploited. From their beginning, powerful workstations running engineering/scientific applications used LAN 8 to share data among users. Little by little, users began installing LANs to interconnect a few PCs so they could share printers and exchange data files and messages. PC LAN programs and entirely new LAN operating systems were developed. They made LAN resources, such as server disk storage and printers, appear as if they were part of the user’s PC. More significantly, performance of these network-attached devices could be made to appear almost as fast as resident devices! This was the birth of client/server computing. The breakthrough in technology that made the client/server practical was the introduction of very high-speed LAN networks. Today, Network-Centric Computing has been described as the next computer wave after host- centric and client/server computing. It represents a form of distributed computing in which the network of computing resources is viewed as the supplier of services. A fundamental trend for servers in Network-Centric Computing is to evolve from traditional database and transaction servers to becoming information distribution and han- dling systems. This trend is driven by the rapid growth of Wide Area Internetworks‘ and the availability of inexpensive microprocessors, which fuel exponential growth in the work- station and personal computer markets. Lower costs in processing and storage have also stimulated the use of many varieties of information, including news sources, document images, video training materials, and financial services. This trend has accelerated because of the World Wide Web (WW) [1] and Internet. These networks themselves reflect changes in the economy: Corporations need to be in closer electronic communication with their customers, suppliers, mobile workers, and contracting organizations and with public sources of information and services. 1.1 Background The Internet applications such as file transfer protocol (FTP), remote login (telnet), search (gopher, Veronica), and locate (finger) have been the traditional users of the Internet Protocol (IP). The emergence of the World Wide Web, a collection of distributed hypermedia server systems accessible through Web browsers at client systems, had accelerated the growth of Internet sites and usage. In fact, the term "World Wide Web" is now often used synonymously with the Internet. The World Wide Web is becoming the interface of choice for electronic commerce and for distributing information. The key attributes are that the protocol is open and it separates how information is presented from the information content. Many emerging applications, such as virtual shopping malls and information retrieval services, are enabled through the Web with the use of Web browsers. The recent explosion of interest in the World Mde Web can be traced to the distribution of the CERN (European Laboratory for Particle Physics in Geneva, Switzerland) and N CSA (National Center for Supercomputing Applications) servers and Web client browsers. In particular, the NCSA Mosaic, a graphical user interface based on distributed multimedia hypertext for Web browsing, has spawned several commercial variants and has made the Internet readily accessible to a much larger population. Shortly after NCSA’s Web server was established, it became clear that the volume of Web traffic would stress operating systems and network implementations in ways not originally envisioned by their designers. For example, the NCSA server receives 30 to 40 new Web requests per second [2] at peak times. Because the Hypertext Transfer Protocol (H'l'l'P) [1, 3] is connectionless, each such request appears to the server as a separate network connection. Given the exponential growth of the Internet in recent years, and the Web in particular, it is increasingly more difficult for organizations and information system staffs to properly anticipate future Web service needs, both in human resources and hardware requirements. Not only were most implementations of the TCP/IP network protocol not designed to accept connections at this sustained rate, even conservative projections of request rate growth showed that no single processor system could serve all requests. Network statistics from Merit, the NSFNet backbone management group, show that Web traffic is the largest and by far the fastest growing segment of the Internet, and growing numbers of government and commercial groups are making hundreds of gigabytes of data available via Web servers. At the same time, the Web servers at N CSA have experienced explosive growth in traffic, from 1 million requests per week in February I994, to 2 million per week in June 1994, 3 million per week in September 1994, nearly 4 million per week in December 1994 [4], and even larger numbers in 1995. To support continued growth, Web servers must manage a multi-gigabyte (in some instances a multi-terabyte) database of multimedia information while concurrently serving multiple request streams. This places demands on the servers’ underlying operating sys- tems and file systems that lie far outside today’s normal operating regime. Simply put, Web servers must become more adaptive and intelligent. The first step on this path is under- standing exact access patterns and responses. On the basis of this understanding, one can then develop more efficient and intelligent server and system file-caching and prefetching strategies. A scalable web service can grow with a seemingly endless increase in the number of user requests by adding more capacity, such as another Web server or more disk or memory, to the Web service [5]. Adding capacity to a scalable Web service should be easily managed, extensive reconfiguring of the entire Web service should not be required. The scalability of the current World Wide Web [6] is mostly accomplished through the distribution of files across a series of decentralized servers. This form of load distribution is both costly and resource intensive. The virtual server concept [5] was introduced to add scalability by dynamically rotating through a pool of Web servers. It uses a round- robin dynamic name server (DNS) to distribute Web requests across a cluster of identically configured Web servers. This type of design reduces single point failures and increases its availability. A scalable Web server technique is used at the National Center for Supercomputing Applications [5]. The key elements are the use of Andrew File System to provide a platform- independent distributed file system and Round-Robin Distributed Name service to allow any of several server platforms to respond to queries to the same URL. The result of this architecture is that any number of servers could be added to the available pool, dynamically increasing the load capacity of the virtual server. The basic approach for a Web server (a.k.a. HTTP daemon or HTI’PD) to retrieve an existing Hypertext Markup Language (HTML) [7] documentation is simply to get it from the documentation tree. With the growth rate of most Web sites, it is also almost impossible to manage this Hypertext documentation without the aids of the back-end server. Adding new information and deleting outdated information becomes an everyday challenge. One of the most exciting promises of the Web is the Digital Library. The Web, while it probably contains more information than any single traditional library, is arguably not as useful as a traditional library because it lacks services such as organization and sophisticated search support. Digital Library has been identified as a "National Challenge" in the Information Infrastructure Technology Application component of the US. High Performance Computing and Communications Committee (HPCC). \VIthin the past decade, the number and kinds of digital information sources have proliferated. Computing system advances and the continuing networking and communications revolution have resulted in a remarkable expansion in the ability to generate, process, and disseminate digital information. Together, these developments have made new forms of knowledge repositories and information delivery mechanisms feasible and economicall. Combining the Web and the Digital Library efforts broadens the aspect of the information retrieving system. Eventually, large-scale Digital Libraries will make huge collections available to millions of people over wide geographical distances using wide area computer networks. 1.2 Motivation As the Web becomes more popular, additional functions have been added to retrieve data more efficiently [8, 9]. In addition, data stored in other forms can be retrieved through the Common Gateway Interface (CGI) [10]. The CGI mechanism is a simple, general- purpose interface that is easy to use. When the CG] mechanism launches a script, the HTTP daemon forks a process and executes it, passing arguments in environment variables. This is a very low-tech interface, but it works on all Unix platforms and for every Web server. Any programming language can be used for the gateway script. The CGI provides an interface to dynamically construct HTML documentation for a Web server, so that data and documents need not be stored in the documentation tree. This allows the use of various tools such as relational databases to provide easy maintenance and manipulation of the documents and data. For example, by storing data in a relational database, one can generate query language scripts in CGI processes to retrieve data from the database and construct HTML documentations "on the fly". The World Wide Web is gaining popularity partially because it gives quick and easy access to a tremendous variety of information in remote locations. Users do not like to wait for their results. Typical users tend to avoid and/or complain about Web pages that take a long time to retrieve. That is, users care about Web latency. Perceived latency come from several sources. For example, it might take a long time for a Web server to process a request, especially if it is overloaded with CGI processes. Common CGI scripts include those to perform searches on behalf of client requests. Web clients may also cause additional delay if they cannot quickly parse the retrieved data and display them for the users. Latency caused by client or server slowness can be solved simply by buying a faster computer, faster disks, more memories, or a combination of them. Another cause for delay is that a C01 process may need to gain access to its back-end server upon receiving a Web request. The initialization and setup phase is usually a very time-consuming step, especially for simple requests. Unfortunately, this initialization step has to be repeated many times for many CGI requests, since each CGI process is created to serve just one Web request. The realization of the dream of combining the Web with Digital Library will require solutions to the problems of efficiently generating HTML pages containing digital images from huge digital collections and indexes, and efficiently navigating through them. These huge digital collections and indexes cannot be stored statically as HTML pages, since many require constant updates, and doing so will make managing and creating Web pages impossible in a timely fashion. Various Digital Library projects have used different back- end servers to solve the problems of management and searching, such as the Harvest Information Discovery and Access System [11]. However, no good gateway exists between these back-end servers and the standard Web servers. This motivates us to design and implement a framework for distributed Web services. The framework includes the Connection Manager Daemon (CMD), Cache Manager Dae- mon, and a set of gateway program APIs (Application Programming Interfaces). Using these APIs, a CGI process can talk to a prestarted back-end client process and get ser- vices immediately from the back-end server without spending time for the initialization and session setup. These prestarted client processes (referred to as the cliette processes) are virtual users who log on to various back-end server applications such as Visual Info [12] and DBZ [13]. Multiple client processes can be allocated for each back-end server to better serve Web requests. They can also be invoked on a remote machine to evenly distribute Web server’s load. The main function of the Connection Manager Daemon is to schedule these cliette processes to serve requests coming from various Web gateway processes. Cliette processes can be dynamically started or terminated by the Connection Manager Daemon according to the load of the Web server as it attempts to increase its scalability. By combining the load distribution of cliette processes and a round-robin dynamic name server, we can provide faster Web services with much greater flexibility and scalability. Repeat requests from Web clients may be served quickly to save system resources if the retrieved information is stored and searched. This motivates us to develop a Cache Manager in the distributed framework. The Cache Manager manages information generated by cliette processes. It provides different threads for different types of cached information. The Cache Manager is independent of the Connection Manager Daemon. It stores cached information in the memory or in the disk, and is configurable to fit different server environments. A Digital Library is a good vehicle to illustrate the set of gateway program APIs and to help us acquire first-hand experience. This motivates us to develop a Digital Library prototype using the distributed framework with a back-end Visual Info server on IBM SP2 systems. An IBM SP system is a general purpose scalable parallel system based on the message-passing programming model. It provides a high-performance switch network for message passing and interprocess communication. Such scalable parallel systems are increasingly being used to address existing and emerging application areas that require performance levels significantly beyond what symmetric multiprocessors are capable of providing. In addition to the distributed framework and a Digital Library prototype, we need a good flexible tool to monitor the Web performance and to better understand the commu- nication patterns among various components of the framework. Most benchmarking tools available today only provide information about the Web server httpd. While monitoring 10 httpd is sufficient for a general purpose Web server, it cannot provide any information about how our distributed framework performs. To understand the performance issues, we need tools to trace the framework with minimal overhead. This requirement prompts us to add the tracing facility to our interface. Running Digital Library servers on scalable parallel systems such as the IBM Scalable Parallel (SP) system provides better expandability to satisfy future growth. The heart of an IBM SP system is a high-performance switch network, a low-latency, high-bandwidth network that binds together hundreds or thousands of IBM RS/6()()O processors [14]. Since the high-performance switch network supports IP as well as other message-passing inter- faces, an IBM SP system can be viewed as a cluster of RS/6000 workstations with fast IP connections that provide a migration path for Web services. It is a challenging task to develop performance tools for such a scalable parallel system. The trace facility should be able to generate both message passing and system events (e.g., process dispatch and page fault) with minimal overhead and source code modification. Other issues, such as the clock synchronization problem [15] and supports for client/server applications, need to be addressed, too. This motivates us to develop a Unified Trace Environment (UTE) for IBM SP systems and its extensions for the distributed framework. 1.3 Related Work Surfing the Web is getting more and more popular among the general public in recent years. As the number of requests for Web sites throughout the world continues to grow, 11 the scalability of the server architecture, the efficiency of the HTTP protocol, and the effectiveness of caching strategies become increasingly critical research and implementation issues. One of the most important features of the World Wide Web is that it delivers many kinds of information, including text, images, sound and movies. Requests for nontextual material tend to be more resource-demanding. Nontextual information uses much more storage space, and requires much more bandwidth or time to transfer. The type of information transmitted is not especially important to the server (it is extremely relevant to the client, of course), but the amount of data to be served is very important. From the server’s point of view, the large amount of data in images, audio, and movies presents similar problems. Web servers have a simple view of the data they serve: The client usually names a file and that file is delivered in its entirety from disk to the client through the network. The files are never written, and they are always read from beginning to end. The access patterns among different files are not necessarily predictable, because they depend on the needs of remote clients. Also, each request is a separate transaction, and there is minimal opportunity to guess the next request. Although requests from a single client might be correlated, the stream of requests from multiple clients to the server may be completely unpredictable. Large files pose performance problems for Web servers because they may incur signif‘ icant latency when read from disk and as they are transferred over the network, especially to bandwidth-limited parts of the Internet. Furthermore, large files will overfill I/O caches, reducing performance even for smaller files. Measuring I/O performance remains a com- 12 plicated task because the problem space is large. I/O performance depends on the storage devices, busses, architecture of the platform, as well as the design and implementation of the operating system and file system. The system-wide performance of any given system also depends on the pattern of I/O traffic. Usually, a system can be tuned to be effective for some traffic patterns, at the expense of additional cost for other access patterns. Related work on performance issues of Web servers include a study of NCSA Web server traffic in [2, 16]. Performance of the Web server on a specific platform (an HP 735 workstation) can be found in [17]. Comparing the response time and throughput of Web servers on several UNIX systems, including HP 735 (HPUX 9.05), Sun IPC (SunOS 4.1.2), SGI Indy 2 (Irix 5.3), and Cray CS 6400 (a 10 SPARC multiprocessor, running Solaris 2.3), was presented in [18]. These performance evaluations were all done prior to the development of generally recognized benchmarks for Web servers. They typically use a series of trials in which a pinger program emitted a series of HTTP requests (one of the test loads) to the server. The pinger program recorded the round trip time for each request for performance measurement. The WebStone [19] benchmark has become the defacto standard for comparing Web servers. In addition, SPEC is producing the webpetf [20] benchmark, which will likely be the future standard of comparison (see [21] for more information on the SPEC benchmark). Several papers were published in the past on techniques for performance improvements of Web servers [22, 23, 24]. Pro-forking is one of the popular techniques to improve per- formance. For example, the Netscape commerce server uses multiple pre—forked processes to handle incoming requests. Instead of forking a new process for every HTTP request, 13 a configurable number of processes that reside in memory are pre-forked, and are wait- ing to fulfill the HTTP request. This improves system performance by eliminating the unnecessary overhead of creating and deleting processes to fulfill every HTTP request. Unfortunately these pre-forked processes cannot perform CGI requests, since running a CGI program requires overlaying the original program images. The Web server still need to fork a new process to perform any CGI request. Among these improvements, several finding are discussed in [21]. First, delivering large files is dominated by the network trans- fer time. Second, using CGI scripts incurs significant overhead in all cases, and Perl has more overhead than compiled C. Third, both multithreading and the dispatcher/pre-forking model perform significantly better than a Web server that forks for each request. Fourth, using inetd is a significant performance loss. Another alternative to improve CGI performance is to use a special purpose API, such as Netscape’s NSAPI [25]. It allows CGI programmers to rewrite CGI scripts in C so that they can be dynamically loaded into the Web server daemon. Dynamic loading is much faster than fork/exec: In a study done by Haynes & Company [26], Netscape server with NSAPI outperforms the other Web servers using CGI. The downside is that CGI programs must be rewritten specifically for NSAPI, and can only be loaded into a Netscape server. In addition, NSAPI does not address the initialization and setup problem. CGI scripts are a bit more portable, and can be written in Perl or other interpreted languages. The design of NSAPI also does not address the problem of the initialization and setup time required to gain access to the back-end server. Sun’s Java [27, 28] is a simple, object-oriented language that operates on the user’s 14 computer. It is capable of making connections to server computers and acting as an inter- active program. Server’s TCP/IP resources are consumed while the user stays connected. In contrast, CGI programs start up, execute, and terminate at the server side in a traditional client/server computing environment. The system memory and resources are free as soon as the page is downloaded. These two languages can be complementary languages. Digital Libraries can implement software "wrappers" (also known as "middleware") to allow diverse systems to interoperate, For example, the Stanford Digital Library project is creating a software "virtual bus" that seamlessly connects a variety of online services [29]. Digital libraries may also use "software agents," autonomous programs capable of negoti- ating complex access methods and terms. For instance, the University of Michigan Digital Library project is exploring a complex architecture of collaborating agents, capable of cus- tomized searching of many repositories [30]. These designs all center around the internal design of the Digital Library, and do not address the gateway between the standard Web server and the Digital Library. A key to performance in any distributed system is effective caching, transparently replicating and locating information for faster access. The development of large—scale Digital Libraries will need effective information caching, not only for quick access, but to converse with network bandwidth and reduce the load on servers. The state of caching for the W is examined in [31]. Web caching is usually a relatively simple "flat" scheme, consisting of a single cache between the client and the servers of the Web. If the document is not in the cache, it is fetched from the original source. The experimental Harvest system [11, 32] provides a hierarchy of caching servers, which can be accessed 15 by Web browsers. The Harvest caching servers can store information from many sources, including the Web, and are integrated into the Harvest indexing and searching mechanisms. The caches communicate with each other and transfer data using their own protocol, which is more efficient and flexible than using H‘l'l'P. The framework for distributed Web services provides an efficient gateway solution to address the performance issue, especially for the Digital Library. 1.4 Organization In this thesis, we will describe our design and implementation of the Connection Manager Daemon and the Cache Manager interfaces, and show program visualization and perfor- mance analysis of their cliette processes to demonstrate the efficiency of our framework for distributed Web services. We have implemented a set of libraries and tools — The Extended Unified Trace Environment, which is built on top of the Unified Trace Environment [33] to generate trace events for the Connection Manager Daemon and cliette processes. Our orig- inal UTE, developed on IBM SP systems for scientific applications, facilitates the tracing of various events, such as message passing, process dispatch, page fault, and 1/0. The Extended UTE trace library has added more functions to analyze and visualize new events associated specifically with the Connection Manager Daemon interface and the Cache Manager interface. We also modify UTE to extend its scope for client/server applications. All events can be visualized, not only to understand the communication behavior of the Connection Manager Daemon, Cache Manager, and cliette processes, but 16 also to understand system responses and to pinpoint the bottlenecks of the application. In Chapter 2 we will explain the design and implementation of the framework for distributed Web services. The original UTE and its extensions for client/server applications are described in Chapter 3. We will first examine the overhead and performance of our gateway design in Chapter 4. A Digital Library using the framework is presented in Chapter 5. In Chapter 6, we show performance results and trace visualization of the distributed framework in multiple platforms for the Digital Library. Concluding remarks are given in Chapter 7. CHAPTER 2 Distributed Web Services The World Wide Web is a client-server system that integrates diverse types of information on the global Internet and on enterprise Internet Protocol (IP) networks. Clients and servers on the Web communicate using the HyperText Transfer Protocol. The HTTP protocol is layered on the TCP/IP protocol, so that it runs on any IP network. The typical Web client is a browser — an interactive application, most often with a graphical user interface. A Web browser can display a number of built-in data types including formatted documents, images, data entry forms, and hyperlinks leading from one document to another. Web servers, also called HTTP servers, send data (documents, images, etc.) to clients in response to their requests. Information integration is the key to the power of the Web. The Web provides three distinct forms of integration. The first form of integration is the Web’s ability to link data provided by different servers. Each data item in the Web is addressed by a Uniform Resource Locator (URL). Web documents, expressed in HyperText Markup Language (HTML), can contain the URLs of other documents. Browsers typically 17 18 display these references, called hyperlinks, as special regions called anchors. An anchor can be a section of highlighted text or an icon. When the user clicks on an anchor, the browser retrieves the document referenced by the underlying URL. The newly retrieved document can come from a server located across the globe both from the client and from the server that provided the document containing the anchor. The second form of integration is the Web’s ability to provide clients with data from diverse sources via the Common Gateway Interface (CGI). The CGI interface provides a simple, easy-to-use method of executing programs from within a Web server. Web servers integrate diverse sources of data by allowing CGI programs to run in response to client requests; CGI programs perform general computations including accepting data, communicating with other computers, and creating dynamic pages. In this way, for instance, a Web server can provide clients with data obtained by running transactions on a legacy mainframe system. In such a scenario, the Web server acts as a gateway, translating into the new standard for interactive information access (HTTP) from a previous one (3270 terminal protocols). The CGI has many advantages, including portability between server softwares, and a large base of public domain programs and development tools designed for its use. The biggest limitation of the CGI interface is its inability to share data and communication resources. When a CGI program is accessed by a client, a new copy of the CGI program is invoked for each remote client. If a program must access an external resource (such as an IPC pipe to another resource or a database to retrieve documents), it must continually close and reopen that resource. The third form of integration is the Web’s ability to encompass new types of data. The 19 HTTP protocol borrows a design for extensible data typing and type negotiation from the Multipurpose Internet Mail Extension (MIME) [34, 35] standard. Web browsers integrate diverse sources of data by supporting several Internet data access protocols in addition to HTTP. For instance, a URL can specify the File Transfer Protocol (FTP) for retrieving data from a file server. Thus, a Web browser is an FTP client as well as an HTTP client. Many Web browsers also support the Gopher (document browsing and indexing) and NNTP (bulletin-board access) protocols. Browsers can also support new data types via helper applications that a user can add to the browser. This is how Web browsers deliver audio, video, and Postscript data to users today. The Web is prepared for whatever new data types become important in the future. In this chapter, the framework for distributed Web services is presented to address performance issues associated with the second form of integration, namely the delivery of diverse data provided by CGI processes. Our design provides application programming interfaces so that users can generate dynamic HTML pages through the framework to request information from back-end resources without constantly closing and reopening back-end 1'6 SOUI'CCS. Several high-performance HTTP daemon designs have recently appeared to address the inefficiency of HTTP daemon in general when handling static HTML pages. To fully understand the feature provided by the framework, we will explain the basic HTTP protocol and discuss various high-performance HTTP daemon designs in Section 2.1. The detail design and implementation of our framework for distributed Web services will be discussed in detail in Section 2.2. 20 2.1 The HTTP Daemon 2.1.1 The HTTP protocol A URL for Web server content has the form protocol://server/path Where 0 protocol is the protocol to be used for retrieving the content. For Web server content the protocol is HTTP, STTP (Secure HTTP), or HTTPS (HTTP with secure Sockets Layer). 0 server is the Internet host name, e.g., "www.ibm.com". The server part may include a port number, e. g., "www.ibm.com:8003". This allows one host to run multiple Web servers, each bound to a different port. 0 path is a UNIX-style path name, e.g., "software/products/kidriffs.html". So "http://www.ibm.com/software/products/kidriffs.html" is a possible URL. The HTTP protocol is extremely simple. A typical interaction, in which a client retrieves some data given a URL, involves the following steps: 1. The client opens a TCP/IP connection to the server mentioned in the URL, using the default port 80 if the URL does not specify a port. 21 2. The client composes a request message containing the method (GET to retrieve data), the path, and other information, such as a list of data types that the browser knows how to handle. The client formats the request message as a series of name-value pairs, encoded using the long-established RFC 822 [36] conventions for Internet electronic mail headers. 3. The client sends the request message to the server over the TCP/IP connection. The server reads and interprets the request. 4. The server composes a response. The response begins with a status code line that summarizes the result: OK, Bad Request, Unauthorized, Not Found, and so forth. The server may include other information such as content type, encoded in RFC 822 format, after the status code line. Finally, the server formats any data as a MIME message body. 5. The server sends the response message to the client over the TCP/IP connection. The client reads the response. 6. The server closes the connection; the interaction is complete. A browser makes several connections (requests) per typical Web page, because an HTML document may contain both text and graphical images. The document’s text is stored within the document’s HTML file, but the images are not: Each image has its own URL embedded in the file. To display an HTML document as a browser reaches each embedded image, the browser must perform another HTTP request to retrieve the image. 22 The size of a typical request message is relatively small, a few hundred bytes. But responses have a bimodal distribution. A typical HTML file is a few thousand bytes long, and can be transferred over a 14.4 kilobit per second dial-up link in a few seconds. But images are often much larger, occupying tens or hundreds of thousands of bytes, and audio or video data are typically even larger; it may take minutes to communicate these large data objects. Thus, a server that receives frequent requests for multimedia data must service many concurrent H'I'I'P connections. The HTTP protocol is designed for stateless servers, meaning that servers retain no information about clients between connections. Because a HTTP server is stateless, it can restart and clients will notice nothing more than a delay. This stateless design improves the user-perceived reliability of the Web. 2.1.2 The high-performance HTTP server An H‘I'IP daemon is a concurrent program — it generally has several client requests in progress at the same time. Any server that processed one client’s request to completion before beginning to process the next client’s request would be very inefficient, for two ICEISOIISI 0 Request processing begins when the server accepts a connection and starts reading the client’s request, and does not end until the server has written the final byte of the response back to the client. Thus, the time it takes the server to process a request depends upon the speed of the server (software, operating system, and hardware) and the complexity of the request (size of file retrieved or amount of computing done and 23 size of result produced by CGI programs), and also depends upon how quickly the client is able to send the request and receive the response. The client may be slow, or the client’s network connection may be slow. In either case, dedicating the entire server to work on a single request would force the server to be idle while waiting for the client to send the next request packet or the next acknowledgment of a response packet. The faster the server, compared to its clients and to the network paths to its clients, the greater the performance benefit of concurrent processing. 0 The server’s processor, disk, and network interface can each do useful work at the same time. While the processor is parsing one request, the disk can be reading a file to satisfy a second request, and the network interface can be sending a packet in response to a third request. The more processor, disk, and network interfaces a server has, the greater the performance benefit of concurrent processing. Most HTTP daemons achieve concurrent processing by simply forking a new process for each connection as it arrives. The process works on the request until completion, then terminates. The drawback of the forking HTTP server design is the large overhead per connection. Each process occupies considerable RAM and swap space, and creating and destroying a process consumes many processor cycles. This forking HTTP server design results in a low-performance server. More advanced H'l'l'P servers that have appeared recently are based on a "pool of processes" or "pro-forking" design. The HTTP server creates a set of identical processes during initialization. As each connection arrives, an idle process is removed from the pool and assigned to handle the request. The process works on the request until completion, then 24 returns to the pool. This pool of processes HTTP server design gives better performance than the forking design, because it avoids the overhead of per-connection process creation and destruction. Yet the design is still limited to handling a moderate number of concurrent connections, largely because it makes such inefficient use of memory. Each process still occupies a lot of RAM and swap space, and when these resources are used up, so is the server’s connection capacity. This inefficiency motivates the design using multithreaded environment. Multithread- ing means that one process can work on many concurrent requests. Far fewer processes are needed, and so each process is busy a greater fraction of the time, making more efficient use of system resources. Netscape Commerce Server is one variation of the multithreaded design. Note that while the server can be either pre-forked or multithreaded, the server creates a new process for each CGI program that needs to run; this process dies when the CGI program finishes. The definition of CGI programs requires the server to create a new process for each use. 2.2 Software Architectures for Distributed Web Services Much of the data a Web provider wants to put out on the Web is managed by existing commercial application supporting resources managers (such as transaction processing and database). The ability to view and optionally update the data and run these applications across public networks like the Internet can be of great benefit to the enterprise Web 25 providers. Web clients will be major points of entry for commercial applications. The ability to use the Web browser and execute transactions provides a powerful capability in the integration of the public presence system with the enterprise business management system. This can take several forms and includes: 0 Web Browser access for the existing commercial application supporting resource managers, by using the gateway at the HTTP server to provide access to those applications including: - Providing impedance matching between stateless Web and stateful application server. — Converting application presentation methods to generate HTML. - Handling a serial stream of requests from the Web and initiating multiple con- current threads and processes. 0 Using the Web browser for new and existing applications and to incorporate multi- media for use over the Web. The CGI is a standard for interfacing these application resources managers with in- formation servers, such as HTTP or Web servers. A plain HTML document that the Web daemon retrieves is static, which means it exists in a constant state: a text file that does not change. CGI programs, which are executed in real-time, go beyond the static model of a client issuing one HTML request after another. Instead of passively reading server data content one pre-written screen at a time, the CGI specification allows the information provider to serve up different documents depending on the client’s request. The CGI spec 26 also allows the gateway program to create new documents on the fly — that is, at the time the client makes the request. For example, a current Table of Contents HTML document, listing all HTML documents in a directory, can easily be composed by a CGI program. CGI programming really expands the horizon of the Web. The simple concept of passing data to a gateway program instantly opens up all sorts of options for a Web developer and changes the nature of the World Wide Web. Now a Web developer can enhance his or her content with applications that involve the end user in producing output. However, using CGI has serious performance drawback. First, in the current HTTP server design, each CGI request received by the HTTP server forces the daemon to spawn a process to process the CGI request. Forking a process is costly and imposes a heavy burden on the underlying operating system that affects the entire HTTP server performance. As we pointed out in the previous section, using a pre-forked or a multithreaded HTTP server design still requires creating a new process for each CGI program. Second, to connect with the back-end server, each CGI process has to open the resource and close the resource when it finishes. Constantly closing and re-opening back-end resources not only wastes system resources, it also makes sharing the retrieved information impossible. To provide a flexible solution instead of providing yet another HTTP server, we decided to implement a framework providing distributed Web services that can be used with any HTTP server design (including the high-performance HTTP server) through standard CGI interfaces. There are four different components in our design: the Connection Manager, cliette processes, the Cache Manager, and CGI processes. The Connection Manager manages cliette processes and listens to requests from CGI processes. Cliette 27 processes send requests to and retrieve data from back-end servers on behalf of Web clients. The Cache Manager handles information constructed by cliette processes, and CGI processes are gateway processes between the HTTP server and cliette processes. We also provide application program interfaces (APIs) to help users write their own CGI and cliette programs to communicate with our daemon processes. The main function of the Connection Manager Daemon is to schedule cliette processes to serve CGI requests. After setting up its well-known socket, the Connection Manager starts up a number of processes to serve as cliette processes. The number of cliette processes and their identities are defined in a configuration file. A cliette process can be created on a remote machine. It can be created on a different platform with a different operating system as long as the TCP/IP socket interface is available. Initialization steps of the cliette process include setting up its own socket for CGI script processes to talk to and inform the connection manager of its socket number and process ID. The connection manager keeps the returned values (i.e., socket number and process ID) in a queue. This queue is used by the connection manager to choose a free cliette process and to return the cliette’s socket number to the requesting CGI process. After initialization, a cliette process opens up a connection with its back-end server using information passed from the connection manager. A cliette process stays connected to its back-end resources until the connection manager either reinitializes or terminates it. Figure 2.1 shows the event sequence of serving Web requests through CGI processes. A CGI process forwards Web requests to a cliette process and receives results from a cliette process on behalf of Web clients. The shaded area indicates steps that are repeated for each 28 fl “ Requesting Cliette ....... ........ ‘.:.-.-.-.'.-.;.:.: .......... .................. .............. ......... ......... Informing Cliette ' Assign Cliette 1:5"‘:"‘”" Cliette Free Figure 2.1: How service is established between a cliette and a CGI process request. When a Web request comes in, a CGI process is begun; it and sends a request to the connection manager asking for a cliette process through the daemon’s well-known socket. The daemon chooses a free cliette process and forwards the cliette’s public socket number to the CGI process. The CGI process then forwards the request (i.e., the Uniform Resource Locator string) to the corresponding cliette process. If no cliette process is available, the CGI process goes into a wait state. In this case, the connection manager is responsible for waking up the waiting CGI process when a cliette process becomes available. The CGI process then forwards the request and goes to sleep until the response is ready. Two time-out values are used by a CGI process during the interaction between the CGI and cliette processes. The CGI process sets a time-out value when it starts waiting for a free cliette. After receiving a free cliette, the CGI process sets another timeout value when it 29 starts waiting for the HTML documentation. Note that the connection manager and CGI processes can be executed on different machines or nodes. In addition to requesting services from its back-end server, a cliette process is responsi- ble for constructing HTML documents dynamically and returning them to the corresponding CGI process. Figure 2.2 shows the snapshot of a running system. The Cache Manager is designed to further improve CGI performance by allowing cliette processes or CGI processes to share information. If dynamic HTML pages are cached, it may prevent a cliette process from requesting the same information and constructing the same HTML page over and over. We will discuss each component in detail in the following sections. In Chapter 5, we will demonstrate a framework for distributed Web services used in the Digital Library environment. Detailed explanation of the Connection Manager configuration file format and its API can be found in Appendix A. Cache Manager configuration file format and its API are given in Appendix C. 2.2.1 Connection Manager Daemon The Connection Manager Daemon reads a start-up configuration file (default filename is /etc/CMDaemon.conf) for information about starting a cliette process. Depending on the hostname of the cliette process, system routine exec() or rexec() is used. The exec() is used to start a local cliette process while the rexec() routine is used to start a remote cliette process. The Connection Manager Daemon maintains a base queue to keep information about the newly created cliette process. As soon as the Connection Manager receives 30 Figure 2.2: Connection Manager Daemon being used by an HTTP (Web) server 31 confirmation from a successfully initialized cliette process, it adds this new cliette process into a second queue — AvailQueue. Connection Manager Daemon manages cliette processes in a first-in first-out fashion. It does not distinguish a local cliette from a remote cliette when assigning CGI requests. A third queue - UnAvail queue — is used to keep a list of busy cliette processes. The Connection Manager Daemon uses a well-known socket port for interaction with CGI and cliette processes. This well-known socket is used 0 for a cliette process to initiate the first connection after being started by the connection manager. 0 for a CGI script process to request a cliette process. 0 for a system administrator to issue commands, such as start a new cliette process, stop a cliette process, or debug a cliette process. In addition to listening to the well-known socket port, the Connection Manager Daemon also listens to several private socket ports between itself and cliette processes. These socket ports remain open as long as cliette processes are active. This open-socket connection supports the following activities: 0 The Connection Manager sends maintenance requests to cliette processes. 0 The Connection Manager gets an interrupt when a cliette process dies. In this case, the Connection Manager Daemon can restart the cliette process without user interventions. 32 e A cliette process announces that it is free to serve more requests. 0 A cliette process announces that it has performed certain maintenance steps. 0 Dynamic trace generation and termination. This will be discussed in more detail in Section 3.2.3. A system utility CMDadmin is provided for various maintenance commands. A UNIX style manual page for CMDadmin is included in Appendix B. 2.2.2 Cliette processes Cliette processes can be dynamically created on machines/nodes either locally or remotely, depending on current system loads. A remote cliette is created by the Connection Man- ager using rexec() on the remote host, which is defined in the configuration file. A cliette process requests a dedicated socket connection with the connection manager after suc- cessful initialization. If a cliette process resides on a host outside a firewall in the open Internet environment and the Connection Manager resides inside the firewall, connection between the Connection Manager and the cliette is made through the firewall host using the SOCKS [37] service. In this case, the cliette process will not initiate the socket connec- tion with the Connection Manager, because the firewall will block any connection request from outside. Instead, the Connection Manager initiates the connection with the remote cliette process through the cliette’s pre-defined socket port. If the Connection Manager fails to make the connection with a remote cliette process, no automatic retry will be made. However, a system administrator can use the CMDadmin utility to force the Connection 33 Manager to retry the connection request at a later time. During the cliette initialization phase, a dedicated connection is made between the cliette process and the Connection Manager, and the information listed in the configuration file is passed to the cliette process as environment variables through the dedicated socket. A cliette process can retrieve these environment variables using the getenv() subroutine call. The environment variables passed to the cliette process are used by the cliette’s login subroutine to open the connection with a back-end server such as Visual Info or DBZ. The following example shows an entry in a configuration file for a cliette process A, and the environment variables available to the cliette are CLIETT E JVAME, CLIETTEJ’ASSWD, and CLIEITE_EXEC_PATH, whose uses are self-explanatory. cliette:{CLIETTE_NAME=userA:CLIETTE_PASSWD:passwdA: CLIETTE_EXEC_PATH=/etc/xxx_cliette} After initialization, a cliette process listens to the Connection Manager through its dedicated socket Connection for commands. The following list shows possible commands from the Connection Manager and corresponding actions. Cliette.dojob: the Connection Manager Daemon is telling the cliette process to get ready for a new CGI process. Cliettereinit: Cliette re-initialization. If the cliette is currently serving a CGI request, then the cliette ignores the command and returns an error to the Connection Manager. Cliettestop: Cliette termination. Again, the cliette ignores the command and returns an error to the daemon if the cliette is currently serving a CGI request. Clietteayt: An “Are you there?” request. The Connection Manager expects a response from the cliette within a time-out period if the cliette is still active. 34 Cliette_kill: Kill a hanging cliette process. This request is useful to stop a run-away cliette. Cliette.debug/debugend: The cliette process starts/stops the debugging procedure. CIietteJraceon/traceoff: The cliette process turns on/off tracing. An active cliette can be in one of the following states: 0 ClietteSTARTUP: The cliette process is being initialized. o ClietteZOMBlE: The cliette process encounters an error and is waiting for the Con- nection Manager daemon’s action. 0 Cliette_AVA1L: The cliette process is available for services. 0 ClietteBUS Y: The cliette process is currently serving a request. Figure 2.3 shows the state transition diagram of a cliette process. A cliette is in its Cliette_STARTUP state when it is being initialized by the Connection Manager. Subse- quently, it enters the ClietteAVAIL state. Upon receiving a request from a CGI process, the Connection Manager finds a free cliette and tells the CGI process what the cliette’s public socket is. The Connection Manager sends a Cliette_dojob request along with the CGI process’s process ID to the selected cliette process. The cliette process changes state to Cliette_BUSY when a Clietteaojob request is received from the Connection Manager. The request tells the cliette process to wait for a URL string coming from a CGI process whose ID is specified in the request block. The cliette process then sets up a time-out value waiting for the CGI process. If the time expires before the cliette process receives any connection request from its public socket port, the cliette process changes its status back to ClietteAVA/L and ignores any connection requests coming from its public socket port. The cliette process will ignores a URL request and return to C lietteAVA/L state if the connection request is from an unexpected CGI process. A cliette process will not change its Cliette.BUS Y state after it has received the CGI request until the request is served, or unless it is terminated by the Connection Manger Daemon’s ClietteJcill command. 35 cliette_ayt Cliette_BUSY dojob fini ' cliette_kil| 10b Cliette_ayt Terminated cliette_stop CIielte_AVAIL trace-0N0“ cliette__kill demon/off . . cliette cliette_klll _reinit n initialization/login Cliette_ZOMBlE C ' Figure 2.3: State Transition Diagram for a Cliette process cliette_kill Each cliette process needs to provide a login routine and logout routine for service initialization and termination. The login subroutine is called automatically by the initializa- tion routine lnitSelf() or when the cliette process receives a ClietteJeinit command from the connection manager. The logout subroutine is called when the cliette receives a Cliette_stop request from the connection manager to terminate its service. The first message, an identification packet sent by the requesting CGI process to the cliette process, contains CGI process ID, a subset of CGI environment variables, and some URL-related information (such as the size of the entire URL string). These environment variables and URL-related information are set by the receiving cliette process dynamically using the putenv() subroutine, and these values are cleared once the cliette process finishes serving the current CGI request. In our design, adding additional Web services can be achieved by simply adding a machine/node to run additional cliette processes, provided the application back-end server can keep up with the requests. This significantly improves the scalability and flexibility of the Web services. 36 2.2.3 CGI processes An HTTP server provides dynamic HTML document generation through CGI processes. Upon receiving a URL string containing a CGI request from a Web client, the HTTP server forks a child process to run the CGI script. The newly created CGI script inherits a number of environment variables [6], which indicate how the HTTP server is set up and how to communicate with the Web client. The CGI process terminates after serving the URL request. To communicate with the cliette processes efficiently, we designed a set of application programming interfaces (API) for a CGI process to use: 0 GetCliette(): to request a free cliette from the Connection Manager Daemon. e ConnectCliette(): to make a connection to the assigned cliette and send the cliette process its identification packet. e PutURL(): to forward its URL string or other information to the cliette process. This subroutine is used to respond to the GetURL() issued by the cliette. e WaitForHTML(): to wait until the dynamically generated HTML document is ready and then start receiving it from the cliette process. When a CGI process is ready to request a cliette process, it uses the GetCliette() subroutine to ask the Connection Manager for a free cliette process. The subroutine will return with the cliette’s public socket address and the cliette’s machine name. If no free cliette process is available, the CGI process is forced to wait. The CGI process can also tell the Connection Manager how long it intends to wait for a free cliette. If the time- out occurs before a cliette process becomes available, the GetCliette() returns a -1 value to the CGI process. If a free cliette process is available, the CGI process uses Con- nectCliette() to connect to the public socket port of a free cliette. After the connection is made, the CGI process forwards an identification packet to the cliette process. This identification packet also contains a subset of the CGI environment variables, includ- in g requestmethod, contentJype, contentJength, script_name, pathjnfo, pathfl'anslated. 37 querystring, remote_host, remote-addr, server_name, and server_port. These environment variables are set by the cliette process dynamically for each CGI request. The cliette process then sends an acknowledgment packet and issues GetURL() to get the URL string. After receiving the acknowledgment packet along with the request for the URL string, the CGI process issues the PutURL() to send the URL string to the cliette process. Note that the CGI process is passive because the CGI process will not send any URL string until the cliette requests it. The CGI process then issues the WaitForHTML(), and starts reading the dynamically generated HTML documentation, which is generated and then sent by the cliette using SendHTML(). Both the PutURL()/GetURL() and WaitForHTML()/SendHTML() pairs are imple- mented in a way similar to the stream file I/O, i.e., the reader waits until the writer puts something in the channel. Both the reader and the writer can specify how much information it intends to read or write. The CGI process sends the requested information back to the Web client as soon as it receives the dynamically generated HTML document. Depending on their design, some Web clients may start processing the HTML information before the entire documentation is received. In the current design for the framework of distributed Web services, these CGI processes are individual processes forked by the HTTP server. Each Web CGI request will create a new CGI process. Although we wish to make CGI processes as small as possible, the forking of a CGI process is still costly. Because our goal is to provide a generic CGI interface without modifying existing Web servers, we have little control over how CGI processes are created. Two approaches may be combined to reduce the impact of forking a new CGI process. One is to use a high-performance HTTP server to reduce the overhead associated with retrieving static HTML pages currently, and the other is to place the HTTP server, Connection Manager Daemon, and cliette processes on different machines/nodes to distribute the load. Newer approaches, such as Netscape’s NSAPI, could also be used to improve CGI performance. They present no conflict with our design. 38 2.2.4 Cache Manager Daemon A Cache Manager interface permits flexible caching policies to further streamline con- nections by reducing the need to contact even the cliette for data that has been recently fetched. A Cache Manager listens on its own well-known port, which is used by both CGI processes and cliette processes. The Cache Manager is multithreaded and capable of man- aging multiple caches in a single process. Each cache is configurable to use disk, memory, or both to store cached data. Each cache is configurable to use passive (client-controlled) management policies or aggressive (cache manager-controlled) policies, or a combination of both. If a cache manger has been configured, the CGI may choose to look in the local cache, possibly eliminating the need to contact a cliette at all. The cliette may also use the cache to avoid back-end server request or to store the previously returned information. Figure 2.4 and Figure 2.5 demonstrate these sequences. A typical cache could contain information that is used to form a dynamic HTML page. For example, it could contain a complete list of database query results while only the first X items are shown to the client. If a Web client requests the next X items, the CGI process could retrieve them from cache instead of issuing another SQL query. The Cache Manager API provides a set of primitive functions usable by other processes to provide caching of data. Very little policy is implemented directly in the cache manager, and what is implemented can be overridden by the processes that access it. That is, each type of process may use the Cache Manager API to implement a policy appropriate to the application. Because the Cache Manager runs as an independent process, it provides a common cache usable by multiple processes on multiple machines. Different applications may use the same Cache Manager but should be aware of each other to avoid key conflicts. The cache API uses a socket interface to the Cache Manager to resolve requests. The Cache Manager listens on a well-known port as defined in /etc/services as service "ibm- cachem gr ". If no port is given in /etc/services, the port specified in the configuration file is used. Both port values may be overridden by using the “-”p command line parameter when starting the Cache Manager. Note that if /etc/services is not used to establish the port, any processes attempting to use the Cache Manager must be informed of the correct port to use. 39 If used in conjunction with the Connection Manager, cliettes are automatically informed of the port by the Connection Manager. CGI processes may need to be started with the correct port as a parameter if /etc/services is not used to define the port, however. The Cache Manager runs as a multithreaded process that manages one or more cache objects. Each cache object may be configured to enforce differing policies. That is, a Cache Manager manages multiple, independent caches. Each cache is identified with a unique character string assigned by the configuration file during initialization. The cache may be configured to keep cached data in memory, on disk, or both. Each data object consists of a key ( token) and data pair. Tokens are mapped into the file system when caching to disk. The policy under which data are purged is configurable: e Purge items older than some threshold on a regular basis 0 Purge items only when requested by the client Purging of old items always occurs to make room for new items when the cache capacity is exceeded. Figure 2.6 shows two Cache Managers configured for different purposes: the CGI Cache Manager is configured as an "in-memory" cache because it is known that the data it caches are always small, that the data change approximately every 20 to 30 minutes, and that slightly out—of~date results are acceptable. The cost of contacting the cliette and retrieving new results is potentially expensive. This cache is configured to maintain items for a maximum of 20 minutes. When a CGI is spawned, if the item is in the cache, it can be returned without the need to contact a cliette. If the item is not found, the cliette is contacted and the item is sent both to the Cache Manager and the client. The cliette Cache Manager is used to cache large images retrieved from a library database. This data is static, changing rarely if at all. The cache is configured to maintain its data on disk, purging items only if capacity is exceeded or if the cliette requests it. If capacity is never exceeded, the cliette alone determines if a cache entry is stale by the date 40 Form Home“ Rum Baum Mommy Cliette @ Cliette Fm Cache MISS Figure 2.4: CGI usage of the cache to avoid cliette connections returned when the cache is queried. Figure 2.5: Cliette usage of the cache to avoid database connections Connection Manager CGI < 2) came can A Manager 4’\ ‘ IX Memory Cache Firewall Cliette Cache a; i Cliette Manager Database Database Figure 2.6: Overview of Cache Manager CHAPTER 3 Unified Trace Environment and Its Extension for Distributed Web Services 3.1 Unified Trace Environment In this section, we describe the design and implementation of a Unified Trace Environ- ment(UTE), which will be used as the base to capture CMD and cliette events. Parallel programs differ from sequential programs in a significant way: Whereas one can often predict the behavior of sequential programs by understanding the algorithm employed, the behavior of parallel programs is notoriously difficult to predict. Even more than sequential programs, parallel programs are subject to “performance bugs,” in which the program computes the correct answer, but more slowly than anticipated. What is needed then is an instrumentation to collect data that leads to the understanding of the program’s behavior with minimal overlead. The UTE developed on IBM Scalable Parallel systems for tracing message passing parallel applications. It provides trace libraries, utilities, and visualization tools for application programmers to understand not only communication patterns of the application, but also system responses to the user program. We first describe the problems of trace analysis for distributed parallel systems in Section 3.1.1. Two libraries, UTE/MP1 and UTE/MPL, are developed and described in Section 3.1.2 and Section 3.1.2 for MP1 (the standard Message Passing Interface) and MPL 42 43 (the Message Passing Library) applications, respectively. In Section 3.1.3 we discuss UTE tools for analyzing and visualizing trace events. Using these tools we are able to pinpoint the source code (if compiled with —g) corresponding to each message passing and user event, and interleave system events such as process dispatch and message passing/user events in the same time-space diagram. 3.1.1 Distributed parallel systems Distributed parallel processing is a way to increase system computing power beyond the limit of current uniprocessor technology. Distributed parallel systems promise higher computing power than sequential or vector computers, and are more scalable than shared memory multiprocessors. On the other hand, programming in such a system based on the message passing programming model is much more complex than writing sequential programs. To take advantage of the underlying hardware, understanding the communication behavior and load balancing issues of parallel programs is extremely critical. One common way of monitoring the behavior of a program is to generate trace events while executing the program. Events generated can then be used for trace-driven analy- sis [38], program visualization [39, 40], and debugging [41]. In a distributed parallel system, an ideal trace facility should be able to generate user-controllable message passing and system events with minimal overhead and source code modification. If trace overhead is big, the timestamp associated with each event may have been altered significantly, and the statistics and data obtained in performance analysis may be meaningless. Most user-level trace systems for message passing systems require source code mod- ification to generate message passing events. More advanced tools such as the Paradyn system require no source code modification, because the code for performance instrumen- tation is inserted into an application program during execution, at the expense of substantial overhead caused by instrumentation daemons. The capability to collect system events is as important as that to collect message passing events. System and I/O events such as process dispatch and page fault reveal 44 crucial information on system responses to user applications. In addition, a trace facility should be easily expandable to trace activities from other software layers, such as parallel I/O file systems and high-level parallel languages, so that the same trace facility may be used to trace multiple software events. One of the most serious problems in trace analysis for distributed parallel systems is the clock synchronization problem [15]. In a distributed system, each processor (or node) has its own local memory and local clock, and processors communicate with one another by exchanging messages. In such a system, trace records are generated by multiple processors, and it is often the case that separate streams are produced independently in multiple nodes. The logical order of events may not be guaranteed in the trace due to discrepancy among local clocks. As a result, many trace facilities in distributed systems are forced to do additional work to ensure consistent timestamps at the expense of increased trace overhead. For example, Lamport [15] developed a distributed algorithm for resource- sharing systems to extend the partial ordering to a consistent total ordering of all events by creating additional messages among processors: a form of barrier synchronization. Since barrier synchronization may take a long time, activating such a trace generation facility may have adverse impact on the total elapsed time and other timing-sensitive program behavior, and ultimately alter the program behavior to be analyzed. UTE, a Unified Trace Environment for IBM SP systems, has been developed to attack all the above problems. The user-level UTE trace libraries require only re-linking for generating message passing and system events. If application source code is available, additional user markers can be inserted into the source code. This allows a user to generate message passing events with minimum overhead and to have the choices to mark specific portions of the program such as various phases, loops, and routines, for performance analysis and visualization. 3.1.2 UTE trace generation and libraries The main parallel programming model supported by IBM SP systems is message passing. A set of tasks, each executing in its own address space, communicates via calls to message 45 passing libraries. It allows parallel applications to exploit the performance characteris- tic of the communication hardware. The IBM SP multicomputers connect hundreds of RISC System/6000 processors via a communication network called the High-Performance Switch [14], or simply the "Switch." In each Switch element is a counter called the Absolute Time Counter (ATC). The primary function of the ATC is to enable the Switch to synchro- nously cycle between its two primary operation modes called the run mode for normal data transfer and the service mode for servicing the network. The ATC in each element is synchronized within one clock cycle (25 ns) of one or more of its immediate neighbors’ ATCs. They facilitate a closely synchronized, nondrifting global time reference available to all the processor nodes, and thus the ATCs simplify the well-known clock synchronization problem encountered in distributed systems. In the presence of a global clock provided by the ATC facility, the clock synchronization problem could be completely avoided if all events use the global clock instead of the local clock. However, the approach is infeasible as it requires changes to the AIX tracing facility for generating system events. In addition, our experience shows that it is much more expensive to access the global clock than the local ones. This is because the local clock register resides inside the processor and can be accessed in tens of nanoseconds, while the global clock register is on the adapter and several microseconds are required, including software overhead, to access it. We have monitored the drift off the system clocks in an IBM SP1 machine over 3 months. The maximum drift observed was 40 msec/hour. Hence, just cutting a clock adjustment trace event at the beginning of the program execution is not sufficient. As the result, we access the ATC in the switch adapter periodically in each node to collect global clock events. Each global event contains a global timestamp as well as a local timestamp. The periodic accesses (once every 400 msec) is implemented through a piggy back function when a low-level communication timer fires to minimize trace overhead. Another way of implementation without a piggy back function is through a local timer and a signal handler for handling SIGALRM. These global clock events are then used to guarantee that the maximum drift between two time stamps can be adjusted to an amount which is well 46 below the message passing latency. UTE trace generation The AIX trace facility, as part of the IBM AIX operating system, is capable of capturing a sequential flow of time-stamped events to provide a fine or coarse level of detail on system and user activities. The AIX operating system is instrumented to provide general visibility of system events. Possible system events include process dispatch, page fault, system calls, and I/O events such as read and write, and so forth. Built on top of the AIX trace facility, the UTE trace libraries instrument message passing routines to provide detailed information on message passing activities. The choice to build UTE libraries using the AIX trace facility provides a unified and easily expandable trace environment for performance analysis. Without such a unified trace environment, it would require multiple trace facilities to trace various software layers such as MP1, MPL, PIOFS (a parallel file system), and HPF (High Performance Fortran). That would not only make trace generation more intrusive, but also make performance analysis tedious and difficult. The UTE trace libraries inherit efficient trace data collection so that system perfor- mance and flow would be minimally altered by activating trace generation. For example, the trace facility pins the data collection buffer in the main memory to reduce trace over- head, and the size of the data collection buffer can be specified by the user at the time of activating trace generation. This avoids tracing side effects such as page-fault, which ultimately yields to indeterministic overhead in the tracing itself. The cost of cutting a trace record is broken into two parts: the cost of testing whether the event is enabled and then calling the trace buffer insertion routine, and the cost of the trace buffer insertion routine. If a typical trace record has 3 words of data in addition to a one-word event header (a so-called hookword that identifies the event type and record length) and a one-word timestamp, the average cost of cutting a trace record is around 110 machine instructions. Thus, the trace generation facility is efficient and adds only a few pseconds to the elapsed time for each trace event. In UTE, trace generation is controlled by an environment variable TRACEOPT, which 47 defines trace Options such as the system and message passing events the user is interested in: the size of the data collection buffer pinned in the main memory; and the file name prefix for trace files. This allows a user to selectively enable generation of events (system or message passing events) at execution time. If the environment variable is not defined, the application will run without generating any trace events. If a user is only interested in message passing and process dispatch events, other system events, such as page fault and I/O events, will not be generated as long as the user does not explicitly ask for them in the environment variable TRACEOPT. There are two major message passing APIs supported in IBM SP systems: the IBM Message Passing Library (MPL) [42, 43], and the Message Passing Interface (MP1) [44, 45]. MPL was first developed for IBM SP systems as the primary message passing API. Later, the MP1 standard was developed jointly by national laboratories, universities, and computer companies to leverage application development costs across multiple distributed parallel computer platforms. Figure 3.1 illustrates the UTE framework. In addition to these two UTE libraries, hooks have been inserted in MPLp (EUI-H) [46], PIOFS [47], Vesta [48, 49], and HPF [50] using the same framework, thus making it possible to generate message passing events along with system activities, parallel I/O, and high-level parallel language events. UTE supports on-line merging. However, we collect trace events in nodes where an application is executing and merge them afterwards in most cases due to the limited LAN bandwidth and volume of trace events. The merged trace stream is then fed into analysis tools, or converted into formats suitable for visualization. UTE/MP1 trace library To facilitate the building of program instrumentation, MP1 provides a profiling interface in which all of the MPI-defined functions may be accessed with a name shift. That is, all of the MP1 functions that normally start with the prefix MPI- are also accessible with the prefix PMPI.. Thus, the profiling interface provides a simple mechanism to “wrap” original MP1 functions with any code (e.g., tracing, graphics, printfs, etc.) and export them as official 48 mF/MPI librarygl l UTE/MPL library I l ..... I trace events I on~line merge merge L IN format 0 SDDF format I I ALOG format j modified nupshot Figure 3.1: Unified Trace Environment for IBM SP systems MPI functions. Typically this can be achieved by instructing the linker to support each MP1 function also under the name shift. Providing such a general mechanism has several advantages for building profiling libraries: 1. The overhead of generating traces is only present in the profiling library and is not part of the base communication library. 2. Different tracing and profiling facilities can be utilized with the same base commu- nication library. 3. The profiling library can be partial, e. g., only certain functions may be "wrapped." 4. Application code does not have to be changed. A no-op routine MPLPcontrol ( ) is also provided in the MPI library to be used for the purposes of enabling and disabling profiling in an MP1 profiling library. Thus, we use this MP1 profiling interface to build the UTE/MP1 trace library on top of the AIX trace facility for IBM SP systems. We capture the begin and end events for each MPI routine along with its arguments and return value. Figure 3.2 illustrates how the UTE/MP1 trace library is written. Note that the same approach is used to construct the UTE/MPL trace library. 49 #define ev_send_start (0x10) #define ev_send_end (0x11) int MPI_Send(void* buf, int cnt, MPI_Datatype type, int dst, int tag, MPI_Comm comm) int rc; cut_event(ev_send_start,cnt,type,dst,tag,comm); rc = PMPI_Send(buf,cnt,type,dst,tag,comm); cut_event(ev_send_end); return rc; Figure 3.2: MPLSend in the UTE/MP1 trace library The profiling interface as defined has certain drawbacks. Without having access to the MPI internal data structures, it can be difficult to trace all functions efficiently. For instance, for visualization, ranks (i.e., node IDs) are most likely to be displayed as global ranks rather than local ranks specified in an argument list for a specific communicator. This local-to-global information is readily available in the MP1 internal data structures, but, if not accessible, must be obtained through a series of MP1 function invocations, ultimately increasing the tracing overhead. To reduce overhead, we dump the global rank list for each communicator when it is created, thus providing an easy way to convert from local rank to global rank. By default, tracing is turned on by the UTE/MP1 trace library when the call to MPLInit () is encountered, and is terminated at the exit of the application. Addi- tional UTE routines are provided through the use of MPI-Pcontrol ( ) to turn on or turn off tracing at any time. Thus, a user can trace only part of an application in detail while other parts of the application are not traced. 50 UTE/MPL trace library Similar to the UTE/MP1 trace library, we capture the begin and end events for each MPL routine along with its arguments and return value. The MPL message passing library does not provide name shifting as in the MPI library. However, it does have hooks for collecting trace data for the VT visualization tool. Therefore, we replace all VT trace routines with UTE trace routines to take advantage of existing trace collection hooks and to generate AIX trace events. The UTE/MPL trace library starts tracing right before the application begins to run, and terminates tracing at the exit of the application. Additional UTE library routines are provided so that users may turn on or turn off tracing at any time. Global clock is accessed in both the UTE/MP1 and UTE/MPL trace libraries once every 400 msec. In the UTE/MP1 trace library, it is implemented through a piggy back function when the low-level communication timer fires. In the UTE/MPL trace library, it is implemented through a local timer and a signal handler for handling SIGALRM. Our experience shows that both approaches work equally well. 3.1.3 Tools and visualization A utility, utemerge, is used to merge multiple trace streams based on global timestamps. The merged trace stream is then passed to other tools for trace listing, performance analysis, or visualization. Another utility, lsut e, is used to list and analyze UTE/AIX trace files. With no option set, the l sute utility lists each event, including node ID, timestamp, event name, and associated data words. The tool can also generate a histogram for MP1 or MPL routines to report the number of times each routine is called, and the total and average elapsed times for each routine called in the application. Since each node in an IBM SP system may be shared by other processes, the information on how the total elapsed time was partitioned may be very useful. For the main process, the utility shows both the time when the CPU is running it and the time when the main process is in its compute mode (i.e., not running in any MP1 routine). Table 3.1 shows the example of a time partition table for a set of four trace files. Table 3.1: A time partition table I Node I 0 I 1 F 2 I 3I Main pid 15076 15183 18901 11172 Elapsed time 27.155 27.696 27.799 27.702 Other processes 0.689 0.210 0.127 0.11 1 Idle time 0.293 0.259 0.310 0.282 Main process 26.171 27.226 27.360 27.308 Compute time 14.649 14.726 14.609 14.559 The analysis of parallel program tracing typically involves matching events in one stream with a related event in another stream. For example, in message passing systems it is important to provide users with run-time data such as the observed message passing time. Detailed descriptions of the analysis techniques can be found in [46]. Table 3.2 shows a histogram of all MPL events and user markers of a two-node program. User markers can be inserted in pairs anywhere in a program to collect information about various phases, loops, and routines. The visualization of these events can be found in Figure 3.5. It can be seen in Table 3.2 that it takes little time to execute some MPL routines, such as MP_Task-query and MP-Envi ron. For example, the total elapsed time for executing an MP_Task_query and generating trace events can be as little as 3.5 psec. This shows that trace overhead is indeed very small. The purpose of program visualization systems is to gain insight into the dynamic behavior of programs. UTE provides multiple conversion utilities to convert a merged trace file to formats suitable for visualization, including the SDDF format [51] for Pablo and ALOG format for UPSHOT/N UPS HOT [52, 53]. For visualization, we are interested in several aspects of the parallel application trace that has been captured by the UTE tracing library. 0 Process State Information: Display the various states of each process of the parallel 52 Table 3.2: A histogram of MPL events and user markers Total_time Count Average 20 S_Phase node 0 0.453592064 10 0.045359206 node 1 0.468927034 10 0.046892703 20 MP_Brecv node 0 0.472896768 10 0.047289677 node 1 0.457345280 10 0.045734528 20 MP_Bsend node 0 0.302807808 10 0.030280781 node 1 0.301400832 10 0.030140083 2 Init_Phase node 0 0.121678080 l 0.121678080 node 1 0.121463296 l 0.121463296 2 MP_Sync node 0 0.000151808 l 0.000151808 node 1 0.008803072 1 0.008803072 2 MP_Task_query node 0 0.000003840 1 0.000003840 node 1 0.000003584 l 0.000003584 2 MP_Environ node 0 0.000004608 l 0.000004608 node 1 0.000005120 l 0.000005120 WMWW mum . . II! mum mam... ’ f 1‘} I DIVIEIIIIV.‘lflillfl‘lflillfiflfl‘ll‘flhfln ' 5 -flIli III-WI Figure 3.3: A N UPSHOT visualization of matched sends and receives 53 job as timelines in a time-space diagram. This is a standard view found in practically every visualization tool. 0 Process Interference Information: Display system interference (e.g., other process activities, I/O activities, etc) during the various states as part of the timelines. This allows a user to easily identify why certain states, for example a particular message passing call, consumed much longer time than expected. 0 Source Code Association: It allows a user to relate process states to source code location. Typically it is difficult to identify a linear sequence of states with the source code location or program structure. We simply desire to click on a process state and be presented with a file browser identifying the source code location in the executed program. 0 User Markers: User markers allow a user to mark various phases, loops, and routines in the application. Displaying user markers along with source code association provides a simple way to extend the tool for better understanding the structure and/or dynamics of the application. We chose NUPSHOT, a public domain visualization tool developed at Argonne Na- tional Laboratory, and modified it to suit our needs. NUPSHOT provides a graphical interface to display timelines of process state information. The trace information is pro- vided in either ALOG or PICL format, two popular trace file formats. Along with other conversion tools, we developed a conversion tool, called ute2ups, that transforms the UTE output into the ALOG file format. Providing such a transformation tool allows one to easily port to other visualization systems without changing other UTE analysis tools. Figure 3.3 shows a snapshot of a N UPSHOT visualization for a three-node program in which a l-MByte message is circulating among all nodes. Matched sends and receives can be displayed by arrows, from the begin event of a send (such as MPI_Send) to the end event of a corresponding receive (such as MPLRecv). Because process interference events (e.g., context switches, etc.) are captured by UTE, the conversion tool simply registers these as special state events, thus not requiring 54 changes for NUPSHOT. A small program with small circulating messages was run, and process dispatch events were traced along with message passing events. Figure 3.4 shows two different views of the same MPI events with and without the state events. An state indicates a period of time stolen by other processes, including the idle process. It can be seen in Figure 3.4 that a big chunk of time was stolen by other processes, especially at node 0. For instance, shortly after all nodes were synchronized at time 0.1215 sec, all other nodes had to wait (in MPI.Recv state) because node 0 had a context switch and was running something else. Note that the idle process may be dispatched if the application is waiting for the completion of an I/O operation, such as page fault. Because both MP1 and MPL message passing libraries are shared libraries and usually loaded at run time, page faults may occur and result in dispatching the idle process. User markers, if used in pairs, can be analyzed and visualized in UTE. They provide an easy way to mark various phases, loops, and routines in application programs. Fig- ure 3.5 shows an MPL program visualization, including two user states, Ini t_Phase and S_Phase. In addition to process dispatch events, other system events such as system calls and I/O activities can be captured as well. Thus, the framework provides an environment not only for end users but also for system software developers to calculate path lengths, understand system behaviors, and eliminate program bottlenecks. To capture source code to process state associations, we do the following. When a trace event is generated, we store the link register from the execution stack of the event generation procedure, such as the profiling MPI call and the routine to generate user markers, in the event itself. The link register holds the address to branch to after the subroutine is completed. This is the instruction immediately after the subroutine invocation in the application program. Although this requires an extra function call and the extension of each event by an extra word, the overhead is negligible in terms of execution time. We extended the ALOG file format to hold an optional instruction address with each event. NUPSHOT itself had to be extended as well to store this instruction address in its 55 llzeen- lulu": n :—_u trmnnum:-Isi W”! :-t In ill—Ii.I Wenlnlnnueuu Figure 3.4: NUPSHOT visualization: with and without states Figure 3.5: Visualization of user markers r it (rare; ' if lushs‘ some? function: Figure 3.6: File browser for source code association internal state database. In case the application has been compiled with debug information enabled (i.e., with the —g option), line information is available in the executable. Therefore, NUPSHOT was extended with a module that loads in the executable and obtains the line information. When the process state is graphically selected and instruction address information is available for this state, this module is queried with the address and returns (similar to the operation of a debugger) the source filename and the line number associated with that address. This information is then provided to a file browser, which highlights the event generating location in the application’s source code. Figure 3.6 shows a file browser that is presented by clicking on an event. As long as the code is compiled with -g, the feature of source code association is available for all message passing events and user markers. All message passing events and user markers can be visualized and source code association is available if compiled with —g. Therefore, a user can easily visualize the most time-consuming states on screen, and find out where (which line) in the source code the 57 responsibility for it is by clicking on the state area. 3.2 UTE Extensions for Distributed Web services UTE was originally developed for scientific applications. A scientific application runs on a number of processors (or nodes), which communicate through messages to jointly solve the problem. UTE relies upon the AIX Parallel Operating Environment (POE) to assign a unique node ID to each node. Trace generation in UTE typically starts when the application begins to run, and stops when the application exits. Thus it is able to capture all message passing events along with user markers and system activities. In addition to scientific applications, many emerging applications follow the client/server model. In a client/server computing environment, a client requests an op- eration that another program, the server, provides. Upon receiving a client request, the server performs the requested service and returns any result. A client interface specifies the individual services or operations supported by the server. Clients can only request services that conform to the client interface provided by the given server. A client/server application is very differentfrom a scientific application, in that a server may be idle (while ready for client requests) for a long time between incoming requests. For example, a Web server may be very active during the prime shift and close to idle at night. Obviously, tracing the entire session of the Web server would cause unnecessary events to be generated, and terminating the Web server merely to collect trace events does not make sense in a real-world application. Thus, a trace facility that does not rely on POE and is capable of dynamic trace generation is needed to trace distributed client/server applications. 3.2.1 Existing benchmarking tools and open issues Several approaches have been developed for performance measurement of Web servers. A simple but popular approach to measure the performance of a Web server is to keep access 58 logs [17] along with timestamps. Information stored in the access log may include the document being requested, the size of the requested document, the time it was requested, and the Internet address from which it was requested. The information stored in the access log is then analyzed for performance. The WebStone [ 19], a Web server benchmark, was developed in an attempt to bet- ter understand the performance characteristics of Web services. In particular, it allows performance measurement of the server in terms of the average and maximum response time, average and maximum connect time, data throughput rate, number of pages retrieved, and number of files retrieved. It was developed by Silicon Graphics, and is available on the SGI Web server. It is the generally accepted industry standard for measuring Web server performance. WebStone runs exclusively for clients (i.e., the Web browser), makes all measurements from the point of view of the clients, and is independent of the server software. WebStone is suitable for testing the performance of any and all Web servers, regardless of architecture, and all combinations of Web server, operating system, network operating system, and hardware. Each WebStone client workstation is able to launch a number of children (called Webchildren), depending on how the system load is configured. Each of the Webchildren simulates a Web client and requests information from the server based on a configured file load. A program called WebMaster controlled the starting and stopping of WebStone and the collection of data at the end of each test run. It ran on one of the workstations but used no network or processing resources while the test was running. In addition, system monitoring, such as “vmstate” and “netstat” traces [4], were also developed to store CPU, VM, and network usage information. These trace events are kept for a long period and consume much of disk space. However, the lack of efficient analysis and visualization tools limits the scope of these tools. It is difficult for a human being to inspect these trace events efficiently, and allowing the Web server to use these traces and to adjust the server performance would be even more difficult. The Webperf benchmark is a product of SPEC (Standard Performance Evaluation Committee), a nonprofit organization that develops standard benchmarks and publishes official results [54]. The Webperf benchmark is similar to the SGI WebStone in style 59 and intent, but was developed completely independently. Webperf is based on the SPEC LADDIS benchmark for NFS file servers [55] and has a Web browser interface that was adopted from the SATAN (Security Administrator’s Tool for Analyzing Networks) security tool [56]. The developers of Webperf sought to keep the best features of WebStone while improving its portability, applicability, and validity. Like the WebStone, the Webperf is a "black box" test, generating a workload with one or more client processes on one or more workstations. The response time and throughput are measured by the clients, and the results automatically summarized into a standard report. The Webperf can be configured to use different workloads, and has a Web browser interface. These benchmarking tools provide mechanisms to examine and compare the perfor- mance of Web servers as they work today; they are a firm foundation for evaluating Web server performance. However, important open issues not yet addressed by any of the benchmarks remain. One of the issues is the lack of techniques to measure the performance of dynamic documents and scripts. The Web is rapidly evolving from the retrieval of static files to more interactive applications such as image maps, database queries, and Java appliets. These requests will make new and different demands on servers. The existing benchmark tools do not handle these kinds of requests conveniently, if at all. Furthermore, these types of workloads have yet to defined. The framework for distributed Web services provides a gateway API into the back-end server, but measuring the Web performance does not provide enough performance information for our entire Web service. The lack of appropriate monitoring tools had prompted us to extend the UTE trace library to support tracing in the distributed framework for Web services. To support this new client/server computing model, we modify UTE in multiple ways. 3.2.2 New trace events - IP_Send, IP_Recv To support communications through firewall or proxy servers, we implement communica- tion connections through sockets and use SOCKS interface to go through firewalls. This 60 works well in a single workstation, a cluster of workstations, and an IBM SP system using IP through the High-Performance Switch. The connection manager can automatically detect the existence of the High-Performance Switch and take advantage of it. The same socket send/receive interface is thus used regardless of the platform. Since our connection manager is built on the UNIX socket library, we need trace events for UNIX socket send/receive operations. Thus, we add two new events - IP_Send and IP_Recv. With these two new events, we can trace interactions between the Connection Manager Daemon and the cliette process, the CGI process and the Connection Manager Daemon, and the CGI process and the cliette process. The Connection Manger Daemon when running on an IBM SPx machine, detects the existence of the High-Performance Switch automatically and instructs cliette processes to use the High-Performance Switch to take advantage of the speed the switch provides. The CGI process could be instructed to use the High-Performance Switch by setting up an environment variable - CMD-HOST. Communication through IP is used by our Web server design. Send and receive operation through UNIX socket ports are used exclusively for communication among cliette processes, CGI processes, and the Connection Manager Daemon. These operations are captured in the trace file as begin and end events for IP_Send and IP-ReCV. 3.2.3 Dynamic trace generation A dynamic tracing interface, which allows a Web service administrator to turn on/off trace whenever necessary, is provided through the use of the CMDadmin utility. Unlike many other projects in which performance analysis and trace generation is often an afterthought process, the Connection Manager Daemon has a built-in interface to accept trace requests coming from CMDadmin. Upon receiving a Trace.start request, the Connection Manager Daemon calls the TraceCliette routine, which asks each active cliette process to turn on its trace. An ac- knowledgment will be sent back to the Connection Manager Daemon after the cliette 61 process turns on its trace. If a cliette process is busy serving a CGI request, a flag is posted on the cliette queue. When a busy cliette process returns to the clietteAVA/L state, this flag is checked and corresponding trace action will be performed. This delay makes certain that cliette trace events are generated on a CGI-request basis and prevents trace initialization when serving a CGI request. The Connection Manager Daemon turns on its own trace only after it has received confirmation messages from all cliette processes. Figure 3.7 shows how the human trace request flows. If a cliette process fails to turn on its trace, an error for tracing will be sent back to the Connection Manager Daemon. In this case, the Connection Manager Daemon may abort trace generation by sending a Trace.stop command to those cliette processes whose traces are already on. Step 5 traceon_done Step 1 request trace on 8mp4 haceon_done Connection Step 4 Manager traceon_done Step 2 Cliette 1 Step 3 start trace Step 3 start trace Step 3 start trace Figure 3.7: Unified Trace Environment for IBM SP systems 62 One possible option is to turn on some cliette processes’ traces while others are running without tracing. This allows a system administrator to monitor problematic cliette processes. The tracing of the Connection Manger Daemon is always on as long as some cliette processes are being traced. Thus, dynamic trace generation of selected cliettes provides a way for measuring/debugging interactions between the Connection Manager Daemon and a newly developed cliette. Trace generation is terminated when the traced process exits or the system administrator issues a stop-tracing request using the CMDadmin utility. 3.2.4 Multiple trace channels UTE was originally developed under the control of the AIX Parallel Operating Environment (POE), which dispatches jobs to various SP nodes. The Connection Manager Daemon has task ID zero, and assigns a unique, positive task ID to each cliette process. A trace file is generated for each traced process, with the unique task ID as the file name extension. This saves one word per trace record in raw trace files, because a trace record does not need a field to indicate which task it is generated from. Special precaution is taken to avoid the accidental overwriting of existing trace files by repeated requests to turn on tracing. Each trace on/off request pair generates a set of trace files, and multiple trace on/off requests can be issued during the entire course of the distributed Web services. After trace files are generated, UTE utilities are used to merge and analyze the trace events. CGI processes are children processes of the Web server. They reside on the same machine as the Web server. A cliette process can also be on the same machine/node as the Connection Manager Daemon. Even a threaded Web server, such as Netscape commerce server, has its pre-started CGI threads on the same machine/node as the server process. The trace facility must be able to collect events from multiple processes in the same workstation. We therefore add the ability to support multiple trace channels in the UTE+ trace library. This allows the Connection Manager Daemon and its cliette processes to run on the same node, and also allows the capturing of trace events of more than one cliette on each node. 63 An available trace channel is chosen when tracing is turned on for each process. The process using trace channel zero (the primary channel) in each node is capable of generating system events as well as message passing events. 3.2.5 Unique IDs for trace generation Each traced process needs to have a unique ID to generate a unique trace file. Previous UTE trace libraries require the POE on an SP machine to schedule tasks and assign a unique ID for each collaborating process in scientific applications. In our distributed Web services design, each cliette process has a unique cliette ID. This ID is used mainly by the Connection Manager Daemon to control individual cliette processes. It is also used as a token to distinguish a CGI/Cliette pair. A CGI process, after receiving the assigned cliette ID from the Connection Manager Daemon, passes this ID to the waiting cliette process. If this cliette ID does not match the receiving cliette ID, request is denied. This is to prevent tampering with the CGI process process without getting permission from the Connection Manager Daemon. If tracing of CGI processes is needed, the Connection Manager Daemon is responsible for assigning each CGI process a unique ID. To prevent duplicate IDs from being used, the Connection Manager Daemon is the only one that can assign IDs. Thus, because distributed Web service does not require the POE, it is portable in multiple platforms and environments. 3.2.6 Clock synchronization As discussed in the section 3.1.3, a common time reference would make it easier to merge multiple trace files collected on multiple nodes. In a cluster of workstations, clocks on different systems will drift apart over time if not periodically synchronized. The drift of a clock is the frequency error of the clock relative to a reference clock. Oscillator manufactures quote frequency errors typically in the order of 1 part per million (1 psec/sec), which represents a drift rate of l psec per second, or 3.6 msec per hour. Clock synchronization in a cluster of workstations can be achieved by either manual 64 adjustment or time synchronizing daemons such as the Network Time Protocol (NT P) [57] or the timed daemon of the 4.3BSD UNIX [58]. Time synchronizing daemons are specialized software for distributing and receiving time over local area networks. A time server or a hierarchy of time servers periodically distributes the current time to client nodes, which can then adjust their clocks accordingly. The sptdaemon [59] in NT P keeping clocks is synchronized to within 1 to 3 msec of each other, and the timed to within 5 msec. 3.2.7 On-line timing routines for run-time timing data and statistics The CM Daemon uses a FIFO queue to choose the first available cliette process to serve an incoming CGI request. To balance the load of each machine/node, other factors need to be considered. First, a cliette process on a lightly loaded node should be chosen before a cliette process on a heavily loaded node. Second, a cliette process running on a powerful machine/node should be chosen before a cliette process running on a low-end machine. Other factors affect a cliette performance, such as system memory, system swap space, and disk space. Various history statistics also help make the decision easy, such as paging statistics, process activity, and so forth. We provide on-line timing routines to collect run-time timing data and statistics in the UTE+ trace library. These on-line timing routines provide valuable run-time information for performance steering. They are useful especially when the distributed Web services may be run in multiple platforms, including a single-node workstation, a cluster of workstations, and an IBM SP2 machine. 3.2.8 Enhancement to the utility command - ute2ups During the period of tracing, various CGI processes will be invoked to request dynamically generated HTML documents from cliette processes. To trace all these CGI process results is a waste of system resources, such as CPU time and disk spaces. But without the trace data from these CGI processes, much IP_Send/IPRecv events could not be paired. In other words, an IP_Send event from the cliette process to a CGI process will not have a matched 65 IP_Recv event in the final trace files. This causes problems for the original ute2uts utility. We modify the original ute2uts to pair only those with positive task IDs, but keep statistical information with all the unpaired sets. With this, we could assign task ID -1 to each untraced CGI process when recording an IP_Send or IP.Recv event to avoid confusion when paring events. In Appendix D, we show a listing of uteZuts results for both a Connection Manager tracing and cliette process tracing information. 3.2.9 Enhancement to the NUPSHOT program Nupshot is modified to display additional information, such as message size for IP_Send and IP_Recv states, and process name and ID for other processes, in the dynamic pop-up information window. 3.2.10 User marker - seerGI and seerache Although there are various types of action performed by a single cliette process, the most important task is serving a CGI request. By knowing the time taken to serve a CGI request, a system administrator could use this information to tune the system. Furthermore, this information could also be fed back in real time to the Connection Manager Daemon to do dynamic load balancing. A set of phase markers — b_seerGI and e_seerGI — are used to mark the beginning and end of a cliette serving a CGI request. This new user marker is built into the API library. When users compile their cliette program with -DUTE_TRACE flag, this user marker is automatically included. Another useful user marker is seerGI. It is used by the cache manager to indicate the beginning and end of serving a Cache item. Depending on how many threads of control a cache manager has, the cache manager could generate a user marker for each of its cache threads. In the Chapter 6, we discuss how this Cache Manager marker is used by two different threads to generate two different user markers - Pi ckCache and GIFCache. CHAPTER 4 Performance Evaluation of the Framework In this chapter, we will evaluate the performance of the proposed framework. First, we will describe the prototype system setup for performance measurement. As a basis for comparison, the performance based on the traditional design is measured. We claim that the proposed framework is scalable by showing the influence of the number of CGI requests, the number of cliette processes, and the number of servers on the system. 4.1 Prototype System Setup In a tradition design, each CGI process needs to establish a connection with its back-end server. As pointed out in Chapter 2, each CGI process needs to perform the initializa- tion/negotiation step before it actually forwards its requests to the back-end server. The time could be significant if there are many CGI processes. If each CGI process performs a relatively simple task, the time required for initialization is significant, wastes system resources, and becomes the bottleneck. And, if each of these CGI processes needs to perform some complicated job when constructing HTML pages, it not only adds load to the system running the Web server, it will also add load to the back-end server, resulting in slow response time for all CGI processes. Figure 4.1 shows a high-level block diagram 66 67 IBM SP node 1 Node 3 f I Server A Dnnect to er A All con act to Server A j Server B I Node 4 Figure 4.1: Traditional Web server design using the traditional method allowing each CGI process to establish connection with the back-end server. It also shows that when using multiple HTTP servers to allow more CGI processes, there is no way for the HTTP server to evenly distribute the CGI connection to the back-end server without knowing the load of the system in advance. In Figure 4.2, we show the high-level diagram using our framework design. In this example, there are at most 8 CGI processes that could be served by cliette processes simultaneously while others wait for a free cliette. But since these 8 cliette processes are evenly connected to two back-end servers, each could perform a CGI request in a reasonably short time and return to serve the next CGI request. Cliette processes can be created on a 68 IBM SP node 2 Node 3 j Server A IBM SP node 1 L Cliette NOde 4 Figure 4.2: Our framework solution 69 F a . WEB Server 'Ssue requests Back-end Server Initialization rok * return to ‘ Web browser negotiation receive results wait/constructing HTML page J L Total Time = Fork CGI process time + Issue requests to back-end server time + Wait for results from back-end server time + Construct HTML page time + Figure 4.3: Elapsed time using traditional Web server design remote system to distribute the system load. Even the Connection Manager Daemon could be duplicated on a different machine to manage another set of cliette processes. There can be any number and any type of back-end server, from a database management system to a documentation management system. Different types of cliette processes can be managed by a single Connection Manager. In Chapter 5, we will demonstrate the building of the Digital Library using the IBM Visual Info Product as our back-end server. 4.1.1 Performance of the traditional design Figure 4.3 illustrates the time required for a Web server to access a back-end server when using the traditional design. It details the time required for each step to complete a CGI request. In the tradition design, each CGI has to perform negotiation/initialization time before 70 actually sending the requests to be processed by the back-end server. In addition, each CGI needs to logout/terminate connection with the back-end server when it finishes. For our Digital Library design (to be discussed in Chapter 5), initialization/negotiation time needed for each CGI process to establish the connection with the back-end server is detailed in Table 4.1, including the network delay. These data are gathered when running 6 CGI processes concurrently on an IBM SP node (SP2 Thin-Node model 390 with 128M memory). The back-end servers are IBM Visual Info servers. Steps Minimum time Maximum time (sec) (sec) Login 0.045 0.088 Session setup 0.125 0.210 Access Index class 0.452 1.009 Access Attribute class 0.278 0.404 Access Linkage class 0.139 0.250 Setup Cache 0.009 0.012 Total 1.048 1.973 Table 4.1: Detailed initialization time in the Digital Library environment The main operations at this initialization/negotiation phase are login ID/password verification; setting up the session ID and handler; initializing and setting up connection with the Cache Manager; and accessing and arranging the corresponding index, attribute, and linkage classes. These steps generate about 10 to 15 query statements. The CGI process in the traditional design cannot be reused for any following CGI request. The HTTP server forks a child process when a CGI request is received by the HTTP server. After the standard output port of this child process is redirected to the open socket port with the HTTP server process, this child process image is overwritten by the CGI script. The only way this CGI process could forward a dynamic HTML page is to write it to the standard output port, which is then redirected to the HTTP server for returning to the Web client. Because this CGI process has no concept of the HTTP server and has no other connection with the HTTP server, it cannot be reused. Adding more H‘l'l'P servers on an extra machine/node cannot solve this problem 71 Web Server Our Framework .1 w I fonNard ® JN Cliette issues request to back-end server time + Cliette waits for results from back-end server time + Cliette constructs HTML page time + CGI returns HTML page to Web browser time Figure 4.4: Elapsed Time using our framework solution efficiently, as illustrated in Figure 4.1. 4.1.2 Our framework solution In our design. the negotiation/initialization is done only once at the start-up step of the cliette process (as shown in Figure 4.4). The difference between our design and the previous traditional one are highlighted in the shaded areas. These cliette processes stay connected and are ready to forward requests to back-end servers. The added overhead when using our framework is the negotiation/connection time between the CGI process and the cliette process. If the time is considerably less than the initialization time in the traditional design, using our framework reduces the overall time. Our framework overhead measures, on average, around 15 msec (Table 4.2) — from when the CGI process asks the Connection Manager for a free cliette process until the 72 connection between the CGI and the free cliette has been established. Instead of spending at least 1 sec for each CGI process during initialization, we add only 15 msec overhead. Table 4.2 represents the overhead of our framework, as indicated by the shaded areas in Figure 4.4. The final HTML page is assumed to be 2,048 bytes, which is the size of one network send packet. Steps Minimum time Average time Maximum time (msec) (msec) (msec) Between CGI and CMD 7.4 10.5 32.2 Between CGI and Cliette 3.8 5.2 16.6 Table 4.2: Overhead in our framework, assuming 2,049 bytes per HTML page As mentioned in Chapter 1, the forking time in either case could be eliminated by using the dynamic load library, such as Netscape NSAPI or Microsoft’s ISAPI. Because this architecture is not yet used by the general public, we use the standard HTTP server, which uses forking to startup a CGI process in our environment. 4.2 Design Considerations To achieve scalability, we had to design our framework with minimum overhead. There are several components that determine how the system should be designed to achieve scalability. We will discuss each component in detail in this section. 4.2.1 Connection Manager There are several potential drawbacks to our framework design. The Connection Manager Daemon is a potential bottleneck; it must respond to every CGI request. The entire server cannot proceed any faster than the Connection Manager Daemon. For this reason, the Connection Manager code must be carefully written to be as fast and robust as possible. In this section, we will examine the influence of this potential bottleneck. 73 Instead of using centralized control, where one Connection Manager manages local and remote cliette processes, multiple Connection Manager Daemons could be used to manage only local cliette processes to reduce this potential bottleneck. To balance the number of CGI requests to multiple Connection Managers, these Connection Managers must constantly exchange load information, which adds load to the Connection Manager and complicates the Connection Manager design. 4.2.2 Number of cliette processes The number of cliette processes must be managed. While cliette processes wait for requests, they use up memory and operating system resources. For this reason, it is wise not to create more cliette processes than necessary. On the other hand, if all the cliette processes are busy and more CGI requests arrive, the Connection Manager must either reject the overflow requests, queue them up until a cliette process is free, or start more cliette processes. Which of these three choices is best depends on the server and the situation, although rejecting a connection should be avoided if possible. Starting more cliette processes allows the system to dynamically adjust to the number of CGI requests, but creating and starting the extra cliette processes slows the system at the worst possible time — peak load. Worse, the extra cliette processes may soon become idle if the load drops, and will continue to hang around unless action is taken to retire them. Queuing the requests is better than rejecting them, but if the wait is more than a few seconds many clients will think the server has crashed and will close the connection anyway. Keeping track of all the pending requests may not be easy to do efficiently, either. The last thing a busy Web system should do at the highest peak of activities is a lot of bookkeeping. The best strategy, then, is to have the number of cliette processes sufficient to meet all but the very highest loads, while not being an excessive burden when the system is less busy. The best number of cliette processes for a given server must be determined through experience. In this section, we will examine the average CGI response time versus the 74 number of cliette processes assuming that the average CGI request will cost the back-end server from 500 msec to 1 sec to process. In addition, two different degrees of cliette busy rates are used. Normally, a cliette process may not be busy all the time. It may stay idle while waiting for results coming from the back-end server in order to determine what the next query should be. While one cliette process is idle, it releases the processor and allows other cliette processes to proceed. If the cliette spends more time idling, the system could handle more cliette processes while maintaining reasonable response time. We use two busy rates - 50% and 20% — in our demonstration. These rates are only rough measurements, and do not include the network communication overhead between the cliette processes and the CGI process, or the cliette process with the back-end server. 4.2.3 Number of SP nodes for cliette processes Once a node is overloaded with cliette processes, increasing cliette processes will only add load to the system, which results in longer CGI response time. At the point, the only solution is to scale the system up from one node to two or more nodes. Our framework design is flexible; it can be set to automatically start up cliette processes on a remote node if the number of waiting CGI requests has reached a predefined number. Because cliette processes are dispatched to serve CGI requests from the free queue in a first-come first-serve fashion, each cliette is kept equally busy. In addition to automatically beginning new/remote cliette processes to serve more CGI requests, the system adminis- trator can use a maintenance command to start new cliette processes. Number of cliette processes per node is defined in the Connection Manager configuration file. Distribution of CGI requests to either a local or a remote cliette process is maintained by the centralized Connection Manager. 75 4.3 Performance Results A user using the Web browser does not like to wait for an HTML page. They tend to stop their requests if it takes too long for a page to show up. The performance of a Web server is judged by how fast an HTML page could be returned to the Web client, given that the back-end server performance is not affected by the number of cliette processes connected to it. The same measurement applies to our performance monitoring. The less time it take a cliette process to generate a dynamic HTML page, the better our framework is. The faster a CGI process can receive and return this dynamic HTML page, the better the entire Web performance is. We use the average CGI request time as the performance metn'c, assuming each cliette process take an average of 0.8 sec (as explained in Section 4.3.1) to get a response from the back-end server. Our extended UTE trace library and tools are used to capture event traces and analyze traces. Customized UTE markers are used to mark the beginning and end of CGI requests. This set of markers helps us to identify the total elapsed time for each CGI request. 4.3.1 Workload We use a modified HTTP server to fork initial CGI processes in our environment After the initial stage in which CGI processes are forked consecutively, these CGI processes remain active and issue CGI requests continuously while we gather the data. The motivation is to control the number of busy cliette processes at all times to guarantee that all the available cliette processes are busy if the number of CGI requests is equal to or greater than the number of cliette processes. There are two reasons why we choose this model. First, forking a process takes time. For example, if forking a process takes 20 msec, there will be only one busy cliette even with continuous forking because it only takes 15 msec for a CGI process to finish its requests. We will not be able to see all the cliette processes in the busy state especially when measuring the overhead of our framework. Second, there is a limit on how many child processes can be outstanding for a single process. If we choose to fork one child 76 process every X msec, we will eventually reach this limit depending on how fast a cliette process serves a CGI request under the current load. Table 4.3 lists the minimum average time when using a single CGI process to one cliette process. The work stage is implemented by using a loop to keep the CPU busy. The wait stage is implemented by using usleepO system call. Percentage Average time Average time Average total time to of Busy in Work state in Wait state complete one CGI time (msec) (msec) (msec) 50% Busy 0.39 0.40 0.822 20% Busy 0.125 0.63 0.773 Table 4.3: Average Wait/Work time for 1 CG] process versus 1 Cliette process In some cases, we assume that the back-end server is infinitely fast. This means the cliette process responds with the final HTML page as soon as it receives the CGI requests. This allows us to examine the overhead of our framework. In addition to different numbers of cliette processes (ranging from 1 cliette process to 32), we also use different numbers of CGI processes, ranging from 1 to 32. 4.3.2 Influence of the number of cliette processes In Figure 4.5, we assume that the back-end server is infinitely fast. We only use one SP node for this setting to show the impact of the number of cliette processes on a single node. In the next section, we will scale up the number of cliette processes to at most 16 SP nodes. From Figure 4.5, we show that when the number of Busy cliettes exceeds a certain number, the average CGI response time begins to increase. This occurs because our cliette processes begin to overload the system resources, especially when our cliette processes stay Busy all the time. If cliette processes are not Busy all the time, the system resources can be shared among cliette processes, and more cliette processes can be run on the same system without 77 l conmxrrenr CGI procesas 2 concurrent CHI processse 4 conmirxenr PG] process -3 B committent CGI prob _aees If concurrnnr CGI p .susases 32 ccnmerent CGI iroweasaes ii'l’i‘ 1'1 I l 0.14 I each CGI request (sec; Irll 0,11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 n_o: 0.01 Average time S: I I I I I I I I I I I I h H w J i p) l" II‘ II Ll l l l l l J l l I I l 4 8 16 Number of Cliette processes per 3? node Figure 4.5: Assuming infinitely fast back-end server overloading the system. We use a 20% Busy ratio and a 50% Busy ratio to present the threshold of the system. In Figures 4.6 and 4.7, it clearly shows that the increasing rate for the average CGI response time depends on how busy the cliette processes are. In Figure 4.7, we show that the best average CGI response time when serving 32 concurrent CGI processes on a single SP node is about 11 sec. In the next section, we will show how to scale up the system to lower the average CGI response time. 4.3.3 Influence of the number of SP nodes Once a node is overloaded with cliette processes, increasing cliette processes will only add load to the system, which results in longer CGI response time as shown in Figures 4.6 and 4.7. When the cliette process Busy ratio is 20%, more cliette processes could be responding to equal numbers of CGI processes without additional noticeable load on the system. The threshold could be up to 32 cliettes per node. When the cliette process Busy ratio is 50%, adding more cliette processes to handle equal numbers or more CGI processes ‘51: GI request r each I A'r—ar age r. Lmé f w n 21% u 14 n 1|! 0 (\l {l r 1 “1 nr‘uirenr “1:1 pp a» No Yes I What type of cache? Contact Cliette Process Picklist Pageltem Return CVaChe Info Retum 080116 1010 Does the page image exist FCLA_GitCache \ Yes No l Retum Cache Ask Cliette for image Info Figure 5.6: Flowchart of nph-CGIscript process 96 Is into in Cache ‘ / Yes No l Contact VI Return Cache into Cliette tor image to Web Client Using IMAGE_ONLY request 1 Wait for image from Cliette Return image to Web client Figure 5.7: Flowchart of GetGif process 5.4 The Internal Design of the CGI interface - GetGif The main purpose of this CGI process GetGi f is to retrieve the actual page image from the Gi fCache cache thread, convert it to an image format acceptable by the Web browser, and add the MIME header information. If there is a cache miss, this CGI process will request the file through the cliette process. Figure 5.7 shows a simple flowchart of the process GetGif. 5.5 Performance of Distributed Web Services The performance of client-server systems such as the Web depends on many factors: the client platform, the client software, the network, network protocols, the server software, and the server platform. Because many different clients interoperate on the Webs and there are many different types of platforms and networks in use, it would be difficult to characterize the entire Web. A trace generation facility may help a system administrator of the distributed Web ser- 97 vices understand its cliettes’ access patterns. Our Extended UTE tracing tools, especially the dynamic tracing facility, provides us with the ability to trace and visualize the performance. The dynamic tracing facility allows a Web service administrator to turn on/off trace when- ever necessary, through the use of the CMDadmin utility. Upon receiving a Trace_s tart request, the Connection Manager Daemon will call the TraceCl iet te ( ) routine, which asks each active cliette process to turn on its trace. An acknowledgment is sent back to the Connection Manager Daemon after the cliette process turns on its trace. If a cliette process is busy serving a CGI request, a flag is posted on the cliette queue. When a busy cliette process returns to the c l i et t e_AVAI L state, this flag is checked and corresponding trace action will be performed. This guarantees that cliette trace events are generated on a CGI-request basis and prevents trace initialization when the cliette process is serving a CGI request. The Connection Manager Daemon turns on its own trace only after it has received confirmation messages from all cliette processes. If a cliette process fails to turn on its trace, an error indicating trace has failed will be sent back to the Connection Manager Daemon. In this case, the Connection Manager Daemon may abort trace generation by sending a Trace.stop command to those cliette processes whose traces are already on. Another useful option is to turn on some cliette processes’ traces while others are running without tracing. This allows the system administrator to monitor problematic cliette processes. The tracing of the Connection Manger Daemon is always on as long as some cliette processes are being traced. Thus, dynamic trace generation of selected cliettes may be a powerful choice for debugging interactions between the Connection Manager Daemon and a newly added cliette process. Trace generation is terminated when the traced process exits or when the system administrator issues a stop-tracing request using the CMDadmin utility. The Connection Manager Daemon has a task ID of zero, and assigns a unique, positive task ID to each cliette process. A trace file is generated for each traced process, with the unique task ID as the file name extension. This saves one word per trace record in raw trace files, because a trace record does not need a field to indicate which task it is generated from. Special precaution is taken to avoid the accidental oversriting of existing trace files 98 by repeated requests of turning on tracing. Each trace on/off request pair generates a set of trace files, and multiple trace on/off requests can be issued during the entire course of the distributed Web services. After the traces are successfully generated, UTE utilities are used to merge and analyze these trace events. To achieve higher performance, our distributed Web services allows users to selectively use the Cache Manager. If the Cache Manager is not used, the second CGI process GetGi f will not be called. Instead, the GIF file is saved as a normal file and could be retrieved by the Web client. Also, a complete picklist is displayed whenever a pic/dist is requested instead of the first 12 items. If the Cache Manager is used, its task ID is assigned by the Connection Manager at the startup time. CHAPTER 6 Digital Library Performance Analysis and Visualization In this chapter, we will present performance tracing results of our distributed Web server used in the FCLA Digital Library project. Our extended UTE trace library and tools are used to gather and analyze trace results. We will try to limit the gateway impact to a minimum based on the results shown in Chapter 4. The flexibility of our Connection Manager design and the scalability of our Web services enables us to fully utilize the IBM SP system to gather trace data for different communication environments. 6.1 FCLA Digital Library Trace Environment Setup The basic trace setting for our Digital Library uses a fully functional cliette process to conduct our tracing. As described in Chapter 5, this cliette process maintains a permanent connection with its back-end server — the IBM Visual Info system. To provide a reference point, we designed an experiment using all standard compo- nents without our gateway. Its performance is discussed in Section 6.2. To analyze the performance and to compare any possible different configuration, we used a single work- station, a cluster of IBM RS/6000 workstations, and an 8-node IBM SP system to gather trace information. The SP system used in our tracing provides both the high-speed network 99 100 .@ ®@ Wm, Visual Into g .@ @. High Speed Switch Library Server E @C:l>® . . meme” Vlsual Into 6—9 @ HighSpeedSwnch Object Server Figure 6.1: One workstation/SP node without the Cache Manager and the token ring/Ethernet connections. All the nodes used are IBM SP2 Thin-Node model 390 with 128M memory each. We used at most 6 cliette processes, because the back-end Visual Info Library server has only five concurrent processes to handle Visual Info requests. Also, based on the performance results shown in Chapter 4, the overall system performance will begin to suffer when the number of total cliette processes exceeds a certain number. We will run our tracing in three basic settings. The first one is running the Web services all on one workstation or one SP node with or without Cache Manager support. Figure 6.1 diagrams the setting of one workstation without the Cache Manager. Figure 6.2 diagrams one workstation with the Cache Manager support. The second basic setting is to run the Web server, Connection Manager Daemon, Cache Manager Daemon, and two cliette processes on one workstation (or one SP node), while the remaining four cliette processes run on two workstations (or two SP nodes). Figure 6.3 shows the setting using three workstations/nodes without the Cache Manager support. Figure. 6.4 shows the setting using three workstations/nodes with the Cache Manager support. The third setting is to run the Web server on one workstation (or one SP node), while distributing six cliette processes evenly on three SP nodes. For each of the settings, we will 101 anew, Visual Info @ @ HighSpeedSwiich Library Sewer e @ e a ti @ 66) (Eng Wsual lnio @ “WSW" Object Server @ e Figure 6.2: One workstation/SP node with the Cache Manager 3 . Ethernet/High Speed Switch Visual Into <:> @ < > Library Server ®®® Ethemet/Higi Speed Switch 3®®§§ < a Visual Into Object Sewer it @@ Figure 6.3: Three workstations/SP nodes without the Cache Manager 3 e<=l> 102 @l 669' Ethemet/ Visual lnto @' H @l . 865) @ I High Speed Switch Library Server I ©@ r ' Ethemet/ @. 02> : a <———> Visual Into I High Speed Switch Object Server it it Figure 6.4: Three workstations/SP nodes with the Cache Manager make comparisons with and without the Cache Manager. When running the tracing on the SP node, only the hi gh-speed switch is used for communication between servers. In addition to providing us information about the advantage of using our distributed Web services for back-end server supports, these various settings allow us to compare the advantage of using the Cache Manager, and the trade-off of using the IBM SP system with the high-speed switch. Processes to be monitored and traced are the Connection Manager Daemon, the Cache Manager, and six cliette processes to support up to six simultaneous requests. The number of cliettes is chosen to match the performance of the back-end server. Although CGI activities are not captured in these settings to reduce the total amount of trace events, send and receive operations are indeed captured in a cliette process’s trace file for messages sent to or received from its CGI processes. There are a total of 180 CGI requests during each of our test runs. To closely simulate an actual Web server environment, each CGI is begun with a randomly chosen URL string from a pool of URL strings. These CGI requests are generated automatically by a control program. A request is queued until timed out if there is no free cliette available at the time of request. In addition to the seerGI trace marker provided by our gateway API, we add user trace markers to trace Cache Manager 103 activities. Two extra user markers, PickCache and GifCache, are used to signal the beginning and end of the Cache Manager serving Pickl ist cache information and Gi f page cache information. Our back-end service contains two basic components — the Library server and the Object server. In a Digital Library environment, the library server is viewed as a library catalog containing indexes to various collections. The object server is an actual book shelf holding books, journals, and newspapers. Some of the operations, such as requesting a content listing of a journal, could be served only by the Library server. Getting the actual page image required the library server to coordinate with the object server with information. Object server responds to the requesting client directly with a page image. Both servers use the IBM DB2/6000 database server to maintain their contents. A Visual Info system can have more than one object server connecting to one library server. In our testing, we only use one library server and one object server. Each server resides either on an RS/6000 workstation or on one IBM SP/2 node. Communication between servers could be either through the token ring or the high-performance switch. The sample journal we used for our testing contains 500 articles. There are 6,453 different URL strings; among them are 5,111 URL strings that generate HTML pages with GIF images. These CGI processes are generated by a control program instead of a Web server process. Using a control program, we could generate CGI processes fast enough to saturate the pool of cliette processes. Thus, we could trace the average availability of each cliette process and the failure rate of these CGI requests. 6.2 Standard HTTP Setting Without Using CMD/Cache Support To provide a reference point, we set up an experiment using all standard components. A complex CGI program is created to retrieve data directly from the Visual Info system and convert it to HTML format before sending the data back to the Web client. After receiving 104 the URL string, a CGI process parses the incoming message and retrieves the Visual Info Item ID. The CGI process then tries to login to the Visual Info system. Upon successful login, the CGI process starts issuing several queries to get information on the index class - FCLA_1. Another set of search queries is issued to get information for the item. Depending on the type of item being retrieved, another set of queries may be required to retrieve information for building navigation links. If the item contains a page image, a query is issued to retrieve the actual image. The CGI process eventually constructs an HTML page and passes it back to the Web client. The CGI program, which is quite complex, is about 1.5 Mbytes in size. Because information related to the item ID varies in size, the CGI process needs to issue several memory allocation subroutine calls. We measure in a system with a normal load of six simultaneous CGI requests; it takes about 3 to 5 sec from the time the Web server received the request until the C01 is ready to process the URL string. To capture the seerGI elapsed time, we begin tracing as soon as the process starts. We collect about 200 such requests. The average access time is about 28 see if there are six simultaneous CGI requests. With only one active CGI request, the average access time is about 19 see. In addition to the above setup, we also modify our cliette process to do a login whenever a CGI request is received. This setting uses our dynamic trace start/stop facility provided by the UTE+ trace package. Table 6.1 shows the elapsed seerGI time. We use six cliette processes to do the performance tracing. From Table 6.1, we calculate that the average is around 41 sec. This is higher than using the C01 process alone. If we add the time used to start a CGI process in the previous setting, this setting still presents 6 to 8 sec higher access time. Several factors contribute to the higher number. First, the IP send and receive time contribute to the elapsed seerGI time. Second, in addition to six cliette processes, six CGI processes are running simultaneously as well. It also takes time to request a cliette process. This setting does not reflect the real performance of not using CMD/Cache services, it merely gives us some indications of the overhead that our CMD/Cache design might bring. 1 05 Cl iet te ID Total No. of calls Average Cliette 1 1114.312 38 29.324 Cliette 2 1131.712 32 35.366 Cliette 3 1217.125 32 38.035 Cliette 4 1097.844 22 49.902 Cliette 5 1187.430 30 39.581 Cliette 6 1141.656 24 47.569 Total 168 Table 6.1: Elapsed seerGI time statistics (using Web standard component without CMD support) 6.3 Running on a Single Workstation One workstation is used to run the Web server, including the HTTP Daemon, the Connection Manager Daemon, and six cliette processes. Communication between the Web server and the back-end server is established through a 16-Mbytes token-ring network. A Message is sent using a blocking send(). and received using a nonblocking select() followed by a blocking recv() through an Internet stream socket. We use a pair of I P-S end (begin and end) events to indicate the elapsed time of a send(). and a pair of IP_Recv (begin and end) events to show the elapsed time of a recv(). Because CGI processes are not traced, send and receive operations for messages between a cliette and its CGI processes are captured only in the cliette’s trace. Messages of two different sizes, 148 bytes and 2,032 bytes, are observed. Exchanging URL strings and HTML documentations between a cliette process and its CGI process is done using large messages (2,032 bytes); small messages (148 bytes) are used for other purposes such as control, acknowledgment, or administration. Figure 6.5 shows the tracing of the interprocesses communication traffic of the Con- nection Manager and cliette processes; it also shows the elapsed time distribution for (I P_Send, IP_Recv) and seerGI events. The CMD has the ID number 0, and cliettes have ID numbers from 1 to 6. From the elapsed time distribution of seerGI events, we show that most CGI requests are served within 10 sec while some are scattered up to 90 ,llll ll Vlllllllllll..._ llll' lll ' 7 ii. I. . Al I l \ I ‘III I u l I z . s “W-‘n‘ 1'. .‘ 34;. I‘ I .1“ v] ‘t K}; H I ll 1 l l I I l l l l I I l I l I Figure 6.5: Distributed Web services on a single workstation M M”, “Mirnm Job: peg-121.9. Ddc: W900 Tlrm' 11:45 000w WHAFAZPC Pm W42“: 107 sec, depending on the Visual Info operation, the network/system load, and the size of the retrieved page images. As we mentioned in Chapter 5, the retrieved page images have to be written to disk for the Web client to read them. These page images also must be transmitted from the remote object server to local cliette processes. Depending on the network load and the system load, the time for retrieving and writing a page image varies. Tables 6.2 and 6.3 show the total elapsed time, number of calls, and average elapsed time of three different N UPSHOT states - IP-Send, IP_Recv, and seerGI. A "NUP- SHOT state" is a period of time with two events: a begin event and an end event. Task Type IP_Send IP-Recv Total No. of Average Total No. of Average (sec) calls (msec) (sec) calls (msec) CMD 0.308 218 1.413 0.139 355 0.391 Cliette 1 10.198 218 46.782 0.131 195 0.673 Cliette 2 11.319 237 47.759 0.211 211 1.004 Cliette 3 12.601 290 43.454 0.195 259 0.754 Cliette 4 8.591 200 42.959 0.134 179 0.753 Cliette 5 12.688 254 49.955 0.239 227 1.055 Cliette 6 14.017 301 46.569 0.524 268 1.955 Total 1,718 1,694 Table 6.2: Elapsed I 13.5 end and IP_Recv time statistics on a single workstation C l i et t e Total No. of Average ID (sec) calls (sec) Cliette 1 532.751 24 22.197 Cliette 2 586.476 26 22.556 Cliette 3 534.141 32 16.691 Cliette 4 535.874 22 24.357 Cliette 5 558.111 28 19.932 Cliette 6 534.113 33 16.185 Total 165 Table 6.3: Elapsed seerGI time statistics on a single workstation The average IP_Send time is considerably higher than the average receiving time. This is because recv() is called only when the message has anived. The Connection Manager 108 Daemon has a lower average sending time because it only handles small messages. Note that there are only 165 seerGI states in the trace (Table 6.3). This means 15 CGI requests ( 180 - 165 = 15 ) failed to receive services due to time-out. The average seerGI elapsed time is about 20 sec, about 8 see less than that in the all-standard component experiment. 6.4 Running on a Single IBM SP2 Node To take advantage of the high-performance switch on the IBM SP2 machine, we allocate one SP2 node for our Web services in this setting. The back-end servers, Library server and Object server, are also running on the same SP2 machine but on two different nodes. The communication between the Web servers and the back-end servers are through the high- performance switch. By allocating one dedicated SP node for the Connection Manager Daemon, we can schedule CGI requests to their cliette processes in a timely fashion. Tables 6.4 and 6.5 show the total elapsed time, number of calls, and average elapsed time of three different NUPSHOT states - IP-Send, IPRecv, and seerGI. Task Type IP-Send IP_Recv Total No. of Average Total No. of Average (sec) calls (msec) (sec) calls (msec) CMD 0.076 189 0.403 0.036 367 0.100 Cliette 1 0.157 272 0.580 0.045 242 0.189 Cliette 2 0.134 271 0.495 0.047 241 0.197 Cliette 3 0.443 272 1.629 0.019 242 0.079 Cliette 4 0.200 272 0.736 0.052 242 0.216 Cliette 5 0.347 265 1.311 0.024 236 0.105 Cliette 6 0.243 280 0.871 0.046 249 0.188 Total 1821 1819 Table 6.4: Elapsed IP-Send and IP_Recv time statistics on a single IBM SP node From the results shown in Tables 6.2 and 6.4, we can see that the average elapsed time for IRS end improves from roughly 0.04 sec on a single workstation to around 1 msec on an IBM SP system. The average elapsed time for I P.Recv improves from roughly 1 msec tortilla Display 200m Printer. 5.127115 (ulsecurrlls) N N 8939?th Heb lP Saul l ' lP Retv .seerGl vlylllllllld lllllill ll 0W l l ‘ l 1 711 ilfl 30 1011 iii] 1211 I'll} 1:111 1511 lilil 171] 1311 191] m0 2111 9le ill] Nil P50 780 2711 2811 .3911 .1110 ilil .2[ 0050!: ll 0116le lllrsur: 0.000075 1 l l 101: mil. 1 -' 01 states: r 01510195: 1 i “1 states: 93% ‘19“ ‘, r l l l l l l l I 1 1111's. : nl llms: 1110 r 01 blitS‘ : 01 blns: till 25 Real» to ill rye-5.3.3 m m Hesrzr lb lll : a . l’ntll L, l 1 0w lIJil 0.0025 0.0050 0.0 0050 m 0.001 0.002 n, 0050 2-50 5110 7.50 1000 1260 15.00 17.50 _ _ 4— trail loll Lluttlln tn start; ll nut left suit 01 histogram ills mg lell button 10 JlNlL'll cutie" “(1901 [lrslugruu ly- ijk‘li lllllll'lll in strelrhnut loll Sillfi 01 llrslbgrlm lt‘rsylhv. Drag mlllrlle button to slllle llisibqram lisplav. Drag nllllllle button in stilt hrslnljrnm rllsphv. r"3'1 ""‘l‘lll‘ ““110" 10 5W- lllim‘lm" ”ISPIRY Drag l'llllll llllllull 10 stretch ulll nrllll 3019 01 liltllUljl'dtll l Drag ngllt lllltlnll t0 slrvlrll but right side 01 histogram [Wt Nil" W110" '0 W915" 0'“ ”11m 3'09 0' "15101?“ 11511191: Dray ally bullun 10 stretch lrlslugralns vertically Drag any 11111111" in slrtttll luslbryratls vertrrrllv. Dray WW 11""9" ‘0 51"?th hISlUWMIS VP'lIUllY Print ' Pnnl Figure 6.6: Distributed Web services on a single IBM SP2 nodc Thu: 1114600000: WHIFQPC m: we 110 Cl iette ID Total No. of Average (secs) calls (secs) Cliette 1 309.950 30 10.331 Cliette 2 311.666 30 10.388 Cliette 3 308.229 30 10.274 Cliette 4 307.765 30 10.258 Cliette 5 309.830 29 10.683 Cliette 6 309.945 31 9.998 Total 180 Table 6.5: Elapsed seerGI time statistics on a single IBM SP node to 0.2 msec. The average elapsed time for seerGI also improves from roughly 20 sec to 10 sec. By comparing Figures 6.5 and 6.6, we find that most of the seerGI operations are centered around 10 sec. This indicates the advantage of using the hi gh-performance switch. 6.5 Running on a Single Workstation with Cache Manager By examining the HTTP daemon log file, we find there are always some HTML pages being accessed frequently. These pages could be the Welcome page, the Int roduct ion page, or the News of the Day page. For example, in our Digital Library environment, a dynamic HTML page containing a professor’s recent class notes is accessed more often than older class notes. In the previous example, we find that on average it takes about 10 sec to request a page image, construct the HTML page, and pass it back to the requesting client browser process. To improve the throughput of heavily accessed pages, we add the Cache Manager support in this setting. There are two cache, the PickCache and the Gi f Cache, maintained by the Cache Manager. Figure 6.7 shows the trace visualization of the interprocesses communication traffic (IP_Send and IP_Recv) between the Connection Manager Daemon and cliette processes, along with elapsed time distributions for IP-Send, IP_Recv, and seerGI. Our Cache Manager uses two cache threads to maintain two different cache types. Due to the limitation lll lblrlrle lllsplav [mm Ptllrllur: 1915/8819 (In seconds) .9 0 RWIWEW H910 fill: Semi : lP_ Harv - seerlil lillillllllll Curstlr' 0 003910 : 01 states: : 0t states: llltl'.‘° 100'. i 111 llth: , 011mg, : ul lllllili 50 75 A ‘37 Retro: 10 I'll Resue to ill Resrze 10 ill hull Pttlll 0 00 0.2 0.3 0.l — 0W, ill 001 0.03 ll 0:: 0.0 _ Italy but llutlnn lb strrlrll out llrtt 5019 nl lllSlogr-‘ltllll Dray left button 10 stretch 001 lell sulv 01105109200 lirsyl Ural; lell buttrln tn stmlr. ll out left slur. rlt lllslngnm llfipl'lv Dray mlrlrlte button to sllde hlslmram lllsplav. Dng mdvitr button in slide histogram insptrv Draq numtle button to slide human 1 twin. Ul'dl] right button 10 such ll uul light side 0t llhlfltfllll Drall right button to stretch out right Still: of llstognlm (ll Ural] nlllll lmllurl lb stltlttl ulll nglll site 01 hiluqulll ltlslllzlv Dray anv lllllltlil to strut: ll luslnmms ventrallv [hag dlIV button to stretch histogram mm in [Iraq any button 10 slmlrtl lllstmlram: YPerRI 0090 Figure 6.7: Distributed Web services on a single workstation with Cache manager support __ Tine: 11:45 Dunno WHIFIZPC m WNZAC 112 of NUPSHOT when displaying nested tracing event, Figure 6.7 only shows the interaction between the connection manager daemon and cliette processes. For example, the nested tracing event can happen when the begin marker of a Gi fCache(P i ckCache) is recorded before the end of a PickCache(Gi fCache) marker, Table 6.6 shows the total elapsed time, number of calls, and average elapsed time of two different N UPSHOT states - I P-Send and IP_Recv. The statistics for the seerGI state are shown in Table 6.7. Task Type IP_Send IP_Recv Total No. of Average Total No. of Average (sec) calls (msec) (sec) calls (msec) CMD 0.168 153 1.103 0.066 205 0.325 Cliette 1 7.598 168 45.229 0.141 151 0.940 Cliette 2 7.128 184 38.743 0.125 165 0.759 Cliette 3 6.945 166 41.839 0.165 149 1.112 Cliette 4 6.640 186 35.701 0.211 167 1.267 Cliette 5 7.920 145 54.626 0.091 129 0.707 Total 1002 966 Table 6.6: Elapsed IP_Send and IP_Recv time statistics on a single workstation with Cache Manager support Cliette ID Total No. of Average (sec) calls (sec) Cliette 1 778.840 19 40.991 Cliette 2 737.366 21 35.112 Cliette 3 758.605 19 39.926 Cliette 4 717.627 21 34.172 Cliette 5 783.304 16 48.956 Total 96 Table 6.7: Elapsed seerGI time statistics on a single workstation with Cache Manager support Due to the limitations of available trace channels, we invoked five cliette processes instead of six in this setting. This results in a longer average seerGI response time 113 (increase from 22 sec to around 40 sec). Another contribution to the longer response time is the fact that each cliette has to access a cache item before it requests anything from the back-end server. It also must update the cache item if it is missing from the cache. Note that the Web client received its information before the cliette process updates the missing cache item. Thus, the longer seerGI response time does not necessarily indicates a longer response time for the Web client. It shows how long a cliette process would take to finish a CGI request and be ready for the next request. Table 6.8 shows the average response time of a cache thread responding to a CacheOpen call until the corresponding CacheClose call. These calls can be ei- ther a Cache_Read or a Cache_Write request. The total number of cache accesses is the combined total of requests coming from either cliette or CGI processes. From Table 6.7, we know there are 96 serv_CGI requests from 180 CGI processes. This means 84 CGI requests are served by the Cache Manager instead of cliette processes. For a cache hit, a CGI process must send requests to both PickCache and GifCache Cache Manager threads to construct the dynamic HTML page. This cache access with an average of 2.51 (1.30 + 1.21) sec return rate indicates that accessing the Cache Manager is about 8 to 10 times faster than accessing the back-end servers. This cache access time also contributes to the longer average seerGI elapsed time shown in Table 6.7. Note that the seerGI trace marker is set when the cliette process begins to listen to a request coming from a particular CGI process from the Connection Manager Daemon. The connection setup time between the cliette process and the CGI process is also included in the time marked by the seerGI trace marker. The PickCache and GifCache trace markers do not include the original connection setup time between the requester and the Cache Manager. Cache Operation Total N o. of Average (sec) calls (sec) PickCache 474.682 364 1.304 GifCache 359.783 297 1.211 Table 6.8: Cache Manager activity statistics on a single workstation It is interesting to see in Figure 6.7 that there are two clusters of the seerGI elapsed 114 time: A cliette process may encountered a cache hit before it actually accesses the back-end SCI'VCI‘S. 6.6 Running on a Single IBM SP Node with Cache Man- ager In addition to using either the high-performance switch or the Cache Manager, we combine both in this test setting. Figure 6.8 shows the screen dump of the NUPSHOT results. Tables 6.9 and 6.10 show the results of all the trace markers. Task Type IP_Send IP_Recv Total No. of Average Total No. of Average (sec) calls (msec) (sec) calls (msec) CMD 0.1 13 157 0.719 0.020 21 1 0.096 Cliette 1 0.264 140 1.887 0.018 125 0.146 Cliette 2 0.331 172 1.925 0.023 153 0.151 Cliette 3 0.493 217 2.275 0.028 193 0.149 Cliette 4 0.412 187 2.206 0.021 166 0.131 Cliette 5 0.379 199 1.907 0.029 177 0.165 Total 971 940 Table 6.9: Elapsed IP-S end and IP_Recv time statistics on an IBM SP node with Cache Manager support There are 93 requests of 180 actually served by the cliette processes. Comparing Tables 6.5 and 6.10, we could determine that the average time spent for a CGI request increases by about 3 see. This is the same reason mentioned in Section 6.5 due to the use of the Cache Manager. Comparing Tables 6.8 and 6.11, the average time for each cache access is about the same because all the cliette processes and the Cache Manager Daemon are on the same machine/node in both settings. Among the 93 requests, the longest is less than 30 see (Figure 6.8) compared to about 50 sec in Figure 6.7. Both are using Cache Manager but one is on a single IBM SP node and the other is on a single workstation. 115 . , — ' a ‘1 i a! ‘ Figure 6.8: High-Performance Web server on an IBM SP2 system with caching support . MM'“MM~_ 116 C l i et t e ID Total No. of calls Average (sec) calls (sec) Cliette 1 274.134 15 18.275 Cliette 2 259.282 19 13.646 Cliette 3 263.222 24 10.967 Cliette 4 260.298 21 12.395 Cliette 5 263.388 22 11.972 Total 93 Table 6.10: Elapsed seerGI time statistics on a single IBM SP node with Cache Manager support Cache Operation Total No. of Average (sec) calls (sec) 491.912 365 1.347 555.710 301 1.846 PickCache GifCache Table 6.11: Cache Manager activities on a single IBM SP node 6.7 Running on a Cluster of Workstations In this setting, we use three RS/6000 model 390 workstations to run our Web services. The HTTP Daemon is running on the same node as the CMD and the first two cliette processes. Communication between remote cliette processes and the CMD is established through a l6-Mbyte token ring network. Table 6.12 shows that the average elapsed time for I P-Recv in the last four cliettes is greater than that in the first two cliettes because the communications between the Connection Manager and the last four cliettes are through the local area network. Table 6.13 shows that the average seerGI time is around 10 sec, 3 200% improvement over that in Table 6.3. The success rate also increases from 165 to 178 for the total of 180 requests. Figure 6.9 shows the NUPSHOT result. 117 y. "f >:y_'.. , It, e “_, I .l l .l 1 ll 1 I I l l . l .t .. . ‘1‘ .«L 3. l l l I . I l l l l I . r I I I [I Figure 6.9: Distributed Web services on a cluster of three workstations Thu: 11:45 0|”: WI-MF‘ZPC PM“ WH‘F‘ZAC 118 Task Type IP_Send IP-Recv Total No. of Average Total No. of Average (sec) calls (msec) (sec) calls (msec) CMD 0.038 191 0.201 0.028 367 0.077 Cliette 1 0.489 289 1.694 0.023 257 0.091 Cliette 2 0.290 244 1.188 0.033 217 0.153 Cliette 3 0.032 244 0.135 0.144 217 0.668 Cliette 4 0.036 263 0.138 0.137 234 0.587 Cliette 5 0.037 281 0.133 0.147 250 0.526 Cliette 6 0.038 289 0.133 0.159 257 0.622 Table 6.12: Elapsed IP_S end and IP_Recv time statistics on a cluster of workstations Cl iette ID Total No. of Average (sec) calls (sec) Cliette 1 320.325 32 10.010 Cliette 2 317.928 27 11.775 Cliette 3 318.441 27 11.794 Cliette 4 323.710 29 1 1.162 Cliette 5 322.085 31 10.389 Cliette 6 316.773 32 9.899 Total 178 Table 6.13: Elapsed seerGI time statistics running on a cluster of workstations 119 6.8 Running on a Cluster of Workstations with Cache Manager In addition to having a cluster of workstations, we also use the Cache Manager to improve access rate for frequently accessed pages. Figure 6.7 shows the tracing of the Connection Manager and cliette processes. Because of the limitations of NUPSHOT, we cannot show the tracing of the Cache Manager Daemon. Tables 6.14 and 6.15 show the total elapsed time, number of calls, and average elapsed time of three different NUPSHOT states - IP_Send, IP_Recv, and seerGI. The total and average times of the two new user markers, PickCache and GIFCache, are also shown in Table 6.16. Task Type IP_Send IP_Recv Total No. of Average Total No. of Average (sec) calls (msec) (sec) calls (msec) CMD 0.092 203 0.454 0.021 105 0.206 Cliette 1 0.018 159 0.114 0.314 176 1.784 Cliette 2 0.020 162 0.129 0.362 182 1.990 Cliette 3 0.061 101 0.604 0.013 112 0.119 Cliette 4 0.059 89 0.664 0.011 100 0.113 Cliette 5 0.075 108 0.696 0.012 115 0.110 Cliette 6 0.070 115 0.612 0.014 128 0.112 Total 937 918 Table 6.14: Elapsed IP-Send and IP.Recv time statistics on a cluster of workstations with Cache Manager support From Table 6.14, we find that the first two cliettes have much better response time than the remaining four cliette processes. The main reason is that these four cliette processes are on a different node than the Cache Manager Daemon. Accessing the Cache Manager for either reading or writing is through the local area network and adds to the total response time for a single CGI request. A total of 91 CGI requests of 180 were actually served by cliette processes. The average time for either IP_s end or IP_recv is about the same as in Table 6.12. Lugtle Display [00in Pointer: tfltllttttfi (in sctninls) fi "189ml :_ I IPjtet'v - sen/[tit N 0 Reset View fir-or Itlllflt 7‘1 .‘ ut states 50 Resue to m Hint 0w tit] ll ttlt‘i flint i‘iinnr: r ut states: .- nt lnns: (6 t Rem to m 5 Mil (ESP [1.0114857 120 Ill“ 1111 lltlt HI llil] 1611 : of states: ‘1: ‘ t " 1 at inn? 1 I00 0.11025 [1.1105 “ 0030 l i } ,, I 100 Rem tn fit i Pmt 1/ll Hill 131] 3l|ll 'i’ltl Yi’ll fill 1'1" Zfill i gt 1 ill Draq left hiittnn tn stmtrtt niil tntt will. at tiistntyam an linig tei't tiiittnn in «will out In" sine nl tmtnqmn 01' “W I?“ ““1th in “Mitt out In" W at titstngmui display [traq minute button tn slide histogram disptav Ora; iiiilttllu lltlllLtll lu blttlt‘ histnqmn titsttld‘f. [Wt middle button tn slut? liistngmn ilisplav. Drug right tiiittuti to strvttli out right iltit.’ nt tnstuty‘nn Dnuj nglit button to 51ml“! out right snle at liistugmn Drag fight tiultun tn SIMLh uul nqlit side ul tiistoiirain display, Uni] any buttnn to stretch liisintpins venicalv Ding any nuttnn to stmtch tustogmis VMICMIY. 1'qu any hltllun “J S'Mth hlimwll‘: Vt‘mti't Figure 6.10: Distributed Web services on a cluster of three workstations with Cache Manager support Tint 11:46 Oucuo: WHIFIZPC m Wm!) l 121 C l i e t t e I D Total N o. of Average (sec) calls (sec) Cliette 1 184.099 20 9.204 Cliette 2 179.513 20 8.975 Cliette 3 178.169 13 13.705 Cliette 4 190.737 11 17.339 Cliette 5 172.499 13 13.269 Cliette 6 189.344 14 13.524 Total 91 Table 6.15: Elapsed seerGI time statistics on a cluster of workstations with Cache Manager support Cache Operation Total No. of Average (sec) calls (sec) 529.728 356 1.488 536.029 296 1.81 1 PickCache GifCache Table 6.16: Cache Manager activities on a cluster of workstations 6.9 Running on Three IBM SP Nodes In this setting, we run our scalable Web services on three IBM SP2 nodes using the high- performance switch. Figure 6.11 shows the tracing of the interprocesses communication traffic (I P-Send and I P-Recv) between the Connection Manager Daemon and cliette processes running on a cluster of workstations. Tables 6.17 and 6.18 show the total elapsed time, number of calls, and average elapsed time of three different N UPSHOT states - IP-Send, IP_Recv, and seerGI. Comparing Table 6.12 and 6.17, we see that the average IP_Send and IP_Recv has improved slightly; Section 6.7 notes that we use three SP nodes to simulate a cluster of workstations. Because these SP nodes are tightly coupled and communicate through a private Ethernet network, their network performance is expected to be much better than logfile [Itsptav Zoom Hither. 1.115111 (methods) 00 ’ IP_finnit 1 ‘ lIP_ltnri -inerGl ”11 21110050601080 1111111131] 1813 1 nt stalks: 11111". .- n1 IllM’ fill Hague to M Fhiit (lose .00 00025 00050 0.0 _ 122 0. Reset vww Help 111. 1 n 911 11111 1111 1.711 1111 1111 .i11 1611 unort 0.0111 1165 : 01 states: 25 1195112 to fit Phiil Dose 1111 911.1 181.1 1111 21111 2'30 .‘111 2111 .7511 2611 7711 21111 2911 11111 .110 -. . ' ' .. p 131] - 01 states, 100°. .' 01 tlllfi 100 1103119 to In Hint flow Dnglrtt 00mm to stmtrh out Inn «110 nt Iistnnmn iiispt Drau Ieit button to stretch out tett Side of instant lira; Mlhiitton to .In-tth 00111.1! uh: 01 Iiisturim itisptiv Draw; nndtlo button to shun histniram itisplav 11mg huddle button to shite histoifciiii Itisphiy than mutate 11mm" 10 SM? histogram display [Iraq right hullon to 3tmtih out nght 51119 01 histugmi 11h Drag right button to stintrh out right stile nt titstoqn Draq mint hutton to stretch out tight 51119 01 histoimii displav 11mg anv hutton tn stmtth hi. tngrims VPl’llt'JNy Figure 6.1 1: Distributed Web services on Mu Tint. 11:40 Ola-tn: WH‘FQPC Pm WHIF42AC Draq any button to stretch histogms verticsty Uriq any 111111011 10 strvtth histograms venicattv 3IBM SP nodes 123 Task Type IP-Send IP-ReCV Total No. of Average Total No. of Average (sec) calls (msec) (sec) calls (msec) CMD 0.035 367 0.095 0.039 186 0.212 Cliette 1 0.049 249 0.198 0.101 280 0.361 Cliette 2 0.040 234 0.172 0.200 263 0.763 Cliette 3 0.044 233 0.189 0.027 262 0.105 Cliette 4 0.045 241 0.187 0.028 271 0.103 Cliette 5 0.044 241 0.186 0.031 271 0.115 Cliette 6 0.049 249 0.200 0.032 280 0.114 Total 1814 1813 Table 6.17: Elapsed IP_Send and IP_Recv time statistics on three IBM SP nodes Cl iette ID Total No. of Average (secs) calls (secs) Cliette 1 308.529 31 9.952 Cliette 2 312.231 29 10.766 Cliette 3 310.899 29 10.720 Cliette 4 310.549 30 10.351 Cliette 5 307.281 30 10.242 Cliette 6 310.165 31 10.005 Total 180 Table 6.18: Elapsed seerGI time statistics running on three IBM SP nodes 124 a cluster of workstations; that expectation also applies to the average seerGI request time. Yet there is not much difference in terms of average seerGI elapsed time between cliettes 1 and 2, and among 3 through 6 compared to the noticeable difference in Table 6.13. From the distribution of seerGI elapsed time in Figure 6.11, we show that the elapsed time for seerGI centered around 10 see with the highest being 15 see. This indicates that using the high-performance switch gives us a more compact elapsed time distribution and provides Web browser clients with a more predictable response time. This also is the reason of no failures for CGI requests, because the cliette process could respond to a C01 request within 15 see (which is our CGI processes’ timeout value). 6.10 Running on an IBM SP System with Cache Manager This section describes how we run the Web services on three IBM SP2 nodes. The httpd (our control program), Connection Manager Daemon, Cache Manager, and the first two cliette processes all run on the first node. The remaining two nodes have four cliette processes running, two for each node. Communication between the remote cliette processes and daemons is through the high-performance switch network. Each SP2 node has the same CPU model (RS/6000 model 390) as those used in Section 6.7. Figure 6.12 shows the trace visualization of the interprocess communication traffic (I P-S end and I P-Recv) between the Connection Manager Daemon and cliette processes, along with elapsed time distributions for IP-Send, IP_Recv, and seerGI. Tables 6.19 shows the total elapsed time, number of calls, and average elapsed time of two Nupshot states. From Table 6.20, we find that the average time that a cliette spends on serving a CGI request increases slightly for the last four cliettes. This is due to the need to check with the Cache Manager, which is on the same node as the first two cliette processes. On the other hand, the total number of seerGI requests decreases dramatically to 97, indicating that many of the requests have been satisfied by the Cache Manager. The total and average times of the two major Cache Manager operations, Pi ckCache and GIFCache, are shown in Table 6.21. The total number of cache accesses is the I . I I1 I 1 . 1 Al V I 1 I ‘- ’-. ‘. ~ -7 i - I l 5 i e [,1 .3 . - 2% :1 I l l l I il I II I I | I | l I 1 v — - -i .,.‘ii,’1r;",, x > . P.." n 1“; t. . 1 ill I l ' i I 1 I ) l ’ . v, ‘ i l . _ - t11 'i " I I I\;. , , C ; ~ ,. .' 7?; l‘. ". l | l u i 1 I I l . I l I . '1 l i i i l I I II t i I I I Figure 6.12: Distributed Web services on an IBM SP2 system with caching support r . is m- ““ *- Tlmo: ii;ui Qua-2 wmrw’c me WC 126 Task Type IP-Send IP_Recv Total No. of Average Total No. of Average (secs) calls (msecs) (secs) calls (msecs) CMD 0.019 104 0.187 0.021 201 0.107 Cliette 1 0.420 181 2.321 0.020 161 0.129 Cliette 2 0.289 163 1.776 0.012 145 0.084 Clistte 3 0.013 129 0.108 0.023 115 0.200 Cliette 4 0.012 119 0.106 0.019 106 0.183 Cliette 5 0.015 146 0.104 0.030 130 0.233 Cliette 6 0.014 147 0.097 0.028 131 0.219 Total 989 989 Table 6.19: Elapsed IP-Send and IP-Recv time statistics on three IBM SP nodes with Cache Manager support Cliette ID Total No. of Average ID (sec) calls (sec) Cliette 1 170.291 20 8.514 Cliette 2 171.333 18 9.518 Cliette 3 184.673 14 13.190 Cliette 4 172.214 13 13.247 Cliette 5 185.187 16 11.574 Cliette 6 174.216 16 10.888 Total 97 Table 6.20: Elapsed seerGI time statistics on 3 IBM SP nodes with cache manager support combined total of requests coming from both cliette and CGI processes. The average elapsed time for cache accesses (1.74 and 2.04 sec for PickCache and GIFCache, respectively) indicates that accessing the Cache Manager is roughly four to five times faster than accessing the back-end server. From Tables 6.18 and 6.20, we find that the average time cliette spends on serving a CGI request increases slightly for the last four cliettes. This is due to the need to access the Cache Manager, which is on the same node as the first two cliette processes. But the difference is not as large as in Table 6.15 because, in the current setting, we use the 127 Cache Operation Total No. of Average (sec) calls (sec) PickCache 654.315907 374 1.749508 GifCache 631.876162 309 2.044907 Table 6.21: Cache Manager activities on an IBM SP system hi gh-performance switch for the cliette processes to talk to the Cache Manager Daemon. 6.11 Running on a Cluster of Four Workstations with One Workstation Dedicated to the HTTP Daemon In this example, we dedicate one workstation for the HTTP daemon (our control pro- gram) and all its dynamically created CGI processes. Six cliette processes are running the remaining three workstations, with two on each workstation. Figure 6.13 shows the tracing of the interprocesses communication traffic (I P_S end and I P_Recv) between the Connection Manager Daemon and cliette processes running on a cluster of workstations. Tables 6.22 and 6.23 show the total elapsed time, number of calls, and average elapsed time of three different NUPSHOT states - IP_Send, IP_Recv, and seerGI. In this setting, we find that the first two cliette processes have a slight performance improvement over those shown in Table 6.13. Because these nodes are capable of handling a large load, the improvement is not shown clearly when moving the HTTP daemon to a dedicated workstation. Also, in this setting we used the Ethernet connection for all the connections, including the CGI process to any cliette process, which would contribute to some delay in the cliette response time. 128 logfile Display Toom Pointer: 16.562552 {in seconds) N 90 Reset view Help 111111111 1111" ., 111111111 111 1 1 1 1111. _ 11111 1 1i, .. “f “I. I J . J ‘2..» . " . II 1111 3211 1311 . z t‘iirsor, 11111111111 (Ilfittt‘i 111162503 01501”. 2.2386311 119-t till 1711 r of states: 3 01 states: . . 01 states: 111111. 1011”. ‘ 1009-. : ot bios. r 01 thus. ’ .6 01111113; 50 25 " 100 Restze to 111 Reset: to 111 {Issue to fit Punt Pnnt j*‘ him 1 1 l 1! ”059 m 0001 (Msp . 0059 '10 000 150 1000 1350 lhhu 1710 2100 re Path_Name} # where # hostname -— optionally specifies the name of the host # on which to start the cliette. If not specified, the # cliette is started on the same host as the Connection # Manager Daemon. # id —- is the userid the cliette runs under. If not # specified it is the same as the Connection Manager Daemon # pw —— is the password for ——id—- if the Connection Manager # Daemon needs a password to log in as -—id-- # # For example: ##2##:4t:##################¢##¢####ttttttfi: 137 -Cliette:{CMD_EXEC_PATH: /etc/cli} The optional keyword CMD_CACHE_MANAGER=Service_type is used to associate a cliette with a specific cache manager. The hostname and port of that cache manager is passed to the cliette in the environment variables CMD_CACHE_HOSTNAME and CMD_CACHE_PORT when it is started up if CMD_CACHE_MANAGER is specified. Cliette environment variables may be initialized by adding them to the cliette definition statement. Every "name"="value" pair in this statement is added to the environment. In the above example all of the environment variables CMD_EXEC_PATH, CMD_NAME, CMD_PASSWD, CMD_CACHE_MANAGER, and CLIETTE_DBNAME are placed in the environment of the cliette. 3) Cache definition statements Specify %service=Service_type Initial_Number Max_Number —cache={CMD_EXBC_PATH=Path_Name; CMD_PARAMETERS=cache—startup-parameters} The Initial_Number may be 0 or 1 indicating whether this cache manager is started during initialization. The Max_Number must be 1 for each cache statement. Both CMD_EXEC_PATH and CMD_PARAMETERS are required CMD_EXEC_PATH specifies the full pathname of the cache manager executable and is used as described in the cliette definition statements. 138 CMD_STARTUP_PARAMETERS=/usr/cmd/cache.cfg 7175; # CMD_STARTUP_PARAMETERS, if specified, specifies the command line # parameters for starting the cache manager. These are: # config-file cache-mgr-por # where # config—file —- is the name of the cache manager configuration # file to use. # cache-mgr—port -- optionally specifies the port number the # cache manager should use if not specified # in /etc/services # Example: # # # start and fully configure a local cache manager at initialization time %servicezLocalCacheManager 1 1 —cache:{CMD_EXEC_PATH=/etc/www/cache_manager; CMD_PARAMETERS=/etc/www/cache.cfg 7175}; # start a remote cache manager at initialization time. Use all defaults. %service=RemoteCacheManager l 1 -cache={CMD_EXEC_PATH=/etc/VI_cache_manager; CMD_PARAMETERS:/etc/www/cache_manager/config 7176) # start 4 cliettes at the initialization time and allow maximum of # 6 cliettes %service=VI 4 6 —cliette={CMD_NAME=userl;CMD_PASSWD=4C4fe868381597cc; CMD_EXEC_PATH=/etc/VI_cliette; CMD_CACHE_MANAGER=LocalCacheManager} —cliette={CMD_NAME=user2;CMD_PASSWD=d25d89OS6eea4le4; CMD_EXEC_PATH=/etc/VI_cliette; CMD_CACHE_MANAGER=LocalCacheManager} —cliette={CMD_NAME=user3;CMD_PASSWD=b2cc52cc08900f38; CMD_EXEC_PATH=/etc/VI_cliette; CMD_CACHE_MANAGER=LocalCacheManager} 139 -cliette={CMD_NAME=user4;CMD_PASSWD=ddaeeObOea8c98a4; CMD_EXEC_PATH=/etc/VI_cliette; CMD_CACHE_MANAGER=LocalCacheManager} —cliette={CMD_NAME=userS;CMD_PASSWD=dl8caf6e6fb94ba2; CMD_EXEC_PATH=/etc/VI_c1iette; CMD_CACHE_MANAGER=LocalCacheManager} ~cliette={CMD_NAME=user6;CMD_PASSWD:8bad27lcd9bf623c; CMD_EXEC_PATH=/etc/VI_cliette; CMD_CACHE_MANAGER=LocalCacheManager} # start no cliette at the initialization time and allows maximum # of 2 cliettes. No definition string is required by the cliette # program for initialization. These cliettes will not cache anything. %service=XXX O 2 -cliette={CMD_EXEC_PATH=/etc/XXX_cliette} -cliette={CMD_EXEC_PATH=/etc/XXX_cliette} # start cliette at the remote host viobj.xxx.edu %service=RemoteVI l l -cliette={CMD_EXEC_PATH=/etc/VI_cliette; CMD_NAME=userF; CMD_PASSWD=6e8ea7c6a5c2971e, CMD_CACHE_MANAGER=RemoteCacheManager} The number of the cliette definition statements should be the same as the maximum cliette processes defined in the service type statement. Otherwise, no cliette process will be created. The Initial_Number in the service type statement defined how many cliette processes should be created at the initialization time. If more cliette processes are needed, the system administrator uses command CMDadmin to create the cliette processes until the maximum number is reached. The system administrator could use kill -1 CMDae- mon_process-id or CMDadmin -i to force the CM Daemon to do re-initialization and to 140 re-read the configuration file. Re-reading the configuration file will cause all the cliette and cache processes to terminate. All values specified in the cliette definition statement are added to the cliette’s envi- ronment. Some of these values are reserved and have specific uses. They are prepended with “CMD_”: CMD_EXEC_PATH CMD_NAME CMD_PASSWD CMD_CACHE_MANAGER CMD__STARTU P_PARAMETERS All others are ignored by the Connection Manager but are placed into the environment of the cliette. The CM Daemon could also connect to a remote cliette. The remote cliette is started by the local CM Daemon using rexec on the host machine specified in the configuration file. If the remote cliette process resides on a host in the open intemet environment while CM daemon resides inside the firewall, connection between the CM Daemon and the cliette is made through the firewall host using the SOCKS service. The CM Daemon will start cache managers if so configured. The Initial_Number should be 0 if the cache manager is not to be started during initialization and 1 otherwise. The Max_Number for cache managers must be specified as “”.l The “Service_Type” statement is used to satisfy requests for the location of a specific cache manager by cliettes and CGI processes. A.2 General Purpose Request Block The following data structure defines the two types of network packets being passed through the socket connection. /* First type of network package struct HTTPRequest { */ */ */ */ i,- /* Possible type of Client_Type */ int client; union { int unl_http_service; int unl_http_total_size; int un1_cgi_timeout; int unl_missing_sequence; } http_unl; int request; int sender_pid; union { int un2_cgi_pid; 141 used to ,. /. ,. /. transfer request */ what kind of service */ size of the HTML */ from cliette/daemon to CGI */ reguest missing sequence */ /* what kind of request */ /* Who sent this request */ /* Used only when Daemon send request to /* Cliette about the incoming CGI process int un2_cliette_uid; } http_un2; struct CliettePortInfo port_info; int http_sender_state; union { /* Cliette pid for Administration uses /* cliette listing port info struct Cliette_Queue un4_qe_contents; struct Cliette_Def un4_qe_defcontents; char un4_http_service_name[50]; } http_un4; #define CMDaemon 0 #define Client_CGI 1 #define Client_ADM 2 #define Cliette 3 /* Possible type of Request */ #define #define */ #define */ #define */ #define */ #define */ */ #define */ #define */ #define */ #define */ #define */ #define */ */ */ Connect_to_Cliette l Cliette_initdone Cliette_free Cliette_finish Cliette_done Cliette_OK Cliette_reinit Cliette_stop Cliette_debug Cliette_debugend Cliette_ayt Cliette_dojob #define CGI_HTMLready */ #define CGI_settimeout its*/ 2 10 ll 12 l3 l4 /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* 142 CGI -> Daemon */ Cliette finish init (Cliette—>Daemon) Cliette finish ljob (Cliette—>Daemon) (Cliette—>Daemon) Cliette terminate In response to Cliette_stop */ Cliette finish req (Cliette->Daemon) Cliette response to Cliette_ayt (Cliette—>Daemon) Re-initialization (Daemon—>Cliette) (Daemon—>Cliette) Terminate a cliette start debugging mode (Daemon—>Cliette) end debugging mode (Daemon—>Cliette) Are you there (Daemon—>Cliette) tell Cliette its CGI client process id Cliette prepare for incoming CGI req (Daemon—>Cliette) HTML is ready (Cliette —> CGI) Daemon or Cliette ask CGI to change */ */ #define */ #define */ #define */ #define */ #define */ #define #define pkg*/ #define */ #define */ #define */ Cliette_init Cliette_list Cliette_kill Daemon_terminate Daemon_reinit Packet_ACK Packet_Resend Get_URL_String URL_String_End Cache_initdone 15 l6 l7 18 19 20 21 22 23 24 /* /* /* /* /* /* /* /* /'k /* /* /* 143 its default timeout value (Cliette —> CGI) Init a new cliette (Admin->Daemon) list cliette status (Admin->Daemon) kill a hanging c1iette(Admin->Daemon) terminate the daemon (Admin->Daemon) Daemon re-read the configuration file Used for general acknowledgement */ CGI asks Cliette to resent the last clliette requests URL string from CGI Cliette does not want any more URL Cache Mgr finish init (Cache—>Daemon) /* Second type of network package : mainly for transfer URL and HTML */ #define DATASIZE 2020 #define HTMLSIZE DATASIZE #define URLSIZE DATASIZE struct DATARequest { int data_size; int data_sequence_number; int end_block; char data_string[DATASIZE]; 144 145 A.3 Connection Manager Daemon API Function Call Argument Return Value Action InitCliette struct Cliette-Def * Process ID InITIALIZE a Cliette process. StopCliette struct Cliette_Def * 0 : success Terminate a running aé 0 : fail cliette process. KillCliette struct Cliette_Queue * struct Cliette_Def Kill a hanging * NULL : fail cliette process. ReInitCliette struct Cliette-Queue * 0 : success Force the cliette 74 0 : fail to run initialization step DebugCliette struct Cliette-Queue * 0 : success Force cliette process int 75 0 : fail to generate debuging char * information GetNextCliette struct HTTPRequest * struct Get the next available Cliette_Queue * available cliette NULL: no free to serve a request. cliette AreYouThere struct Cliette.Queue * 0 : success check if cliette process T1me-out value 75 0 : fail is still running. Daemon waits for Time_out FindCliette CGI socket number 0: successful Find a free cliette struct Cliette_Queue * -I: fail process for the CGI 1 : resource process unavailable The structure of Cliette_Def and Cliette-Queue are defined as the following: struct Cliette_Def { char def_string[MAXCLIETTEDEF]; l: 146 char remote_host[MAXHOSTLEN]; char remote_user[9]; char remote_passwd[9]; struct in_addr remote_addr; int remote_addr_length; int port_no; int unique_id; int Cliette_state; unsigned char adm_wait; char exec_path[MAXPATHLEN]; struct Cliette_Queue *qe; /* qe point to the corresponding */ /* /* /* struct Cliette_Def *next_def; struct Cliette_Queue { 1; struct Cliette_Def *cliette_defptr; int Cliette_uniqueid; int Cliette_processid; int serving_cgi_pid; int socket_num; int public_socket; int Cliette_state; int Cliette_type; unsigned char stop_byte; struct Cliette_Queue *next; union { queue element. NULL, to be used /* redundant info */ struct Cliette_Queue *un2_next_avail; struct Cliette_Queue *un2_next_unavail; } QunZ; #define next_avail Qun2.un2_next_avail #define next_unavail Qun2.un2_next_unavail If qe equal this definition is free */ 147 A.4 Cliette Process API This section defines the cliette process API with both the CM Daemon and the CGI process. 148 Function Call Argument Return Value Action InitSelf int service_type daemon socket Initialization step number char *hostname Return value contains char *defstring socket number to daemon WaitForJob struct HTTPRequest * cliette-socket Wait for Daemon to > 0 socket to assign job to me. CGI = 0 request Returning value is the from Daemon < 0 socket connected socket to CGI error DoAdmin struct HTTPRequest * running administration function IamFree Cliette is free to serve more request. GetURL Pointer to URL string size of the Get the URL string URL size of the URL string URL string from the CGI process cgi_socket received SendHTML Pointer to HTML size of the Give the HTML size of the HTML HTML being to the CGI. cgi_socket sent TelICGI cgi socket number 0: successful Tell CGI to process the total html size 75 0 : CGI had result of a URL search died URLend Cliette does not need any more URL 149 A.4.1 The HTML Request Block This request block is used to pass the HTML documentation generated by the cliette in response to the URL request. #define DATASIZE 2020 #define HTMLSIZE DATASIZE struct HTMLblock { int data_size; /* Current block size */ int data_sequence_number; /* Current sequence number */ char html_data[DATASIZE]; /* data */ A.5 CGI Process API This section defines the CGI process API with the Connection Manager Daemon and the Cliette processes. 150 Function Call Argument Return Value Action GetCliette char *servtype > 0 Cliette sk no. Requesting a Timeout value =0 Timeout Cliette struct CliettePortInfo * < 0 Daemon died GetCache cache manager 0: successful Find the cache port number, -I: fail manager for the hostname 1: resource CGI process unavailable ConnectCliette struct CliettePortInfo * make initial connection to cliette PutURL cliette socket number = sizeof URL sent Give URL to S sizeof URL : fail Cliette WaitForHTML cliette socket number 0: successful Wait for Cliette S O : Cliette failed and output HTML to stdout Appendix B CMDadmin Manual Page and Its Usage Example B.l CMD Administration Command Manual Page NAME CMDadmin - system adminitrator command to control Connection Manager Daemon process and variour Cliette processes. SYNOPSIS CMDadmin [ [-a Cliette_id] [-b l [- c type_of_service] [-d Cliette_id] [-e Cliette_id] [—i] [—k Cliette_id] [-r Cliette_id] [-s cli- etteJd/typebeervice] [—u clietteJd/typebeervice] [-v Cliette_id/typechervice] [-t] [-x] DESCRIPTION CMDadmin is used by the system administrator to maintain the Connection Manager Daemon and the Cliette Processes. The cliette id used in the command line is cliette unique id number which is different from its actume process id. 151 OPTIONS -C 152 check if the cliette process is functional. build the connection manager daemon configuration file. This file is used by the connection manager daemon to startup cliette process. If no configuration file name is given at the command line, the system will prompt you for the filename. If a file with the same filename already exists. the system will give you option to either save. overwrite. quit or append. The save option will rename the original file and response with the "saving the original file as ..... ". It is the user’s responsibility to maintain the backup files. The overwrite option will overwrite the original file with new information. The quit option allows the user to stop. The append option allows the user to append addition cliette information to the configuration file. create a new type of service cliette process. start debugging the cliette process. end debugging the cliette process with id equals to cilette id. force Connection manager daemon to run reinitialization process. All the running cliette processes are terminated before the reinitialization process. kill a hanging cliette process. re-initialization a cliette process. terminate a cliette process or all the cliettes with type equals to type of service. terminate all the cliette processes and the connection manager daemon. start UTE tracing. view status of the cliette of type equals to stop UTE tracing. type of service or id equals to cliette id. B.2 CMDadmin -b Command Example We show here how to use CMDadmin command to build a configuration file. Enter configuration file (/etc/CMDaemon.conf): Enter the Cliette service type (such as D82 or VI): VI Enter the number of Cliette to start initially : 1 153 Enter the maximum number of Cliette allowed : 4 **** Please enter information for Cliette #1 *** Remote host for cliette (Enter for local) CLIETTE program name (includes complete path): /usr/sys/CMD/vi_cli Enter CLIETTE Login NAME (Enter for none): vi_userl Enter Cliette password (Enter for NONE): Enter Cliette password again for verfication: Enter extra Cliette global environment variable Empty string to stop Enter extra Cliette info (TAG=VALUE): VILIleibservl Enter extra Cliette info (TAG=VALUE): VIOBJzobjservl Enter extra Cliette info (TAG=VALUE): **** Please enter information for Cliette #2 *** Remote host for cliette (Enter for local) : tivoli.watson.ibm.com Enter remote machine login name: cmd Enter remote login passwd (Enter for NONE): Enter passwd again for verification : CLIETTE program name (includes complete path): /usr/sys/CMD/vi_cli Enter CLIETTE Login NAME (Enter for none): vi_user2 Enter Cliette password (Enter for NONE): Enter Cliette password again for verfication: Enter extra Cliette global environment variable Empty string to stop Enter extra Cliette info (TAG=VALUE): VILIleibservl Enter extra Cliette info (TAG=VALUE): VIOBJzobjservl Enter extra Cliette info (TAG=VALUE): **** Please enter information for Cliette #3 *** Remote host for cliette CLIETTE program name Enter Enter Enter Enter Empty Enter Enter Enter 154 (Enter for local) (includes complete path): /usr/sys/CMD/vi_cli CLIETTE Login NAME (Enter for none): vi_user3 Cliette password (Enter for NONE): Cliette password again for verfication: extra Cliette global environment variable string to stop extra Cliette info (TAG=VALUE): VILIleibservl extra Cliette info (TAG=VALUE): VIOBJ=objservl extra Cliette info (TAG=VALUE): **** Please enter information for Cliette #4 *** Remote host for cliette CLIETTE program name (includes complete path): (Enter for local) /usr/sys/CMD/vi_cli Enter CLIETTE Login NAME (Enter for none): vi_user4 Enter Cliette password (Enter for NONE): Enter Cliette password again for verfication: Enter extra Cliette global environment variable Empty string to stop Enter extra Cliette info (TAGzVALUE): VILIleibservl Enter extra Cliette info (TAG=VALUE): VIOBJzobjservl Enter extra Cliette info (TAG=VALUE): More Cliette type (y/n) ? n The following is the configuration file. # COMPONENT_NAME: Connection Manager Daemon startup file %service=VI 1 4 -cliette={CMD_EXEC_PATH=/usr/sys/CMD/vi_cli; cliette:{CMD_EXEC_PATH=/usr/sys/CMD/vi CMD_NAME=vi_userl;CMD_PASSWD=edae8ac3cdf88f2b; VILIleibservl; VIOBJzobjservl} 155 CMD_NAME=vi_user2;CMD_PASSWD=d5c8a9a6cb5e4ble; VILIleibservl; VIOBJzobjservl} -cliette={CMD_EXEC_PATH=/usr/sys/CMD/vi_cli; CMD_NAME=vi_user3;CMD_PASSWD=cd46fC62bda5f9c3; VILIB=libservl; VIOBJzobjservl} -cliette={CMD_EXEC_PATH=/usr/sys/CMD/vi_cli; CMD_NAME=vi_user4;CMD_PASSWD20034bcb27136l9f5; VILIleibservl; VIOBJ=objservl} Appendix C Sample Cache Manger Configuration File and Its API Calls C.l Sample Configuration File cache—manager { on logging logfile = all_logs/cache.log port = 7175 wrap-log : yes log—size = 64000 connection-timeout = 3008 } cacheO { root = /usr/www/cache0 # root for cache files caching = on # enable caching file-cache = 100MB memory-cache = lOOOKB expiration = 60M check-expiration = 608 datum—memory-limit = 2K8 156 157 datum—disk—limit = 4KB cachel : cache 0 { root = /usr/www/cachel fs—size = 100MB mem—size = 0 C.2 Cache Manager API The API and cache manager communicate using a CacheToken structure. Depending on the operation the programmer will supply information or to, or receive information from the cache manager via the CacheToken. The CacheToken structure is as follows: typedef struct _CacheHandle { char * cache_host; int port; char * cache_id; int socket; } CacheHandle; enum CacheDisp { CacheRO, CacheWO, CacheNone 1; enum CacheRc { CacheNo, CacheModified, CacheFound, CacheExpired, CacheLocked 158 l; // // CacheToken: // function - c version of cache token as used in API // typedef struct _CacheToken { void *data; // the data int len; // length of data int datum_len; // length of the datum (-l for N/A) time_t creation; // when data was last written (open R/W) time_t expiration; // expiration date time_t last_access; // when item was last opened for read CacheHandle * connection; // connection fd to daemon enum CacheRc return_code; // return code from last operation enum CacheDisp disp; // indicates whether datum RO, RW, NONE (not open), // or NULL (does not exist) } CacheToken; Cachelnit routine Syntax CacheHandle * CacheInit(char *cachernachine, int cache_port, char *cache_service); Description Initialize a connection to cache manager. Parameters cache_machine This is the name of the machine running the cache manager. 159 cache_port This is the connection port the cache manager is listening on. cache_service This is the name of the cache service to connect to. This corresponds to the “cache-id” in the specified cache. Return value The Cachelnit routine returns a CacheHandle which is used in subsequent cache operations. If the cache manager cannot be contacted the CacheHandle is returned NULL. CacheClose routine Syntax void CacheClose(CacheHandle * ch); Description Close the connection to cache manager. Parameters ch This is a CacheHandle returned by Cachelnit. Return value Nothing is returned. CacheMakeTbken Syntax CacheToken * CacheMakeToken(void *data, int len); Description Initialize a cache token. 160 Parameters data This is a pointer to the data comprising the token. The data is understood to be an array of arbitrary bytes. len This is the length of the token. Return value A CacheToken is allocated from free storage, initialized, and returned. This must be freed eventually with CacheFreeToken. The application may modify the datum_len, expiration, and disp fields as appropriate. If not modified by the programmer, the expiration field is set by the cache manager to its default, and the disp is set to RO. CacheOpenData Syntax int CacheOpenData(CacheToken *token, CacheHandle * cache_manager) ; Description Look up an entry in the cache, and get R0 or WO access to it. Parameters token This is a pointer to a CacheToken which has been initialized by CacheMake- Token(); The “disp” field must be set appropriately. If RW is specified, the “datum_len” must also be set to the length of the data. If “expiration” is not set, the cache manager will use its default expiration data; otherwise the “expiration” in the token will take precidence. 161 cache_manager This is a handle to a cache manager which has been initialized by Cachelnit. Return value A single integer return code is returned: CacheFound This is returned if the data is in cache and is valid. CacheModified This is returned if the data is in the cache but is marked expired. CacheNo This is returned if the data is not in the cache, or the data cannot be accessed for some reason. Set the “disp” field to CacheRO for read access or CacheWO to create or replace an entry. If CacheWO is specified and the item already exists the item is discarded and re-allocated for fresh creation. The CacheHandle is placed in the token by this call so the token may be used without a handle from this point on until it is closed. If the object is found in the cache, the token is updated to reflect the correct da- tum_length, creation, expiration, and last_access values. CacheCloseData Usage Notes Syntax void CacheCloseData(CacheToken *token); Description Close the cache item. Parameters 162 token This is a pointer to a CacheToken which has been initialized by CacheOpen- Data() specifying RO mode. Return value None. CacheRead Syntax int CacheRead(CacheToken *token, void *buffer, int len); Description Read data from the cache. Parameters token This is a pointer to a CacheToken which has been initialized by CacheOpen- Data() specifying RO mode. buffer This is a buffer to receive the data. len This is the number of bytes to read. Return value The number of bytes actually read is returned. On read errors -1 is returned and errno reflects the cause of the error. CacheReadToFile Syntax int CacheReadToFile(CacheToken *token, char *file); Description Read data from the cache into the named file. 163 Parameters token This is a pointer to a CacheToken which has been initialized by CacheOpen- Data() specifying RO mode. file This is the name of the file to receive the data. Return value The entire cache item is transferred to the named file. If the file already exists it is overwritten. CacheWrite Syntax int CacheWrite(CacheToken *token, void *buffer, int len); Description Read data from the cache. Parameters token This is a pointer to a CacheToken which has been initialized by CacheOpen- Data() specifying RW mode and the total length of the data object. buffer This is a buffer from which to send the data. len This is the number of bytes to write. Return Value The number of bytes actually written is returned. On write errors -1 is returned and errno reflects the cause of the error. 164 Usage Notes When a datum is opened with CacheOpenData a seek pointer is set for the data object in the cache manager. This pointer is incremented with each read or write command and can be reset only by closing and reopening the item. The cache manager will accept data only up to the number of bytes specified in datum_len in the token when the item was opened. CacheWriteFromFile Syntax int CacheWriteFromFile(CacheToken *token, char *file); Description Write data to the cache from a file. Parameters token This is a pointer to a CacheToken which has been initialized by CacheOpen- Data() specifying RO mode. file This is the name of the file to be copied to cache. / Return value The entire file is copied to the cache. Note that you must have first opened the cache entry in CacheRW mode, passing the correct length of the file in the open call. This is required to permit the cache to correctly allocate space (in memory or disk). CachePurge Syntax int CachePurge(CacheToken *token) Description Purge data from the cache. 165 Parameters token This is a pointer to a CacheToken which has been initialized by CacheOpen- Data() specifying RW mode. Return Value If the item is purged, the value TRUE (1) is returned. Otherwise FALSE (0) is returned. CacheClear Note: is this call wise? There are probably security concerns. Syntax int CacheClear(CacheHandle * handle); Description Request the cache manager to invalidate all its data. Parameters handle This is a CacheHandle initialized by a call to Cachelnit; Return Value If all cache are cleared, the value TRUE (1) is returned. Otherwise FALSE (0) is returned. Usage Notes The operation is performed only if ALL entries can be invalidated; otherwise none of the entries are invalidated. CacheSetParameters This is a “futures” call. Note: There are probably security concerns for this too. 166 Note: Details of this call are not clear at this point and will be determined by experimentation. Syntax int CacheSetParameters(CacheHandle * handle, ... ); Description Reset cache parameters without stopping and restarting the cache manager. Parameters handle This is a CacheHandle initialized by a call to Cachelnit; These are whatever we decide later Return Value None. Appendix D Ute2ups Output File OOOOO OFE OFE OFE OFE OFE OFE 2032 0. OOOOOOOOOOOOOOO OFD OFD OFE OFE OFE OFE OFE OFE OFE OFE OFE OFD 00000000 OFE OFE OFE OOOOOO 000000000000 l3 13 13. .387514112 13 l3. l3. 13. 13. l3. l3. 13. 13. 13. .581643008 13 13. 13. 13. 17. 17. .403901440 l7 17. .387242240 .387314176 387451136 388979712 389090560 389742592 389754368 390113536 390174208 390174208 581337088 581447424 581971456 582043648 582094592 403258880 403339008 403959808 b_IP_Recv e_IP_Recv b_IP_Recv e_IP_Recv b_IP_Recv e_IP_Recv Define_Marker b_seerGI b_IP_Send e_IP_Send e_IP_Send b_IP_Recv e_IP_Recv b_IP_Send e_IP_Send b_IP_Send e_IP_Send NNNNNN 138 @10001594 2 12 NNNNNN 2 GCLOCK 287007.686391450 ADJAMT b_IP_Send e_IP_Send b_IP_Recv 167 O 148 0 148 -l 148 12 148 —l 2032 1296389192 seerGI 12 148 148 12 148 12 2032 12 2032 12 148 12 148 12 148 12 148 12 148 12 148 12 148 0. OOOOOOOOOOOOOOOOOOOO OPE OFE OFE OFE OFE OFD 00000000 OFE OFE OFE OFE OFE OFE OFE OFE OFE OFE OFE OFD 00000000 000000 000000000000 17. 17. 17. 17. 17. 19. 19. .590217472 19 19. 19. 19. 19. 19. 19. 19. 19. 19. .788507648 19 404042496 404656384 412805120 412879360 412956672 589584384 589664512 590289920 590373888 590631680 595549696 595615744 595690752 595762944 602458624 602458624 168 e_IP_Recv b_IP_Send e_IP_Send b_IP_Recv e_IP_Recv NNNN 12 12 12 12 12 GCLOCK 287009.872619575 ADJAMT b_IP_Send e_IP_Send b_IP_Recv e_IP_Recv b_IP_Send e_IP_Send b_IP_Recv e_IP_Recv b_IP_Send e_IP_Send e_IP_Send NNNNNNNNNN 2 12 12 12 12 12 12 12 12 12 18 12 GCLOCK 287010.071621925 ADJAMT 148 2032 2032 148 148 148 148 148 148 2032 2032 148 148 148 148 Appendix E One set of the CGI preformance trace results This appendix shows an example of the CGI preformance trace results. There are 32 Cliette processes and 32 concurrent CGI processes used. Trace markers prefixed with CLIonly- are markers for gathering communication overhead with cliette processes. Trace markers prefixed withTotalCGI- are markers for the total CGI requests overheads. The results are generate using our extended UTE tools. 169 170 Markers Total # Average maximum minimum elapsed time calls per calls (msecs) (msecs) (msecs) CLIonly-l 1.180 30 0.039365 0.100808 0.009687 CLIonly-Z 1.406 30 0.046889 0.458509 0.004284 CLIonly-3 0.866 30 0.028874 0.131889 0.004897 CLIonly-4 2.264 30 0.075496 0.527412 0.005591 CLIonly-S 2.005 30 0.066835 0.353570 0.006677 CLIonly-6 1.400 30 0.046676 0.2781 17 0.007001 CLIonly-7 1.597 30 0.053241 0.392853 0.009950 CLIonly-8 1.1 13 30 0.0371 19 0.320832 0.006632 CLIonly-9 1.351 30 0.045048 0.237793 0.004432 CLIonly- 10 1.207 30 0.040258 0.437840 0.004048 CLIonly- 1 1 1.245 30 0.041502 0.176489 0.004313 CLIonly- 12 0.581 30 0.019368 0.084759 0.00421 1 CLIonly-13 1.344 30 0.044829 0.156143 0.005445 CLIonly- 14 2.272 30 0.075742 1.070509 0.004054 CLIonly- 15 1.075 30 0.035842 0.139207 0.005370 CLIonly- 16 1.223 30 0.040797 0.375009 0.005303 CLIonly-17 0.620 30 0.020693 0.101845 0.004476 CLIonly- 18 0.806 30 0.026884 0.439822 0.003896 CLIonly- 19 1.383 30 0.0461 15 0.167071 0.004154 CLIonly-ZO 1.729 30 0.057660 0.225426 0.006752 CLIonly-21 0.469 30 0.015662 0.229808 0.003834 CLIonly-22 1.032 30 0.034413 0.189010 0.004010 CLIonly-23 1.044 30 0.034821 0.163891 0.003958 CLIonly-24 1.081 30 0.036049 0.447261 0.004008 CLIonly-25 0.557 30 0.018577 0.081385 0.003886 CLIonly-26 0.667 30 0.022239 0.084272 0.004105 CLIonly-27 0.973 30 0.032434 0.208985 0.005090 CLIonly-28 0.981 30 0.032710 0.124501 0.005154 CLIonly-29 0.899 30 0.029980 0.109072 0.004023 CLIonly-30 0.932 30 0.03 1093 0.089812 0.004395 CLIonly-3 1 0.601 30 0.020049 0.217402 0.004091 CLIonly-32 0.958 30 0.031934 0.205561 0.003927 171 Ma r ker 3 Total # Average maximum minimum elapsed time calls per calls (msecs) (msecs) (msecs) TotalCGl-l 3.619 30 0.120643 0.983301 0.022279 TotalCGI-Z 3.393 30 0.1 13101 1.096251 0.012936 TotalCG1-3 1.843 30 0.061445 0.217051 0.018176 '1‘otalCGI-4 4.774 30 0.159149 0.724803 0.014992 TotalCGl-S 4.217 30 0.140570 0.475607 0.0203 10 TotalCG1-6 2.796 30 0.093209 0.302459 0.016272 TotalCGI-7 3.965 30 0.132191 0.534225 0.022115 TotalCGI-8 7.944 30 0.264831 5.902418 0.015512 TotalCGI-9 4.347 30 0.144915 1.158108 0.016727 TotalCGI- 10 8.813 30 0.293794 5.909424 0.012247 TotalCGI-l 1 8.251 30 0.275064 5.647499 0.013061 TotalCGI- 12 7.217 30 0.240593 5.635181 0.014370 TotalCG1-13 3.046 30 0.101542 0.331563 0.017238 TotalCGl- 14 3 .464 30 0.1 15482 1.148947 0.013400 TotalCGl- 1 5 2.645 30 0.088167 0.344756 0.015360 TotalCGI- 16 2.814 30 0.093818 0.425163 0.019374 TotalCGI- 17 8.447 30 0.281591 5.932014 0.016651 TotalCGI- 18 7.1 19 30 0.237332 5.634085 0.01 1395 TotalCGl- 19 3.345 30 0.111525 0.311600 0.015581 TotalCGl-ZO 3.439 30 0. 1 1464 1 0.320407 0.024020 TotalCGl-Zl 7.424 30 0.247479 5.657050 0.011613 TotalCGI-22 2.861 30 0.095373 0.520424 0.01521 1 TotalCGI-23 8.027 30 0.267586 5.540569 0.012145 TotalCGI-24 7.988 30 0.266293 5.527653 0.01 1663 TotalCGI-25 7.535 30 0.251 168 5.932544 0.01 1602 TotalCGI-26 1.733 30 0.057781 0.1501 15 0.012808 TotalCGl-27 2.510 30 0.083676 0.597024 0.014737 TotalCGI-28 2.362 30 0.078741 0.369441 0.022032 TotalCGI-29 2.404 30 0.080159 0.269753 0.021 162 TotalCGI-30 2.197 30 0.073245 0.203909 0.013885 TotalCGI-31 6.868 30 0.228959 5.6631 16 0.012985 TotalCGI-32 1.77 1 30 0.059039 0.239831 0.01 1593 Appendix F Attributes for the FCLA_1 Index Class 0 CategoryName: Contents thing like Journal, Issue, Article, Class Notes etc. This field is used to describe how the information has been categorized. CategoryID: Could be either a name, or some identifying aspect within the Category Name. This could be thing like Jouma1 Name, Issue number, article number/name, Course Number, etc. This attribute is used to identify the information within a category. ExtKey: This is a unique key used to identified a MARC record in the LUIS database. Identifier: This is some identifier that is outside of VI, but may be unique in the context of the data. It will be used to construct a HTML title within a HTML document. URL: This is some hyper link that is used to redirect the browser either back to the Luis catalog, to some other WEB server, or somewhere else within the VI catalog. NextSibling: This is the VI ITEM ID of the next logical ITEM that can be (or is) associated with this specific ITEM or FOLDER. PrevSibling: This is the VI ITEM ID of the previous logical ITEM that can be (or is ) associated with this specific ITEM or FOLDER. 172 173 DiscoveryMethod: This is an enumerated numeric attribute defining the type of display to be used for the ITEMS or FOLDERS that are contained by this FOLDER. NxtSibAnchor: This is a character field that will specify what to place as the anchor for the URL used to request the next sibling. PrvSibAnchor: This is a character field that will specify what to place as the anchor for the URL used to request the previous sibling. RemoteDeliveryServer: This is a human readable character field that contains the protocol, address, and port of a remote server. TimeDuration: An integer value specifying an amount of time in minutes that a URL is valid. This field is used with the Timestamp sent in the HTTP request to validate the current URL. ZERO indicated an indefinite duration. FirstSibAnchor: This is a character field that will specify what to place as the anchor for the URL used to request the first sibling. LastSibAnchor: This is a character field that will specify what to place as the anchor for the URL used to request the last sibling. ParentAnchor: This is a character field that will specify what to place as the anchor for the URL used to request the parent folder. SequenceNumber: This is the sequence number for this ITEM ID or FOLDER as it pertains to the FOLDER that contains this ITEM or FOLDER. SequenceTotal: This is the total number of containing ITEMS or FOLDERS as it pertains to the FOLDER that contains this ITEM or FOLDER. IntKey: Unique Visual-Info Item ID which is created every time the item is entered into the Visual-Info Database. UniqueKey: It is a Visual-Info Item ID which is generated the first time the item is entered into the Visual-Info database. This information is also saved in the LUIS 174 database to link the MARC record with Visual-Info information. This ITEM ID remained unchanged if a rebuild/restore is done on the Visual-Info database to avoid rebuilding the LUIS database because of the changes in the IntKey field.. Bibliography [1] T. Bemers-Lee, “The HTTP protocols as implemented in w3,” tech. rep., URL = http: //www.w3.org/ pub/ www/ doc/ http.txt, January 1992. [2] T. Kwan, R. McGrath, and D. Reed, “NCSA’s world wide web server: Design and performance,” IEEE Computer, vol. 28, November, 1995. [3] T. Bemers-Lee, R. Fielding, and H. Frystyk, “Hypertext transfer protocol - HTTP/ 1.0,” tech. rep., Internet Draft — http:// www.w3.org /pub [WWW /Protocols, 1995. [4] R. McGrath, “What we do and don’t know about the load on the NCSA WWW server,” September, 1994. [5] E. D. Katz, M. Butler, and R. McGrath, “A scalable HTTP server: The NCSA proto- type,” Proc. 1994 World Wide Web Conference, 1994. [6] A. Ford, Spinning The Web - How to Provide Information on the Internet. VNR Communications Library, 1995. [7] T. BernerS-Lee, “Hypertext markup language (HTML),” tech. rep., URL = ftp: //www.w3.org/ pub/ www/ doc/ html-spec.ps, March 1993. [8] S. Lewontin, “The DCE web toolkit: Enhancing www protocols with lower-layer services,” Third International World Wide Web Conference, 1995. [9] S. Lewontin and M. E. Zurko, “The DCE web: Providing authorization and other distributed services to the world wide web,” WWW Conference ’94, 1994. 175 176 [10] NCSA, “The common gateway interface,” tech. rep., http:// hoohoo. ncs.uiuc.edu/ docs-1.4l, 1994. [11] C. Bowman, P. Danzig, D. Hardy, U. Manber, M. Schwartz, and D. Wessels, “Har- vest: A scalable, customizable discovery and access system,” tech. rep., Dept. of CS, University of Colorado - Boulder, March, 1995. [12] IBM, ImagePlus Visuallnfo - An integrated desktop solution for document manage- ment and beyond. IBM Publication 6221-4011-00, 1995. [13] IBM, D32 Information and Concepts for common server. IBM Publication S20H- 4664-00, 1995. [14] C. Stunkel, D. Shea, B. Abali, M. Atkins, C. Bender, D. Grice, P. Hochschild, D. Joseph, B. Nathanson, R. Swetz, R. Stucke, T. Tsao, and P. Varker, “The SP2 high-performance switch,” IBM Systems Journal, vol. 34, no. 2, 1995. [15] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Com- munications of ACM, vol. 21, no. 7, July 1978. [16] T. Kwan, D. Reed, and R. McGrath, “User access patterns to NCSA’s world wide web server,” 1995. [17] R. McGrath, “Performance of several HTTP daemons on an HP 735 workstation,” tech. rep., http://www. ncsa. uiuc. edu/InformationServers/ Performance/ V1.4/ re- porthtrnl, April, 1995. [18] R. McGrath, “Performance of several web server platforms,” tech. rep., httpzllwww. ncsa. uiuc. edu/InformationServers/ Performance/ Platforms/report.html, January 22, 1996. [19] G. Trent and M. Sake, “WebStone: The first generation in HTTP server benchmark- ing,” WWW Conference ’95, 1995. [20] “Webperf: The next generation in web server benchmarking,” 1996. 177 [21] R. McGrath, “Measuring the performance of HTTP daemons,” tech. rep., httpzllwww. ncsa. uiuc. edu/InformationServers/ Performance] Benchmarking/ bench.htrnl, 1996. [22] R. B. Denny, “WebSite performance analysis,” tech. rep., The Web Developer’s Vir- tual Library — http: //www. Stars. com/, 1995. [23] S. E. Spero, “Analysis of HTTP performance problems,” WWW Conference ’ 94, 1994. [24] V. N. Padmanabhan and J. C. Mogul, “Technical document: Improving HTTP latency,” tech. rep., http://www.ncsa.edu/SDG/ 1194/ Proceedings/ DDay/mogul/ HTTPLatencyhtml, 1995. [25] “The netscape server API,” tech. rep., Netscape Communications Corporation — http://www.netscape.com/ newsref/ std/ server_api.htm1, 1995. [26] “Performance benchmark tests of unix web servers using APIs and C613; executive summary,” tech. rep., Hayes & Company and Shiloh Consulting — http: //www. ncsa. edu flnformationServers /Performation /CGI /cgi—nsapi.html, 1995. [27] J. Devember, Presenting Java. No. 1-575-21039-8, Sams.Net Publishing, September 20, 1995. [28] T. Ritchey, Java! No. 1-56205-533-X, New Riders, 1995. [29] A. Parpcke, S. Cousins, and H. G.-M. etc., “Towards interoperability in digital li- braries: Overview and selected highlights of the Stanford digital library project,” tech. rep., httpzll www-diglib.stanford.edu/ cgi-bin/ WP/get ISIDL-WP- 1995-0013, 1995. [30] W. Birmingham, “An agent-based architecture for digital libraries,” tech. rep., http:// www.dlib.org/ dlib/ July95/ 07birrningham.html, July, 1995. [31] R. McGrath, “Caching for large scale systems: Lessons from the W,” Magazine for Digital Library, 1996. [32] C. Bowman, P. Danzig, D. Hardy, U. Manber, M. Schwartz, and D. Wessels, “The harvest information discover and access system,” Proc. of the Second International WWW Conference, 1994. 178 [33] C. E. Wu, H. Franke, and Y.-H. Liu, “UTE: A unified trace environment for IBM SP systems,” Proc. 1995 International Conference on Parallel and Distributed Comput- ing Systems, September 1995. [34] N. Borenstein and N. Freed, “Mime - multipurpose intemet mail extensions: Part i - mechanisms for specifying and describing the format of internet message bodies,” tech. rep., IEEE, September, 1993. [35] K. Moore, “Mime - multipurpose intemet mail extensions: Part ii - message header extensions for non-ascii text,” tech. rep., IEEE, September, 1993. [36] D. Crocker, “Standard for the format of ARPA intemet text messages,” tech. rep., IEEE, August 13, 1982. [37] D. Koblas and M. R. Koblas, “SOCKS,” Proc. of 1992 USENIX Security Symposium, 1992. [38] T. Kerola and H. Schwetrnan, “Monit: A performance monitoring tool for parallel and pseudo-parallel programs,” Proc. of the ACM SIGMETRICS, May 1987. [39] M. Heath, “Visualizing the performance of parallel programs,” IEEE Software, vol. 8, no.5,Sep.1991. [40] A. Malony, D. Hammerslag, and D. Jablonowski, “Traceview: A trace visualization tool,” IEEE Software, September 1991. [41] M. Pongami, W. Hseush, and G. Kaiser, “Debugging multi-threaded programs with MPD,” IEEE Software, vol. 8, no. 3, May 1991. [42] V. B. et al., “The IBM external user interface for scalable parallel systems,” tech. rep., IBM, 1994. [43] IBM, IBM AIX Parallel Environment: Parallel Programming Reference. IBM Publi- cation SH26-7228, Sep. 1993. [44] MP1, “Document for a standard message-passing interface,” Tech. Rep. CS-93-214, MP1 Forum, University of Tennessee, Nov. 1993. 179 [45] H. Franke, P. Hochschild, P. Pattnaik, and M. Snir, “MPI-F: An efficient implemen- tation of MP1 on IBM SP2,” Proc. of 1994 International Conference on Parallel Pro- cessing, August 1994. [46] C. E. Wu, Y.-H. Liu, and Y. Hsu, “Timestamp consistency and trace-driven analysis for distributed parallel systems,” Proc. 1995 Int’l Parallel Processing Symposium, April 1995. [47] P. C. et al., “Parallel file systems for IBM SP computers,” IBM System Journal, 1995. [48] P. Corbett and D. Feitelson, “Design and implementation of the vesta parallel file system,” Proc. of 1994 Scalable High Performance Computing Conference, pp. 63— 70, 1994. [49] S. Baylor and C. Wu, “Parallel i/o workload characteristics using vesta,” Proc. of IOPADS Workshop at IPPS’ 95, April 1995. [50] D. K. et al., “Visualizing the execution of high performance fortran (HPF) programs,” Proc. of 1995 Int’l Parallel Processing Symposium, April 1995. [51] R. Aydt, “The pablo self-defining data format,” tech. rep., Dept. of Computer Science, University of Illinois, July 1994. [52] V. Herrarte and E. Lusk, “Studying parallel program behavior with upshot,” Tech. Rep. ANL—9 l/ 15, Mathematics and Computer Science Division, Argonne National Laboratory, 1991 . [53] E. Karrels and E. Lusk, “Performance analysis of MP1 programs,” Proc. of the Work- shop on Environments and Tools for Parallel Scientific Computing, 1994. [54] “Webperf: The next generation in web server benchmarking,” Tech. Rep. unpub— lished, Standard Performance Evaluation Corporation (SPEC), 1996. [55] M. Wittle and B. E. Kieth, “LADDIS: The next generation in NFS file server bench- marking,” 1993 summer USENIX Conference, 1993. 180 [56] “Security administrator’s tool for analyzing networks,” tech. rep., http: //www. fish. com/ satab, 1996. [57] D. L. Mills, “Internet time synchronization: The network time protocol,” IEEE T rans- actions of Communications, vol. 39, Oct. 1991. [58] W. R. Stevens, Unix Network Programming. Prentice-Hall, Englewood Cliffs, NJ, 1990. [59] B. Abali and C. B. Stunkel, “Time synchronization on SP1 and SP2 parallel systems,” Proc. of 9th International Parallel Processing Symposium, April, 1995. [60] M. Ridley, “Innovation and implementation: Adopting and managing world wide web services in academic libraries,” tech. rep., http: //www. Stars. com/, 1995. [61] A. Koopman and S. Hay, “Swim at your own risk - no librarian on duty: Large-scale application of mosaic in an academic library,” tech. rep., http: //www. Stars. com/, 1995.