5:33 ax 1». 33:4. .. a. gamma. 4‘ 2. , .353” ‘ .a . . . a .m. V .. Q U ,, fining .4535; , ‘ ‘ , k. ,. ! .uua ...m.,.:§ , , . , . . UHdfi$§LfiEfl$flfilfi . . I:,‘sy%%g . , , , . . i a . w an 3.9;“. . $1?va w “I,” ;; .ahmwfia , , . . a . , . . , , «entwn , iv .. , . u... anuvuv «a f , _ . . . 2 H5... {mam ‘ . 3- t «min 2&3: a .. A} .«lfifmuw. “a, a??? V 5 . 2:5 @133 in; 33:33. MR: .0 32 .fiwuafimwmwfiiu . . . . .11... awn: 2. mg Q r... .. w. .32... 0! t. V. . at. fish“? . .L‘ L.‘ 1' % t; k .u‘ .3. «a 2. .5 S. THESlS ECU LIBRARY Michigan State University This is to certify that the dissertation entitled MAINTAINING CONTEXTUAL LINK INTEGRITY IN DISTRIBUTED HYPERMEDIA SYSTEMS presented by Ryan L. McFall has been accepted towards fulfillment of the requirements for Ph.D. Computer Science degree in flaw/Ma ' Major professor Date [2/ 8.] 9000 MS U is an Afl‘rrmau’w Action/Equal Opportunity Institution 0-1277] PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINE return on or before date due. MAY BE RECAUED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 11/00 mlmm-p.“ MAINTAINING CONTEXTUAL LINK INTEGRITY IN DISTRIBUTED HYPERMEDIA SYSTEMS By Ryan Lee M cFall A DISSERTATION Submitted tO Michigan State University in partial fulfillment Of the requirements for the degree of DOCTOR OF PHILOSOPHY Department Of Computer Science and Engineering December 8, 2000 ABSTRACT MAINTAINING CONTEXTUAL LINK INTEGRITY IN DISTRIBUTED HYPERMEDIA SYSTEMS By Ryan Lee McFall Hypertext systems have proven to be an excellent method of organizing and distribut- ing information. The main feature that distinguishes a hypertext system from other information systems is the inherent capability tO describe relationships between var- ious entities using links. The importance Of links has been known for a long time, but the introduction Of electronic hypertext systems has faciliated the increased role that linking holds. The widespread availability Of electronic information and the ease Of publishing and modifying that information has led to a phenomenon referred tO as broken links. This problem occurs when a previously created link can nO longer be traversed due to a modification to or deletion of the referenced information resource. Broken links have been a major annoyance in the most widely used hypertext system tO date, the World Wide Web (WWW). With the introduction tO the WWW Of more sophisticated hyperlinking primitives, the problem will be exacerbated. In particular, a new class of broken link we call a contextually broken link will arise. In this type Of broken link, one or more endpoints Of the link still exist, but the content located at the broken endpoint is no longer correct. This type Of broken link is more subtle than the first type, since the system may not be able to notify the user that the link being followed is broken. In this dissertation, we describe the problems associated with contextually broken links and describe algorithms and communications protocols that can solve these problems. Furthermore, we analyze the efficiency Of these algorithms and protocols. We Show that these algorithms are capable Of detecting and repairing contextually broken hyperlinks automatically in the WWW without requiring significant changes tO its established infrastructure. © Copyright by RYAN LEE MCFALL 2000 All Rights Reserved ACKNOWLEDGMENTS This dissertation has been the product Of much work, sweat, tears and prayer. I would like to take a moment to thank all of the people who have been such a tremendous source Of encouragement and strength for me. Without these peOple, this work would never have been completed. First Of all, I would like to thank Dr. Matt Mutka for overseeing my work as my advisor. I have appreciated his willingness to undertake this research even though my topic was neither Of great personal interest to him, nor was it something that he had an extensive background in. I am grateful that Dr. Mutka was patient with me during the times that my teaching took up a significant portion Of my time (Often probably more than it should have). A second word Of thanks goes to the Graduate School at Michigan State University for providing the Dissertation Completion Fellowship that allowed me to complete this dissertation on time, even while staying home with my newborn daughter for the majority Of my final semester on campus. Timely completion Of this work without that funding would have been close to impossible. My parents have provided me with financial and emotional support throughout my life. Without them, I could not have made it through 5 years of college, let alone another 6 years as a graduate student. I am grateful to you both for all the Opportu- nities you provided me while I was growing up, and for your continued, unwavering support while I have been pursuing this dream. Bernie Holmes listened to my complaining more than just about any other person besides my wife, I suspect. I am grateful to Bernie for his support and suggestions, and for keeping me sane in general. I am sure that he will not forget the many hours I spent muttering about things not working as expected. Thanks for sharing your Office with me for SO many years! Dr. Mark Urban-Lurain, Dr. Don Weinshank and Gary McCuaig were also instru- mental in completing this work. I thank you for the Opportunity to work with you in designing and implementing CSE 101. Mark, you have influenced the way I think about developing software in many ways during the approximately 5 years that we have worked together. I also appreciate your support in all Of the decisions I have had to make regarding my graduate school experience. The faculty at Hope College have been wonderful in their encouragement. Thank you for the excellent preparation for graduate school that you provided me as an under- graduate, and for your sustaining words Of advice as I have slowly made progressed towards this point. I am particularly thankful to Herb Dershem for the way he has provided ideas and advice. I am in your debt! There have been several fellow graduate students who have been crucial to the com- pletion of this work. Stephen Wagner, where would I be without all Of your help? I truly pray that you will be able to complete your dissertation in the upcoming year and can join be in finally being able to answer the question ”Are you almost done?” with ”NO, I AM done!” Without Russ Hankey, Dave Hall and Dmitri Perkins, I would vi probably still be done, but I would be in much worse shape. Our tennis exploits and weightlifting were a source Of both physical and mental renewal. Russ and Dmitri, I thank you for your spiritual companionship over the years. TO the members Of my small group and other friends from East Lansing Trinity Church, I am truly in your debt. You have been an incredible source Of encouragement for me and have stood by me throughout all of the struggles Of this time. I love you all and am so thankful that you have come alongside Of me as I learned to trust in God rather than myself tO complete this journey. A very unlikely place to meet two people who would prove tO be such a source Of strength is Istanbul, Turkey. John Gillette, you have been a brother to me over the past years, and without your prayers and friendship I know that I could never have completed this work. Sara, I am grateful for your friendship and the joy and support you have brought to Leanne’s life while I have been pre-occupied with this work. We thank both Of you from the bottom Of our hearts for the support you have given us throughout this difficult time. Finally, most Of all I would like to thank my wife Leanne. It is unbelievable tO me the way that you have managed tO live with me and even still like me throughout these past years. Thank you for being so faithful tO pray for me and for always providing an encouraging word when I wanted to give up. I am grateful for the way that you have sacrificially spent so much time doing housework and other tasks when I have been too busy to help. Thank you for graciously finding things tO do on Saturdays without me while I trudged along trying to complete this work. One thing that we vii have learned throughout this time together is that God is definitely faithful to fulfill His promises! I will always love you, princess! If you are reading this as you embark on this journey, I want to encourage you with these words that have been proven true tO me countless times during these years: “DO not be anxious about anything, but in everything, by prayer and petition, with thanksgiving, present your requests tO God. And the peace Of God, which transcends all understanding, will guard your hearts and minds in Christ Jesus. (Phillipians 4:6-8, the Bible)” viii TABLE OF CONTENTS LIST OF FIGURES LIST OF ABBREVIATIONS 1 Introduction 2 Problem Statement 3 Background 3.1 Dexter Reference Model .............. 3.2 Intermedia ...................... 3.3 Hytime ....................... 3.4 Hyper-G ....................... 4 Related Work 4.1 Maintaining Link Integrity ............. 4.1.1 Existential Breaks ................ 4.1.2 Contextual Breaks ................ 4.2 Our Approach .................... 5 Finding and Repairing Broken Links 5.1 Detecting Broken Links .............. 5.1.1 Increasing Detection Efficiency ......... 5.1.2 The Algorithm .................. 5.2 Analysis ....................... 5.3 Repairing Broken Links .............. 5.4 Evaluation and Testing ............... 5.4.1 Test Documents ................. 5.4.2 Generating Random Documents ......... 5.4.3 The New Testament Document ......... 5.4.4 Testing Environment ............... 5.4.5 Performance on the New Testament Document 5.4.6 Summary ..................... 6 A Software Agent to Provide Link Integrity 6.1 Introduction .................................. 6.2 Overview Of the Agent Proxy ........................ 6.2.1 The Agent in Detail ............................ ix 0000000000000 xi xii 1 5 15 15 20 22 25 27 27 27 30 34 36 36 38 41 41 44 47 47 48 49 50 50 52 54 54 55 57 6.3 Protocol Description ............................. 60 6.3.1 Messages .................................. 61 6.3.2 Deciding When to Redirect ........................ 63 6.3.3 Operation in the Presence Of Network Failures .............. 66 6.4 Costs ...................................... 74 6.4.1 Increasing Document Processing Efficiency ................ 78 6.5 Prototype Implementation .......................... 81 6.6 Summary ................................... 82 7 Independent Operation 83 7.1 Overview ................................... 83 7.2 Methods .................................... 86 7.3 Results ..................................... 88 7.4 Improving the Results ............................ 90 7.5 Conclusions .................................. 93 8 Conclusions and Future Work 94 8.1 Conclusions .................................. 94 8.2 Future Work ................................. 97 BIBLIOGRAPHY 99 2.1 2.2 5.1 5.2 5.3 5.4 6.1 6.2 6.3 6.4 6.5 6.6 6.7 7.1 7.2 7.3 7.4 7.5 LIST OF FIGURES An example of a contextually broken link .................. An example Of a contextually broken link (continued) ........... A document and its links before and after modification .......... Nodes examined by the brute-force algorithm ............... Types of document modifications that can and cannot be repaired . . . . Element height distributions ......................... Agent system architecture .......................... Messages passed between agents ....................... Possible ambiguity when sending a redirect message ............ Problems due to network latency and AddReference messages ...... The operation of the client proxy ...................... Performance of modified parser using multiple threads .......... Distributed parsing architecture ....................... Discovering an incorrect link ......................... Number Of times modified before first reference occurs .......... Number Of times modified without being referenced ............ Results using modified algorithm ...................... Average number of references per server in a document .......... xi 8 8 39 40 46 49 57 59 63 70 79 80 84 90 91 92 LIST OF ABBREVIATIONS DNS: Domain Name System DOM: Document Object Model HTML: Hypertext Markup Language HTTP: Hypertext 'IIansfer Protocol ISO: International Standards Organization NTP: Network Time Protocol PURL: Persistent URL SGML: Standard Generalized Markup Language TCP: Transmission Control Protocol TEI: the Text Encoding Initiative URI: Uniform Resource Identifier URL: Uniform Resource Locator W3C: World Wide Web Consortium WWW: World Wide Web XML: the eXtensible Markup Language XML: the eXtensible Markup Language xii Chapter 1 Introduction The amount Of data available through Open, distributed hypertext systems, such as the WWW [1], has increased dramatically in the last 5 years. Hypertext, a term whose origin is credited tO Ted Nelson [2], is defined by Woodhead in [3] as the “generic class Of, or approach to, computer-based materials linked by non-linear structures.” Nelson himself defined it as “a body Of written or pictorial material interconnected in such a complex way that it could not conveniently be presented or represented on paper.” In other words, a hypertext system is one in which the data being presented is linked tO other data which are related in some way. The user Of the hypertext system does not have tO follow a pre—determined, sequential path through the documents making up the hypertext system, but can follow links from “node” to “node” at will. While data availability is important, it has long been known that data itself does not necessarily comprise knowledge. It is discovering and understanding the relationships between information that constitute knowledge and understanding, which is where the importance Of easy to use hyperlinking mechanisms come into play. The notion of the importance Of linking relevant information goes back at least as far as Vannevar Bush’s landmark paper “As We May Think” [4]. Bush describes a system where the user is able to create “trails” Of related information, where a particular piece Of data can be included in different trails in different ways. Even with traditional knowledge collections, the importance Of being able to identify consistently some portion of knowledge has been recognized. The ISBN system was created to provide a means that could uniquely identify a printed book, for example. Separating the Bible into chapter and verse is an example Of the need for a consistent method of referring to a fine-grained portion of a work. Both of these mechanisms work very well due to the static nature Of the Objects being identified. The task Of consistently identifying the same part of some resource becomes more challenging when the resource is frequently changing. Many early hypertext systems were proposed and built in the early days Of hypertext research. The main factors that held back widespread acceptance Of these systems were their complexity and their dependence on proprietary data formats and software applications. It took the introduction Of the WWW, a simple distributed hypermedia system, tO exploit the potential Of hypertext. Very simple tools are used to create the content on the web (including the links) and to distribute (serve) that content. In the early days Of the WWW, the tool used to author content was a standard text editor, and web servers were simple extensions to the traditional concept Of a distributed file system. Indeed, many authors still prefer to use these simple tools for their authoring, and web servers have still not evolved into much more than networked file servers. The simple hyperlinking model employed by the WWW is another reason that it has been so quickly adOpted. Many hypertext systems (for example, [5], [6]) have hyperlinking models that are much more sophisticated than what the WWW allows. Forcing users to learn and cope with the complexity Of these models may have con- tributed to the marginal acceptance of these systems. When it is quite simple to create a link between two information resources, authors are much more likely to do so. Another reason that the WWW has flourished while earlier hypertext systems have not may simply be that the other systems were implementations Of ideas whose time was yet tO come [7]. This is evidenced by the fact that as the WWW matures, its users are starting tO demand some Of the functionality that existed in these more sophisticated hypertext systems [8]. Unfortunately, the simplicity Of the WWW has also led to some Of the most serious problems facing the web today. In particular, the familiar problem of broken links is one that many experts agree must be solved in order to facilitate the continued expansion of the use Of the WWW. According tO Davis [9], “If the issue Of main- taining integrity in hypertexts is not successfully addressed then Open hypermedia will continue tO be limited largely to relatively static information publishing applica- tions.” The goal Of this dissertation is tO address the problem Of broken links while still retaining much of the simplicity that led tO the widespread popularity Of the WWW. The remainder Of this dissertationis structured as follows. In chapter 2, we give a more complete statement Of the problem we will be solving. Next, we explore other 3 work that has been done in relation to hyperlinking in general (Chapter 3) and specif- ically the issue Of maintenance Of link integrity (chapter 4). Chapter 5 describes the algorithms we have created to detect and repair contextually broken links. Chapter 6 describes the software agent we have developed to allow the current generation of WW browsers and servers to provide contextual link integrity, including describing the communication protocol these agents use. In chapter 7 we describe a method Of extending the agent which allows a server to maintain contextual link integrity most Of the time even without cooperation from other servers. Finally, in chapter 8 we give conclusions and outline further work that remains to be done. Chapter 2 Problem Statement As user demands for more sophisticated linking are met, the problems Of broken hyperlinks will increase in frequency and cost. Since link integrity is so crucial to the success Of a hypertext system, it is imperative that algorithms and protocols be developed which can maintain link integrity on the WWW. This is the fundamental goal Of this research. One of the more important capabilities that more sophisticated hypertext systems provide is the ability to define links that point to arbitrary locations within a text (for example, the Text Encoding Initiative [10, 11, 12]). This capability allows an author to point a reader to a particular part (called a sub-resource) Of the referenced resource; in the case where the “reader” is a software agent, the agent can be directed to process only a particular portion of the resource in question. This capability is noted by Engelbart in [13] as one Of the more important develOpment paths the WWW must follow. TO facilitate fine—grained referencing of a sub-resource, one must come up with an addressing mechanism to determine the sub-resource that is the endpoint of the link. One such addressing mechanism is structural addressing, in which a sub-resource is identified by specifying a path to traverse through the structure Of the document. For example, documents written using Hypertext Markup Language (HTML) [14, 15, 16] and the eXtensible Markup Language (XML) [17] can be viewed as a tree structure, and a structural address can specify a particular traversal Of the tree from the root node tO the desired node of the tree. Introducing structural addressing as a means of identifying the endpoints of a link brings about a new notion Of broken links. The traditional notion Of a broken link is a link whose destination endpoint no longer exists. For example, an HTML link can break when the document it references is moved tO a new location or is deleted. When this type of broken link occurs, the hypertext system is able to notify the user that the link is no longer valid, since the destination endpoint is not available. In the new type Of broken link caused by structural addressing, one or more endpoints of the link still exist, but the content located at the broken endpoint is no longer correct. This type Of broken link is more subtle than the traditional type, since the system may not be able to notify the user when the link being followed is broken. TO distinguish between these two types Of breakage, we will classify the first type Of broken link as an existential break, while the second type will be called a contextual break, since it is only through examining the context Of the link that we can determine whether it is broken. Both types Of broken links can exist in a hypertext system, whether or not it employs structural addressing. In a system that uses structural addressing, an existential break can occur if an endpoint Of a link is deleted in such a way that the locator used to identify that endpoint is no longer valid. For example, if a link is created which addresses the fourth section Of a four section book, and the second section Of the book is deleted, then an existential break has occurred — the endpoint addressed by the locator does not exist. A contextual break occurs when the structure Of the document is changed SO that the content located by the link is not what the link creator originally intended. In the previous example, if the endpoint Of the link was the third section Of a bOOk, and the second section was deleted, a contextual break would occur. For a further example, consider the situation shown in Figure 2.1 This figure shows two XML documents, and the corresponding document trees that result from these documents. The first document is a sports article written about a baseball game that took place the previous day between the Stars and the Moons. The second document is a collection Of all box scores for the Stars for the current season. The box score in the article is actually a link to the first box score element in the box score document. Now suppose that the Stars play their second game Of the season, and the box score document is updated by placing the second game’s box score above the first (perhaps to make it easy tO find the most recent box score). Figure 2.2 shows the results of the modification to each Of the documents. Because the link in the article document points tO the first box score element in the box score document, its content will change due to the modification, and the link “breaks”. The Stars won their garre yesterday against the Moons by ascore of 3-1. l~bre is the box scae d the game Player Posllon @ Who 1st 3 What 2nd 0 I Nn’t Know 3rd 1 They w il play aga'n tonight at 8. E919! m— Who 1st 3 What 2nd 0 I Don't Know 3rd 1 Figure 2.1: The article (left) and box scores (right) The §ars won their game yesterday Bayer Basilica fits against the Moons by a score of 3-1. Who 151 2 Here is the bcx score of the game: What 2nd 2 Elam Bastion Hits ’ 00'” Km” 3'” 0 Who 1st 2 . . flair mum fits What 21d 2 Who 1 t 3 I Don't Know 3d 0 What 2; 0 They will play aga'n tonight at 8. I Don't Know 3rd 1 / I // Original Link Target / \ / ‘x ’ \ [ Player) { Player) { Player) I / I \~, \~’ ~’ Figure 2.2: The documents after the update The possibility Of existential breaks in a system that does not employ structural addressing is fairly Obvious. It is also possible for a contextual break to occur. In the WWW, for example, it is possible tO replace some information resource with a resource that has completely different content. However, a contextual break occurs if the new resource is given the same name as the original resource. Contextual breaks are much less likely to occur, however, if structural addresssing is not employed. The main focus Of this work is to solve the problem of contextually broken links. When document modifications take place that cause a contextual break, our goal is tO detect this fact automatically and repair the broken links. In particular, we will focus on finding and repairing broken links in documents written using XML in the context Of the WWW. Our work is not necessarily restricted to this context, as it applies in any system using structural addressing where the documents in question are tree-structured. XML was created by the World Wide Web Consortium (W3C) as a successor to HTML, the original markup language used to describe resources on the WWW. There were many reasons that the decision was made to replace HTML with XML. One of the most compelling reasons guiding the development of the XML specification was the desire to eliminate the presentational nature Of markup data, which would allow the markup to provide semantic information describing the nature Of the data. In order to do so, one Of the first changes that needed tO be made was to eliminate the dependency on a fixed set Of elements (commonly referred tO as tags) that are available to mark up data. In this sense, XML is not a specific markup language, but rather a “meta” markup language that defines a common structure for an infinite family Of markup languages. Rather than completely building a new standard for this meta markup language, the designers Of XML leaned heavily on the Standard Generalized Markup Language (SGML) [18] standard. SGML is very complex, which led to the designers Of XML tO eliminate many Of the more esoteric features Of the SGML language. SGML provides a fairly simple means Of creating links between elements in the same document [19]. Each element can have an attribute with a special type called an ID, which must be unique within a document. Any other element that wishes to reference this element does so by having an attribute with a Specific type called an IDREF element. The SGML processor enforces the restriction that all ID’s used within a document are unique, and that any attribute Of type IDREF contains a value that points tO an existing element, within the same document, containing the referenced ID value. SGML does not provide any mechanism that can be used tO provide inter-resource linking. In order tO create an inter-resource link, an author must use a system that builds upon SGML, such as HyTime [20] or the Text Encoding Initiative (TEI). The designers Of XML realized that there is a need for a middle ground between the simplistic linking models Of HTML and SGML and the complexity that is inherent in standards such as HyTime. Therefore, in addition to creating the XML standard describing the ways that documents would be marked up, they authorized the work on a standard means Of describing links between XML resources (the XLink standard 10 [21, 22]), and also a standard way of addressing sub-resources Of those resources (the XPOinter standard [23, 24]). XLink allows links to be created that are multi-ended, bi-directional, and out-Of-line. In this way it was influenced by the linking model specified in the HyTime standard. In order to specify links between documents, the standard Uniform Resource Locator (URL) reference means Of locating a document on the WWW is used. If the endpoint of the link is the entire resource located at that particular URL, no further addressing is necessary. If, on the other hand, a sub-resource Of the resource designated by the URL reference is to be the link’s endpoint, a fragment identifier is appended to the URL, and the fragment identifier is then used to locate that sub-resource. If the resource named by the URL reference is an XML document, then the frag- ment identifier portion Of the reference is an Xpointer. For complete information on the ways that Xpointers1 can be used to locate a sub-resource, see the specification. Xpointers can be used in a way similar tO an HTML link or an SGML IDREF refer- ence tO point to a uniquely named element within the target resource. Additionally, they can specify a traversal Of the target document tree using locator terms such as ancestor, child, sibling, etc2. In this way they are similar to the TEI and HyTime primitives. 1the current practice is tO refer to the standard as XPOinter, while a particular instance is called an Xpointer 2Although the XPOinter Candidate Recommendation [24] has changed the syntax used to specify tree traversal, the semantics remain unchanged. The version in [23] is somewhat easier to read, and so we will use its notation in this dissertation. 11 In addition tO allowing the specification on a particular node in the document tree, an Xpointer can also refer tO a range Of nodes, or to a specific portion Of the content Of a set of nodes. The addressing mechanism outlined in the XPOinter specification essentially uses structural addressing, and thus is susceptible to the problem Of contextual link breaks. There are at least two major problems that must be solved in order to provide guar- anteed contextual link integrity for a hypertext system. The first problem is the necessity Of algorithms that are able to locate the original endpoint Of a link after the document containing the endpoint has been modified in a way that causes the original link to break. While simple string search algorithms would be sufficient to perform this task, they are not efficient enough to be practical when the referenced document is large and there are a large number of links referenc- ing that document. The fact that the hypertext documents we are considering are very regularly structured allows other algorithms to be developed which realize con- siderable performance gains over a simple string searching technique. Furthermore, traditional string searching algorithms are meant to deal with plain text only, and would need tO be modified to understand the difference between content and markup in a hypertext document. The second problem that must be solved is to allow the introduction Of link integrity guarantees into the WWW without requiring substantial changes in the protocols and document markup languages currently used to implement the WWW. As Bouvin states, 12 Adhering to standards has a large part of the success of the Web and the very size Of the Web has enormous inertia, SO an attempt to replace the Web with something perhaps more advanced in certain aspects is, if not doomed, then up against trememdous Odds [25]. Protocols must be developed that allow information providers on the WWW to remain as independent as possible. We must not require all authors tO use a specific authoring tool along with a specific web server, since systems with these restrictions are not likely tO gain widespread acceptance. Instead, a mechanism must be develOped where the users are still free tO use whatever tOOls they are accustomed to author and distribute documents. The problem Of maintaining link integrity while still allowing web systems tO remain independent is a challenging one. In order for a system tO successfully determine which links have been broken by a document modification, it must have knowledge Of the existence Of each Of the links that point into the document in question. Acquiring this information independently (that is, without explicit notification Of link creation by the referring document’s author) has been proven impossible [26], SO we must come up with an unobtrusive means Of communicating this information between servers. Part Of any system that attempts to prevent contextual link breakage will require frequent parsing Of structured hypertext documents. For a server that handles heavy traffic, fast and efficient parsing is Of paramount importance. The widespread exis- tence Of multiple processors in today’s computing environment leads to the possibility Of parsing documents in parallel. Parallel algorithms must be develOped to take ad- vantage Of the processing power that is available. 13 After surveying the capabilities of several historical hypertext systems and their ap- proach to solving the broken link problem in chapters 3 and 4, we present our solutions in chapters 5, 6 and 7. 14 Chapter 3 Background In this chapter, we discuss several previous hypertext systems and the capabilities that they provide in terms Of linking and link maintenance. 3.1 Dexter Reference Model The Dexter Reference Model [8, 27] is not a specific hypertext system. Instead, it is intended to be an abstract model Of a generic hypertext system. While the specific notion Of document and link can be very different among various hypertext systems, it is helpful to specify an abstract model Of the system so that reasoning can be done about the properties and behavior Of the abstract model. This reasoning will then be applicable for any actual system for which a mapping can be found from that system’s data and linking model to the Dexter model. This idea is similar, for example, to the motivation behind the OSI seven layer network model [28]. 15 A second goal Of the Dexter model is tO provide a common base Of terminology for the hypertext community. Every hypertext system has its own set Of terms and definitions that it uses, which can make comparison between systems difficult. The Dexter model was in part created SO that hypertext researchers might be able tO avoid communication difficulties due to a simple difference Of vocabulary. As is the case with any abstract model, the Dexter model uses the terms it defines in a way that is separate from any specific hypertext system. For example, the term assigned to refer to data Objects in the system is a component, while common hypertext systems Often refer tO these data Objects as documents or nodes. Parts of the Dexter Model The Dexter model separates an abstract hypertext system into three layers: the runtime layer, the storage layer and the within-component layer. These layers are listed in decreasing order in terms Of level Of abstraction in the software design process. This means, for example, that the within-component layer is closest to the physical representation Of the data in the system, while the runtime layer is the view that the user Of the hypertext system sees. Interestingly enough, it is the middle layer Of this model (the storage layer) that the Dexter model chooses tO focus on. The within-component and runtime layers are deemed to be too diverse for the creation Of a general model which can cover the range of possibile implementations for these layers. 16 The purpose Of the storage layer is to model how “nodes” and “links” are put together to create a network of information. As previously mentioned, the Dexter term for node is component. A component is an abstract data Object, with no restriction Of the type Of Object (text, graphic, sound, etc.) or the format in which the Object is stored. Along with defining the notion Of component, the model also specifies two functions that can be used tO identify and access components as part Of the storage layer. The first function is the accessor function, which is responsible for taking a unique ID (UID) which all components have and finding the component in question. In the WWW, for example, this corresponds tO the idea Of the GET method Operating on a URL. The resolver function is reponsible for mapping a more general description Of a com- ponent tO the UID Of the component. This function might be responsible for mapping a request such as “The component describing the life Of Martin Luther King” to a specific UID Of a component matching that description. The WWW does not have an analogue tO this function; the URL Of a document is generally the only way to retrieve its contents, and corresponds tO the specific UID in the Dexter model. TO facilitate addressing parts of components, the Dexter model defines the concept Of anchors. Anchors are also made up Of two parts, similar in functionality tO the roles Of the accessor function and the resolver function. The first member Of an anchor Object is an anchor ID, which is a unique identifier within the scope Of the component the anchor is associated with. The value of the anchor ID remains constant throughout the lifetime Of the anchor. The second member of an anchor is the anchor value. This 17 is an arbitrary value which can be used by the within-component layer to resolve to a specific part Of a component. In order tO maintain the separation between the within-component and storage layers, the anchor value has no meaning to the storage layer. For this dissertation, it is interesting to note that the model states that: As a component changes over time (e.g., when it is edited within the run- time layer), the within-component application changes the anchor value to reflect changes to the internal structure Of the component or to re- flect within-component movement Of the point, region, or items to which the anchor is conceptually attached. The anchor id, however, remains constant, providing a fixed referent that can be used to specify a given structure within a component [8]. It is the implementation Of this behavior within the within-component layer into which the algorithms defined in chapter 5 fit. Once we have the notion of components and anchors, it becomes possible to describe the mechanism used tO describe an endpoint of a link. This mechanism is called a specifier and its main feature is a (component specification, anchor ID) pair. A link in the Dexter model is simply a set Of two or more specifiers, providing links that are multi-ended as well as bidirectional 1. A further aspect Of the Dexter model is a specification Of a set Of abstract functions that can be used to Operate on the model contained in a hypertext. Some Of these Operations are Of particular interest to this dissertation: 1the direction Of an anchor is also contained as part Of a specifier 18 o The function LinksTOAnchor takes an anchor and its containing component as input, and returns the set Of links in the hypertext that refer tO the anchor in question. 0 The function LinksTO takes a hypertext (which is a enumerable set Of compo- nents and links) and a component ID as input, and returns the set Of links resolving to that component. 0 The function ModifyComponent modifies a component, ensuring that the com- ponent remains consistent, and that the resulting hypertext remains link con- sistent. Here link consistent means that the component specifiers Of every link specifier must resolve to existing components. Although not the major focus Of the Dexter model, we briefly outline the functionality of the runtime layer. This layer is responsible for presenting the nodes and links Of the hypertext to the user, facilitating browsing and authoring Of the hypertext. It is through this layer that the user instantiates components Of the hypertext, making cached copies Of those components currently in use during that particular session. In particular, the run-time layer is reponsible for making the presence Of anchors and links within the hypertext known to the user, and controlling traversal Of those links. It is also responsible for handling session related information, such as a history Of the components visited, and providing run-time versions Of the storage layer’s resolver function. Finally, it provides the means through which components are instantiated (created from their storage layer models) and written back into the storage layer, a function known as RealizeEdits. 19 Applications of the Dexter Model At the time Of the development Of the Dexter Model, the editors Of the reference state that “The model as currently stated is far more powerful than any existing hypertext system. The provisions for n-ary links and for composite components, for example, are intended to accomodate the design Of future hypertext systems.” The design of the linking and addressing specifications for the second generation Of the WWW proposed in coordination with the XML specification fulfill some of the advanced design goals Of the Dexter system. However, some aspects of the Dex- ter model have been left as unimplemented by the WWW. In particular the link consistency requirements stated by the model are unfulfilled, a deficiency this disser- tationaddresses. 3.2 Intermedia The Intermedia system [29] is significant because it is one Of the first hypermedia systems tO recognize the importance Of two hyperlinking concepts. Of these concepts, Of particular importance to this dissertationis the idea that link anchors should be constructed as “any entity that the user can select in that particular application.” Of course, if link anchors are tO be allowed to be specified in this general a way, there must also be some way of identifying the portion Of the resource that the user selected. In the Intermedia system, it is the responsibility Of each individual editing/ browsing application to implement this function. Since the data needed tO identify an anchor 20 in a text document may be Significantly different than what is necessary tO identify an anchor point in a graphic, this is a reasonable model. The second important concept introduced by the creators Of the Intermedia system was that it should be possible to link to any information resource that the user could access. The Intermedia creators found that forcing hypertext users tO remain in a “closed” world as part Of a hypertext system was one Of the main reasons that acceptance Of early hypertext systems was not as widespread as it might have been. They envisioned a world where any information resource may be the endpoint Of a link; for example, a word processing document could be linked tO a spreadsheet that was linked to a sound clip. The Intermedia system means of accomplishing this “universal linking” is through integration Of linking services into applications through the use Of an Object-oriented programming framework [30]. Link creation is performed through menu Operations “Start Link” and “Complete Link”, a metaphor quite similar to the “Edit/Paste” Operations users Of graphical user interfaces are quite familiar with. Most Of the details Of creating links and the initiation Of following links is handled by the application framework. This, Of course, limits the degree Of “Openness” related to linking present in the system, since it is only those applications that have been derived from the application framework that can participate in linking. The development Of the Intermedia project ended in 1990, and as a result a study [30] was performed which evaluated the usefulness Of various linking capabilities such as bi-directional and multi-ended links. This study established their importance as necessary for any hypertext system in which effective work can be performed. 21 3.3 Hytime The HyTime standard [5] is an International Standards Organization (ISO) docu- ment outlining a standard way Of describing the structure Of time-based hypermedia documents [31]. AS such, it can be used tO implement hypermedia systems that are made Of much more than simple hypertext documents. By using just a subset Of the facilities HyTime provides, however, the standard can be used to describe a hypertext facility. Linking in HyTime HyTime provides a linking model that includes two different kinds Of linking con- structs: contextual links (clinks)2 and independent links (ilinks). A contextual link is used to provide the type Of link that users Of the WWW are familiar with. In this type Of link, one of the endpoints Of the link is the actual linking element. The link asserts a relationship between the linking element itself and the element it references. A normal use for this type of link is as a cross-reference. For example, a link might be created between a term in a document and the definition Of that term in a dictionary document. An independent link is a link that connects more than two locations, is stored in a document other than the documents it is connecting, or both. One use Of this type of link might be tO connect an article tO a series Of reviews Of that article by different 2It is important to note that the use Of this term is quite different than the notion Of contextually broken links that we introduced in chapter 2 22 reviewers. Or, an ilink might be created tO connect a specific word in a document to entries in a dictionary, a thesaurus, and an encyclopedia. Links that are located in a resource other than the resources connected by the link are called out of line links. These types Of links are useful when the resources tO be connected are not owned by the entity wishing to create the link, and are therefore read—only to that entity. For example, suppose an instructor uses a particular on-line text for an Operating systems course. A student in the course wishes to create links from the textbook used in the course to several different online Operating systems texts, so that the student would be able to examine several different discussions of the same topic. Since the student most likely does not have write permission on the course textbook document, it would be impossible tO add the link directly into the document. However, using out-Of-line links the student could create a separate docu— ment containing the links. While reading the course textbook, the hypertext system is responsible for determining which Of the nodes in textbook have links pointing to other textbooks. Addressing in HyTime HyTime provides three different ways Of addressing sub-resources within a particular resource. The three types Of addressing include addressing a node by name (using an ID attribute), specifying a series Of counting terms that specify the location Of the sub-resource, and through a general-purpose querying facility. 23 In general, specifying a sub-resource by name is similar to ID/IDREF mechanism provided by SGML. It is also possible to use this type Of addressing to locate a named entity. An entity is a peice Of data that can be referenced using a name explicitly declared for it. The entity itself may be something as simple as an abbreviation for a literal string, or may represent an external data object such as a file Of SGML or non-SGML data [19]. When using counting to find a sub-resource, HyTime provides an extremely general counting mechanism. A counting locator in HyTime is called a dataloc. A dataloc divides information content into components (called quanta), and then counts the quanta [31]. Different means Of breaking the content into components are possible. For example, one quantity that might be counted would be the bytes Of the digital representation Of the document. Another quantum might be time divisions in a digital representation Of a video recording. Or, the divisions might be the markup tags indicating the elements Of an SGML document. A specific type Of dataloc called a treeloc is of particular interest. A treeloc is used tO Specify a traversal Of a tree representation Of a resource. HyTime does not specify that resources addressed using a treeloc must be SGML documents. Any resource that can be represented using a tree can be addressed using a treeloc locator. However, they are particularly useful for SGML documents since all SGML documents can be conveniently represented as a tree. The treeloc element specifies a traversal that starts at some location source, and follows a specific path down the tree. A second type Of tree location term is a relloc, which is a more general form of treeloc that also allows traversal up the tree, and also 24 across the tree. It does this through the specification of one of five types Of traversal as part Of the relloc locator. These five types Of traversal are anc, esib, ysib, parent and children, which specify the ancestors, elder Siblings, younger siblings, parent, and children Of the location source, respectively. A further type Of counting locator is the fcsloc, which is used to specify some location using a coordinate system. This type Of locator is useful for identifying portions of graphical images, for example. The final category Of locators that HyTime provides are query locators. This allows the specification Of a query, in some query language, that is executed tO find the desired sub-resource. The querying facility Of HyTime is perhaps the most robust, but coming up with a query to identify a particular sub-resource is difficult, since computer understanding of text is still not very reliable. 3.4 Hyper-G The designers Of the Hyper-G network information system see the WWW as a first generation attempt at a world-wide distributed information service [32]. They recog- nize several problems with the current state of affairs on the WWW that they intend to address [33]: 0 Provide orientational and navigational aids 0 Provide automatic structuring and maintenance 25 Reduce fragmentation across servers Support user identification and access control Support multilinguality Maintain interoperability with existing systems Hyper-G addresses each Of the fundamental shortcomings Of the WWW. Like the WWW, Hyper-G is a client-server architecture. One Of the fundamental differences between the WWW and the Hyper-G system is that in Hyper-G, the underlying mech- anism used tO store information within a hypertext is an Object oriented database, as Opposed to the WWW where the storage mechanism is (usually) a traditional file system. Placing all information available tO the system under control Of the Hyper-G server makes the maintenance Of link integrity much simpler, since the server is able tO determine which other information Objects will be affected by a change, and can notify those Objects Of changes necessary to maintain consistency. Change notifications are made through a scalable flooding algorithm [34] SO that all other servers are kept up to date. However, perhaps the major disadvantage Of the Hyper-G approach is that documents have tO be imported into the environment from the devel- Opment environment, they have to be translated into a format for which a Hyper-G viewer has been written, and more generally that they have to be owned by the database, i.e. brought into the authors domain of authority in other words, you can’t link things you don’t own (emphasis mine) [35]. 26 Chapter 4 Related Work 4.1 Maintaining Link Integrity Davis [9] gives an overview Of the various strategies that an Open hypermedia system can take when attempting to solve the broken link problem. In the following para- graphs, we give an outline Of these strategies along with an analysis Of the strengths and weaknesses Of each approach. 4. 1 . 1 Existential Breaks The simplest approach to preventing broken links is to put the burden entirely on the user tO repair any links that might become broken. Although this approach sounds overly simplistic, it is in fact the method used to maintain link integrity in the largest hypertext system currently in use, the WWW. Indeed, the simplicity Of this approach may well have led to the rapid widespread acceptance Of the WWW. However, there are at least two serious problems with this approach. The first Of these problems 27 is a human factors issue: people are simply not consistent in performing the work necessary to maintain link integrity. A recent study by the W3C [36] found that between 5% and 8% Of links referenced by WWW users are broken. The second problem with this approach is that it requires that document authors are aware Of the existence Of links that point into their documents, and have some mechanism Of notifying the owner Of the referring document tO change the link. It is this second issue that makes the practice Of manual link maintenance infeasible. In [37], Creech describes an extension to this mechanism by which the system provides some support tO the user. He defines the concept Of a Change Log Table, which records the Significant changes that have been made to a document during an editing session. Periodically a process called the Web—Walk process reads the Change Log Table and either automatically makes a change made necessary by the operation, or notifies the owner Of the referencing document if the appropriate action cannot be determined automatically. The entries in the Change Log Table are made by the authors themselves, which allows them to use whatever editing tOOls they normally use tO author their documents. Unfortunately, since the logging Of changes is a manual process, dependent upon the authors doing it (and more importantly, doing it correctly), this approach is quite susceptible tO error and very difficult to dO correctly. Furthermore, there is a period Of time during which broken links exist, dependent upon the length Of time between Web-Walk’s. Another approach to dealing with the problem Of broken links (either contextual or existential breaks) is to attempt to provide a permanent name, independent Of location and structure, for the Object that is to be identified. This type Of approach 28 can be further broken down based on whO/ what is expected to maintain the mapping between the current location Of the Object and the permament name. Generally, this can be either the human owner Of the Object or a software program acting on the owner’s behalf. Persistent URL (PURL)’S [38] are one example Of the former strategy. The relation- ship between an object’s PURL and its current, location dependent name (URL) is similar tO the relationship between a host name and an IP address in the Domain Name System (DNS) [39]. A PURL server provides functionality similar to a DNS server. It takes requests for a PURL and redirects those requests to the URL that is the current location of the Object identified by the PURL. If the owner of the Object decides tO alter the location Of the Object, it is up tO the owner to send a message to the PURL server(s) in charge Of mapping that PURL. The main advantage Of providing this level Of indirection is that it does not require a document to know about the documents containing references tO it. By providing the redirection, the notification Of a change in location for a resource only needs tO occur once, and the owner Of the resource that has changed location should have permission to make this change. There are two main disadvantages to this type Of approach. The first disadvantage is the requirement Of manual update Of the link resolution service (the PURL server), since we know that a repetitive manual task is likely to be forgotten or ignored by a human Operator. The second disadvantage Of this system is that it only provides link integrity for links that point to resources that have been deemed important enough tO have been assigned a name by the author 29 of the resource. This does not allow users tO create links to sub-resources Of the the resource that the author may not have anticipated. The PURL approach centralizes the management Of the names used to identify re- sources, but still allows those resources tO be maintained in any way the user desires. Another approach that can be taken is to centralize the management Of the informa- tion stored in the hypertext system in such a way that the system is able tO easily maintain the integrity Of links. TO do SO requires that users Of the system interact with the hypertext system through Specific tools such as specialized editors and servers. Through the use Of these tools, the system can be made aware of any changes that are taking place that may break existing links, and update any references to links that would be broken by the modification in question. Hyper-G [6], described in section 3.4, is one such system that takes this approach. The WBObjects architecture [40] is another system that adopts this approach. The main disadvantage with this approach is the closed nature Of the system. Users do not like to be told what tOOlS to use, and Often may not be able to use the tools due to their unavailability on the specific computing platform on which they are working. 4.1 .2 Contextual Breaks The approaches mentioned above work best to deal with existential breaks. Further approaches are necessary to deal with the problem Of contextual breaks. Some of the approaches that have been proposed include versioning, link aware editing tools, “just-in-time” link repairs and using queries to specify the endpoints Of links. 30 In the versioning approach, the problem Of broken links is eliminated because when a link is created, it points to a specific version Of the referenced resource. If that resource is subsequently changed, it is not replaced by the new version Of the resource; instead the new version and the Old version CO-exist. Since the link refers to the Old version Of the document, it will not be broken by the introduction Of the new version. Of course, in a versioning approach, the amount Of storage Space required for the resources stored in the hypertext system increases in proportion to the number of times the resources are changed. In addition, the risk Of users editing or referring tO Old versions Of the resources is present. Davis mentions the “diff” approach, which could be used instead Of maintaining separate versions Of the document. In this system, the difference between an original version Of a document and a modification Of the document is computed. Then a “just- in-time” repair method can be used which uses the difference file tO resolve a link that is known to be broken. Implementations Of diff exist that can be used to create a byte- by-byte difference file, which would be useful if links use byte addresses tO specify link anchors. Byte addresses are one Of the least robust means Of identifying link anchors, which means that a large fraction of the links might be broken by a minor edit to the document. Furthermore, they are very unintuitive and difficult for users tO work with [31]. When using structural addressing mechanisms, a difference algorithm that is able tO Operate on tree-based structures is needed. In [41], the authors give an algorithm that can do so in 0(nd) time. However, even with efficient algorithms that compute a useful difference, determining the difference between two documents involves more 31 work than really needs to be performed, since we are only interested in the changes to the parts Of the document that are currently the endpoints Of links. More recently, in [42], Phelps and Bilensky describe a system using what they call )7 “Robust Locations. A robust location consists Of a series Of location methods, in- cluding a core set Of methods made up of unique ID’s, tree-walks and context. Further sets Of location methods can be added by implementors if so desired. Robust locations are complex enough that the authors suggest they always be machine generated. This method requires that repair Of broken links be left up to the client, using the information encoded in the robust location. They propose a specific repair algorithm that attempts to take advantage of the tree-walk portion of the location and does a “spiralling” search of the content around the original location attempting to find the originally intended content. If the tree-walk portion fails according to some predefined threshold value, a string matching algorithm is used to try to match the context of the link. A main point in which this dissertationdiffers from the Robust Location proposal is in where the computation providing link integrity is performed. Phelps and Bilensky argue that reattachment should be performed by the browser. This requires significant new implementation by the browser, and also means that it is impossible for an organization to guarantee that all users will be prevented from following a broken link. Instead, only those users that use a reattachment capable browser will be guaranteed contextual link integrity. It is our Opinion that allowing the server to perform link reattachment is a better Option for information providers. Once the server implements link maintenance prO- 32 cedures, any user browsing that server’s documents will be ensured that links pointing tO that server will be correct. Using queries tO specify the anchors Of links seems like a reasonable approach, and in some cases is a very gOOd approach. For example, if one is creating a link tO a list Of items in a catalog that fall into a certain price range, a query mechanism seems perfectly reasonable. When links are created based on the content Of the resource being linked to, rather than being based on attributes of the that resource, however, querying is not really a possibility, since it is difficult tO construct a query that uniquely identifies the desired content (and only the desired content). Another useful classification of the various approaches to maintaining link integrity is one that determines how “Open” the solution is. That is, to what degree is the user Of such a system able tO interact with the system in any way that he/ she desires? It is our goal tO be able to provide a system that provides a high degree Of correctness in an environment where the user is completely unaware Of the existence Of the link integrity problem. Furthermore, if we require the user to have a small amount Of responsibility, then we should be able tO guarantee the integrity Of links in all cases. Davis provides a categorization which classifies integrity maintenance approaches into the following classes: 0 Don’t Bother: In this approach, the user makes links, and if the links break it is determined not to be that important. 0 Avoid the Problem: Links are expressed declaratively or in terms Of an algo- rithm that will hopefully resolve to the intended anchor. 33 o Loosely Coupled: The system itself provides no means of guaranteeing link integrity; but it provide a set Of tools that the user can use to dO so. If the user chooses not tO use the tools, link integrity is compromised. 0 Automatic Link Repairs: The system knows that links can be broken, and takes responsibility Of detecting and repairing broken links. 0 Tightly Coupled: The most restrictive class, systems that adopt this approach integrate link management responsibility with authoring software and servers. 4.2 Our Approach Our approach to maintaining link integrity combines many of the concepts mentioned in the above work, while attempting tO provide solutions to many Of their disadvan- tages. In Davis’ categorization scheme described above, the approach falls somewhere between “Loosely Coupled” and “Automatic Link Repairs.” We hOpe to be able to provide automatic link repairs that are guaranteed to work if the user uses a (min- imal) set Of tOOlS to notify the system about the existence Of links and inform it Of changes to the documents it manages. When the user neglects the responsibility Of using the tOOls, the system should be able tO gracefully deal with the situation and still provide link integrity with a high probability. We have developed algorithms that can efficiently detect and repair broken links. These algorithms require knowledge Of the links pointing into a document and prior 34 notification when a document is about to be replaced with a new version and are described in detail in chapter 5. We have additionally developed a communication protocol and WWW agent service that can be used tO provide a “link mapping” service so that existing web servers can take advantage Of the algorithms described above in order to provide link integrity. One thing that we are not proposing tO deal with is the possibility that the content specified by a specific link may have been modified in some way, or perhaps even deleted. Strategies for solving this problem would involve devising some sort Of se- mantic distance metric that could be used tO measure the distance between a portion Of the content in a modified version Of a document and the original content, and then deciding whether any portions Of the modified document are “close enough” to be considered the original content. Indeed, the recent “robust locations” work cited previously falls into this category. While this is certainly possible, we do not consider it in this dissertation. Recently, many systems and methods have been proposed for implementing “Open hyperlinking”, which allows the users Of a hypermedia system to create links between information Of widely varying types (for example, DLS [43]). Bouvin provides a good summary Of these approaches in [25]. Providing guaranteed link integrity in systems where the endpoints Of links can be arbitrary types Of information is a more ambitious goal than this dissertationwill address. 35 Chapter 5 Finding and Repairing Broken Links This chapter outlines algorithms that efficiently detect which links will be broken by a document modification. Furthermore, once it is determined that a link is broken, we describe algorithms that can locate the referenced node in the modified document in some restricted cases. 5.1 Detecting Broken Links Definitions H,(N): the height of the sub-tree Of the document rooted at node N, where height is defined in the traditional sense. 36 E3(N): an encoding Of the element structure Of the sub-tree Of the document rooted at node N. Link target: A node in a document tree which is directly pointed tO by a link Contained links: A link L1 is contained by a link L2 if the link target of L1 is a descendant Of the link target Of L2. Document tree: The tree representation Of an XML document. The representation we will use is the Document Object Model (DOM). We give a simple description Of the DOM here; for a more complete description, see the standard [44]. In the DOM, a document tree is made up Of several different types Of nodes, with the most significant types for this discussion being the element and text nodes. Element nodes represent the element markup within the document; text nodes encapsulate the document content contained by the elements and by definition are leaf nodes. Note that all document content is contained by some element. Link tree: A tree whose nodes are the set Of nodes that are targets Of links for a particular document. The nodes Of this tree are a subset of the nodes in the document tree representing that document. The link tree has an artificial root node as its root which is not a node in the document tree. This is necessary in order to guarantee that the link tree is tree. Node equivalency: Node comparison is an overloaded Operation. T wo element nodes are considered equal if their tag names are the same and they have the same number Of children. TWO text nodes are considered equal if a string com- 37 parison Of the text contained in the text nodes shows them to be equal. We will use the symbol E to denote the node equivalency test. 5.1.1 Increasing Detection Efficiency The brute force method Of detecting broken links involves checking each Of the links tO the original document and deciding if the content Of the nOde(s) referenced by the link has changed in the modified document. However, it may be the case that hundreds or even thousands of links to the document exist. A sequential walk through the link list, checking each link independently, could take a substantial amount Of time. We can certainly do better than this due to the hierarchical structure Of XML documents. One way Of reducing the amount Of work required tO detect broken links is structuring the order in which the links are examined. When there are a large number Of links tO a document, it is likely that some Of the links will contain other links. Since a containing link should be considered broken if any Of its contained links are broken, we can reduce the amount of work done by the link detection algorithm by checking broken links in a “bottom-up” fashion. Consider the example shown in figure 5.1. In figure 5.1(a), we see a tree representing the original version Of a document. Figure 5.1(b) shows the document after one node has been modified, and 3 new nodes have been inserted. Figure 5.2 illustrates the nodes that would be checked by a “brute-force” algorithm that compares the content 38 e o 0 Link 2 a 9(— (a) The Original Document ....9_> e o oo 00 Link 2 000 GG (b) The Modified Document Modified Node 0 Newly Inserted Node Key Figure 5.1: A document and its links before and after modification 39 Of each explicit target in the original tree tO the corresponding nodes in the new document tree. , ; i I \ I \ I .f , x‘ r \“l /.'H_. /\ Link 2 ‘ \ .\ 12 ," 2:: 13 }<—-Link3 Node needed to determine if links have been broken Node compared by the "brute-force" algorithm ‘o Figure 5.2: Nodes examined by the brute-force algorithm We can see that this algorithm clearly does more work than is necessary, due to the hierarchical nature Of XML document content. If the first comparison performed by a broken link detection algorithm was tO determine if link 3 is broken, we could avoid comparing nodes 2, 5, 6, 7 and 12 altogether, since the target Of link 3 is the second child Of node 6, and that node is an implicit target of link 1. In a large document, structuring the link database so that links are checked in a “bottom-up” fashion can eliminate large amounts Of work that could be spent com- paring nodes unnecessarily. This link structuring is the fundamental idea in our broken link detection algorithm. 40 5.1 .2 The Algorithm Let Treeozd and TreeNm be the tree structures representing the Old version Of a document and a new version Of the document which is about tO replace the Old version, respectively. Let Tree Links be a link tree for Treeozd. The links in TreeLmks are structured in a way such that a node child is a child Of a node parent if parent contains child (We give an algorithm for creating TreeLink, in algorithm 5.3). Our goal is to determine which Of the links in Tree Links should be marked as modified, where a link is considered modified if the explicit target Of the link or any Of its descendants are not equivalent in the new tree structure. We wish tO do this in such a way as tO perform as few node comparisons as possible. Algorithm 5.1 is used tO decide if the target Of a link has been modified. This algorithm is used by algorithm 5.2 to examine the targets Of each link in TreeLmks. At the conclusion of algorithm 5.2, all of the explicit targets have been correctly marked as either modified or unmodified. 5.2 Analysis We claim that algorithm 5.2 is an Optimal solution to the problem Of determining which links of a set will be broken by a document modification. Our definition Of Optimality consists Of the following two criteria. First, an Optimal broken link detection algorithm must compare any given node in the document tree at most once. Secondly, when determining whether a particular 41 Algorithm 5.1 Comparing a node /* Input: node is the node to be compared Returns: true if node has been modified, false otherwise T reeotd (node) E Tree New (node) means to apply the equivalency operator to the nodes in the same positions in Treeozd and Tree New * / if node is already marked then return marking Of node end if set ismOdified = Treeozd (node) E Tree New (node) if node is not modified then for all descendants of node do if Treeozd (current_descendant) E TreeNew (current_descendant) then set ismodifiedztrue Break out Of FOR ALL lOOp end if end for end if return (ismodified) Algorithm 5.2 Determining which links have been broken set modified 2 false for all nodes in a post-order traversal Of Tree Links (10 if node_current is unmarked then use algorithm 5.1 to determine if node_current is modified if node-current is modified then set modified 2 true set temp = node_current while parent(temp) exists do mark temp as modified set temp 2 parent (temp) end while end if if not modified then mark node_current not modified end if end if end for 42 Algorithm 5.3 placemode /*inserts a new node new-node into subtree rooted at parent-node*/ for all children of parent_node do if newnode is a descendant Of child_current then place_node (child-current, new_node) return end if end for insert new-node as first child of parent_node for all siblings of newnode do if current..sibling is a descendant Of new_node then Remove current-sibling from parent_node’s child list Insert current-sibling into neuLnode’s child list end if end for target is modified, an Optimal broken link detection algorithm will compare at most one modified node in the subtree rooted at the target. Algorithm 5.2 has the Optimality properties described above. For the first property, we see that nodes are marked at the time they are compared. If a node is already marked, it will not be compared again (line 3 in algorithm 5.1). For the second property, the post-order traversal Of TreeLinks guarantees that any links contained by link will have already been visited at the time link is visited. If any descendants Of those contained links are modified, then link will have already been marked as modified as a result Of executing the code in lines 7 through 11 Of algorithm 5.2. If link does not contain any other links, then the break statement at line 10 Of algorithm 5.1 guarantees that no other descendants Of link will be visited after the first modified descendant is visted. 43 5.3 Repairing Broken Links For each Of the links marked broken in algorithm 5.2, we must find a node in the modified document tree that corresponds tO the original target node. This problem is difficult if arbitrary document modifications are allowed. For example, if changes to the element structure Of a linked-to node or changes to the actual content Of the resource are made, it will not be possible tO find an exact match in the modified document. However, in certain restricted circumstances, the problem is not as hard, and we can devise efficient algorithms tO solve it. Suppose a link L points to a node N1 in document D1. Editing D1 in some way produces document D2. If the following three conditions are met, then it is a fairly trivial matter to produce a new link L’ which points to a node N] whose element structure and content are the same as the structure and content Of N1: 1. The element structure Of N1 is not changed 2. The content of N1 is not changed 3. The XPOinter for link L is not a Spanning XPOinter One way Of thinking Of these restrictions is that the referenced sub-tree has been “cut” from its original location in the original document and “pasted” in its entirety into a new location in the modified document. Figure 5.3 shows examples Of edits that do and do not meet satisfy these restrictions. A simple brute force algorithm for finding N] would be to examine each node in D2, comparing its content and element structure tO that Of N1. However, given the 44 assumption that the element structure has not changed, many Of the nodes of D2 cannot be the node we are seeking. Only nodes N for which the sub-tree rooted at N has the same height as the node referenced by L could be the node we are looking for. Using this knowledge, we greatly reduce the number Of nodes that are considered candidates for being the new target node. We can further prune this set by comparing the element structure Of the sub-tree rooted at each node in the set tO the structure Of the original target N1. If the element structure is not the same, we can remove it from the set. Once we have pruned the set as much as possible, we can compare the element content Of each Of the nodes in the set to the content Of the referenced resource, and return the set Of nodes that match. TO implement this algorithm, we maintain a couple Of data structures as the modified document D2 is parsed. First, for each sub—tree height H we keep a list Of nodes in D2 for which the sub-tree height Of the node is H. Second, an encoding Of the sub-tree structure for each element is maintained to allow us to quickly compare the structure Of two elements. When we discover a broken link, we use algorithm 5.4 tO search for a new target node in the modified document tree: We can further Optimize this algorithm by maintaining a pointer into each set Of nodes indicating where in the set the last search completed. The next time this set is searched, the search will start at the pointer rather than at the beginning Of the set. This is helpful if link targets that were close tO each other in the original document 45 Legal Modifications Illegal Nbdification Key oo::>oo oo o ‘09 Link Target ‘ . Pattern Figure 5.3: Types Of document modifications that can and cannot be repaired Algorithm 5.4 Finding a new target node Let H = height of the original target node T Set C = set of nodes in modified document tree Of height H For each node N in C if E3(N) = E5(T) then if Content(N) = Content (T) then N is new target node; return locator end if end if 46 tree remain close in the modified document tree. This will Often be the case, such as when a common ancestor is moved from one location tO another. 5.4 Evaluation and Testing We have shown in section 5.1.2 that the broken link detection algorithm described in algorithm 5.2 is Optimal. We now wish tO evaluate the link repair algorithm given in algorithm 5.4. 5.4. 1 Test Documents Finding gOOd documents tO use as test cases was difficult due tO the short life span Of the XML and XPOinter specifications. We chose to use two sets Of documents for our testing. The first document set consists Of a series Of randomly generated documents. For the second document, we have chosen tO use the XML version Of the New Testament provided by Jon Bosak [45] A major factor in the success Of our link repair algorithm is the distribution Of the heights Of nodes in the tree. Suppose we determine that a link L1 has indeed been broken and we wish to find the target Of L1 in the modified document. If a large majority Of the nodes have the same height, we will have tO search most Of the tree in order tO determine a new link to replace L1. On the other hand, if the heights Of the nodes in the tree are more evenly distributed, we will see considerable improvement in the performance Of our algorithms. 47 5.4.2 Generating Random Documents Since we decided to generate documents randomly, we needed to have some idea Of what reasonable distributions for the variables described above are. Due tO the lack Of a large body Of XML documents to use, we chose to examine HTML documents from our department and college’s web sites and attempt tO find a model that char— acterized the element height distribution from these sample documents. There were approximately 8800 documents in the collection that we looked at. For each Of these documents, we calculated the percentage Of nodes at each height, from 0 up tO the height Of the root node Of the document. Figure 5.4 shows the distribution of element heights Obtained with outliers discarded. The data suggests that the distribution Of element heights follows a hyperexponential distribution. The hyperexponential distribution is similar tO the exponential distri- bution except that there is more area in the tails Of the distribution. Since HTML does not provide a method Of pointing tO arbitrary elements within the document, we decided a uniform distribution of link targets would be appropriate. That is, every element in the tree is equally likely tO be chosen as a link target. Given that we have no empirical evidence to support any other model, this model seems reasonable. The final task in using random documents as test cases is modeling changes tO the documents. We allow two basic types Of document edits: insertions and deletions. We have chosen to follow the model Of [41] in defining these operations. Again, without any empirical evidence tO guide us, we have chosen tO treat each Of these Operations 48 1-3 l I I I I I I I I I I I I I I I I I I Elonont Height Distribution . Hupor-Exponontial Distribution www ".- 0 1 8 3 4 5 6 7 8 9 18 11 12 13 14 15 16 17 18 19 Figure 5.4: Element height distributions as equally likely, and also consider each node in the tree equally likely tO be the target Of an editing Operation. 5.4.3 The New Testament Document This document is an XML marked up copy Of the New Testament. It is approximately 1.1 MB in size, and contains close tO 25,000 elements. For this document, rather than choosing links randomly, we have chosen to use a set of 270 links taken from the Life Application Bible [46] These links connect passages in the first four books Of the New Testament that relate similar events. This set Of links could very easily be found in a document whose goal is to Show the commonalties between these first four books. 49 In order to come up with a modified document, we chose to rearrange the books in alphabetical order. The content Of the document itself has not been changed. 5.4.4 Testing Environment We have implemented the link repair algorithm described in section 5.3 and tested it. Due tO the availability Of tools such as XML parsers, the algorithms were implemented in the Java programming language. The current work seeks only to evaluate the effectiveness Of our pruning algorithms. The algorithms may be implemented in a more efficient programming language as part Of future work. Our implementation uses the SAX parser interface [47], and so can be used with any SAX-compliant parser. We performed our evaluations on a Sun Ultra-2 with 2 300 MHz CPU’s and 512 MB of RAM, running Solaris 2.6. Since measuring absolute performance is not our goal, the choice Of hardware is not critical. This machine is used as a compute server, SO it is difficult tO know exactly what other activity was present at the time the tests were run. The machine usually has about 200 MB Of free RAM at any given moment, however, SO it is likely that our experiments had plenty Of room in which tO work. 5.4.5 Performance on the New Testament Document After running some preliminary performance evaluations, we discovered that this set Of links and documents is essentially a worst-case data set for input tO our algorithms. This is due tO the fact that the mark-up Of the document follows a highly regular 50 pattern. Every verse is marked up in exactly the same way, as is every chapter and every bOOk. All Of the links in the link set point at leaf nodes. Each verse is located at exactly the same height in the parse tree, and possesses exactly the same encoding Of the element structure. Furthermore, a disproportionate percentage (approximately 85%) Of the nodes are leaf nodes. TO more accurately evaluate our algorithms, we generated random test documents in accordance with the distributions described in section 5.4.2. We chose to generate documents Of three different sizes in terms Of the number Of elements in the document. The small document category was made up Of documents that contain approximately 100 elements; the medium document category was made up Of documents containing approximately 1000 elements, and the large document category was made up Of doc- uments containing approximately 5,000 elements. We randomly chose places in the generated documents tO insert/delete elements, and varied the number of modifica- tions made. We also varied the percentage Of elements in the document that were the target Of links from 5% to 50%. Table 5.1 shows performance statistics for each Of the test cases. These statistics are average values for all numbers Of links. These results Show that our algorithms achieve good performance for small documents, and become increasingly efficient as the size Of the document increases. 51 Class Number of Modification Average Elements searched Small 5 8 10 9 Medium 5 62 10 64 Large 5 188 10 213 Table 5.1: Performance of Algorithm 5.4 5.4.6 Summary In this chapter, we have described algorithms that can be used to efficiently detect link breakage when document modifications take place. In some circumstances the algorithms are able to repair the broken links quickly by searching the modified doc- ument for the referenced content and calculating a new link. An advantage of the algorithms is that they do not require authors to provide explicit information about how a document has been modified. To our knowledge, these algorithms are the only solutions that can repair broken links automatically. We have shown that the broken link detection algorithm is optimally efficient ac- cording to the conditions we set forth. These Optimality conditions are reasonable for any implementable (i.e. non-oracle) algorithm. Our link repair algorithms have Shown that by restricting the search for new target nodes based on the target sub-tree height and element structure, we can Significantly reduce the number of nodes in the modified document that must be searched to find the target. 52 In chapter 6, we discuss the design and implementation of a software system that utilizes the algorithms described in this section. Using this system, we can achieve our goal of providing link integrity automatically. 53 Chapter 6 A Software Agent to Provide Link Integrity 6. 1 Introduction In chapter 5, we described algorithms that can be used tO compute a new locator efficiently to repair a contextually broken link. In this chapter, we describe a web proxy service utilizing those algorithms to ensure that, at all times, a user agent receives the correct data as the result of following a hyperlink. In order to provide this guarantee, our system makes use of an agent, which is a particular implementation of the general WWW idea of a proxy server [48]. Each web server wishing to ensure referential integrity of links pointing into its document space must be serviced by an agent. Agents communicate through a protocol we have developed, allowing them to notify each other when links have been created, deleted or broken. An outline of the chapter is as follows. First, we describe the requirements for the agent and how we have addressed them in our prototype. Next we describe the agent communication protocol in detail, and describe its Operation in the presence of unreliable communication networks. We then analyze the resource requirements and performance effects of adding our system to the current web infrastructure. Finally, we summarize the chapter and give conclusions. The necessity of inter-agent communication is motivated by Mendelzon and Milo [26], in which the authors give a proof for why it is not possible to compute the answer to the query “Are there any documents pointing to document D” (or equivalently, “Find all the documents pointing to document D”). In order to guarantee 100% link consistency, it is necessary to require some amount of cooperation among agents. 6.2 Overview of the Agent Proxy For clarity’s sake, we define some terminology. Let S be a web server, addressable using the URL U. We say that S’s document space is the set of all documents whose URL is of the form U / path, where path is a string locating a document on server S. The agent has two distinct functions. The first is its role as an Hypertext Trans- fer Protocol (HTTP) proxy. In this role, it must accept incoming HTTP requests, forward them to the server for which they are destined and relay the response from the destination server back to the user agent (typically a browser) that initiated the request. 55 The agent’s second role is to maintain a database with information about all links that point into the web server’s document Space, and make sure that incoming requests following those links locate the originally designated part of the document. The agent is able to perform this role because in its role as a proxy, it will receive and examine all requests made to servers it is responsible for, allowing it to redirect requests generated by following a link the agent knows to be broken. In order to allow the agents to work with previously existing web servers, the agent must be configured to listen on the same IP address and Transmission Control Pro- tocol (TCP) port the web server used. Of course, this means that the address and port number of the server must be changed. Since the agent is acting as a proxy, to the outside world, the change of addresses will be transparent. In our current implementation, there are actually two processes involved in imple- menting the functionality described above. One we call the lightweight agent, while the other is the agent proper. The lightweight agent fulfills the role of the HTTP proxy, and the agent is responsible for the link maintenance functions. Agents and lightweight agents communicate with each other through the use of the Java Remote Method Invocation (RMI) distributed processing architecture [49, 50]. While it is necessary to have one lightweight agent for each web server (since each server requires a unique IP address and TCP port) an agent can actually service multiple web servers. The lightweight agent receives HTTP and document maintenance requests, and for- wards the document maintenance requests to the agent process responsible for that server. In addition to the request, the lightweight agent includes information about 56 which server the request pertains to so that the agent can act accordingly. Figure 6.1 gives an illustration of the system architecture. ‘ Lightweight ‘ ‘ Lightweight A86"! RMl RMl A80” HTTP RMI 01% Lightweight Agent 01% . Figure 6.1: System architecture for agent, lightweight agent and web servers 6.2.1 The Agent in Detail The agent maintains a database of information about each document for which it is responsible. For each document, the database contains a list of documents (URL’s) that contain references to the document in the database. Each entry in the database also records the Specific locator (series of XPOinter location terms) the referencing link is using. Figure 6.2 gives an illustration of some of the messages used by agents to communicate with each other. In the figure, we have two web servers 81 and 52, and two agents Agent A and Agent 3. Agent A is responsible for server SI, while Agent 3 is responsible 57 for server 52. Further suppose a document D2, to be located on server 32, is created and that D2 contains a link L2 pointing to some location in document D1, located on server 51. In order to guarantee the contextual integrity Of L2, Agent 3 must send a message to Agent A, indicating the URL of document D2, and the locator information of L2. Suppose document D1 is now updated. We assume the entity performing the update informs Agent A of its desire to perform the update and asks Agent A to perform the update on its behalf. AgentA must determine which of the links pointing into D1 will be broken by the update. Furthermore, for each link that is broken, Agent/1 must compute a corrected locator and notify AgentB of the new locator. We must not make any assumptions about the actual amount of time it takes Agent A to contact AgentB and notify it of the change.1 Therefore, there will be a period of time when D2 will contain the old (incorrect) locator. During this time, AgentA maintains a link map containing a mapping that indicates the corrected locator that should be mapped to the broken locator. This mapping is indexed by both the combination of the referring document’s URL and the locator. During the (perhaps unbounded) period of time until AgentB fixes the broken link in D2, AgentA will intercept all HTTP requests which have a referer2 header of D2 and the incorrect locator and cause them to use the new locator information. Agent A does this by sending an HTTP redirect response (status code 302) with the correct locator to the 1 In fact, it may be the case that AgentA chooses to delay notifying Agent}; for performance reasons. 2This unfortunate misspelling is in fact part of the HTTP standard. 58 Server 82 Server 81 F—fi Document Document = Add Reference In E Check In D2 from D2 to D] E : >@————>. E E‘ % L_5_J L—z Part (a): Adding a Reference from D2 to D1 Server 82 r—-——\ ' ‘ Docggient Modified _ . Change locator Check In D a cigar] ent E Modify D2 for link L2 E = +— AgcntB E E a, \____J l" g___t2_J Reroute requests from D2 for D1 using L2 Part (b): Checking In a New Version of D1 Figure 6.2: Messages passed between agents client that made the incorrect request. The client will then re—issue its request, using the correct locator, and will receive the correct information. The requirement of knowledge of the locator means that for our system to work with current generation browsers, we cannot use the traditional ’#’ character as the separator between the URL and the fragment identifier. This is due to the fact that browsers do not send along the fragment identifier with the request. This is unfortunate, but hopefully a case can be made that the browser should send the fragment identifier along, perhaps as a separate header, in extensions to HTTP. 59 As implied in the preceding description, part of the agent’s role is to serve as a gateway for document modifications. When a document author wishes to make modifications to that document, she must send a request to the agent containing the new content of the document. The agent will then make the change when it has determined which of the links pointing at the document will be broken and how to correct them. We are explicitly not requiring that the authoring software (or the author, for that matter) maintain a record of the changes that have been made to the document; it simply sends the modified document data to the agent. While this policy may generate more network traffic than sending only the change description, it places less burden on the authoring software. By utilizing a separate “check-in” program, any currently existing authoring software can be used. The authoring software simply works on a local copy of the document in question, and then the user invokes the check-in program when editing is complete. 6.3 Protocol Description In this section, we describe the protocol used by agents to communicate with each other. We first describe the set of messages that are used, and discuss the normal behavior of the agent when a broken link is followed. We then examine issues having to do with network failures and delays. 60 6.3. 1 Messages AddServer (LWAAddress, LWAPort, WebServerAddress, WebServerPort) This messages indicates to an agent that it should begin servicing the web server whose lightweight agent is located at LWAAddress and listening on LWAPort. It also informs the agent of the actual address and port of the web server that is being serviced This is necessary so that the agent can communicate with the web server to update web content using the HTTP PUT method or some similar mechanism if the agent and the web server do not Share a common file system. RemoveServer (ServerAddress, Port) Indicates to the agent that it should no longer handle protocol messages regarding the lightweight agent located at SeruerAddress and listening on port Port. AddReference (ReferencedURL, ReferringURL, Locator, [Checksum or LastModified- Date]) This message indicates to the receiving agent that a link has been created that points into its document space. The document containing the link is located at Refer- ringURL and uses locator Locator. The purpose of the remaining parameters are described in section 6.3.3. RemoveReference (ReferencedURL, ReferringURL, Locator) Receipt of this message by a link agent means that a previously registered link has been removed and should be removed from the agent’s link database. 61 CheckInURL (DocumentURL, SearchList, DocumentSize, DocumentData) This message is sent whenever either a new document with URL DocumentURL has been created, or the document has been edited. SearchList is a list of document URL’s indicating a set of documents that Should be used by the algorithms described in chapter 5 to perform broken link detection and repair. Document URL is always an implicit member of SearchList. This feature is useful for the (fairly common) editing operation of splitting a large document into several component documents. LinkUpdate (ReferencedURL, ReferrerList, NewLocator,NewURL, NewLocator) This message tells the receiving agent that all links contained in documents listed in ReferrerList pointing to Referenced URL using locator Locator should be updated to point to NewURL and use NewLocator. Allowing a list of referring documents to be passed reduces the amount of network traffic and processing that must occur if the receiving agent has several documents in its document space that use the same address for a link. The majority of the messages generated by the protocol come about due to the receipt of a CheckInURL message by an agent. The agent must check the new version of the document to determine if any new links exist in its content, and if so send AddReference messages to the appropriate agents. If the document is a new version of an existing document, it is also possible that links have been deleted in the new version of the document, in which case a RemoveReference message should be generated. 62 Link Ll Link L2 Referring Document Original Links , Section 1 (New) Section 1 Redirected Links _ utter Modification Section 2 (was Section I) L" Section 2 _ L. Section 3 (was station 2) E Section 3 L——‘i Section 4 (was section 3) Original Version Modified Version of Referenced Document of Referenced Document Figure 6.3: Possible ambiguity when sending a redirect message If the agent detects that some links have been broken, it generates the appropriate LinkUpdate messages and sends them to the agents responsible for the documents containing the broken references. 6.3.2 Deciding When to Redirect When the lightweight agent determines that a broken link has been followed, it sends an HTTP redirect message to the user agent that followed the link, telling it the correct locator to use. For correct Operation in certain situations, the lightweight must be careful about how it goes about doing that. Take for example, the situation shown in figure 6.3. Here we see a referring document that contains two links into the referenced document. After the document is modified, we see that the locator originally used by link L1 is now mapped to the locator originally used by L2. 63 When the agent receives a request following link L1, it will respond with a redirect asking the client to follow L2 instead. When the agent receives this request, it should NOT be redirected. Unfortunately, the agent has no way to distinguish this request from a request that is actually following L2. We must be able to determine the relationship between two requests — that is, did the request to follow L2 come about as a result of a previous request following L1? Unfortunately, due to the stateless nature of the HTTP protocol, this is impossible. While HTTP 1.1 contains a mechanism for persistent connections (that is, multiple HTTP requests from a client to a server can be sent along the same connection), either party is free to terminate the connection at any time. Therefore we cannot force both requests to be made over the same connection, which would allow us to determine causality. The Cookie [51] mechanism allows a server to introduce state into HTTP requests and responses. By utilizing cookies, we can solve the problem described above in the following manner. When the link agent receives the request following link L1, it includes a Set-Cookie header. The Path field of this cookie specifies the referenced document’s URL, the name of the cookie is “NoRedirect” and the value of the cookie is set to the value of L2. Finally, the age of the cookie is set to be 1 second. Setting the age in this way makes it likely that the cookie is only valid for one request by the user agent. If the link agent receives a request containing a cookie of the form described above, it knows this request should not be redirected, and so forwards the request unmodified to the web server. The link agent forwards the response from the server to the user 64 agent, including a Set-Cookie header that invalidates the cookie generated by the original request. Creating an association between two HTTP requests using cookies is the preferred mechanism to ensure that redirected requests are not subsequently redirected, and is the only completely correct means that we can do so. Unfortunately, due to the potential for misuse of the information conveyed using cookies [52], some users direct their user agents not to accept cookies. If we desire, we can still do quite well without using cookies. Let us assume that we have sent a Set-Cookie request to a browser, and it has rejected this cookie3. We still need some way to make sure the second request is not redirected. The lightweight agent must redirect the original request in such a way that it can iden- tify the subsequent request and refrain from performing redirection on that request. The only part of the second request the agent has control over is the Request-URL. The agent has a couple of options in constructing the redirected URL: 0 redirect the request to a different port number a redirect the request using a different HTTP Host header 0 change the path to the requested resource Unfortunately, any of these methods require sending a different URL back to the user agent that made the original request. If the user then views the resulting document 3Determining if a cookie is rejected requires two redirections: first to a URI that is specifically set up to examine if a cookie is present, and then the actual redirection to the appropriate site, depending on whether the cookie is present. 65 and decides to create a new link to a portion of it, the URL the user sees for the document will be in the form for which redirection is restricted from occurring. This would then prevent the system from maintaining the integrity of this new link. We must deal with this problem at a higher level than HTTP, in the link agent communication protocol. If a link agent receives an AddReference message using the restricted-redirection form of a URL, it must nack (send a negative acknowledgment for) that message, including the correct version of the URL as part of the message. By sending this nack, we ensure that links using the restricted-redirection form of the URL are not created. This of course does not prevent incorrect links from being created without an AddReference message being sent. 6.3.3 Operation in the Presence of Network Failures In the absence of network delays or failures, the protocol described in the previous section is sufficient to guarantee link integrity. However, communication networks are not perfect, and so we must address the issue of network failures and delays. The presence of caches along the path between web clients and servers further complicates the situation. In this section, we address problems caused by unreliable networks and the presence of caches. 66 Lost or Delayed AddReference Messages Suppose an AddReference message is delayed for a significant period of time.4 It is possible for the URL being referenced to be modified during the time the AddReference message is being transmitted. This modification could possibly change the content pointed to by the locator used by the referencing document. In this case, not only will the link will be broken initially, the agent protocol will perpetuate the broken link as further changes to the document are made! Figure 6.4 illustrates this problem. In order to avoid this problem, it is necessary to know either the exact content the user wishes to link to, or the last modification date of the referenced document, at the time that the user downloaded it. Author at Machine Y checks Author at Machine Y Author at Machine Y out document 8 from modifies target of Doc checks in Doc 8 Agent B A‘s link with Agent 8 I Jl t o a I o a In a a o a I o c 0| 0 h Author at Machine A Author at Machine A Author A adds link in Author A checks in Agent A sends Agent 8 receives checks out Doc A views Doe B in browser Doc A to Doc B Doc A with Agent A AddReference message AddReference Message from Agent A by requesting it from to Agent 8 from Agent A Agent B Figure 6.4: Problems due to network latency and AddReference messages One possible way of avoiding this problem using knowledge of the intended content is to compute a checksum on the data the author intends the link to reference. This checksum is then included as part of the AddReference message. On receipt of the message, the receiving agent retrieves the current version of the document and makes sure the checksum on the intended content matches the one in the message. This might be implemented by providing a simple program into which the user pastes the 4III fact, since network delivery takes non-zero time, this situation can occur no matter how long it takes for the message to be delivered. 67 portion of the referenced document and have this program calculate the checksum. The check-in program later prompts the user for the checksum. Unfortunately, this adds a significant amount of complexity to the check-in process, and would be very tedious when checking in a document with a large number of links. Another way of attacking this problem involves the use of the last modification date of the referenced document. A somewhat naive approach is to timestamp the AddReference message with the current date and time when it is sent. The receiving agent compares the timestamp in the message with the last modification date of the document, and if the modification date is later than the timestamp in the message, a nack is sent back, warning the user that the link might now be incorrect. There are two significant problems with this approach. First, using the time the AddReference message is sent as the reference time is not necessarily an accurate measurement of the time the user examined the content of the document and decided to create a link. Thus, it is possible that the protocol would accept a message containing a locator that references content other than that desired by the author. A second problem with the naive approach is its reliance on synchronization of sep- arate clocks. If the clock on the author’s machine and the agent’s clock are not synchronized, it is again possible for the agent to accept a message containing an incorrect locator. Algorithms for determining temporal relationships between ma- chines with unsynchronized clocks, such as Lamport’s Algorithm [53], are not helpful in this situation, since as can be seen in figure 6.4, there is no communication between the machines sending the AddReference and the CheckInURL messages, and so no reasoning can be made about the order of events. While HTTP 1.1 urges using a 68 method such as the Network Time Protocol (NTP) [54] to synchronize clocks, the standard does not require it. Therefore, requiring synchronization of physical clocks is too Significant a restriction to impose on the potentially globally distributed set of agents. Thus, we need a system that does not require synchronization of clocks or significant user input. If the last modification date of the referenced URL at the time the user decided to create the link could be known, then this could be included as part of the AddReference message. Since this date is generated by the server, and the Agent receiving the AddReference message can request this information from the server, we can use it as a basis for comparison.5 The question is, how does the check-in process know what the modification date of a referenced document was at the time that the user looked at it? We could simply require that the user supply this information. Most current browsers provide a way to retrieve the HTTP LastModified header when viewing a document. While this is less obtrusive than requiring the user to create a checksum, it still is user intervention we would like to avoid. Another possibility is illustrated in figure 6.5. In this situation, we have placed a proxy on the client side. The client is configured to use this proxy, and thus all outgoing HTTP requests are sent through the proxy. The proxy is then able to record the requests made by a client machine, recording the URL’S and last modification date 5For robustness, a web server is often replicated. In order for this scheme to work, the clocks of a web server and any machines replicating it would need to be kept synchronized. This is not as unrealistic a requirement as keeping the clocks of machines administered by different entities synchronized. 69 for each document the user requests. When the check in process is initiated, the agent communicates with the proxy6 to find out the last modification dates of all the documents being referenced by the new version of the document and includes these as part of the AddReference messages it sends. Author‘s Machine Proxy Server Checkout Document Turn on logging Hl IP Requests d———> ooonaogfiu Checkin Document nODflflnflflgo Turn off logging sass"! Figure 6.5: The operation of the client proxy Logging all requests by a client in the proxy uses a significant amount of storage, and it is diflicult to know when to flush this data. Therefore, we need some way of knowing when to turn the proxy recording mechanism on and off. One way of doing so would be to incorporate this action into the check-out and check—in procedures. When the user checks out a document, we turn the recording mechanism on, leaving it on until the time the user checks in the document. Logging of all requests is performed by the proxy during the time between check in and check out. Another option would be to provide “begin authoring” and “end authoring” programs that the user executes. This avoids the problem using check-in and check-out where the user might do some browsing between checking in Of a first document and checking out of a second document. 6in fact, the agent could very well be the proxy. 70 Lost or Delayed LinkUpdate Messages A second situation we must consider occurs when multiple LinkUpdate messages are sent to an agent, and the order the updates are received is not the same as the order in which they were sent. We cannot rely upon the in-order delivery guarantees of TCP here because the messages will be sent across different TCP connections, and TCP only guarantees in-order delivery of messages on the same connection. The example below outlines why receiving link update messages out of order could be a significant problem: 1) Document D1 links to document D2 using locator L1 2) D2 is modified by a check-in with Agent A2, transforming L1 into L2 3) Agent A2 sends a message M1 to Agent A1, asking that it replace L1 with L2. This message is delayed indefinitely. 4) D2 is modified again, in a way that causes L2 to be replaced by L3. 5) A2 sends message Mg to Al, asking it to replace L2 with L3. If M2 is received before M1, this would confuse A1; D1 may not hold any links using L2 as a locator. This situation could not occur if updates are not permitted while there are unacknowl- edged LinkUpdate messages. Then, we will be sure that LinkUpdate messages will be delivered in the correct order, since only one set of messages can be in transit at any one time. This solution is not practical since we have no idea how long it might take for any given LinkUpdate message to be received and acknowledged. Instead, our solution buffers unacknowledged LinkUpdate messages at the sender. Every time a 71 LinkUpdate message is sent, it includes ALL unacknowledged LinkUpdates intended for the receiver. This is not an extra buffering burden for the sender, Since this data is already being maintained as part of the link map stored by the lightweight agent. The sender must include some sort of sequencing mechanism in its messages, and the receiving agent must know which messages it has already received. For the se- quencing mechanism, we use the date/ time of the sending agent. Since the receiver will be using the date contained in the message as part of the acknowledgment, and specifically not the value of its own clock, there is no issue of coordinated clocks.7 The receiver only needs to maintain in its memory the sender’s timestamp for the last message successfully received from that sender. Using the date as the sequencing mechanism avoids the complexity involved in avoiding duplicate sequence numbers in the event of a machine failure. If the receiver receives a message containing LinkUpdate information it has already processed, it disregards that portion of the message. When the receiver sends an acknowledgment to the sender, that acknowledgment acknowledges receipt of all out- standing LinkUpdate messages sent before the timestamp contained in the acknowl- edgment message. Browser and Proxy Caches The presence of caches in the network further complicates the agent communication protocol. Suppose a LinkUpdate message is sent to an agent. The agent receives the 7Again, replicated agents used for performance and robustness would need to keep their clocks synchronized. 72 message, corrects the broken link indicated in the message, and sends an acknowledg- ment to the sending link agent. In the absence of caches, the lightweight agent for the referenced document could immediately remove the mapping of the Old locator to the new locator from its link map, confident that no request using the old locator will be received in the future”. However, when the referring document can be retrieved from one of several caches, the situation becomes more complex. How does the the receiving lightweight agent know whether or not the request should be redirected? And how does it know when the mapping can be removed from its link map? To answer these questions, the links agent sets the “Expires” header field of any response that is sent by the link agent to a user agent according to the following formula: Expires :2 min(Max—Expires, Server’s Expires value) where Max-Expires is a constant set as part of the lightweight agent’s configuration. When an agent acknowledges the receipt of a LinkUpdate message, it includes its value of the Max-Expires parameter as part of the acknowledgment message. By setting the Expires header field in this way, a lightweight agent knows the maxi- mum amount of time that a cached copy of a referring document containing a broken link can exist. It must keep the mapping for that broken link in its link map for that period of time. If during that time it receives a request following the broken link, it sends back an HTTP conflict response (response code 409) to the user agent, 8unless, of course, a new link is created using the same locator, in which case requests using that locator should of course not be redirected 73 notifying the user that the link is potentially broken and the document should be explicitly reloaded from the origin server before attempting to follow the link again. To make sure that the request made by the user after doing so does not again receive a 409 response, the link agent again makes use of a cookie. 6.4 Costs AS with any modification that adds functionality to a system, we must determine what additional resource requirements will be imposed in adding the new functionality. In this section, we first examine the resource requirements for the agent, and then discuss how the performance of the web is affected by the addition of the agent service. Memory Requirements There are two data structures the agent must maintain in memory. The first Of these data structures are the link databases for each document in the document spaces of the servers serviced by the agent. Although these database structures could be maintained in main memory, it is more likely they would be contained in some sort of mass storage system. These databases must be maintained in persistent storage so that the agent can recover if the agent process crashes for some reason. Also, they only need to be consulted during a document check-in, a relatively infrequent occurrence, so speed of access is not of particular interest. The second data structure is the link map the lightweight agent uses to map broken links to their corrected counterparts. This structure is much smaller than the link 74 databases, and will be referenced more frequently, since each time a link containing an XPOinter is referenced, the agent must verify that the link is correct. Therefore, it seems likely that this structure would be stored in main memory. How large might this structure get, and what may be done to limit its size? There are several factors that contribute to its size: 0 the frequency that document modifications are made causing links to break. 0 the policy used to decide when to request that the agents for referring docu- ments, whose links have been broken, update their documents. The immediate notification policy sends a LinkUpdate message as soon as the broken link is detected. Delayed notification policies such as periodic notification and notifi- cation on reference are also possible. 0 if a delayed notification policy is chosen, further variables are present. If peri- odic notification is chosen, then the period must be chosen. If when-referenced notification is chosen, the amount of time between references to a link becomes important. Even with an immediate notification policy, we cannot determine a bound on the Size of the link map, because it is possible that we may not be able to contact the agent that needs to be notified. Even so, it is likely that the amount Of memory the link map requires will be reasonably small. If we assume the average length of a locator is 32 bytes, each entry in the link map structure will consist of 64 bytes (since each entry consists of a mapping from one locator to another). There may be some additional overhead involved in the link map 75 (for example, one way of implementing it would be as a hashtable of hashtables — the first level of hashtable is indexed by a referrer/ referred pair, and each hashtable in the second level contains the locator mappings), but this overhead should be fairly negligible. Allocating 1 MB of RAM for the link map structure then allows storage of mappings for about 16,000 broken links. The notify on reference policy is desirable since it has the potential to reduce network traffic generated by unnecessary update notifications. However, it is also likely to be the policy that allows the link map to become the largest. A simple modification to this policy where notification messages are sent when the structure becomes too large would be a reasonable compromise. Effects on Web Performance In order for link requests to be serviced correctly, we must funnel all HTTP requests through the lightweight agent responsible for the server that owns the document. If clients are allowed to communicate directly with the web server, the server will not be able to determine if the link being requested is broken, since only the agent possesses this knowledge. If the agent process is running on the same machine as the web server, then the cost of requiring the agent process will simply be that of communicating between the two processes. However, it is likely that the two processes will run on separate machines for performance reasons. In this case, every HTTP request to the server will require 76 at least one extra network hop. This performance cost is typical when there is a proxy between a web server and the rest Of the network. If a broken link is followed, the browser will be redirected to the correct location, which will essentially be an extra round trip between the browser and the server. Again, this cost is due to the use of the HTTP redirect mechanism, a mechanism already in place and generally used. In general, the costs of adding the agent proxy are minor, and are already present to a great extent in the current web architecture. Given all the processing that goes on in a web server today, the cost of processing the extra network traffic generated will be negligible. A side effect of requiring the agent is perhaps more costly. This side effect is the fact that server replication is not possible if all of the requests must go through the agent. It is becoming more and more common to see a single logical web server be serviced by multiple physical servers, with the content on these servers replicated. The load on these servers can be distributed through methods such as DNS Aliasing [55], Magic Routers [56], and HTTP redirect [48]. Forcing web clients to communicate with a single agent introduces a single point of failure into the system, and eliminates the load balancing advantages of having repli- cated servers. The Simplest way to deal with this problem it is to replicate the agents. With this approach, the same methods used to distribute the load between conven- tional web servers can be used to distribute the load between the agent processes. If this approach is adopted, steps will need to be taken to assure that the link databases 77 and link maps will be kept synchronized between the agent processes. We intend to explore this approach more carefully in future work. 6.4.1 Increasing Document Processing Efficiency An agent servicing a frequently updated web server will need to process a large number of documents concurrently. Most implementations of XML processors are written in a single-threaded fashion. Since a good portion of the time spent processing an XML document is spent doing I/O (especially for large documents), a large percentage of the time it takes a single-threaded processor to process a document will be spent in a blocked state. With an appropriate threads library a multi-threaded, single process XML processor will be able to process documents much more efficiently than its single-threaded counterpart. We have modified version 2.0.15 of the XML4J processor from IBM [57] to transform it from a single-threaded to a multi-threaded implementation. As can be seen from 6.4.1, the time that it takes to process a set Of large documents is considerably improved by adding more threads. We have actually modified the XML4J processor in a much more significant way than to simply add multi-threading capabilities. The changes we have made allow the pro- cessor to be used in a distributed fashion. Multiple processes running on distributed machines are then used cooperatively to process the same document. Implementing the system in this way allows resources such as memory to be aggregated to facilitate processing of large documents that perhaps a single machine might not be able to 78 handle. It is this implementation, running on a single machine with the number of processing threads varied, that the figures in 6.4.1 refer to. 260 r I i r r 240 l" -1 220 ~ . 200 - . Q) .E g 180 b - U) (D 0 9 a 160 r- d 140 - - 120 >- ~ 100 J 41 41 l l O 2 4 6 8 10 12 Number of Threads Figure 6.6: Performance of modified parser using multiple threads The distributed implementation works in the following way. When a client wishes to have a document parsed, it submits a request to the processing cluster by making a remote procedure call to one of the members of the cluster, specifying the URL of the document to be parsed. This processor parses the root node of the document, storing the text nodes of the document tree locally. As it processes the root, it keeps track of the byte offsets of any children that it encounters. When the root node is completed, the processor picks another processor from the cluster and sends it a message, asking it to process the children of the root node.9 The processors recursively 9Originally, the processor would allocate each child as it was encountered to another processor. Although this allows a greater degree of parallelism, due to the relatively high costs of remote procedure calls, this was found to be impractical. 79 continue this process, treating the node currently being parsed as the root node of the document, until the document has been completely parsed. Figure 6.7 illustrates this architecture. ------ .......... ......... . .~. v Node Table (proc l) Initial Parse/Request % @omnm Clien ] Info on root nibde (proc 1, node l) I (Agent) l . Processor 1 22.. Document to parse Parse Requests (URL. Offset)

This is bold text

Processor 3 Node Table (proc 2) . .. ....... .......... ............ Figure 6.7: Distributed parsing architecture The client interacts with the processor cluster through the standard DOM API [44]. Each time a request is made for data associated with a node, a remote procedure call is made to the machine that holds that node. It is possible that a request for information about a node could be made before that node has been processed. In that case, the processor waits to respond to the client until the requested data arrives. List structures such as child lists contain a list of references to remote nodes, which are used to contact the processors which actually hold the data associated with the node. For reasons we have been unable to determine, however, the performance Of the distributed system is unacceptably poor to be practically useful in comparison to 80 the single processor model. We believe that the potential improvements for the dis- tributed system merit further study of the problems we have found in our particular implementation. 6.5 Prototype Implementation We have implemented a prototype agent which works with current web browsers and servers without modifications to them. The only thing we have had to do that deviates from standards is to separate the URL portion of a link element from the XPOinter portion using a | character, rather than the standard # character. This is because current browsers do not send anything after the # sign as part of the request; for our agent to work, it must know the locator the requester is using. The prototype agent and the link repair algorithms that it uses were both imple- mented in the Java programming language. Obviously, one of the main benefits of using Java is that this allows the agent process to execute on any platform, so long as there is an implementation of the Java Virtual Machine for that platform. Since there are not yet any browsers that implement the XLink and XPOinter pro- posed standards, our implementation does all of the XPOinter processing at the agent. When a link containing an XPOinter is followed, the agent retrieves the portion of the document located by the link and returns only that portion to the requester, rather than returning the entire document and allowing the browser to process the XPOinter. 81 We have created a simple message editor to facilitate the creation and communication of link notification and “check” in messages. A simple “check in” program that checks in a document located in a file has also been written as part of our prototype. 6.6 Summary This chapter has described an agent acting as a proxy for the web server(s) that wish to guarantee contextual link integrity. For guaranteed correctness, the agents for each web server must communicate with each other when links are created or documents are modified. The costs of implementing such a system have been Shown to be comparable to those of systems that are already in use, such as firewalls and employing HTTP redirect to keep traditional links from breaking. A prototype implementation has been developed using current web tools, including web browsers and servers. The fact that we were able to implement the prototype without changes to current browser and server software demonstrates the potential it has for quick deployment. In the future, we plan to explore the issues of fault tolerance and performance dis- cussed in section 6.4. Allowing agents to be replicated would greatly increase the usefulness of the system. We would also like to continue to reduce the amount of cooperation needed between agent processes, which we discuss in chapter 7. 82 Chapter 7 Independent Operation 7. 1 Overview We now turn to work which has as its goal eliminating the need for cooperation between authors of documents in web sites administered by different entities. In particular we are interested in eliminating the link notification requirement. This requirement states that authors must notify the agent associated with a web server every time a link pointing to a document in that server’s document space is created or deleted. Eliminating this requirement is desirable since it will expand the set of documents for which contextual link integrity can be maintained. As we stated in section 6.1, the requirement for agents to communicate the creation of links with each other exists since it is not possible to enumerate the document space of the WWW. While we cannot exhaustively search the universe of web documents to discover the existence of links, information is available that can inform an agent of 83 the existence of a link. This information is provided in the form of the HTTPreferer header. By monitoring the receipt of HTTP requests containing referrer information, we can discover the existence of new links and record them in the agent’s link database. We will call links that are recorded in an agent’s link database due to this process discovered links. This method of discovery is not foolproof, however. Consider the example shown in figure 7.1. Referring Document Referenced Document ‘1 '- Llnk Created Referenced document modified (link now incorrect) Link Referenced __ L— .... ‘ ll. Agent for Web Server records link pointing to incorrect location Figure 7.1: Discovering an incorrect link In this example, a document containing a link is created. The author of that document uses the current version of the referenced document to create the locator for the link. 84 The author of the referring document does not notify the agent reSponsible for the referenced document of the link’s existence. Subsequently, the referenced document’s author makes a change to that document which contextually breaks the referring document’s link. It is significant that this change occurs before the link in question is ever followed. Some time later, a user browsing the referring document decides to follow the link. It is at this time that the agent for the referenced document learns of the existence of the link and records its existence in its link database. Unfortunately, the agent has no way of knowing if the link is contextually broken, and therefore it assumes the link is correct. Not only will the user browsing the referring document be pointed to an incorrect portion of the referenced document, the incorrect resolution of the link will be perpetuated by the referenced document’s agent after any significant subsequent changes to the referenced document. In other words, once the link is incorrectly recorded in the database, it will be “fixed” by the link maintenance process so that it stays broken. We call a link which is discovered by the link agent in this situation an invalid discovered link. Adding an invalid link to an agent’s database does not really cause much harm. When the user follows the link, the result will be that the target of the link is not the same as that originally intended by the document author. If the agent did not record the invalid link, then the target of the link will also not be the author’s intended target. The actual target the link resolves to will most likely be different in the two cases, but in both cases it will be incorrect. If at some time in the future the owner of the referencing document does start sending explicit 85 AddReference notifications, then it is important at that time to verify the correctness of any discovered links held by the agent. This is the case since at this time authors at the referencing site begin to have an expectation that links they create will remain correct. The rest of this chapter describes the methods and results of an investigation into how frequently invalid discovered links are encountered. The results of this investigation give us an idea about how well an agent can maintain contextual link integrity without cooperation from document authors. 7 .2 Methods Determining the frequency of discovering an invalid link is a difficult proposition. We must be able to answer the question “How frequently is a referenced page modified between the time a link is created and the first time that link is followed?” Ideally we need to be notified immediately whenever any link pointing into an agent’s document space is created. However, the impossibility of obtaining such information in an unrestricted environment is the very reason that the notion of discovered links came about in the first place. In order to approximate the answer to the question, our search will be restricted to the document spaces of a specific set of agents. We periodically poll the document space of this set of agents and look for links that point into the document space of other agents in the set. This requires the ability to enumerate the document space of 86 each agent in the set; access to the configuration files of the server(s) associated with the agents is necessary to perform this enumeration. In addition to tracking the creation of links, we must track two other activities. The first of these activities is modifications made to the documents in the selected agents’ aggregate document space. Again we ideally would like to be notified immediately when a document modification takes place. Without operating system support, how- ever, this is impossible and so we resort again to periodic polling.1 The second activity we must track is referencing of the links in question. This activity can be precisely monitored, since most web servers provide the ability to log referrer information at the time they encounter it. Of course the permission to monitor these log files is required for the set of agents in question. We tracked link creation, page modifications and link references for a period of six weeks. Due to administrative difficulties, the set of servers used in our study contained a single server, our departmental web server. Studies have shown [59] that documents located on servers in the .edu domain are the least frequently modified documents, so the bias of studying this particular server must be taken into account when analyzing the results. We have divided the links discovered during the data gathering process into two categories — links that point to pages that were generated by some program, and links that were not. Generated links were typically generated by the javadoc [60] 1T he Win32 API used by Microsoft operating systems provides something close to this ability. Unix operating systems in do not generally provide such a facility, although one has recently been proposed in [58]. 87 program, and their large number appears to be due to an instructor requiring students to use javadoc frequently as part of their assignments. 7 .3 Results During the six week tracking period, we observed the creation of 2678 links that pointed from within one document in the departmental server’s web space to another document in that same document space. 526 of these links were referenced during the experimental period. For each link referenced at least once during the period of time during which data was gathered, the chart in figure 7.2 indicates how many times the referenced document was modified before the link was referenced. 400 I I I I I I I I I I I I I I I Not Generated - ;:-..I.:Icrl 350 “ - 300 250 200 Number of Links 150 100 50 3;. .. .. _1_ i. L L l l— 1 __I O 1 2 3 4 5 6 7 8 9 10 ll 12 13 14 Number of Modifications Figure 7.2: Number of times modified before first reference occurs 88 The entry in the chart where the x-axis is 0 indicates links that, if discovered by an agent during a reference, would be valid. There are 346 links that fit this category, indicating that approximately 2 / 3 of the links referenced at least once would be valid at the time of discovery. Pages that were modified one or more times before they were referenced may lead to invalid discovered links, depending on whether the modification made at that time caused the link to be contextually broken. At this time, we do not have any data on how likely a document modification would contextually break a link. This data depends on the types of modifications that are typically made to documents, and the types of locators users typically create. A substantial number of the links created during the trial were not referenced during the trial. Figure 7.3 shows the distribution of modification counts for those pages during the trial period. For the generated links, we can argue that invalid discovered links are unlikely to occur when one of these links is referenced. This is because both the referring and referenced documents are typically generated by the program responsible for generating the pages, and it is likely that the program would correctly maintain any links it creates. While the user is able to add additional links to the generated documentation, creating candidates for invalid discovery, this seems unlikely in the case of javadoc, since these links would not be maintained between versions of the documentation; users would soon learn not to create links in the referring pages. 89 500 I I I I I I I I I I I I I Not Generated _ 450 _ it {Jr’fli’fltlijtl m _. 400 r E”? ‘ 7" _ I: - 350 if: m a? g 300 "' t — .1 gal u— 13’ o 250 - a": _ u ’8; 3 a E 200 _. {#5: F 2 3’ ,.. ' 150 l - f; if . 100 - 5‘4 I» ‘ 50 - .‘5‘1 g» _ T O 0 1 2 3 4 5 6 7 8 9 10 11 12 More Number of Modifications Figure 7.3: Number of times modified without being referenced 7 .4 Improving the Results The statistics in the previous section compare the time between the first modification of a document and the the first reference of a particular link pointing to that docu- ment. We can improve the performance of the algorithm by a slight modifcation to the discovery process. As we stated in section 7.1, when the agent receives a request for a document which has the referer header set, and the agent does not already know about the existence of the link, it “discovers” that link. At the time the agent receives the request containing a referer header, it can in fact discover the existence of all links in the referring document that point into its document space. In other words, the agent is really discovering the existence of the referring document, not of the link itself. 90 Making this modification to the link discovery process changes the question we must ask to determine how frequently discovered links will be invalid. The question be- comes “How frequently is the referenced page modified between the time the link in question is created and the first time any link in the referring document pointing into the referenced agent’s document space is referenced?” Figure 7.4 shows how the results change when using this modified discovery strategy. 400 I I I I I —I I I I I I I I I I Not Generated _ ‘ m \ ;-.:g:.‘l‘ .1ch ‘ 350 300 250 200 150 Number of Links e- _L l_ L l 1 1_ l L l 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of Modifications Figure 7.4: Results using modified algorithm These results are based on the data that we gathered for this study, which only track intrasite links. The improvement is realized because most pages contain multiple links (and, since they are intrasite links, they must point to the same server). The likelihood of multiple links in a specific page to a single agent’s aggregate document space is probably lower than what we have observed in this data. 91 We have attempted to estimate this likelihood by reexamining the data on our depart- mental web server. Each document in the server’s document space was analyzed, and we recorded the web server associated with each link pointing to an offsite server. This data was summarized by computing, for each document, the number of links pointing to each unique server. Figure 7.5 shows the results averaging these per-server link counts for each document. 1200 I I I I I I I“ I _ Qtfsite Links — .: WIN: 1. MIL: “W Number of Pages Number of Links (ceiling function) Figure 7.5: Average number of references per server in a document It is possible that as the use of XPointers as locators becomes more prevalent, the likelihood of multiple links in a page pointing to an agent’s aggregate document space will increase. We say this because users may create more links when they have the ability to more accurately specify the target of a link. 92 7 .5 Conclusions We have seen that discovering links through the use of the HTTP referer header can allow an agent to maintain contextual link integrity for referring documents residing on a server not serviced by an agent with a reasonably high degree of success. The possibility of discovering an invalid link is a risk that the agent must take if it decides to add discovered links to its database. We have seen, however, that discovering an invalid link does not necessarily result in behavior that differs significantly from not discovering the link. Discovered links may be treated differently by the agent than links whose existence has been explicitly communicated by another agent. For example, they may be discarded at any time if some of the resources used by the agent need to be reclaimed (for example, disk space to hold the link database or memory needed to hold the link map described in section 6.4). The evidence we have shows that discovering links is a worthwhile proposition for an agent that wishes to increase the probability that links pointing into its document space are resolved correctly. A more expansive study is required to further validate this evidence. 93 Chapter 8 Conclusions and Future Work 8. 1 Conclusions The problem of maintaining contextual link integrity is an important one. As the use of standards such as XLink and XPointer become more widespread, web users will increasingly encounter documents containing contextually broken links. Preventing this situation gives an information provider a definite advantage in terms of the value of the information being provided. Most previous approaches to maintaining WWW link integrity apply only to preventing existentially broken links. This dissertationde- scribes a system we have implemented to maintain contextual link integrity in today’s WWW environment. Our system is open in the sense that specialized editors are not required. Users can continue to use whatever tools they are familiar with to create their documents. The only requirement imposed by the system is that a “check-in/out” process is used by 94 the users when they wish to being/ end editing of a particular document. This process then takes care of all communication with the relevant agents. We have described algorithms that allow web servers to automatically and efficiently maintain link integrity . These algorithms have been shown to be efficient compared to simplistic brute force algorithms that might be applied. As far as we know, our algorithm for contextually broken link detection is the first algorithm proposed to detect broken links efficiently. We have shown that our algorithm is an optimal algorithm for solving this problem. Furthermore, we have developed an agent proxy to provide contextual link integrity. The most significant aspect of this proxy architecture is its ability to interact with the current generation of web browsers and servers with little modification. We have argued that the communication protocol used by these agents is correct, and shown that it does not impose a significant processing or communication overhead on the servers and network infrastructure of the current WWW. We have shown that the agent performs correctly in the event of several common failure scenarios. One of the biggest criticisms of the agent approach is the dependence upon commu- nication between independently managed agents. Indeed, several researchers dismiss the possibility of solving the link integrity problem in this way primarily due to this issue. We have shown that inter—agent communication is unavoidable for agents wish- ing to guarantee 100% contextual link integrity. We do not fully agree that inter-agent communication is the determining factor in deciding whether our approach is feasible. Microsoft is basing a large part of its corporate strategy for the early 2000’s on commu- nication between web services [61], which indicates that at least the corporate world 95 does not view such communication as prohibitive. Even if most information providers were to agree to inter-agent communication, however, there certainly could be those who would not wish to do so. Thus, allowing agents to operate effectively without inter-agent communication is certainly an issue that needs to be addressed. In chap- ter 7, we described an alternative approach through which agents could “discover” the existence of links into their document spaces independently, reducing the need for inter-agent communication. Through a study of document modification histories and link access data, we have shown that this discovery technique can effectively provide link agents with the information necessary to maintain contextual link integrity most of the time. In order for our system to be widely deployed, it must not introduce significant time delays in the document development process. We have extensively researched the pos- sibility of implementing a distributed processing architecture for resource efficient, fast processing of XML documents, an operation that will become increasingly important as the use of XML as a data description language proliferates. We have seen evidence that a distributed architecture would be beneficial, based on the data gained from implementing a multi-threaded document processor. This multi—threaded processor has been shown to have significant performance benefits over a single-threaded docu- ment processor, particularly when running in a machine in which there are multiple processors. 96 8.2 Future Work Taken as a whole, the algorithms and systems described in this dissertationcan be used by an organization to guarantee contextual link integrity for the documents in its web servers’ document spaces. However, there are several opportunities to extend and improve on the work we have already accomplished. Firstly, and perhaps most importantly, the correctness of the agent communication protocol described in chapter 6 is a crucial component of our system. While we have argued for its correctness, a formal verification of the correctness of these protocols is certainly desirable. A second area where further research could be beneficial would be to extend the study of document modifications and web access patterns described in chapter 7. The results we have reported are based on data that is limited in scope and size. To further validate these results, a longer study that would hOpefully involve multiple web sites should be conducted. Involving multiple sites from different types of domains (e.g. .com and .org) will further confirm the validity of the discovered link approach. Although we were not able to successfully implement distributed document processing with reasonable performance, the idea still seems to be a good one. Distributing the processing load among multiple machines allows us to more efficiently process large documents. In addition, by aggregating the memory resources of a set of machines, it should be possible to process large documents a single document processor would be incapable of handling. 97 Due to the relatively high cost of making remote procedure calls, we have found that pre-fetching and caching of portions of the tree at the client is a useful technique. De- termining an appropriate pre-fetch and cache replacement strategy is important, since performance of the distributed document processor can be significantly hampered if poor decisions are made for these strategies. In order to decide what the optimal cache replacement algorithm is, it is necessary to have a model of typical application access patterns to the DOM representation of a document. A study of several signif- icant DOM applications could be used to develop a model of how document content is typically accessed, thus leading to more efficient caching. Finally, as the WWW continues to evolve, it is very likely that the prOportion of dynamic documents will continue to increase. A dynamic document is a document that is generated at the time it is requested, and is typically not stored in a single file. Our system is currently based on the ‘documentzfile‘ paradigm, which is still present in many WWW sites. Deve10ping a system that could deal with dynamic documents is certainly a challenge. 98 BIBLIOGRAPHY Bibliography [1] Tim Berners-Lee, Robert Cailliau, Ari Luotonen, Henrik Frystk-Nielsen, and Arthur Secret. The World-Wide Web. Communications of the ACM, 37(8):?6— 82, August 1994. [2] Theodor Nelson. Literary Machines. Ted Nelson, 1987. [3] Nigel Woodhead. Hypertext and Hypermedia: Theory and Applications. Sigma Press, Winslow, England, 1" edition, 1990. [4] Vannevar Bush. As we may think. The Atlantic Monthly, July 1945. [5] International Organization for Standardization, Geneva, Switzerland. Informa- tion Technology — Hypermedia/Time-based Structuring Language (HyTime) Sec- ond edition, 1997-08-01. International Standard ISO 10744-1997. Available on- line at http://www.ornl.gov/sgml/wg8/docs/n1920/html/n1920.html. [6] Frank Kappe. Hyper-G: A Distributed Information System. In B. Leiner, editor, Proceedings of INET, San Francisco, California, 1993. Internet Society. [7] Hugh C. Davis, Simon Knigh, and Wendy Hall. Light hypermedia link services: A study of third party application integration. In Proceedings of the ACM Confer- ence on Hypertext, ECH T ’94, pages 41—50. Assocation for Computing Machinery, 1994. [8] Frank Halasz and Mayer Schwartz. The dexter hypertext reference model. Com- munications 0f the ACM, 37(2):30—39, February 1994. [9] Hugh C. Davis. Referential integrity of links in open hypermedia systems. In Pro- ceedings of Hypertext 98, pages 207—216, Pittsburgh PA USA, 1998. Assocation for Computing Machinery. [10] Text Encoding Initiative, [Online] Available http://www.uic.edu/orgs/tei, pub- lished 1998. [11] Steven J. Derose and David Durand. The TEI Hypertext Guidelines. Computers and the Humanities, 29(3):181—190, 1995. 99 [12] S. Hockey. Deve10ping access to electronic texts in the humanities. Computers in Libraries, 13(2):41—43, February 1993. [13] Douglas C. Engelbart. Toward augmenting the human intellect and boosting our collective IQ. Communications of the ACM, 38(8):30—33, August 1995. [14] Tim Berners—Lee. Hypertext markup language — 2.0, [Online] Avail- able http://www.w3.org/MarkUp/html-spec/html—spec.ps, published November 1995. [15] Dave Raggett. HTML 3.2 Reference Specification, [Online] Available http://www.w3.org/TR/REC-html32, published January 1997. [16] HTML 4.01 Specification, [Online] Available http://www.w3.org/TR/html4/, published January 1999. [17] Extensible Markup Language, [Online] Available http://www.w3.org/TR/REC- xml, published February 1998. This is a W3C Recommendation. [18] International Organization for Standardization, Geneva, Switzerland. Informa- tion Processing, Text and Oflice Systems, Standard Generalized Markup Lan— guage (SGML) = Traitement de l’information, systemes bureautiques, langage standard géne’ralise’ de balisage (SGML). First edition, 1986-10-15. International Standard ISO 8879-1986. Federal information processing standard; FIPS PUB 152. [19] Steven J. DeRose. The SGML FAQ Book. Kluwer Academic Press, Boston, Dordrect, London, 13‘ edition, 1997. [20] ISO/IEC 1074411997. Hypermedia/Time-based Structuring Language (Hytime), [Online] Available http://www.ornl.gov/sgml/wg4/docs/n1920, published 1997. [21] XML Linking Language (XLink), [Online] Available http://www.w3.org/TR/1998/REC-xlink-19980210, published March 1998. This is a work in progress and subject to change at any time. [22] XML Linking Language (XLink), [Online] Available http://www.w3.org/TR/xlink, published June 2000. This is a W3C Can- didate Recommendation. [23] XML Pointer Language (XPOinter), [Online] Available http://www.w3.org/TR/1998/WD—xptr-19980303, published March 1998. This is a work in progress and subject to change at any time. [24] XML Pointer Language (XPOinter), [Online] Available http://www.w3.org/TR/xptr, published June 2000. This is a W3C Can- didate Recommendation. 100 [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [36] [37] Niels Olof Bouvin. Unifying strategies for web augmentation. In Proceedings of the ACM Conference on Hypertext, ECHT’QQ, pages 91—100. ACM, ACM, 1999. Alberto O. Mendelzon and Tova Milo. Formal models of web queries. In Pro- ceedings of PODS 97, pages 134—143, 1997. Frank Halasz and Mayer Schwartz. The dexter hypertext reference model. In J. Moline, D. Benigni, and J. Baronas, editors, Proceedings of The Hypertext Standardization Workshop, pages 95-133, Gaithersburg, MD, 1990. National In- stitute of Standards. U. Black. OSI: A Model for Computer Communications Standards. Prentice-Hall International, Englewood Cliffs, New Jersey, 1991. Norman Meyrowitz. The missing link: Why we’re all doing hypertext wrong. In Edward Barrett, editor, The Society of Text: Hypertext, Hypermedia, and the Social Construction of Information, pages 107—114. MIT Press, Cambridge, Massachusetts, 1989. Bernard J. Haan, Paul Kahn, Victor A. Riley, James H. Coombs, and Norman K. Meyrowitz. Iris hypermedia services. Communications of the ACM, i35(1):36-51, January 1992. Steven J. DeRose and David G. Durand. Making Hypermedia Work: A User’s Guide to HyTime. Kluwer Academic Publishers, Norwell, Massachusetts 02061 USA, 1994. Udo Flohr. Hyper-G organizes the web. Byte Magazine, 20(11):59—64, November 1995. Keith Andrews, Frank Kappe, and Hermann Maurer. The Hyper-G network information system. Journal of Universal Computer Science, 1(4):206—220, April 1995. Frank Kappe. A scalable architecture for maintaining referential integrity in distrubted information systems. Journal of Universal Computer Science, 1(2):84— 104, February 1995. Leslie Carr, Hugh Davis, David De Roure, Wendy Hall, and Gary Hill. Open information services. In Proceedings of the Fifth International World Wide Web Conference, Paris, France, 1996. World Wide Web Consortium. Jim Pitkow. Web characterization overview, [Online] Available http: / / www.w3.org / WCA / Reports / 1998-01-PDG-answers.htm, published March 1999. Michael L. Creech. Author-oriented link management. In Pro- ceedings of the Fifth International World Wide Web Conference, Paris, France, 1996. World Wide Web Consortium. Available at http: / / www5conf.inria.fr / fich_html / papers / P 1 1 / Overview.html. 101 [38] PURL home page, [Online] Available http://purl.oclc.org. [39] P. Mockapetris. Domain Names — Implementation and Specification. RFC 1035, 1987. [40] S. Ingham, D. Caughey and M. Little. Fixing the ”broken link” Problem: The W30bjects Approach. Computer Networks and ISDN Systems, 28(7—11):1255, 1996. [41] Andres S. Noetzel and Stanley M. Selkow. An Analysis of the General Tree- Editing Problem, pages 237—252. Addison Wesley, 1983. [42] Thomas A. Phelps and Robert Wilensky. Robust intra-document locations. In Proceedings of the Ninth International World Wide Web Conference, Amster- dam, Netherlands, 2000. World Wide Web Consortium, Foretec Seminars, Inc. [43] Leslie Carr, David De Roure, Wendy Hall, and Gary Hill. The distributed link service: A tool for publishers, authors and readers. World Wide Web Journal, 1(1):647—656, 1995. [44] Document Object Model DOM Level 1 Specification, [Online] Available http://www.w3.org/TR/REC-DOM-Level—1, published October 1998. This is a W3C Recommendation. [45] Jon Bosak. Xml version of the bible, [Online] Available http://sunsite.unc.edu/pub/sun-info/standards/xml/eg/rel200.zip, published Oct 1998. [46] James C. Galvin. Harmony of the Gospels, pages 1931—1936. Tyndale House Publishers and Zondervan Publishing Company, Grand Rapids, Michigan, 1991. [47] David Megginson. SAX 1.0: The Simple API for Xml, [Online] Available http://www.megginson.com/SAX/index.html, published May 1998. [48] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners—Lee, L.Masinter, and P. Leach. HTTP 1.1 Specification. RFC 2616, [Online] Available http://www.w3.org/Protocols/rfc2068/rfc2068, published June 1999. [49] Sun Microsystems. Java Remote Method Invocation Specification, [Online] Available http: / / www. j avasoft .com / products / jdk / 1 .2 / docs / guide / rmi / spec /rmiTOC.doc.html, published October 1998. [50] Sun Microsystems. Getting Started Using RMI, [Online] Available http: / / www . javasoft.com / products / jdk / 1 .2 / docs / guide / rmi / getstart.doc.html, published 1999. [51] D. Kristol and L. Montulli. HTTP state management mechanism. RFC 2109, February 1997. 102 [52] [53] [54] [57] [58] l59l [60] [61] Junkbusters Corporation. How web servers’ cookies threaten your privacy, [On- line] Available http: / / www. junkbusterscom / ht / en/ cookieshtml, downloaded August 2000. Leslie Lamport. Time, clocks and the ordering of events in a distributed system. Communications of the ACM, 21:558-564, July 1978. D. Mills. Network Time Protocol (version 3) specification. RFC 1305, March 1992. T. Brisco. DNS Support for Load Balancing. RFC 1794, April 1995. E. Anderson, D. Patterson, and E. Brewer. The Magicrouter, an Application of Fast Packet Interposing, [Online] Available http:/ / www.cs.berkeley.edu / éanders / pro j ects / magicrouter / osdi96-mr- submissionps, published May 1996. IBM alphaWorks. XML parser for Java, [Online] Available http://www.alphaworks.ibm.com/tech/xml4j/, published Feb 1998. Jeffrey C. Mogul Gaurav Banga and Peter Druschel. A scalable and explicit event delivery mechanism. In Proceedings of the 1999 USENIX Technical Con- ference, Monterey, California, 1999. Advanced Computing Systems Association. Available at http: / / www.usenix.org / events / usenixOl / cfp / banga / banga-html / . Xiangping Chen and Prasant Mohapatra. Lifetime behavior and its impact on web caching. In International Conference on Computer Communications and Networks, Estes Park, Colorado, 1999. Institute of Electrical and Electronics Engineers. Sun Microsystems. J avadoc 1.3, [Online] Available http: / / www . j avasoft.com / j 2se /1 .3 / docs / tooldocs / j avadoc / index.html, down- loaded August 2000. Microsoft Corporation. Microsoft.NET: Realizing the next generation in- ternet, [Online] Available http://www.microsoft.com/net/whitepaper.asp, pub- lished June 2000. 103 M IIIIIIIIIIIIIIIIIIIIIIIIIIIIII [WWI/[1]]ll][]][![[[l]l][][[l[l 3193