5:33 ax 1».
33:4. .. a. gamma. 4‘ 2. ,

.353” ‘ .a
. . . a

.m.
V .. Q U ,,

ﬁning

.4535; , ‘ ‘ , k. ,. ! .uua
...m.,.:§ , , . , . . UHdﬁ$§LﬁEﬂ$ﬂﬁlﬁ
. . I:,‘sy%%g

. , , , . . i a . w an

3.9;“. .
$1?va
w “I,” ;;

.ahmwﬁa , , . .
a . , . . , , «entwn , iv
.. , . u... anuvuv «a f ,

_ . . . 2 H5... {mam
‘ . 3- t
«min
2&3:
a .. A} .«lﬁfmuw.
“a, a??? V 5
. 2:5 @133 in; 33:33.
MR: .0 32 .ﬁwuaﬁmwmwﬁiu
. .
. . .11...

awn: 2.
mg

Q

r...

.. w.
.32...
0! t. V.

 

. at.
ﬁsh“?
. .L‘ L.‘ 1'

%

t;
k .u‘

.3. «a

 

 

 

2. .5 S.

THESlS

ECU

 

LIBRARY

Michigan State
University

 

 

 

This is to certify that the
dissertation entitled

MAINTAINING CONTEXTUAL LINK INTEGRITY
IN DISTRIBUTED HYPERMEDIA SYSTEMS

presented by

Ryan L. McFall

has been accepted towards fulﬁllment
of the requirements for

Ph.D. Computer Science

degree in

 

 

ﬂaw/Ma

' Major professor

Date [2/ 8.] 9000

 

MS U is an Aﬂ‘rrmau’w Action/Equal Opportunity Institution 0-1277]

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINE return on or before date due.
MAY BE RECAUED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

11/00 mlmm-p.“

MAINTAINING CONTEXTUAL LINK INTEGRITY IN
DISTRIBUTED HYPERMEDIA SYSTEMS

By

Ryan Lee M cFall

A DISSERTATION

Submitted tO
Michigan State University
in partial fulﬁllment Of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department Of Computer Science and Engineering

December 8, 2000

ABSTRACT
MAINTAINING CONTEXTUAL LINK INTEGRITY IN DISTRIBUTED
HYPERMEDIA SYSTEMS
By
Ryan Lee McFall

Hypertext systems have proven to be an excellent method of organizing and distribut-
ing information. The main feature that distinguishes a hypertext system from other
information systems is the inherent capability tO describe relationships between var-
ious entities using links. The importance Of links has been known for a long time,
but the introduction Of electronic hypertext systems has faciliated the increased role
that linking holds.

The widespread availability Of electronic information and the ease Of publishing and
modifying that information has led to a phenomenon referred tO as broken links.
This problem occurs when a previously created link can nO longer be traversed due
to a modiﬁcation to or deletion of the referenced information resource.

Broken links have been a major annoyance in the most widely used hypertext system
tO date, the World Wide Web (WWW). With the introduction tO the WWW Of more

sophisticated hyperlinking primitives, the problem will be exacerbated. In particular,

a new class of broken link we call a contextually broken link will arise. In this type
Of broken link, one or more endpoints Of the link still exist, but the content located
at the broken endpoint is no longer correct. This type Of broken link is more subtle
than the ﬁrst type, since the system may not be able to notify the user that the link
being followed is broken.

In this dissertation, we describe the problems associated with contextually broken
links and describe algorithms and communications protocols that can solve these
problems. Furthermore, we analyze the efﬁciency Of these algorithms and protocols.
We Show that these algorithms are capable Of detecting and repairing contextually
broken hyperlinks automatically in the WWW without requiring signiﬁcant changes

tO its established infrastructure.

© Copyright by

RYAN LEE MCFALL

2000 All Rights Reserved

ACKNOWLEDGMENTS

This dissertation has been the product Of much work, sweat, tears and prayer. I would
like to take a moment to thank all of the people who have been such a tremendous
source Of encouragement and strength for me. Without these peOple, this work would
never have been completed.

First Of all, I would like to thank Dr. Matt Mutka for overseeing my work as my
advisor. I have appreciated his willingness to undertake this research even though
my topic was neither Of great personal interest to him, nor was it something that he
had an extensive background in. I am grateful that Dr. Mutka was patient with me
during the times that my teaching took up a signiﬁcant portion Of my time (Often
probably more than it should have).

A second word Of thanks goes to the Graduate School at Michigan State University
for providing the Dissertation Completion Fellowship that allowed me to complete
this dissertation on time, even while staying home with my newborn daughter for the
majority Of my ﬁnal semester on campus. Timely completion Of this work without
that funding would have been close to impossible.

My parents have provided me with ﬁnancial and emotional support throughout my
life. Without them, I could not have made it through 5 years of college, let alone

another 6 years as a graduate student. I am grateful to you both for all the Opportu-

nities you provided me while I was growing up, and for your continued, unwavering
support while I have been pursuing this dream.

Bernie Holmes listened to my complaining more than just about any other person
besides my wife, I suspect. I am grateful to Bernie for his support and suggestions,
and for keeping me sane in general. I am sure that he will not forget the many hours
I spent muttering about things not working as expected. Thanks for sharing your
Ofﬁce with me for SO many years!

Dr. Mark Urban-Lurain, Dr. Don Weinshank and Gary McCuaig were also instru-
mental in completing this work. I thank you for the Opportunity to work with you
in designing and implementing CSE 101. Mark, you have inﬂuenced the way I think
about developing software in many ways during the approximately 5 years that we
have worked together. I also appreciate your support in all Of the decisions I have
had to make regarding my graduate school experience.

The faculty at Hope College have been wonderful in their encouragement. Thank you
for the excellent preparation for graduate school that you provided me as an under-
graduate, and for your sustaining words Of advice as I have slowly made progressed
towards this point. I am particularly thankful to Herb Dershem for the way he has
provided ideas and advice. I am in your debt!

There have been several fellow graduate students who have been crucial to the com-
pletion of this work. Stephen Wagner, where would I be without all Of your help? I
truly pray that you will be able to complete your dissertation in the upcoming year
and can join be in ﬁnally being able to answer the question ”Are you almost done?”

with ”NO, I AM done!” Without Russ Hankey, Dave Hall and Dmitri Perkins, I would

vi

probably still be done, but I would be in much worse shape. Our tennis exploits and
weightlifting were a source Of both physical and mental renewal. Russ and Dmitri, I
thank you for your spiritual companionship over the years.

TO the members Of my small group and other friends from East Lansing Trinity
Church, I am truly in your debt. You have been an incredible source Of encouragement
for me and have stood by me throughout all of the struggles Of this time. I love you
all and am so thankful that you have come alongside Of me as I learned to trust in
God rather than myself tO complete this journey.

A very unlikely place to meet two people who would prove tO be such a source Of
strength is Istanbul, Turkey. John Gillette, you have been a brother to me over the
past years, and without your prayers and friendship I know that I could never have
completed this work. Sara, I am grateful for your friendship and the joy and support
you have brought to Leanne’s life while I have been pre-occupied with this work. We
thank both Of you from the bottom Of our hearts for the support you have given us
throughout this difﬁcult time.

Finally, most Of all I would like to thank my wife Leanne. It is unbelievable tO me the
way that you have managed tO live with me and even still like me throughout these
past years. Thank you for being so faithful tO pray for me and for always providing
an encouraging word when I wanted to give up. I am grateful for the way that you
have sacriﬁcially spent so much time doing housework and other tasks when I have
been too busy to help. Thank you for graciously ﬁnding things tO do on Saturdays

without me while I trudged along trying to complete this work. One thing that we

vii

have learned throughout this time together is that God is deﬁnitely faithful to fulﬁll
His promises! I will always love you, princess!

If you are reading this as you embark on this journey, I want to encourage you with
these words that have been proven true tO me countless times during these years:
“DO not be anxious about anything, but in everything, by prayer and petition, with
thanksgiving, present your requests tO God. And the peace Of God, which transcends
all understanding, will guard your hearts and minds in Christ Jesus. (Phillipians

4:6-8, the Bible)”

viii

TABLE OF CONTENTS

LIST OF FIGURES

LIST OF ABBREVIATIONS
1 Introduction

2 Problem Statement

3 Background

3.1 Dexter Reference Model ..............
3.2 Intermedia ......................
3.3 Hytime .......................
3.4 Hyper-G .......................

4 Related Work

4.1 Maintaining Link Integrity .............
4.1.1 Existential Breaks ................
4.1.2 Contextual Breaks ................
4.2 Our Approach ....................

5 Finding and Repairing Broken Links

5.1 Detecting Broken Links ..............
5.1.1 Increasing Detection Efficiency .........
5.1.2 The Algorithm ..................
5.2 Analysis .......................
5.3 Repairing Broken Links ..............
5.4 Evaluation and Testing ...............
5.4.1 Test Documents .................
5.4.2 Generating Random Documents .........
5.4.3 The New Testament Document .........
5.4.4 Testing Environment ...............
5.4.5 Performance on the New Testament Document

5.4.6 Summary .....................

6 A Software Agent to Provide Link Integrity

6.1 Introduction ..................................
6.2 Overview Of the Agent Proxy ........................
6.2.1 The Agent in Detail ............................

ix

0000000000000

xi
xii
1

5

15
15
20
22
25

27
27
27
30
34

36
36
38
41
41
44
47
47
48
49
50
50
52

54
54
55
57

6.3 Protocol Description ............................. 60

6.3.1 Messages .................................. 61
6.3.2 Deciding When to Redirect ........................ 63
6.3.3 Operation in the Presence Of Network Failures .............. 66
6.4 Costs ...................................... 74
6.4.1 Increasing Document Processing Efficiency ................ 78
6.5 Prototype Implementation .......................... 81
6.6 Summary ................................... 82
7 Independent Operation 83
7.1 Overview ................................... 83
7.2 Methods .................................... 86
7.3 Results ..................................... 88
7.4 Improving the Results ............................ 90
7.5 Conclusions .................................. 93
8 Conclusions and Future Work 94
8.1 Conclusions .................................. 94
8.2 Future Work ................................. 97
BIBLIOGRAPHY 99

2.1
2.2

5.1
5.2
5.3
5.4

6.1
6.2
6.3
6.4
6.5
6.6
6.7

7.1
7.2
7.3
7.4
7.5

LIST OF FIGURES

An example of a contextually broken link ..................
An example Of a contextually broken link (continued) ...........

A document and its links before and after modiﬁcation ..........
Nodes examined by the brute-force algorithm ...............
Types of document modiﬁcations that can and cannot be repaired . . . .
Element height distributions .........................

Agent system architecture ..........................
Messages passed between agents .......................
Possible ambiguity when sending a redirect message ............
Problems due to network latency and AddReference messages ......
The operation of the client proxy ......................
Performance of modiﬁed parser using multiple threads ..........
Distributed parsing architecture .......................

Discovering an incorrect link .........................
Number Of times modiﬁed before ﬁrst reference occurs ..........
Number Of times modiﬁed without being referenced ............
Results using modiﬁed algorithm ......................
Average number of references per server in a document ..........

xi

8
8

39
40
46
49

57
59
63

70
79
80

84

90
91
92

LIST OF ABBREVIATIONS

DNS: Domain Name System

DOM: Document Object Model

HTML: Hypertext Markup Language
HTTP: Hypertext 'IIansfer Protocol

ISO: International Standards Organization
NTP: Network Time Protocol

PURL: Persistent URL

SGML: Standard Generalized Markup Language
TCP: Transmission Control Protocol

TEI: the Text Encoding Initiative

URI: Uniform Resource Identiﬁer

URL: Uniform Resource Locator

W3C: World Wide Web Consortium
WWW: World Wide Web

XML: the eXtensible Markup Language
XML: the eXtensible Markup Language

xii

Chapter 1

Introduction

The amount Of data available through Open, distributed hypertext systems, such as
the WWW [1], has increased dramatically in the last 5 years. Hypertext, a term whose
origin is credited tO Ted Nelson [2], is deﬁned by Woodhead in [3] as the “generic class
Of, or approach to, computer-based materials linked by non-linear structures.” Nelson
himself deﬁned it as “a body Of written or pictorial material interconnected in such
a complex way that it could not conveniently be presented or represented on paper.”
In other words, a hypertext system is one in which the data being presented is linked
tO other data which are related in some way. The user Of the hypertext system does
not have tO follow a pre—determined, sequential path through the documents making
up the hypertext system, but can follow links from “node” to “node” at will.

While data availability is important, it has long been known that data itself does not
necessarily comprise knowledge. It is discovering and understanding the relationships
between information that constitute knowledge and understanding, which is where

the importance Of easy to use hyperlinking mechanisms come into play.

The notion of the importance Of linking relevant information goes back at least as
far as Vannevar Bush’s landmark paper “As We May Think” [4]. Bush describes
a system where the user is able to create “trails” Of related information, where a
particular piece Of data can be included in different trails in different ways.

Even with traditional knowledge collections, the importance Of being able to identify
consistently some portion of knowledge has been recognized. The ISBN system was
created to provide a means that could uniquely identify a printed book, for example.
Separating the Bible into chapter and verse is an example Of the need for a consistent
method of referring to a ﬁne-grained portion of a work. Both of these mechanisms
work very well due to the static nature Of the Objects being identiﬁed. The task
Of consistently identifying the same part of some resource becomes more challenging
when the resource is frequently changing.

Many early hypertext systems were proposed and built in the early days Of hypertext
research. The main factors that held back widespread acceptance Of these systems
were their complexity and their dependence on proprietary data formats and software
applications. It took the introduction Of the WWW, a simple distributed hypermedia
system, tO exploit the potential Of hypertext.

Very simple tools are used to create the content on the web (including the links) and
to distribute (serve) that content. In the early days Of the WWW, the tool used to
author content was a standard text editor, and web servers were simple extensions to
the traditional concept Of a distributed ﬁle system. Indeed, many authors still prefer
to use these simple tools for their authoring, and web servers have still not evolved

into much more than networked ﬁle servers.

The simple hyperlinking model employed by the WWW is another reason that it
has been so quickly adOpted. Many hypertext systems (for example, [5], [6]) have
hyperlinking models that are much more sophisticated than what the WWW allows.
Forcing users to learn and cope with the complexity Of these models may have con-
tributed to the marginal acceptance of these systems. When it is quite simple to
create a link between two information resources, authors are much more likely to do
so.

Another reason that the WWW has ﬂourished while earlier hypertext systems have
not may simply be that the other systems were implementations Of ideas whose time
was yet tO come [7]. This is evidenced by the fact that as the WWW matures, its
users are starting tO demand some Of the functionality that existed in these more
sophisticated hypertext systems [8].

Unfortunately, the simplicity Of the WWW has also led to some Of the most serious
problems facing the web today. In particular, the familiar problem of broken links
is one that many experts agree must be solved in order to facilitate the continued
expansion of the use Of the WWW. According tO Davis [9], “If the issue Of main-
taining integrity in hypertexts is not successfully addressed then Open hypermedia
will continue tO be limited largely to relatively static information publishing applica-
tions.” The goal Of this dissertation is tO address the problem Of broken links while
still retaining much of the simplicity that led tO the widespread popularity Of the
WWW.

The remainder Of this dissertationis structured as follows. In chapter 2, we give a

more complete statement Of the problem we will be solving. Next, we explore other

3

work that has been done in relation to hyperlinking in general (Chapter 3) and specif-
ically the issue Of maintenance Of link integrity (chapter 4). Chapter 5 describes the
algorithms we have created to detect and repair contextually broken links. Chapter
6 describes the software agent we have developed to allow the current generation of
WW browsers and servers to provide contextual link integrity, including describing
the communication protocol these agents use. In chapter 7 we describe a method Of
extending the agent which allows a server to maintain contextual link integrity most
Of the time even without cooperation from other servers. Finally, in chapter 8 we give

conclusions and outline further work that remains to be done.

Chapter 2

Problem Statement

As user demands for more sophisticated linking are met, the problems Of broken
hyperlinks will increase in frequency and cost. Since link integrity is so crucial to
the success Of a hypertext system, it is imperative that algorithms and protocols be
developed which can maintain link integrity on the WWW. This is the fundamental
goal Of this research.

One of the more important capabilities that more sophisticated hypertext systems
provide is the ability to deﬁne links that point to arbitrary locations within a text
(for example, the Text Encoding Initiative [10, 11, 12]). This capability allows an
author to point a reader to a particular part (called a sub-resource) Of the referenced
resource; in the case where the “reader” is a software agent, the agent can be directed
to process only a particular portion of the resource in question. This capability is
noted by Engelbart in [13] as one Of the more important develOpment paths the WWW

must follow.

TO facilitate ﬁne—grained referencing of a sub-resource, one must come up with an
addressing mechanism to determine the sub-resource that is the endpoint of the link.
One such addressing mechanism is structural addressing, in which a sub-resource is
identiﬁed by specifying a path to traverse through the structure Of the document. For
example, documents written using Hypertext Markup Language (HTML) [14, 15, 16]
and the eXtensible Markup Language (XML) [17] can be viewed as a tree structure,
and a structural address can specify a particular traversal Of the tree from the root
node tO the desired node of the tree.

Introducing structural addressing as a means of identifying the endpoints of a link
brings about a new notion Of broken links. The traditional notion Of a broken link
is a link whose destination endpoint no longer exists. For example, an HTML link
can break when the document it references is moved tO a new location or is deleted.
When this type of broken link occurs, the hypertext system is able to notify the user
that the link is no longer valid, since the destination endpoint is not available.

In the new type Of broken link caused by structural addressing, one or more endpoints
of the link still exist, but the content located at the broken endpoint is no longer
correct. This type Of broken link is more subtle than the traditional type, since the
system may not be able to notify the user when the link being followed is broken.
TO distinguish between these two types Of breakage, we will classify the ﬁrst type Of
broken link as an existential break, while the second type will be called a contextual
break, since it is only through examining the context Of the link that we can determine

whether it is broken.

Both types Of broken links can exist in a hypertext system, whether or not it employs
structural addressing. In a system that uses structural addressing, an existential
break can occur if an endpoint Of a link is deleted in such a way that the locator
used to identify that endpoint is no longer valid. For example, if a link is created
which addresses the fourth section Of a four section book, and the second section Of
the book is deleted, then an existential break has occurred — the endpoint addressed
by the locator does not exist. A contextual break occurs when the structure Of the
document is changed SO that the content located by the link is not what the link
creator originally intended. In the previous example, if the endpoint Of the link was
the third section Of a bOOk, and the second section was deleted, a contextual break
would occur.

For a further example, consider the situation shown in Figure 2.1 This ﬁgure shows
two XML documents, and the corresponding document trees that result from these
documents. The ﬁrst document is a sports article written about a baseball game that
took place the previous day between the Stars and the Moons. The second document
is a collection Of all box scores for the Stars for the current season. The box score in
the article is actually a link to the ﬁrst box score element in the box score document.
Now suppose that the Stars play their second game Of the season, and the box score
document is updated by placing the second game’s box score above the ﬁrst (perhaps
to make it easy tO ﬁnd the most recent box score). Figure 2.2 shows the results of
the modiﬁcation to each Of the documents. Because the link in the article document
points tO the ﬁrst box score element in the box score document, its content will change

due to the modiﬁcation, and the link “breaks”.

 

The Stars won their garre yesterday
against the Moons by ascore of 3-1.
l~bre is the box scae d the game

Player Posllon @
Who 1st 3
What 2nd 0
I Nn’t Know 3rd 1

They w il play aga'n tonight at 8.

E919! m—
Who 1st 3
What 2nd 0
I Don't Know 3rd 1

 

 

 

 

 

 

 

 

Figure 2.1: The article (left) and box scores (right)

 

 

The §ars won their game yesterday Bayer Basilica ﬁts
against the Moons by a score of 3-1. Who 151 2
Here is the bcx score of the game: What 2nd 2
Elam Bastion Hits ’ 00'” Km” 3'” 0
Who 1st 2 . .

ﬂair mum ﬁts
What 21d 2 Who 1 t 3
I Don't Know 3d 0 What 2; 0
They will play aga'n tonight at 8. I Don't Know 3rd 1

 

/

I

 

 

// Original Link Target

  
    

/ \ / ‘x ’ \
[ Player) { Player) { Player)
I / I
\~,

\~’ ~’

 

Figure 2.2: The documents after the update

 

The possibility Of existential breaks in a system that does not employ structural
addressing is fairly Obvious. It is also possible for a contextual break to occur. In
the WWW, for example, it is possible tO replace some information resource with a
resource that has completely different content. However, a contextual break occurs if
the new resource is given the same name as the original resource. Contextual breaks
are much less likely to occur, however, if structural addresssing is not employed.
The main focus Of this work is to solve the problem of contextually broken links.
When document modiﬁcations take place that cause a contextual break, our goal is
tO detect this fact automatically and repair the broken links. In particular, we will
focus on ﬁnding and repairing broken links in documents written using XML in the
context Of the WWW. Our work is not necessarily restricted to this context, as it
applies in any system using structural addressing where the documents in question
are tree-structured.

XML was created by the World Wide Web Consortium (W3C) as a successor to
HTML, the original markup language used to describe resources on the WWW. There
were many reasons that the decision was made to replace HTML with XML. One of
the most compelling reasons guiding the development of the XML speciﬁcation was
the desire to eliminate the presentational nature Of markup data, which would allow
the markup to provide semantic information describing the nature Of the data. In
order to do so, one Of the ﬁrst changes that needed tO be made was to eliminate
the dependency on a ﬁxed set Of elements (commonly referred tO as tags) that are

available to mark up data. In this sense, XML is not a speciﬁc markup language,

but rather a “meta” markup language that deﬁnes a common structure for an inﬁnite
family Of markup languages.

Rather than completely building a new standard for this meta markup language,
the designers Of XML leaned heavily on the Standard Generalized Markup Language
(SGML) [18] standard. SGML is very complex, which led to the designers Of XML
tO eliminate many Of the more esoteric features Of the SGML language.

SGML provides a fairly simple means Of creating links between elements in the same
document [19]. Each element can have an attribute with a special type called an ID,
which must be unique within a document. Any other element that wishes to reference
this element does so by having an attribute with a Speciﬁc type called an IDREF
element. The SGML processor enforces the restriction that all ID’s used within a
document are unique, and that any attribute Of type IDREF contains a value that
points tO an existing element, within the same document, containing the referenced
ID value.

SGML does not provide any mechanism that can be used tO provide inter-resource
linking. In order tO create an inter-resource link, an author must use a system that
builds upon SGML, such as HyTime [20] or the Text Encoding Initiative (TEI). The
designers Of XML realized that there is a need for a middle ground between the
simplistic linking models Of HTML and SGML and the complexity that is inherent
in standards such as HyTime. Therefore, in addition to creating the XML standard
describing the ways that documents would be marked up, they authorized the work

on a standard means Of describing links between XML resources (the XLink standard

10

[21, 22]), and also a standard way of addressing sub-resources Of those resources (the
XPOinter standard [23, 24]).

XLink allows links to be created that are multi-ended, bi-directional, and out-Of-line.
In this way it was inﬂuenced by the linking model speciﬁed in the HyTime standard.
In order to specify links between documents, the standard Uniform Resource Locator
(URL) reference means Of locating a document on the WWW is used. If the endpoint
of the link is the entire resource located at that particular URL, no further addressing
is necessary. If, on the other hand, a sub-resource Of the resource designated by the
URL reference is to be the link’s endpoint, a fragment identiﬁer is appended to the
URL, and the fragment identiﬁer is then used to locate that sub-resource.

If the resource named by the URL reference is an XML document, then the frag-
ment identiﬁer portion Of the reference is an Xpointer. For complete information on
the ways that Xpointers1 can be used to locate a sub-resource, see the speciﬁcation.
Xpointers can be used in a way similar tO an HTML link or an SGML IDREF refer-
ence tO point to a uniquely named element within the target resource. Additionally,
they can specify a traversal Of the target document tree using locator terms such as
ancestor, child, sibling, etc2. In this way they are similar to the TEI and HyTime

primitives.

 

1the current practice is tO refer to the standard as XPOinter, while a particular
instance is called an Xpointer

2Although the XPOinter Candidate Recommendation [24] has changed the syntax
used to specify tree traversal, the semantics remain unchanged. The version in [23]
is somewhat easier to read, and so we will use its notation in this dissertation.

11

In addition tO allowing the speciﬁcation on a particular node in the document tree,
an Xpointer can also refer tO a range Of nodes, or to a speciﬁc portion Of the content
Of a set of nodes.

The addressing mechanism outlined in the XPOinter speciﬁcation essentially uses
structural addressing, and thus is susceptible to the problem Of contextual link breaks.
There are at least two major problems that must be solved in order to provide guar-
anteed contextual link integrity for a hypertext system.

The ﬁrst problem is the necessity Of algorithms that are able to locate the original
endpoint Of a link after the document containing the endpoint has been modiﬁed in
a way that causes the original link to break. While simple string search algorithms
would be sufﬁcient to perform this task, they are not efﬁcient enough to be practical
when the referenced document is large and there are a large number of links referenc-
ing that document. The fact that the hypertext documents we are considering are
very regularly structured allows other algorithms to be developed which realize con-
siderable performance gains over a simple string searching technique. Furthermore,
traditional string searching algorithms are meant to deal with plain text only, and
would need tO be modiﬁed to understand the difference between content and markup
in a hypertext document.

The second problem that must be solved is to allow the introduction Of link integrity
guarantees into the WWW without requiring substantial changes in the protocols
and document markup languages currently used to implement the WWW. As Bouvin

states,

12

Adhering to standards has a large part of the success of the Web and the
very size Of the Web has enormous inertia, SO an attempt to replace the
Web with something perhaps more advanced in certain aspects is, if not
doomed, then up against trememdous Odds [25].

Protocols must be developed that allow information providers on the WWW to remain
as independent as possible. We must not require all authors tO use a speciﬁc authoring
tool along with a speciﬁc web server, since systems with these restrictions are not likely
tO gain widespread acceptance. Instead, a mechanism must be develOped where the
users are still free tO use whatever tOOls they are accustomed to author and distribute
documents.

The problem Of maintaining link integrity while still allowing web systems tO remain
independent is a challenging one. In order for a system tO successfully determine
which links have been broken by a document modiﬁcation, it must have knowledge Of
the existence Of each Of the links that point into the document in question. Acquiring
this information independently (that is, without explicit notiﬁcation Of link creation
by the referring document’s author) has been proven impossible [26], SO we must come
up with an unobtrusive means Of communicating this information between servers.
Part Of any system that attempts to prevent contextual link breakage will require
frequent parsing Of structured hypertext documents. For a server that handles heavy
trafﬁc, fast and efﬁcient parsing is Of paramount importance. The widespread exis-
tence Of multiple processors in today’s computing environment leads to the possibility
Of parsing documents in parallel. Parallel algorithms must be develOped to take ad-

vantage Of the processing power that is available.

13

After surveying the capabilities of several historical hypertext systems and their ap-
proach to solving the broken link problem in chapters 3 and 4, we present our solutions

in chapters 5, 6 and 7.

14

Chapter 3

Background

In this chapter, we discuss several previous hypertext systems and the capabilities

that they provide in terms Of linking and link maintenance.

3.1 Dexter Reference Model

The Dexter Reference Model [8, 27] is not a speciﬁc hypertext system. Instead, it is
intended to be an abstract model Of a generic hypertext system. While the speciﬁc
notion Of document and link can be very different among various hypertext systems,
it is helpful to specify an abstract model Of the system so that reasoning can be done
about the properties and behavior Of the abstract model. This reasoning will then be
applicable for any actual system for which a mapping can be found from that system’s
data and linking model to the Dexter model. This idea is similar, for example, to the

motivation behind the OSI seven layer network model [28].

15

A second goal Of the Dexter model is tO provide a common base Of terminology for
the hypertext community. Every hypertext system has its own set Of terms and
deﬁnitions that it uses, which can make comparison between systems difﬁcult. The
Dexter model was in part created SO that hypertext researchers might be able tO avoid
communication difﬁculties due to a simple difference Of vocabulary.

As is the case with any abstract model, the Dexter model uses the terms it deﬁnes
in a way that is separate from any speciﬁc hypertext system. For example, the
term assigned to refer to data Objects in the system is a component, while common

hypertext systems Often refer tO these data Objects as documents or nodes.

Parts of the Dexter Model

The Dexter model separates an abstract hypertext system into three layers: the
runtime layer, the storage layer and the within-component layer. These layers are
listed in decreasing order in terms Of level Of abstraction in the software design process.
This means, for example, that the within-component layer is closest to the physical
representation Of the data in the system, while the runtime layer is the view that
the user Of the hypertext system sees. Interestingly enough, it is the middle layer
Of this model (the storage layer) that the Dexter model chooses tO focus on. The
within-component and runtime layers are deemed to be too diverse for the creation
Of a general model which can cover the range of possibile implementations for these

layers.

16

The purpose Of the storage layer is to model how “nodes” and “links” are put together
to create a network of information. As previously mentioned, the Dexter term for
node is component. A component is an abstract data Object, with no restriction Of
the type Of Object (text, graphic, sound, etc.) or the format in which the Object is
stored.

Along with deﬁning the notion Of component, the model also speciﬁes two functions
that can be used tO identify and access components as part Of the storage layer. The
ﬁrst function is the accessor function, which is responsible for taking a unique ID
(UID) which all components have and ﬁnding the component in question. In the
WWW, for example, this corresponds tO the idea Of the GET method Operating on a
URL.

The resolver function is reponsible for mapping a more general description Of a com-
ponent tO the UID Of the component. This function might be responsible for mapping
a request such as “The component describing the life Of Martin Luther King” to a
speciﬁc UID Of a component matching that description. The WWW does not have
an analogue tO this function; the URL Of a document is generally the only way to
retrieve its contents, and corresponds tO the speciﬁc UID in the Dexter model.

TO facilitate addressing parts of components, the Dexter model deﬁnes the concept Of
anchors. Anchors are also made up Of two parts, similar in functionality tO the roles Of
the accessor function and the resolver function. The ﬁrst member Of an anchor Object
is an anchor ID, which is a unique identiﬁer within the scope Of the component the
anchor is associated with. The value of the anchor ID remains constant throughout

the lifetime Of the anchor. The second member of an anchor is the anchor value. This

17

is an arbitrary value which can be used by the within-component layer to resolve to
a speciﬁc part Of a component. In order tO maintain the separation between the
within-component and storage layers, the anchor value has no meaning to the storage
layer.

For this dissertation, it is interesting to note that the model states that:

As a component changes over time (e.g., when it is edited within the run-
time layer), the within-component application changes the anchor value
to reﬂect changes to the internal structure Of the component or to re-
ﬂect within-component movement Of the point, region, or items to which
the anchor is conceptually attached. The anchor id, however, remains
constant, providing a ﬁxed referent that can be used to specify a given
structure within a component [8].

It is the implementation Of this behavior within the within-component layer into
which the algorithms deﬁned in chapter 5 ﬁt.

Once we have the notion of components and anchors, it becomes possible to describe
the mechanism used tO describe an endpoint of a link. This mechanism is called a
speciﬁer and its main feature is a (component speciﬁcation, anchor ID) pair. A link
in the Dexter model is simply a set Of two or more speciﬁers, providing links that are
multi-ended as well as bidirectional 1.

A further aspect Of the Dexter model is a speciﬁcation Of a set Of abstract functions
that can be used to Operate on the model contained in a hypertext. Some Of these

Operations are Of particular interest to this dissertation:

 

1the direction Of an anchor is also contained as part Of a speciﬁer

18

o The function LinksTOAnchor takes an anchor and its containing component as
input, and returns the set Of links in the hypertext that refer tO the anchor in

question.

0 The function LinksTO takes a hypertext (which is a enumerable set Of compo-
nents and links) and a component ID as input, and returns the set Of links

resolving to that component.

0 The function ModifyComponent modiﬁes a component, ensuring that the com-
ponent remains consistent, and that the resulting hypertext remains link con-
sistent. Here link consistent means that the component speciﬁers Of every link

speciﬁer must resolve to existing components.

Although not the major focus Of the Dexter model, we brieﬂy outline the functionality
of the runtime layer. This layer is responsible for presenting the nodes and links Of
the hypertext to the user, facilitating browsing and authoring Of the hypertext. It
is through this layer that the user instantiates components Of the hypertext, making
cached copies Of those components currently in use during that particular session.

In particular, the run-time layer is reponsible for making the presence Of anchors and
links within the hypertext known to the user, and controlling traversal Of those links.
It is also responsible for handling session related information, such as a history Of the
components visited, and providing run-time versions Of the storage layer’s resolver
function. Finally, it provides the means through which components are instantiated
(created from their storage layer models) and written back into the storage layer, a

function known as RealizeEdits.

19

Applications of the Dexter Model

At the time Of the development Of the Dexter Model, the editors Of the reference state
that “The model as currently stated is far more powerful than any existing hypertext
system. The provisions for n-ary links and for composite components, for example,
are intended to accomodate the design Of future hypertext systems.”

The design of the linking and addressing speciﬁcations for the second generation Of
the WWW proposed in coordination with the XML speciﬁcation fulﬁll some of the
advanced design goals Of the Dexter system. However, some aspects of the Dex-
ter model have been left as unimplemented by the WWW. In particular the link
consistency requirements stated by the model are unfulﬁlled, a deﬁciency this disser-

tationaddresses.

3.2 Intermedia

The Intermedia system [29] is signiﬁcant because it is one Of the ﬁrst hypermedia
systems tO recognize the importance Of two hyperlinking concepts. Of these concepts,
Of particular importance to this dissertationis the idea that link anchors should be
constructed as “any entity that the user can select in that particular application.” Of
course, if link anchors are tO be allowed to be speciﬁed in this general a way, there
must also be some way of identifying the portion Of the resource that the user selected.
In the Intermedia system, it is the responsibility Of each individual editing/ browsing

application to implement this function. Since the data needed tO identify an anchor

20

in a text document may be Signiﬁcantly different than what is necessary tO identify
an anchor point in a graphic, this is a reasonable model.

The second important concept introduced by the creators Of the Intermedia system
was that it should be possible to link to any information resource that the user could
access. The Intermedia creators found that forcing hypertext users tO remain in
a “closed” world as part Of a hypertext system was one Of the main reasons that
acceptance Of early hypertext systems was not as widespread as it might have been.
They envisioned a world where any information resource may be the endpoint Of a
link; for example, a word processing document could be linked tO a spreadsheet that
was linked to a sound clip.

The Intermedia system means of accomplishing this “universal linking” is through
integration Of linking services into applications through the use Of an Object-oriented
programming framework [30]. Link creation is performed through menu Operations
“Start Link” and “Complete Link”, a metaphor quite similar to the “Edit/Paste”
Operations users Of graphical user interfaces are quite familiar with. Most Of the details
Of creating links and the initiation Of following links is handled by the application
framework. This, Of course, limits the degree Of “Openness” related to linking present
in the system, since it is only those applications that have been derived from the
application framework that can participate in linking.

The development Of the Intermedia project ended in 1990, and as a result a study
[30] was performed which evaluated the usefulness Of various linking capabilities such
as bi-directional and multi-ended links. This study established their importance as
necessary for any hypertext system in which effective work can be performed.

21

3.3 Hytime

The HyTime standard [5] is an International Standards Organization (ISO) docu-
ment outlining a standard way Of describing the structure Of time-based hypermedia
documents [31]. AS such, it can be used tO implement hypermedia systems that are
made Of much more than simple hypertext documents. By using just a subset Of the
facilities HyTime provides, however, the standard can be used to describe a hypertext

facility.

Linking in HyTime

HyTime provides a linking model that includes two different kinds Of linking con-
structs: contextual links (clinks)2 and independent links (ilinks). A contextual link is
used to provide the type Of link that users Of the WWW are familiar with. In this
type Of link, one of the endpoints Of the link is the actual linking element. The link
asserts a relationship between the linking element itself and the element it references.
A normal use for this type of link is as a cross-reference. For example, a link might be
created between a term in a document and the deﬁnition Of that term in a dictionary
document.

An independent link is a link that connects more than two locations, is stored in a
document other than the documents it is connecting, or both. One use Of this type

of link might be tO connect an article tO a series Of reviews Of that article by different

 

2It is important to note that the use Of this term is quite different than the notion
Of contextually broken links that we introduced in chapter 2

22

reviewers. Or, an ilink might be created tO connect a speciﬁc word in a document to
entries in a dictionary, a thesaurus, and an encyclopedia.

Links that are located in a resource other than the resources connected by the link
are called out of line links. These types Of links are useful when the resources tO be
connected are not owned by the entity wishing to create the link, and are therefore
read—only to that entity. For example, suppose an instructor uses a particular on-line
text for an Operating systems course. A student in the course wishes to create links
from the textbook used in the course to several different online Operating systems
texts, so that the student would be able to examine several different discussions of
the same topic. Since the student most likely does not have write permission on the
course textbook document, it would be impossible tO add the link directly into the
document. However, using out-Of-line links the student could create a separate docu—
ment containing the links. While reading the course textbook, the hypertext system
is responsible for determining which Of the nodes in textbook have links pointing to

other textbooks.

Addressing in HyTime

HyTime provides three different ways Of addressing sub-resources within a particular
resource. The three types Of addressing include addressing a node by name (using an
ID attribute), specifying a series Of counting terms that specify the location Of the

sub-resource, and through a general-purpose querying facility.

23

In general, specifying a sub-resource by name is similar to ID/IDREF mechanism
provided by SGML. It is also possible to use this type Of addressing to locate a named
entity. An entity is a peice Of data that can be referenced using a name explicitly
declared for it. The entity itself may be something as simple as an abbreviation for
a literal string, or may represent an external data object such as a ﬁle Of SGML or
non-SGML data [19].

When using counting to ﬁnd a sub-resource, HyTime provides an extremely general
counting mechanism. A counting locator in HyTime is called a dataloc. A dataloc
divides information content into components (called quanta), and then counts the
quanta [31]. Different means Of breaking the content into components are possible.
For example, one quantity that might be counted would be the bytes Of the digital
representation Of the document. Another quantum might be time divisions in a
digital representation Of a video recording. Or, the divisions might be the markup
tags indicating the elements Of an SGML document.

A speciﬁc type Of dataloc called a treeloc is of particular interest. A treeloc is used tO
Specify a traversal Of a tree representation Of a resource. HyTime does not specify that
resources addressed using a treeloc must be SGML documents. Any resource that
can be represented using a tree can be addressed using a treeloc locator. However,
they are particularly useful for SGML documents since all SGML documents can be
conveniently represented as a tree.

The treeloc element speciﬁes a traversal that starts at some location source, and
follows a speciﬁc path down the tree. A second type Of tree location term is a relloc,

which is a more general form of treeloc that also allows traversal up the tree, and also

24

across the tree. It does this through the speciﬁcation of one of ﬁve types Of traversal
as part Of the relloc locator. These ﬁve types Of traversal are anc, esib, ysib, parent
and children, which specify the ancestors, elder Siblings, younger siblings, parent, and
children Of the location source, respectively.

A further type Of counting locator is the fcsloc, which is used to specify some location
using a coordinate system. This type Of locator is useful for identifying portions of
graphical images, for example.

The ﬁnal category Of locators that HyTime provides are query locators. This allows
the speciﬁcation Of a query, in some query language, that is executed tO ﬁnd the
desired sub-resource.

The querying facility Of HyTime is perhaps the most robust, but coming up with a
query to identify a particular sub-resource is difﬁcult, since computer understanding

of text is still not very reliable.

3.4 Hyper-G

The designers Of the Hyper-G network information system see the WWW as a ﬁrst
generation attempt at a world-wide distributed information service [32]. They recog-
nize several problems with the current state of affairs on the WWW that they intend

to address [33]:
0 Provide orientational and navigational aids

0 Provide automatic structuring and maintenance

25

Reduce fragmentation across servers

Support user identiﬁcation and access control

Support multilinguality

Maintain interoperability with existing systems

Hyper-G addresses each Of the fundamental shortcomings Of the WWW. Like the
WWW, Hyper-G is a client-server architecture. One Of the fundamental differences
between the WWW and the Hyper-G system is that in Hyper-G, the underlying mech-
anism used tO store information within a hypertext is an Object oriented database,
as Opposed to the WWW where the storage mechanism is (usually) a traditional ﬁle
system.

Placing all information available tO the system under control Of the Hyper-G server
makes the maintenance Of link integrity much simpler, since the server is able tO
determine which other information Objects will be affected by a change, and can notify
those Objects Of changes necessary to maintain consistency. Change notiﬁcations are
made through a scalable ﬂooding algorithm [34] SO that all other servers are kept up
to date.

However, perhaps the major disadvantage Of the Hyper-G approach is that

documents have tO be imported into the environment from the devel-
Opment environment, they have to be translated into a format for which a
Hyper-G viewer has been written, and more generally that they have to be
owned by the database, i.e. brought into the authors domain of authority
in other words, you can’t link things you don’t own (emphasis
mine) [35].

26

Chapter 4

Related Work

4.1 Maintaining Link Integrity

Davis [9] gives an overview Of the various strategies that an Open hypermedia system
can take when attempting to solve the broken link problem. In the following para-
graphs, we give an outline Of these strategies along with an analysis Of the strengths

and weaknesses Of each approach.

4. 1 . 1 Existential Breaks

The simplest approach to preventing broken links is to put the burden entirely on the
user tO repair any links that might become broken. Although this approach sounds
overly simplistic, it is in fact the method used to maintain link integrity in the largest
hypertext system currently in use, the WWW. Indeed, the simplicity Of this approach
may well have led to the rapid widespread acceptance Of the WWW. However, there

are at least two serious problems with this approach. The ﬁrst Of these problems

27

is a human factors issue: people are simply not consistent in performing the work
necessary to maintain link integrity. A recent study by the W3C [36] found that
between 5% and 8% Of links referenced by WWW users are broken.

The second problem with this approach is that it requires that document authors
are aware Of the existence Of links that point into their documents, and have some
mechanism Of notifying the owner Of the referring document tO change the link. It is
this second issue that makes the practice Of manual link maintenance infeasible.

In [37], Creech describes an extension to this mechanism by which the system provides
some support tO the user. He deﬁnes the concept Of a Change Log Table, which
records the Signiﬁcant changes that have been made to a document during an editing
session. Periodically a process called the Web—Walk process reads the Change Log
Table and either automatically makes a change made necessary by the operation, or
notiﬁes the owner Of the referencing document if the appropriate action cannot be
determined automatically. The entries in the Change Log Table are made by the
authors themselves, which allows them to use whatever editing tOOls they normally
use tO author their documents. Unfortunately, since the logging Of changes is a
manual process, dependent upon the authors doing it (and more importantly, doing it
correctly), this approach is quite susceptible tO error and very difﬁcult to dO correctly.
Furthermore, there is a period Of time during which broken links exist, dependent
upon the length Of time between Web-Walk’s.

Another approach to dealing with the problem Of broken links (either contextual
or existential breaks) is to attempt to provide a permanent name, independent Of

location and structure, for the Object that is to be identiﬁed. This type Of approach

28

can be further broken down based on whO/ what is expected to maintain the mapping
between the current location Of the Object and the permament name. Generally, this
can be either the human owner Of the Object or a software program acting on the
owner’s behalf.

Persistent URL (PURL)’S [38] are one example Of the former strategy. The relation-
ship between an object’s PURL and its current, location dependent name (URL) is
similar tO the relationship between a host name and an IP address in the Domain
Name System (DNS) [39]. A PURL server provides functionality similar to a DNS
server. It takes requests for a PURL and redirects those requests to the URL that is
the current location of the Object identiﬁed by the PURL. If the owner of the Object
decides tO alter the location Of the Object, it is up tO the owner to send a message to
the PURL server(s) in charge Of mapping that PURL.

The main advantage Of providing this level Of indirection is that it does not require
a document to know about the documents containing references tO it. By providing
the redirection, the notiﬁcation Of a change in location for a resource only needs tO
occur once, and the owner Of the resource that has changed location should have
permission to make this change. There are two main disadvantages to this type Of
approach. The ﬁrst disadvantage is the requirement Of manual update Of the link
resolution service (the PURL server), since we know that a repetitive manual task is
likely to be forgotten or ignored by a human Operator. The second disadvantage Of
this system is that it only provides link integrity for links that point to resources that

have been deemed important enough tO have been assigned a name by the author

29

of the resource. This does not allow users tO create links to sub-resources Of the the
resource that the author may not have anticipated.

The PURL approach centralizes the management Of the names used to identify re-
sources, but still allows those resources tO be maintained in any way the user desires.
Another approach that can be taken is to centralize the management Of the informa-
tion stored in the hypertext system in such a way that the system is able tO easily
maintain the integrity Of links. TO do SO requires that users Of the system interact with
the hypertext system through Speciﬁc tools such as specialized editors and servers.
Through the use Of these tools, the system can be made aware of any changes that are
taking place that may break existing links, and update any references to links that
would be broken by the modiﬁcation in question. Hyper-G [6], described in section
3.4, is one such system that takes this approach. The WBObjects architecture [40] is
another system that adopts this approach.

The main disadvantage with this approach is the closed nature Of the system. Users
do not like to be told what tOOlS to use, and Often may not be able to use the tools due

to their unavailability on the speciﬁc computing platform on which they are working.

4.1 .2 Contextual Breaks

The approaches mentioned above work best to deal with existential breaks. Further
approaches are necessary to deal with the problem Of contextual breaks. Some of
the approaches that have been proposed include versioning, link aware editing tools,

“just-in-time” link repairs and using queries to specify the endpoints Of links.

30

In the versioning approach, the problem Of broken links is eliminated because when a
link is created, it points to a speciﬁc version Of the referenced resource. If that resource
is subsequently changed, it is not replaced by the new version Of the resource; instead
the new version and the Old version CO-exist. Since the link refers to the Old version Of
the document, it will not be broken by the introduction Of the new version. Of course,
in a versioning approach, the amount Of storage Space required for the resources stored
in the hypertext system increases in proportion to the number of times the resources
are changed. In addition, the risk Of users editing or referring tO Old versions Of the
resources is present.

Davis mentions the “diff” approach, which could be used instead Of maintaining
separate versions Of the document. In this system, the difference between an original
version Of a document and a modiﬁcation Of the document is computed. Then a “just-
in-time” repair method can be used which uses the difference ﬁle tO resolve a link that
is known to be broken. Implementations Of diff exist that can be used to create a byte-
by-byte difference ﬁle, which would be useful if links use byte addresses tO specify link
anchors. Byte addresses are one Of the least robust means Of identifying link anchors,
which means that a large fraction of the links might be broken by a minor edit to
the document. Furthermore, they are very unintuitive and difﬁcult for users tO work
with [31].

When using structural addressing mechanisms, a difference algorithm that is able tO
Operate on tree-based structures is needed. In [41], the authors give an algorithm
that can do so in 0(nd) time. However, even with efﬁcient algorithms that compute

a useful difference, determining the difference between two documents involves more

31

work than really needs to be performed, since we are only interested in the changes
to the parts Of the document that are currently the endpoints Of links.
More recently, in [42], Phelps and Bilensky describe a system using what they call

)7

“Robust Locations. A robust location consists Of a series Of location methods, in-
cluding a core set Of methods made up of unique ID’s, tree-walks and context. Further
sets Of location methods can be added by implementors if so desired. Robust locations
are complex enough that the authors suggest they always be machine generated.
This method requires that repair Of broken links be left up to the client, using the
information encoded in the robust location. They propose a speciﬁc repair algorithm
that attempts to take advantage of the tree-walk portion of the location and does a
“spiralling” search of the content around the original location attempting to ﬁnd the
originally intended content. If the tree-walk portion fails according to some predeﬁned
threshold value, a string matching algorithm is used to try to match the context of
the link.

A main point in which this dissertationdiffers from the Robust Location proposal is
in where the computation providing link integrity is performed. Phelps and Bilensky
argue that reattachment should be performed by the browser. This requires signiﬁcant
new implementation by the browser, and also means that it is impossible for an
organization to guarantee that all users will be prevented from following a broken
link. Instead, only those users that use a reattachment capable browser will be
guaranteed contextual link integrity.

It is our Opinion that allowing the server to perform link reattachment is a better

Option for information providers. Once the server implements link maintenance prO-

32

 

cedures, any user browsing that server’s documents will be ensured that links pointing
tO that server will be correct.

Using queries tO specify the anchors Of links seems like a reasonable approach, and
in some cases is a very gOOd approach. For example, if one is creating a link tO a list
Of items in a catalog that fall into a certain price range, a query mechanism seems
perfectly reasonable. When links are created based on the content Of the resource
being linked to, rather than being based on attributes of the that resource, however,
querying is not really a possibility, since it is difﬁcult tO construct a query that
uniquely identiﬁes the desired content (and only the desired content).

Another useful classiﬁcation of the various approaches to maintaining link integrity
is one that determines how “Open” the solution is. That is, to what degree is the user
Of such a system able tO interact with the system in any way that he/ she desires? It
is our goal tO be able to provide a system that provides a high degree Of correctness
in an environment where the user is completely unaware Of the existence Of the link
integrity problem. Furthermore, if we require the user to have a small amount Of
responsibility, then we should be able tO guarantee the integrity Of links in all cases.
Davis provides a categorization which classiﬁes integrity maintenance approaches into

the following classes:

0 Don’t Bother: In this approach, the user makes links, and if the links break it

is determined not to be that important.

0 Avoid the Problem: Links are expressed declaratively or in terms Of an algo-

rithm that will hopefully resolve to the intended anchor.

33

o Loosely Coupled: The system itself provides no means of guaranteeing link
integrity; but it provide a set Of tools that the user can use to dO so. If the user

chooses not tO use the tools, link integrity is compromised.

0 Automatic Link Repairs: The system knows that links can be broken, and takes

responsibility Of detecting and repairing broken links.

0 Tightly Coupled: The most restrictive class, systems that adopt this approach

integrate link management responsibility with authoring software and servers.

4.2 Our Approach

Our approach to maintaining link integrity combines many of the concepts mentioned
in the above work, while attempting tO provide solutions to many Of their disadvan-
tages. In Davis’ categorization scheme described above, the approach falls somewhere
between “Loosely Coupled” and “Automatic Link Repairs.” We hOpe to be able to
provide automatic link repairs that are guaranteed to work if the user uses a (min-
imal) set Of tOOlS to notify the system about the existence Of links and inform it Of
changes to the documents it manages. When the user neglects the responsibility Of
using the tOOls, the system should be able tO gracefully deal with the situation and
still provide link integrity with a high probability.

We have developed algorithms that can efﬁciently detect and repair broken links.

These algorithms require knowledge Of the links pointing into a document and prior

34

notiﬁcation when a document is about to be replaced with a new version and are
described in detail in chapter 5.

We have additionally developed a communication protocol and WWW agent service
that can be used tO provide a “link mapping” service so that existing web servers can
take advantage Of the algorithms described above in order to provide link integrity.
One thing that we are not proposing tO deal with is the possibility that the content
speciﬁed by a speciﬁc link may have been modiﬁed in some way, or perhaps even
deleted. Strategies for solving this problem would involve devising some sort Of se-
mantic distance metric that could be used tO measure the distance between a portion
Of the content in a modiﬁed version Of a document and the original content, and
then deciding whether any portions Of the modiﬁed document are “close enough” to
be considered the original content. Indeed, the recent “robust locations” work cited
previously falls into this category. While this is certainly possible, we do not consider
it in this dissertation.

Recently, many systems and methods have been proposed for implementing “Open
hyperlinking”, which allows the users Of a hypermedia system to create links between
information Of widely varying types (for example, DLS [43]). Bouvin provides a good
summary Of these approaches in [25]. Providing guaranteed link integrity in systems
where the endpoints Of links can be arbitrary types Of information is a more ambitious

goal than this dissertationwill address.

35

Chapter 5

Finding and Repairing Broken

Links

This chapter outlines algorithms that efﬁciently detect which links will be broken by
a document modiﬁcation. Furthermore, once it is determined that a link is broken,
we describe algorithms that can locate the referenced node in the modiﬁed document

in some restricted cases.

5.1 Detecting Broken Links

Deﬁnitions

H,(N): the height of the sub-tree Of the document rooted at node N, where height is

deﬁned in the traditional sense.

36

E3(N): an encoding Of the element structure Of the sub-tree Of the document rooted

at node N.

Link target: A node in a document tree which is directly pointed tO by a link

Contained links: A link L1 is contained by a link L2 if the link target of L1 is a

descendant Of the link target Of L2.

Document tree: The tree representation Of an XML document. The representation
we will use is the Document Object Model (DOM). We give a simple description
Of the DOM here; for a more complete description, see the standard [44]. In the
DOM, a document tree is made up Of several different types Of nodes, with the
most signiﬁcant types for this discussion being the element and text nodes.
Element nodes represent the element markup within the document; text nodes
encapsulate the document content contained by the elements and by deﬁnition

are leaf nodes. Note that all document content is contained by some element.

Link tree: A tree whose nodes are the set Of nodes that are targets Of links for a
particular document. The nodes Of this tree are a subset of the nodes in the
document tree representing that document. The link tree has an artiﬁcial root
node as its root which is not a node in the document tree. This is necessary in

order to guarantee that the link tree is tree.

Node equivalency: Node comparison is an overloaded Operation. T wo element
nodes are considered equal if their tag names are the same and they have the

same number Of children. TWO text nodes are considered equal if a string com-

37

parison Of the text contained in the text nodes shows them to be equal. We will

use the symbol E to denote the node equivalency test.

5.1.1 Increasing Detection Efﬁciency

The brute force method Of detecting broken links involves checking each Of the links
tO the original document and deciding if the content Of the nOde(s) referenced by the
link has changed in the modiﬁed document.

However, it may be the case that hundreds or even thousands of links to the document
exist. A sequential walk through the link list, checking each link independently, could
take a substantial amount Of time. We can certainly do better than this due to the
hierarchical structure Of XML documents.

One way Of reducing the amount Of work required tO detect broken links is structuring
the order in which the links are examined. When there are a large number Of links
tO a document, it is likely that some Of the links will contain other links. Since a
containing link should be considered broken if any Of its contained links are broken,
we can reduce the amount of work done by the link detection algorithm by checking
broken links in a “bottom-up” fashion.

Consider the example shown in ﬁgure 5.1. In ﬁgure 5.1(a), we see a tree representing
the original version Of a document. Figure 5.1(b) shows the document after one node
has been modiﬁed, and 3 new nodes have been inserted. Figure 5.2 illustrates the

nodes that would be checked by a “brute-force” algorithm that compares the content

38

e o 0

Link 2
a 9(—

(a) The Original Document

....9_> e o
oo 00

Link 2

000 GG

(b) The Modiﬁed Document

 

Modified Node

0 Newly Inserted Node

Key

 

 

 

Figure 5.1: A document and its links before and after modiﬁcation

39

Of each explicit target in the original tree tO the corresponding nodes in the new

document tree.

  

, ; i
I \ I \ I .f
, x‘ r \“l /.'H_.
/\ Link 2
‘ \

.\ 12 ," 2:: 13 }<—-Link3

Node needed to determine if links have been broken

Node compared by the "brute-force" algorithm

‘o

Figure 5.2: Nodes examined by the brute-force algorithm

We can see that this algorithm clearly does more work than is necessary, due to the
hierarchical nature Of XML document content. If the ﬁrst comparison performed by
a broken link detection algorithm was tO determine if link 3 is broken, we could avoid
comparing nodes 2, 5, 6, 7 and 12 altogether, since the target Of link 3 is the second
child Of node 6, and that node is an implicit target of link 1.

In a large document, structuring the link database so that links are checked in a
“bottom-up” fashion can eliminate large amounts Of work that could be spent com-
paring nodes unnecessarily. This link structuring is the fundamental idea in our

broken link detection algorithm.

40

5.1 .2 The Algorithm

Let Treeozd and TreeNm be the tree structures representing the Old version Of a
document and a new version Of the document which is about tO replace the Old
version, respectively. Let Tree Links be a link tree for Treeozd. The links in TreeLmks
are structured in a way such that a node child is a child Of a node parent if parent
contains child (We give an algorithm for creating TreeLink, in algorithm 5.3).

Our goal is to determine which Of the links in Tree Links should be marked as modiﬁed,
where a link is considered modiﬁed if the explicit target Of the link or any Of its
descendants are not equivalent in the new tree structure. We wish tO do this in such
a way as tO perform as few node comparisons as possible.

Algorithm 5.1 is used tO decide if the target Of a link has been modiﬁed. This
algorithm is used by algorithm 5.2 to examine the targets Of each link in TreeLmks.
At the conclusion of algorithm 5.2, all of the explicit targets have been correctly

marked as either modiﬁed or unmodiﬁed.

5.2 Analysis

We claim that algorithm 5.2 is an Optimal solution to the problem Of determining
which links of a set will be broken by a document modiﬁcation. Our deﬁnition Of
Optimality consists Of the following two criteria.

First, an Optimal broken link detection algorithm must compare any given node in

the document tree at most once. Secondly, when determining whether a particular

41

 

Algorithm 5.1 Comparing a node
/*
Input: node is the node to be compared
Returns: true if node has been modiﬁed, false otherwise
T reeotd (node) E Tree New (node) means to apply the equivalency operator to the
nodes in the same positions in Treeozd and Tree New
*
/

if node is already marked then

return marking Of node
end if
set ismOdiﬁed = Treeozd (node) E Tree New (node)
if node is not modiﬁed then

for all descendants of node do

if Treeozd (current_descendant) E TreeNew (current_descendant) then
set ismodiﬁedztrue
Break out Of FOR ALL lOOp
end if

end for
end if
return (ismodiﬁed)

 

 

 

Algorithm 5.2 Determining which links have been broken

 

set modiﬁed 2 false
for all nodes in a post-order traversal Of Tree Links (10
if node_current is unmarked then
use algorithm 5.1 to determine if node_current is modiﬁed
if node-current is modiﬁed then
set modiﬁed 2 true
set temp = node_current
while parent(temp) exists do
mark temp as modiﬁed
set temp 2 parent (temp)
end while
end if
if not modiﬁed then
mark node_current not modiﬁed
end if
end if
end for

 

42

 

Algorithm 5.3 placemode

 

/*inserts a new node new-node into subtree rooted at parent-node*/
for all children of parent_node do
if newnode is a descendant Of child_current then
place_node (child-current, new_node) return
end if
end for
insert new-node as ﬁrst child of parent_node
for all siblings of newnode do
if current..sibling is a descendant Of new_node then
Remove current-sibling from parent_node’s child list
Insert current-sibling into neuLnode’s child list
end if
end for

 

target is modiﬁed, an Optimal broken link detection algorithm will compare at most
one modiﬁed node in the subtree rooted at the target.

Algorithm 5.2 has the Optimality properties described above. For the ﬁrst property,
we see that nodes are marked at the time they are compared. If a node is already
marked, it will not be compared again (line 3 in algorithm 5.1).

For the second property, the post-order traversal Of TreeLinks guarantees that any
links contained by link will have already been visited at the time link is visited. If
any descendants Of those contained links are modiﬁed, then link will have already
been marked as modiﬁed as a result Of executing the code in lines 7 through 11 Of
algorithm 5.2. If link does not contain any other links, then the break statement at
line 10 Of algorithm 5.1 guarantees that no other descendants Of link will be visited

after the ﬁrst modiﬁed descendant is visted.

43

5.3 Repairing Broken Links

For each Of the links marked broken in algorithm 5.2, we must ﬁnd a node in the
modiﬁed document tree that corresponds tO the original target node. This problem
is difﬁcult if arbitrary document modiﬁcations are allowed. For example, if changes
to the element structure Of a linked-to node or changes to the actual content Of the
resource are made, it will not be possible tO ﬁnd an exact match in the modiﬁed
document. However, in certain restricted circumstances, the problem is not as hard,
and we can devise efﬁcient algorithms tO solve it.

Suppose a link L points to a node N1 in document D1. Editing D1 in some way
produces document D2. If the following three conditions are met, then it is a fairly
trivial matter to produce a new link L’ which points to a node N] whose element

structure and content are the same as the structure and content Of N1:

1. The element structure Of N1 is not changed
2. The content of N1 is not changed

3. The XPOinter for link L is not a Spanning XPOinter

One way Of thinking Of these restrictions is that the referenced sub-tree has been
“cut” from its original location in the original document and “pasted” in its entirety
into a new location in the modiﬁed document. Figure 5.3 shows examples Of edits
that do and do not meet satisfy these restrictions.

A simple brute force algorithm for ﬁnding N] would be to examine each node in

D2, comparing its content and element structure tO that Of N1. However, given the

44

assumption that the element structure has not changed, many Of the nodes of D2
cannot be the node we are seeking. Only nodes N for which the sub-tree rooted at
N has the same height as the node referenced by L could be the node we are looking
for. Using this knowledge, we greatly reduce the number Of nodes that are considered
candidates for being the new target node.

We can further prune this set by comparing the element structure Of the sub-tree
rooted at each node in the set tO the structure Of the original target N1. If the
element structure is not the same, we can remove it from the set. Once we have
pruned the set as much as possible, we can compare the element content Of each Of
the nodes in the set to the content Of the referenced resource, and return the set Of
nodes that match.

TO implement this algorithm, we maintain a couple Of data structures as the modiﬁed
document D2 is parsed. First, for each sub—tree height H we keep a list Of nodes in D2
for which the sub-tree height Of the node is H. Second, an encoding Of the sub-tree
structure for each element is maintained to allow us to quickly compare the structure
Of two elements.

When we discover a broken link, we use algorithm 5.4 tO search for a new target node
in the modiﬁed document tree:

We can further Optimize this algorithm by maintaining a pointer into each set Of
nodes indicating where in the set the last search completed. The next time this set is
searched, the search will start at the pointer rather than at the beginning Of the set.

This is helpful if link targets that were close tO each other in the original document

45

 

Legal Modiﬁcations

 

Illegal Nbdification

 

Key

oo::>oo
oo o ‘09

Link Target

‘ . Pattern

 

 

 

 

Figure 5.3: Types Of document modiﬁcations that can and cannot be repaired

 

Algorithm 5.4 Finding a new target node

 

Let H = height of the original target node T
Set C = set of nodes in modiﬁed document tree Of height H
For each node N in C
if E3(N) = E5(T) then

if Content(N) = Content (T) then

N is new target node; return locator

end if

end if

 

46

tree remain close in the modiﬁed document tree. This will Often be the case, such as

when a common ancestor is moved from one location tO another.

5.4 Evaluation and Testing

We have shown in section 5.1.2 that the broken link detection algorithm described in
algorithm 5.2 is Optimal. We now wish tO evaluate the link repair algorithm given in

algorithm 5.4.

5.4. 1 Test Documents

Finding gOOd documents tO use as test cases was difﬁcult due tO the short life span
Of the XML and XPOinter speciﬁcations. We chose to use two sets Of documents
for our testing. The ﬁrst document set consists Of a series Of randomly generated
documents. For the second document, we have chosen tO use the XML version Of the
New Testament provided by Jon Bosak [45]

A major factor in the success Of our link repair algorithm is the distribution Of the
heights Of nodes in the tree. Suppose we determine that a link L1 has indeed been
broken and we wish to ﬁnd the target Of L1 in the modiﬁed document. If a large
majority Of the nodes have the same height, we will have tO search most Of the tree in
order tO determine a new link to replace L1. On the other hand, if the heights Of the
nodes in the tree are more evenly distributed, we will see considerable improvement

in the performance Of our algorithms.

47

5.4.2 Generating Random Documents

Since we decided to generate documents randomly, we needed to have some idea Of
what reasonable distributions for the variables described above are. Due tO the lack
Of a large body Of XML documents to use, we chose to examine HTML documents
from our department and college’s web sites and attempt tO ﬁnd a model that char—
acterized the element height distribution from these sample documents. There were
approximately 8800 documents in the collection that we looked at. For each Of these
documents, we calculated the percentage Of nodes at each height, from 0 up tO the
height Of the root node Of the document.

Figure 5.4 shows the distribution of element heights Obtained with outliers discarded.
The data suggests that the distribution Of element heights follows a hyperexponential
distribution. The hyperexponential distribution is similar tO the exponential distri-
bution except that there is more area in the tails Of the distribution.

Since HTML does not provide a method Of pointing tO arbitrary elements within the
document, we decided a uniform distribution of link targets would be appropriate.
That is, every element in the tree is equally likely tO be chosen as a link target. Given
that we have no empirical evidence to support any other model, this model seems
reasonable.

The ﬁnal task in using random documents as test cases is modeling changes tO the
documents. We allow two basic types Of document edits: insertions and deletions. We
have chosen to follow the model Of [41] in deﬁning these operations. Again, without

any empirical evidence tO guide us, we have chosen tO treat each Of these Operations

48

 

1-3 l I I I I I I I I I I I I I I I I I I

Elonont Height Distribution .
Hupor-Exponontial Distribution www

 

".-

 

 

 

 

0 1 8 3 4 5 6 7 8 9 18 11 12 13 14 15 16 17 18 19

Figure 5.4: Element height distributions

as equally likely, and also consider each node in the tree equally likely tO be the target

Of an editing Operation.

5.4.3 The New Testament Document

This document is an XML marked up copy Of the New Testament. It is approximately
1.1 MB in size, and contains close tO 25,000 elements. For this document, rather than
choosing links randomly, we have chosen to use a set of 270 links taken from the Life
Application Bible [46] These links connect passages in the ﬁrst four books Of the New
Testament that relate similar events. This set Of links could very easily be found in

a document whose goal is to Show the commonalties between these ﬁrst four books.

49

In order to come up with a modiﬁed document, we chose to rearrange the books in

alphabetical order. The content Of the document itself has not been changed.

5.4.4 Testing Environment

We have implemented the link repair algorithm described in section 5.3 and tested it.
Due tO the availability Of tools such as XML parsers, the algorithms were implemented
in the Java programming language. The current work seeks only to evaluate the
effectiveness Of our pruning algorithms. The algorithms may be implemented in a
more efﬁcient programming language as part Of future work. Our implementation
uses the SAX parser interface [47], and so can be used with any SAX-compliant
parser.

We performed our evaluations on a Sun Ultra-2 with 2 300 MHz CPU’s and 512 MB
of RAM, running Solaris 2.6. Since measuring absolute performance is not our goal,
the choice Of hardware is not critical. This machine is used as a compute server, SO
it is difﬁcult tO know exactly what other activity was present at the time the tests
were run. The machine usually has about 200 MB Of free RAM at any given moment,

however, SO it is likely that our experiments had plenty Of room in which tO work.

5.4.5 Performance on the New Testament Document

After running some preliminary performance evaluations, we discovered that this set
Of links and documents is essentially a worst-case data set for input tO our algorithms.

This is due tO the fact that the mark-up Of the document follows a highly regular

50

pattern. Every verse is marked up in exactly the same way, as is every chapter and
every bOOk. All Of the links in the link set point at leaf nodes. Each verse is located
at exactly the same height in the parse tree, and possesses exactly the same encoding
Of the element structure. Furthermore, a disproportionate percentage (approximately
85%) Of the nodes are leaf nodes.

TO more accurately evaluate our algorithms, we generated random test documents in
accordance with the distributions described in section 5.4.2. We chose to generate
documents Of three different sizes in terms Of the number Of elements in the document.
The small document category was made up Of documents that contain approximately
100 elements; the medium document category was made up Of documents containing
approximately 1000 elements, and the large document category was made up Of doc-
uments containing approximately 5,000 elements. We randomly chose places in the
generated documents tO insert/delete elements, and varied the number of modiﬁca-
tions made. We also varied the percentage Of elements in the document that were the
target Of links from 5% to 50%.

Table 5.1 shows performance statistics for each Of the test cases. These statistics
are average values for all numbers Of links. These results Show that our algorithms
achieve good performance for small documents, and become increasingly efﬁcient as

the size Of the document increases.

51

 

 

 

 

 

 

Class Number of Modiﬁcation Average Elements searched
Small 5 8
10 9
Medium 5 62
10 64
Large 5 188
10 213

 

 

 

 

 

 

Table 5.1: Performance of Algorithm 5.4

5.4.6 Summary

In this chapter, we have described algorithms that can be used to efﬁciently detect
link breakage when document modiﬁcations take place. In some circumstances the
algorithms are able to repair the broken links quickly by searching the modiﬁed doc-
ument for the referenced content and calculating a new link. An advantage of the
algorithms is that they do not require authors to provide explicit information about
how a document has been modiﬁed. To our knowledge, these algorithms are the only
solutions that can repair broken links automatically.

We have shown that the broken link detection algorithm is optimally efﬁcient ac-
cording to the conditions we set forth. These Optimality conditions are reasonable
for any implementable (i.e. non-oracle) algorithm. Our link repair algorithms have
Shown that by restricting the search for new target nodes based on the target sub-tree
height and element structure, we can Signiﬁcantly reduce the number of nodes in the

modiﬁed document that must be searched to ﬁnd the target.

52

In chapter 6, we discuss the design and implementation of a software system that
utilizes the algorithms described in this section. Using this system, we can achieve

our goal of providing link integrity automatically.

53

Chapter 6

A Software Agent to Provide Link

Integrity

6. 1 Introduction

In chapter 5, we described algorithms that can be used tO compute a new locator
efﬁciently to repair a contextually broken link. In this chapter, we describe a web
proxy service utilizing those algorithms to ensure that, at all times, a user agent
receives the correct data as the result of following a hyperlink.

In order to provide this guarantee, our system makes use of an agent, which is a
particular implementation of the general WWW idea of a proxy server [48]. Each
web server wishing to ensure referential integrity of links pointing into its document
space must be serviced by an agent. Agents communicate through a protocol we have
developed, allowing them to notify each other when links have been created, deleted

or broken.

An outline of the chapter is as follows. First, we describe the requirements for the
agent and how we have addressed them in our prototype. Next we describe the
agent communication protocol in detail, and describe its Operation in the presence of
unreliable communication networks. We then analyze the resource requirements and
performance effects of adding our system to the current web infrastructure. Finally,
we summarize the chapter and give conclusions.

The necessity of inter-agent communication is motivated by Mendelzon and Milo [26],
in which the authors give a proof for why it is not possible to compute the answer
to the query “Are there any documents pointing to document D” (or equivalently,
“Find all the documents pointing to document D”). In order to guarantee 100% link

consistency, it is necessary to require some amount of cooperation among agents.

6.2 Overview of the Agent Proxy

For clarity’s sake, we deﬁne some terminology. Let S be a web server, addressable
using the URL U. We say that S’s document space is the set of all documents whose
URL is of the form U / path, where path is a string locating a document on server S.
The agent has two distinct functions. The ﬁrst is its role as an Hypertext Trans-
fer Protocol (HTTP) proxy. In this role, it must accept incoming HTTP requests,
forward them to the server for which they are destined and relay the response from
the destination server back to the user agent (typically a browser) that initiated the

request.

55

The agent’s second role is to maintain a database with information about all links that
point into the web server’s document Space, and make sure that incoming requests
following those links locate the originally designated part of the document. The agent
is able to perform this role because in its role as a proxy, it will receive and examine all
requests made to servers it is responsible for, allowing it to redirect requests generated
by following a link the agent knows to be broken.

In order to allow the agents to work with previously existing web servers, the agent
must be conﬁgured to listen on the same IP address and Transmission Control Pro-
tocol (TCP) port the web server used. Of course, this means that the address and
port number of the server must be changed. Since the agent is acting as a proxy, to
the outside world, the change of addresses will be transparent.

In our current implementation, there are actually two processes involved in imple-
menting the functionality described above. One we call the lightweight agent, while
the other is the agent proper. The lightweight agent fulﬁlls the role of the HTTP
proxy, and the agent is responsible for the link maintenance functions.

Agents and lightweight agents communicate with each other through the use of the
Java Remote Method Invocation (RMI) distributed processing architecture [49, 50].
While it is necessary to have one lightweight agent for each web server (since each
server requires a unique IP address and TCP port) an agent can actually service
multiple web servers.

The lightweight agent receives HTTP and document maintenance requests, and for-
wards the document maintenance requests to the agent process responsible for that

server. In addition to the request, the lightweight agent includes information about

56

which server the request pertains to so that the agent can act accordingly. Figure 6.1

gives an illustration of the system architecture.

 

 

 

 

 

‘ Lightweight ‘ ‘ Lightweight
A86"! RMl RMl A80” HTTP

RMI

 

 

 

 

 

01%

 

Lightweight
Agent

 

 

 

01% .

 

 

 

Figure 6.1: System architecture for agent, lightweight agent and web servers

6.2.1 The Agent in Detail

The agent maintains a database of information about each document for which it is
responsible. For each document, the database contains a list of documents (URL’s)
that contain references to the document in the database. Each entry in the database
also records the Speciﬁc locator (series of XPOinter location terms) the referencing
link is using.

Figure 6.2 gives an illustration of some of the messages used by agents to communicate
with each other. In the ﬁgure, we have two web servers 81 and 52, and two agents

Agent A and Agent 3. Agent A is responsible for server SI, while Agent 3 is responsible

57

for server 52. Further suppose a document D2, to be located on server 32, is created
and that D2 contains a link L2 pointing to some location in document D1, located on
server 51. In order to guarantee the contextual integrity Of L2, Agent 3 must send a
message to Agent A, indicating the URL of document D2, and the locator information
of L2.

Suppose document D1 is now updated. We assume the entity performing the update
informs Agent A of its desire to perform the update and asks Agent A to perform the
update on its behalf. AgentA must determine which of the links pointing into D1 will
be broken by the update. Furthermore, for each link that is broken, Agent/1 must
compute a corrected locator and notify AgentB of the new locator.

We must not make any assumptions about the actual amount of time it takes Agent A

to contact AgentB and notify it of the change.1

Therefore, there will be a period
of time when D2 will contain the old (incorrect) locator. During this time, AgentA
maintains a link map containing a mapping that indicates the corrected locator
that should be mapped to the broken locator. This mapping is indexed by both the
combination of the referring document’s URL and the locator. During the (perhaps
unbounded) period of time until AgentB ﬁxes the broken link in D2, AgentA will
intercept all HTTP requests which have a referer2 header of D2 and the incorrect

locator and cause them to use the new locator information. Agent A does this by

sending an HTTP redirect response (status code 302) with the correct locator to the

 

1 In fact, it may be the case that AgentA chooses to delay notifying Agent}; for
performance reasons.
2This unfortunate misspelling is in fact part of the HTTP standard.

58

 

Server 82 Server 81

 

 

 

 

 

 

 

 

 

 

 

 

 

F—ﬁ
Document
Document
= Add Reference In
E Check In D2 from D2 to D] E
: >@————>. E
E‘
%
L_5_J L—z

<a href="http://S 1 ID 1 #root().descendant( l ,PattNo)">

Part (a): Adding a Reference from D2 to D1

 

      
 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

Server 82
r—-——\ ' ‘
Docggient Modiﬁed
_ . Change locator Check In D a cigar] ent
E Modify D2 for link L2 E
= +— AgcntB E
E
a, \____J
l"
g___t2_J

Reroute requests from D2

<a href="http://S1/D1#root().desccndant(2,PartNo)"> for D1 using L2

Part (b): Checking In a New Version of D1

 

 

 

 

 

 

Figure 6.2: Messages passed between agents

client that made the incorrect request. The client will then re—issue its request, using
the correct locator, and will receive the correct information.

The requirement of knowledge of the locator means that for our system to work
with current generation browsers, we cannot use the traditional ’#’ character as the
separator between the URL and the fragment identiﬁer. This is due to the fact
that browsers do not send along the fragment identiﬁer with the request. This is
unfortunate, but hopefully a case can be made that the browser should send the

fragment identiﬁer along, perhaps as a separate header, in extensions to HTTP.

59

As implied in the preceding description, part of the agent’s role is to serve as a gateway
for document modiﬁcations. When a document author wishes to make modiﬁcations
to that document, she must send a request to the agent containing the new content of
the document. The agent will then make the change when it has determined which of
the links pointing at the document will be broken and how to correct them. We are
explicitly not requiring that the authoring software (or the author, for that matter)
maintain a record of the changes that have been made to the document; it simply
sends the modiﬁed document data to the agent. While this policy may generate more
network trafﬁc than sending only the change description, it places less burden on
the authoring software. By utilizing a separate “check-in” program, any currently
existing authoring software can be used. The authoring software simply works on
a local copy of the document in question, and then the user invokes the check-in

program when editing is complete.

6.3 Protocol Description

In this section, we describe the protocol used by agents to communicate with each
other. We ﬁrst describe the set of messages that are used, and discuss the normal
behavior of the agent when a broken link is followed. We then examine issues having

to do with network failures and delays.

60

6.3. 1 Messages

AddServer (LWAAddress, LWAPort, WebServerAddress, WebServerPort)
This messages indicates to an agent that it should begin servicing the web server
whose lightweight agent is located at LWAAddress and listening on LWAPort. It
also informs the agent of the actual address and port of the web server that is being
serviced This is necessary so that the agent can communicate with the web server to
update web content using the HTTP PUT method or some similar mechanism if the
agent and the web server do not Share a common ﬁle system.

RemoveServer (ServerAddress, Port)

Indicates to the agent that it should no longer handle protocol messages regarding
the lightweight agent located at SeruerAddress and listening on port Port.
AddReference

(ReferencedURL, ReferringURL, Locator, [Checksum or LastModiﬁed-
Date])

This message indicates to the receiving agent that a link has been created that points
into its document space. The document containing the link is located at Refer-
ringURL and uses locator Locator. The purpose of the remaining parameters are
described in section 6.3.3.

RemoveReference (ReferencedURL, ReferringURL, Locator)

Receipt of this message by a link agent means that a previously registered link has

been removed and should be removed from the agent’s link database.

61

CheckInURL (DocumentURL, SearchList, DocumentSize, DocumentData)
This message is sent whenever either a new document with URL DocumentURL has
been created, or the document has been edited. SearchList is a list of document
URL’s indicating a set of documents that Should be used by the algorithms described
in chapter 5 to perform broken link detection and repair. Document URL is always an
implicit member of SearchList. This feature is useful for the (fairly common) editing
operation of splitting a large document into several component documents.
LinkUpdate

(ReferencedURL, ReferrerList, NewLocator,NewURL, NewLocator)

This message tells the receiving agent that all links contained in documents listed in
ReferrerList pointing to Referenced URL using locator Locator should be updated to
point to NewURL and use NewLocator.

Allowing a list of referring documents to be passed reduces the amount of network
traffic and processing that must occur if the receiving agent has several documents
in its document space that use the same address for a link.

The majority of the messages generated by the protocol come about due to the receipt
of a CheckInURL message by an agent. The agent must check the new version
of the document to determine if any new links exist in its content, and if so send
AddReference messages to the appropriate agents. If the document is a new version of
an existing document, it is also possible that links have been deleted in the new version

of the document, in which case a RemoveReference message should be generated.

62

 

 

 

Link Ll

 

 

 

 

 

Link L2

 

 

 

 

 

 

 

Referring Document

Original Links

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

, Section 1 (New)
Section 1 Redirected Links

_ utter Modification Section 2 (was Section I)
L" Section 2

_ L. Section 3 (was station 2)
E Section 3

L——‘i Section 4 (was section 3)

Original Version Modiﬁed Version

 

 

of Referenced Document of Referenced Document

 

Figure 6.3: Possible ambiguity when sending a redirect message

If the agent detects that some links have been broken, it generates the appropriate
LinkUpdate messages and sends them to the agents responsible for the documents

containing the broken references.

6.3.2 Deciding When to Redirect

When the lightweight agent determines that a broken link has been followed, it sends
an HTTP redirect message to the user agent that followed the link, telling it the
correct locator to use. For correct Operation in certain situations, the lightweight
must be careful about how it goes about doing that. Take for example, the situation
shown in ﬁgure 6.3.

Here we see a referring document that contains two links into the referenced document.
After the document is modiﬁed, we see that the locator originally used by link L1 is

now mapped to the locator originally used by L2.

63

When the agent receives a request following link L1, it will respond with a redirect
asking the client to follow L2 instead. When the agent receives this request, it should
NOT be redirected. Unfortunately, the agent has no way to distinguish this request
from a request that is actually following L2. We must be able to determine the
relationship between two requests — that is, did the request to follow L2 come about as
a result of a previous request following L1? Unfortunately, due to the stateless nature
of the HTTP protocol, this is impossible. While HTTP 1.1 contains a mechanism for
persistent connections (that is, multiple HTTP requests from a client to a server can
be sent along the same connection), either party is free to terminate the connection
at any time. Therefore we cannot force both requests to be made over the same
connection, which would allow us to determine causality.

The Cookie [51] mechanism allows a server to introduce state into HTTP requests
and responses. By utilizing cookies, we can solve the problem described above in
the following manner. When the link agent receives the request following link L1, it
includes a Set-Cookie header. The Path ﬁeld of this cookie speciﬁes the referenced
document’s URL, the name of the cookie is “NoRedirect” and the value of the cookie
is set to the value of L2. Finally, the age of the cookie is set to be 1 second. Setting
the age in this way makes it likely that the cookie is only valid for one request by the
user agent.

If the link agent receives a request containing a cookie of the form described above, it
knows this request should not be redirected, and so forwards the request unmodiﬁed

to the web server. The link agent forwards the response from the server to the user

64

agent, including a Set-Cookie header that invalidates the cookie generated by the
original request.

Creating an association between two HTTP requests using cookies is the preferred
mechanism to ensure that redirected requests are not subsequently redirected, and
is the only completely correct means that we can do so. Unfortunately, due to the
potential for misuse of the information conveyed using cookies [52], some users direct
their user agents not to accept cookies. If we desire, we can still do quite well without
using cookies. Let us assume that we have sent a Set-Cookie request to a browser,
and it has rejected this cookie3. We still need some way to make sure the second
request is not redirected.

The lightweight agent must redirect the original request in such a way that it can iden-
tify the subsequent request and refrain from performing redirection on that request.
The only part of the second request the agent has control over is the Request-URL.

The agent has a couple of options in constructing the redirected URL:

0 redirect the request to a different port number

a redirect the request using a different HTTP Host header

0 change the path to the requested resource

Unfortunately, any of these methods require sending a different URL back to the user

agent that made the original request. If the user then views the resulting document

 

3Determining if a cookie is rejected requires two redirections: ﬁrst to a URI that
is speciﬁcally set up to examine if a cookie is present, and then the actual redirection
to the appropriate site, depending on whether the cookie is present.

65

and decides to create a new link to a portion of it, the URL the user sees for the
document will be in the form for which redirection is restricted from occurring. This
would then prevent the system from maintaining the integrity of this new link.

We must deal with this problem at a higher level than HTTP, in the link agent
communication protocol. If a link agent receives an AddReference message using the
restricted-redirection form of a URL, it must nack (send a negative acknowledgment
for) that message, including the correct version of the URL as part of the message.
By sending this nack, we ensure that links using the restricted-redirection form of
the URL are not created. This of course does not prevent incorrect links from being

created without an AddReference message being sent.

6.3.3 Operation in the Presence of Network Failures

In the absence of network delays or failures, the protocol described in the previous
section is sufﬁcient to guarantee link integrity. However, communication networks
are not perfect, and so we must address the issue of network failures and delays. The
presence of caches along the path between web clients and servers further complicates
the situation. In this section, we address problems caused by unreliable networks and

the presence of caches.

66

Lost or Delayed AddReference Messages

Suppose an AddReference message is delayed for a signiﬁcant period of time.4 It is
possible for the URL being referenced to be modiﬁed during the time the AddReference
message is being transmitted. This modiﬁcation could possibly change the content
pointed to by the locator used by the referencing document. In this case, not only
will the link will be broken initially, the agent protocol will perpetuate the broken
link as further changes to the document are made! Figure 6.4 illustrates this problem.
In order to avoid this problem, it is necessary to know either the exact content the
user wishes to link to, or the last modiﬁcation date of the referenced document, at

the time that the user downloaded it.

 

 

Author at Machine Y checks Author at Machine Y Author at Machine Y
out document 8 from modifies target of Doc checks in Doc 8
Agent B A‘s link with Agent 8
I Jl t o a I o a In a a o a I o c 0| 0 h
Author at Machine A Author at Machine A Author A adds link in Author A checks in Agent A sends Agent 8 receives
checks out Doc A views Doe B in browser Doc A to Doc B Doc A with Agent A AddReference message AddReference Message
from Agent A by requesting it from to Agent 8 from Agent A
Agent B

Figure 6.4: Problems due to network latency and AddReference messages

One possible way of avoiding this problem using knowledge of the intended content
is to compute a checksum on the data the author intends the link to reference. This
checksum is then included as part of the AddReference message. On receipt of the
message, the receiving agent retrieves the current version of the document and makes
sure the checksum on the intended content matches the one in the message. This

might be implemented by providing a simple program into which the user pastes the

 

4III fact, since network delivery takes non-zero time, this situation can occur no
matter how long it takes for the message to be delivered.

67

portion of the referenced document and have this program calculate the checksum.
The check-in program later prompts the user for the checksum. Unfortunately, this
adds a signiﬁcant amount of complexity to the check-in process, and would be very
tedious when checking in a document with a large number of links.

Another way of attacking this problem involves the use of the last modiﬁcation
date of the referenced document. A somewhat naive approach is to timestamp the
AddReference message with the current date and time when it is sent. The receiving
agent compares the timestamp in the message with the last modiﬁcation date of the
document, and if the modiﬁcation date is later than the timestamp in the message, a
nack is sent back, warning the user that the link might now be incorrect. There are
two signiﬁcant problems with this approach. First, using the time the AddReference
message is sent as the reference time is not necessarily an accurate measurement of
the time the user examined the content of the document and decided to create a link.
Thus, it is possible that the protocol would accept a message containing a locator
that references content other than that desired by the author.

A second problem with the naive approach is its reliance on synchronization of sep-
arate clocks. If the clock on the author’s machine and the agent’s clock are not
synchronized, it is again possible for the agent to accept a message containing an
incorrect locator. Algorithms for determining temporal relationships between ma-
chines with unsynchronized clocks, such as Lamport’s Algorithm [53], are not helpful
in this situation, since as can be seen in ﬁgure 6.4, there is no communication between
the machines sending the AddReference and the CheckInURL messages, and so no

reasoning can be made about the order of events. While HTTP 1.1 urges using a

68

method such as the Network Time Protocol (NTP) [54] to synchronize clocks, the
standard does not require it. Therefore, requiring synchronization of physical clocks
is too Signiﬁcant a restriction to impose on the potentially globally distributed set of
agents.

Thus, we need a system that does not require synchronization of clocks or signiﬁcant
user input. If the last modiﬁcation date of the referenced URL at the time the user
decided to create the link could be known, then this could be included as part of the
AddReference message. Since this date is generated by the server, and the Agent
receiving the AddReference message can request this information from the server, we
can use it as a basis for comparison.5 The question is, how does the check-in process
know what the modiﬁcation date of a referenced document was at the time that the
user looked at it?

We could simply require that the user supply this information. Most current browsers
provide a way to retrieve the HTTP LastModiﬁed header when viewing a document.
While this is less obtrusive than requiring the user to create a checksum, it still is
user intervention we would like to avoid.

Another possibility is illustrated in ﬁgure 6.5. In this situation, we have placed a proxy
on the client side. The client is conﬁgured to use this proxy, and thus all outgoing
HTTP requests are sent through the proxy. The proxy is then able to record the

requests made by a client machine, recording the URL’S and last modiﬁcation date

 

5For robustness, a web server is often replicated. In order for this scheme to
work, the clocks of a web server and any machines replicating it would need to be
kept synchronized. This is not as unrealistic a requirement as keeping the clocks of
machines administered by different entities synchronized.

69

for each document the user requests. When the check in process is initiated, the
agent communicates with the proxy6 to ﬁnd out the last modiﬁcation dates of all the
documents being referenced by the new version of the document and includes these

as part of the AddReference messages it sends.

 

     
     

Author‘s Machine Proxy Server
Checkout Document

Turn on logging
Hl IP Requests
d———>

ooonaogﬁu Checkin Document nODﬂﬂnﬂﬂgo
Turn off logging

 
  
   

 
 
    

sass"!

 

 

 

Figure 6.5: The operation of the client proxy

Logging all requests by a client in the proxy uses a signiﬁcant amount of storage, and
it is diﬂicult to know when to ﬂush this data. Therefore, we need some way of knowing
when to turn the proxy recording mechanism on and off. One way of doing so would
be to incorporate this action into the check-out and check—in procedures. When the
user checks out a document, we turn the recording mechanism on, leaving it on until
the time the user checks in the document. Logging of all requests is performed by
the proxy during the time between check in and check out.

Another option would be to provide “begin authoring” and “end authoring” programs
that the user executes. This avoids the problem using check-in and check-out where
the user might do some browsing between checking in Of a ﬁrst document and checking

out of a second document.

 

6in fact, the agent could very well be the proxy.

70

Lost or Delayed LinkUpdate Messages

A second situation we must consider occurs when multiple LinkUpdate messages are
sent to an agent, and the order the updates are received is not the same as the order
in which they were sent. We cannot rely upon the in-order delivery guarantees of
TCP here because the messages will be sent across different TCP connections, and
TCP only guarantees in-order delivery of messages on the same connection.

The example below outlines why receiving link update messages out of order could

be a signiﬁcant problem:

1) Document D1 links to document D2 using locator L1
2) D2 is modiﬁed by a check-in with Agent A2, transforming L1 into L2

3) Agent A2 sends a message M1 to Agent A1, asking that it replace L1 with

L2. This message is delayed indeﬁnitely.
4) D2 is modiﬁed again, in a way that causes L2 to be replaced by L3.

5) A2 sends message Mg to Al, asking it to replace L2 with L3.

If M2 is received before M1, this would confuse A1; D1 may not hold any links using
L2 as a locator.

This situation could not occur if updates are not permitted while there are unacknowl-
edged LinkUpdate messages. Then, we will be sure that LinkUpdate messages will be
delivered in the correct order, since only one set of messages can be in transit at any
one time. This solution is not practical since we have no idea how long it might take
for any given LinkUpdate message to be received and acknowledged. Instead, our
solution buffers unacknowledged LinkUpdate messages at the sender. Every time a

71

LinkUpdate message is sent, it includes ALL unacknowledged LinkUpdates intended
for the receiver. This is not an extra buffering burden for the sender, Since this data
is already being maintained as part of the link map stored by the lightweight agent.
The sender must include some sort of sequencing mechanism in its messages, and
the receiving agent must know which messages it has already received. For the se-
quencing mechanism, we use the date/ time of the sending agent. Since the receiver
will be using the date contained in the message as part of the acknowledgment, and
speciﬁcally not the value of its own clock, there is no issue of coordinated clocks.7
The receiver only needs to maintain in its memory the sender’s timestamp for the
last message successfully received from that sender. Using the date as the sequencing
mechanism avoids the complexity involved in avoiding duplicate sequence numbers in
the event of a machine failure.

If the receiver receives a message containing LinkUpdate information it has already
processed, it disregards that portion of the message. When the receiver sends an
acknowledgment to the sender, that acknowledgment acknowledges receipt of all out-
standing LinkUpdate messages sent before the timestamp contained in the acknowl-

edgment message.

Browser and Proxy Caches

The presence of caches in the network further complicates the agent communication

protocol. Suppose a LinkUpdate message is sent to an agent. The agent receives the

 

7Again, replicated agents used for performance and robustness would need to keep
their clocks synchronized.

72

message, corrects the broken link indicated in the message, and sends an acknowledg-
ment to the sending link agent. In the absence of caches, the lightweight agent for
the referenced document could immediately remove the mapping of the Old locator
to the new locator from its link map, conﬁdent that no request using the old locator
will be received in the future”.

However, when the referring document can be retrieved from one of several caches,
the situation becomes more complex. How does the the receiving lightweight agent
know whether or not the request should be redirected? And how does it know when
the mapping can be removed from its link map?

To answer these questions, the links agent sets the “Expires” header ﬁeld of any
response that is sent by the link agent to a user agent according to the following
formula:

Expires :2 min(Max—Expires, Server’s Expires value)

where Max-Expires is a constant set as part of the lightweight agent’s conﬁguration.
When an agent acknowledges the receipt of a LinkUpdate message, it includes its
value of the Max-Expires parameter as part of the acknowledgment message.

By setting the Expires header ﬁeld in this way, a lightweight agent knows the maxi-
mum amount of time that a cached copy of a referring document containing a broken
link can exist. It must keep the mapping for that broken link in its link map for
that period of time. If during that time it receives a request following the broken

link, it sends back an HTTP conﬂict response (response code 409) to the user agent,

 

8unless, of course, a new link is created using the same locator, in which case
requests using that locator should of course not be redirected

73

notifying the user that the link is potentially broken and the document should be
explicitly reloaded from the origin server before attempting to follow the link again.
To make sure that the request made by the user after doing so does not again receive

a 409 response, the link agent again makes use of a cookie.

6.4 Costs

AS with any modiﬁcation that adds functionality to a system, we must determine what
additional resource requirements will be imposed in adding the new functionality. In
this section, we ﬁrst examine the resource requirements for the agent, and then discuss

how the performance of the web is affected by the addition of the agent service.

Memory Requirements

There are two data structures the agent must maintain in memory. The ﬁrst Of these
data structures are the link databases for each document in the document spaces
of the servers serviced by the agent. Although these database structures could be
maintained in main memory, it is more likely they would be contained in some sort
of mass storage system. These databases must be maintained in persistent storage
so that the agent can recover if the agent process crashes for some reason. Also,
they only need to be consulted during a document check-in, a relatively infrequent
occurrence, so speed of access is not of particular interest.

The second data structure is the link map the lightweight agent uses to map broken

links to their corrected counterparts. This structure is much smaller than the link

74

databases, and will be referenced more frequently, since each time a link containing
an XPOinter is referenced, the agent must verify that the link is correct. Therefore,
it seems likely that this structure would be stored in main memory. How large might
this structure get, and what may be done to limit its size? There are several factors

that contribute to its size:

0 the frequency that document modiﬁcations are made causing links to break.

0 the policy used to decide when to request that the agents for referring docu-
ments, whose links have been broken, update their documents. The immediate
notiﬁcation policy sends a LinkUpdate message as soon as the broken link is
detected. Delayed notiﬁcation policies such as periodic notiﬁcation and notiﬁ-

cation on reference are also possible.

0 if a delayed notiﬁcation policy is chosen, further variables are present. If peri-
odic notiﬁcation is chosen, then the period must be chosen. If when-referenced
notiﬁcation is chosen, the amount of time between references to a link becomes

important.

Even with an immediate notiﬁcation policy, we cannot determine a bound on the Size
of the link map, because it is possible that we may not be able to contact the agent
that needs to be notiﬁed. Even so, it is likely that the amount Of memory the link
map requires will be reasonably small.

If we assume the average length of a locator is 32 bytes, each entry in the link map
structure will consist of 64 bytes (since each entry consists of a mapping from one
locator to another). There may be some additional overhead involved in the link map

75

(for example, one way of implementing it would be as a hashtable of hashtables — the
ﬁrst level of hashtable is indexed by a referrer/ referred pair, and each hashtable in
the second level contains the locator mappings), but this overhead should be fairly
negligible. Allocating 1 MB of RAM for the link map structure then allows storage
of mappings for about 16,000 broken links.

The notify on reference policy is desirable since it has the potential to reduce network
trafﬁc generated by unnecessary update notiﬁcations. However, it is also likely to be
the policy that allows the link map to become the largest. A simple modiﬁcation to
this policy where notiﬁcation messages are sent when the structure becomes too large

would be a reasonable compromise.

Effects on Web Performance

In order for link requests to be serviced correctly, we must funnel all HTTP requests
through the lightweight agent responsible for the server that owns the document. If
clients are allowed to communicate directly with the web server, the server will not be
able to determine if the link being requested is broken, since only the agent possesses
this knowledge.

If the agent process is running on the same machine as the web server, then the cost
of requiring the agent process will simply be that of communicating between the two
processes. However, it is likely that the two processes will run on separate machines

for performance reasons. In this case, every HTTP request to the server will require

76

at least one extra network hop. This performance cost is typical when there is a proxy
between a web server and the rest Of the network.

If a broken link is followed, the browser will be redirected to the correct location,
which will essentially be an extra round trip between the browser and the server.
Again, this cost is due to the use of the HTTP redirect mechanism, a mechanism
already in place and generally used.

In general, the costs of adding the agent proxy are minor, and are already present to
a great extent in the current web architecture. Given all the processing that goes on
in a web server today, the cost of processing the extra network trafﬁc generated will
be negligible.

A side effect of requiring the agent is perhaps more costly. This side effect is the
fact that server replication is not possible if all of the requests must go through the
agent. It is becoming more and more common to see a single logical web server be
serviced by multiple physical servers, with the content on these servers replicated.
The load on these servers can be distributed through methods such as DNS Aliasing
[55], Magic Routers [56], and HTTP redirect [48].

Forcing web clients to communicate with a single agent introduces a single point of
failure into the system, and eliminates the load balancing advantages of having repli-
cated servers. The Simplest way to deal with this problem it is to replicate the agents.
With this approach, the same methods used to distribute the load between conven-
tional web servers can be used to distribute the load between the agent processes. If

this approach is adopted, steps will need to be taken to assure that the link databases

77

and link maps will be kept synchronized between the agent processes. We intend to

explore this approach more carefully in future work.

6.4.1 Increasing Document Processing Efﬁciency

An agent servicing a frequently updated web server will need to process a large number
of documents concurrently. Most implementations of XML processors are written in
a single-threaded fashion. Since a good portion of the time spent processing an XML
document is spent doing I/O (especially for large documents), a large percentage of
the time it takes a single-threaded processor to process a document will be spent in
a blocked state. With an appropriate threads library a multi-threaded, single process
XML processor will be able to process documents much more efﬁciently than its
single-threaded counterpart.

We have modiﬁed version 2.0.15 of the XML4J processor from IBM [57] to transform it
from a single-threaded to a multi-threaded implementation. As can be seen from 6.4.1,
the time that it takes to process a set Of large documents is considerably improved
by adding more threads.

We have actually modiﬁed the XML4J processor in a much more signiﬁcant way than
to simply add multi-threading capabilities. The changes we have made allow the pro-
cessor to be used in a distributed fashion. Multiple processes running on distributed
machines are then used cooperatively to process the same document. Implementing
the system in this way allows resources such as memory to be aggregated to facilitate

processing of large documents that perhaps a single machine might not be able to

78

handle. It is this implementation, running on a single machine with the number of

processing threads varied, that the ﬁgures in 6.4.1 refer to.

 

 

 

 

 

260 r I i r r
240 l" -1
220 ~ .
200 - .
Q)
.E
g 180 b -
U)
(D
0
9
a
160 r- d
140 - -
120 >- ~
100 J 41 41 l l
O 2 4 6 8 10 12
Number of Threads

Figure 6.6: Performance of modiﬁed parser using multiple threads

The distributed implementation works in the following way. When a client wishes to
have a document parsed, it submits a request to the processing cluster by making a
remote procedure call to one of the members of the cluster, specifying the URL of
the document to be parsed. This processor parses the root node of the document,
storing the text nodes of the document tree locally. As it processes the root, it keeps
track of the byte offsets of any children that it encounters. When the root node
is completed, the processor picks another processor from the cluster and sends it a

message, asking it to process the children of the root node.9 The processors recursively

 

9Originally, the processor would allocate each child as it was encountered to another
processor. Although this allows a greater degree of parallelism, due to the relatively
high costs of remote procedure calls, this was found to be impractical.

79

continue this process, treating the node currently being parsed as the root node of
the document, until the document has been completely parsed. Figure 6.7 illustrates

this architecture.

------
..........
.........
. .~.
v

Node Table (proc l)

 

 

 

 

Initial Parse/Request

 

 

 

 

 

 

% @omnm

Clien ] Info on root nibde (proc 1, node l) I
(Agent) l . Processor 1
22..

 

     
  

 

 

      
 

Document to parse

 

 

Parse Requests
(URL. Offset)

 

 

 

 
  
   
    

 

<P>This is
<B>bold<lB>
text
</P>

 

   

 

 

 

Processor 3

 

Node Table (proc 2)

. ..
.......
..........
............

Figure 6.7: Distributed parsing architecture

The client interacts with the processor cluster through the standard DOM API [44].
Each time a request is made for data associated with a node, a remote procedure
call is made to the machine that holds that node. It is possible that a request for
information about a node could be made before that node has been processed. In
that case, the processor waits to respond to the client until the requested data arrives.
List structures such as child lists contain a list of references to remote nodes, which
are used to contact the processors which actually hold the data associated with the
node.

For reasons we have been unable to determine, however, the performance Of the

distributed system is unacceptably poor to be practically useful in comparison to

80

the single processor model. We believe that the potential improvements for the dis-
tributed system merit further study of the problems we have found in our particular

implementation.

6.5 Prototype Implementation

We have implemented a prototype agent which works with current web browsers and
servers without modiﬁcations to them. The only thing we have had to do that deviates
from standards is to separate the URL portion of a link element from the XPOinter
portion using a | character, rather than the standard # character. This is because
current browsers do not send anything after the # sign as part of the request; for our
agent to work, it must know the locator the requester is using.

The prototype agent and the link repair algorithms that it uses were both imple-
mented in the Java programming language. Obviously, one of the main beneﬁts of
using Java is that this allows the agent process to execute on any platform, so long
as there is an implementation of the Java Virtual Machine for that platform.

Since there are not yet any browsers that implement the XLink and XPOinter pro-
posed standards, our implementation does all of the XPOinter processing at the agent.
When a link containing an XPOinter is followed, the agent retrieves the portion of the
document located by the link and returns only that portion to the requester, rather

than returning the entire document and allowing the browser to process the XPOinter.

81

We have created a simple message editor to facilitate the creation and communication
of link notiﬁcation and “check” in messages. A simple “check in” program that checks

in a document located in a ﬁle has also been written as part of our prototype.

6.6 Summary

This chapter has described an agent acting as a proxy for the web server(s) that wish
to guarantee contextual link integrity. For guaranteed correctness, the agents for each
web server must communicate with each other when links are created or documents
are modiﬁed.

The costs of implementing such a system have been Shown to be comparable to those
of systems that are already in use, such as ﬁrewalls and employing HTTP redirect to
keep traditional links from breaking.

A prototype implementation has been developed using current web tools, including
web browsers and servers. The fact that we were able to implement the prototype
without changes to current browser and server software demonstrates the potential it
has for quick deployment.

In the future, we plan to explore the issues of fault tolerance and performance dis-
cussed in section 6.4. Allowing agents to be replicated would greatly increase the
usefulness of the system. We would also like to continue to reduce the amount of

cooperation needed between agent processes, which we discuss in chapter 7.

82

Chapter 7

Independent Operation

7. 1 Overview

We now turn to work which has as its goal eliminating the need for cooperation
between authors of documents in web sites administered by different entities. In
particular we are interested in eliminating the link notiﬁcation requirement. This
requirement states that authors must notify the agent associated with a web server
every time a link pointing to a document in that server’s document space is created
or deleted. Eliminating this requirement is desirable since it will expand the set of
documents for which contextual link integrity can be maintained.

As we stated in section 6.1, the requirement for agents to communicate the creation of
links with each other exists since it is not possible to enumerate the document space
of the WWW. While we cannot exhaustively search the universe of web documents

to discover the existence of links, information is available that can inform an agent of

83

the existence of a link. This information is provided in the form of the HTTPreferer
header.

By monitoring the receipt of HTTP requests containing referrer information, we can
discover the existence of new links and record them in the agent’s link database.
We will call links that are recorded in an agent’s link database due to this process
discovered links.

This method of discovery is not foolproof, however. Consider the example shown in

ﬁgure 7.1.

Referring Document Referenced Document

‘1 '- Llnk Created

 

 

 

 

Referenced document modified
(link now incorrect)

 

 

 

 

Link Referenced

__ L— .... ‘ ll.

Agent for Web Server records link
pointing to incorrect location

 

 

 

 

 

 

 

Figure 7.1: Discovering an incorrect link

In this example, a document containing a link is created. The author of that document

uses the current version of the referenced document to create the locator for the link.

84

The author of the referring document does not notify the agent reSponsible for the
referenced document of the link’s existence. Subsequently, the referenced document’s
author makes a change to that document which contextually breaks the referring
document’s link. It is signiﬁcant that this change occurs before the link in question
is ever followed.

Some time later, a user browsing the referring document decides to follow the link.
It is at this time that the agent for the referenced document learns of the existence of
the link and records its existence in its link database. Unfortunately, the agent has
no way of knowing if the link is contextually broken, and therefore it assumes the link
is correct. Not only will the user browsing the referring document be pointed to an
incorrect portion of the referenced document, the incorrect resolution of the link will
be perpetuated by the referenced document’s agent after any signiﬁcant subsequent
changes to the referenced document. In other words, once the link is incorrectly
recorded in the database, it will be “ﬁxed” by the link maintenance process so that
it stays broken.

We call a link which is discovered by the link agent in this situation an invalid
discovered link. Adding an invalid link to an agent’s database does not really cause
much harm. When the user follows the link, the result will be that the target of
the link is not the same as that originally intended by the document author. If
the agent did not record the invalid link, then the target of the link will also not
be the author’s intended target. The actual target the link resolves to will most
likely be different in the two cases, but in both cases it will be incorrect. If at some

time in the future the owner of the referencing document does start sending explicit

85

AddReference notiﬁcations, then it is important at that time to verify the correctness
of any discovered links held by the agent. This is the case since at this time authors
at the referencing site begin to have an expectation that links they create will remain
correct.

The rest of this chapter describes the methods and results of an investigation into how
frequently invalid discovered links are encountered. The results of this investigation
give us an idea about how well an agent can maintain contextual link integrity without

cooperation from document authors.

7 .2 Methods

Determining the frequency of discovering an invalid link is a difﬁcult proposition. We
must be able to answer the question “How frequently is a referenced page modiﬁed
between the time a link is created and the ﬁrst time that link is followed?” Ideally we
need to be notiﬁed immediately whenever any link pointing into an agent’s document
space is created. However, the impossibility of obtaining such information in an
unrestricted environment is the very reason that the notion of discovered links came
about in the ﬁrst place.

In order to approximate the answer to the question, our search will be restricted to
the document spaces of a speciﬁc set of agents. We periodically poll the document
space of this set of agents and look for links that point into the document space of

other agents in the set. This requires the ability to enumerate the document space of

86

each agent in the set; access to the conﬁguration ﬁles of the server(s) associated with
the agents is necessary to perform this enumeration.

In addition to tracking the creation of links, we must track two other activities. The
ﬁrst of these activities is modiﬁcations made to the documents in the selected agents’
aggregate document space. Again we ideally would like to be notiﬁed immediately
when a document modiﬁcation takes place. Without operating system support, how-
ever, this is impossible and so we resort again to periodic polling.1

The second activity we must track is referencing of the links in question. This activity
can be precisely monitored, since most web servers provide the ability to log referrer
information at the time they encounter it. Of course the permission to monitor these
log ﬁles is required for the set of agents in question.

We tracked link creation, page modiﬁcations and link references for a period of six
weeks. Due to administrative difﬁculties, the set of servers used in our study contained
a single server, our departmental web server. Studies have shown [59] that documents
located on servers in the .edu domain are the least frequently modiﬁed documents, so
the bias of studying this particular server must be taken into account when analyzing
the results.

We have divided the links discovered during the data gathering process into two
categories — links that point to pages that were generated by some program, and

links that were not. Generated links were typically generated by the javadoc [60]

 

1T he Win32 API used by Microsoft operating systems provides something close
to this ability. Unix operating systems in do not generally provide such a facility,
although one has recently been proposed in [58].

87

program, and their large number appears to be due to an instructor requiring students

to use javadoc frequently as part of their assignments.

7 .3 Results

During the six week tracking period, we observed the creation of 2678 links that
pointed from within one document in the departmental server’s web space to another
document in that same document space. 526 of these links were referenced during
the experimental period.

For each link referenced at least once during the period of time during which data was
gathered, the chart in ﬁgure 7.2 indicates how many times the referenced document

was modiﬁed before the link was referenced.

 

400 I I I I I I I I I I I I I I I

Not Generated -
;:-..I.:Icrl

350 “ -

300
250

200

Number of Links

150

100

50

 

 

 

3;. .. .. _1_ i. L L l l— 1 __I
O 1 2 3 4 5 6 7 8 9 10 ll 12 13 14
Number of Modiﬁcations

Figure 7.2: Number of times modiﬁed before ﬁrst reference occurs

88

The entry in the chart where the x-axis is 0 indicates links that, if discovered by an
agent during a reference, would be valid. There are 346 links that ﬁt this category,
indicating that approximately 2 / 3 of the links referenced at least once would be valid
at the time of discovery.

Pages that were modiﬁed one or more times before they were referenced may lead to
invalid discovered links, depending on whether the modiﬁcation made at that time
caused the link to be contextually broken. At this time, we do not have any data
on how likely a document modiﬁcation would contextually break a link. This data
depends on the types of modiﬁcations that are typically made to documents, and the
types of locators users typically create.

A substantial number of the links created during the trial were not referenced during
the trial. Figure 7.3 shows the distribution of modiﬁcation counts for those pages
during the trial period.

For the generated links, we can argue that invalid discovered links are unlikely to occur
when one of these links is referenced. This is because both the referring and referenced
documents are typically generated by the program responsible for generating the
pages, and it is likely that the program would correctly maintain any links it creates.
While the user is able to add additional links to the generated documentation, creating
candidates for invalid discovery, this seems unlikely in the case of javadoc, since these
links would not be maintained between versions of the documentation; users would

soon learn not to create links in the referring pages.

89

 

500 I I I I I I I I I I I I I
Not Generated _
450 _ it {Jr’fli’ﬂtlijtl m _.
400 r E”? ‘
7"
_ I: -
350 if:
m a?
g 300 "' t —
.1 gal
u— 13’
o 250 - a": _
u ’8;
3 a
E 200 _. {#5: F
2 3’ ,.. '
150 l -
f; if .
100 - 5‘4 I» ‘
50 - .‘5‘1 g» _ T
O

 

 

0 1 2 3 4 5 6 7 8 9 10 11 12 More
Number of Modiﬁcations

Figure 7.3: Number of times modiﬁed without being referenced

7 .4 Improving the Results

The statistics in the previous section compare the time between the ﬁrst modiﬁcation
of a document and the the ﬁrst reference of a particular link pointing to that docu-
ment. We can improve the performance of the algorithm by a slight modifcation to
the discovery process. As we stated in section 7.1, when the agent receives a request
for a document which has the referer header set, and the agent does not already
know about the existence of the link, it “discovers” that link. At the time the agent
receives the request containing a referer header, it can in fact discover the existence
of all links in the referring document that point into its document space. In other
words, the agent is really discovering the existence of the referring document, not

of the link itself.

90

Making this modiﬁcation to the link discovery process changes the question we must
ask to determine how frequently discovered links will be invalid. The question be-
comes “How frequently is the referenced page modiﬁed between the time the link in
question is created and the ﬁrst time any link in the referring document pointing
into the referenced agent’s document space is referenced?” Figure 7.4 shows how the

results change when using this modiﬁed discovery strategy.

 

400 I I I I I —I I I I I I I I I I
Not Generated _
‘ m

\ ;-.:g:.‘l‘ .1ch

‘

350
300
250
200

150

Number of Links

e- _L l_ L l 1 1_ l L
l 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Modiﬁcations

 

 

 

Figure 7.4: Results using modiﬁed algorithm

These results are based on the data that we gathered for this study, which only track
intrasite links. The improvement is realized because most pages contain multiple
links (and, since they are intrasite links, they must point to the same server). The
likelihood of multiple links in a speciﬁc page to a single agent’s aggregate document

space is probably lower than what we have observed in this data.

91

We have attempted to estimate this likelihood by reexamining the data on our depart-
mental web server. Each document in the server’s document space was analyzed, and
we recorded the web server associated with each link pointing to an offsite server. This
data was summarized by computing, for each document, the number of links pointing
to each unique server. Figure 7.5 shows the results averaging these per-server link

counts for each document.

 

1200 I I I I I I I“ I
_ Qtfsite Links —

.: WIN: 1. MIL: “W

Number of Pages

 

 

 

Number of Links (ceiling function)

Figure 7.5: Average number of references per server in a document

It is possible that as the use of XPointers as locators becomes more prevalent, the
likelihood of multiple links in a page pointing to an agent’s aggregate document space
will increase. We say this because users may create more links when they have the

ability to more accurately specify the target of a link.

92

7 .5 Conclusions

We have seen that discovering links through the use of the HTTP referer header can
allow an agent to maintain contextual link integrity for referring documents residing
on a server not serviced by an agent with a reasonably high degree of success. The
possibility of discovering an invalid link is a risk that the agent must take if it decides
to add discovered links to its database. We have seen, however, that discovering an
invalid link does not necessarily result in behavior that differs signiﬁcantly from not
discovering the link.

Discovered links may be treated differently by the agent than links whose existence has
been explicitly communicated by another agent. For example, they may be discarded
at any time if some of the resources used by the agent need to be reclaimed (for
example, disk space to hold the link database or memory needed to hold the link map
described in section 6.4).

The evidence we have shows that discovering links is a worthwhile proposition for an
agent that wishes to increase the probability that links pointing into its document
space are resolved correctly. A more expansive study is required to further validate

this evidence.

93

Chapter 8

Conclusions and Future Work

8. 1 Conclusions

The problem of maintaining contextual link integrity is an important one. As the use
of standards such as XLink and XPointer become more widespread, web users will
increasingly encounter documents containing contextually broken links. Preventing
this situation gives an information provider a deﬁnite advantage in terms of the value
of the information being provided. Most previous approaches to maintaining WWW
link integrity apply only to preventing existentially broken links. This dissertationde-
scribes a system we have implemented to maintain contextual link integrity in today’s
WWW environment.

Our system is open in the sense that specialized editors are not required. Users can
continue to use whatever tools they are familiar with to create their documents. The

only requirement imposed by the system is that a “check-in/out” process is used by

94

the users when they wish to being/ end editing of a particular document. This process
then takes care of all communication with the relevant agents.

We have described algorithms that allow web servers to automatically and efﬁciently
maintain link integrity . These algorithms have been shown to be efficient compared
to simplistic brute force algorithms that might be applied. As far as we know, our
algorithm for contextually broken link detection is the ﬁrst algorithm proposed to
detect broken links efﬁciently. We have shown that our algorithm is an optimal
algorithm for solving this problem.

Furthermore, we have developed an agent proxy to provide contextual link integrity.
The most signiﬁcant aspect of this proxy architecture is its ability to interact with
the current generation of web browsers and servers with little modiﬁcation. We have
argued that the communication protocol used by these agents is correct, and shown
that it does not impose a signiﬁcant processing or communication overhead on the
servers and network infrastructure of the current WWW. We have shown that the
agent performs correctly in the event of several common failure scenarios.

One of the biggest criticisms of the agent approach is the dependence upon commu-
nication between independently managed agents. Indeed, several researchers dismiss
the possibility of solving the link integrity problem in this way primarily due to this
issue. We have shown that inter—agent communication is unavoidable for agents wish-
ing to guarantee 100% contextual link integrity. We do not fully agree that inter-agent
communication is the determining factor in deciding whether our approach is feasible.
Microsoft is basing a large part of its corporate strategy for the early 2000’s on commu-

nication between web services [61], which indicates that at least the corporate world

95

does not view such communication as prohibitive. Even if most information providers
were to agree to inter-agent communication, however, there certainly could be those
who would not wish to do so. Thus, allowing agents to operate effectively without
inter-agent communication is certainly an issue that needs to be addressed. In chap-
ter 7, we described an alternative approach through which agents could “discover”
the existence of links into their document spaces independently, reducing the need for
inter-agent communication. Through a study of document modiﬁcation histories and
link access data, we have shown that this discovery technique can effectively provide
link agents with the information necessary to maintain contextual link integrity most
of the time.

In order for our system to be widely deployed, it must not introduce signiﬁcant time
delays in the document development process. We have extensively researched the pos-
sibility of implementing a distributed processing architecture for resource efﬁcient, fast
processing of XML documents, an operation that will become increasingly important
as the use of XML as a data description language proliferates. We have seen evidence
that a distributed architecture would be beneﬁcial, based on the data gained from
implementing a multi-threaded document processor. This multi—threaded processor
has been shown to have signiﬁcant performance beneﬁts over a single-threaded docu-
ment processor, particularly when running in a machine in which there are multiple

processors.

96

8.2 Future Work

Taken as a whole, the algorithms and systems described in this dissertationcan be
used by an organization to guarantee contextual link integrity for the documents in
its web servers’ document spaces. However, there are several opportunities to extend
and improve on the work we have already accomplished.

Firstly, and perhaps most importantly, the correctness of the agent communication
protocol described in chapter 6 is a crucial component of our system. While we have
argued for its correctness, a formal veriﬁcation of the correctness of these protocols
is certainly desirable.

A second area where further research could be beneﬁcial would be to extend the
study of document modiﬁcations and web access patterns described in chapter 7.
The results we have reported are based on data that is limited in scope and size. To
further validate these results, a longer study that would hOpefully involve multiple web
sites should be conducted. Involving multiple sites from different types of domains
(e.g. .com and .org) will further conﬁrm the validity of the discovered link approach.
Although we were not able to successfully implement distributed document processing
with reasonable performance, the idea still seems to be a good one. Distributing the
processing load among multiple machines allows us to more efﬁciently process large
documents. In addition, by aggregating the memory resources of a set of machines,
it should be possible to process large documents a single document processor would

be incapable of handling.

97

Due to the relatively high cost of making remote procedure calls, we have found that
pre-fetching and caching of portions of the tree at the client is a useful technique. De-
termining an appropriate pre-fetch and cache replacement strategy is important, since
performance of the distributed document processor can be signiﬁcantly hampered if
poor decisions are made for these strategies. In order to decide what the optimal
cache replacement algorithm is, it is necessary to have a model of typical application
access patterns to the DOM representation of a document. A study of several signif-
icant DOM applications could be used to develop a model of how document content
is typically accessed, thus leading to more efﬁcient caching.

Finally, as the WWW continues to evolve, it is very likely that the prOportion of
dynamic documents will continue to increase. A dynamic document is a document
that is generated at the time it is requested, and is typically not stored in a single ﬁle.
Our system is currently based on the ‘documentzﬁle‘ paradigm, which is still present
in many WWW sites. Deve10ping a system that could deal with dynamic documents

is certainly a challenge.

98

BIBLIOGRAPHY

Bibliography

[1] Tim Berners-Lee, Robert Cailliau, Ari Luotonen, Henrik Frystk-Nielsen, and
Arthur Secret. The World-Wide Web. Communications of the ACM, 37(8):?6—
82, August 1994.

[2] Theodor Nelson. Literary Machines. Ted Nelson, 1987.

[3] Nigel Woodhead. Hypertext and Hypermedia: Theory and Applications. Sigma
Press, Winslow, England, 1" edition, 1990.

[4] Vannevar Bush. As we may think. The Atlantic Monthly, July 1945.

[5] International Organization for Standardization, Geneva, Switzerland. Informa-
tion Technology — Hypermedia/Time-based Structuring Language (HyTime) Sec-
ond edition, 1997-08-01. International Standard ISO 10744-1997. Available on-
line at http://www.ornl.gov/sgml/wg8/docs/n1920/html/n1920.html.

[6] Frank Kappe. Hyper-G: A Distributed Information System. In B. Leiner, editor,
Proceedings of INET, San Francisco, California, 1993. Internet Society.

[7] Hugh C. Davis, Simon Knigh, and Wendy Hall. Light hypermedia link services: A
study of third party application integration. In Proceedings of the ACM Confer-
ence on Hypertext, ECH T ’94, pages 41—50. Assocation for Computing Machinery,
1994.

[8] Frank Halasz and Mayer Schwartz. The dexter hypertext reference model. Com-
munications 0f the ACM, 37(2):30—39, February 1994.

[9] Hugh C. Davis. Referential integrity of links in open hypermedia systems. In Pro-
ceedings of Hypertext 98, pages 207—216, Pittsburgh PA USA, 1998. Assocation
for Computing Machinery.

[10] Text Encoding Initiative, [Online] Available http://www.uic.edu/orgs/tei, pub-
lished 1998.

[11] Steven J. Derose and David Durand. The TEI Hypertext Guidelines. Computers
and the Humanities, 29(3):181—190, 1995.

99

[12] S. Hockey. Deve10ping access to electronic texts in the humanities. Computers
in Libraries, 13(2):41—43, February 1993.

[13] Douglas C. Engelbart. Toward augmenting the human intellect and boosting our
collective IQ. Communications of the ACM, 38(8):30—33, August 1995.

[14] Tim Berners—Lee. Hypertext markup language — 2.0, [Online] Avail-
able http://www.w3.org/MarkUp/html-spec/html—spec.ps, published November
1995.

[15] Dave Raggett. HTML 3.2 Reference Speciﬁcation, [Online] Available
http://www.w3.org/TR/REC-html32, published January 1997.

[16] HTML 4.01 Speciﬁcation, [Online] Available http://www.w3.org/TR/html4/,
published January 1999.

[17] Extensible Markup Language, [Online] Available http://www.w3.org/TR/REC-
xml, published February 1998. This is a W3C Recommendation.

[18] International Organization for Standardization, Geneva, Switzerland. Informa-
tion Processing, Text and Oﬂice Systems, Standard Generalized Markup Lan—
guage (SGML) = Traitement de l’information, systemes bureautiques, langage
standard géne’ralise’ de balisage (SGML). First edition, 1986-10-15. International
Standard ISO 8879-1986. Federal information processing standard; FIPS PUB
152.

[19] Steven J. DeRose. The SGML FAQ Book. Kluwer Academic Press, Boston,
Dordrect, London, 13‘ edition, 1997.

[20] ISO/IEC 1074411997. Hypermedia/Time-based Structuring Language (Hytime),
[Online] Available http://www.ornl.gov/sgml/wg4/docs/n1920, published 1997.

[21] XML Linking Language (XLink), [Online] Available
http://www.w3.org/TR/1998/REC-xlink-19980210, published March 1998.
This is a work in progress and subject to change at any time.

[22] XML Linking Language (XLink), [Online] Available
http://www.w3.org/TR/xlink, published June 2000. This is a W3C Can-
didate Recommendation.

[23] XML Pointer Language (XPOinter), [Online] Available
http://www.w3.org/TR/1998/WD—xptr-19980303, published March 1998.
This is a work in progress and subject to change at any time.

[24] XML Pointer Language (XPOinter), [Online] Available
http://www.w3.org/TR/xptr, published June 2000. This is a W3C Can-
didate Recommendation.

100

[25]
[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[36]

[37]

Niels Olof Bouvin. Unifying strategies for web augmentation. In Proceedings of
the ACM Conference on Hypertext, ECHT’QQ, pages 91—100. ACM, ACM, 1999.

Alberto O. Mendelzon and Tova Milo. Formal models of web queries. In Pro-
ceedings of PODS 97, pages 134—143, 1997.

Frank Halasz and Mayer Schwartz. The dexter hypertext reference model. In
J. Moline, D. Benigni, and J. Baronas, editors, Proceedings of The Hypertext
Standardization Workshop, pages 95-133, Gaithersburg, MD, 1990. National In-
stitute of Standards.

U. Black. OSI: A Model for Computer Communications Standards. Prentice-Hall
International, Englewood Cliffs, New Jersey, 1991.

Norman Meyrowitz. The missing link: Why we’re all doing hypertext wrong.
In Edward Barrett, editor, The Society of Text: Hypertext, Hypermedia, and
the Social Construction of Information, pages 107—114. MIT Press, Cambridge,
Massachusetts, 1989.

Bernard J. Haan, Paul Kahn, Victor A. Riley, James H. Coombs, and Norman K.
Meyrowitz. Iris hypermedia services. Communications of the ACM, i35(1):36-51,
January 1992.

Steven J. DeRose and David G. Durand. Making Hypermedia Work: A User’s
Guide to HyTime. Kluwer Academic Publishers, Norwell, Massachusetts 02061
USA, 1994.

Udo Flohr. Hyper-G organizes the web. Byte Magazine, 20(11):59—64, November
1995.

Keith Andrews, Frank Kappe, and Hermann Maurer. The Hyper-G network
information system. Journal of Universal Computer Science, 1(4):206—220, April
1995.

Frank Kappe. A scalable architecture for maintaining referential integrity in
distrubted information systems. Journal of Universal Computer Science, 1(2):84—
104, February 1995.

Leslie Carr, Hugh Davis, David De Roure, Wendy Hall, and Gary Hill. Open
information services. In Proceedings of the Fifth International World Wide Web
Conference, Paris, France, 1996. World Wide Web Consortium.

Jim Pitkow. Web characterization overview, [Online] Available
http: / / www.w3.org / WCA / Reports / 1998-01-PDG-answers.htm, published
March 1999.

Michael L. Creech. Author-oriented link management. In Pro-

ceedings of the Fifth International World Wide Web Conference,
Paris, France, 1996. World Wide Web Consortium. Available at
http: / / www5conf.inria.fr / ﬁch_html / papers / P 1 1 / Overview.html.

101

[38] PURL home page, [Online] Available http://purl.oclc.org.

[39] P. Mockapetris. Domain Names — Implementation and Speciﬁcation. RFC 1035,
1987.

[40] S. Ingham, D. Caughey and M. Little. Fixing the ”broken link” Problem: The
W30bjects Approach. Computer Networks and ISDN Systems, 28(7—11):1255,
1996.

[41] Andres S. Noetzel and Stanley M. Selkow. An Analysis of the General Tree-
Editing Problem, pages 237—252. Addison Wesley, 1983.

[42] Thomas A. Phelps and Robert Wilensky. Robust intra-document locations. In
Proceedings of the Ninth International World Wide Web Conference, Amster-
dam, Netherlands, 2000. World Wide Web Consortium, Foretec Seminars, Inc.

[43] Leslie Carr, David De Roure, Wendy Hall, and Gary Hill. The distributed link
service: A tool for publishers, authors and readers. World Wide Web Journal,
1(1):647—656, 1995.

[44] Document Object Model DOM Level 1 Speciﬁcation, [Online] Available
http://www.w3.org/TR/REC-DOM-Level—1, published October 1998. This is
a W3C Recommendation.

[45] Jon Bosak. Xml version of the bible, [Online] Available
http://sunsite.unc.edu/pub/sun-info/standards/xml/eg/rel200.zip, published
Oct 1998.

[46] James C. Galvin. Harmony of the Gospels, pages 1931—1936. Tyndale House
Publishers and Zondervan Publishing Company, Grand Rapids, Michigan, 1991.

[47] David Megginson. SAX 1.0: The Simple API for Xml, [Online] Available
http://www.megginson.com/SAX/index.html, published May 1998.

[48] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners—Lee, L.Masinter,
and P. Leach. HTTP 1.1 Speciﬁcation. RFC 2616, [Online] Available
http://www.w3.org/Protocols/rfc2068/rfc2068, published June 1999.

[49] Sun Microsystems. Java Remote Method Invocation Speciﬁcation, [Online]
Available http: / / www. j avasoft .com / products / jdk / 1 .2 / docs / guide / rmi / spec
/rmiTOC.doc.html, published October 1998.

[50] Sun Microsystems. Getting Started Using RMI, [Online] Available
http: / / www . javasoft.com / products / jdk / 1 .2 / docs / guide / rmi / getstart.doc.html,
published 1999.

[51] D. Kristol and L. Montulli. HTTP state management mechanism. RFC 2109,
February 1997.

102

[52]

[53]

[54]

[57]

[58]

l59l

[60]

[61]

Junkbusters Corporation. How web servers’ cookies threaten your privacy, [On-
line] Available http: / / www. junkbusterscom / ht / en/ cookieshtml, downloaded
August 2000.

Leslie Lamport. Time, clocks and the ordering of events in a distributed system.
Communications of the ACM, 21:558-564, July 1978.

D. Mills. Network Time Protocol (version 3) speciﬁcation. RFC 1305, March
1992.

T. Brisco. DNS Support for Load Balancing. RFC 1794, April 1995.

E. Anderson, D. Patterson, and E. Brewer. The Magicrouter,
an Application of Fast Packet Interposing, [Online] Available
http:/ / www.cs.berkeley.edu / éanders / pro j ects / magicrouter / osdi96-mr-
submissionps, published May 1996.

IBM alphaWorks. XML parser for Java, [Online] Available
http://www.alphaworks.ibm.com/tech/xml4j/, published Feb 1998.

Jeffrey C. Mogul Gaurav Banga and Peter Druschel. A scalable and explicit
event delivery mechanism. In Proceedings of the 1999 USENIX Technical Con-
ference, Monterey, California, 1999. Advanced Computing Systems Association.
Available at http: / / www.usenix.org / events / usenixOl / cfp / banga / banga-html / .

Xiangping Chen and Prasant Mohapatra. Lifetime behavior and its impact on
web caching. In International Conference on Computer Communications and
Networks, Estes Park, Colorado, 1999. Institute of Electrical and Electronics
Engineers.

Sun Microsystems. J avadoc 1.3, [Online] Available
http: / / www . j avasoft.com / j 2se /1 .3 / docs / tooldocs / j avadoc / index.html, down-
loaded August 2000.

Microsoft Corporation. Microsoft.NET: Realizing the next generation in-
ternet, [Online] Available http://www.microsoft.com/net/whitepaper.asp, pub-
lished June 2000.

103

M IIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[WWI/[1]]ll][]][![[[l]l][][[l[l

3193