.111!
if

ct ....I! P!)
.l«« .
. lira:

... '— .
. .l’.‘!.
.J(..
Ft. _

.3 f.
.. « ...:mr.
...mmﬁww

. .
3.9: ...!) 1
I. .

my.
5

1

tab. ..

“‘2”ka

¢ U
.....uay.
...». .3?
.... a.

. 3.. £.m....nz.§

l. a ..It

...: "i: .3
J

{3...

.A. : ...... .

 

 

This is to certify that the

dissertation entitled

ONTOLOGY-BASED SIMILARITY
FOR CLUSTERING IN TEXT SPACE

presented by
Nasser Assem

has been accepted towards fulﬁllment
of therequirements for

Doctoral degreein Computer Science &
Engineering

‘F:;EE>___:EE§\RCL-\va;i<:\

Major professor

 

Date 08" Q0~Ol

 

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 chlRC/DatoDuopss-DJS

ONTOLOGY-BASED SIMILARITY FOR CLUSTERING IN
TEXT SPACE

By

Nasser Assem

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY

DEPARTMENT OF COMPUTER SCIENCE &: ENGINEERING

2002

ABSTRACT

ONTOLOGY—BASED SIMILARITY FOR CLUSTERING IN TEXT SPACE
By

Nasser Assem

With the advent of the World Wide Web, and the increasing popularity of web
search engines, there has been a renewed interest in information retrieval systems. In
this research, we introduce a system that combines category-based and keyword-based
concepts for a better information retrieval system. For improved document cluster-
ing, we prOposed a document similarity measure that is based on keyword frequency
in documents, but also uses an input ontology. This ontology is domain speciﬁc and
includes a list of keywords organized with their degree of importance to the categories
of the ontology. We evaluated the performance of this similarity measure and com-
pared it to the standard cosine vector similarity measure. For that, we used document
data with pro-determined structure as well as actual web documents. We designed a
framework to generate synthetic data to model documents, and analyzed statistical
attributes of documents in high dimension. For synthetic data analysis, we designed
a controllable structure using various distributions of angle to specify cluster com-

pactness and angle based inter-cluster overlap to specify cluster isolation. We address

the issue of modeling text documents, and propose the use of a graph data model
that is based on the concept of semantic groups. We present a mechanism by which

semantic groups can be used with document processing.

Copyright by
NASSER ASSEM

2002

To my family

ACKNOWLEDGMENTS

I would like to thank Dr. Sakti Pramanik for being my guidance committee chair.
Thanks also go to the other members of my guidance committee: Dr. Abdol H.
Esfahanian, Dr. Eric Torng, Dr. Shawnee Vickery and Dr. George C. Stockman.

Special thanks go to Dr. Wayne Dyksen for his support.

vi

TABLE OF CONTENTS

LIST OF TABLES x
LIST OF FIGURES xi
1 Introduction 1
1.1 Motivation ................................. 1
1.2 Research Work .............................. 3
1.3 Dissertation Outline ........................... 4

2 Background and Literature Review 6
2.1 Introduction ................................ 6
2.2 Document Modeling and Processing ................... 6
2.2.1 Document Preprocessing ..................... 7

2.2.2 Document Modeling ....................... 9

2.3 Data and Document Clustering ..................... 13
2.3.1 Hierarchical Clustering Techniques ............... 13

2.3.2 Complete Link Clustering Algorithm .............. 14

2.4 Graph Data Model and Documents ................... 15
2.4.1 Graph-Based Database ...................... 16

2.4.2 Domain .............................. 17

2.4.3 Semantic Group .......................... 17

2.5 Literature Review ............................. 18
2.5.1 Document Modeling and Similarity Measure .......... 18

2.5.2 Data and Document Clustering ................. 19

2.5.3 Graph Data Model and Semistructured Data .......... 21

2.6 Summary ................................. 23

3 Ontology-based Similarity Measure 25
3.1 Introduction ................................ 25
3.2 Similarity and Distance Measures for Documents ........... 26
3.2.1 Cosine Vector Similarity Measure ................ 26

3.2.2 Euclidean Distance Measure ................... 28

3.2.3 Discussion ............................. 32

3.3 Proposed Ontology-based Similarity Measure (OBSM) ........ 33
3.3.1 Motivation ............................. 33

vii

3.3.2 Ontology Used in Text Space ..................
3.3.3 Ontology-based Similarity Measure ...............
3.3.4 Performance Evaluation Approaches ..............
3.4 Data Sets With a Priori Knowledge ...................
3.4.1 Varying Intra-cluster Overlap ..................
3.4.2 Varying Inter-cluster Overlap ..................
3.4.3 Discussion .............................
3.5 Real Data Set ...............................
3.5.1 Web Document Gathering ....................
3.5.2 Document Preprocessing .....................
3.5.3 Clustering Technique .......................
3.5.4 Web Documents Data Set ....................
3.6 Summary .................................
Synthetic Document Data Modeling and Generation
4.1 Introduction ................................
4.2 Approach .................................

4.3

4.2.1 Deﬁnitions .............................
4.2.2 Distribution Analysis of Random Variables ...........
Cartesian Coordinate System ......................
4.3.1 Distribution of Euclidean Distance for Uniform Random Data
4.3.2 Distribution of Angle for Uniformly Distributed Random Data
4.3.3 Average Distance and Angle, and Deviation as Functions of

Dimension .............................
4.3.4 Distribution of Intra and Inter-Cluster Distances ........
4.3.5 Expected Value of Eucliden Distance ..............
4.3.6 Distribution of Intra-Cluster Angles ...............
4.4 Polar Coordinate System .........................
4.4.1 Polar vs. Cartesian Coordinates in t-dimensional Space . . .
4.4.2 Clustered Document Data ....................
4.5 Summary .................................
Semantic Groups of Graph Data Model and Documents
5.1 Introduction ................................
5.2 Motivation .................................

5.3

5.4
5.5

5.2.1 Cluster Representation and Reuse ................
5.2.2 Update of Private Views .....................
Semantic Groups in Graph Data Model .................
5.3.1 Semantic Groups and Views ...................
5.3.2 Separation of Semantic Groups and Private Update ......
5.3.3 Effect of GDM Operations on Semantic Groups ........
Semantic Groups and Documents ....................
Summary .................................

viii

34
36
36
37
42
43
43
44
44
45
49
55

57
57
58
59
59
61
61
63

63
67
69
70
71
72
73
80

81
81
82
84
85
86
87
89
90
94
96

6 Summary and Future Work 97

6.1 Summary ................................. 97
6.2 Future Work ................................ 99
APPENDICES 100
A Some Web Pages from Data Set 101
B Commands of Graph Data Model 105

BIBLIOGRAPHY 109

ix

3.1
3.2
3.3
3.4

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9

LIST OF TABLES

Data set used to generate Figure 3.4 .................. 39
Data set used to generate Figure 3.5 .................. 42
Real Documents Data Set ........................ 45
Summary of the documents gathered from Yahoo and Detrol web sites 51
Distance between cluster centroids (2 clusters, 1000 samples) ..... 66
Angle between cluster centroids (2 clusters, 1000 samples) ...... 66
Summary of data set used to analyze intra and inter-cluster distance 67
Intra and inter-cluster average distance and deviation ......... 68
Inter-cluster overlap (2 dim, 5 deg. scatter) ............... 76
Inter cluster overlap (3 dim, 5 deg. scatter) ............... 78
Inter cluster overlap (3 dim, 10 deg. scatter) .............. 79
Inter cluster overlap (100 dim, 10 deg. scatter) ............. 79
Inter cluster overlap (1000 dim, 10 deg. scatter) ............ 79

1.1
2.1

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

A.1

LIST OF FIGURES

Architecture layers ............................
Document preprocessing: from full text to index terms ........

Illustration of distance and similarity measures .............
Intra and inter-cluster CVSM similarities ................
Intra and inter-cluster OBSM similarities ................
Intra and inter-cluster OBSM similarities ................
Intra over inter-cluster Similarity ratio .................
Hierarchical Single link clustering ....................
Hierarchical average link clustering ...................
Hierarchical complete Link clustering ..................
Clustering using CVSM .........................
Clustering using OBSM (weight 5) ....... ’ ............
Clustering using OBSM (weight 15) ...................

Distribution of distance between centroids ...............
Distribution of angle between centroids .................
Average distance between cluster centroids ...............
PDF of intra and inter-cluster distances .................
PDF of intra-cluster angles ........................
Spherical coordinate system .......................
Representation of clustered documents in polar coordinate system . .
Clustered synthetic document data (2 dim, 5 deg. scatter) ......
Clustered synthetic document data (3 dim, 5 deg. scatter) ......
Clustered synthetic document data (3 dim, 10 deg. scatter) ......

Cluster representation ..........................
Example of semantic group ........................
Querying based on semantic groups ...................
Separating semantic groups .......................
Semantic groups and basic graph operations ..............
Connected nodes .............................
Linking nodes in interacting semantic groups ..............
Context sensitivity using semantic groups ................

Yahoo’s “Drugs and Medications” category page ............

xi

3
7

28
38
40
41
42
46
47
48
52
53
54

62
64
65
67
70
73
74
76
77
78

87
88
90
91
92
93
95

102

A.2 Part of the “D” medication index .................... 103
A3 Detrol description page .......................... 104

Chapter 1

Introduction

1.1 Motivation

With the advent Of the World Wide Web, there has been a renewed interest in Infor-
mation Retrieval systems. Most current WWW search engines are based mainly on
keyword-based search. Some words may be conceptually different, even though they
are syntactically similar and vice versa. Therefore, keyword-based search systems
may retrieve documents which are not relevant to a query, or may not retrieve all
relevant documents. To address this issue, some search engines also include category
based search. However the two search mechanisms are usually used separately.

The space of category-based search systems can be characterized as a continuum.
On one end the categories are automatically generated; such systems are fast and
capture all possible categories. However, some categories may not necessarily make
sense to users; the number of categories may also be huge. On the other end of

the continuum, the categories are created by humans. Such categories fit perfectly

in the context of user searches, since they are most likely to be understandable by
users. However, the process of creating and maintaining categories in such systems
is slow and costly. The categories may not always be up—to—date. Moreover, the
categorization system may not be global, and may have a low coverage rate, that is,
only a small subset of documents can be covered by such categorization. For example,
in this continuum, the Yahoo [18] categorization system is located at the extreme end
of “manual” systems, since the categorization is made manually off-line. More recent
search engines, like Lycos [50], have been augmented with semi-automated category
systems such as WiseWire. However, they still require human intervention.

A search engine will be more user friendly and easy to maintain if there is a system
that automatically generates and uses human-like categories. As shown in Figure 1.1,
conceptually such a system would consist of three layers: (a) a top layer that contains
human understandable categories, (b) a middle layer that contains automatically
generated categories, and (c) a bottom layer that contains references to actual web
pages. The top layer can be constructed based on a general categorization method.
One of the main advantages of such system is that different levels of categorization
offer the user the best of both worlds. On one hand, the top layer provides the user
with concise closely related and human readable query results which are as easy to
browse as Yahoo’s manual categories. On the other hand, the middle and bottom
layers offer the opportunity to access broader results. Another advantage is that
automatic categorization minimizes human intervention; and hence alleviates system
maintenance and update.

This work contributes to the middle layer by using ontology-based Similarity mea-

 

High Level categories (human-understandable)

 

 

 

 

 

 

mm 'TTH mm

 

Document clusters (automatiocally generated)

rim

JESS.

Figure 1.1: Architecture layers

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

sure that combines category and keyword based information, to generate document
clusters with improved quality. It also proposes a graph data model for representation

of documents.

1.2 Research Work

We have designed a technique that integrates text categorization with the standard
keyword based search and produces improved document clustering. To design such
technique, we have investigated several distance measures for the domain of text
documents. Text categories used in this work are composed of a set of keywords

with varying weights, where the weights determine how important the keyword is to

a category. The motivation for usage of an ontology-based weight is that keywords
that are part of the same category should mutually contribute with a higher effect
on the similarity of documents in which they appear. Combining the ontology with
standard keyword-based similarity, adds a new semantic to the information retrieval
system.

Clustering is used to perform document categorization. We present a clustering
technique based on the ontology-based Similarity measure, thus improving the quality
of clusters. As previous work suggests that the quality of hierarchical clustering
methods depends on the application domain, we analyze the effect of the types of
data distributions on the performance of the Similarity measure.

To achieve control of the parameters of synthetic data, such as angle distance be-
tween data centroids, we investigate several methods to model and generate synthetic
documents.

We have proposed the use of a graph data model [64] for modeling documents.
This model provides a general framework for mapping clusters to high level categories,
thus allowing for better presentation of documents. This model also allows for better

update of the underlying structure of documents.

1.3 Dissertation Outline

This dissertation is organized as follows. Chapter 2 introduces background informa-
tion and related research work in the area of document processing, modeling, and

clustering techniques, as well as on graph data models. In Chapter 3 we deﬁne the

proposed ontology-based similarity measure, and elaborate on how to evaluate its
performance. Details on generation and characterization of synthetic data used for
data distribution analysis are presented in Chapter 4. Chapter 5 focuses on the se-
mantic groups of the graph data model suggested to use with documents. Chapter 6
summarizes the work of this dissertation and concludes with suggestions regarding

future work.

Chapter 2

Background and Literature Review

2.1 Introduction

This chapter presents background information and related work on document mod-
eling and clustering techniques and the graph data model. The concepts of term
weighting and similarity measure are introduced. Different data and document clus-
tering techniques are described. And since the graph data model provides a framework
for modeling document clusters and categories, an introduction to graph data models

is provided.

2.2 Document Modeling and Processing

Information Retrieval systems have been developed to help manage the huge volume of
literature documents. Documents are generally represented by a set of index terms or

keywords. A logical view of a document is its representation by the set of all the words

it contains; however, this may require a lot of storage space for the whole collection of
documents. That is why, in general, documents are ﬁrst pro-processed after they are
parsed to identify all terms. Some words like articles and connectives (also called stop-
words) are eliminated. Some words are reduced to their common grammatical root;
this process is called stemming. Figure 2.1 shows a typical preprocessing sequence of

a document from full text to index terms for document modeling.

 

K W

Document

 

 

 

Figure 2.1: Document preprocessing: from full text to index terms

2.2.1 Document Preprocessing

Document preprocessing is the preparation of the documents for modeling, retrieval
and access. The processing consists of the following operations: (1) lexical anal-
ysis, which identify words; (2) elimination of stop words, which ﬁlters out words
with low discrimination value; (3) stemming, which is removing afﬁxes (sufﬁxes),
allows retrieval of documents containing static variations; (4) construction of term
categorization structures, or “Thesaurus”, which can be used for the construction of
ontologies.

This processing is made to improve the quality and speed of document and data
retrieval. Not all operations are performed for every data retrieval system; it mainly

depends on the domain of use and the cost of using it. The following subsections

describe each of the above operations.

Lexical Analysis This is the process of extracting a stream of selected words from
a stream of characters. To do this, it is required to identify the words of a document.

Punctuation marks and single letters are usually removed.

Elimination of Stop words Words that occur in most documents like “and”,
“or” should be removed. They are considered useless because they do not have a
high discrimination value. Their elimination also reduces the size of the document

database. A list of stop words can be found in [22].

Stemming Sometimes two different but related forms of a word exist in documents
like plurals, and various verb forms. Stemming is the process of reducing a word to
its origin. A stem is the portion of the word remaining after removing its afﬁxes.
This can be achieved using a ﬁnite state machine that is based on linguistic rules.

This operation also reduces the size of the document database.

Thesauri Construction is the method of creating a standard vocabulary, which
provides classiﬁed hierarchies. A thesaurus consists of a compiled list of important
words in a given domain of knowledge. For each word in the list, a set of related words
is deﬁned. A term in the thesaurus is usually a single word. Using a Thesaurus in the
data retrieval allows for adding concept to words, which Signiﬁcantly increases the
quality of clustering. This doesn’t always work well, because relationships captured

in the thesauri frequently are not valid in the local context of a data set. For example,

general relationships between keywords may not be very helpful in a domain speciﬁc
data set, like a medical database, as Opposed to a more technical set of relationships,

like those between brand names and generic names.

2.2.2 Document Modeling

This deﬁnes a framework in which documents are represented by logical views based
on the information that the preprocessor extracts from the documents. It also de-
ﬁnes how a similarity measure between documents is quantiﬁed. The main models
that are proposed in the literature are the vector model [69], the boolean model and
the probabilistic model [68, 70]. The boolean model uses a binary weight to repre-
sent documents. The relationship between the document and keywords is basically
whether the keyword is cited in the document or not. The vector model uses a more
involved weight system than binary weights to index terms in documents, allowing
for partial matching by computing the degree of similarity between documents. This
allows for a possible ranking of retrieved documents in decreasing order of degree of
similarity. This is especially useful for a system where range queries are used. The
probabilistic model iteratively deﬁnes the relationship between documents and key-
words using some initial formula estimates that are improved in a multi-step fashion
based, for instance, on input from the user. In the following we focus on the vector

model.

Vector Representation of Documents: The weight 211.3 associated with a term

kj and document d,- is positive and non binary, and the vector for a document d,- is:

"t . . .
d,- = (2051,2132, . . .,w,-¢), where t 18 the number of Index terms In the system. The

number t of keywords can be computed based on the preprocessor output by removing
the keywords from a predeﬁned stOp-list, or by including only those terms that do not
appear in too many documents; a threshold can be deﬁned for this purpose. Another
issue that is addressed here, as well as while preprocessing documents, is what form
of the keywords should be used; stemming may not necessarily improve the quality

of the representation of documents.

Term-Weighting System: Keywords that are frequently mentioned in a document
are useful to categorize this document. Therefore, a term frequency, which is the
number of times a keyword occurs in a document, should be used as part of the
term-weighting system.

As proposed in [69], the widely adopted term-weighting scheme is as follows. If
f TCQij is the raw frequency of term It,- in the document d.-, that is the number of
times the term kj is mentioned in the text of the document d,-, then, the normalized
frequency f,-,- of term k,- in document (1,- is: f,-,- = 53%;? where the maximum is
computed over all terms of document d,-.

Standard term weighting schemes also take into account the inverse document
frequency, also called the idf factor: idfj = 109%, where N is the total number of
documents in the system, and 71,- is the number of documents in which the term It,-
appears. Terms that appear in many documents are not very useful for distinguishing

documents. In these term-weighting schemes, the weight associated with d,- and k,-

181 10.5 = fij X dej.

10

The normalized frequency allows for a well deﬁned domain, where a raw weight
can be a real number between 0 and 1. The second term in the above formula, also
called the idf term, allows for terms that are not being used in too many documents
to have more contribution in distinguishing those documents in which those terms
appear; this is specially useful for domain speciﬁc terms, that are in general not
widely used outside their domain. A user would deﬁnitely try to use those keywords

to retrieve the documents that he/ she wants.

Similarity Measure Between Documents A similarity measure between two
documents is based on the number of keywords that belong to both documents [69].
The degree of similarity of documents d,- and dj is the correlation between the
vectors :1: and 3;, which can be quantiﬁed by the cosine of the angle between the
two vectors.
Let’s consider that there are t different keywords in the collection of documents.
If a document d, is represented by the following vector in the t-dimensional space:

E: = (20,1, 10,2, w,-3, . . . w“), then the following length can be used:

(2.1)

 

and the Similarity between two documents d,- and dj can be deﬁned by the cosine

vector similarity measure:

t
Zk=1 was X wile

\/ 21:1(w,k)2 x V 22:1(wﬂcy

11

 

sim(d,-, dj) = (2.2)

The numerator of the above formula obviously is useful in associating two doc-
uments if they share many common keywords. Those keywords not shared by both
documents are eliminated since the product would be zero. The denominator of
the formula, which is basically a normalizing factor is used to avoid discriminating
against small documents in favor of large documents. A normalization factor should
be incorporated into the similarity measure to normalize the length of the document
vectors. This way, relevant short documents have equal chance to be clustered as

larger documents.

Comparison of Vector Model to Other Models The probabilistic model and
the boolean Model are the main alternative models used in Information Retrieval
Systems. The vector model is superior to the boolean model, and at least almost as
good as the probabilistic model.

The advantages of the vector model can be summarized as follows. First, the
simple term-weighting scheme improves retrieval performance, since the similarity
measure can be easily implemented, unlike the probabilistic model, where an iterative
process is needed. Second, the partial matching strategy allows for the retrieval
of approximately Similar documents. This is useful for range queries, where exact
match is not realistic, or is unnecessary. Finally, the cosine ranking formula sorts
the documents according to their degree of similarity; this is useful for ranking the
documents retrieved for a given cluster.

One of the disadvantages of the vector model is that the index terms are assumed

to be mutually independent; which is not necessarily true, since some terms usually

12

appear together in a lot of documents, and may therefore be correlated. The gen-
eralized vector space model addresses this issue. However, it has not been shown
that the increase in complexity does improve performance compared to the standard
vector model. Another disadvantage is the sparse vectors used for representing doc-
uments, with too many zeros; a lot of space may be wasted. However, some search

data structures and algorithms are designed to take into consideration this issue.

2.3 Data and Document Clustering

A clustering is a type of classiﬁcation of a set of objects [34]. There are hierarchical
and partitional clustering methods. A hierarchical classiﬁcation is a nested sequence
of partitions. A partitional classiﬁcation is a single partition. Hierarchical clustering
is usually used to cluster documents, because it provides a hierarchy of clusters, that

can be navigated further.

2.3.1 Hierarchical Clustering Techniques

A hierarchical clustering method is a procedure for transforming a similarity matrix
into a sequence of nested partitions. The resulting clustering is a sequence of parti-
tions in which each partition is nested into the next partition in the sequence. An
agglomerative algorithm for hierarchical clustering starts with the disjoint clustering,
which places each of the documents in an individual cluster. The similarity matrix is
used to merge two of these clusters, thus nesting this clustering into a second partition.

The process is repeated to form a sequence of nested clusterings in which the number

13

of clusters decreases as the sequence progresses until a single cluster containing all
documents remains. A dendrogram of a hierarchy of clusters can be obtained. It can
be used to visualize and assess to some degree the resulting clusters [74]. The main
hierarchical clustering methods in use are: (i) complete link: chooses that pair of
clusters with maximum distance, (ii) single link: chooses the minimum distance, and
(iii) average link: chooses the average distance. The complete link method has been
widely used for document clustering; we describe an algorithm for it in the following

section.

2.3.2 Complete Link Clustering Algorithm

A matrix updating algorithm, suggested by King [39] and made popular by John-
son [37], is used to implement complete link clustering method. The algorithm, as
described in [34], is as follows.

Suppose that we have 72 documents, and let the turn matrix D = [d(i, j)] be the
similarity matrix. The clusterings are assigned sequence numbers 0, 1, ..., (n — 1) and
L(k) is the level of the kth clustering. We denote the similarity between clusters (1‘)
and (8) by d[(r), (3)].

Step 1. Begin with the disjoint clustering having level L(0) = 0 and sequence number
m = 0.

Step 2. Find the least dissimilar pair of clusters in the current clustering, say pair
{(1), (3)}, according to d[(r), (3)] = min{d[(i), (j)]} where the minimum is over all
pairs of clusters in the current clustering.

Step 3. Increment the sequence number: m <— m + 1. Merge clusters (1‘) and (s)

14

into a single cluster to form the next clustering m. Set the level of this clustering to
L(m) = d[(r). (3)]

Step 4. Update the proximity matrix, D, by deleting the rows and columns corre-
sponding to clusters (1) and (s) and adding a row and column corresponding to the
newly formed cluster. The proximity between the new cluster, denoted (r, s) and old
cluster (k) is deﬁned as followslz d[(k), (r, 3)] = max{d[(k), (1‘)], d[(k), (s)]}

Step 5. If all objects are in one cluster, stOp.

Else, go to step 2.

2.4 Graph Data Model and Documents

Text documents, in general, and web documents, in particular, do not have a well
deﬁned structure or schema. In fact, in [1, 12], web documents are referred to as
semistructured data. Therefore, it is not easy to model them using a classical data
model, like the relational data model. To model documents and their clusters and cat-
egories, we prOpose to use a graph data model [64] with a novel idea of semantic group.
The graph data model used has a structure similar to the ER (Entity-Relationship)
data model [14], the Hypemode data model [44, 62], and the graph data model pro-
posed in [41]. The basic structure of the data model used in this research is a graph
where nodes correspond to entities, and links correspond to relationships between

these entities. The links are labeled, and their labels deﬁne types of relations.

 

1For single link method, the proximity is updated in step 4 above as follows:

(WC), (1', 8)] = min{d[(k). (1')]. 011(k), (s)]}
For average link method, it is updated as follows:

4W5), (’3 8)] = av9{d[(k): (7‘)], d“"): (s)]}

15

In traditional database models, such as the relational data model, views are built
on top of base relations. In the graph data model, the deﬁnition of views is concep-
tually different. Views are built as part of the data itself. Views are subgraphs in
which relationships are logically chained. Two fundamental issues addressed in the
graph data model are data Sharing and view update. Links shared by multiple views
cause the problem of updating these views independently; that is, an update on a
given view may affect other views in an unwanted way.

In the following we will deﬁne the major components of a graph data model.

2.4.1 Graph-Based Database

A graph data model is a set of typed nodes (atomic information) connected with
typed links (relations between nodes). A domain (subgraph), can also be abstracted
by a node.

A graph-based database state can be deﬁned by the 7-tuple:

DB =< N, NT, NM, L, LT, LM,D >,

where N is the set of all nodes; it represents a collection of atomic pieces of infor-
mation. NT is the set Of node types; every node has a type. NM is the mapping
between nodes and their types. L is the set of links; it represents the relationships
between atomic entities (nodes). LT is the set of link types; every link has a type.

LM is the mapping between links and their types. D is the set of domains.

16

2.4.2 Domain

The last element in the tuple deﬁning the state of a graph-based database is D,
set of domains. A domain d is basically a subgraph; d = {q}, where the c,- are
components (nodes and links) that belong to domain (1. The notion of domain is
close to the notion of traditional views, i.e a part of a database selected according to
some criteria. However, data in domains are independent, and modiﬁcations in one
domain are not reﬂected in others, even if they contain the same components.

A domain can be encapsulated into a node, which deﬁnes a composite structure
similar to the Hypernode model [62]. Encapsulation of domains is recursive, i.e a

domain may contain other domains.

2.4.3 Semantic Group

The semantic group concept captures transitivity of relationships in a database. Tra-
ditional views are also used to capture transitivity. Views can be created for different
contexts in the database. However, views are deﬁned on tOp of actual data, and there-
fore, update of views in traditional databases is a complex problem. It will be shown
in the proposed data model, that transitivity of relations is naturally supported by
the notion of semantic groups. Only relationships in the same semantic group are

transitive.

17

2.5 Literature Review

This section presents an overview of literature work on document modeling and clus-
tering techniques, and graph data models. Section 2.5.1 presents some work done in
document modeling. Section 2.5.2 contains a survey of data and document clustering

techniques. Section 2.5.3 presents a review of work done in graph data models.

2.5.1 Document Modeling and Similarity Measure

In this section we present related work done in the area of document preprocessing,
modeling, term-weighting and similarity measure between documents.

Salton [69], Frakes and Baeza-Yates [22] are among the classic references in the
area of Information Retrieval. TO extract the most useful keywords from documents,
the preprocessing phase of Information Retrieval Systems usually includes the removal
of useless words, also called stop-list, from documents, as proposed by Fox [20]. The
preprocessing phase of some Information Retrieval Systems also includes normal-
ization of keywords, called document stemming, as proposed by the Porter algo-
rithm [61, 21].

In most recent Information Retrieval systems, documents are modeled using the
vector space as proposed by Salton [69], where each document is represented by a
vector of keywords. The product of term frequency (tf) and the inverse document
frequency (idf) has been proposed as term-weighting scheme.

Different similarity measures between documents and keywords can be used to de-

termine how close documents are to each other. The cosine vector similarity measure

18

is widely used [8, 22, 69, 77].

2.5.2 Data and Document Clustering

In this section we present related work done in the area of data and document clus-
tering. Murtagh [57] presented a survey of hierarchical clustering algorithms. The
study focused on performance analysis of those algorithms with respect to time and
space complexity; not much comparison of cluster validity was addressed. Another
survey of hierarchical clustering methods is presented in [34]. It was suggested that
no hierarchical clustering method is best for all clustering problems, and that the
quality of hierarchical clustering methods depends on the application domain. It was
also noted that with single link clustering method, clusters easily chain together and
are less compact, since only a single small edge between two large clusters is needed
to merge the clusters. The complete link clustering method is more conservative; in
that, all pairs of patterns must be related before the patterns are clustered, resulting
in more compact clusters.

Recent work using hierarchical clustering has been done in the area of cellular
manufacturing for identifying alternative cell designs. Measures of association for
forming part of families and machine cells were deﬁned, and different hierarchical
clustering methods were compared. Mosier [56] showed that complete link dominated
Single link for cluster recovery; and that single link was best overall for between-cells
move measure. The study also Showed that there are signiﬁcant interaction effects
between clustering techniques, similarity measures and data structures.

In [71], a comparison was performed between Single, average and complete linkage

19

clusterings for a new Similarity measure for the cell formation problem. The study
showed that clustering techniques have greater impact than similarity measures, and
there is no signiﬁcant interaction between similarity measures and clustering tech-
niques. Complete linkage clustering method performed best, and average linkage
clustering performed worst. The authors introduced a new similarity measure that
has the desirable property of yielding a perfect 1.0 whenever one part requires the
same set or subset of the machines required by another part.

Vakharia and Wemmerlov [75] showed that while the choice of clustering tech-
niques is more critical than the choice of similarity measure, differences among cluster-
ing techniques (due to chaining tendencies) can be reduced by restricting the solution
space for acceptable cell conﬁgurations. Complete link performed poorly. Average
link performed best.

Some implementations of hierarchical clustering algorithms have been prOposed
in [19]. In [17] a sequential, agglomerative, hierarchical, non overlapping (SAHN)
clustering algorithm was described.

Hierarchies have been used in Information Retrieval systems such as in [9, 23, 43],
for organization, summarization and ﬁnding topics. Clustering has been applied to
documents in [10, 51, 67, 76, 81, 80].

An evaluation of topic-driven web crawlers has been done in [52]. Iwazume et
al. in [33, 32] proposed the use of an ontology for document navigation and text
categorization. This is different from our approach, where we do not use ontology
separately, but combined with keyword-based similarity measure.

WordNet [54, 55] is a lexical database, where words are organized by different

20

types of relationships. WordNet thesauri can be used as an input ontology to our
clustering technique. In [77], related work has been done, where words are clustered to
produce such relationships. Other Artiﬁcial Intelligence approaches to classiﬁcation
of documents have been reported in [66, 6, 35, 47, 45, 46, 48, 78] where machine
learning is emphasized.

SQLEM [58] is an SQL implementation of a clustering algorithm, claimed by the
authors to handle high dimensional data. “Scatter/ Gather” [29, 30, 60] is a novel
idea proposed to cluster query results documents, before presenting them to the
users. In [26], [5] and more recently in [3, 4], efﬁcient clustering algorithms have been

prOposed for large databases.

2.5.3 Graph Data Model and Semistructured Data

In this section we discuss some graph-based data models, and we also present some
relevant work that deals with modeling data and the Web.

The graph data model that is used in this research has a structure similar to the
ER (Entity-Relationship) data model proposed by Chen [14], the Hypernode data
model [44, 62], and the graph data model proposed by Kunii in [41]. In the ER
model, nodes represent entities and links represent relationships between them. The
basic structure in the hypernode model is deﬁned by G = (N, E), where G is the
(unique) label of the hypernode, and (N, E) is a directed graph, whose nodes can be
labels of other hypernodes (directed graph), attribute names, or atomic values. Since
a node can itself be a nested graph, this model can support complex objects.

The GRAS data model presented in [38] relies on attributed graphs. In this

21

model, objects are represented by typed nodes which may carry attributes. Relations
between objects are modeled by bidirectional edges (with no attributes). The authors
used graph schemes to represent the semantics of a database. This data model was
biased toward modeling software engineering systems. Unlike in GRAS, in our data
model, both the schema and instance of the database are Simultaneously represented.
This results in a more dynamic update of the database.

GOOD [27, 24, 59] started as a database interface, then evolved as a graph object
oriented database system. Actually, it is a graph representation of an object oriented
database. Nodes represent objects, and links represent relationships between objects.
Here again, the authors distinguished between an object base instance, and an object
base scheme.

The H 31* system [15, 16] is a query and visualization system that provides the
user with tools that assist in forming queries, and visualizing results. It is not a full
graph data model. In [49] the authors presented a graph-based object model and
focused on visual retrieval and user interface functionalities. Another graph-based
framework for visual access has been proposed in [13].

Grifﬁn et al. [25] used incremental view update through auxiliary tables that
contain information recorded since the last view update. This work was motivated by
performance improvement. It is different from our approach, where we address view
update in the context of semantic groups.

Related work has also been conducted in the area of semi-structured data [1, 2, 11,
12]. Theoretical issues related to modeling and querying such data were addressed.

The Object Exchange Model (OEM) was compared in [1] to relational and object

22

data models. The OEM model was shown to be more appropriate for modeling
semi-structured data. In [12] the query language UnQL was proposed for querying
data organized as a rooted graph. Another system based on the OEM model called
TSIMMIS [28] was proposed to extract semi-structured data from HTML documents
using template matching.

Other work has been done for integrating and querying data over the Web. A
W resource database system [79], that uses limited hypertext structure in index-
ing HTML documents was developed. Some SQL-like languages, such as WebSQL
[53] and W3QS [40], considered both hypertext structure querying and content query-
ing. In our approach, we consider modeling semi-structured data, thus we use a graph
based query language rather than a SQL—like language.

Other web querying systems such as [7] and [42] are based on a logic data model.
The ADM model [7] was developed to manage and restructure Web data using an
interesting page-oriented approach that considers HTML forms. The authors provided

transformation mechanisms from hypertext to relational views.

2.6 Summary

In this chapter we introduced the concepts of document modeling, clustering tech-
niques and graph data models. These concepts will be used to describe the research
work performed for this dissertation. We have also described the published research
work related to document modeling and clustering techniques, and graph data models.

The reviewed papers suggest that no speciﬁc clustering method is best for all clus-

23

tering problems and that the quality of clustering methods depends on the application
domain.

We observe that similarity measure based only on keyword frequency is not ap-
propriate for all domains. In particular, if there is an ontology available for a speciﬁc
domain, the use of that ontology can provide some added semantics that can improve
document clustering.

It is worth mentioning that most studies on graph data model focused on deﬁning
new approaches to presenting a graph data model. Our work, on the other hand, fo-
cuses on using views through the notion of semantic groups for modeling and querying

documents, and addresses the issue of updating views in graph data model.

24

Chapter 3

Ontology-based Similarity Measure

3.1 Introduction

As mentioned in Chapter 1, similarity measure based only on keyword frequency is not
appropriate for all types of data sets. In particular, if there is an ontology available
for a speciﬁc domain, keyword-based similarity measure may not take advantage of
the semantics that can be provided by this ontology.

Section 3.2 contains an analysis of cosine vector similarity and Euclidean distance
measures for the domain of text documents. Section 3.3, focuses on the ontology-
based Similarity measure prOposed in this work. It also describes the ontology in
text Space, and provides a general description of two approaches used to evaluate the
performance of the proposed ontology-based similarity measure. The ﬁrst approach
uses a set of documents having predeﬁned clusters, with a priori knowledge about its
cluster structure. The second approach uses real document data sets, gathered from

the web, without considering a priori knowledge about the structure of the data.

25

For the purpose of comparison, we provide for both approaches, results achieved
using ontology-based Similarity measure as well as using the standard cosine vector
similarity. Those results are reported in Sections 3.4, and 3.5. Finally, in Section 3.6,

we summarize this chapter.

3.2 Similarity and Distance Measures for Docu-

ments

While designing our proposed ontology-based Similarity measure, we analyzed cosine
vector similarity and Euclidean distance measures in Text Space. In this section,
we discuss the relationship between standard cosine vector similarity measure and
the Euclidean distance. These relationships lead us to focus on the cosine vector

similarity measure and try to improve it.

3.2.1 Cosine Vector Similarity Measure

AS seen in Chapter 2, in the case of a t—dimensional space, that is if the number of

distinct keywords in the total collection of documents is t, and a document d,- in the
—)

collection is represented by the vector (1,- = (mg, 10,2, . . . , wit), then the cosine vector

similarity between documents (1,- and dj is deﬁned as:

sz'm(d,-,dJ-) = BLAH)“ X wjk) (3.1)

t t
V Ek=1 "’5: x \/ 21ml wit:

26

Salton in [69], gave a motivation for the cosine vector similarity measure by arguing
that the sum of the products of the frequency of corresponding keywords (or vector
dot product) between two documents is a good indicator of the similarity between
those documents. He then argued that, to avoid the problem of discriminating against
smaller documents in favor of larger ones, a normalization by the length of the docu-
ment vector is in order. Shapiro and Stockman in [72], invoking the Cauchy-Schwartz
Inequality [73], also suggest the normalized dot product as a measure of similarity be-
tween vectors. We ﬁrst set out to understand how will normalization affect document
similarity measure.

Deﬁnition. Let (1,- be any document represented by the vector 2: = (mg, 1.052, . . . , wit),
and let a be any positive real number. We deﬁne the scaling normalization of doc-
ument d,- by a (scalar-document product) as the document ﬁdg, represented by the

—t . .
vector ﬁdi. The kth component of the normalized vector 18 then:

 

wik = o k = 1,2,...,t (3.2)

Proposition. The cosine vector similarity measure is invariant with respect to scal-
ing.

Proof. Let (1,- and dj be any two documents represented by the vectors 2: =
(w,1,w,-2,...,wit) and d: = (101-1,2152, . ..,wjt) in the t-dimensional space, and let
a and b be any two positive real numbers. We shall prove that

d.

sim( , —b—) = sim(d.-, dj), (3.3)

al.9-

27

where sim(d,~, dj) is the cosine vector similarity between documents d,- and dj.

 

 

sim(id—i)Zk=1(w—ulx ‘1‘)
a b =./z;_.(ek)2x M2..- mm):
ab Zk=1(w"° X ka) (3.4)
=veri=1wa x V3152k=1w12k
= Sim(dia d1)

This result is another form of the property of normalized cross correlation being

independent of scale factors [72].

3.2.2 Euclidean Distance Measure

Figure 3.1 illustrates the cosine vector similarity and the Euclidean distance in a 3-
dimensional space, where the cosine vector similarity between documents d,- and dj is

represented by cos a, and the standard Euclidean distance by dis(d,-,d,-).

‘ ‘ ~ ~ - . ‘dis(di.dj)

dis’(di,dj)

 

 

kl

Figure 3.1: Illustration of distance and similarity measures

28

In the following, we investigate what would happen if the Euclidean distance is
used to quantify the dissimilarity of documents. We are here talking about dissim-
ilarity instead of similarity because as seen in Figure 3.1, the bigger the Euclidean
distance between two documents on the surface of the unit hypersphere, the more
they are dissimilar. After all, by using the Euclidean distance, both arguments pre-
sented by Salton in [69] remain valid. The sum of the squared difference between the
frequency of corresponding keywords is an indicator of how dissimilar two documents

are.

Standard Euclidean Distance

Let’s consider that there are t different keywords in the collection of documents.
If a document d; is represented by the following vector in the t-dimensional space:
_.)

d,- = (mg, 1.052, . . . , w“), then the dissimilarity between two documents d,- and dj can

be deﬁned by the Euclidean distance as follows:

 

t

(”30%. dj) = \J 200:): — 10102 (3-5)

k=l

It has the following metric properties:

Vi, dis(d,-, d,) = o (3.6)

Vi, jli a: j, dis(d,-, d,-) > o (3.7)

V233“. dis(d.-, d1) = dis(dj, d.) (3.8)

w, j, k, dis(d.-, dj) 3 dis(d,-, d.) + dis(d,,, dj) (3.9)

29

It has also the property of being invariant with respect to rotation and translation.
Because of these properties, the Euclidean distance has been used as a measure of
dissimilarity in many pattern recognition domains, like image retrieval. The problem
with using the Euclidean distance for the domain of documents is that documents
with comparable size tend to be more Similar to each other and less similar to other
much larger or much smaller documents, even though all these documents are seman-
tically Similar. In the following we tried to address this problem by normalizing the

documents vectors by their vector length.

Normalization by Document Length

 

Let document d,- be represented by the t—dimensional vector 2 = (1051, wig, . . . , wit).
We deﬁne the length of document d,- aS the length of vector 2:, that is:
_.)
[61;]: Mt" = (3-10)
We deﬁne the normalized document vector as:
1 —+
E? = .d- (3.11)
t :5- 3
Hi"
The kth component of the normalized vector is then:
20;, = lg] k = 1,2,...,t (3.12)

The modiﬁed dissimilarity between two documents d,- and dj can be deﬁned by

30

the Euclidean distance between the document vectors normalized by their length as

follows:

 

dis (3,, d,)— _ Jijmgk—w) tug.) (3.13)

The normalized distance between d,- and dj in Figure 3.1 is illustrated by dis’(d,-,d,-).
Proposition. The Euclidean distance applied to the normalized document vectors
is semantically equivalent to the standard cosine vector similarity measure. In fact,

the cosine vector Similarity is quadratically proportional to the Euclidean distance.

_)
Proof. Let (1,- and d,- be any two documents represented by the vectors d,- = (w,1,w,~2, . . .

_.)
and dj 2 (wj1,wj2, ...,wjk) in the t-dimensional space. We shall prove that

di3'2(de, d1)

sim(d,~, (13-) = 1 — 2 ,

(3.14)

where sim(d,-, dj) is the cosine vector similarity between documents (1,- and 11,-, and
dis’ (d,, dj) is the Euclidean distance between the normalized document vectors of d,-

and dj.

31

, wit)

t
6138,2015, dj) = 2011);,c - w

(72—273; «2:121»; —.)

 

 

 

‘ :Z(—— mg, 2 x w” x w" 102:1?) (3 15)
._ t 2 .
k=1 21:1 wit -\/z::_ lwaJZl-l 11}?! +21: -1 wjl
Zk=1(w5k x wjk) +1

2V2): lwikx V2 lc:lek

= 2 — 2sim(d,-, d,)

This equation is another form of the cosine law in [73].

3.2.3 Discussion

In this section, we have shown that the Euclidean distance applied to the normalized
document vectors is semantically equivalent to the standard cosine vector Similar-
ity measure. While the Euclidean distance has the property of being invariant with
respect to rotation and translation, the cosine vector similarity measure has the prop-
erty of being invariant with respect to scaling. In the domain of documents, the latter
property is more important since, for example, we expect a technical paper or a news
report to have a high Similarity measure to their abstract or Shorter versions. More-
over, with the advance in the development of ontologies and domain speciﬁc topical
hierarchies, as in most recent Web search engines, the ability to automatically use the

semantics provided by such knowledge structures is desirable. For those reasons, we

32

proposed a hybrid similarity measure that combines both keyword information and
ontology-based knowledge. In the next section, we describe the proposed similarity

measure, and report on its evaluation.

3.3 Proposed Ontology-based Similarity Measure

(OBSM)

In this section, we introduce the use of ontology in text space in Section 3.3.2. We
deﬁne the proposed ontology-based similarity measure (OBSM) in Section 3.3.3. Sec-
tion 3.3.4 provides a general description of the two approaches used to evaluate the
performance of OBSM. The ﬁrst approach uses a set of documents having predeﬁned
clusters, with a priori knowledge about its cluster structure. The second approach
uses real document data sets, gathered from the web, without considering a priori

knowledge about the structure of the data.

3.3.1 Motivation

AS discussed in the previous section, the use of ontology-based knowledge is motivated
by the essence and intrinsic properties of the cosine vector Similarity measure itself.
Keywords that are part of a same category or topic should mutually contribute with

a higher effect on the similarity of two documents in which they appear.

33

3.3.2 Ontology Used in Text Space

An ontology in the text space can be deﬁned as a hierarchy of categories. Each cate-
gory is deﬁned by a set of keywords, ordered by their importance with respect to their
category. This order is achieved by associating a weight that measures the impor-
tance of a keyword in that category. Keywords may be part of many categories. This
allows for an overlapping hierarchical category structure. Categories are based on
term to term relationships and not on their appearance is documents. The ontology
relationships are obtained by considering the terms as part of concepts. Combining
the ontology with standard keyword-based similarity, adds a new semantic to the
information retrieval system. An ontology can be constructed from a term catego-
rization structure such as a thesaurus, which generally includes synonyms. However,
relationships between keywords in an ontology may be general. Ideally, they should

be speciﬁed by a domain speciﬁc expert.

3.3.3 Ontology-based Similarity Measure

An ontology is used to highlight those keywords that may be important to some
speciﬁc categories and not others. In this ontology, a category is deﬁned as a set of
keywords sorted by order of importance for that category. A weight is used to capture
this order of importance. A keyword that happens to be Signiﬁcant to a category will
be having a bigger contribution to the similarity between documents that contain it.

If t is the number of distinct keywords in the whole collection of documents, 3

is the number of categories in the ontology, w,-), is the term weight associated with

34

keyword 1: and document d,-, and w’d, is the weight of keyword k (k = 1,2,... , t) with

respect to category c, then the ontology-based similarity between two documents (1,-

 

 

and dj, represented by the t-dimensional vectors E: = (w,1,w,-2,. . . ,w,,) and d: =
(wJ-l, '1sz, . . . , 1.011), IS.
a t . t _
sim(d.-, dj) = ZC=I(Z"=1W'C" x Zk=1 W’d‘) , (3.16)

@2103.-. War: x must... we)?

where Wick = w“, x w’ck.

The original weight associated with keyword k and document d,- is multiplied by
the relative weight of keyword 1: with respect to category c. The sum of the products
is performed over all categories. Different keywords can be identiﬁed if they are in
the same category. This guarantees that if there exists a category to which many
common keywords of documents d,- and dj belong to, then the contribution of those
keywords is expanded, causing the two documents to be more similar to each other
than to other documents. Therefore, those documents will have a better chance to be
classiﬁed in a cluster that semantically resembles the category of the input ontology.
It is worth noting that we keep the normalization by the vector length, so that again
we avoid discrimination against smaller documents in favor of larger ones.

OBSM can also be used for the similarity between documents and queries in the
case of interactive Information Retrieval systems as well, like a search engine. In that
case, the user may even contribute interactively in the choice of the categories that
may be used to expand the contribution of the keywords that he/she enters in the

query. In the next sections, we present results of our evaluation of the performance

35

of OBSM.

3.3.4 Performance Evaluation Approaches

We considered two different evaluation approaches. In the ﬁrst approach, we generate
clustered data (here clustered is in terms of keyword overlap). This is similar to the
training set approach used in artiﬁcial intelligence based clustering. Then we compute
the distance between these clusters in terms of Similarity measure, using standard
cosine vector similarity measure (CVSM) and OBSM, and we check if the clusters are
preserved. In the second approach, we collect real documents and cluster them using

both similarity measures.

3.4 Data Sets With a Priori Knowledge

In this approach, to evaluate the performance of our technique using OBSM, we gen-
erated a set of documents, with keywords drawn from a limited set of keywords. An
ontology is built associating keywords into different categories, with weights repre-
senting an ordering within a category. A document is then associated with a category
by having more keywords drawn from that category than any other. We vary the over-
lap between documents in terms of the percentage of common keywords, both within
the same cluster (intra-cluster overlap) and from different clusters (inter-cluster over-

lap). If a cluster i has n,- t-dimensional data points {a} (k = 1, 2, ..., 174), then the

36

intra—cluster and inter—cluster average similarities are deﬁned respectively as follows:

d,,-=n .(.-n _ ————)-1: :3 sim(d,-),, (1,.) (3.17)

k: 1 1:1 ,leék

LiZszmw ika dﬂ) (3.18)

nian=11=1

For various data sets, we computed the intra-cluster and inter-cluster average
document similarities. A data set with a good clustered structure Should exhibit a
high intra-cluster similarity, showing a good compactness of the clusters, and a low

inter-cluster similarity, showing a good isolation between clusters.

3.4.1 Varying Intra—cluster Overlap

Due to the correlation between categories used in the ontology on one hand, and
the clusters used to generate the documents on the other hand, the ontology should
be generated independent from the documents. First, we used a data set with a
dimension of 20. The keywords have been drawn in a pool of two categories of 10
documents with an overlap of 10%.

The results Show that OBSM reﬂects better the clustered structure of the doc-
uments. This is illustrated in Figure 3.2 and Figure 3.3. Figure 3.2(a) shows the
variation of intra-category and inter-category CVSM similarities. Figure 3.2(b) shows
the ratio of intra-category over inter-category CV SM similarity as a function of the
intra-category overlap. We found that for this particular data set, the ratio of intra-

cluster over inter-cluster CVSM similarity is not a non decreasing function of the

37

Casino-vector similarity
9
0|

 

 

Inter-chasm dmilamy

 

 

 

d
0

Ratio 0! mm over inter-chaste! cosine-vector similarities
a

 

l l l I I

1

l l

 

 

 

10 20 30 40 50

70 80

Pde-mm

(b) Intra over inter-cluster similarity ratio

Figure 3.2: Intra and inter-cluster CVSM similarities

38

100

 

Dimension (keywords) 1000
Number of categories 50
Number of documents 2000
Inter-category overlap 5%

 

 

 

 

 

 

 

Table 3.1: Data set used to generate Figure 3.4

intra—category overlap.

Figure 3.3(a) and Figure 3.3(b) show the same terms computed using OBSM. Un-
like for CVSM similarity, the ratio of intra—cluster over inter-cluster OBSM similarity
is a non decreasing function of the intra-category overlap. This means that for two
data sets, one exhibiting more compact clusters than the other, OBSM will determine
that the ﬁrst data set has a better clustered structure. Whereas, using CV SM, that
may not be guaranteed.

As a next step we have decided to generate ontology automatically. We generate
documents varying: (i) the number of keywords; (ii) the number of clusters; (iii) the
number of documents; (iv) the percentage of intra-cluster overlap; (v) the percentage
of inter-cluster overlap. For different data sets, we compute the intra and inter-cluster
average similarities, as well as their ratio, for both CVSM and OBSM.

Figure 3.4(a) shows the variation of intra—category and inter-category OBSM similari-
ties for the data set described in Table 3.1. Figure 3.4(b) shows that the corresponding
ratio of intrarcategory over inter-category OBSM similarity remains a non decreasing
function of the intra-category overlap. As the intra-cluster overlap increases, more
keywords from a common category are shared among documents of a cluster. There-
fore, the intra-cluster average similarity increases. Similarly, lass keywords from the

same category are shared among documents from different clusters; therefore, the

39

 

0.9 *- lntra-cluster similarity -1
0.8 P -
0.7 - 1

P
a:
l

r

P
U!
1

/

Ontology-based similarity
i
I
I
i
/
I
l

9
§
I
I

1

 

 

 

0.3 h \ \ -r
‘ \ ‘ Inter-cluster similarity
0.2- -’—“~-\ .
~~~~~ .1
0.1 l l l l l l J 1 L
0 10 20 30 40 50 60 70 80 90 100

Percentage of tetra-duster overlap

(a) Intra and inter-cluster similarities

 

& OI G ‘1

Ratio of lntra over inter-chater ontology—based similarities
(a)

 

. N

 

 

 

1 l l 1 l l l 1 l I
0 10 20 30 40 50 60 7O 80 90 100
Percentage ot inn-cluster overlap

(b) Intra over inter-cluster similarity ratio

Figure 3.3: Intra and inter-cluster OBSM similarities

40

Ontology-based similarity

Ratio oi intra over inter-cluster ontology-based similarities

0.9

0.8

 

.°
N

P
a»

P
or

P
as

0.3

0.2

 

  

lntra—cluster similarity

\

1 l l l l l l l l

   

 

 

0.1
0

1 0 20 30 40 50 60 70 80 90
Percentage oi intra-ciuster overlap

(a) Intra and inter-cluster similarities

 

5.5

9
0|

§

9’
U!

(a

.'°
or

N

1.5

 

 

l L l l l l l l

 

 

50 60 70 80 90

1 0 20 30 40
Percentage of lntra—cluster overlap

(b) Intra over inter-cluster similarity ratio

Figure 3.4: Intra and inter-cluster OBSM similarities

41

 

Dimension (keywords) 20
Number of categories 2
Number of documents 500

 

 

 

 

 

 

Table 3.2: Data set used to generate Figure 3.5

inter-cluster average similarity decreases, resulting in an increasing intra over inter-

cluster average similarity ratio.

3.4.2 Varying Inter-cluster Overlap

 

5.5 l I I Y I I I
5 - \ -Ontology—based similarity -

4.5 - ‘ -Coeine-vector similarity -

Ratio of intra over inter-cluster similarities
0)
I
1

1.5’- '1

 

l J 1 l l l l

1
1 0 1 5 20 25 30 35 40 45 50
Percentage ol inter-cluster overlap

 

 

Figure 3.5: Intra over inter-cluster similarity ratio

Figure 3.5 generated for the data set listed in Table 3.2 shows that for small inter-
cluster overlap, OBSM yields a slightly larger ratio of intra over inter—cluster similarity
than CVSM. Again this is desirable, since a small inter-cluster overlap means better

cluster isolation.

42

3.4.3 Discussion

In both generated data sets, we could see the improvement brought about by com-
bining ontology-based information with keyword-based similarity measure. This im-
provement was quantiﬁed by the isolation and compactness of output clusters of data

with varying clustered structure.

3.5 Real Data Set

It should be emphasized that this performance evaluation approach is completely
different from the previous one. In fact, in this approach, we collect real documents
with no a priori knowledge in terms of initial cluster structure, then we cluster the
documents using a clustering technique, using both similarity measures: CVSM and
OBSM. This is diﬁerent from the ﬁrst approach, where we did not perform any cluster-
ing. Instead, we generated clustered data, where the clustered structure is measured
by overlap in terms of common keywords. Then, we computed the dissimilarity be-
tween these clusters in terms of both similarity measure (standard cosine vector and
our proposed similarity measure), and we checked if and how well the cluster structure
was preserved.

The methods of web document gathering, pre—processing, and clustering technique
are covered in Sections 3.5.1, 3.5.2, and 3.5.3. In section 3.5.4, we address the issue of
which web documents do we cluster: a randomly selected subset, or a subset selected
from an ontology, like Yahoo [18]. In Section 3.5.4, documents are used from Yahoo’s

medication index pages.

43

3.5.1 Web Document Gathering

We needed to automatically gather web documents. We have implemented a gathering
utility, based on the “HtDig” robot [31] and “Efﬁcacy” [36], a data gathering system
for the World Wide Web, for which I participated in the design and implementation.
Our gathering utility starts with a set of initial web sites. Then the robot follows all
hyper-links to a level depth speciﬁed as a parameter. A cache copy of all documents

is created, and an index of web page URLs and all document information is created.

3.5.2 Document Preprocessing

We needed to prepare the web documents for analysis. In the preprocessing phase,
we performed the following operations: lexical analysis, elimination of stop words,
stemming, construction of term categorization structures.

In lexical analysis, we extracted a stream of selected words from a stream of
characters. To do this, we have implemented an automata based lexical analyzer,
which scans and identiﬁes the words of a document, removing punctuation marks,
and single letters.

We eliminate those words that occur in most documents, called stop words. These
words are considered usele because they do not have much discrimination pr0perty.
This reduces the size of the document database considerably.

We have implemented a stemmer that reduces a word to its origin, or stem, by
removing its afﬁxes. For this we have implemented a ﬁnite state machine that is

based on linguistics rules. The advantage of this operation is that it further reduces

44

the size of the document database.

We create an ontology in the form of classiﬁed hierarchies. It consists of a compiled
list of important words in a given domain of knowledge. For each word in the list, a
set of related words is deﬁned. A term in the thesauri is usually a single word. Using
an ontology in data retrieval allows for adding concepts to words, which signiﬁcantly
increases the quality of clustering.

We used the vector model to represent documents. For term weighting, we used
the tf-idf scheme [69]. We use the inverse document frequency because terms that

appear in many documents are not very useful for distinguishing documents.

3.5.3 Clustering Technique

The categorization in our clustering method uses a hierarchical clustering, and uses
a threshold on the number of steps after which clusters are formed. The main clus-
tering methods considered are: (i) complete link: maximum distance, (ii) single link:

minimum distance, and (iii) average link: average distance.

 

 

 

 

 

 

 

 

 

 

 

 

 

Categories Document IDs
Business & Economy (Jobs) 1-8
Education 9-18
Government (Military) 19-29
Regional 30-35
Science 36-44
Health 45-51
Recreation & Sports (Sports) 52-63
Entertainment (Movies) 64—76
Computers & Internet 77—82
Total: 9 categories 82 ]

 

 

Table 3.3: Real Documents Data Set

45

80

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o 20 4o 60
l I l L I
31
34
.. El—
3530 Z
32 12
_—l
16 ———|
14 17 P r
11 l
9
18
15
10
13
37
54
52
63
57 58
31 :l—
56
62
59
53
60
43 ——
5 ———1
7 —J
“ — —
80 64 —
78 l
”Ell—H
81
77 —
79

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.6: Hierarchical single link clustering

46

 

 

100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

60 80
l l l l l l
31
34 ——1
33 :31
30 -——ﬁ
32 ——J 12 ——-l
16
17
11
14
18
15
10
80 13
78 l
82
81
77
79
45
19
21
25
51
40
36
39
41
42
23 __
24
27
28
47
49
48
50
20
22 :7:—
26
u —
53 ——L
60 —J
52
63
59
62
57
61
37 ::J——
43
38 __
5 ——1
7 —l
55 — l
2
76
68
72
75
67
71
66
69
70
73
65
74 3
84 —-—-1 l—
3, :1—1"
4‘ 3—1
64

 

Figure 3.7: Hierarchical average link clustering

47

 

 

60

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

37
43

 

 

 

 

Lm

 

 

 

 

”ﬁ—

Figure 3.8: Hierarchical complete Link clustering

48

Figures 3.6, 3.7, and 3.8 show the clustering results of single link, average link and
complete link clustering methods for a set of data of 82 documents, taken from nine
categories of Yahoo web directory, and listed in Table 3.3. We chose the “complete
link” method because it is a conservative method; all pairs of documents must be
related before the documents are clustered. In the “single link” method, only a single
edge between two large clusters is needed to merge the clusters. Clusters easily chain

together and are less compact.

3.5.4 Web Documents Data Set

Now that we decided on the clustering technique, we need to decide on the type of
document data set that we need to use for the evaluation. There is a huge number
of documents on the web. It is estimated that the Web contains about 320 million
pages of information. Most search engines cover only a limited subset of the web.
For instance, Lycos [50] covers only about 3 percent of these. Therefore, there are
several approaches in deciding which documents to cluster: (i) cluster all web pages
gathered following all links, (ii) start from a predeﬁned set of web pages, or (iii) select
a subset of web pages within a site (assuming web pages are grouped per site,) for
example large pages, or virtual representative pages per site. In order to provide a
good evaluation of the practicality and usefulness of OBSM, we used documents from

Yahoo’s medication index [18].

49

Description of Data Set Used

In this experiment, We used documents from Yahoo’s drugs and medication indexl.
We show the results of the clustering using hierarchy dendrograms, where the docu-
ments of a targeted category (Detrol pages) were labeled. A good clustering would
be indicated by all labeled documents being grouped together in the same compact
cluster, isolated from other clusters.

From Yahoo categories directory, we traced a path from the root of the Yahoo
hierarchy to the sub-category “Speciﬁc Drugs and Medications”. As Figure A.1 of
Appendix A shows, this path is: “Home > Health > Pharmacy > Drugs and Medi-
cation” where there is a link to “Yahoo! Health: Drugs and Medications” web page,
which has links to over 1000 drug and medication pages, listed by alphabetical order.
For example under the “D” index”, a link to Yahoo’s Detrol3 page can be found. That
page, for example, gives a description of Tolterodine (brand name: Detrol) “helps pa-
tients with an overactive bladder...” Figures A.2 and A.3 of Appendix A show a part
of Yahoo’s “D” index and Detrol pages.

We used our web gathering utility to automatically download all 2,070 documents
from the Yahoo medication index, grouped by drug or type of medication. Then
we mixed in all 124 documents that we downloaded from the Detrol" web site. We
preprocessed the documents using the techniques described in Section3.5.2. The

number of distinct keywords in the whole document collection, which represent the

 

1httpz//health.yahoo.com/hea.lth/pdr.drugs/a..html
2http://hea.lth.ya.hoo.com/hea.lth/pdr.drugs/d.html
3http: / /health.yahoo.com / health] pdrdrugs/ 0908/ 0.html
‘http://www.detrol.com/

50

 

dimension of the data set is 9,397. Table 3.4 shows a summary of the data set built.

 

 

 

Yahoo’s medications index Detrol.com Total
Number of documents 2,070 124 2,194
Dimension (keywords) - - 9,397

 

 

 

 

 

 

Table 3.4: Summary of the documents gathered from Yahoo and Detrol web sites

Clustering Results

We used both CV SM and OBSM to cluster this data set. The resulting clustering
dendrograms are shown in Figures 3.9, 3.10 and 3.11. To make the documents of the
targeted cluster, i.e. Detrol related documents, easy to spot on the dendrograms, we

labeled them by “*” and “+++”:
e each “*” represents one of the 124 Detrol.com web site documents.

0 each “+++” represent one of the two Detrol related documents contained in

the Yahoo medications database.

A good clustering would be indicated by all labeled documents being grouped together
in the same compact cluster, isolated from other clusters.

Figure 3.9 shows that with CVSM, the Detrol.com documents did cluster together,
but did so in a different cluster than Yahoo’s Detrol related documents. This may be
because Yahoo’s documents share many keywords that are speciﬁc to Yahoo’s style,
such as the type of headers. The effect of these keywords seems to overtake Detrol
related keywords.

The clustering dendrograms obtained using OBSM are displayed in Figures 3.10
and 3.11. By using an ontology, we emphasize keywords that are speciﬁc to a cate-

51

1.0

 

 

0.8

 

 

 

 

 

 

 

 

 

 

 

_ ...f 3.

:1 .... 3.... ___
__.... 3.. ﬂu. 11.314 3.2: __.... ._:__

_.
.

. z . .

 

Figure 3.9: Clustering using CVSM
52

1.0

0.8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.10: Clustering using OBSM (weight 5)

53

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.4

Figure 3.11: Clustering using OBSM (weight 15)

54

gory that an expert would recognize as relevant to Detrol medication, treatment and
diagnosis. Speciﬁcally, the keywords in the Detrol category of this ontology are: “De-
trol”, “Detrolla”, “overactive” , “bladder”, “Tolterodine”. We include these keywords
in the same category of the ontology and give them higher weights than keywords
related to other medications. Figure 3.10 was generated with a weight of 5, whereas
Figure 3.11 uses a weight of 15. The different weight is the reason why both Yahoo
documents cluster with the Detrol.com documents in Figure 3.11, but only one does
so in Figure 3.10.

An analysis of the Detrol web site shows that there are whole web pages that
talk about the symptoms and diagnosis for overactive bladder. These web pages
include multiple citations of the keywords used in the ontology. And since those
keywords have been extracted from the Yahoo’s index, the dot-product between those
documents induced by those keywords gets ampliﬁed by the weight added by the
ontology. This makes those documents more similar and hence they cluster together.
The differences between Figure 3.10 and 3.11, reveal that the choice of weight of
the keywords in the categories affects the clusters. In fact the clusters shown in
Figure 3.10 and 3.11 were obtained by experimentally varying the weights between 5

and 20, and monitoring the resulting cluster quality.

3.6 Summary

The results presented in this chapter show that unlike the standard similarity measure,

OBSM with a well chosen input ontology for a speciﬁc data set consistently reﬂects

55

 

the quality of the clustered structure of the data set. As shown for the data set
collected from the Yahoo web site, we could see the improvement brought about by
combining ontology-based information with keyword-based similarity measure. This
improvement was emphasized in terms of the quality of the cluster of Detrol related
web pages. OBSM can improve the quality of clusters, if it is combined with a good

choice of domain speciﬁc categories.

56

 

Chapter 4

Synthetic Document Data

Modeling and Generation

4.1 Introduction

This chapter covers synthetic data generation for modeling documents. We investi-
gate data generation in both standard Cartesian coordinate system and the (hyper-
spherical) polar coordinate system. This investigation includes an analysis of the
probability distribution of both Euclidean distance and angle between documents
and/or centroids, as well as their variation with respect to the dimension of the data
set.

In Section 4.2, we describe our approach, including the parameters and tools used
for data analysis. In Section 4.3, we investigate data generation using Cartesian coor-
dinate system. In Section 4.4, we investigate data generation using polar coordinate

system. Finally, we summarize our results in Section 4.5.

57

4.2 Approach

As seen in Chapter 2, if the total number of distinct keywords in a whole collection

of documents is t, then a document d,- is represented by the t-dimensional vector:

__)
d,- = (wﬁ, wig, . . . , wit). We generate synthetic data for modeling documents by

representing a document by a feature vector in the t-dimensional vector space. Docu-
ment data has some intrinsic structure, and document points are generally scattered
around cluster centroids. A cluster centroid may be viewed as a representative point

of the cluster. Synthetic data generation is controlled by the following parameters:
1. the total number of documents;
2. the data set dimension, i.e. the number of keywords;
3. the number of clusters, or categories;
4. the size (number of documents) of clusters;
5. the centers of clusters, or centroids;
6. the average scatter around cluster centroids, or cluster compactness;
7. the inter-cluster overlap, or cluster isolation.

The average scatter around a cluster centroid measures the cluster compactness, and
is deﬁned by the interval within which data points are scattered around the cluster
centroid. This is sometimes represented by the cluster radius [82]. The inter-cluster

isolation indicates how the clusters are isolated from each other, and may be deﬁned

58

by the average number of documents that fall closer to a cluster centroid other than

their own. In the following section, we formally deﬁne the above parameters.

4.2.1 Deﬁnitions

In the following, we deﬁne the centroid and radius of a cluster, as well as the intra-
cluster and inter-cluster average distances. Let us assume that we have a data set
with the following parameters: (i) the dimension is t, (ii) the total number of points
is N, (iii) the number of clusters is K, (iv) the number of documents per cluster C),
is nah. If a cluster has neg t-dimensional data points {2:} (i = 1, 2, ..., nah), then the
centroid C, radius R, the intra-cluster and inter-cluster average distances are deﬁned

respectively as follows:

‘6: $22,? (4.1)

 

R=%§3£—Bmi an
arm - (”611,62 :Z(Zi-£§>2)% (4.4)

4.2.2 Distribution Analysis of Random Variables

In general, the distribution of a random variable is represented by two (related)

functions. The probability distribution function (PDF) of a random variable X is

as) = P1X 5 x] (4.5)

59

One main property of the PDF is that Fx(a:) is a nondecreasing function of x, with
range [0, 1].

The probability density function (pdf) of a random variable X is

 

_ de (x)
fx (1?) -— dill (4.6)
One main property of the pdf is that
/ fx(:r:) dz = 1 (4.7)

The two distributions used in this chapter are the uniform (U) and normal (N)

distributions, with probability density functions:
1
fy($) = I; for a < 23 < b (4.8)

) (4-9)

 

 

Normal distributions are characterized by the mean a and standard deviation p. They
all share a common property, the 68-95-99.7% Empirical Rule: 68% of the observations
fall within 1 deviation of the mean; 95% of them fall within 2 deviations; and 99.7%

fall within 3 deviations.

60

4.3 Cartesian Coordinate System

We ﬁrst investigated the generation of synthetic data in the standard Cartesian co—
ordinate system. In this vector space, documents are represented by t-dimensional
vectors. Every dimension represents an index term or keyword, where t is the total
number of distinct index terms in the documents data set. The vector components

are the keyword weights for the corresponding document.

4.3.1 Distribution of Euclidean Distance for Uniform Ran-

dom Data

We ﬁrst investigated the distribution of Euclidean distance between two random vec-
tors, representing cluster centroids, for various dimensions. Since we could not assume
a priori knowledge about the cluster centroids distribution, we treated the Cartesian
coordinates (weights) as a random uniform variable. We run our generation program
a 1000 times with various seeds, and monitored the distance distribution.

Figures 4.1 (a) and (b) show the distribution of Euclidean distance between two cluster
centroids for dimension=2 and 5000 respectively. In the case of 2 dimensions, it has
an average of .516 and standard deviation of 0.25; the plot shows that it is hardly a
normal distribution. However, in the case of 5000 dimensions, the plot shows that it

is a normal distribution with average 28.87 and standard deviation 0.23.

61

 

20W,2centroids.10003amplee

 

09”.. .............. ............... ............... ............... . ............... g...+...
0997-1 3 i . g g 4,

0.99 ' ‘ ' ' i i

0.90

0.95
0.90

0.75
0.50
025 i
0.10

0.05

0.02
0.01

0.0063 . . . .
OIW1H ....... .............. . ...':.. ......... ............ _,

 

 

 

 

 

0399.} .......... . ......... L .......... .......... ........ I; ......... .,
. . . . , . + .
0397... .......... ... ....... .......... . .......... ...... ........;,*+.

0.99
0.”

0.95
0.90

0.75

0.50

Probabiity

0.25

0.10
0.05

0.02
0.01

0.003 . . . . . . .
0.m1"} ...... t1; ........... j .......... ; .......... ; ........... j .......... ; .........; .......... ; .......... ,1

 

 

 

 

(b) Dimension=5000

Figure 4.1: Distribution of distance between centroids

62

4.3.2 Distribution of Angle for Uniformly Distributed Ran-
dom Data

Using the same data set as in the previous section, we investigated the distribution
of the angle between documents or cluster centroids, for various dimensions. Other
studies [63, 65] have suggested some properties of angle in high dimension in the
context of nearest neighbor queries.

Figures 4.2(a) and (b) show the distribution of angle between centroids for the same
data set as Figures 4.1(a) and (b). In the case of 2 dimensions, it has an average
of 27.3 and standard deviation of 19.0; the plot shows that it is hardly a normal
distribution. However, in the case of 5000 dimensions, the plot shows that it is a

normal distribution with average 41.4 and standard deviation 0.36.

4.3.3 Average Distance and Angle, and Deviation as Func-
tions of Dimension

We have extended the previous experiments for multiple data sets, having various
dimensions. Figures 4.3(a) and (b) show the plot of average Euclidean distance
and angle between two cluster centroids versus dimension (number of terms); the
components of the centroids being randomly scattered over the interval [0, 1] following
a uniform distribution. While the Euclidean distance ﬁts the 4%"- ﬁmction, the angle
converges asymptotically towards 45 degrees as the dimension goes to inﬁnity. Those
ﬁgures also show the standard deviation as a function of dimension. The data plotted

in those ﬁgures is described in Tables 4.1 and 4.2. Table 4.1 shows the distance values

63

20W.2Mus.10003unphs

 

0,999.“; ......... ......... .......... g .......... ..... ........ ...... +34
0997... ......... ........ 3? ....4
0.99 i i 3 1 - .3
0.90

0.95
0.”

0.75
0.50

0.25

0.10
0.05
0.02
0.01

0-003 . . . .

0.00, . , .......... ........ g ......... g ......... j ..........

 

 

 

 

(a) Dimension=2

SOOOWJWUOOOW)

 

0.999
0.997

0.99
0.“

0.95
0.90

0.75

0.50

0.25

0.1 0
0.05

0.02
0.01

0.000
0.001

 

 

 

 

(b) Dimension=5000

Figure 4.2: Distribution of angle between centroids

64

2 clusters. 1000 samples

30 I I r I I I T T l

 

 

 

 

 

 

 

o—-———1-——--0--—9-—--r-———n-—-+——-r-—-—1—-—-1--—
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Dinonslon

(a) Euclidean distance

2 clusters. 1000 samples

 

 

 

 

 

 

 

 

 

45 . . ....n, . . ......ﬁ .-...., . . ....-.
40» -
35- -

—Avomgomglo
——Su.dev.
30» .

’0?

$.25» ~

.5,

i” ‘

\
\
\
D- \ .1
15 \
\
\
\
1o»- \ ..
\
\
\
\
5" \\\ "
0 - . ...1 L - . ...-.J . :TT..L.-.1‘--.‘1-.g.
10° 10‘ 10’ 10’ 10‘
Dltnonslon

(b) Angle distance
Figure 4.3: Average distance between cluster centroids

65

 

 

 

 

 

Dimension Euclidean distance Deviation
2 0.5165 0.2455

3 0.6616 0.2447

4 0.7787 0.2446

5 0.8779 0.2528

6 0.9655 0.2499

7 1.0490 0.2461

8 1.1299 0.2478

9 1.2037 0.2552

10 1.2673 0.2478
20 1.8054 0.2427
50 2.8429 0.2444
100 4.0802 0.2331
200 5.7660 0.2383
500 9.1180 0.2354
1000 12.9166 0.2510
2000 18.2539 0.2479
5000 28.8672 0.2335

 

 

Table 4.1: Distance between cluster centroids (2 clusters, 1000 samples)

 

 

 

Dimension Angle Deviation
2 27.2858 19.0131

3 34.2791 17.0340

4 36.4735 14.1604

5 37.4477 12.9496

6 37.9835 11.9038

7 38.6798 10.6275

8 39.2921 9.9100

9 39.6707 9.5699

10 39.6838 8.8548
20 40.5166 6.1592
50 40.6047 3.9594
100 41.3060 2.6153
200 41.3147 1.9324
500 41.3683 1.1768
1000 41.4426 0.8923
2000 41.4035 0.6100
5000 41.4151 0.3666

 

 

 

66

Table 4.2: Angle between cluster centroids (2 clusters, 1000 samples)

 

 

 

 

 

 

 

 

Dimension 1000
Number of clusters 20
Number of documents 2000
Radius 2

 

 

Table 4.3: Summary of data set used to analyze intra and inter-cluster distance

for dimension up to 5000, for the case of two clusters, with 1000 samples. Table 4.2
shows the angle values for dimension up to 5000, for the case of two clusters, with

1000 samples.

4.3.4 Distribution of Intra and Inter-Cluster Distances

We then investigated the distribution of intra and inter-cluster distance or clustered

data.

 

1.0C

0.8C

0.6C

intra inter

0.4C

0.2C

 

 

 

L - . l4
0% 1.349 2.697 4.045 5.394 .6742 8.090 9.439 10.787 12.135 13.484
distance

Figure 4.4: PDF of intra and inter-cluster distances

For the data set summarized in Table 4.3, Figure 4.4 shows the PDF of intra-cluster
distance for cluster 1, and the PDFs of inter-cluster distances between cluster 1 and
all other clusters. Table 4.4 shows the corresponding averages and deviations. The

67

 

 

 

 

 

 

 

 

 

 

 

Cluster Distance from Average intra Deviation

centroid 1 cluster distance
1 0 1.0541 0.548814

Average inter

cluster distance
2 13.117636 13.0601 0.060046
3 12.804267 12.7554 0.048568
4 12.985431 12.9205 0.056116
5 12.758499 12.6982 0.061719
6 13.141036 13.0786 0.058101
7 12.809853 12.7614 0.050714
8 13.454486 13.3826 0.068648
9 13.249435 13.1818 0.064184
10 12.640610 12.5795 0.057982
11 12.824699 12.7610 0.059478
12 12.891011 12.8274 0.059240
13 13.233682 13.1606 0.066985
14 12.674950 12.6179 0.059596
15 13.328342 13.2672 0.063191
16 12.795399 12.7443 0.049922
17 13.467176 13.3955 0.064418
18 12.760791 12.6972 0.061297
19 12.996210 12.9340 0.059926
20 13.254703 13.1969 0.063341

 

 

68

Table 4.4: Intra and inter-cluster average distance and deviation

 

ﬁrst row of Table 4.4 shows the intra—cluster average distance and deviation for cluster
1. The remaining rows show the distances between cluster 1 centroid and all other
centroids, as well as the inter-cluster average distances, and their deviations. The fact
that all the inter-cluster average distances are all in the range of 12-13, which is much
greater than the intra-cluster average distance (1.054) indicates that cluster one is
well isolated from other clusters. The standard deviation (essentially the steepness of

the PDF around the average) characterizes the compactness of cluster 1.

4.3.5 Expected Value of Eucliden Distance

In this section, we derive the expected value of the Euclidean distance between two
points in t-dimensional space, with components uniformly distributed over interval
[0,1].

Let’s suppose that X,- and Y,- are two independent random variables, uniformly
distributed over the interval [0,1]. The probability distribution function (PDF) of
X,- is Fx,(x) = 2:, for 0 < :1: <= 1 (same for Fy,). The probability density function
(pdf) of X,- is fx,(:c) = gag)— = 1, for 0 < a: < 1 (same for fy,). Then fx‘y‘.($, y) =

fx.(x)fy.(y) = 1. for 0 < 32.31 < 1

69

 

 

E(dz‘s(X, Y)) = BUZXXi ‘ 10%)

= (:1:/;(X,- — K)2fx,y,(x, y) dxdyf

i=1

-(gfolf.._.,2...,y

= (9%

(4.10)

4.3.6 Distribution of Intra-Cluster Angles

For a cluster of documents scattered uniformly around the cluster centroid, we deter-

mine the distribution of intra-cluster angles.

1.0C

  
  
  

0.8C

0.60

0.4C

    

 

 

0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13
angle (radian)

Figure 4.5: PDF of intra-cluster angles

Fig 4.5 shows the probability distribution function (dark line) of the intra-cluster
angle, for a data set with 1000 keywords, 2 clusters, 1000 documents (length=1.00,

radius=0.145). It also shows the theoretical normal (Gaussian) distribution function

70

with avg=0.08 and dev=0.0216 (gray line). The intra-cluster hyper angles are dis-
tributed following a normal distribution. This type of distribution is similar to that
of the distribution of intra-cluster distance shown in Fig 4.4. This shows that the
generation of synthetic data using the standard Cartesian coordinate system yields
comparable results in terms of the distribution of Euclidean distance and angle based

similarities.

4.4 Polar Coordinate System

Since as we increased dimension the expected angle value for uniformly distributed
documents asymptotically converges to a constant, it is not possible to vary document
cluster separation. So we decided to use the (hyper-spherical) polar coordinate system
instead of the Cartesian coordinate system to generate our synthetic clustered data
set. We also decided to use the normal (Gaussian) distribution with various standard
deviations to scatter documents within clusters. This allows for using different cluster
spread values around centroids to specify diﬂ'erent cluster compactness. Because
documents belonging to the same cluster have similar angle, it is easier to visualize
clusters using polar instead of Euclidean coordinates. In the polar coordinate system,
we can specify documents belonging to the same cluster by giving them the same
mean angle value. The separation between documents in the same cluster will be
controlled by the standard deviation around centroid polar coordinates. We also use
angle based overlap to specify cluster isolation. It is worth mentioning that using

hyper spherical polar coordinate system to generate our synthetic clustered data set

71

is intuitively more apprOpriate for modeling documents since we are using angle based

similarity measure.

4.4.1 Polar vs. Cartesian Coordinates in t-dimensional Space

Since documents are characterized by keywords, to represent documents by polar
coordinates, we needed to derive the polar angles that represent documents in function
of the keywords.

In 2-dimensional space, for 0 S r _<_ +00, 0 S 0 S 271:

:1: rcosH
= (4.11)

y rsin0

In 3-dimensional space, for 0 S r 3 +00, 0 _<_ 0 3 Zn, 0 S 1,!) 3 7r:

{:0\ {rsim/Jcos0\

y = rsinwsinG (4-12)
K 2 j K TCOS’l/J 4

Figure 4.7 shows 3-dimensional (spherical) polar coordinate system. In general,

 

 

 

 

in a t-dimensional system, where 101,102, . . . , k, are the dimensions representing the
keywords, a document 11,- in a data set with t keywords will be represented in t-
dimensional space by a point with t — 1 angles, and the length ldgl. The relationship
between the Cartesian coordinates 112,1, 111,2, . . . , wit and the t — 1 angles 01, 02, . . . , 0,-1

is:

72

 

 

113“
I “
' \
,’ Wl3s \
I ‘ ‘
...... -J----\
'—.’— \ ---
4“ I \ \ ~‘
’ I \ \ s
s
I \ ‘
\ \ \
s I ‘ I
‘~~- l ‘\ "”
T __________________ .
1 , d1
0 I“
l .
, ldll :.
, \
o ' ‘
, 02 ' 1
' 1
I ' .
l ' |
. 0 ' 1 m
| \ r t r V
' \ | ' I W12 IE2
‘ \ I l ’
1 s ’
‘ I l I
9 “ | I],
I \ '09
. \
ml .1 ................... .1,
\ 1
\ p ‘\
\ 1 \
\ y ‘
\
\ I s
\ l
\
\ l s
\ I \
s
\ I ‘
\ I \
\ I \
I
s I
kl “ I
I

Figure 4.6: Spherical coordinate system

( w,-1 \ ( |d,-| sin 9,4 sin 0,42 . . . sin 02 cos 01 \
wig |d,-| sin 9,-1 sin 91—2 . . . sin 02 sin 01
111,3 |d,| sin 0,-1 sin 6L2 . . . cos 62
= (4.13)
111,- 1—1 |d,| sin 6,4 cos 9,42

 

 

 

 

\ w“ ) \Id,|cosl9¢_1 )

where 0 S 01 _<_ g is the horizontal angle measured on the K1K2 plane from the
K1 axis, and 0 S 02,03,...,9,_1 g g are the polar angles measured from the

K3,K4, ...,Kt axes.

4.4.2 Clustered Document Data

73

1:31 C,

 

II
I
/ "' C'
I O
b”(
1" l o"‘
/ p _____ :L-z -----
5 /

 

O
511

111
Figure 4.7: Representation of clustered documents in polar coordinate system
Figure 4.6 shows the representation of clustered documents in polar coordinate

system. Cl, Cg, . . . represent the centroids, around which document data is clustered.

We generate clustered data around hyper-cones representing centroids, around which,

within a delta angle, documents can be generated in the vector space. In the (hyper-

spherical) polar coordinate system, the parameters of synthetic data, is represented
as follows: (1) the keywords are represented by the dimensions of the vector space;

(2) the clusters are represented by angles; (3) the documents are represented by the

points in the vector space; (4) the similarity between documents is represented by

angles deﬁned from the origin.

Generation of Clustered Document Data

In the polar coordinate system, a random point can be generated as [r cos 0),] 1 S

k g t, where r and 0,, are random polar coordinates. 0,, is distributed over [0,; ,

74

and r and 0,, are independent. We can also generate the point as [r cos(§Uk)], where

U), is distributed over [0, 1].

Algorithm The following is an algorithm for generating clustered document data,
adapted from [34].

Step 1. Establish cluster sizes

Repeat Step 2 to 4 for all clusters

Step 2. Select a cluster centroid as uniformly distributed random variable,

Step 3. Scatter documents in the hyper-cone surrounding the average centroid
angle: Select the t polar coordinates (angles) of each document relative to the cluster
centroid as a normal distributed random variable with centroid components as aver-
age, and deviation (or spread) a. If a document falls outside the domain, repeat Step
3.

Step 4. Repeat steps 2 and 3 until all cluster overlap is within maximum allowed.

We generated data sets with different dimensions, and different average scatter
angle around centroids.

Figure 4.8 shows an example data set for 1,000 documents, scattered around 5
clusters in 2 dimensions with 5 degrees spread and a maximum 30% overlap allowed.
The ﬁve clusters are labeled respectively ’.’, ’+’, ’*’, ’x’ and ’0’. Table 4.5 shows the
overlap (in %) between the ﬁve clusters (average=3.25%).

Then we increased the dimension while keeping the same average scatter angle.

Figure 4.9 shows an example data set for 10,000 documents, scattered around 5

clusters in 3 dimensions with 5 degrees spread and a maximum 30% overlap allowed.

75

 

 

 

 

Figure 4.8: Clustered synthetic document data (2 dim, 5 deg. scatter)

 

 

 

1

 

8.75

 

0.00

0.00

 

150053

0.00

0.00

8.50

 

 

 

5 0.75

 

0.00

 

0.75

 

13.75

 

Table 4.5: Inter-cluster overlap (2 dim, 5 deg. scatter)

76

 

 

 

1\

0.8 x

0.6 \

0.4 \

0.2 \

 

0.8

0.6 0.8

0.6

 

Figure 4.9: Clustered synthetic document data (3 dim, 5 deg. scatter)

77

 

Here again, the ﬁve clusters are labeled respectively ’.’, ’+’, ’*’, ’x’ and ’0’. Table 4.6

shows the overlap (in %) between the ﬁve clusters (average=2.53%).

 

Table 4.6: Inter cluster overlap (3 dim, 5 deg. scatter)

Then we increased the average scatter angle while keeping the same dimension.

 

 

Figure 4.10: Clustered synthetic document data (3 dim, 10 deg. scatter)

Figure 4.10 shows an example data set for 10,000 documents, scattered around 5
clusters in 3 dimensions with 10 degrees spread and a maximum 30% overlap allowed.
Table 4.7 shows the overlap (in %) between the ﬁve clusters (average=4.36%).

78

 

l 2 3 4

9.95
0.33 0.50
9.75 0.55 0.40
1.52 0.00 0.05 20.62

 

 

 

 

 

0111:0010

 

 

 

 

 

 

 

Table 4.7: Inter cluster overlap (3 dim, 10 deg. scatter)

Then we increased the data set dimension further. Table 4.8 shows the overlap

between the ﬁve clusters for dim=100 (average=4.06%).

 

 

1 2 3 4

10.12
0.05 0.00
2.65 3.05 0.00
5.22 5.53 0.00 13.98

 

 

 

 

01150010

 

 

 

 

 

 

 

Table 4.8: Inter cluster overlap (100 dim, 10 deg. scatter)

Table 4.9 shows the overlap between the ﬁve clusters for dim=1000 (average=3.31%).

 

1 2 3 4

0.85
1.95 0.00
2.35 0.05 16.98
5.65 3.05 1.05 1.20

 

 

 

01150059

 

 

 

 

 

 

 

 

Table 4.9: Inter cluster overlap (1000 dim, 10 deg. scatter)

The variation of inter-cluster overlap in Tables 4.6 and 4.7 show that the overlap
increases with average scatter around cluster centroids. Tables 4.7, 4.8 and 4.9 show,
for the same scatter angle, a slight decrease of overlap as dimension increases. There-
fore, the inter-cluster overlap can be controlled by the average scatter value around

cluster centroids.

79

4.5 Summary

We ﬁrst generated synthetic data using the standard Cartesian coordinate system,
and showed comparable results in terms of the distribution of Euclidean distance
and angle based similarities when using uniform random data. We then used the
polar coordinate system to apply normal distribution to model cluster compactness
and isolation. This is unlike generating data in the Cartesian coordinate system,
where controlling parameters for a desired distribution in terms of angles, is not
easily achievable by Cartesian parameters. The data sets that we generated in this
chapter show that we can choose data with various parameters. Our data gener-
ation mechanism based on polar coordinate system achieves easy control of cluster
centroids, cluster compactness and cluster isolation. We specify cluster compactness
by using normal distribution where standard deviation controls how to scatter docu-
ments around cluster centroids. And we specify cluster isolation by using inter-cluster

angle-based overlap.

80

 

Chapter 5

Semantic Groups of Graph Data

Model and Documents

5. 1 Introduction

In previous chapters we have developed clustering algorithms based on ontologies. For
web applications, many ontologies may have to be deﬁned and these ontologies are
related to each other. The ontologies also evolve as the document database changes
and grows. In order to apply the proposed clustering to a database of documents,
a framework to represent clusters of documents is needed. For this, a graph data
model that uses semantic groups to add semantics to documents and keywords is
suggested. Semantic groups can be used in the management of multiple ontologies.
They can be built using keywords based on output clusters. These semantic groups
can then be used in input ontologies for subsequent clustering. This reuse of semantics

incrementally improves the quality of clustering. Moreover, since clustering requires

81

document processing and a structure to hold post-processed document data, the graph
data model can also be used as an underlying structure to represent the relationships
between documents and keywords.

In this chapter, we present our conceptual work on graph data model. Section 5.2
contains the motivation for proposing the use of semantic groups of graph data model.
In Section 5.3 we present the concept of semantic groups in a Graph Data Model.
In Section 5.3.1 we show how semantic groups enhance sharing by allowing private
views. We describe how semantic groups are affected by the graph operations in
Section 5.3.3. We present an application of our graph data model in Section 5.4.
We conclude with some discussions of the graph data model and semantic groups in

Section 5.5.

5.2 Motivation

In this chapter, we present a graph data model, which focuses on deﬁning views for
sharing data while allowing context-sensitive updates on views. The fundamental
basis of the approach is the concept of a semantic group [64] which has some similar-
ities to the concept of views. A primary difference between the concept of semantic
groups and the traditional concept of views is that views are deﬁned on top of base
relations while semantic groups are built as part of the data. For example, in this
graph data model new data can be incorporated to an existing semantic group, with-
out necessarily aﬁ'ecting some other semantic groups. The semantic groups provide

more keyword-based semantic meaning than the traditional keywords’ occurrences in

82

documents. This limits the scope of the query by the user in the presence of a very
large number of documents.

The concept of views is important in database systems for controlling accesses and
sharing of data. 'IYaditional database systems are based on the concept of a schema,
which deﬁnes the database using a model and facilitates querying of that database.
Views are deﬁned on top of base tables, and are implemented by query expressions.
Any update to the base relations is reﬂected in the views. As a result, updates
done within the context of a given view may affect other views. The focus of this
chapter is to develop a new concept called “semantic group” for implementing views
for sharing and querying in schema-less graph data model. Semantic groups allow
updates to be made to private views. The proposed graph data model is suitable
for modeling semi-structured data, such as documents. In the graph data model,
nodes represent entities and links represent relationships between them. The notion
of semantic groups has some similarities to that of views. The fundamental difference
between semantic groups and traditional concept of views is that semantic groups are
not query statements, but are part of the graph data itself. This allows us to make
local updates easier. This approach is different from the object oriented approach,
in which the relationships are deﬁned between classes, and not necessarily between
instances. These relationships should be standardized for all objects in a class, which
may not be true in the case of semi-structured data. We introduce a two-layer graph
structure: the basic layer that corresponds to the traditional graph data model, and a
second layer that provides a mechanism for creating and updating views. The second

layer implements the concept of semantic groups.

83

The graph data model and the semantic group concept can be used as a framework
to represent clusters of documents. The idea is to build semantic groups based on out-
put clusters and relationships to most important keywords. These relationships can
then be reused in input ontology for subsequent clustering. This reuse of semantics
is intended to incrementally improve the quality of clustering. Also, since clustering
requires document processing and a structure to hold post-processed document data,
the graph data model can also be used as an underlying structure to represent the re-
lationships between documents and keywords and the corresponding weights. Update

of this graph based structure is made easier by the concept of semantic groups.

5.2.1 Cluster Representation and Reuse

A semantic group would include an output cluster, encapsulated by a domain node,
with links to the most relevant keywords. The order of relevance can be established
by a statistical analysis of the documents in the clusters, based on keyword frequency
within a document and the number of documents containing such keywords. Unlike
dendrograms, this graph based representation of cluster keywords allows for overlap-
ping categories.

For example, in Figure 5.1, the semantic group .91 includes cluster cl and its most
important keywords [01, kg, 103, and semantic group 32 includes 02 and [02, [03, 104. The
links between keyword nodes can be based on the order of importance of keywords V
with respect to the corresponding clusters. The most relevant keywords obtained
from the output cluster representation can be used to build new categories, which
can deﬁne an input ontology for subsequent clusterings.

84

I

 

 

 

 

 

 

 

 

l
I
l ' I '
I ' l 82 I
I l l ,
: : ' .
. c1 : I c2 :
l
I . | l . :
I | 1 ,
I | l .
l ' l .
I ' l .
I ' I ,
l ' I I
I l l ,
l I
I l l .
l l I ,
I ' l .
l ' I .
I ' '
I l l ,
I o ----l- -O- ....... I
: ------- -o- ---- -c -. I
l : I :
I k1 k2. I k3 k4 :
I l I ,
I ' I .
I

Figure 5.1: Cluster representation

The semantic groups of the graph data model, used for the management of on-
tologies, can handle the complexity of relationships between keywords, and the over-
lapping nature of such a graph. Multiple semantic groups, which are related to each
other, can be merged to form a more general category. Querying capabilities of se-
mantic group can be used to select and separate speciﬁc categories of an ontology for

analysis and /or update.

5.2.2 Update of Private Views

Data sharing is important in database systems. Different parts of a large graph
structure should be accessed and updated in different contexts. In the traditional
approach, sharing is done through views. A traditional view is a dynamic subset of

the database whose entities satisfy a speciﬁc logical condition. It is deﬁned on top of

85

base relations. In our approach, we use the concept of semantic groups which are not
deﬁned on top of base tables. Here, all the data are created and updated in the context
of semantic groups. Since a semantic group is not a full replication of the subgraph
data, update is done directly to one’s semantic group without having an unwanted
side effect on other semantic groups. This is different from the materialized views
approach [25], where physical copies are used to improve performance. Our approach
is also different from the object oriented approach, where relationships are between

classes, and not necessarily between instances.

5.3 Semantic Groups in Graph Data Model

The basic structure of 3 Graph Data Model, as presented in Chapter 2, is a graph
where nodes correspond to entities, and labeled links represent the relationships be-
tween these entities. In addition to the concept of domain also presented in Chapter 2,
the semantic group concept provides a contextual semantics for the components in-
side a domain. A semantic group is a set of components within a domain with a
speciﬁc context. Thus, the same component in a domain can have different meaning
in different contexts.

In the example of Figure 5.2, the “102” node can be linked to “dl” node in the
semantic group of “dl” , and be linked to “d3” node in the semantic group of “d3”. Note
that dashed links represent the semantic groups, whereas the solid arrows represent
the user deﬁned links. In this example, there are two semantic groups. The semantic

group 31 contains the nodes “111” and “dz”, and the links between them. And the

86

 

 

 

 

 

 

 

 

 

 

 

 

1
| \
I I d2 : sz :
l ..o a." :
l ,I1 r‘
1 d1 ,’ l 1 ‘4 d3 :
I 1" I ‘s '
' r’ ' 4% ,
I I ' W3 '
: “’13 :w22 E :
: w2 : w34 I
I W33 '
1 ‘WI l
I I | |
1 1 . 1
I I | '
: 'Wll : ' W’ I
l
: . - :
: ------- 4— ----J -o -7“ $ I
l : | :
I kl k2: I k3 k4 ;
I I I :
l

I
I
I
I
I
I
I
I
I
I
I
I
I
l
I
I
I
l
I
l
I
\

Figure 5.2: Example of semantic group

semantic group 32 contains the nodes “dz” , “d3” , and the links between them.
To capture the concept of semantic groups we extend our deﬁnition of graph-based
database in Chapter 2 as follows:
DB =< D, S, N, NT, NM, L, LT, LM >,
where the new element 5' is the set of all semantic groups. A domain can now be
deﬁned as d = {c,-} U {3,}, where c,- is a component, and s,- is a semantic group
within the domain. The other components of the graph-based database have similar

meaning as in the previous deﬁnition.

5.3.1 Semantic Groups and Views

One important feature of semantic groups is the querying of graph structures. The

Select() command is used to query a database in GDM.

87

 

Select(inDomain, component, linkType, direction, depth, logicalOperator, outDomaz'n)
The main idea of Select() is to follow links, with a speciﬁed type and direction, within

a semantic group, starting from a given component. The depth of the links followed
can be controlled by the parameter “depth” . Logical operators can be used for link
speciﬁcation. The link type, direction, depth and logical operator are optional. If

a Select() command is applied to a domain with a unique semantic group and these
optional parameters are not speciﬁed, then the output domain will contain the same
elements as the input domain.

Select() command acts differently when semantic groups interact. Consider Se-
lect() with no optional parameters speciﬁed, and the input domain consists of several
semantic groups. Unlike the case with a simple semantic group, the only components
presented in the output domain are those which are in the same semantic groups as the
starting component. That is how Select() command uses the notion of independent

relations within the same component.

 

 

 

 

 

 

 

’-——------------
—————--—————--————-——-J

 

Figure 5.3: Querying based on semantic groups

88

The example in Figure 5.3 illustrates the use of Select() command in a domain
with multiple semantic groups. The command Select(d, n1,t, to, null, null, 011) will
return the domain d1 with nodes n1, ng and 714, and the links connecting them. Note
that 113 is not returned by the query because it is not in the same semantic group as
the starting node 711. On the other hand the node n5 was not returned by the query

because it is connected to 712 with a link of type t’ and not t.

5.3.2 Separation of Semantic Groups and Private Update

The strength of this data model consists of the creation and querying of semantic
groups that contain shared components while allowing update of private data. This
section shows how one can separate a semantic group so that private update can be
done.

The Separate (inDomain, outDomain) command allows users to create independent
semantic groups. It copies semantic groups from an input domain into an output
domain speciﬁed by the user. These semantic groups will contain the same elements
as the corresponding semantic groups in the input domain.

Figure 5.4 illustrates the separate() command. We start with domain d1 of ﬁg-
ure 5.4(a). After the operation Separate(d1, d2) is applied on this domain d1, the
output domain £12 with its semantic group 32 becomes independent from the existing
structure. Thus, this semantic group can be updated, i.e new nodes and links can be
added to semantic group 32, without affecting the domain all.

This ﬂexibility in updating private views is especially beneﬁcial when nodes con-
sist of complex objects because components are not duplicated, but only subnodes

89

a.

H

I
E5

1
I
I.
I
I
l
I
I
I
I
I
I
I
I
I
I
I
I
I
I

  

 

 

 

 

p-L--- ----—----

----4.

-----------------------

’———_——-—----
-—-—----—-d

\——-—------—-——’

Figure 5.4: Separating semantic groups

corresponding to these nodes are created. The subnodes are used to implement the se-
mantic groups. Nodes and links are associated to semantic groups through subnodes.

Subnodes are hidden from users.

5.3.3 Effect of GDM Operations on Semantic Groups

In Section 5.3.2 we saw how one can separate semantic groups to be able to make
updates to private views. This separation is possible because while components are
created in a graph, the system creates the corresponding semantic groups. In this
section we show how semantic groups are also captured through other operations of
GDM, such as “create node” and “create link”. A complete list of graph data model

commands is included in Appendix B.

90

Semantic Groups and Newly Created Nodes

Let’s consider a graph database DB =< D, S, N, NT, NM, L, LT, LM > .

O.
O-

J
I
1

Q
Q

 

5
~
E;
=
h
:1
N

 

 

 

 

 

 

I
I

 

3

--------------' -——----——---——-- --------

Figure 5.5: Semantic groups and basic graph operations

First we create a domain (I, and a node 111 within this domain. The system
associates a semantic group 31 = {711} when the node 111 is created (see Figure 5.5(a)).
Initially, nodes are created with no link and belong to separate semantic groups.
For example, in Figure 5.5(b) there are two semantic groups, 31 = {111} and 32 =
{112}. Nodes belong to the same semantic group when they are connected by links.
For example, in Figure 5.5(c), after connecting the two nodes 111 and 112, the newly
created semantic group is 312 = {111, n2, 1}. A more formal deﬁnition of linking two

components is given in the next subsection.

Effect of Linking Components in Multiple Semantic Groups

The linking of two components that are members of multiple semantic groups is a
more sophisticated operation than connecting components that belong to a single

semantic group.

91

 

 

 

 

 

 

 

 

 

 

 

(a) (b) (C)

Figure 5.6: Connected nodes

Consider the example of Figure 5.6(a). The node n; is connected to 111 and 712
in respectively two different semantic groups. Now if we want to connect 113 to 114,
we should be able to preserve the semantic groups associating m to either 111 or 112,
as shown in Figure 5.6(b). We may decide later that a new node 115 linked to 113
be associated to the semantic group {121,113,114} and not {712,113,114}, as shown in
Figure 5.6(c). Since a node can be shared in different contexts, all combinations of
contexts should be preserved. A private update to this node can subsequently be
done by separating the speciﬁc semantic group.

A union of semantic groups associated with the nodes being connected deﬁnes the
semantic groups of the output domain. Every semantic group associated with the
ﬁrst linked node must be joined with every other semantic group associated with the
other node.

More formally, consider a link I of type t created between nodes 111 and n2, re-

spectively in domains (11 and d2. Moreover, suppose that 111 belongs to the semantic

92

groups 311, 312, and 112 to 321, 322, Then, the newly created link I is added to the
set of links, i.e. L <— L U {I}, and the type t is added to the set LT of link types,
(LT +— LT U {t}). The output domain is d = d1 U d2 U {I}. The set S of semantic
groups in d is determined as follows:

Let SI = {311, 312, ..}, $2 = {321,322, ...}, and S’ be the Cartesian product of SI and
32, i.e., S’ = {(311,321), (311,322), ....} Then S consists of the union of {l} with the

sets in each pair of S’, i.e., S = {311 U 321 U {l}, 311 U 322 U {I}, ...}.

d=d1Ud2U[l}

 

 

 

 

 

 

 

 

 

 

 

 

112 06 J

 

(a) (b)

Figure 5.7: Linking nodes in interacting semantic groups

Figure 5.7 illustrates how two semantic groups 31 = {711, n3,113} and 32 = {112,113,123}
are combined with 33 = {114,115,145} and s4 = {114,716,146}, when linking the two
nodes 113 and 114 with link (34. This creates the four semantic groups: 313 = 31 U .93 U
{134} = {711,713,113M41n511451134}, 314 = 81 U 84 U {134} = {711.713.113.714msJ461134},
323 = 32 U .93 U {134} = {n2,n3,123,n4,n5,l45,l34} and 324 = 32 U 34 U {(34} =

{712, "311231 ”4171611461134}

93

5.4 Semantic Groups and Documents

Typical Information Retrieval keyword searches return a large number of documents.
Browsing through the whole set of retrieved documents may be a tedious task. Even
though most Information Retrieval systems provide a way to tune the search, the
protocol varies with each system. Some present users with categories of documents,
so that searches can be performed within a subset of documents. However, these
categories are static; and may not be up—to—date all the time. Querying document data
is a difﬁcult task because this data does not necessarily have a well deﬁned schema
that characterizes its structure. Therefore, it does not lend itself naturally to standard
data models, such as relational model. A graph data model is more apprOpriate for
such data. We propose the use of a graph-based system that provides a general
framework for modeling documents. This system speciﬁcally uses the semantic group
concept to provide a view-based framework for modeling documents. It adds more
semantics to documents. Links represent semantic relationships between keywords
or between documents and keywords. For instance, two or more documents may
have information about the same subject of interest. Thus, they can be related in a
semantic group. The weight of a keyword with respect to a given document is context-
sensitive, in that it depends on the semantic group to which the relationship between
a keyword and a document belong to. A graph-based system allows documents and
keywords to be shared among several categories, or semantic groups.

Such a system may also allow for a search that returns a set of categories (con-

cepts). These concepts may be different from topics in that they may not necessarily

94

be meaningful to the user. They are composed of a set of keywords with varying
weights. These weights are context sensitive, in that the same keyword may carry

different weights in the presence of two semantic groups.

I
I
I
I
I
I
I
I
I
I
l
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
_ I

51 --------------------------------

     

 

B .
E

Figure 5.8: Context sensitivity using semantic groups

To illustrate this, let’s consider Figure 5.8, where there are two semantic groups

51 and S2 deﬁned as follows:

5'1 = {011,612, €13,011, (12), (dz, d3), ’61, ’62, ’93, 154,061, ’92), (k2, k3)1(k31 (C4)}, and

52 = {613,614, d5, 40: (43,41), ((14,615), (‘15, d6): ’53, [C4, ’05, ’90, ’97, (’53, ’64). (’64, ’65), ((651,913),
(’56: ’57)},

where d,- refers to the ith document, It; to the ith keyword, (d,-, dj) to the link between
11,- and d,, and (kg, kj) to the link between 1:,- and kj.

In Figure 5.8 dashed links represent the semantic groups, whereas the solid arrows
represent the user deﬁned links. Each link has an associated weight. For example,
wu refers to the weight of keyword 1:1 in document d1, and 111’ll refers to the weight
of keyword In with respect to semantic group SI. Note that d3 is shared among the

95

two semantic groups 81 and 32. 10;, and 104 are also shared among the two semantic
groups. The weight of ’03 in SI, 11213, is carried in the subnode of 103 in a sublink
(102,103), and therefore is unseen in 52. Whereas the weight of kg in $2 is carried
in the other subnode of 113 in sublink (’03, 104), and therefore is unseen in SI. The
update of weights of keywords with respect to a category is done in the context of
a speciﬁc semantic group. Thus, changes relative to a semantic group, such as the
weight of kg with respect to group SI, do not affect how 103 is important to 32. This
has the advantage of having more stable categories. This is also important when a
new document is inserted into the system. If keywords in this document are only
relevant to one semantic group for example, then we do not have to worry about
computing the weights of keywords that are not part of this semantic group. This

makes the process more dynamic.

5.5 Summary

In this chapter, we presented the concept of semantic groups in a Graph Data Model,
which provides a ﬂexible mechanism for deﬁning views while allowing update and
incorporation of data to views. We proposed a mechanism by which semantic groups
can be used in connection with document categorizing application. Since the weight
of keywords depends on which category it is associated with, semantic groups can be
used to capture this context-sensitivity. We showed that the semantic group concept
can be used in situations where there is context sensitive information (in the above

example, keyword weights within a given category), to make update more dynamic.

96

Chapter 6

Summary and Future Work

6.1 Summary

In this chapter we summarize the results achieved in this research work. We have
deﬁned an ontology based similarity measure (OBSM) that improves the quality of
document clustering. Unlike traditional keyword based cosine vector similarity mea-
sure (CVSM), our measure takes into account an input ontology of keywords. The
weight of a keyword is no longer only its frequency in a document, but is a function
of its frequency and a context based coefﬁcient. We evaluated the performance of
OBSM ﬁrst using a set of documents having pre—deﬁned clusters, then using actual
web-pages. We generated clustered data, where the clustered structure is measured by
overlap in terms of common keywords. Then, we computed the dissimilarity between
these clusters in terms of both CVSM and OBSM. Unlike for CVSM similarity, the
ratio of intra-cluster over inter-cluster OBSM similarity is a non decreasing function

of the intra—category overlap. This means that for two data sets, one exhibiting more

97

compact clusters than the other, OBSM will determine that the ﬁrst data set has a
better clustered structure. OBSM gave a slightly larger ratio of intra over inter-cluster
similarity than CVSM, which means better cluster isolation.

We gathered all documents from the Yahoo medication index, and mixed in all
documents that we got from the Detrol web site. This resulted in a database of
preprocessed documents of dimension 9,397. We generated clustering dendrograms
using complete link clustering for both CVSM and OBSM with varying weights. The
results show that unlike CVSM, with proper choice of the weight of the keywords in
the categories all Yahoo Detrol related web pages and Detrol.com documents cluster
together when using OBSM.

We developed a system to generate synthetic data using spherical coordinates in
high dimensional space to allow for control of cluster compactness and separation.
This is unlike generating data in the Cartesian coordinate system, where controlling
parameters for a desired distribution in terms of angles, is not easily achievable by
Cartesian parameters. We used normal distribution with various standard deviations
to scatter documents within clusters around centroid polar coordinates. We also used
angle based overlap to specify cluster isolation. This allows the control of inter cluster
overlap.

We proposed the use of the concept of semantic groups of graph data model and
presented a methodology to integrate documents, categories and the corresponding
keyword weights, by mapping categories to semantic groups. Semantic group concept
can be used in situations where there is context sensitive information to make update

more dynamic.

98

6.2 Future Work

As the need for querying data over the Web is growing, users are becoming more de-
manding, and the volume of data on the web increases. Any improvement of similarity
measure between documents will contribute to improve search engine performance.
In this work we assumed that an ontology existed for the similarity measure to use.
In order for ontology based similarity measure to gain wide use, a method needs to
be developed to build and / or identify this ontology, preferably with minimum human
intervention. As seen in Chapter 3, our tests on a real web data set showed that the
choice of category weights does affect the quality of clusters produced by ontology
based similarity measure. This effect warrants further analysis for the purpose of
understanding how to choose the best weights. For example, our method of synthetic
data generation in polar coordinate system can be applied to investigate the relation
between data characteristics and the performance of category weights. The ultimate
test for the concept of semantic groups in graph data model is to be implemented
in an integrated information retrieval system that uses our ontology-based similarity

measure as part of its indexing scheme.

99

APPENDICES

100

Appendix A

Some Web Pages from Data Set

In this appendix, we include snapshots of some of the web pages that were used in the
real data set. Figure A.1 shows Yahoo’s “Drugs and Medications” category1 page.
Figure A.2 shows a part of the “D” medication index2 of Yahoo. Figure A.3 shows

Detrol description3 page.

 

1http: / / dir.yahoo.com / Health / Pharmacy / DrugsandMedications /
2http: / / health.yahoo.com / health / pdr.drugs / d.html
3http: / /health.yahoo.com / health / pdr.drugs / 0908 / 0.html

101

m COMPAQ Help - Check Email

FREE Profile: Height 5n 7in Weight

Go!

Click Here to Lose 10 lbs - START NOW!

Yahoo! Directory WWW

Hell?
Drugs and Medications Search
all of Yahoo! just this category

Home > Health > Pharmacy > Drugs and Medications

 

 

 

 

More Yahoo!
Inside Yahoo!
News: Antibiotics and
0 Yahoo! Health: Drugs and Medications - deﬁnitions, M'°“’”‘°‘°"
alternative names, and related resources. News: m8 Trade
Categories News: Pharmaceutical
Industry
0 Booksellers© 0 Research (I)
0 Culture ( 19) 0 Resources (33)
0 Drug Interactions (14) PM 0 Specific Drugs and Medications
0 Drug Policy (59) (825 ) run
0 Organizations (7) 0 Substance Abuse@
0 Psychopharmacology© 0 Types (258) M
0 Publications (5) 0 FAQs (I)
O Usenet (10)
% Email this Category to a Friend

 

Get Our NEW BOOKLET of the 101 MOST USEFUL WEBSITES FREE - CLICK HERE

 

Copyright 0 2002 Yahoo! Inc. All Rights Reserved. - Company Information - Advertise With Us - Suggest a Site

Figure A.1: Yahoo’s “Drugs and Medications” category page

102

. Delavirdine (Oral)

. Delaxin®

. Deisym Hold®

. Delsym®

. Deitalin®

. Deltasone®

. Demadex®

. Demerol® (Injection)
. Demerol® (Oral)

. Demi-Regroton®

. DemuienO 1150

. Denavir®

. Deniieukin Diftitox (Injection)
. Depakene®

. DepakoteO

. DepenO

. Depo-Provera®

. Depo-Testosterone®
- 061,06th m)

. Deponit®

. Depotest®

. Dermabet®

. Dennacomb®

. Dermtex Hc®

. DesferalO

. Dcsiccated Thyroid (Oral)
. Desipraminc (Oral)

. Desmopressin (Injection)
. Desmopressin (Nasal)

. Desogen®

. Desoximetasone (Topical)
. Desoxyn®

. Desquam-Eﬂ‘m)

. Desquam-X®

. Desyrel®

. Detrol®

. Dexacidin®

. Dexair®
. Dexamethasone (injection)
. Dexamethasone (Ophthalmic)

103

. Disipal®

. Disonate®

. Disopyramide (Oral)
. Disulﬁram (Oral)

. Ditropan®

. Diucardin®

. Diulo®

. Diupres®

. Diurese-R®

. Diuril®

. Diutcnsin-R®

. Dobutamine (Injection)
. DobutrexO

. Docetaxei (Injection)
. Docusate

. Dofetilide (Oral)

. Dolobid®.

. Dolophine®

. Dolor-aCO

. Donatussin®

. Donepezil (Oral)

. Donnagel®

. Donnatai Exentabs®
. DonnataiO
.Deserpidine/I‘hiazide Diuretics (Oral) .

Donnazyme®

. Dorai®

. Domase Alfa (Inhalation)
. Doryx®

. Dorzolamide (Ophthalmic)
. Doss

. Dovonex®

. Doxazosin (Oral)

. Doxepin (Oral)

. Doxercaiciferol (Oral)

. Doxil®

. Doxinate D-S-SO

. Doxorubicin (Injection)

. Doxorubicin Liposomal

(Injection)

. Doxycycline (Oral)
. Dramamine®
. Drisdol®

Figure A.2: Part of the “D” medication index

W! Health-Oh.

Free

Fetal Development Updates ; ' 1‘

Yahoo! Health

Home > Medication or Drug

Search
E Medication or Drug

Search alphabetically

* Diseases & Conditions
* Find Clinical Trials

Yahoo! Categories

* Drugs and Medications

* Pharmaceutical Companies
* Pharmacies

* Substance Abuse

Yahoo! Community

* Events: Drugs and
Medications

* Groups: Drugs and
Medications

* Message Boards: Drugs
and Medications

Yahoo! Resources

* Find a Doctor

* Health Assessment Test

* Yellow Pages: Pharmacies

Yah !-Hl m
°° w ﬁlming

.i'yvas stealing them; ~ :5
' - Ineeded for my gra'hdso '. .7

 

-' a
‘/

 
 

 

 

 

Health Ht
Tolterodine (Oral) W
Overview l Precautions !
Description See what
Helps patients with an overactive bladder control their urine. Also helps your baby
control the feeling of having to urinate right away or having to urinate too '00” like
often.
Brand Name(s)
Detrol®

 

When you should not use this medicine

You should not use this medicine if you had an allergic reaction to
tolterodine. You should not use this medicine if you have problems
passing urine, gastric retention (problems with food emptying from your
stomach). or uncontrolled glaucoma.

 

 

 

How to use and store this medicine
Tablets:

0 Your doctor will tell you how much to take and how often.

0 You may take this medicine with or without food.

0 Store the tablets at room temperature. away from heat. moisture,
and direct light.

I Keep all medicine out of the reach of children.

RIGHT
NOW!

CM

if you miss a dose:

0 Take the missed dose as soon as possible, unless it is almost time
for your next dose.

0 Skip the missed dose if it is almost time for your next regular dose.

0 You should not use two doses at the same time.

 

 

 

 

Copyright 0 2002 Yahoo! inc. All Rights Reserved.
Copyright 0 2002 Medical Economics Company. inc. All rights reserved.
Important Disclaimer: - Privacy Practices for Yahoo! Health

Figure A.3: Detrol description page

104

Appendix B

Commands of Graph Data Model

0 CreateNode(inputDomain, type, name, outputDomain)
This command creates a node with the given (unique) name and type in output-
Domain. A new semantic group is created and associated with the node. All

data from inputDomain is copied into outputDomain.

e CreateLink(inputDomain1, nodeI, inputDomainZ, node2, type, direction, out-
putDomain)
This command creates a link, of the given type, between the nodes node] and
nodeZ, in outputDomain. A new semantic group will be created in output-
Domain, and contains the Cartesian product of all semantic groups (with the
speciﬁed nodes in the input domains) and the newly created link. The content
of both input domains is copied into the output domain. Direction could be

from or to. If the link is semantically symmetric, option none can be used.

105

e DeleteNode(inputDomain, name, outputDomain)
This command capies all data from inputDomain into outputDomain, then
deletes the node from all semantic groups in outputDomain. Dangling links

may be created.

0 DeleteLink(inputDomain, node, node, type, direction, outputDomain)
This command copies all data from inputDomain into outputDomain, then
deletes the link from all applicable semantic groups in outputDomain. Isolated

nodes may be created.

0 Separate(inputDomain, outputDomain)
This command creates capies of all semantic groups from inputDomain to out-
putDomain. It does not copy the content of the input domains. The input and

output domains for this command must be different.

0 SeparateComponent(inputDomain, component, outputDomain)
This command creates a new semantic group for the speciﬁed component in
outputDomain. It is similar to the separate command. However, since this
command was very useful for users of the system, it was included into the

command set.

0 Select(inputDomain, component, linkType, direction, depth, logicalOperator, out-
putDomain)
This command selects parts (or all) of the semantic groups that contain the

speciﬁed elements from the input domains, and puts them into outputDomain.

106

This command is essentially a generalized transitive closure with respect to the
select parameters. Starting from the initial component, all components con-
nected by the links of the speciﬁed types, directions, depth (i.e. the number
of links from the initial node), and corresponding semantic groups are selected
into it outputDomain. For the links’ speciﬁcation, logical Operators (“and”,
“or”, “not”) can be used. Special reserved keywords indicate two boundary sit-
uations. When no link type should be considered, the keyword “none” is used;
“all’ indicates that all link types will be followed. Unlike separate() command,
select() provides ﬁltering only, i.e. components that satisfy some conditions
are put into the output domain, whereas separate creates new semantic groups,

different from initial groups, although they contain the same components.

U nion( { inputDomains} , outputDomain)

This command takes the union of all semantic groups in all input domains
and places their union into outputDomain. If a component exists in multiple
input domains, it will be presented in outputDomain only once. If a component

belongs to multiple semantic groups they will be presented in outputDomain.

Intersection({ inputDomains}, outputDomain)

The command takes the intersection of the sets of components in the input
domains, retrieves all semantic groups that correspond to the components in the
input domains, and places them in outputDomain. All components that exist
in all input domains will be presented in outputDomain, even if they belong to

different semantic groups.

107

e CreateDomain(inputDomain, domainName, outputDomain)
This command creates an empty domain “domainName” in outputDomain. The

output domain will also contain all components in inputDomain.

e DeleteDomain(inputDomain, domainName, outputDomain)
This command deletes the domain “domainName” and all components that only

belong to this domain. The remaining components are copied to outputDomain.

108

BIBLIOGRAPHY

[1] S. Abiteboul, “Querying semi-structured data,” in Proceedings of the Interna-
tional Conference on Database Theory (ICDT), pp. 1—18, Springer-Verlag, Jan.
1997.

[2] S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Simeon,
“Querying documents in object databases,” International Journal on Digital Li-
braries, vol. 1, Apr. 1997.

[3] C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, and J. Park, “Fast algorithms
for projected clustering,” SIGMOD ’99, pp. 61—72, 1999.

[4] C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in high
dimensional spaces,” in Proceedings of the International ACM SI GM 0D Confer-
ence on Management of Data, (Dallas, TX), pp. 70—81, ACM Press, May 2000.

[5] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace
clustering of high dimensional data for data mining applications,” SI GM 0D ’98,
pp. 94—105, 1998.

[6] R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell, “WebWatcher: A learn-
ing apprentice for the World Wide Web,” Proceedings of the 1.995 AAAI Spring
Symposium on Information Gathering from Heterogeneous, Distributed Environ-
ments, Mar. 1995.

[7] P. Atzeni, G. Mecca, and P. Merialdo, “To weave the Web,” in Proceedings of
the International Conference on Very Large Data Bases, (Athens, Greece), 1997.

[8] R. Baeza-Yates and B. Ribeiro—Neto, Modern Information Retrieval. Addison
Wesley, 1999.

[9] K. Bharat and G. A. Mihaila, “When experts agree: Using non-afﬁliated experts
to rank popular topics,” in The International World Wide Web Conference,
pp. 597—602, May 2001.

[10] R. Botafogo, “Cluster analysis for Hypertext systems,” in the Proceedings of
ACM SIGIR’93, 1993.

[11] P. Buneman, “A query language and Optimization techniques for unstructured
data,” in SIGMOD’Qb', (Montreal, Canada), pp. 505-516, 1996.

109

 

[12] P. Buneman, “Semistructured data,” in ACM PODS ’97, (Tucson, Arizona USA),
pp. 117-121, 1997.

[13] T. Catarci, S. Chang, M. Costabile, S. Levialdi, and G. Santucci, “A graph-based
framework for multiparadigmatic visual access to databases,” IEEE Transactions
on Knowledge and Data Engineering, vol. 8, no. 3, pp. 445—475, 1996.

[14] P. Chen, “The Entity-Relationship Model - Toward a uniﬁed view of data,” ACM
Transactions on Database Systems, vol. 1, no. 1, pp. 9—36, 1976.

[15] M. P. Consens and A. O. Mendelzon, “The G+/GraphLog visual query system,”
in Proceedings of the International ACM SIGMOD Conference on Management
of Data, (Atlantic City, NJ), p. 388, May 1990.

[16] M. P. Consens and A. O. Mendelzon, “H y+: A hygraph—based query and visu-
alization system,” SIGMOD Record, vol. 22, pp. 511—516, June 1993.

[17] W. H. E. Day, “Efﬁcient algorithms for agglomerative hierarchical clustering
methods,” Journal of Classiﬁcation, vol. 1, no. 1, pp. 7-24, 1984.

[18] D. Filo and J. Yang, “Yahoo!,” 1995. http://www.yahoo.com.

[19] D. Fisher, “Iterative optimization and simpliﬁcation of hierarchical clusterings,”
Journal of Artiﬁcial Intelligence Research, vol. 4, pp. 147—179, 1996.

[20] C. Fox, “Lexical analysis and stoplists,” in Information Retrieval Data Structures
and Algorithms (W. B. Brakes and R. Baeza-Yates, eds), pp. 102—130, Englewood
Cliffs, New Jersey: Prentice Hall, 1992.

[21] W. B. Frakes, “Stemming algorithms,” in Information Retrieval Data Structures
and Algorithms (W. B. Frakes and R. Baeza-Yates, eds.), pp. 131—160, Englewood
Cliffs, New Jersey: Prentice Hall, 1992.

[22] W. B. Frakes and R. Baeza-Yates, Information Retrieval Data Structures and
Algorithms. Englewood Cliffs, New Jersey: Prentice Hall, 1992.

[23] M. Franz and J. S. McCarley, “Unsupervised and supervised clustering for topic
tracking,” in Proceedings of the Annual International ACM SI CIR Conference on
Research and Development in Information Retrieval, (New Orelans, Louisiana),
pp. 310—317, ACM Press, Sept. 2001.

[24] M. Gemis, J. Paredaens, I. Thyssens, and J. V. Bussche, “GOOD: A graph-
oriented object database system,” SIGMOD Record, vol. 22, pp. 505-510, June
1993.

[25] T. Grifﬁn and L. Libkin, “Incremental maintenance of views with duplicates,”
in Proceedings of the International ACM SIGMOD Conference on Management
of Data, vol. 24, (San Jose, CA USA), pp. 328—339, June 1995.

110

 

[26] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for
large databases,” SIGMOD ’98, pp. 73—84, 1998.

[27] M. Gyssens, J. Paradaens, and D. V. Gucht, “A graph-oriented object model for
database end-user interfaces,” SIGMOD Record, vol. 19, no. 2, pp. 24—33, 1990.

[28] J. Hammer, H. Garcia-Molina, K. Ireland, Y. Papakonstantinou, J. Ullman,
and J. Widom, “Information translation, mediation, and mosaic-based browsing
in the TSIMMIS system,” in Proceedings of the International ACM SIGMOD
Conference on Management of Data, vol. 24, (San Jose, CA USA), p. 483, June
1995.

[29] M. A. Hearst, D. R Karger, and J. O. Pedersen, “Scatter/Gather as a tool for
the navigation of retrieval results,” in The Working Notes of the 1995 AAA] Fall
Symposium an AI Applications in Knowledge Navigation and Retrieval, 1995.

[30] M. A. Hearst and J. O. Pedersen, “Reexamining the cluster hypothesis: Scat-
ter/ Gather on retrieval results,” in Proceedings of the Annual International
ACM SI GIR Conference on Research and Development in Information Retrieval,

(Zurich, Switzerland), pp. 76—84, Aug. 1996.
[31] “Ht://Dig,” 1995. http://www.htdig.org.

[32] M. Iwazume, K. Shirakami, K. Hatadani, H. Takeda, and T. Nishida, “IICA:
An ontology-based internet navigation system,” in Working notes for AAAI-96
Workshop on Internet-Based Information Systems, pp. 65—71, 1996.

[33] M. Iwazume, H. Takeda, and T. Nishida, “Ontology-based approach to informa-
tion gathering and text categorization,” in Proceedings of International Sympo-
sium on Digital Libraries, pp. 186-193, 1995.

[34] A. K. Jain and R. C. Dubes, Algorithms for clustering data. Prentice Hall, 1988.

[35] T. Joachims, T. Mitchell, D. Freitag, and R. Armstrong, “WebWatcher: Machine
learning and Hypertext,” Fachgruppentreﬁen Maschinelles Lernen, Aug. 1995.

[36] A. Johnson, A. Ramachandran, and S. Pramanik, “Efﬁcacy: A data gather-
ing system for the World Wide Web,” Master’s thesis, Computer Science and
Engineering Department, Michigan State University, July 1998.

[37] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32,
pp. 241-254, 1967.

[38] N. Kiesel, A. Schuerr, and B. Westfechtel, “GRAS, a graph-oriented (software)
engineering database system,” Information Systems, vol. 20, no. 1, pp. 21—51,
1995.

[39] B. King, “Step-wise clustering procedures,” Journal of the American Statistical
Association, vol. 69, pp. 86-101, 1967.

111

 

[40] D. Konopnicki and O. Shmueli, “W3QS: A query system for the World Wide
Web,” in Proceedings of the International Conference on Very Large Data Bases,
(Zurich), pp. 54-65, 1995.

[41] H. S. Kunii, Graph Data Model and its Data Language. Springer-Verlag, 1990.

[42] L. Lakshmanan, F. Sadri, and I. Subramanian, “A declarative language for query-
ing and restructuring the Web,” in 6th Intern. Workshop on Research Issues in
Data Engineering: Interoperability of Nontraditional Database Systems, 1996.

[43] D. Lawrie, W. B. Croft, and A. Rosenberg, “Finding topic words for hierarchical
summarization,” in Proceedings of the Annual International ACM SIGIR Con-
ference on Research and Development in Information Retrieval, (New Orelans,
Louisiana), pp. 349—357, ACM Press, Sept. 2001.

[44] M. Levene and G. Loizou, “A graph-based data model and its ramiﬁcations,”
IEEE {Transactions on Knowledge and Data Engineering, vol. 7, pp. 809—823,
Oct. 1995.

[45] D. D. Lewis, “Evaluating text categorization,” in Proceedings of Speech and
Natural Language Workshop, pp. 312-318, Defense Advanced Research Projects
Agency, Morgan Kaufmann, Feb. 1991.

[46] D. D. Lewis, “Feature selection and feature extraction for text categorization,”
in Proceedings of Speech and Natural Language Workshop, (San Mateo, CA),
pp. 212—217, Defense Advanced Research Projects Agency, Morgan Kaufmann,
1992.

[47] D. D. Lewis and W. B. Croft, “Term clustering of syntactic phrases,” in Pro-
ceedings of the Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 385404, 1990.

[48] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classiﬁers,”
in Proceedings of the Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, (Dublin, Ireland), pp. 3—12, July
1994.

[49] D. Lucarella and A. Zanzi, “A visual retrieval environment for hypermedia infor-
mation systems,” ACM Transactions on Information Systems, vol. 14, pp. 3—29,
Jan.1996.

[50] “Lycos.” http://lycos.cs.cmu.edu.

[51] J. D. Martin, “Clustering full text documents,” in Proceedings of IJCAI-95
Workshop on Data Engineering for Inductive Learning, (Montreal, Canada),
Aug. 1995.

112

[52] F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz, “Evaluating topic-driven
Web Crawlers,” in Proceedings of the Annual International ACM SIGIR Con-
ference on Research and Development in Information Retrieval, (New Orelans,
Louisiana), pp. 241—249, ACM Press, Sept. 2001.

[53] A. O. Mendelzon, G. A. Mihaila, and T. Milo, “Querying the World Wide Web,”
in Proceedings of the First International Conference on Parallel and Distributed
Information Systems, pp. 80—91, Dec. 1996.

[54] G. A. Miller, “Introduction to WordNet: An on-line lexical database,” Aug. 1993.

[55] G. A. Miller, “WordNet: A lexical database for english,” Communications of the
ACM, pp. 39-41, Nov. 1995.

[56] C. T. Mosier, “An experiment investigating the application of clustering pro-
cedures and similarity coeﬁcients to the gt machine cell formation problem,”
International Journal of Production Research, vol. 27, no. 10, pp. 1811—1835,
1989.

[57] F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,”
The Computer Journal, vol. 26, no. 4, pp. 354-359, 1983.

[58] C. Ordonez and P. Cereghini, “SQLEM: Fast clustering in sql using the em
algorithm,” in Proceedings of the International ACM SIGMOD Conference on
Management of Data, (Dallas, TX), pp. 559—570, ACM Press, May 2000.

[59] J. Paredaens, P. Peelman, and L. Tanca, “G-Log: A graph-based query lan-
guage,” IEEE Transactions on Knowledge and Data Engineering, vol. 7, p. 436,
June 1995.

[60] P. Pirolli, P. Schank, M. Hearst, and C. Diehl, “Scatter/ Gather browsing com-
municates the topic structure of a very large text collection,” in Human Factors
in Computing Systems CHI ’96, pp. 213—220, Association for Computing Ma-
chinery, 1996.

[61] M. Porter, “An algorithm for sufﬁx stripping,” Program, vol. 14, no. 3, pp. 130—
137, 1980.

[62] A. Poulovassilis and M. Levene, “A nested-graph model for the representation
and manipulation of complex objects,” ACM Transactions on Information Sys-
tems, vol. 12, pp. 35-68, Jan. 1994.

[63] S. Pramanik, S. Alexander, and J. Li, “Efﬁcient searching algorithms for ap-
proximate nearest neighbor queries in high dimensions,” in IEEE Multimedia
Int. Conf. on Multimedia Computing and Systems, pp. 865—869, 1999.

113

 

[64] S. Pramanik, N. Assem, S. Choudhury, and S. Bhattacharya, “Graph theoretic
modeling of semi-structured information system based on functional abstrac-
tion,” in Proceedings of the IASTED International Conference on Applied Mod-
eling Systems IASTED-AMS’98, (Honolulu, Hawaii - USA), pp. 518—522, Aug.
1998.

[65] S. Pramanik, J. Li, and J. Ruan, “Physics of high dimensional data sets in
euclidean space and their applications,” in SSCRR-2000, (L’Aquila, Italy), July
2000.

[66] C. Y. Quek, “Classiﬁcation of World Wide Web documents,” Master’s thesis,
School of Computer Science, Carnegie Mellon University, 1997.

[67] E. Rasmussen, “Clustering algorithms,” in Information Retrieval Data Structures
and Algorithms (W. B. Frakes and R. Baeza-Yates, eds.), pp. 419—442, Englewood
Cliffs, New Jersey: Prentice Hall, 1992.

[68] S. E. Robertson and K. S. Jones, “Relevance weighting of search terms,” Journal
of the American Society for Information Sciences, vol. 27, no. 3, pp. 129—146,
1976.

[69] G. Salton and C. Buckley, “Term-weighting approaches in automatic text re-
trieval,” Information Processing 84 Management, vol. 24, pp. 513-523, 1988.

[70] G. Salton, E. A. Fox, and H. Wu, “Extended boolean information retrieval,”
Communications of the ACM, vol. 26, no. 11, pp. 1022—1036, 1983.

[71] S. M. Shaffer and D. F. Rogers, “Similarity and distance measures for cellular
manufacturing. part II. an extension and comparison,” International Journal of
Production Research, vol. 31, no. 6, pp. 1315—1326, 1993.

[72] L. G. Shapiro and G. C. Stockman, Computer Vision. Prentice Hall, 2001.
[73] P. C. Shields, Linear Algebra. Addison Wesley, 1966.

[74] R. R. Sokal and F. J. Rohl, “The comparison of dendrograms by objective meth-
ods,” Taxon, vol. XI, pp. 33—39, Feb. 1962.

[75] A. J. Vakharia and U. Wemmerlov, “A comparative investigation of hierarchical
clustering techniques and dissimilarity measures applied to the cell formation
problem,” Journal of Operations Management, vol. 13, pp. 117-138, Feb. 1995.

[76] P. Willett, “Recent trends in hierarchic document clustering: a critical review,”
Information Processing 6'4 Management, vol. 24, no. 5, 1988.

[77] M. Wulfekuhler and W. Punch, “Finding salient features for personal Web page
categories,” in The International World Wide Web Conference, (Santa Clara,
CA), Apr. 1997.

114

 

[73]

[79]

[80]

[81]

[82]

Y. Yang, “Sampling strategies and learning efﬁciency in text categorization,” in
AAAI Spring Symposium on Machine Learning in Information Access, (Stan-
ford), Mar. 1996.

B. Yumono and D. Lee, “WISE: A World Wide Web resource database sys-
tem,” IEEE Transactions on Knowledge and Data Engineering, pp. 548-554,
Aug. 1996.

O. Zamir and O. Etzioni, “Web document clustering: A feasibility demonstra-
tion,” in Proceedings of the Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, (Melbourne, Australia),
pp. 46—54, ACM Press, 1998.

O. Zamir, O. Etzioni, O. Madani, and R. M. Karp, “Fast and intuitive clustering
of Web documents,” in The International Conference on Knowledge Discovery
and Data Mining, 1997.

T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efﬁcient data clustering
method for very large databases,” SIGMOD’96', pp. 103—114, 1996.

115

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII

!|[|[[l[[lililillllllll[WWI]Milli]
3 1293 02327 0782