ﬂﬂﬂmm.
._ z

in
.5” ts.
snub". 9
Eu... .2.

I
\1
:h i nu
.15

,.
mm

.1 ‘L
, ‘ .1... £1:
.uh’l.xxhﬂb, r

: h.
.14 5. hmﬁ .
ILﬂlnlwmri..q n
. .al {:2

{inns grit!
x! :
.I‘gs “r
1.... Eat.

I“:

z...
vauaawatn A.
.- (351?.

 

.411: r! ix

 

 

 

”)7 LIBRARY
{gt-1H 00 Michs-ve-z. . State
“ University

 

 

 

This is to certify that the
dissertation entitled

DIAS: DISTRIBUTED INTEREST-BASED ADAPTIVE
SEARCH FOR P2P NETWORKS

presented by

Ning Liu

 

has been accepted towards fulfillment
of the requirements for the

Master degree in Department of Computer
Science and Engineering

 

 

w./(/\‘ KL.“
Major Professor’s Signature

9/7/2003

Date

 

 

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

 

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6’01 cJCIRC/DateDuepGS—sz

DIAS: DISTRIBUTED INTEREST-BASEDADAPTIVE SEARCH
FOR P2P NETWORKS

By

Ning Liu

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Computer Science and Engineering

2003

ABSTRACT

DIAS: DISTRIBUTED INTEREST-BASED ADAPTIVE SEARCH FOR
P2P NETWORKS

By

Ning Liu

Content search in decentralized P2P networks is a challenging problem. Gnutella, a
widely used P2P system, employs the query ﬂooding approach. Although this approach is
simple and robust, its efﬁciency is fairly low.

In this thesis, we propose an efﬁcient P2P information retrieval system DIAS that
supports state-of-the-art content search in the semantic layer. DIAS avoids the inefﬁcient
random flooding used in Gnutella by using an interest-based search scheme. Peers in a
DIAS system sharing the common interest set are call “buddies”. Each DIAS peer creates
and updates its buddy list by computing the degree of interest similarity between its
neighboring peer and itself periodically. Different interest groups are generated and main-
tained according to the buddy-relationship between any two peers. DIAS peers will issue
and forward queries mostly within the interest groups associated with them.

Our simulation result veriﬁes that DIAS could achieve satisfactory search efﬁciency
without producing too more trafﬁc overhead, while retaining the simplicity and fairness
of Gnutella. The implementation of DIAS on top of JXT A shows that it is a feasible sys-

tem.

ACKNOWLEDGEMENTS

I would like to thank my advisor, Dr. Lionel M. Ni, for his continuous support and
help throughout the research project. It was deﬁnitely my great pleasure to conduct this
interesting and challenging research under his guidance. His broad and profound knowl-
edge and his effective instruction have given me a great help.

I am also very grateful to my co-advisor Dr. Li Xiao, my academic committee mem-
bers, Dr. Abdol H. Esfahanian and Dr. Matt W. Mutka, for their valuable advice and in-
spiring comments.

Many of my colleagues have contributions to my dissertation. I would like to thank
Zhenyun Zhuang, Feng Zhu, and Pei Zheng for their insightful comments and sugges-

tions.

iii

TABLE OF CONTENTS

List of Tables ................................................................................................................ vii
List of Figures ............................................................................................................. viii
Chapter 1 Introduction .................................................................................................... 1
1.1 Motivation ............................................................................................................ 1
1.2 Problem Statement ............................................................................................... 3
1.3 Deﬁnition of Terms .............................................................................................. 3
1.3.1 P2P Networks ................................................................................................ 3
1.3.2 Interest Base .................................................................................................. 5

1.4 DIAS ........................................................................................ . ............................ 6
1.5 Description of the Remaining Chapters ............................................................... 7
Chapter 2 Related Work ................................................................................................. 8
2.1 Overview of P2P Protocols .................................................................................. 8
2.1.1 Gnutella ......................................................................................................... 8
2.1.2 Napster ......................................................................................................... 11
2.1.3 Freenet ......................................................................................................... 11
2.1.4 Kazaa ........................................................................................................... 12
2.1.5 NeuroGrid .................................................................................................... 13

iv

2.1.6 Interest-Based Locality Using Shortcut ....................................................... 14

2.1.7 PeerSearch ................................................................................................... 14

2.2 Research Issues in Existing P2P Applications ................................................... 15
2.3 Information Retrieval ......................................................................................... 16
2.4 JXT A .................................................................................................................. 16
Chapter 3 DIAS Architecture ....................................................................................... 18
3.1 The Design of DIAS ........................................................................................... 18
3.1.1 Interest Database and Interest Index ............................................................ 19
3.1.2 Buddy List ................................................................................................... 21
3.1.3 Interest Matching ......................................................................................... 24
3.1.4 Search Strategy ............................................................................................ 36

3.2 Implementation of DIAS ......................................................... A ........................... 37
3.3 Simulation Setup ................................................................................................ 39
3.3.1 Parameters ................................................................................................... 39
3.3.2 Zipf Distribution .......................................................................................... 42

3.4 Simulation Results .............................................................................................. 43
3.4.1 DIAS vs. Gnutella ....................................................................................... 43
3.4.2 Different Conﬁgurations in DIAS of Polling Fashion ................................ 49
3.4.3 Different Conﬁgurations in DIAS of Reporting Fashion ............................ 54
Chapter 4 DIAS Implementation on JXTA .................................................................. 6O

4.1 JXTA Architecture ............................................................................................. 60

4.2 The Applications in DIAS .................................................................................. 62
4.2.1 Overview ............................................ 62
4.2.2 Chat Application .......................................................................................... 64
4.2.3 File Application ........................................................................................... 66

4.3 Implementation ................................................................................................... 68
4.3.1 Recommended System Requirements ......................................................... 68
4.3.2 Building Requirements ................................................................................ 68
4.3.3 Running DIAS ............................................................................................. 68

Chapter 5 Conclusions .................................................................................................. 76
5.1 Summary of the Work ........................................................................................ 76
5.2 Future Work ....................................................................................................... 77

BIBLIOGRAPHY ........................................................................................................ 79

vi

LIST OF TABLES

Table 1 An example of two—peer interesting matching ..................................................... 35

vii

LIST OF FIGURES

Figure l: A logical architecture of a P2P network .............................................................. 4
Figure 2: A new servent joining a P2P network .................................................................. 9
Figure 3: A servent discovers hosts in the network .......................................................... 10
Figure 4: A servent requests a ﬁle ..................................................................................... 10
Figure 5: An overview of DIAS ........................................................................................ 19
Figure 6: A typical information retrieval model ............................................................... 25
Figure 7: A two-dimension vector space .......................................................................... 28
Figure 8: A three-dimension vector space ........................................................................ 29
Figure 9: A simple structure of the vector space model .................................................... 30
Figure 10: Cosine coefﬁcient ............................................................................................ 33
Figure 11: An example of two-peer interesting matching ...................... H ........................... 35
Figure 12: Properties ﬁle ................................................................................................... 41

Figure 13: The 100000 searches in a 1000 node network, using Gnutella routing for a
Zipf distribution of documents over nodes. .............................................................. 45

Figure 14: The 100000 searches in a 1000 node network, using DIAS routing in polling
fashion for a Zipf distribution of documents over nodes. ......................................... 46

Figure 15: The 100000 searches in a 1000 node network, using DIAS routing in reporting

fashion for a Zipf distribution of documents over nodes. ......................................... 47
Figure 16: Gnutella vs. DIAS: Average Path Length ........................................................ 48
Figure 17: Gnutella vs. DIAS: trafﬁc overhead ................................................................ 49

Figure 18: Variation of average path length when BUDDY_UPDATE_LOOP varied in
the polling fashion ..................................................................................................... 50

Figure 19: Variation of trafﬁc overhead when BUDDY_UPDATE_LOOP varied in the
polling fashion ........................................................................................................... 51

viii

Figure 20: Variation of average path length when THRESHOLD varied in the polling

fashion ....................................................................................................................... 53
Figure 21: Variation of trafﬁc when THRESHOLD varied in the polling fashion ........... 54
Figure 22: Variation of average path length when NO_BUD_CHANGES varied in the

reporting fashion ....................................................................................................... 55
Figure 23: Variation of trafﬁc when NO_BUD_CHANGES varied in the reporting

fashion ....................................................................................................................... 56
Figure 24: Variation of average path length when NO_CONN_CHANGES varied in the

reporting fashion ....................................................................................................... 58
Figure 25: Variation of trafﬁc overhead when NO_CONN_CHANGES varied in the

Reporting Fashion ..................................................................................................... 59
Figure 26: JXTA application overview ............................................................................. 61
Figure 27: JXTA conﬁguration window ........................................................................... 69
Figure 28: Login window .................................................................................................. 70
Figure 29: Secure chat tab ................................................................................................. 71
Figure 30: The chat between two buddies ......................................................................... 72
Figure 31: Local content tab ................................................................... . .......................... 73
Figure 32: Search tab ......................................................................................................... 74
Figure 33: Maintain connection tab .................................................................................. 75

ix

CHAPTER 1 INTRODUCTION

1.1 Motivation

As the rapid growth of the Internet, P2P networks are gaining signiﬁcant popularity in
a quite short period of time. Centralized indexing, query ﬂooding, and hybrid are the
most widely used techniques in the present P2P systems to conduct content searching
from any given peer on the overlay. Centralized indexing is employed in Napster [l] sys-
tem, where peers automatically upload lists of available ﬁles to the central index, and
queries are answered using this index. Flooding-based approach is applied in Gnutella [1]
system, where queries are broadcasted over an overlay network connecting peers. Hybrid
architecture is employed in Kazaa [1] system, where some peers are elected as “super—
nodes” in order to index content available at peers in a nearby neighborhood and manage
queries. A

These approaches cannot provide deterministic guarantees because they do not make
use of peer’s knowledge. A knowledge-ware [3] system is a system in which informa-
tion/knowledge have been represented uniformly and well organized into the knowledge
base, and the resources can be wide-spread shared and coordinated used to provide on-
demand and intelligent service satisfying personal needs and expectations. As we know
that in searching the largely distributed data, the efﬁciency of a P2P Internet application
mostly depends on the scalability and versatility of its search mechanism [4]. The current

approaches are inefﬁcient not only because they will generate a large amount of network

trafﬁc for each query, but also because they do not utilize peer’s knowledge to improve
searching success rate.

In fact, there is an opportunity for many peers to identify a small set of neighbors that
have a higher probability to answer the queries from the current peers. Based on this idea,
many new systems have been developed to provide fast and reliable search and lower
trafﬁc overhead. PeerSearch [5] and NeuroGrid [6] are two of these systems that have
high efﬁciency, scalability, and reliability. PeerSearch is a P2P information retrieval sys-
tem that achieves both efﬁciency and determinism through an elegant combination of in-
dex placement and query routing [5]. NeuroGrid is an approach that decentralizes search
to ongoing network activity. In a NeuroGrid system, each successive search changes the
initiating peer’s knowledge that is about the relationship information between contents
and other peers in the network. And this peer will utilize the knowledge when it conducts
a search next times [6].

In this thesis, we propose an interest-based [7], semantic-layered peers clustering
scheme, DIAS, to improve searching efﬁciency without producing too more overall traf-
ﬁc. The design philosophy is motivated by the existing work of NeuroGrid, which retains
the primary characteristics of Gnutella network: simple, robust, and decentralized. We
ﬁnd that in the P2P networks, if there are some particular pieces of content residing in a
certain peer, then the queries issued by this peer are very likely to contain the keywords
which already exist in these pieces of content. Furthermore, if peer A is interested in a
special piece of content which is located in peer B, then it is quite possible that peer B
will also have other pieces of content which peer A may be interested in. At the same

time, it is very likely that peer B can also ﬁnd the pieces of content which it is interested

in located in peer A. Those peers who share the same interest can gather to compose a
group, and the queries issued by the peers in this group will be forwarded mostly within
this group. So the searches will succeed in a shorter time. Based on this scheme, we pro-

pose our Adaptive Interest-based File Sharing system: DIAS.
1.2 Problem Statement

As we have mentioned in the last section, many P2P protocols such as NeuroGrid use
knowledge base to improve search efﬁciency. We know that capturing the characteristics
of web users is an important task for the web site designers. By mining web users’ his-
torical access patterns, some demographics and behavioral characteristics of web users
could be determined. The navigation path of the web users, if available to the server, car-
ries valuable information [8]. Also in P2P networks, if we deploy peers’ historical search
behaviors, mine their interest characteristics from the information stored in their local and
contained in the messages issued or forwarded by them, we can establish some special
relationships among the peers. Using these relationships, we can navigate the transfer of

messages to optimize the search path, so that improve the search efﬁciency.

1.3 Deﬁnition of Terms

1.3.1 P2P Networks

In the traditional client/server architectures, some computers called servers are dedi-
cated to serving the other computers, and some computers call clients are dedicated to
being served by the server computers or server software. In a P2P network, each peer has

equivalent capabilities and responsibilities. A P2P network is a type of network in which

3

two or more peers are connected and share resources, such as ﬁles and devices like scan-

ners and printers without going through a separate server computer or server software.

 

Figure 1: A logical architecture of a P2P network

Figure 1 shows the architecture of a typical P2P network. A P2P system is an overlay
built on top of the physical infrastructure of a network. It can be a network of very small
scale. For example, it can be a permanent infrastructure that consists of tens of computers
and devices in an ofﬁce or a small company. Or it can be much larger, such as a network
in which different protocols and applications set up direct or indirect relationships among
computers and devices over the Internet. Also a P2P network can be in an ad hoc envi-
ronment — where network computers and devices are linked by wireless or temporary
plug-in connections. In this case, the computers and devices are components of the net-
work only for a certain period of time, for example, the duration of a communication ses-

sion.

1.3.2 Interest Base

Among the users who are using the P2P network, it is very likely that many of them
may be interested in some particular category of content. For example, some users may
be very interested in “car” or “automobile”, some users may have special taste in
“movie”, and “Britney Spears” may be the favorite singer star of many young users. It is
quite possible that the users who have a special interest keep many pieces of content as-
sociated with this interest in their local computer, and they are very likely to search simi-
lar stuff in a certain period of time.

In some traditional P2P networks such as Gnutella, peers will ﬂood their queries to
some randomly detected neighbors no matter whether their interests are similar to or far
from each other. So the query may be propagated to a very long distance (up to the TH.
of a query), and the search efﬁciency cannot be satisfactory. The same case also occurs in
another famous approach, random walk [9], where a query message is forwarded to a
randomly chosen neighbor at each step until the object is found. We intent to design a
scheme, through which the users who have the same or similar interest in a P2P network
can be clustered into a group and connected to each other in a certain period of time.
Then according to our common sense, if a user issues a query which is associated with
his special interest and the query is forwarded mostly within the interest group of other
users who also have this interest, then the wanted ﬁle or item will be found in a much
shorter time compared with those traditional P2P networks. This is the basic idea of our

DIAS system.

1.4 DIAS

An interest group is a set of nodes that show a high probability in querying the same
category of content in a P2P system. For example, nodes that frequently send queries of
Linux documents could be put into a “Linux Document” group. The protocol we propose
intends to provide a mechanism for each node to determine which groups to join, how to

join and leave, and its impact on content searching and retrieving.

The basic idea of DIAS is for any peer to select a number of current neighboring peers
whose “interest index” is close to the current peer. We call these special neighbors of a
peer “buddies” of that peer. The peers who share the same interests form interest groups
on the overlay. The number of those neighboring peers selected is called “group size”.
Peers in this system will calculate their buddy sets periodically, using active polling
scheme or passive reporting scheme, which will be detailed later. In addition to determin-
ing whether a neighboring peer is a buddy, a peer can also determine whether that
neighboring peer’s neighbors are its buddies, based on these peers’ interest information.
Then gradually buddies of same interest will be clustered in a group. Using this approach,
after a period of warm-up time, a peer will be surrounded by its buddies on the overlay.
And its query will be sent to these buddies directly. Also the queries coming into this

peer will be forwarded to these buddies.

Each node in a P2P system is associated with a number of interest groups. In the case
that multiple users may share the same node each user has a proﬁle that deﬁnes a set of
interest groups. While bootstrapping, a node is able to identify its interest groups based

on its query history. Thus an initial i-group list could be created in the node. To partici-

pate in an unstructured P2P network, the node has to discover a small number of buddies
who have the similar interest such that it can route query messages via these neighbors.
A buddy list can be obtained from buddy cache from previous operation history, or
gradually can be populated after some warming-up time. I Each i-group entry of a node
may have multiple ranked buddies. As a result, query search will have a large probability

to success in the interest group.
1.5 Description of the Remaining Chapters

We will elaborate our DIAS system in the following chapters: In Chapter 2, we intro-
duce some related works including Freenet, Gnutella, PeerSearch, Neurogrid, and JXT A.
We present the architecture of our DIAS system in detail and show the simulation result
in Chapter 3. In Chapter 4, we introduce the implementation of DIAS on JXT A platform.

At last, we draw a conclusion on this thesis and discuss the future work in Chapter 5.

CHAPTER 2 RELATED WORK

2.1 Overview of P2P Protocols

2.1.1 Gnutella

Gnutella [2] is a fully decentralized, P2P application layer network designed by Null-
soft, a subsidiary of American Online. In a Gnutella network, each peer stores some spe-
cial ﬁles and routes ﬁle searches (queries) from and to neighboring peers. If the searched
ﬁle is stored locally in a certain peer, then this peer will response to this query, and the
peer who initializes the query will download the ﬁle directly from the peer who has the
ﬁle. The Gnutella network work by simple blind ﬂooding: a peer who wants to search a
ﬁle sends a query to its neighbors on the overlay. When the query is forwarded to a peer,
and if this peer does not have the searched ﬁle, the query will be further forwarded to its
neighbors, and so on, until a certain query 'ITL is reached. Each querying message has a
global unique ID (GUID), and each peer maintains a list of recently seen GUID. Using
this strategy, peers in Gnutella networks can ﬁlter out the messages they have already
seen recently, so no messages can be received by a peer more than twice and query loop-
ing is avoided. In order to download the ﬁle that is found at some peer, each peer keeps a
table locally that stores the information about where each searching query comes from.
Using these tables, searching responses can be routed back to where they come from. At
last, a connection will be established between the peer that issues the query and the peer

who offers a response for the query, and the original peer can download the ﬁle from the

destination peer directly through HTTP. Figure 2 [18] shows how Gnutella works in each

step.

Hoalcatche Gnutella Network N
Request/Receive //”
a set of Active ,I’
Peers /

 

Figure 2: A new servent joining a P2P network

A Gnutella servent connects itself to the network by establishing a connection with
another servent currently on the network. Once the address of another servent on the net-
work is obtained, a TCP/IP connection to the servent is created. This step is shown in
Figure 2.

Once a servent has connected successfully to the network, it communicates with other
servents by sending and receiving Gnutella protocol descriptors.

A servent sends a Ping descriptor to actively discover hosts in the network. A servent
receiving a Ping descriptor is expected to respond with one or more Pong descriptors,
which includes the address of a connected Gnutella servent and information regarding the
amount of data it is making available to the network. This step is shown in Figure 3.

A peer sends a Query descriptor to search a particular document in the network. A

servent receiving a Query descriptor will respond with a QueryHit if a match is found

against its local data set. The QueryHit descriptor provides the recipient with enough in-
formation to acquire the data matching the corresponding Query. This step is shown in

Figure 4.

Gnutella Network N

 

ll

 

 

 

 

 

Servent p

 

Figure 3: A servent discovers hosts in the network

Gnutella Network N

 

 

 

 

Servent p

Servent p2

Figure 4: A servent requests a ﬁle

Gnutella works in a parallel way, in which the query will spread through the whole

network as far as possible until the connectivity and TTL (T ime-to—live) exhaust. If a

10

match or a possible loop is discovered in a peer, then the query will stop at this peer. But
the search will still continue to propagate to the other peers in the network. Thus a large
amount of trafﬁc overhead will be generated even if a query response is available from a

peer several hops away from the sender.

2.1.2 Napster

Napster [1] is a centralized application-level, client-server protocol over point-to-point
TCP for sharing ﬁles over Internet. The Napster system works in the following steps:
When a peer enters the system, it ﬁrst connects to the Napster server, and then uploads its
list of ﬁles to the server. This process is called “push”. The peer will also send the server
a set of keywords that can be used to identify its local ﬁles. When another peer sends
query to the Napster server, the server pings the hosts that apparently have the data, and

then select the best of the correct answers. Then this user retrieves the ﬁles it requires.

2.1.3 Freenet

Freenet [2] is another well-known system in P2P environments. In a Freenet network,
each ﬁle is associated with a unique key that is created by hash functions and can be used
to identify the ﬁle. At the same time, each peer has a knowledge base (KB) that stores
information about the relationships between keys and peers. When a peer initializes a
query, Freenet ranks the different keys it knows in the network, by comparing these keys
with the key included in the query message, the more similar keys are ranked higher.
Then the query will be forwarded to the peer that is associated with the most similar key.
If the requested ﬁle cannot be found in this peer, then the query will be forwarded to the

peer that is associated with the second most similar key, and so on. As we can see that

11

Freenet works in a greedy serial fashion. When a match occurs, that is, the search suc-
ceeds, the query is returned along the query path, and the searched ﬁle is passed to the
original peer together with the query along the query path. At the same time, the found
ﬁle is cached or replicated in all the peers along the querying route. Also to guarantee
anonymity, any peers along the querying route can declare that they are the source of the
found ﬁle randomly. The original peer that issues the query will update its knowledge
base (KB) so that the key that it searched can be associated with the peers that declared to
be the source of the found ﬁle. This will be quite helpful in the next time when the peer

issue a query including the similar keys.

2.1.4 Kazaa

Kazaa [9] uses a hybrid architecture in which some peers are elected as “supemodes”.
Content available at regular peers surrounding a supemode are indexed.

To search a particular piece of content, regular peers send query requests to their su-
pemodes. Supemodes communicate amongst themselves to ﬁnd where the wanted data is
located. When succeed, they return locations of matching objects to the requesting peer.
To download an object, a peer initiates one or more connections to other peers that have
replicas of the object. The requesting peers may either transfer the entire object in one
connection from a single peer or choose to download multiple fragments in parallel from
multiple peers. The download fashion in Kazaa is different with that in Gnutella, where

only one connection is used. Thus Kazaa carries a more efﬁcient download scheme.

12

2.1.5 NeuroGrid

NeuroGrid [6] is an adaptive decentralized network system. Peers in NeuroGrid sys-
tem support distributed searches through semantic routing, that is, queries are forwarded
based on content. A learning mechanism is applied in NeuroGrid system. It adjusts meta-
data describing the peers’ content and the documents that consist of those contents. Like
Freenet, each NeuroGrid peer also maintains a knowledge base (KB) locally. Any time
when a search succeeds, the original peer that issues the query will update its knowledge
base to store the information about the relationship between the document keywords and
remote peer in which the match occurs. At the same time, a direct connection is estab-
lished between the original peer and the destination peer. Each NeuroGrid peer forwards
queries only to a subset of peers in which it believes the wanted ﬁle may be found, in-
stead of forwarding queries to all of its neighbors. And this subset is determined by re-
sorting the knowledge base. [2].

NeuroGrid system improves the searching efﬁciency and reduces the trafﬁc overhead.
Compared with Gnutella, NeuroGrid supports a really efﬁcient search procedure because
it is able to consistently determine peer identities. Compared with Freenet, NeuroGrid
produces much fewer local cache overheads that are required to achieve some of its at-
tractive qualities. But there is an important side effect in NeuroGrid system: Whenever a
match occurs in a remote peer, a new connection will be established between the original
peer and the remote peer, and a new record storing the relationship between the querying
keywords and the destination peer will be added to the knowledge base of the original
peer. So after a long time, too many new connections will be added to the network, and

too many new records will be stored in each peer’s knowledge base, but no old connec-

l3

tions or old records will be deleted. This will ultimately be very expensive, with connec-

tions and knowledge bases growing indeﬁnitely.

2.1.6 Interest-Based Locality Using Shortcut

This content location solution using interest-based locality principle [7] is proposed by
a research lab in Carnegie Mellon University. In this solution, peers loosely organize
themselves into an interest-based structure on top of the existing Gnutella network. The
interest-based locality principle posits that if a peer has a particular piece of content that
one is interested in, it is very likely that it will have other items that one is interested in as
well. Shortcut discovery is used in this solution: When a peer joins the system, it ﬁrst at-
tempts to locate the content through ﬂooding. The lookup returns a set of peers that store
the content. These peers are potential candidates to be added to a “shortcut list”. Subse-
quent queries for content go through the shortcut list and select the peers with the highest
rank. If a peer cannot ﬁnd content through the list, it issues a lookup through Gnutella,

and repeats the process for adding new shortcuts.

2.1.7 PeerSearch

PeerSearch [5] is a P2P information retrieval system proposed by HP Lab. The basic
idea of this system is to combine index placement and query routing. It only needs to
search a small number of servants to identify matching documents. Leveraging some in-
formation retrieva] algorithms, PeerSearch represents documents and queries as vectors
and measure the similarity between a query and a document as the cosine of the angle

between their vector representations. PeerSearch stores a document index using its vector

14

representation as the coordinates, resulting in that indices stored close to each other are

also close in semantics.
2.2 Research Issues in Existing P2P Applications

By comparing these protocols, we can say that: Napster uses the centralized client-
server model, which is not as ﬂexible as the other protocols, and the additional server is a
big consuming. Kazaa employs hybrid architecture where a few supemodes act as the
servers. Gnutella is running a fairly inefﬁcient search procedure in a fully decentralized
network. Freenet, NeuroGrid, the shortcut solution, and PeerSearch appear to support a
more efﬁcient search process than Gnutella. But Freenet achieves this with much more
local cache requirements, because it works in a greedy serial fashion. In NeuroGrid, new
connections continue to be added among the peers in the network. New records continue
to be inserted into the knowledge database of each peer. So the connection overhead and
knowledge base overhead are really critical. The shortcut solution generates and main-
tains shortcut list for each peer according to the results of the searches issued by the peer,
Peers in the list are believed to share the same interests with this peer. PeerSearch com-
putes the similarity between document vectors and query vectors to optimize search proc-
ess.

By studying the P2P protocols described above, we get the purpose of this thesis: To
ﬁnd a scheme that can improve the search efﬁciency in P2P networks while producing
additional overhead as few as possible. The download efﬁciency also needs to be im-
proved, but we won’t cover this issue in this thesis. Our DIAS protocol can improve

search efﬁciency without needing to concern the connection overhead and local informa-

15

tion overhead while producing no more trafﬁc overhead. We achieve the enhancement by

deploying the interest similarity between peers and generating interest groups.
2.3 Information Retrieval

The interest base algorithm of our DIAS system is built from the information retrieval
conception. The problem of information storage and retrieval can be stated as: [10] we
have vast amounts of information to which accurate and speedy access is becoming ever
more difﬁcult. The difﬁculty is not only how to extract the information but also how to
use it to decide relevance.

The purpose of an automatic retrieval strategy is to retrieve all the relevant documents
at the same time retrieving as few of the non-relevant as possible. When the characteriza-
tion of a document is worked out, it should be such that when the document it represents
is relevant to a query, it will enable the document to be retrieved in response to that
query.When the indexing is done automatically it is assumed that by pushing the text of
a document or query through the same automatic analysis, the output will be a representa-
tion of the content, and if the document is relevant to the query, a computational proce-
dure will show this.

We will elaborate the conception of information retrieval in Chapter 3.
2.4 J XTA

As the collection of all software and devices connected to the Internet, the web is
evolving from a Web of Computers to a Web of Things [11] as new devices are connect-

ing to the Internet. Wireless devices, a new generation of web devices, such as pagers,

l6

PDAs, cell phones, and other consumer devices are among these new devices that are en-
tering the web world. Analysts predict that there will be a grand demand for wireless ap-
plications in the near future. P2P computing is leading itself to the more dynamic envi-
ronments of wireless devices rapidly [12]. Because the P2P applications have the proper-
ties that they are capable of creating, joining, interacting with peer groups, and posting
advertisements to provide and request resources, they can dynamically ﬁnd the stuff that
they need. This characteristic makes P2P applications suite to wireless environments
quite well.

JXT A [11] is deployed based on the imagination of a world in which peers can take
advantages of being connected with each other, despite what software and hardware plat-
forms they are using. It is an open-source project that speciﬁes a standard set of network
protocol upon the existing physical network infrastructure for ad hoc, pervasive, P2P
computing as a foundation of the upcoming Web of Things. A wide variety of P2P net-
work applications can be built above this thin but powerful layer. The JXTA project aims
at being independent of programming languages, system platforms, service deﬁnitions,
and network protocols. It provides a uniform, addressable network based on assigning
each peer in the network a unique peer ID, with all of the complexity of the underlying
physical network topologies being hidden. So peers in the JXT A network are allowed to
discover and communicate with each other, self-organize into groups, advertise and look
for network resources, without being required to understand and manage the physical
network topologies. Despite how complicated and dynamically the topologies are chang-

ing, JXT A peers do not need to care about.

17

CHAPTER 3 DIAS ARCHITECTURE

3.1 The Design of DIAS

The main idea of our DIAS system is to cluster peers on the overlay that share com-
mon interest set on the semantic layer based on the analysis of their local content, search-
ing entities, and forwarding entities. Peers in such an interest group (intemet zone) are
called group members. The members in a group are connected together according to
some scheme after a period of warm-up. And the quires issued by the group members in a
group are to be forwarded mostly within this group. To determine whether two peers
share the similar interest set, some vector space [14] based similarity coefﬁcients have
been investigated. Our preliminary simulation results show that such a semantic layer
improvement to content searching for P2P networks will signiﬁcantly improve searching

efﬁciency with moderate overhead. Figure 5 shows an overview or our DIAS system.

18

a, 0

. as

‘5 9
0

P1

P2

12

G I" 9
e 1‘ 9

V

Figure 5: An overview of DIAS

3.1.1 Interest Database and Interest Index

Our DIAS is based on the analysis of the interest set of each peer in the P2P networks.
Here we analyze the interest set according to the keywords that are contained in the ﬁles
or documents existing in the P2P networks and the querying messages going around. In
DIAS system, each peer maintains a local interest database that consists of three parts:
LCK (Local Content Keyword), LQK (Local Query Keyword), and FQK (Forwarding
Query Keyword). By comparing the keywords in these sets of each peer, our DIAS sys-
tem can determine whether any two peers in a network share the common interest set, and

according to the result of the comparison, each interest group can be formed.

Now let’s take a look at these three keyword sets:

19

LCK (Local Content Keyword): Keyword set of the local content of each peer. If
a user has some special interest, then it is very likely to have some ﬁles or docu-
ments related to its interest stored in its local computer or device, that is its local
content. So the keywords contained in the local content can indicate the interest of
the user, or, say, the peer. In our simulation that will be described later, the local
content of each peer will be assigned when the simulator starts, and LCK set will
be created at the same time. In addition, when the peer receives a query reply
from other peers, or retrieve some content from other peers, it will update its LCK
set by putting the keywords of the content in the response that are not included in
its LCK set into its LCK set. The maximum size of the LCK set is a parameter
that may affect the performance of our protocol. When the size of the LCK set of
a certain peer exceeds the speciﬁed maximum value, this peer will delete the old-
est keywords in its LCK set and maintain the size less than or equal to the speci-

ﬁed maximum value.

LQK (Local Query Keyword): Keyword set of last M; number of queries. If a
user has some particular interest, then it is quite possible that this user will issue
the queries that are related to its interest frequently. So the keywords contained in
the querying messages issued by a peer can also indicate the interest of this peer.
In our simulation, anytime when a peer issues a query, it will update its LQK set.
If some of the keywords contained in the querying message are not included in the
LQK set of this peer, it will put those keywords into the its LQK set. The maxi-

mum size of the LQK set is also a parameter that may affect the performance of

20

our protocol. Each peer will maintain the size of its LQK set according to the

same scheme used while maintaining its LCK set.

' FQK (Forwarding Query Keyword): Keyword set of last M2 number of forwarded
queries. According to our analysis to the behaviors of peers in P2P networks we
believe that some times, the keywords contained in the queries that a certain peer
often forwards may also indicate the interest of this peer. So in our simulation, we
considerate this factor in a few cases. Anytime when a peer forwards a query, it

will update its FQK set according to the same scheme described above.

Peers in our DIAS network system use these three keyword sets to estimate the inter-
est of each, look for their buddies, and generate interest groups. In the next section, we
will talk about how DIAS peers generate and maintain their buddy lists using these three

keyword sets.

3.1.2 Buddy List

A peer’s buddy list contains neighboring peers in the interest group to which the peer
should send its query. A peer decides whether one of its neighboring peers is its buddy
by comparing the LCK, LQK, FQK keyword sets of that peer and itself, so the peers in its
buddy list shares the common interest set with it. The average search path length in our
DIAS network system can be reduced considerably with each peer sending and forward-
ing queries directly to its buddies.

The buddy list of each DIAS peer has to be updated from time to time in order to re-
ﬂect the most recent interest distribution in the network. We provide two update schemes:

polling and reporting. The simulation can run based on either of these schemes.

21

. Polling

A peer will update its buddy list every BUDDY_UPDATE_LOOP times of its issuing
queries. Here BUDDY_UPDATE_LOOP is a parameter in our properties ﬁle that will
affect the performance of the system. For example, if BUDDY_UPDATE_LOOP is set to
10, then a peer will update its buddy list when it issues 10 queries, 20 queries, 30 queries,

etc.

When a peer is about to update its buddy list in the interest group, it will send interest-
PING message with speciﬁc indication of LCK, LQK, or FQK to its neighboring peers.
Its neighbors will reply to the interest-PING with their local interest databases. The in-
formation of the interest database of other peers will be saved (cached) in the underlying

peer for later use.

To go one step further (more step could be possible, which is determined by TTL of
interest-PING), when a peer is about to update its buddy list, it can also request interest
information of the peers who are the neighbors of its direct neighbors. For such request, a
peer can either reply with its local cached interest information, or initiate another interest-
PING with TTL set to 1 to its own neighbors, which in turn updates its local cached in-
terest information. To avoid ﬂooding when establishing buddy relationship for a peer, the
interest-PING should have a TTL of less than 3. Again this parameter will likely affect

the performance of the system.

We cannot directly exchange buddy list between two peers due to the fact that they
might be in different interest groups. Instead, upon receiving the interest information of
its neighbors and its neighbors’ neighbors, a peer can decide which one among these

peers can be added to its buddy list following some “Interest Matching” algorithms that

22

will be described later. When a peer identiﬁes a buddy after calculation, and this buddy
is not connected to this peer right now (the buddy must be the neighbor’s neighbor of this
peer), then a new connection will be established between the buddy and the polling peer.
As a result, the connections of each peer will become more and more gradually, and this
is an overhead to our system. The maximum number of the connections of each DIAS
peer is a parameter that may affect the performance of our protocol. When the number of
a certain peer’s connections exceeds the speciﬁed maximum value, this peer will ﬁrst de-
lete the connections of its neighbors that are not its buddies, and then, if its connections
are still more than the speciﬁed maximum value, it will delete the buddies with the lowest
rank in its buddy list and maintain its connections less than or equal to the speciﬁed
maximum value. We will describe how the buddies are ranked for a certain peer in Sec-

tion 1.3.
I Reporting

In the reporting pattern, DIAS peers will not inquire the interest database information
from their neighbors positively as in the polling pattern. Instead, they wait passively. Any
time after a peer updates its buddy list, it will check the difference between its old buddy
list and its new buddy list, that is, the number of the peers that are in the old buddy list
but not in the new buddy list plus the number of the peers that are in the new buddy list
but not in the old buddy list. If the total number exceeds a speciﬁed maximum number
NO_CONN_CHANGES then this peer will report to its neighbors. Here
NO_CONN_CHANGES is a parameter in our property ﬁle that may affect the perform-

ance of our protocol.

23

A peer reports its buddy list change in such a way: It sends a message to all of its
neighbors telling them that its buddy list has been changed enough for them to consider-
ate, and the neighbors may forward this message to their neighbors, according to the TTL
of interest-PING. Again, as in the polling fashion, to avoid ﬂooding, the interest-PING
should have a TTL value less than 3. And this parameter will likely affect the perform-

ance of the system.

Each DIAS peer maintains a variable “o_no_neibor_changes” that records how many
messages described above have been sent to this peer. If the number of the received mes-
sages exceeds a speciﬁed maximum number NO_BUD_CHANGES then this peer will
update its buddy list. Here NO_BUD_CHANGES is a parameter in our property ﬁle that
may affect the performance of our system. After the peer updates its buddy list, it will

report and set the variable “o_no_neibor__changes” to 0 to start a new turn of recording.

3.1.3‘Interest Matching

- Information Retrieval

Deﬁnition

In order to elaborate the scheme that we used in our DIAS network system for the in-
terest similarity measure, let’s ﬁrst take a look at the concept of information retrieval
based on which we developed our interest matching strategy.

What is information retrieval (IR)? It is a wide loosely-deﬁned term. Lancaster [15]
gave a perfect and straightforward deﬁnition: “Information retrieval is the term conven-
tionally, though somewhat inaccurately, applied to the type of activity discussed in this
volume. An information retrieval system does not inform (i.e. change the knowledge of)

the user on the subject of his inquiry. It merely informs on the existence (or non—

24

existence) and whereabouts of documents relating to his request.” Some basic concepts
in the information retrieval term are as follows:
I Document is represented by a set of keywords. The keywords can also be call
“index terms”.
I Keywords may be assigned binary weights or gain their weighs from the calcula-
tion on statistics of their frequency in documents.
I “Retrieval” is a “matching” process between the keywords in documents and the
keywords in queries.

The purpose of a retrieval strategy is to retrieve the relevant documents as many as
possible and at the same time, retrieve the non-relevant as few as possible. After the
characterizations of a document is mined out, when a document represented by these
characterizations is related to a query, this document should be able to be retrieved in re-

ply to that query. Below is a ﬁgure describing a typical information retrieval system [10]:

Feedback

 

 

Processor
Input Output

 

 

 

 

~— Documents

 

Figure 6: A typical information retrieval model
25

Model

Now let’s embody the theory of information retrieval using a model. A typical IR
model is shown in Figure 6. The purpose of an information retrieval model is to formalize
the way of thinking about the information retrieval. In the model, we deﬁne a set of ob-
jects and assertions can be made about these objects. Also in the model, we restrict the
ways in which classes of objects can interact with each other. The purpose of an informa-
tion retrieval model is to specify the representations that are used for documents and in-

formation needs, and how the comparison between the keywords of them is implemented.

An information retrieval model can be presented as a [D 9 Q, FaR(qi 9d} )]

where

I D is a set of representations for the documents in the collection
I Q is a set of representations for queries, that is, the user information needs

F is a framework for modeling document representations, user information

needs, and the relationships between them

I R (q i r d j) is a ranking function. It associates a real number with a query rep-

resentation q i (q iEQ ) and a document representation dj (dng) (Baeza-

Yates & Ribeiro-Neto, 1999 [16])There are several information retrieval models
that have been deployed: Boolean Model, Vector Space Model, Probabilistic Model, Lan-
guage Model, etc. The interest matching scheme used in our DIAS system is based on the

vector space model, which is an algebraic model. In the vector space model, documents

26

ments in the collection ( D ) and user information needs ( Q ) are both documents in L-

dimensional space.
Vector space model
Vector space model is based on the idea of L—dimensional document space. User in-
formation needs in this model are also located in document space. Documents and queries
are both points in L-dimensional space, where L is the number of unique index terms in
the data collection, and documents and queries are vectors of these index terms. Actual
vectors may have many terms, and can use binary keyword weights or assume 0-1

weights. Below is a simple example:

 

9’ 6‘ 9’ “

Terms: “duck”, “horse”, “dog , squirrel”, “pig , cat”
Binary: (l, O, 0, 0, 0,1),(0,1, 0,1, 0, 0)

Weighted: (0.01, 0.03, 0.01, 0.0, 0.05, 0.0)

 

 

 

A two-dimension vector space and a three-dimension vector space are shown in Figure
7 and Figure 8.

In the vector space model, not only documents can be weighted, but queries can also
be weighted. If we assume the index terms of documents have different values for re-
trieval, then we can assign weights to each term in each document.

The vector space model calculates degree of similarity between documents and queries
in the collection, and ranks output by sorting the values of similarity. Documents are
ranked according to how close they are to the query. There are many possible matching
functions in the vector space model and we will talk about them later. Figure 9 illustrates

a simple structure of the vector space model.

27

A two-dimension vector space

 

5 I I I I I I I I I

45*
4~ . 02=2t1+4tZ

3.5 ~

02=411+32

I

2.5

1.5

I

‘Dr=or1+1r2 I

0.5 >-

 

 

 

Figure 7: A two-dimension vector space

28

Imension vector space

Athree-d

    

 

\ \ \
x x x x a
x x \ a \ x
a a a \ x x
\ \ \ x x \
writtrllirsrtlitrrrrtrlrIL 4
\ x \ x x x.
x c 3\ x x x
a a 3‘ \ x x
\ +1 x x \
a----..--im----..----..----\
\ a \ a
\ a I. I \ \/ e
\ ‘ ‘1- ‘ \ c
x \ \M a a x , a
c x x = \ \ \ . p
1 ,
rIlrrrIIIILiQIILIIIILIIIIK s
\ x x x x > C
s c c \ x x, y r
k x x a a x , a X m
\ x. c \ s x , c
\ x s ‘ x x .\ a
\ c \x \N u m N, ,\ v
+ «
a a \ a 2 x. I n
x c k c s a\ . x,
\ c a c \ +\ , x , om

printsrruukurrrkulirkriihux .\ ,

. s , . . 4, ,-\ a m
, , , , s , : ,\ e
. , , , . , c

, , s , , . N x x. m
s 01
s, x a, x ,. xc , c, d
x o
1till+tlli4lrrlarirIJIII|J .\ u
. , , , , , .A‘ w

, , , , , .

. . 3 . . , , \. o m
, , a , , . s s a. t
. , + . . , ,\

, , 2 . 3 C .\ A
1|ttt4lﬁlt4tlrtartl I3IIJ ,\
, ,H , , ,m , x on
, , a , , , \
, ,ﬂ , , , , a. 8
, ,1 . , ,1. , \ , e
. D . , + , r.
T ,\
a g a a 1‘ ’
,rrriiAIrrrAIorrJrrlrﬂrrlrl,
, . , s 2 , y 0'.
, . , , D, , F
, , . , s , c
, , . , . ,\
. , , , , ,\
rnuulrlinup-illpnunucururk
’ I ’ I a ’
’ z y p a ’
, , , , , ,
g a ’ I a g
, . , , , ,
, . , . s ,
r r r r r
\ ﬂ \ \ \ >
5 4 3 2 4| 05
z

29

     

Figure 9: A simple structure of the vector space model

I Distributional Similarity Measures

In the vector space, the closer two vectors are, the more similar they are. Documents
and queries clustering together are very likely to be highly similar. We need a method
that allows ranking, using threshold and. reforrnulating query, to calculate the similarity
between documents and queries or documents and documents. There are several existing
distributional similarity measures in the vector space model: Cosine coefﬁcient [17], Jac-
card coefﬁcient [17], Dice coefﬁcient [17], Overlap coefﬁcient, Ll distance (City block
distance, Manhattan distance), Euclidean distance (L2 distance), Hellinger distance, In-
formation Radius (Jensen-Shannon divergence), Skew divergence Cosine coefﬁcient,

Jaccard coefﬁcient and Dice coefﬁcient are the most popular measures. We will only

30

elaborate these three measures that are based on contribution to similarity of co-occurring
terms.

Some conventions are assumed ﬁrst:
I N : the number of documents in collection

I L : the number of unique terms in collection

I D i ;the i th document in the collection

I Ti : the i th term in the collection

" Q : a query

I Wi,Dj : the weight of term i in Document j

' Wi,Q : the weight of term i in Query

Most similarity functions are based on the inner product. The similarity between a
document and a query is the sum of the products of the term weights in the document and

the query:

SIM(D,.,Q) = Z(W.-,D, *Wrg)

And the sinrilarity between two documents is the sum of the products of the term

weights in these two documents:
SIM(D,.,DJ.) = 2M“), *ij)

Dice coefficient

31

Using dice coefﬁcient measure, the degree of similarity between a document and

query is calculated using the equation below:

22<W.-,D,.. *W..Q>
IZWDfEZsz

The sum is taken over all terms for index 1 to the dimensionality of the vector space L

 

SIM(DJ.,Q) =

in the collection. For binary documents, the equation can be simpliﬁed to:

2A
SIM D, =——
( Q) B+C'

Jaccard coefficient

Using dice coefﬁcient measure, the degree of similarity between a document and

query is calculated using the following equation:

20%,», *W.,Q) ,
Zijz TZWmZ — 2(ij *Wi,Q)

The sum is also taken over all index terms for 1 to L in the collection. For binary

 

SIM(DJ.,Q)=

documents, the equation can be simpliﬁed to:

A
B+C—A°

 

SIM (D, Q) =

Cosine coefﬁcient
Cosine coefﬁcient shown in Figure 10 is the most commonly used measure. This
measure is based on the calculation of cosine of the angle between a document vector and

a query VCCIOI‘ 01' IWO document VCCIOI‘S.

32

 

Figure 10: Cosine coefﬁcient

Using this measure, the degree of similarity between a document and query is calcu-

lated using the equation below:
2 =|<

”WWW

Again the sum is taken over all index terms for 1 to L in the collection.
I Interest Matching in DIAS
In our DIAS system, each peer has three keyword set, LCK, LQK, and FQK. In order
to generate the buddy list for each peer, we need to calculate the degree of similarity be-
tween the keyword sets of two peers. Our computation is based on the cosine coefﬁcient
measure in the vector space, and some appropriate modiﬁcations are made according to

the real case in our system. The Unions of these three sets are used in the computation.

We know that in the standard cosine coefﬁcient measure, the convention “ILDJ. denotes

the weight of term i in Document j. In our DIAS system, we use the function

1(Snj a k i ) to denote the existence of keyword i in the keyword set j of peer 11. Here i

33

takes the value over all the keyword index in the keyword collection of our system, j

takes the value from 1, 2, 3, because we only set up three keyword sets in our system, and

n is the index of DIAS peer. If k: exists in the joined keyword set j of peer n, then the

value of [(Snjaki) is set to 1; otherwise 0. Let A and P2 be two peers in the

DIAS system. Because the contribution to the total degree of interest similarity between

two peers by each keyword set is different, we assign different weight to different set.

The weight for each keyword set is denoted as W j ,where j = l, 2, 3 for each set respec-

tively. The degree of interest similarity between two peers can be computed using the

following equation:

{ Z(I(Slj’ki) *I(S2j’ki))

SIM(p,p)= W-* 4—————-——-
l 2 142.3 J 21(Slj,ki)*ZI(S21-,ki))
(V i i

An example of Interest Matching computation is described in Table 1 and Figure 11.

 

 

Here the keyword collection is {10, 11, 12, 13, , 22}, W1=0.6, W2 =0.3, W3 =0.1.

Using the give equation, the degree of interest similarity between pl and p2 is calcu-

lated as:

SIM(p1, P2) =

——+ 0.1* =O.654

2
6*——-—+O
J4*23*J21*2 J32*

34

Table 1 An example of two-peer interesting matching

 

 

 

 

 

 

Keyword Set LCK LQK FQK

pl 10,11,12,15 15, 17 20,21,22
p2 10,11 15,20 21,22
SIM ( pl , p2) 0.654

 

 

 

 

LCK LQK

 

 

FQK

 

Figure 11: An example of two-peer interesting matching

After the calculation of the interest similarity between two peers, how can we deter-
mine whether these two peers are buddies of each other? We use a threshold to accom-
plish this. If the result of similarity calculation between two peers is equal to or more than
the threshold, that means, there is enough number of shared keywords between them,
then these two peers are considered as buddies, otherwise, not. This threshold is a pa-
rameter in our properties ﬁle and may affect the performance of our protocol.

Buddies in the buddy list of a certain peer are ranked according to the result of similar-
ity calculation between the buddies and this peer. The buddy with a high value of result is

ranked high, and the buddy with a low value of result is ranked lower, because the former

35

one has the more similar interest with the peer than the later one. When a peer has to de-
lete some of its buddies because the number of its connections exceeds the limitation, it

will ﬁrst delete the buddies with the lowest rank that has the least similar interest with it.

3.1.4 Search Strategy

We have elaborated the core of our DIAS system: How the calculation of interest
similarity between peers is implemented and how the buddy list of each peer is generated.
We know that the main purpose of this thesis is to improve the search efﬁciency, and
what we have done above are all for this purpose. Now let’s take a look at how searches
are implemented to improve the efﬁciency in the DIAS system.

Queries in DIAS network use similar techniques to Gnutella to prevent loops and in-
deﬁnite propagation: We use TI‘L (Time To Live) counters to limit the range of queries,
whenever the path length of a query exceeds TI'L, this search will terminate despite
whether it succeeds. We use GUID (Globally Unique Identiﬁer) to'prevent a node for-
warding a query more than twice. Whenever a querying message arrive at a certain peer,
this peer will ﬁrst check whether it has seen the GUID of this querying message. If so, it
will discard it; otherwise, it will process it.

In our simulation, each search is started at a randomly selected peer. When a peer is
selected to initiate a query, a document will be selected randomly from the local content
of this peer and a keyword will be selected randomly from the selected document. Then
all the documents contained the selected keyword in the document collection of our
DIAS system will be picked out except those in the local content of the selected peer.
And the target document that the selected peer wants to search is selected randomly from

the picked document set. Finally, the keyword set contained in the target document and

36

the target document itself are put into the corresponding ﬁelds of the querying message
and the message is added into the inbox of the selected peer that will initiate a query.

After the querying message is generated, the selected peer will ﬁrst update its keyword
set LQK by adding the keywords in the query into LQK if they are not already in LQK.
Then it will increase the number of searches it has initiated in the present turn. After
these two steps, it will decide whether it’s the time for it to update its buddy list based on
the polling scheme or reporting scheme, and then update its buddy list using the method
we described above if appropriate. Upon ﬁnished all of these steps, the peer will send its
query to all of its neighboring peers, including its buddies.

Whenever a peer receives a querying message, it will ﬁrst check whether the target
document contained in the message reside in its local content. If so, the search succeeds.
If not, the peer will update its keyword set FQK by adding the keywords contained in the
message into FQK if they are not already in FQK. Then it forwards the message to all of

its neighbors.

3.2 Implementation of DIAS

Our DIAS system is developed from the existing work of NeuroGrid. The system is
built on the platform of Unix, using Java programming language. The functions are pack-
aged into three major classes: Node, Network, DishonestNode and DishonestNetwork.

Major members in the class Nodes are:

I protected String o_node_ID: The unique ID of each node
I protected Hashtable o_conn_list: List of connections to other nodes

I public MultiHashtable o_contents: Stores docs indexed via keyword

37

protected Vector lck, lqk, qu: The three keyword sets

protected Vector o_buddy_set: Buddy list

public int o_no_search: The number of conducted searches in a certain interval
public int o_no_neibor_changes: The number of neighbors who have updated

their buddy list recently which is used in the reporting fashion.

Major functionality methods in the class Node are:

public void UpdateKeywordSet(Vector p_vector, Keyword[] p_keywords): Up-
date each keyword set according the parameter.

public void add_buddy(Node p_node, int p_rate): Add a buddy to the buddy list
public void IncreNoSearchO: Increase the value of the member o_no_search by 1.
public void IncreNoChanges(): Increase the value of the member

o_no_neibor_changes by 1

Major functionality methods in the class DishonestNode are:

public void UpdateBuddySet( int p_update_loop, double threshold): Update the
buddy list

public void report(): A node report to its neighbors that it has updated its buddy
list in the reporting fashion.

public boolean processMessage( Message p_message, boolean p_start) throws

Exception: Process a search message when it comes.

Major members in the class Network are:

public static double o_threshold: The threshold used when generating the buddy

list for each node.

38

I public static double o_lck_coeﬁ; o_lqk_coeﬁ§ ojqk_coeﬁ: The coefﬁcient used
when generating the buddy list.
Major functionality methods in the class DishonestNetwork are:

I public String[] searchNetwork( int p_start_TTL)I throws Exception: Initiate a
search

I public void generateContent(Hashtable p_nodes, int p_no_documents_per_node)

throws Exception: Assign documents to nodes using Zipf distribution

3.3 Simulation Setup

3.3.1 Parameters

In this thesis, preliminary simulations of the DIAS and Gnutella systems were per-
formed. We set the parameters of these two systems in our properties ﬁle. Both of these
two networks contained 1000 nodes that is set by the parameter NO_NODES in the prop-
erties ﬁle, a pool of 2000 documents that is set by the parameter NO_DOCUMENTS and
a pool of 1000 keywords that is set by the parameter NO_KEYWORDS. Here the nodes
we refer to are equal to the peers we refer to above. Each document in the document pool
is randomly assigned 1 keyword from the keyword pool. The reason why we set the pa-
rameter “NO_KEYWORDS_PER_DOCUMENT” to the value of 1 is because that by
doing so, the distribution of the documents among the nodes represented the distribution
of the keywords among the nodes. So we implemented the Zipf distribution of the key-
words by distributing the documents in Zipf fashion. In the next section, we will elabo-
rate how the documents were distributed among the nodes in Zipf fashion. Each node was

connected to 30 other nodes that were selected at random. This is set by the parameter

39

NO_CONNECTIONS_PER_NODE. In both simulations, each node could receive infor-
mation about the contents of their neighbor nodes. The initial 'ITL was set to 9 by the pa-
rameter START_'ITL. Both simulations were run for 100,000 searches set by the pa-
rameter INTERNAL_LOOP, with each search being started at a node selected at random.
The target of each search was for a document that shared the only keyword with docu-
ments in the starting node, which has been described in Section 1.4. The simulations were
run in both polling fashion and reporting fashion. The number of searches by which a
node will update its buddy set in the polling fashion is set by the parameter
BUDDY_UPDATE_LOOP. The number of connection changes by which a node will
report its change to its neighbors in the reporting fashion is set by the parameter
NO_CONN_CHANGES, and the number of neighbor changes by which a node will up-
date its buddy set in the reporting fashion is set by NO_BUD_CHANGES.

The properties ﬁle is shown in Figure 12.

4O

 

 

SIMULATION_TYPE=neuro grid

MAIN_LOG_FILE=ng_growing_simulation.log
SUMMARY_LOG_FILE=ng_growing_sum_simulation.log
ARCHITECTURE_LOG_FILE=ng_growing_ar’ch_simulation.l0g
PROBE_LOG_FILE=ng_gmwing_probe_simulation.log
BUDDY_LOG_FILE=ng_growing_buddy_simulation.log
SEARCH_LOG_FILE=ng_growing_search_simulation.log
APPLET=false

NO_KEYWORDS=1000
NO_DOCUMENTS=2000
NO_KEYWORDS_PER_DOCUMENT= l
NO_NODES=1000
NO_HONEST_NODES=990
NO_DOCUMENTS_PER_NODE=3
MAX_KNOWLEDGE_PER_NODE=1

N O_CONNECTIONS_PER_NODE=30
MAX_CONNECTIONS_PER_NODE=3O
START_'ITL=9
DEGREE_OF_CORRELATION=1

FORWARDING_MODEL=0
INTERN AL_LOOP= 1 00000
PROBE_LOOP=800000
NO_PROBES=0
GROWTH_LOOP=100001
STATS_LOOP=1000

KNOWLEDGEzfalse

LEARN ING=false

RING_TOPOLOGY=false
DOC_KEYWORD_ZIPF_DISTRIBUTION=false
N ODE_DOC_ZIPF_DISTRIBUTION=true
RANDOM_SEARCHES=false
RANDOM_FORWARDING=false

THRESHOLD=0.4
LCKCOEFF=O.6
LQKCOEFF=0.4
FQKCOEFF=0.0
BUDDY_UPDATE_LOOP=30
NO_CONN_CHANGES=5
NO_BUD_CHANGES=5
NO_GROUPS=50

 

Figure 12: Properties file
4 l

 

3.3.2 Zipf Distribution

The term Zipf is used to describe the phenomena in which large events are of low fre-
quency, but small ones are of quite high frequency. For example, large earthquakes sel-
dom occurred in the world but small ones often occurred. Only a few words, such as ‘is’,
‘are’ and ‘the’ are used very ﬁequently, but many other ones such as ‘phenomena’,
‘earthquake’ and ‘astronautics’ are rarely used.

The deﬁnition of Zipf’s law can be give as “The probability of occurrence of words or
other items starts high and tapers off. Thus, a few occur very often while many others oc-
cur rarely”. Zipt‘s law usually refers to the 'size' y of an occurrence of an event relative to
its rank r [18]. We can use the following equation to denote the frequency of the occur-

rence of the nth ranked item:

 

In our DIAS system, the distribution of documents among nodes follows Zipf fashion.
We assigned documents to nodes in two steps: First, the nodes were divided into several
groups, because in the real world, it is very possible that users with different interest clus-
ter into different groups. The number of groups was set by the parameter NO_GROUPS
that may affect the performance of our system. Second, within each group, documents
were assigned to nodes following the Zipf distribution. A few documents occurred fre-
quently, others seldom. The documents highly ranked were selected randomly, so in dif-

ferent groups, different sets of documents occurred frequently.

42

Because the distribution of documents represents the distribution of keywords among
nodes, which has been mentioned in the last section, keywords are also distributed in the

fashion described above.

3.4 Simulation Results

3.4.1 DIAS vs. Gnutella

In this simulation, we set the value of NO_GROUPS to 50 for both networks, the
value of THRESHOLD to 0.4 and the value of weight LCKCOEFF, LQKCOEFF,
FQKCOEFF to 0.6, 0.4 and 0.0 for our DIAS network. The value of
BUDDY_UPDATE_LOOP is set to 30 for the polling fashion. The value of
NO_CONN_CHANGES is set to 5 and the value of NO_BUD_CHANGES is set to 5 for
the reporting fashion. The results of simulations are shown in Figure 13-19.

From the data in Figure 16 we can see that the average path length was reduced dra-
matically in our DIAS system compared with the traditional Gnutella system from around
2.2 in Gnutella to around 1.6 in the polling fashion and around 1.3 in the reporting fash-
ion of DIAS. The decreasing rate is 27.3% and 40.9%. At the same time, Figure 17 shows
that the additional overhead resulting from the generation and maintenance of the buddy
lists was quite small. The number of message transfers was increased from 3"‘104 in the
Gnutella network to only 3.02"‘104 in the polling fashion of our DIAS system with in-
creasing rate 0.67%. And in the reporting fashion of our system, the average trafﬁc is al-
most not increased, except in the warm-up time at the very beginning.

We can also see that the performance in reporting fashion is better than that in polling

fashion from the data shown in Figure 16: Gnutella vs. DIAS: Average Path Length. Re-

43

porting fashion improved the searching efﬁciency more grandly than polling fashion. It’s
because in the reporting fashion, only the node with the number of neighbors that had up-
dated their buddy lists big enough would update its buddy list, and once the number met
the requirement, the node would update its buddy list, despite how many searches it had
conducted. And we know that the buddies of a certain node are among its neighbors, two
nodes that have similar interest can be connected by the introduction of their intermediate
buddies. So the updating of buddy list was quite efﬁcient in the reporting fashion, thus
nodes sharing the common interest set could be clustered together in a higher level. This
caused the high searching efﬁciency. But in the polling fashion, each node updated its
buddy list in a stable interval, despite whether necessary, so the updating efﬁciency is
lower. Also the reporting fashion produced less trafﬁc than the polling fashion. That’s
because in the reporting fashion, interest groups were generated more quickly in the
warm-up time, and after that, nodes could not update their buddy lists easily for the sake
of the limitation to the number of their neighbors that had updated their buddy lists. So
only in the warm-up time, the additional trafﬁc was a little much. But in the polling fash-
ion, nodes continued to update their buddy list with rhythm, so the additional trafﬁc ex-
isted all the time.

From these simulation results we can conclude that our DIAS system can improve the
search efﬁciency considerably with the acceptable additional overhead in P2P networks

in contrast to Gnutella system.

Ave Path Length

 

3.5 r i

' - -- 'i " " MIR - --~= use r
2.5 W i ii ii i] i i Tit iiTl r 11‘L

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.5r ‘ _

 

 

 

00 2 4 6 8 10

No of Searches x104
x104 Message Transfers

4 f r 4

 

 

2.5: a

1.5" ‘

0.5- .

 

 

 

00 2 4 6 8 10

No of Searches x10

Figure 13: The 100000 searches in a 1000 node network, using Gnutella routing

for a Zipf distribution of documents over nodes.

45

 

Ave Path Length

4
3.5L ~
3" .

 

 

 

1]

 

H M I H :P "Irm-
2-5 T”: I . H -‘In '2'" i
1 r W ‘ WT” ~ ilii‘rﬁ‘wﬂl

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
   

 

 

 

1.5L “ii-iii Wt ""Wtr' v
"silt
1 h i J IL-I-MAHLULHluaummﬂmilw
0.5~ .
00 2 4 6 8 10
No of Searches x 104
4x104 Message Transfers
3.5- -
3 miinrlliiiiiiiiiluz :::::::::ill.'.lllll::::::::::::::::::::::'::ll:::'::"::
2.5: ~
2" i
1.5 _
1 _ a
0.5- -
00 2 4 6 8 10
4
No of Searches X 10

Figure 14: The 100000 searches in a 1000 node network, using DIAS routing in

polling fashion for a Zipf distribution of documents over nodes.

46

 

Ave Path Length

3.5 ,

 

 

2.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In
I l
1
I 1
Lt"
L1;
L_

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.5 ~

 

 

00 2 4 6 8 10

No of Searches x 104

4 x 104 Message Transfers

 

3.5: .

 

 

 

 

 

 

 

 

 

 

 

 

 

2.5-

0.5“ .

 

 

 

O0 2 4 6 8 10

No of Searches x104

Figure 15: The 100000 searches in a 1000 node network, using DIAS routing in

reporting fashion for a Zipf distribution of documents over nodes.

47

Ave Path Length

 

 

 

 

 

 

 

 

3 t I F
_.._ Gnutella
2.8— --— DIAS-Polling n
-e— DIAS-Reporting
2.6? ' ~
2.4 —
2.2 ’ ,
~‘

2— l
1.8» .. ,
1.6— \ -
1.4T “in A
1.2 "9 ‘ .. ‘ “We" 'L. 2

1 r r r L

0 2 4 6 8 10

No of Searches x104

Figure 16: Gnutella vs. DIAS: Average Path Length

48

3 2 x 104 Message Transfers

 

 

—~— Gnutella
-- DIAS-Polling _
-e— DIAS-Reporting

I

3.15

 

 

 

3.1

3.05

 

2.95

I
l

2.9

I
l

2.85 - *

 

 

 

2'80 2 4 6 8 10

No of Searches x104

Figure 17: Gnutella vs. DIAS: trafﬁc overhead

3.4.2 Different Conﬁgurations in DIAS of Polling Fashion

Parameters in the properties ﬁle may affect the performance of our protocol. Now let’s
take a look at the variation of performance when the parameters are set to different values
in the polling fashion.

First, we set the value of the parameter NO_GROUPS to 200, the value of THRESH-
OLD to 0.4, and the value of BUDDY_UPDATE_LOOP is set to 20, 30, and 40 respec-
tively. Figure 18 and Figure 19 present the simulation results when the parameter

BUDDY_UPDATE_LOOP varied.

49

Average Path Length

 

 

 

 

 

3 f
28 -'-30 “
—e-40
2.6r -

 

 

 

 

 

O 2 4 6 8 10
No of Searches x104

Figure 18: Variation of average path length when BUDDY_UPDATE_LOOP

varied in the polling fashion

50

x 104 Message Transfers

 

 

 

 

 

3.2 l l l I
—~— 20
- -"' 30 _
3.15 _e_ 40
31* .
3.05~ _

 

 

 

 

 

2.95 . ~
2.9 - -

2.85 r i
2'80 2 4 6 8 10

No of Searches x104

Figure 19: Variation of trafﬁc overhead when BUDDY_UPDATE_LOOP varied

in the polling fashion

The data in Figure 18 shows that the smaller the value of BUDDY_UPDATE_LOOP
was, the earlier the average path length began to decrease. This indicates that if nodes in
the DIAS system updated their buddy lists in shorter rounds, the time of warm-up in the
system would be reduced, that is, the interest groups were generated more quickly. We
know that when BUDDY_UPDATE_LOOP was wet to a smaller value, nodes in the sys-
tem should update their buddy list more frequently. This may cause more trafﬁc, and the
additional overhead should be grander. But Figure 19 shows that there was no big differ-

ence among these three runs when the parameter BUDDY_UPDATE_LOOP varied.

51

Why? Because the additional trafﬁc that the updating of buddy lists caused was so small
that can be ignored.

Now let’s take a look at the variation of the simulation result when the parameter
THRESHOLD varied. This time we set the value of BUDDY_UPDATE_LOOP to 30,
the value of THRESHOLD to 0.4, 0.5, and 0.6 respectively, and the value of
NO_GROUPS is remained 200. Figure 20 and Figure 21 present the simulation results
when THRESHOLD varied.

The data in Figure 20 shows that there was on big difference in average path length
when the threshold varied. The reason is because nodes in our simulation shared similar
interests in a high degree. 80 after a period of warm-up time, nodes sharing the common
interest set were mostly clustered together. Thus the difference of threshold did not put
much affection on the search efﬁciency. And because the additional trafﬁc that the up-
dating of buddy lists caused was so small that can be ignored, there was no big difference

in trafﬁc when the threshold varied either in Figure 21.

52

Average Path Length

 

 

 

 

 

3 T 1 .
-— 0.4

28 -— 0.5 ~
-e- 0.6

2.6~ ~

 

 

 

 

No of Searches x104

Figure 20: Variation of average path length when THRESHOLD varied in the

polling fashion

53

x 104 Message Transfers

 

 

 

 

 

 

 

 

3.2 l I I I

-- 0.4
_ --— 0.5 _

3'15 . -e— 0.6
3.1— -
305. ~
3W
2.95~ -
2.9 ~
2.85- —
2'80 2 4 6 8 10
No of Searches x104

Figure 21: Variation of trafﬁc when THRESHOLD varied inthe polling fashion

3.4.3 Different Conﬁgurations in DIAS of Reporting Fashion

In this section, we will show the simulation result when parameters varied in the re-
porting fashion.

First, we set the value of the NO_GROUPS to 50, the value of
NO_CONN_CHANGES to 5. Figure 22 and Figure 23 present the simulation result when
the parameter NO_BUD_CHANGES was set to 5, 10, and 15, respectively.

The data in Figure 22 shows that the smaller the value of NO_BUD_CHANGES was,
the more the average path length was reduced. That is because the requirement for a node

to update its buddy list was easier to meet, nodes could update more frequently during the

54

warm-up time at the very beginning. So the interest groups were generated in a higher
level. This caused the searching efﬁciency improved more. At the same time, the more
updating was conducted, the more trafﬁc was caused in the warm-up period. Thus, in
Figure 23, the peak of the curve for NO_BUD_CHANGES 5 is the highest, and the peak
of the curve for NO_BUD_CHANGES 15 is the lowest. And the additional trafﬁc in the
warm-up period was temporary, after that, the trafﬁc for all the values of
NO_BUD_CHANGES was equal to that in Gnutella network, because the interest groups

were generated and stable.

 

 

 

 

 

Average Path Length
3 I T r r
28- -— 10 ~
-e— 15
2.6~ 4

 

 

 

 

o 2 4 6 s 10
No of Searches ) x104

Figure 22: Variation of average path length when NO_BUD_CHANGES varied

in the reporting fashion

55

3 2 x 104 Message Transfers

 

 

. "" 10 -
3-15 —e— 15

 

 

 

h —1

3.1

3.05

 

2.95 - a

2.85 ~ _

 

 

 

2'80 2 4 6 8 10

No of Searches x 104

Figure 23: Variation of trafﬁc when NO_BUD_CHANGES varied in the report-

ing fashion

Now, let’s take a look at the variation of the simulation result when the parameter
NO_CONN_CHANGES varied. This time we set the value of NO_BUD_CHANGES to
5, the value of NO_CONN_CHANGES to 5, 10, and 15, respectively, and the value of all
the other parameters remained the same as in the simulation above.

The simulation results of this time shown in Figure 24 and Figure 25 is similar to the
simulation results shown in Figure 22 and Figure 23 when the parameter
NO_BUD_CHANGES varied. The data in Figure 24 shows that the smaller the value of
NO_CONN_CHANGES was, the more the average path length was reduced. The reason

is that the requirement for a node to report its variation was easier to meet, thus during

56

the same period, more nodes could get enough varying neighbors to update their buddy
lists. As a result, buddy lists were updated more frequently during the warm-up time. So
the interest groups were generated in a higher level, which caused the searching efﬁ-
ciency improved more. Also as described above, the more updating was conducted, the
more trafﬁc was caused in the warm-up period. Thus, in Figure 25, the peak of the curve
for NO_CONN_CHANGES 5 is the highest, and the peak of the curve for
NO_CONN_CHANGES 15 is the lowest. And the additional trafﬁc in the warm-up pe-
riod was temporary, after that, the trafﬁc for all the values of NO_CONN_CHANGES
was equal to that in Gnutella network, because the interest groups were generated and

stable.

57

Average Path Length

 

 

 

 

 

 

 

 

3 f

2.8- —_ 10 ~
-e- 15
2.6— ' ‘
2.4~ _
2.2 " 7" ‘33 . d
2- u
1.8» \‘ ‘.\ d
1.6 h “ 3“ '\ d
”w 19': {‘2 .' ITO a“: .' ' \n .- ‘. ' .‘ :32 r“ 5'; I i I '
1.4“ ﬁvmvuw,‘kéﬁ “MV.¥
1.2~ ‘
1 L ’ I I

No of Searches x 104

Figure 24: Variation of average path length when NO_CONN_CHANGES varied

in the reporting fashion

58

3 2 x 104 Message Transfers

 

 

—n— 05
_ _._ 10
3-15 -e— 15

 

 

 

3.1

h —-1

3.05

 

2.95 - r

2.9

I
l

2.85

I
LL

 

 

 

2'80 2 4 6 8 10

No of Searches x104

Figure 25: Variation of trafﬁc overhead when NO_CONN_CHANGES varied in

the Reporting Fashion

59

CHAPTER 4 DIAS IMPLEMENTATION ON

J XTA

4.1 JXTA Architecture

We have introduced JXTA brieﬂy in Chapter 1. We know that project JXT A is con-
ceived with three main objectives intended to address the shortcoming of the P2P system:
interoperability, platform independence, and ubiquity. Distributed computing applications
built on top of JXTA can across different P2P systems and communities. Language inde-
pendence, system independence and network independence can be achieved. Every de-
vice in these applications, wire or wireless, has a digital heartbeat.

In the JXTA network, each peer resource is associated with a single unique peer II).
For a given peer, all of the available network interface address is encapsulated for it by a
peer endpoint, so this peer can only be addressed by its peer ID. Others peers can make
choice among the list of available peer endpoint addresses to achieve the most efﬁcient
communication with this peer while getting the advertisement from the peer endpoint,
here “advertisements” are XML documents that peers cache, publish and exchange to ex-
plore and discover network resources. The notion of interest base is also applied in JXT A
project. Peers in JXTA network self-organize into groups based on their interest. If a set
of peers have a common set of interests, and have agreed on a common set of policies, for
example, membership, then they can establish a group [1 1]. Each group is also associated

with a unique group ID.

60

For developers, Project JXTA provides a set of building blocks that provide a solid
foundation for distributed applications. Also these blocks can support the common func-
tions required by any P2P system. Also we know that distributed computing applications
require a great deal of infrastructure, and building this framework is typically time con-
suming and resource-intensive. By using Project JXT A, we can focus on creating innova-
tive software applications, and do not need to care about the infrastructure of P2P system
[19]. Therefore we select JXTA as the underlying infrastructure for our DIAS applica-

tion. Figure 26 shows the JXTA implementation overview:

 

Applications

 

 

Application API

 

 

J XTA Protocols Implementation
(OS/Runtime independent)

 

 

 

Porting API

 

 

Apache Portable UND( Windows RTOS
Runtime

 

 

 

 

 

 

 

 

 

Figure 26: J XTA application overview

InstantP2P is a demo application that illustrates key concepts of Project JXTA and
P2P system. It is consist of two main applications: Chat application and File sharing ap-
plication. The InstantP2P application provides functionality for secure one-to-one chat,
group chat, and sharing, searching, downloading documents within a peer group by using

the JXTA platform core building blocks. It can create, discover, and join groups, create a

61

connection between two peers in the one-to-one chat function, create connections among

a group of peers in the group chat function, as well as operations on ﬁles.
4.2 The Applications in DIAS

In this section, we describe how our DIAS is designed on top of the InstantP2P appli-

cation.

4.2.1 Overview

A started InstantP2p application represents a peer in a P2P network. The name of the
peer is set as soon as the initial entry for the peer is created. After boots-trapping the
start-up of the JXTA platform, the GUI application will instantiate one instance of the
class net. jxta.instanthp.PeerGroupManager. In order to implement our DIAS system,
most of our modiﬁcation was focus on this class, because it is the heart of the application.
PeerGroupManager is responsible for monitoring and inﬂuencing the buddy discoveries.
It generates and maintains the buddy list for a peer using the algorithm we describe in
Chapter 3. Upon instantiation it will immediately trigger a remote search of neighbors in
the P2P network.

The GUI application can retrieve a list of remote peers using the following methods:

. public Enumeration getNeighborList(): Returns an enumeration of all the
neighbors of this peer. The objects in this enumeration are
net. jxta.protocol.PeerGroupAdvertisement objects.

. public Enumeration getBuddyList(): Returns an enumeration of all the buddies of

this peer.

62

The PeerGroupManager class might also want to search for peers with speciﬁc names
only. It can do so by using the following method:

0 public void searchPeers (String peerName): This will trigger the remote discov-
ery of peers. If the parameter peerName is not null, the remote search is restricted
to peers with the speciﬁed name.

The above methods do not return a list of discovered peers, since this happens asyn-
chronously. Instead interested classes can get the search result by implementing the inter-
face net. jxta. instanthp. GroupStructureListener via the methods in this interface. The
classes can receive the notiﬁcations by calling the method public void addStructureLis-
tener (roupStructureListener listener) to register, and terminate the notiﬁcations by call-

in g the method public void removeStructureListener (GroupStructureListener listener).

In order to be informed if new peers are discovered, classes can implement the
net. jxta.instanthp.GroupStructureListener interface via the method public void peer-
StructureChanged 0. To retrieve a list of peers, classes can call the method getPeersO in
the class net. jxta. instantp2p.PeerGroupManager. In order to be informed if the currently
selected peer changed, classes can implement the net. jxta.instanthp.PeerListener inter-

face via the method public void peerChanged (PeerGroupAdvertisement pgAdv)

The PeerGroupSearcher class is another major class we modiﬁed to implement our
system. It also has a notion of a currently selected peer. The GUI application is responsi-
ble to inform the PeerGroupSearcher instance of changes in the currently selected peer

using the method public void selectedPeerChanged (PeerGroupAdvertisement adv).

63

PeerGroupManager can also communicate the change of the peer in one part of the
GUI to the other parts of the GUI. Classes interested in the changes of the selected peer
implement the net. jxta. instantp2p.PeerListener interface and will then be informed via
the methods in that interface of the changes. To receive the notiﬁcations interested
classes register themselves by using the method addPeerListener ( PeerListener listener).
Notiﬁcation can be terminated by calling the method removePeerListener ( PeerListener
listener). The currently selected peer can be retrieved via the method PeerGroupAdver—

tisement getSelectedPeerO.

4.2.2 Chat Application

The chat application allows a peer to chat with its neighbors or its buddies. The func-
tions are implemented in the class net. jxta. instanthp. Chat and the interface
net. jxta. instanthp.MessageBoard that we modiﬁed to apply to the chat application in our
system.

The class net. jxta. instanthp. Chat establishes a chat between a peer and its neighbor
or its buddy. Once a peer wants to chat, this class will publish the appropriate advertise-
ment remotely by the method login. The name used for the user is its peer name returned
by the method getMyPeerName() in the class PeerGroupManager. In order for the peer
to receive message, an input pipe is established and monitored also by the login method
in the Chat class. If the peer sends a message, the message will be sent to the currently
selected peer as reported by the class PeerGroupManager via an output pipe created and

monitored by the login method.

The user is informed about the success or failure of the attempt to open a chat via calls
to the MessageBoard instance. If there is no currently selected peer to chat with, the peer
is instructed to select a peer. If a peer wants to chat with its buddy, then a secure chat is
initiated, i.e. all information sent and received between the chatting buddies is encrypted.
When an output pipe to a selected buddy is opened, a public key will be send to the cor-

responding buddy.

The constructor of this class public Chat (PeerGroupSearcher manager, Message-
Board messageBoard) can register itself as a PeerListener to the manager instance,
which allows the chat application to access the current neighbors of the peer. The in-
stance messageBoard of the MessageBoard class is used to display the received messages
to the peer. The method public synchronized void logout () terminates the current chat.
The monitoring of the input pipe is suspended. The method public void sendMessage-
ToPeers (String message) propagates the message to the output pipe and thus send to the
neighbor. If the chat channel is secure, the message is encrypted before it is sent to the
buddy.

Messages may be of different types and received by different neighbors. Thus, a GUI
application would probably render these messages in different colors. In addition there is
a method that requests information about whether the user wants to accept certain chat

proposals.

65

4.2.3 File Application

The ﬁle sharing application allows a peer to manage its local content, search remote
content from the peers in the P2P network, and maintain the connections between its

neighbors and itself. These functions are implemented in several classes and interfaces.

Content distributed in our DIAS system is described by
net. jxta. share. ContentAdvertisement object. The ContentAdvertisement class extends the
JXTA Advertisement object and thus allows propagation along normal JXT A discovery
channels. The methods below return information about the content described by the Con-

tentAdvertisement object:

. public abstract String getName(): Returns a name that identiﬁes the content. This
is often the ﬁle name of the ﬁle described by the ContentAdvertisement object.
. public abstract Contentld getContentId(): Returns the U11) that uniquely identi-
ﬁes the content described in the ContentAdvertisement object.
0 public abstract long getLength(): Returns the length of the content.
. public abstract String getType(): Returns the mime type of the content in order to
select the correct method to display the content once retrieved.
0 public abstract String getDescription(): Returns a detailed description of the con-
tent.
Local content is described by net. jxta.share. Content object. The method in this inter-
face public ContentAdvertisement getContentAdvertisement() allows to extract detailed
information about the local content by return a ContentAdvertisement instant. Remote

content is described by net. jxta. instanthp.SearchResult object. The method in this inter-

66

face public ContentAdvertisement getContentAdvertisement() allows to retrieve a descrip-
tion about the remote content.

The class net. jxta. instanthp.SearchManager is the interface to the ﬁle searching ap-
plication. Its constructor public SearchManagedPeerGroupSearcher manager) retrieves
peer information from the parameter manager. This class registers itself as a PeerListener
to manager. Some major methods in this class are described below:

. public void startSearch(String searchString): Starts an asynchronous search for
remote data. The result is communicated to the registered ContentListener objects.

0 public void addContent(String path) & public void addContent( File addFilesH):
Allows a peer to add content locally.

. public void removeContent(Enumeration enum) & public void removeCon-
tent( Content c): Allows a peer to remove its local content.

If a peer wants to download the data from a remote peer, the method retrieveContent()
in the class net.jxta.instantp2p.SearchManager can be used to retrieve. the data via the
net. jxta. instanthp. ContentListener interface. To retrieve the actual data, an input pipe
needs to be opened to the peer in which the actual data resides. Since the download proc-
ess may be time consuming, it is done asynchronously. The method public void ﬁn-
ishedRetrieve( String url) is called once the retrieval has succeeded.

The class net. jxta. instanthp. ConnectionManager is responsible for connection main-
tenance. Some major methods in this class are described below:

0 public void addConnection(String peerName): Allows a peer to add connection

between a remote peer and itself.

67

. public void removeContent(String peerName): Allows a peer to remove connec-
tion between its neighbor and itself.
The classes above are the major classes we modiﬁed for the ﬁle application of our sys-

tem.
4.3 Implementation

In this section, we describe how the DIAS application is built and work.

4.3.1 Recommended System Requirements

0 Any Java enabled platforms: Windows (95, 98, ME, 2000, NT, XP), Solaris,
Linux, Unix, MaxOSX.

. 200MHz Pentium class computer with 128MB of RAM, 5 MB free disk space.
4.3.2 Building Requirements

. Windows NT Version 4 Service Pack 3, Windows 2000/98/ME
. JavaZJDK 1.3.1
- Cygwin: a Linux-like environment for Windows

. Ant: A Java-based build tool in Apache.

4.3.3 Running DIAS

. Conﬁguration

68

The ﬁrst time when DIAS is run, the JXT A Conﬁguration window that is shown in

Figure 27 will appear.

 
 

 

“A ‘
E53 JXTA Confrgurator
. , 7 89;? 'hﬂpﬂsheltlxtagmﬂndexhtmf: '-.r1,'{,,1,’.l_§}_idj‘: #3»;- ' 3 _

 

advanced RendezvoustRelays‘ mm:

 

Security Settings

Secure Usemame

Password

 

Verify Password
Please note your secure usemame and password
They will be required for all future access

 

 

 

 

 

Figure 27 : JXTA conﬁguration window

In this window, a user is required to input some necessary information such as “Peer
Name” in the “basic” tab, “Secure Usemame” and “Password” in the “Security” tab.
After the initial setup, a user is only required to enter a name and password in the Secure

Login window shown in Figure 28 to connect to the J XTA network. Each time the DIAS

69

is restarted, the user will have to verify this Secure Usemame identity, because it is

his/her unique identity on the J XTA Network.

Ems

Starting AIF E:

[215 JXTA Secure Login

seems" 'eriﬁnié:"lblack |

 

 

Figure 28: Login window

When the application launches, the main window of the DIAS application will come
out. This window has menu options that control key functionality of the system and a
panel contains four tabs: Secure Chat, Local Content, Search, and Maintain Connections.
The ﬁrst tab corresponds to the Chat application. The last three tabs consist of the File
application.

. Secure Chat

The search tab is shown in Figure 29. In the left half of this tab, the buddies and
neighbors of this peer are listed. The user can select a certain neighbor to open a chat or a
buddy to open a secure chat. The text box in the bottom of the tab is used for the user to
write and send message. The sent and received messages are displayed in the right half of
the tab. The messages sent to and received from different neighbors of buddies are dis-

played in different color. That is shown in Figure 30.

70

AIFS: Peer black

File Edit Navigation Group

Add Buddy

Delete Neighbor

black is now logged in NeiPeerGroup

:3; en d M ea: age A

We iii n g for F: e n d 9 run?» u .;- .3 n n e i: ii 1:: n

 

Figure 29: Secure chat tab

71

All‘S: Peer black
File Edit Navigation Group Help

Simon} C hat i} J Loral ifcmienl irieauh ~ Kruiiairiiairi 'LIZIl‘Il'IE‘IINONE-

amen

black is now logged in NetPeerGroup

red> hello
black> Got it!

E: e n d M e a g e

'I.:"u.v'aiimg for F'eiidezmu-s. Ingririetiicm

 

Figure 30: The chat between two buddies
. Local Content Management
Figure 31 shows the local content tab. The user can use this tab to manage his/her lo-
cal content. The indexes of content are listed in the left half of the tab and the detailed
description of content are displayed in the right half. The user can add a new piece of
content by click the “Add” button, and delete an undesired piece of content by highlight

the index of this piece of content and click the “Delete” button.

72

AIFS: Peer black Elli???
_ AboutAlFS ____
::iE|jl.llE‘ I: hat 144‘ Ltnt'al I: om»: , _ [elusiaii'itairi IZ i:Ir'i.r'iectior‘rz.

"I!!!

«We in n g in r F' e n d e "Eu-'0 u -3: n: .3 n n E c ll 0 n

 

Figure 31: Local content tab

I Search Function

The search tab is shown in Figure 32. When the user wants to search a certain piece of
content, he/she should ﬁrst input the keywords he/she wants to search, then click the
“Search” button. When the search succeeds, all the matched content will be displayed in
the bottom box. The information includes the ID, the name, and the owner peer of each
piece of content. After the user has conducted several searches, he/she clear all the pervi-
ous search results by clicking the “Clear” button.

73

AIFS: Peel black EHEHEI

" itiiearitii ~ giltaii‘i‘rmii r2.mirxemraria

 

'v"'-.I':3iiirig for F‘ei'ider-,-'tuie utirinertitir'i

 

Figure 32: Search tab

I Connection Maintenance

Figure 33 shows the maintain connection tab. The peer ID and peer name of the user’s
buddies and neighbors are listed in the two boxes respectively. Though the connections of
each peer are maintained automatically using the algorithm we described in Chapter 3, a
user can manually delete a connection to its buddy or neighbor by clicking the “Add
Connection” or add a connection by clicking the “Remove Connection” button to look for
the available peers in the network according to some special willingness.

74

AIFS; Peer black
File Edit Navigation Group Help

Auto Share Preference

Ll

 

‘Ie"'l.|':_‘4llii'ig for F’E-i’ll'iEZ‘v'EIUS I: Lii’lr‘wtzlliiii'i

 

Figure 33: Maintain connection tab

75

CHAPTER 5 CONCLUSIONS

5.1 Summary of the Work

Content location in decentralized P2P networks is a challenging problem. Most P2P
protocols, such as Gnutella, employ fairly inefﬁcient search approaches. The main pur-
pose of our research is to ﬁnd out a scheme that can make use of peer’s interest to im-
prove the search efﬁciency in P2P networks without producing too much trafﬁc over-
head. By analyzing those major protocols, we presented our DIAS system, which built its
search scheme based on the conception of interest retrieval in the semantic layer. Peers in
a DIAS network are clustered into different interest groups according to their local con-
tent and historical searching behaviors. And they issue their search requirements mainly
within the interest groups associated with them. DIAS enhances the performance in a
considerable level, without complicating the simple architecture of unstructured P2P
networks.

The simulation results show that using our scheme in the P2P environment, including
ad hoc networks, the searching performance will be optimized dramatically. Also we im-
plemented DIAS on top of the JXT A application. Two applications are implemented in
the system: Chat application and File application.

Our main contributions are:

I Propose a P2P search approach which generates and maintains interest groups by

computing interest similarity between peers

76

I Show the improvement of search efﬁciency with moderate trafﬁc overhead by
simulation

I Demonstrate the feasibility of this approach by the implementation on J XTA.
5.2 Future Work

By the simulation result and the implementation on JXTA, we can conclude that our
AIFS system can support a more efﬁcient search process without producing additional
overhead. Also this scheme is feasible in real network system. But as we have mentioned
above, it cannot reduce trafﬁc overhead as in the NeuroGrid system. Our AIFS protocol
still employs the query flooding approach in the traditional Gnutella system. So one of
our future works is to try to reduce the trafﬁc overhead of the system. We may accom-
plish this goal by avoiding the random ﬂooding fashion. A possible solution is to utilize
the rank of each buddy and neighbor of a peer and only forward the querying message to
the limited number of buddies with the highest rank. A query from a node will be sent to
the ﬁrst several (group leader index) buddies in the interest group corresponding to the
keywords in a given query.

When a search succeeds, a peer will download its wanted content from the remote peer
using the traditional scheme in Gnutella and other P2P networks. In this scheme, only one
connection will be established between the querying peer and the remote destination peer.
Thus the download speed is really low and the efﬁciency is not satisfactory. So to opti-
mize the download pattern is another critical task in our future work. The strategy used in
Kazaa gives us some hint. We may utilize the idea in Kazaa to establish several connec-

tions between the querying peer and all the remote peers that have the wanted content.

77

Also the duplication of the searched content in each peer along the propagation path may

be employed.

78

BIBLIOGRAPHY

[l]

[2]

[3]

[4]

[5]

[6]

i7]

[8]

[9]

S. Saroiu, K. P. Gummadi, R. J. Dunn, S. D. Gribble, and H. M. Levy, "An
Analysis of Internet Content Delivery Systems," In Proceedings of OSDI 2002,
2002.

S. Joseph, "Adaptive Rooﬁng in Distributed Decentralized Systems: NeuroGrid,
Gnutella and Freenet," In Proceedings of workshop on Infrastructure for Agents,
MAS, and Scalable MAS, Autonomous Agents 2001 , 2001.

Y. Fu, L. Cherkasova, W. Tang, and A. Vahdat, "EtE: Passive End-to-End Inter-
net Service Performance Monitoring," In Proceedings of USENIX 2002 Annual
Conference, 2002.

E. Cohen, A. Fiat, and H. Kaplan, "Associative search in peerto -peer networks:
Harnessing latent semantics," In Proceedings of IEEE INF OCOM 2003, 2003.

C. Tang, Z. Xu, and M. Mahalingam, "Peersearch: Efﬁcient information retrieval
in peer-to—peer networks," HP Labs, Technical Report HPL-2002-198, 2002.

S. Joseph, "NeuroGrid: Semantically Routing Queries in Peer-to-Peer Networks,"

In Proceedings of the International Workshop on Peer-to-Peer Computing (c0-
located with Networking 2002), 2002.

K. Sripanidkulchai, B. Maggs, and H. Zhang, "Efficient Content Location Using
Interest-Based Locality in Peer-to-Peer Systems," In Proceedings of INF OCOM
2003, 2003.

J. Xiao, Y. Zhang, X. Jia, and T. Li, "Measuring Similarity of Interests for Clus-
tering Web—Users," In Proceedings of Australiasian Database Conference, Gold
Coast, Queensland, Australia, 2001.

Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, "Search and replication in un-
structured peer-to-peer networks," In Proceedings of the 16th ACM Intema-tional
Conference on Supercomputing, 2002.

79

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

C. J. v. R. B.Sc., P. D. Dip. NAAC, M.B.C.S., F....IEE, C.Eng., and F.R.S.E., IN-
FORMATION RETRIEVAL, 2nd ed. London: Butterworths, 1979.

B. Traversat, M. Abdelaziz, D. Doolin, M. Duigou, J .-C. Hugly, and Eric Pouy-
oul, "Project JXTA-C: Enabling a Web of Things," In Proceedings of 36th Annual

Hawaii International Conference on System Sciences (HICSS'03), Big Island,
Hawaii, 2003.

J XTA for J 2MB: Extending the Reach of Wireless With JXTA Technology,
http://www.ixtg.org/Droiect/www/docs/JXTA4J2ME.pdf

Project JXT A 2.0 Super-Peer Virtual Network,
http://wwwjxta.or2/pro'Lect/www/white_papers.html

Z. D. Michael W. Berry, Elizabeth R. Jessup, "Matrices, Vector Spaces, and In-
formation Retrieval," Society for Industrial and Applied Mathematics, vol. 41, pp.
335 - 362, 1999.

J. W. Sons, Information Retrieval Systems: Characteristics, Testing and Evalua-
tion, 2nd ed. New York: F. Wilfrid Lancaster, 1979.

R. Baeza—Yates and B. Ribeiro-Neto, "Modern Information Retreival," ACM
SIGMETRICS Performance Evaluation Review, 1999.

P. Pathak, M. Gordon, and W. Fan, "Effective information retrieval using genetic
algorithms based matching functions adaptation," In Proceedings of the 33rd Ha-
waii International Conference on System Science, 2000, Hawaii, 2000.

Zipf, Power-laws, and Pareto - a ranking tutorial,
htt :/l in er.h l.h .com/shl/ a rs/rankin rankin .html

Project JXTA Technology: Creating Connected Communities,
http://www.jxta.org

80

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

'IIIIIIIII IIIIII III IIIIIIIII II IIIIII

293 024700