flflflmm. ._ z in .5” ts. snub". 9 Eu... .2. I \1 :h i nu .15 ,. mm .1 ‘L , ‘ .1... £1: .uh’l.xxhflb, r : h. .14 5. hmfi . ILfllnlwmri..q n . .al {:2 {inns grit! x! : .I‘gs “r 1.... Eat. I“: z... vauaawatn A. .- (351?. .411: r! ix ”)7 LIBRARY {gt-1H 00 Michs-ve-z. . State “ University This is to certify that the dissertation entitled DIAS: DISTRIBUTED INTEREST-BASED ADAPTIVE SEARCH FOR P2P NETWORKS presented by Ning Liu has been accepted towards fulfillment of the requirements for the Master degree in Department of Computer Science and Engineering w./(/\‘ KL.“ Major Professor’s Signature 9/7/2003 Date MSU is an Affirmative Action/Equal Opportunity Institution PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6’01 cJCIRC/DateDuepGS—sz DIAS: DISTRIBUTED INTEREST-BASEDADAPTIVE SEARCH FOR P2P NETWORKS By Ning Liu A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science and Engineering 2003 ABSTRACT DIAS: DISTRIBUTED INTEREST-BASED ADAPTIVE SEARCH FOR P2P NETWORKS By Ning Liu Content search in decentralized P2P networks is a challenging problem. Gnutella, a widely used P2P system, employs the query flooding approach. Although this approach is simple and robust, its efficiency is fairly low. In this thesis, we propose an efficient P2P information retrieval system DIAS that supports state-of-the-art content search in the semantic layer. DIAS avoids the inefficient random flooding used in Gnutella by using an interest-based search scheme. Peers in a DIAS system sharing the common interest set are call “buddies”. Each DIAS peer creates and updates its buddy list by computing the degree of interest similarity between its neighboring peer and itself periodically. Different interest groups are generated and main- tained according to the buddy-relationship between any two peers. DIAS peers will issue and forward queries mostly within the interest groups associated with them. Our simulation result verifies that DIAS could achieve satisfactory search efficiency without producing too more traffic overhead, while retaining the simplicity and fairness of Gnutella. The implementation of DIAS on top of JXT A shows that it is a feasible sys- tem. ACKNOWLEDGEMENTS I would like to thank my advisor, Dr. Lionel M. Ni, for his continuous support and help throughout the research project. It was definitely my great pleasure to conduct this interesting and challenging research under his guidance. His broad and profound knowl- edge and his effective instruction have given me a great help. I am also very grateful to my co-advisor Dr. Li Xiao, my academic committee mem- bers, Dr. Abdol H. Esfahanian and Dr. Matt W. Mutka, for their valuable advice and in- spiring comments. Many of my colleagues have contributions to my dissertation. I would like to thank Zhenyun Zhuang, Feng Zhu, and Pei Zheng for their insightful comments and sugges- tions. iii TABLE OF CONTENTS List of Tables ................................................................................................................ vii List of Figures ............................................................................................................. viii Chapter 1 Introduction .................................................................................................... 1 1.1 Motivation ............................................................................................................ 1 1.2 Problem Statement ............................................................................................... 3 1.3 Definition of Terms .............................................................................................. 3 1.3.1 P2P Networks ................................................................................................ 3 1.3.2 Interest Base .................................................................................................. 5 1.4 DIAS ........................................................................................ . ............................ 6 1.5 Description of the Remaining Chapters ............................................................... 7 Chapter 2 Related Work ................................................................................................. 8 2.1 Overview of P2P Protocols .................................................................................. 8 2.1.1 Gnutella ......................................................................................................... 8 2.1.2 Napster ......................................................................................................... 11 2.1.3 Freenet ......................................................................................................... 11 2.1.4 Kazaa ........................................................................................................... 12 2.1.5 NeuroGrid .................................................................................................... 13 iv 2.1.6 Interest-Based Locality Using Shortcut ....................................................... 14 2.1.7 PeerSearch ................................................................................................... 14 2.2 Research Issues in Existing P2P Applications ................................................... 15 2.3 Information Retrieval ......................................................................................... 16 2.4 JXT A .................................................................................................................. 16 Chapter 3 DIAS Architecture ....................................................................................... 18 3.1 The Design of DIAS ........................................................................................... 18 3.1.1 Interest Database and Interest Index ............................................................ 19 3.1.2 Buddy List ................................................................................................... 21 3.1.3 Interest Matching ......................................................................................... 24 3.1.4 Search Strategy ............................................................................................ 36 3.2 Implementation of DIAS ......................................................... A ........................... 37 3.3 Simulation Setup ................................................................................................ 39 3.3.1 Parameters ................................................................................................... 39 3.3.2 Zipf Distribution .......................................................................................... 42 3.4 Simulation Results .............................................................................................. 43 3.4.1 DIAS vs. Gnutella ....................................................................................... 43 3.4.2 Different Configurations in DIAS of Polling Fashion ................................ 49 3.4.3 Different Configurations in DIAS of Reporting Fashion ............................ 54 Chapter 4 DIAS Implementation on JXTA .................................................................. 6O 4.1 JXTA Architecture ............................................................................................. 60 4.2 The Applications in DIAS .................................................................................. 62 4.2.1 Overview ............................................ 62 4.2.2 Chat Application .......................................................................................... 64 4.2.3 File Application ........................................................................................... 66 4.3 Implementation ................................................................................................... 68 4.3.1 Recommended System Requirements ......................................................... 68 4.3.2 Building Requirements ................................................................................ 68 4.3.3 Running DIAS ............................................................................................. 68 Chapter 5 Conclusions .................................................................................................. 76 5.1 Summary of the Work ........................................................................................ 76 5.2 Future Work ....................................................................................................... 77 BIBLIOGRAPHY ........................................................................................................ 79 vi LIST OF TABLES Table 1 An example of two—peer interesting matching ..................................................... 35 vii LIST OF FIGURES Figure l: A logical architecture of a P2P network .............................................................. 4 Figure 2: A new servent joining a P2P network .................................................................. 9 Figure 3: A servent discovers hosts in the network .......................................................... 10 Figure 4: A servent requests a file ..................................................................................... 10 Figure 5: An overview of DIAS ........................................................................................ 19 Figure 6: A typical information retrieval model ............................................................... 25 Figure 7: A two-dimension vector space .......................................................................... 28 Figure 8: A three-dimension vector space ........................................................................ 29 Figure 9: A simple structure of the vector space model .................................................... 30 Figure 10: Cosine coefficient ............................................................................................ 33 Figure 11: An example of two-peer interesting matching ...................... H ........................... 35 Figure 12: Properties file ................................................................................................... 41 Figure 13: The 100000 searches in a 1000 node network, using Gnutella routing for a Zipf distribution of documents over nodes. .............................................................. 45 Figure 14: The 100000 searches in a 1000 node network, using DIAS routing in polling fashion for a Zipf distribution of documents over nodes. ......................................... 46 Figure 15: The 100000 searches in a 1000 node network, using DIAS routing in reporting fashion for a Zipf distribution of documents over nodes. ......................................... 47 Figure 16: Gnutella vs. DIAS: Average Path Length ........................................................ 48 Figure 17: Gnutella vs. DIAS: traffic overhead ................................................................ 49 Figure 18: Variation of average path length when BUDDY_UPDATE_LOOP varied in the polling fashion ..................................................................................................... 50 Figure 19: Variation of traffic overhead when BUDDY_UPDATE_LOOP varied in the polling fashion ........................................................................................................... 51 viii Figure 20: Variation of average path length when THRESHOLD varied in the polling fashion ....................................................................................................................... 53 Figure 21: Variation of traffic when THRESHOLD varied in the polling fashion ........... 54 Figure 22: Variation of average path length when NO_BUD_CHANGES varied in the reporting fashion ....................................................................................................... 55 Figure 23: Variation of traffic when NO_BUD_CHANGES varied in the reporting fashion ....................................................................................................................... 56 Figure 24: Variation of average path length when NO_CONN_CHANGES varied in the reporting fashion ....................................................................................................... 58 Figure 25: Variation of traffic overhead when NO_CONN_CHANGES varied in the Reporting Fashion ..................................................................................................... 59 Figure 26: JXTA application overview ............................................................................. 61 Figure 27: JXTA configuration window ........................................................................... 69 Figure 28: Login window .................................................................................................. 70 Figure 29: Secure chat tab ................................................................................................. 71 Figure 30: The chat between two buddies ......................................................................... 72 Figure 31: Local content tab ................................................................... . .......................... 73 Figure 32: Search tab ......................................................................................................... 74 Figure 33: Maintain connection tab .................................................................................. 75 ix CHAPTER 1 INTRODUCTION 1.1 Motivation As the rapid growth of the Internet, P2P networks are gaining significant popularity in a quite short period of time. Centralized indexing, query flooding, and hybrid are the most widely used techniques in the present P2P systems to conduct content searching from any given peer on the overlay. Centralized indexing is employed in Napster [l] sys- tem, where peers automatically upload lists of available files to the central index, and queries are answered using this index. Flooding-based approach is applied in Gnutella [1] system, where queries are broadcasted over an overlay network connecting peers. Hybrid architecture is employed in Kazaa [1] system, where some peers are elected as “super— nodes” in order to index content available at peers in a nearby neighborhood and manage queries. A These approaches cannot provide deterministic guarantees because they do not make use of peer’s knowledge. A knowledge-ware [3] system is a system in which informa- tion/knowledge have been represented uniformly and well organized into the knowledge base, and the resources can be wide-spread shared and coordinated used to provide on- demand and intelligent service satisfying personal needs and expectations. As we know that in searching the largely distributed data, the efficiency of a P2P Internet application mostly depends on the scalability and versatility of its search mechanism [4]. The current approaches are inefficient not only because they will generate a large amount of network traffic for each query, but also because they do not utilize peer’s knowledge to improve searching success rate. In fact, there is an opportunity for many peers to identify a small set of neighbors that have a higher probability to answer the queries from the current peers. Based on this idea, many new systems have been developed to provide fast and reliable search and lower traffic overhead. PeerSearch [5] and NeuroGrid [6] are two of these systems that have high efficiency, scalability, and reliability. PeerSearch is a P2P information retrieval sys- tem that achieves both efficiency and determinism through an elegant combination of in- dex placement and query routing [5]. NeuroGrid is an approach that decentralizes search to ongoing network activity. In a NeuroGrid system, each successive search changes the initiating peer’s knowledge that is about the relationship information between contents and other peers in the network. And this peer will utilize the knowledge when it conducts a search next times [6]. In this thesis, we propose an interest-based [7], semantic-layered peers clustering scheme, DIAS, to improve searching efficiency without producing too more overall traf- fic. The design philosophy is motivated by the existing work of NeuroGrid, which retains the primary characteristics of Gnutella network: simple, robust, and decentralized. We find that in the P2P networks, if there are some particular pieces of content residing in a certain peer, then the queries issued by this peer are very likely to contain the keywords which already exist in these pieces of content. Furthermore, if peer A is interested in a special piece of content which is located in peer B, then it is quite possible that peer B will also have other pieces of content which peer A may be interested in. At the same time, it is very likely that peer B can also find the pieces of content which it is interested in located in peer A. Those peers who share the same interest can gather to compose a group, and the queries issued by the peers in this group will be forwarded mostly within this group. So the searches will succeed in a shorter time. Based on this scheme, we pro- pose our Adaptive Interest-based File Sharing system: DIAS. 1.2 Problem Statement As we have mentioned in the last section, many P2P protocols such as NeuroGrid use knowledge base to improve search efficiency. We know that capturing the characteristics of web users is an important task for the web site designers. By mining web users’ his- torical access patterns, some demographics and behavioral characteristics of web users could be determined. The navigation path of the web users, if available to the server, car- ries valuable information [8]. Also in P2P networks, if we deploy peers’ historical search behaviors, mine their interest characteristics from the information stored in their local and contained in the messages issued or forwarded by them, we can establish some special relationships among the peers. Using these relationships, we can navigate the transfer of messages to optimize the search path, so that improve the search efficiency. 1.3 Definition of Terms 1.3.1 P2P Networks In the traditional client/server architectures, some computers called servers are dedi- cated to serving the other computers, and some computers call clients are dedicated to being served by the server computers or server software. In a P2P network, each peer has equivalent capabilities and responsibilities. A P2P network is a type of network in which 3 two or more peers are connected and share resources, such as files and devices like scan- ners and printers without going through a separate server computer or server software. Figure 1: A logical architecture of a P2P network Figure 1 shows the architecture of a typical P2P network. A P2P system is an overlay built on top of the physical infrastructure of a network. It can be a network of very small scale. For example, it can be a permanent infrastructure that consists of tens of computers and devices in an office or a small company. Or it can be much larger, such as a network in which different protocols and applications set up direct or indirect relationships among computers and devices over the Internet. Also a P2P network can be in an ad hoc envi- ronment — where network computers and devices are linked by wireless or temporary plug-in connections. In this case, the computers and devices are components of the net- work only for a certain period of time, for example, the duration of a communication ses- sion. 1.3.2 Interest Base Among the users who are using the P2P network, it is very likely that many of them may be interested in some particular category of content. For example, some users may be very interested in “car” or “automobile”, some users may have special taste in “movie”, and “Britney Spears” may be the favorite singer star of many young users. It is quite possible that the users who have a special interest keep many pieces of content as- sociated with this interest in their local computer, and they are very likely to search simi- lar stuff in a certain period of time. In some traditional P2P networks such as Gnutella, peers will flood their queries to some randomly detected neighbors no matter whether their interests are similar to or far from each other. So the query may be propagated to a very long distance (up to the TH. of a query), and the search efficiency cannot be satisfactory. The same case also occurs in another famous approach, random walk [9], where a query message is forwarded to a randomly chosen neighbor at each step until the object is found. We intent to design a scheme, through which the users who have the same or similar interest in a P2P network can be clustered into a group and connected to each other in a certain period of time. Then according to our common sense, if a user issues a query which is associated with his special interest and the query is forwarded mostly within the interest group of other users who also have this interest, then the wanted file or item will be found in a much shorter time compared with those traditional P2P networks. This is the basic idea of our DIAS system. 1.4 DIAS An interest group is a set of nodes that show a high probability in querying the same category of content in a P2P system. For example, nodes that frequently send queries of Linux documents could be put into a “Linux Document” group. The protocol we propose intends to provide a mechanism for each node to determine which groups to join, how to join and leave, and its impact on content searching and retrieving. The basic idea of DIAS is for any peer to select a number of current neighboring peers whose “interest index” is close to the current peer. We call these special neighbors of a peer “buddies” of that peer. The peers who share the same interests form interest groups on the overlay. The number of those neighboring peers selected is called “group size”. Peers in this system will calculate their buddy sets periodically, using active polling scheme or passive reporting scheme, which will be detailed later. In addition to determin- ing whether a neighboring peer is a buddy, a peer can also determine whether that neighboring peer’s neighbors are its buddies, based on these peers’ interest information. Then gradually buddies of same interest will be clustered in a group. Using this approach, after a period of warm-up time, a peer will be surrounded by its buddies on the overlay. And its query will be sent to these buddies directly. Also the queries coming into this peer will be forwarded to these buddies. Each node in a P2P system is associated with a number of interest groups. In the case that multiple users may share the same node each user has a profile that defines a set of interest groups. While bootstrapping, a node is able to identify its interest groups based on its query history. Thus an initial i-group list could be created in the node. To partici- pate in an unstructured P2P network, the node has to discover a small number of buddies who have the similar interest such that it can route query messages via these neighbors. A buddy list can be obtained from buddy cache from previous operation history, or gradually can be populated after some warming-up time. I Each i-group entry of a node may have multiple ranked buddies. As a result, query search will have a large probability to success in the interest group. 1.5 Description of the Remaining Chapters We will elaborate our DIAS system in the following chapters: In Chapter 2, we intro- duce some related works including Freenet, Gnutella, PeerSearch, Neurogrid, and JXT A. We present the architecture of our DIAS system in detail and show the simulation result in Chapter 3. In Chapter 4, we introduce the implementation of DIAS on JXT A platform. At last, we draw a conclusion on this thesis and discuss the future work in Chapter 5. CHAPTER 2 RELATED WORK 2.1 Overview of P2P Protocols 2.1.1 Gnutella Gnutella [2] is a fully decentralized, P2P application layer network designed by Null- soft, a subsidiary of American Online. In a Gnutella network, each peer stores some spe- cial files and routes file searches (queries) from and to neighboring peers. If the searched file is stored locally in a certain peer, then this peer will response to this query, and the peer who initializes the query will download the file directly from the peer who has the file. The Gnutella network work by simple blind flooding: a peer who wants to search a file sends a query to its neighbors on the overlay. When the query is forwarded to a peer, and if this peer does not have the searched file, the query will be further forwarded to its neighbors, and so on, until a certain query 'ITL is reached. Each querying message has a global unique ID (GUID), and each peer maintains a list of recently seen GUID. Using this strategy, peers in Gnutella networks can filter out the messages they have already seen recently, so no messages can be received by a peer more than twice and query loop- ing is avoided. In order to download the file that is found at some peer, each peer keeps a table locally that stores the information about where each searching query comes from. Using these tables, searching responses can be routed back to where they come from. At last, a connection will be established between the peer that issues the query and the peer who offers a response for the query, and the original peer can download the file from the destination peer directly through HTTP. Figure 2 [18] shows how Gnutella works in each step. Hoalcatche Gnutella Network N Request/Receive //” a set of Active ,I’ Peers / Figure 2: A new servent joining a P2P network A Gnutella servent connects itself to the network by establishing a connection with another servent currently on the network. Once the address of another servent on the net- work is obtained, a TCP/IP connection to the servent is created. This step is shown in Figure 2. Once a servent has connected successfully to the network, it communicates with other servents by sending and receiving Gnutella protocol descriptors. A servent sends a Ping descriptor to actively discover hosts in the network. A servent receiving a Ping descriptor is expected to respond with one or more Pong descriptors, which includes the address of a connected Gnutella servent and information regarding the amount of data it is making available to the network. This step is shown in Figure 3. A peer sends a Query descriptor to search a particular document in the network. A servent receiving a Query descriptor will respond with a QueryHit if a match is found against its local data set. The QueryHit descriptor provides the recipient with enough in- formation to acquire the data matching the corresponding Query. This step is shown in Figure 4. Gnutella Network N ll Servent p Figure 3: A servent discovers hosts in the network Gnutella Network N Servent p Servent p2 Figure 4: A servent requests a file Gnutella works in a parallel way, in which the query will spread through the whole network as far as possible until the connectivity and TTL (T ime-to—live) exhaust. If a 10 match or a possible loop is discovered in a peer, then the query will stop at this peer. But the search will still continue to propagate to the other peers in the network. Thus a large amount of traffic overhead will be generated even if a query response is available from a peer several hops away from the sender. 2.1.2 Napster Napster [1] is a centralized application-level, client-server protocol over point-to-point TCP for sharing files over Internet. The Napster system works in the following steps: When a peer enters the system, it first connects to the Napster server, and then uploads its list of files to the server. This process is called “push”. The peer will also send the server a set of keywords that can be used to identify its local files. When another peer sends query to the Napster server, the server pings the hosts that apparently have the data, and then select the best of the correct answers. Then this user retrieves the files it requires. 2.1.3 Freenet Freenet [2] is another well-known system in P2P environments. In a Freenet network, each file is associated with a unique key that is created by hash functions and can be used to identify the file. At the same time, each peer has a knowledge base (KB) that stores information about the relationships between keys and peers. When a peer initializes a query, Freenet ranks the different keys it knows in the network, by comparing these keys with the key included in the query message, the more similar keys are ranked higher. Then the query will be forwarded to the peer that is associated with the most similar key. If the requested file cannot be found in this peer, then the query will be forwarded to the peer that is associated with the second most similar key, and so on. As we can see that 11 Freenet works in a greedy serial fashion. When a match occurs, that is, the search suc- ceeds, the query is returned along the query path, and the searched file is passed to the original peer together with the query along the query path. At the same time, the found file is cached or replicated in all the peers along the querying route. Also to guarantee anonymity, any peers along the querying route can declare that they are the source of the found file randomly. The original peer that issues the query will update its knowledge base (KB) so that the key that it searched can be associated with the peers that declared to be the source of the found file. This will be quite helpful in the next time when the peer issue a query including the similar keys. 2.1.4 Kazaa Kazaa [9] uses a hybrid architecture in which some peers are elected as “supemodes”. Content available at regular peers surrounding a supemode are indexed. To search a particular piece of content, regular peers send query requests to their su- pemodes. Supemodes communicate amongst themselves to find where the wanted data is located. When succeed, they return locations of matching objects to the requesting peer. To download an object, a peer initiates one or more connections to other peers that have replicas of the object. The requesting peers may either transfer the entire object in one connection from a single peer or choose to download multiple fragments in parallel from multiple peers. The download fashion in Kazaa is different with that in Gnutella, where only one connection is used. Thus Kazaa carries a more efficient download scheme. 12 2.1.5 NeuroGrid NeuroGrid [6] is an adaptive decentralized network system. Peers in NeuroGrid sys- tem support distributed searches through semantic routing, that is, queries are forwarded based on content. A learning mechanism is applied in NeuroGrid system. It adjusts meta- data describing the peers’ content and the documents that consist of those contents. Like Freenet, each NeuroGrid peer also maintains a knowledge base (KB) locally. Any time when a search succeeds, the original peer that issues the query will update its knowledge base to store the information about the relationship between the document keywords and remote peer in which the match occurs. At the same time, a direct connection is estab- lished between the original peer and the destination peer. Each NeuroGrid peer forwards queries only to a subset of peers in which it believes the wanted file may be found, in- stead of forwarding queries to all of its neighbors. And this subset is determined by re- sorting the knowledge base. [2]. NeuroGrid system improves the searching efficiency and reduces the traffic overhead. Compared with Gnutella, NeuroGrid supports a really efficient search procedure because it is able to consistently determine peer identities. Compared with Freenet, NeuroGrid produces much fewer local cache overheads that are required to achieve some of its at- tractive qualities. But there is an important side effect in NeuroGrid system: Whenever a match occurs in a remote peer, a new connection will be established between the original peer and the remote peer, and a new record storing the relationship between the querying keywords and the destination peer will be added to the knowledge base of the original peer. So after a long time, too many new connections will be added to the network, and too many new records will be stored in each peer’s knowledge base, but no old connec- l3 tions or old records will be deleted. This will ultimately be very expensive, with connec- tions and knowledge bases growing indefinitely. 2.1.6 Interest-Based Locality Using Shortcut This content location solution using interest-based locality principle [7] is proposed by a research lab in Carnegie Mellon University. In this solution, peers loosely organize themselves into an interest-based structure on top of the existing Gnutella network. The interest-based locality principle posits that if a peer has a particular piece of content that one is interested in, it is very likely that it will have other items that one is interested in as well. Shortcut discovery is used in this solution: When a peer joins the system, it first at- tempts to locate the content through flooding. The lookup returns a set of peers that store the content. These peers are potential candidates to be added to a “shortcut list”. Subse- quent queries for content go through the shortcut list and select the peers with the highest rank. If a peer cannot find content through the list, it issues a lookup through Gnutella, and repeats the process for adding new shortcuts. 2.1.7 PeerSearch PeerSearch [5] is a P2P information retrieval system proposed by HP Lab. The basic idea of this system is to combine index placement and query routing. It only needs to search a small number of servants to identify matching documents. Leveraging some in- formation retrieva] algorithms, PeerSearch represents documents and queries as vectors and measure the similarity between a query and a document as the cosine of the angle between their vector representations. PeerSearch stores a document index using its vector 14 representation as the coordinates, resulting in that indices stored close to each other are also close in semantics. 2.2 Research Issues in Existing P2P Applications By comparing these protocols, we can say that: Napster uses the centralized client- server model, which is not as flexible as the other protocols, and the additional server is a big consuming. Kazaa employs hybrid architecture where a few supemodes act as the servers. Gnutella is running a fairly inefficient search procedure in a fully decentralized network. Freenet, NeuroGrid, the shortcut solution, and PeerSearch appear to support a more efficient search process than Gnutella. But Freenet achieves this with much more local cache requirements, because it works in a greedy serial fashion. In NeuroGrid, new connections continue to be added among the peers in the network. New records continue to be inserted into the knowledge database of each peer. So the connection overhead and knowledge base overhead are really critical. The shortcut solution generates and main- tains shortcut list for each peer according to the results of the searches issued by the peer, Peers in the list are believed to share the same interests with this peer. PeerSearch com- putes the similarity between document vectors and query vectors to optimize search proc- ess. By studying the P2P protocols described above, we get the purpose of this thesis: To find a scheme that can improve the search efficiency in P2P networks while producing additional overhead as few as possible. The download efficiency also needs to be im- proved, but we won’t cover this issue in this thesis. Our DIAS protocol can improve search efficiency without needing to concern the connection overhead and local informa- 15 tion overhead while producing no more traffic overhead. We achieve the enhancement by deploying the interest similarity between peers and generating interest groups. 2.3 Information Retrieval The interest base algorithm of our DIAS system is built from the information retrieval conception. The problem of information storage and retrieval can be stated as: [10] we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. The difficulty is not only how to extract the information but also how to use it to decide relevance. The purpose of an automatic retrieval strategy is to retrieve all the relevant documents at the same time retrieving as few of the non-relevant as possible. When the characteriza- tion of a document is worked out, it should be such that when the document it represents is relevant to a query, it will enable the document to be retrieved in response to that query.When the indexing is done automatically it is assumed that by pushing the text of a document or query through the same automatic analysis, the output will be a representa- tion of the content, and if the document is relevant to the query, a computational proce- dure will show this. We will elaborate the conception of information retrieval in Chapter 3. 2.4 J XTA As the collection of all software and devices connected to the Internet, the web is evolving from a Web of Computers to a Web of Things [11] as new devices are connect- ing to the Internet. Wireless devices, a new generation of web devices, such as pagers, l6 PDAs, cell phones, and other consumer devices are among these new devices that are en- tering the web world. Analysts predict that there will be a grand demand for wireless ap- plications in the near future. P2P computing is leading itself to the more dynamic envi- ronments of wireless devices rapidly [12]. Because the P2P applications have the proper- ties that they are capable of creating, joining, interacting with peer groups, and posting advertisements to provide and request resources, they can dynamically find the stuff that they need. This characteristic makes P2P applications suite to wireless environments quite well. JXT A [11] is deployed based on the imagination of a world in which peers can take advantages of being connected with each other, despite what software and hardware plat- forms they are using. It is an open-source project that specifies a standard set of network protocol upon the existing physical network infrastructure for ad hoc, pervasive, P2P computing as a foundation of the upcoming Web of Things. A wide variety of P2P net- work applications can be built above this thin but powerful layer. The JXTA project aims at being independent of programming languages, system platforms, service definitions, and network protocols. It provides a uniform, addressable network based on assigning each peer in the network a unique peer ID, with all of the complexity of the underlying physical network topologies being hidden. So peers in the JXT A network are allowed to discover and communicate with each other, self-organize into groups, advertise and look for network resources, without being required to understand and manage the physical network topologies. Despite how complicated and dynamically the topologies are chang- ing, JXT A peers do not need to care about. 17 CHAPTER 3 DIAS ARCHITECTURE 3.1 The Design of DIAS The main idea of our DIAS system is to cluster peers on the overlay that share com- mon interest set on the semantic layer based on the analysis of their local content, search- ing entities, and forwarding entities. Peers in such an interest group (intemet zone) are called group members. The members in a group are connected together according to some scheme after a period of warm-up. And the quires issued by the group members in a group are to be forwarded mostly within this group. To determine whether two peers share the similar interest set, some vector space [14] based similarity coefficients have been investigated. Our preliminary simulation results show that such a semantic layer improvement to content searching for P2P networks will significantly improve searching efficiency with moderate overhead. Figure 5 shows an overview or our DIAS system. 18 a, 0 . as ‘5 9 0 P1 P2 12 G I" 9 e 1‘ 9 V Figure 5: An overview of DIAS 3.1.1 Interest Database and Interest Index Our DIAS is based on the analysis of the interest set of each peer in the P2P networks. Here we analyze the interest set according to the keywords that are contained in the files or documents existing in the P2P networks and the querying messages going around. In DIAS system, each peer maintains a local interest database that consists of three parts: LCK (Local Content Keyword), LQK (Local Query Keyword), and FQK (Forwarding Query Keyword). By comparing the keywords in these sets of each peer, our DIAS sys- tem can determine whether any two peers in a network share the common interest set, and according to the result of the comparison, each interest group can be formed. Now let’s take a look at these three keyword sets: 19 LCK (Local Content Keyword): Keyword set of the local content of each peer. If a user has some special interest, then it is very likely to have some files or docu- ments related to its interest stored in its local computer or device, that is its local content. So the keywords contained in the local content can indicate the interest of the user, or, say, the peer. In our simulation that will be described later, the local content of each peer will be assigned when the simulator starts, and LCK set will be created at the same time. In addition, when the peer receives a query reply from other peers, or retrieve some content from other peers, it will update its LCK set by putting the keywords of the content in the response that are not included in its LCK set into its LCK set. The maximum size of the LCK set is a parameter that may affect the performance of our protocol. When the size of the LCK set of a certain peer exceeds the specified maximum value, this peer will delete the old- est keywords in its LCK set and maintain the size less than or equal to the speci- fied maximum value. LQK (Local Query Keyword): Keyword set of last M; number of queries. If a user has some particular interest, then it is quite possible that this user will issue the queries that are related to its interest frequently. So the keywords contained in the querying messages issued by a peer can also indicate the interest of this peer. In our simulation, anytime when a peer issues a query, it will update its LQK set. If some of the keywords contained in the querying message are not included in the LQK set of this peer, it will put those keywords into the its LQK set. The maxi- mum size of the LQK set is also a parameter that may affect the performance of 20 our protocol. Each peer will maintain the size of its LQK set according to the same scheme used while maintaining its LCK set. ' FQK (Forwarding Query Keyword): Keyword set of last M2 number of forwarded queries. According to our analysis to the behaviors of peers in P2P networks we believe that some times, the keywords contained in the queries that a certain peer often forwards may also indicate the interest of this peer. So in our simulation, we considerate this factor in a few cases. Anytime when a peer forwards a query, it will update its FQK set according to the same scheme described above. Peers in our DIAS network system use these three keyword sets to estimate the inter- est of each, look for their buddies, and generate interest groups. In the next section, we will talk about how DIAS peers generate and maintain their buddy lists using these three keyword sets. 3.1.2 Buddy List A peer’s buddy list contains neighboring peers in the interest group to which the peer should send its query. A peer decides whether one of its neighboring peers is its buddy by comparing the LCK, LQK, FQK keyword sets of that peer and itself, so the peers in its buddy list shares the common interest set with it. The average search path length in our DIAS network system can be reduced considerably with each peer sending and forward- ing queries directly to its buddies. The buddy list of each DIAS peer has to be updated from time to time in order to re- flect the most recent interest distribution in the network. We provide two update schemes: polling and reporting. The simulation can run based on either of these schemes. 21 . Polling A peer will update its buddy list every BUDDY_UPDATE_LOOP times of its issuing queries. Here BUDDY_UPDATE_LOOP is a parameter in our properties file that will affect the performance of the system. For example, if BUDDY_UPDATE_LOOP is set to 10, then a peer will update its buddy list when it issues 10 queries, 20 queries, 30 queries, etc. When a peer is about to update its buddy list in the interest group, it will send interest- PING message with specific indication of LCK, LQK, or FQK to its neighboring peers. Its neighbors will reply to the interest-PING with their local interest databases. The in- formation of the interest database of other peers will be saved (cached) in the underlying peer for later use. To go one step further (more step could be possible, which is determined by TTL of interest-PING), when a peer is about to update its buddy list, it can also request interest information of the peers who are the neighbors of its direct neighbors. For such request, a peer can either reply with its local cached interest information, or initiate another interest- PING with TTL set to 1 to its own neighbors, which in turn updates its local cached in- terest information. To avoid flooding when establishing buddy relationship for a peer, the interest-PING should have a TTL of less than 3. Again this parameter will likely affect the performance of the system. We cannot directly exchange buddy list between two peers due to the fact that they might be in different interest groups. Instead, upon receiving the interest information of its neighbors and its neighbors’ neighbors, a peer can decide which one among these peers can be added to its buddy list following some “Interest Matching” algorithms that 22 will be described later. When a peer identifies a buddy after calculation, and this buddy is not connected to this peer right now (the buddy must be the neighbor’s neighbor of this peer), then a new connection will be established between the buddy and the polling peer. As a result, the connections of each peer will become more and more gradually, and this is an overhead to our system. The maximum number of the connections of each DIAS peer is a parameter that may affect the performance of our protocol. When the number of a certain peer’s connections exceeds the specified maximum value, this peer will first de- lete the connections of its neighbors that are not its buddies, and then, if its connections are still more than the specified maximum value, it will delete the buddies with the lowest rank in its buddy list and maintain its connections less than or equal to the specified maximum value. We will describe how the buddies are ranked for a certain peer in Sec- tion 1.3. I Reporting In the reporting pattern, DIAS peers will not inquire the interest database information from their neighbors positively as in the polling pattern. Instead, they wait passively. Any time after a peer updates its buddy list, it will check the difference between its old buddy list and its new buddy list, that is, the number of the peers that are in the old buddy list but not in the new buddy list plus the number of the peers that are in the new buddy list but not in the old buddy list. If the total number exceeds a specified maximum number NO_CONN_CHANGES then this peer will report to its neighbors. Here NO_CONN_CHANGES is a parameter in our property file that may affect the perform- ance of our protocol. 23 A peer reports its buddy list change in such a way: It sends a message to all of its neighbors telling them that its buddy list has been changed enough for them to consider- ate, and the neighbors may forward this message to their neighbors, according to the TTL of interest-PING. Again, as in the polling fashion, to avoid flooding, the interest-PING should have a TTL value less than 3. And this parameter will likely affect the perform- ance of the system. Each DIAS peer maintains a variable “o_no_neibor_changes” that records how many messages described above have been sent to this peer. If the number of the received mes- sages exceeds a specified maximum number NO_BUD_CHANGES then this peer will update its buddy list. Here NO_BUD_CHANGES is a parameter in our property file that may affect the performance of our system. After the peer updates its buddy list, it will report and set the variable “o_no_neibor__changes” to 0 to start a new turn of recording. 3.1.3‘Interest Matching - Information Retrieval Definition In order to elaborate the scheme that we used in our DIAS network system for the in- terest similarity measure, let’s first take a look at the concept of information retrieval based on which we developed our interest matching strategy. What is information retrieval (IR)? It is a wide loosely-defined term. Lancaster [15] gave a perfect and straightforward definition: “Information retrieval is the term conven- tionally, though somewhat inaccurately, applied to the type of activity discussed in this volume. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non— 24 existence) and whereabouts of documents relating to his request.” Some basic concepts in the information retrieval term are as follows: I Document is represented by a set of keywords. The keywords can also be call “index terms”. I Keywords may be assigned binary weights or gain their weighs from the calcula- tion on statistics of their frequency in documents. I “Retrieval” is a “matching” process between the keywords in documents and the keywords in queries. The purpose of a retrieval strategy is to retrieve the relevant documents as many as possible and at the same time, retrieve the non-relevant as few as possible. After the characterizations of a document is mined out, when a document represented by these characterizations is related to a query, this document should be able to be retrieved in re- ply to that query. Below is a figure describing a typical information retrieval system [10]: Feedback Processor Input Output ~— Documents Figure 6: A typical information retrieval model 25 Model Now let’s embody the theory of information retrieval using a model. A typical IR model is shown in Figure 6. The purpose of an information retrieval model is to formalize the way of thinking about the information retrieval. In the model, we define a set of ob- jects and assertions can be made about these objects. Also in the model, we restrict the ways in which classes of objects can interact with each other. The purpose of an informa- tion retrieval model is to specify the representations that are used for documents and in- formation needs, and how the comparison between the keywords of them is implemented. An information retrieval model can be presented as a [D 9 Q, FaR(qi 9d} )] where I D is a set of representations for the documents in the collection I Q is a set of representations for queries, that is, the user information needs F is a framework for modeling document representations, user information needs, and the relationships between them I R (q i r d j) is a ranking function. It associates a real number with a query rep- resentation q i (q iEQ ) and a document representation dj (dng) (Baeza- Yates & Ribeiro-Neto, 1999 [16])There are several information retrieval models that have been deployed: Boolean Model, Vector Space Model, Probabilistic Model, Lan- guage Model, etc. The interest matching scheme used in our DIAS system is based on the vector space model, which is an algebraic model. In the vector space model, documents 26 ments in the collection ( D ) and user information needs ( Q ) are both documents in L- dimensional space. Vector space model Vector space model is based on the idea of L—dimensional document space. User in- formation needs in this model are also located in document space. Documents and queries are both points in L-dimensional space, where L is the number of unique index terms in the data collection, and documents and queries are vectors of these index terms. Actual vectors may have many terms, and can use binary keyword weights or assume 0-1 weights. Below is a simple example: 9’ 6‘ 9’ “ Terms: “duck”, “horse”, “dog , squirrel”, “pig , cat” Binary: (l, O, 0, 0, 0,1),(0,1, 0,1, 0, 0) Weighted: (0.01, 0.03, 0.01, 0.0, 0.05, 0.0) A two-dimension vector space and a three-dimension vector space are shown in Figure 7 and Figure 8. In the vector space model, not only documents can be weighted, but queries can also be weighted. If we assume the index terms of documents have different values for re- trieval, then we can assign weights to each term in each document. The vector space model calculates degree of similarity between documents and queries in the collection, and ranks output by sorting the values of similarity. Documents are ranked according to how close they are to the query. There are many possible matching functions in the vector space model and we will talk about them later. Figure 9 illustrates a simple structure of the vector space model. 27 A two-dimension vector space 5 I I I I I I I I I 45* 4~ . 02=2t1+4tZ 3.5 ~ 02=411+32 I 2.5 1.5 I ‘Dr=or1+1r2 I 0.5 >- Figure 7: A two-dimension vector space 28 Imension vector space Athree-d \ \ \ x x x x a x x \ a \ x a a a \ x x \ \ \ x x \ writtrllirsrtlitrrrrtrlrIL 4 \ x \ x x x. x c 3\ x x x a a 3‘ \ x x \ +1 x x \ a----..--im----..----..----\ \ a \ a \ a I. I \ \/ e \ ‘ ‘1- ‘ \ c x \ \M a a x , a c x x = \ \ \ . p 1 , rIlrrrIIIILiQIILIIIILIIIIK s \ x x x x > C s c c \ x x, y r k x x a a x , a X m \ x. c \ s x , c \ x s ‘ x x .\ a \ c \x \N u m N, ,\ v + « a a \ a 2 x. I n x c k c s a\ . x, \ c a c \ +\ , x , om printsrruukurrrkulirkriihux .\ , . s , . . 4, ,-\ a m , , , , s , : ,\ e . , , , . , c , , s , , . N x x. m s 01 s, x a, x ,. xc , c, d x o 1till+tlli4lrrlarirIJIII|J .\ u . , , , , , .A‘ w , , , , , . . . 3 . . , , \. o m , , a , , . s s a. t . , + . . , ,\ , , 2 . 3 C .\ A 1|ttt4lfilt4tlrtartl I3IIJ ,\ , ,H , , ,m , x on , , a , , , \ , ,fl , , , , a. 8 , ,1 . , ,1. , \ , e . D . , + , r. T ,\ a g a a 1‘ ’ ,rrriiAIrrrAIorrJrrlrflrrlrl, , . , s 2 , y 0'. , . , , D, , F , , . , s , c , , . , . ,\ . , , , , ,\ rnuulrlinup-illpnunucururk ’ I ’ I a ’ ’ z y p a ’ , , , , , , g a ’ I a g , . , , , , , . , . s , r r r r r \ fl \ \ \ > 5 4 3 2 4| 05 z 29 Figure 9: A simple structure of the vector space model I Distributional Similarity Measures In the vector space, the closer two vectors are, the more similar they are. Documents and queries clustering together are very likely to be highly similar. We need a method that allows ranking, using threshold and. reforrnulating query, to calculate the similarity between documents and queries or documents and documents. There are several existing distributional similarity measures in the vector space model: Cosine coefficient [17], Jac- card coefficient [17], Dice coefficient [17], Overlap coefficient, Ll distance (City block distance, Manhattan distance), Euclidean distance (L2 distance), Hellinger distance, In- formation Radius (Jensen-Shannon divergence), Skew divergence Cosine coefficient, Jaccard coefficient and Dice coefficient are the most popular measures. We will only 30 elaborate these three measures that are based on contribution to similarity of co-occurring terms. Some conventions are assumed first: I N : the number of documents in collection I L : the number of unique terms in collection I D i ;the i th document in the collection I Ti : the i th term in the collection " Q : a query I Wi,Dj : the weight of term i in Document j ' Wi,Q : the weight of term i in Query Most similarity functions are based on the inner product. The similarity between a document and a query is the sum of the products of the term weights in the document and the query: SIM(D,.,Q) = Z(W.-,D, *Wrg) And the sinrilarity between two documents is the sum of the products of the term weights in these two documents: SIM(D,.,DJ.) = 2M“), *ij) Dice coefficient 31 Using dice coefficient measure, the degree of similarity between a document and query is calculated using the equation below: 22 IZWDfEZsz The sum is taken over all terms for index 1 to the dimensionality of the vector space L SIM(DJ.,Q) = in the collection. For binary documents, the equation can be simplified to: 2A SIM D, =—— ( Q) B+C' Jaccard coefficient Using dice coefficient measure, the degree of similarity between a document and query is calculated using the following equation: 20%,», *W.,Q) , Zijz TZWmZ — 2(ij *Wi,Q) The sum is also taken over all index terms for 1 to L in the collection. For binary SIM(DJ.,Q)= documents, the equation can be simplified to: A B+C—A° SIM (D, Q) = Cosine coefficient Cosine coefficient shown in Figure 10 is the most commonly used measure. This measure is based on the calculation of cosine of the angle between a document vector and a query VCCIOI‘ 01' IWO document VCCIOI‘S. 32 Figure 10: Cosine coefficient Using this measure, the degree of similarity between a document and query is calcu- lated using the equation below: 2 =|< ”WWW Again the sum is taken over all index terms for 1 to L in the collection. I Interest Matching in DIAS In our DIAS system, each peer has three keyword set, LCK, LQK, and FQK. In order to generate the buddy list for each peer, we need to calculate the degree of similarity be- tween the keyword sets of two peers. Our computation is based on the cosine coefficient measure in the vector space, and some appropriate modifications are made according to the real case in our system. The Unions of these three sets are used in the computation. We know that in the standard cosine coefficient measure, the convention “ILDJ. denotes the weight of term i in Document j. In our DIAS system, we use the function 1(Snj a k i ) to denote the existence of keyword i in the keyword set j of peer 11. Here i 33 takes the value over all the keyword index in the keyword collection of our system, j takes the value from 1, 2, 3, because we only set up three keyword sets in our system, and n is the index of DIAS peer. If k: exists in the joined keyword set j of peer n, then the value of [(Snjaki) is set to 1; otherwise 0. Let A and P2 be two peers in the DIAS system. Because the contribution to the total degree of interest similarity between two peers by each keyword set is different, we assign different weight to different set. The weight for each keyword set is denoted as W j ,where j = l, 2, 3 for each set respec- tively. The degree of interest similarity between two peers can be computed using the following equation: { Z(I(Slj’ki) *I(S2j’ki)) SIM(p,p)= W-* 4—————-——- l 2 142.3 J 21(Slj,ki)*ZI(S21-,ki)) (V i i An example of Interest Matching computation is described in Table 1 and Figure 11. Here the keyword collection is {10, 11, 12, 13, , 22}, W1=0.6, W2 =0.3, W3 =0.1. Using the give equation, the degree of interest similarity between pl and p2 is calcu- lated as: SIM(p1, P2) = ——+ 0.1* =O.654 2 6*——-—+O J4*23*J21*2 J32* 34 Table 1 An example of two-peer interesting matching Keyword Set LCK LQK FQK pl 10,11,12,15 15, 17 20,21,22 p2 10,11 15,20 21,22 SIM ( pl , p2) 0.654 LCK LQK FQK Figure 11: An example of two-peer interesting matching After the calculation of the interest similarity between two peers, how can we deter- mine whether these two peers are buddies of each other? We use a threshold to accom- plish this. If the result of similarity calculation between two peers is equal to or more than the threshold, that means, there is enough number of shared keywords between them, then these two peers are considered as buddies, otherwise, not. This threshold is a pa- rameter in our properties file and may affect the performance of our protocol. Buddies in the buddy list of a certain peer are ranked according to the result of similar- ity calculation between the buddies and this peer. The buddy with a high value of result is ranked high, and the buddy with a low value of result is ranked lower, because the former 35 one has the more similar interest with the peer than the later one. When a peer has to de- lete some of its buddies because the number of its connections exceeds the limitation, it will first delete the buddies with the lowest rank that has the least similar interest with it. 3.1.4 Search Strategy We have elaborated the core of our DIAS system: How the calculation of interest similarity between peers is implemented and how the buddy list of each peer is generated. We know that the main purpose of this thesis is to improve the search efficiency, and what we have done above are all for this purpose. Now let’s take a look at how searches are implemented to improve the efficiency in the DIAS system. Queries in DIAS network use similar techniques to Gnutella to prevent loops and in- definite propagation: We use TI‘L (Time To Live) counters to limit the range of queries, whenever the path length of a query exceeds TI'L, this search will terminate despite whether it succeeds. We use GUID (Globally Unique Identifier) to'prevent a node for- warding a query more than twice. Whenever a querying message arrive at a certain peer, this peer will first check whether it has seen the GUID of this querying message. If so, it will discard it; otherwise, it will process it. In our simulation, each search is started at a randomly selected peer. When a peer is selected to initiate a query, a document will be selected randomly from the local content of this peer and a keyword will be selected randomly from the selected document. Then all the documents contained the selected keyword in the document collection of our DIAS system will be picked out except those in the local content of the selected peer. And the target document that the selected peer wants to search is selected randomly from the picked document set. Finally, the keyword set contained in the target document and 36 the target document itself are put into the corresponding fields of the querying message and the message is added into the inbox of the selected peer that will initiate a query. After the querying message is generated, the selected peer will first update its keyword set LQK by adding the keywords in the query into LQK if they are not already in LQK. Then it will increase the number of searches it has initiated in the present turn. After these two steps, it will decide whether it’s the time for it to update its buddy list based on the polling scheme or reporting scheme, and then update its buddy list using the method we described above if appropriate. Upon finished all of these steps, the peer will send its query to all of its neighboring peers, including its buddies. Whenever a peer receives a querying message, it will first check whether the target document contained in the message reside in its local content. If so, the search succeeds. If not, the peer will update its keyword set FQK by adding the keywords contained in the message into FQK if they are not already in FQK. Then it forwards the message to all of its neighbors. 3.2 Implementation of DIAS Our DIAS system is developed from the existing work of NeuroGrid. The system is built on the platform of Unix, using Java programming language. The functions are pack- aged into three major classes: Node, Network, DishonestNode and DishonestNetwork. Major members in the class Nodes are: I protected String o_node_ID: The unique ID of each node I protected Hashtable o_conn_list: List of connections to other nodes I public MultiHashtable o_contents: Stores docs indexed via keyword 37 protected Vector lck, lqk, qu: The three keyword sets protected Vector o_buddy_set: Buddy list public int o_no_search: The number of conducted searches in a certain interval public int o_no_neibor_changes: The number of neighbors who have updated their buddy list recently which is used in the reporting fashion. Major functionality methods in the class Node are: public void UpdateKeywordSet(Vector p_vector, Keyword[] p_keywords): Up- date each keyword set according the parameter. public void add_buddy(Node p_node, int p_rate): Add a buddy to the buddy list public void IncreNoSearchO: Increase the value of the member o_no_search by 1. public void IncreNoChanges(): Increase the value of the member o_no_neibor_changes by 1 Major functionality methods in the class DishonestNode are: public void UpdateBuddySet( int p_update_loop, double threshold): Update the buddy list public void report(): A node report to its neighbors that it has updated its buddy list in the reporting fashion. public boolean processMessage( Message p_message, boolean p_start) throws Exception: Process a search message when it comes. Major members in the class Network are: public static double o_threshold: The threshold used when generating the buddy list for each node. 38 I public static double o_lck_coefi; o_lqk_coefi§ ojqk_coefi: The coefficient used when generating the buddy list. Major functionality methods in the class DishonestNetwork are: I public String[] searchNetwork( int p_start_TTL)I throws Exception: Initiate a search I public void generateContent(Hashtable p_nodes, int p_no_documents_per_node) throws Exception: Assign documents to nodes using Zipf distribution 3.3 Simulation Setup 3.3.1 Parameters In this thesis, preliminary simulations of the DIAS and Gnutella systems were per- formed. We set the parameters of these two systems in our properties file. Both of these two networks contained 1000 nodes that is set by the parameter NO_NODES in the prop- erties file, a pool of 2000 documents that is set by the parameter NO_DOCUMENTS and a pool of 1000 keywords that is set by the parameter NO_KEYWORDS. Here the nodes we refer to are equal to the peers we refer to above. Each document in the document pool is randomly assigned 1 keyword from the keyword pool. The reason why we set the pa- rameter “NO_KEYWORDS_PER_DOCUMENT” to the value of 1 is because that by doing so, the distribution of the documents among the nodes represented the distribution of the keywords among the nodes. So we implemented the Zipf distribution of the key- words by distributing the documents in Zipf fashion. In the next section, we will elabo- rate how the documents were distributed among the nodes in Zipf fashion. Each node was connected to 30 other nodes that were selected at random. This is set by the parameter 39 NO_CONNECTIONS_PER_NODE. In both simulations, each node could receive infor- mation about the contents of their neighbor nodes. The initial 'ITL was set to 9 by the pa- rameter START_'ITL. Both simulations were run for 100,000 searches set by the pa- rameter INTERNAL_LOOP, with each search being started at a node selected at random. The target of each search was for a document that shared the only keyword with docu- ments in the starting node, which has been described in Section 1.4. The simulations were run in both polling fashion and reporting fashion. The number of searches by which a node will update its buddy set in the polling fashion is set by the parameter BUDDY_UPDATE_LOOP. The number of connection changes by which a node will report its change to its neighbors in the reporting fashion is set by the parameter NO_CONN_CHANGES, and the number of neighbor changes by which a node will up- date its buddy set in the reporting fashion is set by NO_BUD_CHANGES. The properties file is shown in Figure 12. 4O SIMULATION_TYPE=neuro grid MAIN_LOG_FILE=ng_growing_simulation.log SUMMARY_LOG_FILE=ng_growing_sum_simulation.log ARCHITECTURE_LOG_FILE=ng_growing_ar’ch_simulation.l0g PROBE_LOG_FILE=ng_gmwing_probe_simulation.log BUDDY_LOG_FILE=ng_growing_buddy_simulation.log SEARCH_LOG_FILE=ng_growing_search_simulation.log APPLET=false NO_KEYWORDS=1000 NO_DOCUMENTS=2000 NO_KEYWORDS_PER_DOCUMENT= l NO_NODES=1000 NO_HONEST_NODES=990 NO_DOCUMENTS_PER_NODE=3 MAX_KNOWLEDGE_PER_NODE=1 N O_CONNECTIONS_PER_NODE=30 MAX_CONNECTIONS_PER_NODE=3O START_'ITL=9 DEGREE_OF_CORRELATION=1 FORWARDING_MODEL=0 INTERN AL_LOOP= 1 00000 PROBE_LOOP=800000 NO_PROBES=0 GROWTH_LOOP=100001 STATS_LOOP=1000 KNOWLEDGEzfalse LEARN ING=false RING_TOPOLOGY=false DOC_KEYWORD_ZIPF_DISTRIBUTION=false N ODE_DOC_ZIPF_DISTRIBUTION=true RANDOM_SEARCHES=false RANDOM_FORWARDING=false THRESHOLD=0.4 LCKCOEFF=O.6 LQKCOEFF=0.4 FQKCOEFF=0.0 BUDDY_UPDATE_LOOP=30 NO_CONN_CHANGES=5 NO_BUD_CHANGES=5 NO_GROUPS=50 Figure 12: Properties file 4 l 3.3.2 Zipf Distribution The term Zipf is used to describe the phenomena in which large events are of low fre- quency, but small ones are of quite high frequency. For example, large earthquakes sel- dom occurred in the world but small ones often occurred. Only a few words, such as ‘is’, ‘are’ and ‘the’ are used very fiequently, but many other ones such as ‘phenomena’, ‘earthquake’ and ‘astronautics’ are rarely used. The definition of Zipf’s law can be give as “The probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while many others oc- cur rarely”. Zipt‘s law usually refers to the 'size' y of an occurrence of an event relative to its rank r [18]. We can use the following equation to denote the frequency of the occur- rence of the nth ranked item: In our DIAS system, the distribution of documents among nodes follows Zipf fashion. We assigned documents to nodes in two steps: First, the nodes were divided into several groups, because in the real world, it is very possible that users with different interest clus- ter into different groups. The number of groups was set by the parameter NO_GROUPS that may affect the performance of our system. Second, within each group, documents were assigned to nodes following the Zipf distribution. A few documents occurred fre- quently, others seldom. The documents highly ranked were selected randomly, so in dif- ferent groups, different sets of documents occurred frequently. 42 Because the distribution of documents represents the distribution of keywords among nodes, which has been mentioned in the last section, keywords are also distributed in the fashion described above. 3.4 Simulation Results 3.4.1 DIAS vs. Gnutella In this simulation, we set the value of NO_GROUPS to 50 for both networks, the value of THRESHOLD to 0.4 and the value of weight LCKCOEFF, LQKCOEFF, FQKCOEFF to 0.6, 0.4 and 0.0 for our DIAS network. The value of BUDDY_UPDATE_LOOP is set to 30 for the polling fashion. The value of NO_CONN_CHANGES is set to 5 and the value of NO_BUD_CHANGES is set to 5 for the reporting fashion. The results of simulations are shown in Figure 13-19. From the data in Figure 16 we can see that the average path length was reduced dra- matically in our DIAS system compared with the traditional Gnutella system from around 2.2 in Gnutella to around 1.6 in the polling fashion and around 1.3 in the reporting fash- ion of DIAS. The decreasing rate is 27.3% and 40.9%. At the same time, Figure 17 shows that the additional overhead resulting from the generation and maintenance of the buddy lists was quite small. The number of message transfers was increased from 3"‘104 in the Gnutella network to only 3.02"‘104 in the polling fashion of our DIAS system with in- creasing rate 0.67%. And in the reporting fashion of our system, the average traffic is al- most not increased, except in the warm-up time at the very beginning. We can also see that the performance in reporting fashion is better than that in polling fashion from the data shown in Figure 16: Gnutella vs. DIAS: Average Path Length. Re- 43 porting fashion improved the searching efficiency more grandly than polling fashion. It’s because in the reporting fashion, only the node with the number of neighbors that had up- dated their buddy lists big enough would update its buddy list, and once the number met the requirement, the node would update its buddy list, despite how many searches it had conducted. And we know that the buddies of a certain node are among its neighbors, two nodes that have similar interest can be connected by the introduction of their intermediate buddies. So the updating of buddy list was quite efficient in the reporting fashion, thus nodes sharing the common interest set could be clustered together in a higher level. This caused the high searching efficiency. But in the polling fashion, each node updated its buddy list in a stable interval, despite whether necessary, so the updating efficiency is lower. Also the reporting fashion produced less traffic than the polling fashion. That’s because in the reporting fashion, interest groups were generated more quickly in the warm-up time, and after that, nodes could not update their buddy lists easily for the sake of the limitation to the number of their neighbors that had updated their buddy lists. So only in the warm-up time, the additional traffic was a little much. But in the polling fash- ion, nodes continued to update their buddy list with rhythm, so the additional traffic ex- isted all the time. From these simulation results we can conclude that our DIAS system can improve the search efficiency considerably with the acceptable additional overhead in P2P networks in contrast to Gnutella system. Ave Path Length 3.5 r i ' - -- 'i " " MIR - --~= use r 2.5 W i ii ii i] i i Tit iiTl r 11‘L 1.5r ‘ _ 00 2 4 6 8 10 No of Searches x104 x104 Message Transfers 4 f r 4 2.5: a 1.5" ‘ 0.5- . 00 2 4 6 8 10 No of Searches x10 Figure 13: The 100000 searches in a 1000 node network, using Gnutella routing for a Zipf distribution of documents over nodes. 45 Ave Path Length 4 3.5L ~ 3" . 1] H M I H :P "Irm- 2-5 T”: I . H -‘In '2'" i 1 r W ‘ WT” ~ ilii‘rfi‘wfll 1.5L “ii-iii Wt ""Wtr' v "silt 1 h i J IL-I-MAHLULHluaummflmilw 0.5~ . 00 2 4 6 8 10 No of Searches x 104 4x104 Message Transfers 3.5- - 3 miinrlliiiiiiiiiluz :::::::::ill.'.lllll::::::::::::::::::::::'::ll:::'::":: 2.5: ~ 2" i 1.5 _ 1 _ a 0.5- - 00 2 4 6 8 10 4 No of Searches X 10 Figure 14: The 100000 searches in a 1000 node network, using DIAS routing in polling fashion for a Zipf distribution of documents over nodes. 46 Ave Path Length 3.5 , 2.5 In I l 1 I 1 Lt" L1; L_ 0.5 ~ 00 2 4 6 8 10 No of Searches x 104 4 x 104 Message Transfers 3.5: . 2.5- 0.5“ . O0 2 4 6 8 10 No of Searches x104 Figure 15: The 100000 searches in a 1000 node network, using DIAS routing in reporting fashion for a Zipf distribution of documents over nodes. 47 Ave Path Length 3 t I F _.._ Gnutella 2.8— --— DIAS-Polling n -e— DIAS-Reporting 2.6? ' ~ 2.4 — 2.2 ’ , ~‘ 2— l 1.8» .. , 1.6— \ - 1.4T “in A 1.2 "9 ‘ .. ‘ “We" 'L. 2 1 r r r L 0 2 4 6 8 10 No of Searches x104 Figure 16: Gnutella vs. DIAS: Average Path Length 48 3 2 x 104 Message Transfers —~— Gnutella -- DIAS-Polling _ -e— DIAS-Reporting I 3.15 3.1 3.05 2.95 I l 2.9 I l 2.85 - * 2'80 2 4 6 8 10 No of Searches x104 Figure 17: Gnutella vs. DIAS: traffic overhead 3.4.2 Different Configurations in DIAS of Polling Fashion Parameters in the properties file may affect the performance of our protocol. Now let’s take a look at the variation of performance when the parameters are set to different values in the polling fashion. First, we set the value of the parameter NO_GROUPS to 200, the value of THRESH- OLD to 0.4, and the value of BUDDY_UPDATE_LOOP is set to 20, 30, and 40 respec- tively. Figure 18 and Figure 19 present the simulation results when the parameter BUDDY_UPDATE_LOOP varied. 49 Average Path Length 3 f 28 -'-30 “ —e-40 2.6r - O 2 4 6 8 10 No of Searches x104 Figure 18: Variation of average path length when BUDDY_UPDATE_LOOP varied in the polling fashion 50 x 104 Message Transfers 3.2 l l l I —~— 20 - -"' 30 _ 3.15 _e_ 40 31* . 3.05~ _ 2.95 . ~ 2.9 - - 2.85 r i 2'80 2 4 6 8 10 No of Searches x104 Figure 19: Variation of traffic overhead when BUDDY_UPDATE_LOOP varied in the polling fashion The data in Figure 18 shows that the smaller the value of BUDDY_UPDATE_LOOP was, the earlier the average path length began to decrease. This indicates that if nodes in the DIAS system updated their buddy lists in shorter rounds, the time of warm-up in the system would be reduced, that is, the interest groups were generated more quickly. We know that when BUDDY_UPDATE_LOOP was wet to a smaller value, nodes in the sys- tem should update their buddy list more frequently. This may cause more traffic, and the additional overhead should be grander. But Figure 19 shows that there was no big differ- ence among these three runs when the parameter BUDDY_UPDATE_LOOP varied. 51 Why? Because the additional traffic that the updating of buddy lists caused was so small that can be ignored. Now let’s take a look at the variation of the simulation result when the parameter THRESHOLD varied. This time we set the value of BUDDY_UPDATE_LOOP to 30, the value of THRESHOLD to 0.4, 0.5, and 0.6 respectively, and the value of NO_GROUPS is remained 200. Figure 20 and Figure 21 present the simulation results when THRESHOLD varied. The data in Figure 20 shows that there was on big difference in average path length when the threshold varied. The reason is because nodes in our simulation shared similar interests in a high degree. 80 after a period of warm-up time, nodes sharing the common interest set were mostly clustered together. Thus the difference of threshold did not put much affection on the search efficiency. And because the additional traffic that the up- dating of buddy lists caused was so small that can be ignored, there was no big difference in traffic when the threshold varied either in Figure 21. 52 Average Path Length 3 T 1 . -— 0.4 28 -— 0.5 ~ -e- 0.6 2.6~ ~ No of Searches x104 Figure 20: Variation of average path length when THRESHOLD varied in the polling fashion 53 x 104 Message Transfers 3.2 l I I I -- 0.4 _ --— 0.5 _ 3'15 . -e— 0.6 3.1— - 305. ~ 3W 2.95~ - 2.9 ~ 2.85- — 2'80 2 4 6 8 10 No of Searches x104 Figure 21: Variation of traffic when THRESHOLD varied inthe polling fashion 3.4.3 Different Configurations in DIAS of Reporting Fashion In this section, we will show the simulation result when parameters varied in the re- porting fashion. First, we set the value of the NO_GROUPS to 50, the value of NO_CONN_CHANGES to 5. Figure 22 and Figure 23 present the simulation result when the parameter NO_BUD_CHANGES was set to 5, 10, and 15, respectively. The data in Figure 22 shows that the smaller the value of NO_BUD_CHANGES was, the more the average path length was reduced. That is because the requirement for a node to update its buddy list was easier to meet, nodes could update more frequently during the 54 warm-up time at the very beginning. So the interest groups were generated in a higher level. This caused the searching efficiency improved more. At the same time, the more updating was conducted, the more traffic was caused in the warm-up period. Thus, in Figure 23, the peak of the curve for NO_BUD_CHANGES 5 is the highest, and the peak of the curve for NO_BUD_CHANGES 15 is the lowest. And the additional traffic in the warm-up period was temporary, after that, the traffic for all the values of NO_BUD_CHANGES was equal to that in Gnutella network, because the interest groups were generated and stable. Average Path Length 3 I T r r 28- -— 10 ~ -e— 15 2.6~ 4 o 2 4 6 s 10 No of Searches ) x104 Figure 22: Variation of average path length when NO_BUD_CHANGES varied in the reporting fashion 55 3 2 x 104 Message Transfers . "" 10 - 3-15 —e— 15 h —1 3.1 3.05 2.95 - a 2.85 ~ _ 2'80 2 4 6 8 10 No of Searches x 104 Figure 23: Variation of traffic when NO_BUD_CHANGES varied in the report- ing fashion Now, let’s take a look at the variation of the simulation result when the parameter NO_CONN_CHANGES varied. This time we set the value of NO_BUD_CHANGES to 5, the value of NO_CONN_CHANGES to 5, 10, and 15, respectively, and the value of all the other parameters remained the same as in the simulation above. The simulation results of this time shown in Figure 24 and Figure 25 is similar to the simulation results shown in Figure 22 and Figure 23 when the parameter NO_BUD_CHANGES varied. The data in Figure 24 shows that the smaller the value of NO_CONN_CHANGES was, the more the average path length was reduced. The reason is that the requirement for a node to report its variation was easier to meet, thus during 56 the same period, more nodes could get enough varying neighbors to update their buddy lists. As a result, buddy lists were updated more frequently during the warm-up time. So the interest groups were generated in a higher level, which caused the searching effi- ciency improved more. Also as described above, the more updating was conducted, the more traffic was caused in the warm-up period. Thus, in Figure 25, the peak of the curve for NO_CONN_CHANGES 5 is the highest, and the peak of the curve for NO_CONN_CHANGES 15 is the lowest. And the additional traffic in the warm-up pe- riod was temporary, after that, the traffic for all the values of NO_CONN_CHANGES was equal to that in Gnutella network, because the interest groups were generated and stable. 57 Average Path Length 3 f 2.8- —_ 10 ~ -e- 15 2.6— ' ‘ 2.4~ _ 2.2 " 7" ‘33 . d 2- u 1.8» \‘ ‘.\ d 1.6 h “ 3“ '\ d ”w 19': {‘2 .' ITO a“: .' ' \n .- ‘. ' .‘ :32 r“ 5'; I i I ' 1.4“ fivmvuw,‘kéfi “MV.¥ 1.2~ ‘ 1 L ’ I I No of Searches x 104 Figure 24: Variation of average path length when NO_CONN_CHANGES varied in the reporting fashion 58 3 2 x 104 Message Transfers —n— 05 _ _._ 10 3-15 -e— 15 3.1 h —-1 3.05 2.95 - r 2.9 I l 2.85 I LL 2'80 2 4 6 8 10 No of Searches x104 Figure 25: Variation of traffic overhead when NO_CONN_CHANGES varied in the Reporting Fashion 59 CHAPTER 4 DIAS IMPLEMENTATION ON J XTA 4.1 JXTA Architecture We have introduced JXTA briefly in Chapter 1. We know that project JXT A is con- ceived with three main objectives intended to address the shortcoming of the P2P system: interoperability, platform independence, and ubiquity. Distributed computing applications built on top of JXTA can across different P2P systems and communities. Language inde- pendence, system independence and network independence can be achieved. Every de- vice in these applications, wire or wireless, has a digital heartbeat. In the JXTA network, each peer resource is associated with a single unique peer II). For a given peer, all of the available network interface address is encapsulated for it by a peer endpoint, so this peer can only be addressed by its peer ID. Others peers can make choice among the list of available peer endpoint addresses to achieve the most efficient communication with this peer while getting the advertisement from the peer endpoint, here “advertisements” are XML documents that peers cache, publish and exchange to ex- plore and discover network resources. The notion of interest base is also applied in JXT A project. Peers in JXTA network self-organize into groups based on their interest. If a set of peers have a common set of interests, and have agreed on a common set of policies, for example, membership, then they can establish a group [1 1]. Each group is also associated with a unique group ID. 60 For developers, Project JXTA provides a set of building blocks that provide a solid foundation for distributed applications. Also these blocks can support the common func- tions required by any P2P system. Also we know that distributed computing applications require a great deal of infrastructure, and building this framework is typically time con- suming and resource-intensive. By using Project JXT A, we can focus on creating innova- tive software applications, and do not need to care about the infrastructure of P2P system [19]. Therefore we select JXTA as the underlying infrastructure for our DIAS applica- tion. Figure 26 shows the JXTA implementation overview: Applications Application API J XTA Protocols Implementation (OS/Runtime independent) Porting API Apache Portable UND( Windows RTOS Runtime Figure 26: J XTA application overview InstantP2P is a demo application that illustrates key concepts of Project JXTA and P2P system. It is consist of two main applications: Chat application and File sharing ap- plication. The InstantP2P application provides functionality for secure one-to-one chat, group chat, and sharing, searching, downloading documents within a peer group by using the JXTA platform core building blocks. It can create, discover, and join groups, create a 61 connection between two peers in the one-to-one chat function, create connections among a group of peers in the group chat function, as well as operations on files. 4.2 The Applications in DIAS In this section, we describe how our DIAS is designed on top of the InstantP2P appli- cation. 4.2.1 Overview A started InstantP2p application represents a peer in a P2P network. The name of the peer is set as soon as the initial entry for the peer is created. After boots-trapping the start-up of the JXTA platform, the GUI application will instantiate one instance of the class net. jxta.instanthp.PeerGroupManager. In order to implement our DIAS system, most of our modification was focus on this class, because it is the heart of the application. PeerGroupManager is responsible for monitoring and influencing the buddy discoveries. It generates and maintains the buddy list for a peer using the algorithm we describe in Chapter 3. Upon instantiation it will immediately trigger a remote search of neighbors in the P2P network. The GUI application can retrieve a list of remote peers using the following methods: . public Enumeration getNeighborList(): Returns an enumeration of all the neighbors of this peer. The objects in this enumeration are net. jxta.protocol.PeerGroupAdvertisement objects. . public Enumeration getBuddyList(): Returns an enumeration of all the buddies of this peer. 62 The PeerGroupManager class might also want to search for peers with specific names only. It can do so by using the following method: 0 public void searchPeers (String peerName): This will trigger the remote discov- ery of peers. If the parameter peerName is not null, the remote search is restricted to peers with the specified name. The above methods do not return a list of discovered peers, since this happens asyn- chronously. Instead interested classes can get the search result by implementing the inter- face net. jxta. instanthp. GroupStructureListener via the methods in this interface. The classes can receive the notifications by calling the method public void addStructureLis- tener (roupStructureListener listener) to register, and terminate the notifications by call- in g the method public void removeStructureListener (GroupStructureListener listener). In order to be informed if new peers are discovered, classes can implement the net. jxta.instanthp.GroupStructureListener interface via the method public void peer- StructureChanged 0. To retrieve a list of peers, classes can call the method getPeersO in the class net. jxta. instantp2p.PeerGroupManager. In order to be informed if the currently selected peer changed, classes can implement the net. jxta.instanthp.PeerListener inter- face via the method public void peerChanged (PeerGroupAdvertisement pgAdv) The PeerGroupSearcher class is another major class we modified to implement our system. It also has a notion of a currently selected peer. The GUI application is responsi- ble to inform the PeerGroupSearcher instance of changes in the currently selected peer using the method public void selectedPeerChanged (PeerGroupAdvertisement adv). 63 PeerGroupManager can also communicate the change of the peer in one part of the GUI to the other parts of the GUI. Classes interested in the changes of the selected peer implement the net. jxta. instantp2p.PeerListener interface and will then be informed via the methods in that interface of the changes. To receive the notifications interested classes register themselves by using the method addPeerListener ( PeerListener listener). Notification can be terminated by calling the method removePeerListener ( PeerListener listener). The currently selected peer can be retrieved via the method PeerGroupAdver— tisement getSelectedPeerO. 4.2.2 Chat Application The chat application allows a peer to chat with its neighbors or its buddies. The func- tions are implemented in the class net. jxta. instanthp. Chat and the interface net. jxta. instanthp.MessageBoard that we modified to apply to the chat application in our system. The class net. jxta. instanthp. Chat establishes a chat between a peer and its neighbor or its buddy. Once a peer wants to chat, this class will publish the appropriate advertise- ment remotely by the method login. The name used for the user is its peer name returned by the method getMyPeerName() in the class PeerGroupManager. In order for the peer to receive message, an input pipe is established and monitored also by the login method in the Chat class. If the peer sends a message, the message will be sent to the currently selected peer as reported by the class PeerGroupManager via an output pipe created and monitored by the login method. The user is informed about the success or failure of the attempt to open a chat via calls to the MessageBoard instance. If there is no currently selected peer to chat with, the peer is instructed to select a peer. If a peer wants to chat with its buddy, then a secure chat is initiated, i.e. all information sent and received between the chatting buddies is encrypted. When an output pipe to a selected buddy is opened, a public key will be send to the cor- responding buddy. The constructor of this class public Chat (PeerGroupSearcher manager, Message- Board messageBoard) can register itself as a PeerListener to the manager instance, which allows the chat application to access the current neighbors of the peer. The in- stance messageBoard of the MessageBoard class is used to display the received messages to the peer. The method public synchronized void logout () terminates the current chat. The monitoring of the input pipe is suspended. The method public void sendMessage- ToPeers (String message) propagates the message to the output pipe and thus send to the neighbor. If the chat channel is secure, the message is encrypted before it is sent to the buddy. Messages may be of different types and received by different neighbors. Thus, a GUI application would probably render these messages in different colors. In addition there is a method that requests information about whether the user wants to accept certain chat proposals. 65 4.2.3 File Application The file sharing application allows a peer to manage its local content, search remote content from the peers in the P2P network, and maintain the connections between its neighbors and itself. These functions are implemented in several classes and interfaces. Content distributed in our DIAS system is described by net. jxta. share. ContentAdvertisement object. The ContentAdvertisement class extends the JXTA Advertisement object and thus allows propagation along normal JXT A discovery channels. The methods below return information about the content described by the Con- tentAdvertisement object: . public abstract String getName(): Returns a name that identifies the content. This is often the file name of the file described by the ContentAdvertisement object. . public abstract Contentld getContentId(): Returns the U11) that uniquely identi- fies the content described in the ContentAdvertisement object. 0 public abstract long getLength(): Returns the length of the content. . public abstract String getType(): Returns the mime type of the content in order to select the correct method to display the content once retrieved. 0 public abstract String getDescription(): Returns a detailed description of the con- tent. Local content is described by net. jxta.share. Content object. The method in this inter- face public ContentAdvertisement getContentAdvertisement() allows to extract detailed information about the local content by return a ContentAdvertisement instant. Remote content is described by net. jxta. instanthp.SearchResult object. The method in this inter- 66 face public ContentAdvertisement getContentAdvertisement() allows to retrieve a descrip- tion about the remote content. The class net. jxta. instanthp.SearchManager is the interface to the file searching ap- plication. Its constructor public SearchManagedPeerGroupSearcher manager) retrieves peer information from the parameter manager. This class registers itself as a PeerListener to manager. Some major methods in this class are described below: . public void startSearch(String searchString): Starts an asynchronous search for remote data. The result is communicated to the registered ContentListener objects. 0 public void addContent(String path) & public void addContent( File addFilesH): Allows a peer to add content locally. . public void removeContent(Enumeration enum) & public void removeCon- tent( Content c): Allows a peer to remove its local content. If a peer wants to download the data from a remote peer, the method retrieveContent() in the class net.jxta.instantp2p.SearchManager can be used to retrieve. the data via the net. jxta. instanthp. ContentListener interface. To retrieve the actual data, an input pipe needs to be opened to the peer in which the actual data resides. Since the download proc- ess may be time consuming, it is done asynchronously. The method public void fin- ishedRetrieve( String url) is called once the retrieval has succeeded. The class net. jxta. instanthp. ConnectionManager is responsible for connection main- tenance. Some major methods in this class are described below: 0 public void addConnection(String peerName): Allows a peer to add connection between a remote peer and itself. 67 . public void removeContent(String peerName): Allows a peer to remove connec- tion between its neighbor and itself. The classes above are the major classes we modified for the file application of our sys- tem. 4.3 Implementation In this section, we describe how the DIAS application is built and work. 4.3.1 Recommended System Requirements 0 Any Java enabled platforms: Windows (95, 98, ME, 2000, NT, XP), Solaris, Linux, Unix, MaxOSX. . 200MHz Pentium class computer with 128MB of RAM, 5 MB free disk space. 4.3.2 Building Requirements . Windows NT Version 4 Service Pack 3, Windows 2000/98/ME . JavaZJDK 1.3.1 - Cygwin: a Linux-like environment for Windows . Ant: A Java-based build tool in Apache. 4.3.3 Running DIAS . Configuration 68 The first time when DIAS is run, the JXT A Configuration window that is shown in Figure 27 will appear. “A ‘ E53 JXTA Confrgurator . , 7 89;? 'hflpflsheltlxtagmflndexhtmf: '-.r1,'{,,1,’.l_§}_idj‘: #3»;- ' 3 _ advanced RendezvoustRelays‘ mm: Security Settings Secure Usemame Password Verify Password Please note your secure usemame and password They will be required for all future access Figure 27 : JXTA configuration window In this window, a user is required to input some necessary information such as “Peer Name” in the “basic” tab, “Secure Usemame” and “Password” in the “Security” tab. After the initial setup, a user is only required to enter a name and password in the Secure Login window shown in Figure 28 to connect to the J XTA network. Each time the DIAS 69 is restarted, the user will have to verify this Secure Usemame identity, because it is his/her unique identity on the J XTA Network. Ems Starting AIF E: [215 JXTA Secure Login seems" 'erifinié:"lblack | Figure 28: Login window When the application launches, the main window of the DIAS application will come out. This window has menu options that control key functionality of the system and a panel contains four tabs: Secure Chat, Local Content, Search, and Maintain Connections. The first tab corresponds to the Chat application. The last three tabs consist of the File application. . Secure Chat The search tab is shown in Figure 29. In the left half of this tab, the buddies and neighbors of this peer are listed. The user can select a certain neighbor to open a chat or a buddy to open a secure chat. The text box in the bottom of the tab is used for the user to write and send message. The sent and received messages are displayed in the right half of the tab. The messages sent to and received from different neighbors of buddies are dis- played in different color. That is shown in Figure 30. 70 AIFS: Peer black File Edit Navigation Group Add Buddy Delete Neighbor black is now logged in NeiPeerGroup :3; en d M ea: age A We iii n g for F: e n d 9 run?» u .;- .3 n n e i: ii 1:: n Figure 29: Secure chat tab 71 All‘S: Peer black File Edit Navigation Group Help Simon} C hat i} J Loral ifcmienl irieauh ~ Kruiiairiiairi 'LIZIl‘Il'IE‘IINONE- amen black is now logged in NetPeerGroup red> hello black> Got it! E: e n d M e a g e 'I.:"u.v'aiimg for F'eiidezmu-s. Ingririetiicm Figure 30: The chat between two buddies . Local Content Management Figure 31 shows the local content tab. The user can use this tab to manage his/her lo- cal content. The indexes of content are listed in the left half of the tab and the detailed description of content are displayed in the right half. The user can add a new piece of content by click the “Add” button, and delete an undesired piece of content by highlight the index of this piece of content and click the “Delete” button. 72 AIFS: Peer black Elli??? _ AboutAlFS ____ ::iE|jl.llE‘ I: hat 144‘ Ltnt'al I: om»: , _ [elusiaii'itairi IZ i:Ir'i.r'iectior‘rz. "I!!! «We in n g in r F' e n d e "Eu-'0 u -3: n: .3 n n E c ll 0 n Figure 31: Local content tab I Search Function The search tab is shown in Figure 32. When the user wants to search a certain piece of content, he/she should first input the keywords he/she wants to search, then click the “Search” button. When the search succeeds, all the matched content will be displayed in the bottom box. The information includes the ID, the name, and the owner peer of each piece of content. After the user has conducted several searches, he/she clear all the pervi- ous search results by clicking the “Clear” button. 73 AIFS: Peel black EHEHEI " itiiearitii ~ giltaii‘i‘rmii r2.mirxemraria 'v"'-.I':3iiirig for F‘ei'ider-,-'tuie utirinertitir'i Figure 32: Search tab I Connection Maintenance Figure 33 shows the maintain connection tab. The peer ID and peer name of the user’s buddies and neighbors are listed in the two boxes respectively. Though the connections of each peer are maintained automatically using the algorithm we described in Chapter 3, a user can manually delete a connection to its buddy or neighbor by clicking the “Add Connection” or add a connection by clicking the “Remove Connection” button to look for the available peers in the network according to some special willingness. 74 AIFS; Peer black File Edit Navigation Group Help Auto Share Preference Ll ‘Ie"'l.|':_‘4llii'ig for F’E-i’ll'iEZ‘v'EIUS I: Lii’lr‘wtzlliiii'i Figure 33: Maintain connection tab 75 CHAPTER 5 CONCLUSIONS 5.1 Summary of the Work Content location in decentralized P2P networks is a challenging problem. Most P2P protocols, such as Gnutella, employ fairly inefficient search approaches. The main pur- pose of our research is to find out a scheme that can make use of peer’s interest to im- prove the search efficiency in P2P networks without producing too much traffic over- head. By analyzing those major protocols, we presented our DIAS system, which built its search scheme based on the conception of interest retrieval in the semantic layer. Peers in a DIAS network are clustered into different interest groups according to their local con- tent and historical searching behaviors. And they issue their search requirements mainly within the interest groups associated with them. DIAS enhances the performance in a considerable level, without complicating the simple architecture of unstructured P2P networks. The simulation results show that using our scheme in the P2P environment, including ad hoc networks, the searching performance will be optimized dramatically. Also we im- plemented DIAS on top of the JXT A application. Two applications are implemented in the system: Chat application and File application. Our main contributions are: I Propose a P2P search approach which generates and maintains interest groups by computing interest similarity between peers 76 I Show the improvement of search efficiency with moderate traffic overhead by simulation I Demonstrate the feasibility of this approach by the implementation on J XTA. 5.2 Future Work By the simulation result and the implementation on JXTA, we can conclude that our AIFS system can support a more efficient search process without producing additional overhead. Also this scheme is feasible in real network system. But as we have mentioned above, it cannot reduce traffic overhead as in the NeuroGrid system. Our AIFS protocol still employs the query flooding approach in the traditional Gnutella system. So one of our future works is to try to reduce the traffic overhead of the system. We may accom- plish this goal by avoiding the random flooding fashion. A possible solution is to utilize the rank of each buddy and neighbor of a peer and only forward the querying message to the limited number of buddies with the highest rank. A query from a node will be sent to the first several (group leader index) buddies in the interest group corresponding to the keywords in a given query. When a search succeeds, a peer will download its wanted content from the remote peer using the traditional scheme in Gnutella and other P2P networks. In this scheme, only one connection will be established between the querying peer and the remote destination peer. Thus the download speed is really low and the efficiency is not satisfactory. So to opti- mize the download pattern is another critical task in our future work. The strategy used in Kazaa gives us some hint. We may utilize the idea in Kazaa to establish several connec- tions between the querying peer and all the remote peers that have the wanted content. 77 Also the duplication of the searched content in each peer along the propagation path may be employed. 78 BIBLIOGRAPHY [l] [2] [3] [4] [5] [6] i7] [8] [9] S. Saroiu, K. P. Gummadi, R. J. Dunn, S. D. Gribble, and H. M. Levy, "An Analysis of Internet Content Delivery Systems," In Proceedings of OSDI 2002, 2002. S. Joseph, "Adaptive Roofing in Distributed Decentralized Systems: NeuroGrid, Gnutella and Freenet," In Proceedings of workshop on Infrastructure for Agents, MAS, and Scalable MAS, Autonomous Agents 2001 , 2001. Y. Fu, L. Cherkasova, W. Tang, and A. Vahdat, "EtE: Passive End-to-End Inter- net Service Performance Monitoring," In Proceedings of USENIX 2002 Annual Conference, 2002. E. Cohen, A. Fiat, and H. Kaplan, "Associative search in peerto -peer networks: Harnessing latent semantics," In Proceedings of IEEE INF OCOM 2003, 2003. C. Tang, Z. Xu, and M. Mahalingam, "Peersearch: Efficient information retrieval in peer-to—peer networks," HP Labs, Technical Report HPL-2002-198, 2002. S. Joseph, "NeuroGrid: Semantically Routing Queries in Peer-to-Peer Networks," In Proceedings of the International Workshop on Peer-to-Peer Computing (c0- located with Networking 2002), 2002. K. Sripanidkulchai, B. Maggs, and H. Zhang, "Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems," In Proceedings of INF OCOM 2003, 2003. J. Xiao, Y. Zhang, X. Jia, and T. Li, "Measuring Similarity of Interests for Clus- tering Web—Users," In Proceedings of Australiasian Database Conference, Gold Coast, Queensland, Australia, 2001. Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, "Search and replication in un- structured peer-to-peer networks," In Proceedings of the 16th ACM Intema-tional Conference on Supercomputing, 2002. 79 [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] C. J. v. R. B.Sc., P. D. Dip. NAAC, M.B.C.S., F....IEE, C.Eng., and F.R.S.E., IN- FORMATION RETRIEVAL, 2nd ed. London: Butterworths, 1979. B. Traversat, M. Abdelaziz, D. Doolin, M. Duigou, J .-C. Hugly, and Eric Pouy- oul, "Project JXTA-C: Enabling a Web of Things," In Proceedings of 36th Annual Hawaii International Conference on System Sciences (HICSS'03), Big Island, Hawaii, 2003. J XTA for J 2MB: Extending the Reach of Wireless With JXTA Technology, http://www.ixtg.org/Droiect/www/docs/JXTA4J2ME.pdf Project JXT A 2.0 Super-Peer Virtual Network, http://wwwjxta.or2/pro'Lect/www/white_papers.html Z. D. Michael W. Berry, Elizabeth R. Jessup, "Matrices, Vector Spaces, and In- formation Retrieval," Society for Industrial and Applied Mathematics, vol. 41, pp. 335 - 362, 1999. J. W. Sons, Information Retrieval Systems: Characteristics, Testing and Evalua- tion, 2nd ed. New York: F. Wilfrid Lancaster, 1979. R. Baeza—Yates and B. Ribeiro-Neto, "Modern Information Retreival," ACM SIGMETRICS Performance Evaluation Review, 1999. P. Pathak, M. Gordon, and W. Fan, "Effective information retrieval using genetic algorithms based matching functions adaptation," In Proceedings of the 33rd Ha- waii International Conference on System Science, 2000, Hawaii, 2000. Zipf, Power-laws, and Pareto - a ranking tutorial, htt :/l in er.h l.h .com/shl/ a rs/rankin rankin .html Project JXTA Technology: Creating Connected Communities, http://www.jxta.org 80 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 'IIIIIIIII IIIIII III IIIIIIIII II IIIIII 293 024700