la!!b£$:.:.$ is}: i. E. .1. 12.1.21. .3...- . K UNEIV RS ITY LIBRARI llllllllllllllllHlllHHllll lllllllllll 3 1293 01410 7365 ||||l|l This is to certify that the dissertation entitled RECURSIVE QUERY PROCESSING IN LARGE DATABASES presented by Sungwon Jung has been accepted towards fulfillment of the requirements for PhD degree in Computer Science g ' Q‘KMMLL‘ Major professor Date December 18, 1995 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State Unlversity PLACE III RETURN BOX to roman thb checkout from your record. TO AVOID FINES return on or before date duo. DATE DUE DATE DUE DATE DUE MSU In An Al'firmlutlvo Action/Equal Opporlmly Ira-mulch W RECURSIVE QUERY PROCESSING IN LARGE DATABASES By Sungwon Jung A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DEGREE OF PHILOSOPHY Department of Computer Science 1995 ABSTRACT RECURSIVE QUERY PROCESSING IN LARGE DATABASES By Sungwon Jung Recursive database structures are the fundamental processing elements in numer- ous important applications including navigation systems, recursive rule processing in knowledge base systems, and processing IS-A hierarchy in object-oriented sys- tems. Query processing for these types of structures is the focus of numerous current database research projects, because traditional approaches are not suitable for this kind of query processing. In this thesis, we have investigated query processing meth— ods for large recursive relational databases. We have used automobile navigation systems as one of the model applications. More general models for recursive queries using transitive closures have also been investigated. In automobile navigation systems, a topographical road map can be defined by a large recursive relation, and “finding shortest paths” using this relation is a recursive query. We have developed a H iTz' graph model for structuring large recursive relations to speed up the shortest path computation. The H iTi graph model provides a novel approach to abstracting and structuring a topographical road map in a hierarchical fashion. We then proposed a new shortest path algorithm named SPAH, which utilizes the hierarchical abstraction of a topographical road map for its computation. Our performance analysis of SPAH showed that it dramatically reduces the search space. Within the H iTi graph framework, we also studied the parallel processing for intra and inter query Shortest path problems. Our empirical analysis of these two problems revealed that the inter query Shortest path problem provides more opportunity for scalable parallelism than the intra query Shortest path problem. We then studied recursive query processing for distributed fragments of recursive relations. The major performance issues involved were the description and location of recursive fragments in a distributed environment. Traditional approaches to de- scribing and locating a fragment in non-recursive databases were not effective here. We have developed a new method for describing and locating recursive fragments based on the mathematical properties of lattices. We showed that lattice structures provide a good theoretical and practical basis for describing and locating recursive fragments. The performance of this lattice approach was analyzed both theoretically and experimentally in this thesis. © Copyright 1995 by Sungwon Jung All Rights Reserved To my parents ACKNOWLEDGMENTS I wish to express my gratitude and appreciation to my thesis advisor Dr. Sakti Pramanik for his consistent guidance and encouragement right from the beginning, and his financial support. I am grateful for many discussions and invaluable comments he provided. I would like to thank my guidance committee members, Dr. Moon Jung Chung, Dr. .lon Sticklen, and Dr. David Rovner, for their help and guidance. I must express my heartful thanks to my father, BongJo Jung and my mother, JungIIee Kim for their Sincere pray during my studying here. I also thank my two sisters, Heewon Jung and Soonwon Jung, and my brother Sungkyu Jung for their encouragement. vi TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 Introduction 1 1.1 Problem Description ............................. 1 1.2 Previous Works and Our Approach ..................... 4 1.3 Dissertation Outline ............................. 11 2 HiTi Graph Model of Large Recursive Relations for the Shortest Path Computation 12 2.1 Introduction .................................. 14 2.2 Formal Framework and Description of HiTi Graphs ............ 19 2.3 Use of HiTi Graphs for Computing Shortest Route ............ 28 2.3.1 Shortest Path Algorithm SPAH ...................... 29 2.4 Performance Analysis ............................. 35 2.4.1 Comparison between SPAH and A* algorithm ............. 37 2.4.2 Effects of edge cost distribution ...................... 38 2.4.3 Effects of the number of hierarchical levels ................ 40 2.5 Updating HiTi Graph ............................ 42 2.5.1 Edge Deletion ................................ 42 2.5.2 Edge Addition ............................... 43 2.6 Conclusion ................................... 44 3 HiTi Graph-based Parallel Shortest Path Algorithms 46 3.1 Introduction .................................. 47 3.2 Intra Query Parallel Shortest Path Computation .............. 49 3.2.1 Use of HiTi Graph for Identifying Middle Nodes ............. 51 3.2.2 Efficient Stopping Criteria ......................... 53 3.2.3 Formal Description of PASPAH ...................... 55 3.2.4 Performance Evaluation .......................... 58 3.3 Inter Query Parallel Shortest Paths Computation ............. 62 3.3.1 Formal Description of ISPAH Algorithm ................. 62 3.3.2 Performance Evaluation .......................... 63 3.4 Conclusion ................................... 65 vii viii 4 Description and Location of Distributed fiagments of Large Recur- sive Relations 67 4.1 Introduction .................................. 69 4.2 Fragment Description by Lattice Structures ................ 71 4.2.1 Maximal and Minimal Nodes Approach ................. 75 4.2.2 Lattice Approach .............................. 76 4.2.3 Lattice Cover Problem is NP — complete . . . .. ............. 79 4.2.4 A General Heuristic for Finding a Good Lattice Cover ......... 83 4.2.5 Near Optimal Lattice Cover Size ..................... 91 4.3 Finding Relevant Fragments ......................... 94 4.4 Updating Recursive Relations ........................ 101 4.4.1 Edge Update ................................ 102 4.4.2 Node Addition and Deletion ........................ 107 4.4.3 Performance Analysis of Update Algorithms ............... 107 4.5 Conclusion ................................... 111 5 Conclusion and Future Research 112 5.1 Contribution of This Dissertation ...................... 112 5.2 Future Research ................................ 115 APPENDICES 116 A HiTi Graph Based Distributed 'D‘ansitive Closure Algorithm 116 B Empirical Analysis of Computing Shortest Path for Road Map Queries 124 8.1 Performance comparison of Shortest path algorithms ........... 124 8.2 Conclusion ................................... 129 C A Survey of Database Problems in Intelligent 'h‘ansportation Sys- tem 131 CI Introduction .................................. 131 C.2 Data Models for Vehicle Navigation Systems ................ 134 C21 Indexing Schemes for accessing multidimensional road objects ..... 135 C22 Large Database Issues ........................... 139 (1.2.3 Update ................................... 140 C.3 Queries .................................... 141 C.3.1 Optimization ................................ 142 C.3.2 Query Types ................................ 143 CA Positioning Technology ............................ 144 C5 Current Systems ............................... 146 BIBILIOGRAPHY 149 LIST OF TABLES 2.1 N}, B}, and W} of SC} in Figure 2.2 .................... 22 2.2 Boundary nodes, between and within edges of level 1, 2, and 3 subgraphs in Figure 2.4 ................................ 25 2.3 Effects of levels of HiTi graphs on H" when 10 S lNgll S 20 ....... 40 2.4 Effects of levels of HiTi graphs on H” when 4 S [Nil] S 8 ......... 40 2.5 Effects of levels of HiTi graphs on '1'" .................. .. . 41 4.1 An example of relation R .......................... 72 4.2 An example of relation R’ .......................... 72 4.3 Newly created nodes for each set S.- for transforming from MCP to LCP 82 4.4 Lattice L and Set 3;, ............................. 90 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 LIST OF FIGURES Examples of within and between connection of CROMS .......... 17 A digraph G and its level 1 subgraph SC}, SC; and SC; ......... 20 An example of level 3 subgraph tree ST .................. 24 An example of a digraph C(V, E) partitioned according to ST in Figure 2.3 25 A level 2 HiTi graph corresponding to the digraph in Figure 2.4 ..... 27 An example of level 4 subgraph tree ST .................. 29 SPAH: Finding a Minimum Cost Path from START to DEST ..... 34 A* Algorithm: Finding a Minimum Cost Path from START to DEST . 36 A 800 x 800 grid graph partitioned according to the level 4 ST ...... 37 Performance comparison between A* and Algorithm S PAH ....... 38 Effect of edge cost on the performance of MA and MD .......... 39 Relative size of explored search space for A* and 0T0par ........ 50 Explored search space size of PAS PAH .................. 50 Level 1 HiTi graph created from level 1 subgraphs ............. 51 PASPAH: Find a shortest path from S' in 3G}, to T in SC} ........ 56 An example of sets I and II ......................... 59 Performance of PASPAH, SPAH, MOTOpar, and MOTO over: a) SET I b) SET 11 ............................... 61 ISPAH algorithm ............................... 63 Performance of ISPAH ............................ 64 Performance of MISPAH ........................... 65 A digraph G for an acyclic relation ..................... 74 Three F -subgraphs .............................. 74 An Example of DAG GR(VR, ER) ...................... 76 Two lattice covers for the F -subgraph G2 .................. 78 Partial order sets created from MCP to LCP ................ 82 A digraph G and F-Subgraph of G ..................... 84 Algorithm 1: Partition an Unsuitable Candidate lattice L =< at, y > . . . 85 Changes of SGL(VL, EL) in Algorithm 1 .................. 86 Algorithm 2: Find a Lattice Cover for an F-subgraph ........... 88 A digraph G consisting of two F-Subgraphs ................. 89 Performance analysis for Algorithm 2 .................... 93 Procedure Block(;r,t) ............................. 97 a) DAG for the whole recursive relation b) Fragment table ........ 98 Algorithm 3: Find the Relevant Fragments ................. 99 X 4.15 4.16 4.17 4.18 4.19 A.1 A.2 A.3 A.4 A5 A6 8.1 8.2 8.3 8.4 85 xi Procedure AddEdge((:c, y), S) ........................ A DAG consisting of two F -Subgraphs ................... Procedure DeleteEdge((;r, y), s) ....................... Effects of updates on ILCpl for a) non-dense and b) dense F-Subgraphs . Snapshots of lLCpl and Q after 50 random updates ............ Algorithm TC.1: AscendingPhase (U, j, k) ................. Algorithm TC.2: DescendingPhase (Z, j,i) ................. G'( V, E) partitioned according to level 3 subgraph tree .......... A level 2 HiTi graph created from Figure A.2 ............... Algorithm TC.3: CoordinatorSite j(H,TC,k) ............... Algorithm TC.4: ParticipantSite 5(F, m, j) ................ Dijkstra’s shortest path algorithm ...................... Nicholson’s Shortest path algorithm ..................... Performance comparison between Dijkstra and Nicholson’s algorithms . . Our two tree expansion shortest path algorithm .............. Performance comparison between Nicholson’s and Our algorithms . . . . Chapter 1 Introduction 1.1 Problem Description Recursive relations have become one of the dominant types of relations in next gen- eration databases such as deductive and object-oriented databases. The recursive relation can be easily viewed as a directed graph. Each instance of attributes partic- ipating in the transitive closure represents a node in the graph, and each tuple in the relation represents an arc [53]. The above recursive relation is usually called an edge relation, where each tuple corresponds to a labeled edge. Traditionally, the recursive relation refers to this edge relation. Queries are recursive if they are defined over re- cursive relations and their processing requires transitive closure computations. These recursive queries also must access detailed information about the nodes. Thus, we are defining another type of relation, called a node relation, where each tuple represents the detailed information of a node in the graph. A significant amount of research has been done on recursive relations, primar- 2 ily in computing transitive closures [4, 42, 46, 49, 50, 57, 74, 83, 99, 102]. This is - because every linearly recursive query requires transitive closure computation [52]. The-Shortest path problem is a Special case of the transitive closure computation problem. The shortest path problem is fundamental for processing road map queries. Very large recursive relations are needed to store topographical road maps. Previ- ously suggested transitive closure or graph traversal algorithms are not suitable for shortest path computations on topographical road maps because of their large search space. In this thesis, we will first study an efficient database organization method that allows us to speed up the single pair shortest path (SPSP) computation from source to destination nodes on large topographical road maps. We will then investigate a fast SPSP algorithm that takes advantage of our proposed data organization method. The efficiency of our proposed SPSP algorithm and data organization method will be thoroughly analyzed on two dimensional grid graphs. Two dimensional grid graphs are considered to be typical examples of topographical road maps [64, 91]. Grid graphs not only closely model topographical road maps but also provide road map databases for controlled analysis of our algorithms. This thesis wll then examine the parallel processing for intra and inter query SPSP problems on large topographical road maps. Intro query SPSP problems deal with parallelizing a Single transaction of SPSP computation for further speedup, while inter parallel SPSP problems deal with computing multiple SPSP transactions in parallel. The inter query SPSP problem arises in the domain of automobile navigation systems where a single central server provides route information to automobile navigators. In 3 these systems, many vehicles send their SPSP computation requests to a central server and the server must be able to compute these multiple SPSP computation transactions within a reasonable amount of time. Thus, it is very useful to develop an efficient inter query parallel SPSP algorithm. In a centralized database, the primary concern of recursive query processing is an efficient transitive closure computation (i.e., a Shortest path computation for to- pographical road maps). However, in a distributed database system, the problems of describing and locating data are interrelated, and their optimization is an impor- tant goal for recursive query processing. One important optimization parameter for describing and locating distributed data-is the data transmission cost, because this is a major performance bottleneck in distributed query processing. In distributed database systems, it is often convenient and beneficial to fragment and distribute data according to referencing frequency or locality. By storing only frequently ac- cessed data at the site, we can significantly reduce the data maintenance and trans- mission cost. In recursive relations, frequently accessed data are likely to be related through the transitive relationships between them. Most papers on parallel and distributed computation of transitive closure (includ- ing the shortest path computation) have used either predicates or hash functions to describe, distribute, and locate the data [32, 45, 47, 74, 84, 102]. However, the above two methods are not suitable for describing and locating the distributed fragments because they do not properly capture transitive relationships between data. In this thesis, we will investigate the problem of describing distributed fragments of the re- cursive relation that capture transitive relationships between data. Next, we will 4 show how to locally determine the location of the remote fragments of the recursive relation. By locally determining the location of remote fragments, we can avoid a Significant amount of communication cost for recursive query processing. 1.2 Previous Works and Our Approach Shortest path problems have been the fundamental problems in computer science. Therefore, a Significant amount of research has been done on these problems, and excellent survey papers can be found in [23, 25, 31, 34, 81]. In the domain of auto- mobile navigation systems where a database contains large recursive relations such as topographical road maps, one of the primary functions is to compute the shortest route from the current location (i.e. where a driver is) to the destination (i.e. where the driver wants to go). Thus, computing SPSP is also essential for processing of road map queries. As a database problem, the SPSP problem has been traditionally studied as a special case of Single source and all pair Shortest path problems in the context of recursive query processing [7, 8, 46, 65, 86, 101], transitive closure computa- tion [3, 21, 22, 40, 48, 49, 52, 53, 57, 70, 103], and database query languages [2, 26, 71]. However, due to the large search Space a topographical road map creates [64], these single source or all pair shortest paths algorithms are not suitable for computing SPSP on this kind of map. Dijkstra first proposed an algorithm developed Specifically for SPSP problems [24]. This algorithm further reduces the search space required by single—source and all-pair shortest path algorithms. However, the search Space Dijkstra’s algorithm has to 5 explore still remains very large when large topographical road maps are considered. Nicholson [77] proposed a scheme that finds SPSP by alternately building two trees, one rooted at the source node and the other rooted at the destination node. By exploring the search space from both ends simultaneously, his algorithm can reduce the explored search Space of Dijkstra’s algorithm by roughly half. Another approach to reducing the explored search Space is to use semantic in- formation about the domain of topographical road maps. That is, each node in the graph corresponding to a topographical road map has its distinct coordinates repre- sented in terms of longitude and latitude. By taking advantages of this information, we can estimate the cost of the Shortest path between any two nodes. Either Eu- clidean or Manhattan distances are usually used for the estimation of shortest path cost between nodes. A* algorithm [30] is a typical example of this approach. A* algorithm using Manhattan distance estimation gives an optimal shortest path only when edge costs are uniform. This is because when edge costs are nonuniform, the Manhattan distance is usually overestimated. Unlike A* algorithm using Manhattan distance estimation, A* algorithm using Euclidean distance estimation always gives optimal Shortest paths. The performance of A* algorithm was empirically analyzed on grid graphs as well as on actual topographical road maps by Pearl and Shekhar et. al. in [80,91]. Pearl [80] showed that A* algorithm was more efficient than Breadth First Search (BFS) [20] when the database fit in main memory. However, Pearl’s analysis did not examine the effect of path length and edge costs on the relative performance of A* algorithm. This was later analyzed by Shekhar et. al. [91]. Their analysis was done on 6 10 x 10, 20 x 20, and 30 x 30 grid graphs with different edge-cost distributions and one actual digitized Minneapolis road map consisting of 1089 nodes and 3300 edges. Their analysis shows that A* outperforms BFS if the path (source,destination) is much smaller than the diameter of the graph, or if the edge—cost distribution is skewed. The explored search space can be further reduced by combining A* algorithm and Nicholson’s two-tree building approach. That is, each tree is expanded by using A* algorithm. Mohr and Pasche [72] proposed a new SPSP algorithm named 0T0, which is a hybrid of A* and Nicholson’s algorithms. They provided an empirical analysis of 0T0, Nicholson’s, A*, and Dijkstra’s algorithms on three grid graphs (i.e., 100 x 100 nodes with three different edge cost distributions) and one road map of Switzerland (i.e., 3937 nodes and 12500 edges). For their empirical analysis, they used Manhattan distance as the estimation of shortest path cost between nodes. Their analysis showed that 0T0 outperforms Dijkstra and A* algorithms. The 0T0 algorithm was also Shown to perform far better than N icholson’s if the degree of edge cost variation is not high. Agrawal and J agadish [5] developed a data organization approach to further speed up SPSP computation. Their approach was to precompute some partially precom- puted path information and then use it at run time to prune the explored search space. This data organization technique assumes that a graph (i.e., topographical road map) can be decomposed into a set of subgraphs. This assumption is reasonable in the domain of road maps, since a large road map can be easily decomposed into a set of small sub-road maps. While this approach gives much faster computation time than all the algorithms discussed above, it has additional storage overhead and update complexity. Although the approach of Agrawal and Jagadish speeds up the Shortest path computation Significantly, it does not optimize processing the shortest path queries based on road map navigation. The reason for this is that the shortest path obtained from their approach still includes all the detail intermediate nodes between the source and destination, and navigators do not usually need this information. This problem would be very serious if we needed to compute a shortest path on a large topographical road map on which the discretized interval is fine—grained, and the two end nodes on the path are far apart. To cope with this problem, we have developed a H iTi (Hierarchical mulTi) graph model of very large recursive relations (e.g., topographical road maps). The H iTi graph provides a novel approach to abstracting and structuring road maps in a hierarchical fashion. It is modeled after the mechanism people use to select a route on a road map. That is, when a person wants to find the shortest route from a current location in one state to a destination point in another, he/ she usually reads road maps in a deceasing map-scale order. For example, a driver who wants to find the shortest path from a current location in East Lansing to a destination in Los Angeles, will consult the highway road map for the entire U.S.A. first. The person will then find a more detailed highway connection by reading state road maps (i.e., Michigan, and California). Finally, the driver will consult city road maps of East Lansing and Los Angeles to find very detailed information about the starting point and destination. By maintaining road maps in this hierarchical structure, H iTi graphs provide the following three features: 8 1. They allow the computation of a Shortest path without having to search through all the detailed maps of the intermediate states and cities. 2. They provide a basis for storing road maps at various levels of hierarchical ab— straction. Thus, a more controlled storage management suitable for navigation systems (e.g., automobile navigation systems) with limited available storage can be built. 3. They are able to support various types of database queries on hierarchical ab- stractions of a road map. HiTi graphs, therefore, provide a powerful framework for implementing road-map queries (e.g., shortest-path computation) as well as for storing very large topograph~ ical road maps. We developed a new Shortest-path algorithm named S' PAH , which utilizes H iTi graph structure. By taking advantage of hierarchical edge levels in the H iTi graph, SPAH Significantly reduces the explored search space. The performance of SPAH can be further improved if Euclidean distance estimation is used. Although S PAH reduces the explored search space significantly, its performance improvement is limited by the computation power of a single processor. In order to overcome this performance limitation, we studied parallel processing for Shortest path problems. In this study, we considered not only intra but also inter query SPSP problems. Intro query parallel SPSP problems deal with parallelizing a single pair Shortest path transaction, whereas inter query SPSP problems deal with parallelizing multiple Single-pair shortest-path computations. Mohr and Pasche [72] proposed an intra parallel SPSP algorithm called 0T0par, which is a parallel im- 9 plementation of the 0T0 algorithm mentioned above. That is, two processors are used in OTOpar, and each processor executes A* algorithm with Manhattan distance estimation to build its corresponding tree. To the best of our knowledge, no previous work has been done on inter query SPSP problems. The 0T0par algorithm is not scalable because its use of only two processors limits its performance improvement. Therefore, we need a parallel intro query SPSP algorithm that is more scalable and fine-grained than OTOpar. In this thesis, we proposed both intra query and inter query parallel SPSP algorithms named PAS PAH and M I SPAH respectively. These two algorithms are based on the HiTi graph structure. The H iTi graph structure provides the opportunity for fine-grained parallel processing for intra query SPSP problems. This is because multiple parallel shortest- path computations can be initiated based on the nodes at a higher level of abstraction of a topographical road map. This far, we have discussed recursive query processing in a centralized database system where data are stored at a single location. In this environment, the primary goal of optimizing recursive query processing is to speed up either Shortest path computation or, more generally, transitive closure computation. Location of the data in recursive relations is not an issue because they are all in one location. However, in a distributed database system, recursive relations are often fragmented and distributed over various sites to reduce the data maintenance cost and the data access cost. Each site stores the data (i.e., fragments) that are frequently accessed at that sites. In such a system, the reduction of transmission of data between sites is also very critical for the optimization of recursive query processing. Thus, we need an efficient fragment 10 description and location method to avoid the significant communication costs for query processing. This could be achieved by keeping remote fragment descriptions in local Sites. Most papers on parallel and distributed computation of transitive closure (includ- ing the shortest path) have used either predicates or hash functions to fragment, distribute, and identify the data in recursive relations [32, 45, 47, 74, 84, 102]. Huang et.al. [45] and Valduriez et.al. [102], studied the parallel computation of transitive closures where a hash function is applied to fragment a recursive relation. Gan- guly et.al. [32] suggested a framework for the parallel processing of datalog queries where hash functions are also used to fragment recursive relations. Nejdl et.al. [74] studied evaluating recursive queries in distributed databases in which recursive rela- tions are fragmented by predicates. However, these fragmentation criteria, defined by hash functions or predicates, are not suitable for fragmenting recursive relations in distributed environments. This is because they do not properly capture the referenc- ing locality of data, which is of paramount importance to the success of distributed database systems. Locality often comes from the transitive relationships between data. Stated another way, relevant data, frequently accessed by recursive queries, are likely to be related in the transitive closures. Hence, in distributed databases, fragmentation criteria based on the transitive relationship are considered to be more suitable for recursive relations than traditional criteria. Based on the criteria defined by transitive relationships, we developed a method to describe and locate the fragments of a recursive relation. Our method uses lattice structures. Lattice structures provide a powerful mathematical representation tool 11 for the description of recursive fragments. A distinct set of lattices describes each recursive fragment. This lattice representation provides a concise but flexible frag- ment description. It is concise in that a large recursive fragment is represented by a small set of lattices. At the same time, it is flexible in that it can represent any type of recursive fragment. Finding the optimal lattice descriptions of fragments is Shown to be an NP-complete problem. We proposed an efficient heuristics for this problem. The performance of our heuristics was analyzed theoretically as' well as empirically. The empirical analysis Showed that our heuristics gives a near-optimal lattice description. The lattice approach creates unique update problems. We have extensively analyzed the performance of our proposed update algorithms. 1.3 Dissertation Outline The structure of this thesis is as follows: Chapter 2 presents a HiTi graph model of large recursive relations for efficiently computing an optimal single pair shortest path. In Chapter 3, parallel processing for intra and inter single pair shortest path problems is discussed. Chapter 4 presents a methodology for describing and locating the distributed fragments of recursive relations in a distributed database. Chapter 5 concludes the thesis with a list of future research issues. The following parts of this thesis have been published: The content of chapter 2 appears in [59]. The content of Chapter 4 appears in [82]. The content of Appendix A appears in [58]. Appendix B [60] and C [104] are being submitted. Chapter 3 will. be submitted after additional work has been done. Chapter 2 HiTi Graph Model Of Large Recursive Relations for the Shortest Path Computation In navigation systems, a primary task is to compute the minimum cost route from the current location to the destination. One of the major problems for navigation systems is that a significant amount of computation time is required to find a minimum cost path when the topographical road map is large. Since navigation systems are real time systems, it is critical that the path be computed while satisfying a time constraint. An efficient database organization method is proposed for structuring a topographical road map to speed up the computation of a minimum cost path. Data organization methods previously suggested either do not allow optimal minimum cost path generation or require searching all intermediate nodes on the minimum cost path. 12 13 In navigation systems, a navigator may not have to know all the intermediate nodes to go from the current location to destination. In this chapter, we propose a new graph model named HiTi (Hierarchical mulTi graph model), for efficiently computing an optimal minimum cost path. Multiple levels of geographical boundaries (e.g. cities, counties, and states) can be easily mapped into the hierarchical structure of the proposed graph model. Various levels of hierarchical abstractions can also serve as the basis for efficient storage management for large road maps. Thus, it provides the basis to develop a more controlled storage management suitable for navigation systems with limited available storage (e.g., navigation system inside an automobile). Based on HiTi graph model, we propose a new single pair minimum cost path algorithm. We empirically Show that our proposed algorithm dramatically reduces the explored search space. Further, we empirically analyze our algorithm by varying both edge cost distribution and hierarchical level number of H iTi graphs. 14 2.1 Introduction In navigation systems, a primary function is to find possible routes from the current location (e.g., where a driver is currently positioned) to the destination (e.g., where the driver wants to go) with a minimum expected cost. For this purpose, they use a topographical road map which is in the form of the following recursive relation : topographical.road-map (source, destination, cost) where the cost attribute indicates, for example, a minimum expected time of travel from point source to point destination. Another applicable cost can be the Shortest distance between the two end points. One of the major difficulties of navigation systems is the size of the topographical road map data. It requires about 2.4 Gbytes of storage to store a small 100 mi x 100 mi map discretized at 100 feet intervals [5, 64]. Thus, the size of data involved is very large when larger maps are considered. Minimum cost route computation with this large amount of road map data requires a Significant amount of computation time. Since navigation systems are real time systems, it is critical to compute a minimum cost route satisfying a time constraint. Previously suggested transitive closure or graph traversal algorithms [4, 46, 49, 50, 57, 83, 86, 99] are not directly applicable to topographical-road-map (source, destination, cost) for the computation of a minimum cost path due to the very large volume of data they have to search. Thus, we need an efficient database organization method for structuring the topographical road map to speed up the computation of 15 a minimum cost path. In this regard, two different approaches have been studied in the past. One approach is to develop a database structure which gives a suboptimal minimum cost path quickly. The other is to develop the database structure which gives an optimal Shortest path. For the suboptimal shortest path generation, Ishikawa et. al., Shapiro et. al., and Liu et. al. used road hierarchies (i.e., Freeways, Highways, . . ., Side roads) to speed up minimum cost routes [51, 68, 89, 107]. They used multiple levels of hierarchical details of road maps to cut down the unnecessary search Space. Ishikawa et. al. [51] applied Dijkstra algorithm to the hierarchically structured road maps. Shapiro et. al. [89] proposed a new graph structure, named LGS' (Level Graph Structure ) which models the road hierarchies theoretically. Based on LGS, Shapiro et. al. gave a new algorithm which generates approximate shortest paths rapidly. Their study Showed that the length of the path produced by LGS converges rapidly to that of the actual minimum cost path as the distance between the source and destination nodes increases. Liu et. al. [68] studied integrating Dijkstra’s algorithm with an Al knowledge—based approach and case-based reasoning for computing a minimum cost path. For the optimal Shortest path generation, Agrawal and Jagadish recently studied a data organization technique which precomputes and stores some partial path infor- mation [5]. They use the precomputed partial path information to prune the search space when computing a minimum cost path. While their approach Speeds up the computation of a minimum cost path, it does not optimize processing minimum cost path queries based on road map navigation. The reason is that their approach still 16 requires searching all the intermediate nodes on a minimum cost path. In navigation systems, it is not necessary for a navigator to know all the intermediate nodes on a minimum cost path. This can be illustrated by the following example. Consider an automobile navigation system where a driver wants to find a minimum cost path from a place in East Lansing to a destination point in New York city. He/she needs to know more about detailed route information near East Lansing and New York city than those in other intermediate places (e.g., Toledo, and Pittsburgh). It is not likely that that person has to know about detailed route information within all the intermediate cities. The rational for this is based on the following two observations. First, a driver likes to have detailed route information near East Lansing to get into an interstate highway from a current location in East Lansing. Second, he/she needs to know detailed route information near New York city to get into the local road which leads to the destination point in New York city. The above problem would be Very serious if we need to compute a minimum cost path on a large topographical road map where its discretized interval is fine-grained and two end nodes on the path are far apart. To cope with this problem, we have developed a H iTi (Hierarchical mulTi) graph model of very large recursive relations (e.g., topographical road maps). In the following paragraphs, we will first describe Hi Ti graphs informally followed by a more formal description of the model. Consider a topographical road map viewed as a directed graph G( V, E). Nodes in V correspond to discretized grid points representing map objects in a road map. Edges in E correspond to the connections between the nodes in V. Then, regional bound- 17 aries partition a road map C(V, E) into a set of Component ROad Maps (CROM) exclusively. Each CROM can be viewed as a subgraph with its boundary nodes defin- ing the boundary of the CROM. Connectivity between CROMS is represented by the connectivity of their boundary nodes. One CROM has direct connections with an- other, if boundary nodes of the former are directly connected to boundary nodes of the latter. We call this kind of connectivity as between connections of CROMS. For each CROM, we precompute the cost for each pair of connected boundary nodes. If two boundary nodes of the same CROM are connected by a path solely contained within the CROM, we call this a within connection. By using the between and within connections, we can partition the entire road map. Examples of between and within connections of CROMS are shown in Figure 1.1 where dotted lines represent regional boundaries. Withinconnection forCROM 1: “MD! Within connection for CROM 2: 1(B.C).(B.D).(C.D)l Within connection for CROM 3: [(EG).(B.F).(F.G)I Betweenconneetion forCROM I and2: 1(A.B)l Between connection for CROM 2 and 3: ((C3).(D.F)l Between connection for CROM] and 3: 101.0” Figure 2.1: Examples of within and between connection of CROMS HiTi graph is a graph whose nodes are the boundary nodes of the CROMS and edges are the within and between connections of CROMS. Note that a CROM can be 18 defined to contain a set of CROMS, thus creating a multi level hierarchy. Thus, we first need to determine a set of the lowest level CROMS which exclusively partition an entire road map. We call these CROMS as level 1 CROMS. Then, we can recursively construct a set of level k CROMS by grouping a set of geographically adjacent level 19-] CROMS where k 2 2. These sets of level I, 2, . .. ,k CROMS form a complete balanced tree structure where the root node of the tree is a whole road map. For example, consider a topographical road map covering the entire area of U.S.A. The U.S.A. consists of 4 subregions which are Eastern, Midwestern, Southern and Western regions. These 4 regions consist of their subregions which are the states of U.S.A. Each state consists of a set of counties and each county consists of a set of cities. If a city road map is defined as basic i.e., level 1, CROMS, then level 2 CROMS are county road maps, level 3 CROMS the state road maps, and SO on. Given a complete balanced tree representing level I, 2, ,k CROMS, level 1 CROMS will be the leaf nodes, i.e., level 1 node of the tree. Each level 2 nodes ‘of the tree will represent the level 2 CROMS and capture the between and within connections of the level 1 CROMS. Thus, in general, a complete balanced tree has 1 thru k levels corresponding to 1 thru k level CROMS. A level i (S k) node of the tree corresponds to between connections of level i CROMS plus level 1 within connections. Note that level 1 within connections capture the between and within connections of the level 1— 1 CROMS that corresponds to the child nodes of the tree. A HiTi graph is named the level I: Hi Ti graph if its nodes and edges represent boundary nodes, between and within connections of level I, .‘3, ,k CROMS. The HiTi graph constructed for the 19 United States example above is a level 4 Hi Ti graph. The rest of the paper is organized as follows. In section 2.2, a formal framework and description of HiTi graphs are discussed. In section 2.3, we propose and empiri- cally analyzed a new Single pair shortest path algorithm that takes advantage of Hi Ti graphs. Section 2.5 discusses the effects of updates on HiTi graphs. Finally, section 2.6 gives concluding remarks. 2.2 Formal Framework and Description of HiTi Graphs Recursive relations have the generic form R (attl, attz, att3) where attributes satisfy the following 3 conditions: 1. attl and att-z are the key attributes of R and share the same domain 2. att, and att-z are the recursive join attributes and their corresponding values are related by some transitive relationship 3. att3 describes the transitive relationship between attl and attg Based on the above generic form, a recursive relation can be viewed as a directed graph G(V, E). Each node in V is represented by the value of attl. Each edge in E is represented by values of (attl, attg, att3). 20 Suppose that a recursive relation G(V, E) is partitioned into a set of subgraphs (i.e. SG] (141,531), SG; (V21,E._}), ..., SG1 (VulnEl ) ) such that: 1lj 111 V,‘OV.,‘u-~uv‘ =v, E,‘uE,§u---uE‘ CE n1 n1 Vilfll/j1 = w and £3,105};- = Q) where 1Si,an] and iyéj We name SG} (W, EB) a level 1 subgraph i and the connections between all level 1 subgraphs are captured in PE1 2 E— (E,1 U E} U- - -U E)“ ). Then, each (:r,y,z) E PEl has the property that a: 6 V,‘ and y 6 VJ} where i 9i j. Note that (:r,y,z) corresponds to (attl, attg, att3) where a: = attl, y = attg, z = att3. Definition 2.2.1 Let N,1 denotes a set of nodes in SC} (Vil, E3) connected to/from nodes in other subgraphs SG] (VJ-1, E,‘) where i 79 j and 1 S i,j S n1. Specifically, N,‘ ={;r|(a:,y,z) 6 PE1 /\ :1: 6 V,‘} U {y|(;r,y,z) 6 PE1 /\ y 6 14‘}. Set N-1 is called level I boundary nodes of SG}. 3 Figure 2.2: A digraph G and its level 1 subgraph SG], SG; and 5G; 21 For example, consider Figure 2.2 where a digraph G and its subgraphs SG}, SG; and 5G}, are Shown. Set N21 of subgraph SG; is { G, H, M, O }. Each level 1 subgraph is described and identified by its boundary nodes, since they exclusively belong to one level 1 subgraph. Based on boundary nodes of 56'}, definition 2.2.2 gives formal definitions of level 1 between and within edge sets B} and W,‘ of SG]. Definition 2.2.2 Let { SG} ,1,E,l) | j = l,n1 } be a set of level I subgraphs of graph G(V, E) with the corresponding N}. Then, for each SG} where 1 S i S n1, B.’ ={ (mas/.2) l ($.y.z) 6 PE‘ A w 6 V9 } and W.‘ = { ($,y,fz(2=,y)) l ($.31) E (N1 x N3) A (x’fl’y in $03) /\ as ye y}. Each set B} contains an edge (:r,y,z) if there is an edge (x,y,z) in PE1 where a: is in SG} and y is in a different level 1 subgraph. In other words, set B} contains connectivity information between level 1 subgraph SG} and other level 1 subgraphs of G(V, E). Then, the following condition holds: 1 _ n 1 PE ._ u,;,B,- 11.1 From the above, it can be noted that edges in szlBJl exclusively partition graph G(V, E) into a set of subgraphs SG] (V,‘, E]), SG; (V21, Eg), ..., 5G,!“ (an1, E3“). Each set W} contains an edge (2:, y, fz(:r,y)) if there exists a directed path from the boundary nodes .7: to y of 5G}. Function fz(a:,y) gives an aggregated cost (i.e. from node 2: to y) with respect to :3 within SG}. In other words, each edge in W? represents additional precomputed connectivity information of boundary nodes within 22 SG,! . Thus, it I n 1 n 1 AS an example, Table 2.2 shows the corresponding values of N}, B} and W} of the three subgraphs Shown in Figure 2.2. In the following example, fz(:r, y) gives the cost of a minimum expected time of travel between node 2: and y. N-1 B.1 w} 1 {Or} {(D,G,2),(F,H,3)} {(P,D,4)} 2 {G’H’MIO} {(OIQI2)I(MIR74)7(G1FI2)} {(G1016)’(G1M13)’(H1M13)} 3 {QiR} 0 (0 Table 2.1: N}, B}, and W-1 of SG} in Figure 2.2 I So far, we have introduced level 1 subgraphs and their associated information such as level 1 Boundary node sets, and between and within edge sets. In general, based on these level 1 subgraphs and their associated information, we can recursively define level I: subgraphs SG]c and their corresponding N}, B}, and W} for any h 2 2. Their details are discussed in the following definitions. Definition 2.2.3 Let \II"‘1 = { 5qu (Vik'l, Eff-l) I i = 1,nk_1 } be the set of all level k-I subgraphs of graph G(V, E) where k 2 2. Then, a level I: subgraph SG]c (Vf, Ef) is defined as a subgraph induced by all nodes of level k-I subgraphs in (I) Q ‘11,“1 where [CPI Z 2. After all level k subgraphs in \II" are defined, the following three requirements must be satisfied: 1. Each level k-I subgraph in ‘1‘,“1 is a subgraph of some. level I: subgraph in ‘1”. 23 2. All level It subgraphs in W“ do not overlap among themselves. That is, V,"uV,*u-~LJV’° = V, Equg‘uo-bEfik c E nk VfflVf = 0 and EfflEf = Q) where 1Si,an;c and iaéj 3. All edges belonging to level 1: between edge sets are removed from level k-I between edges sets. AS a result, (Uj‘éj' Bf"l fl UjvfilBJ-k) = (0. By satisfying the preceding first two requirements, all subgraphs in U53 Q” are related to each other in a complete balanced tree structure where the root node of the tree symbolizes G(V, E). This complete balanced tree is named a subgraph tree (ST). The root node of ST is considered as a level k+I subgraph and it has all level I: subgraphs as child nodes. Recursively, each node symbolizing a level I: subgraph has a set of level k-I subgraphs as child nodes. All leaf nodes of ST symbolizes level 1 subgraphs. This subgraph tree is named level k+1 ST if the root of ST symbolizes level k+1 subgraph (i.e. G( V, E)). The following Figure 2.3 shows an example of level 3 subgraph tree. This tree structure shows a level 3 subgraph SG? induced by the nodes in level 2 subgraphs (i.e. S G?, S G: and S G3) It also shows a level 2 subgraph 3G? induced by the nodes in the two level 1 subgraphs SG] and SG;. For each level It subgraph SGf" of ST where 1 S i S nk, there exist corresponding level I: boundary nodes (i.e. N1“) and level 1: between and within edge sets (i.e. B!“ and W,"). The semantics of the level I: boundary nodes, level 1: between and within edge sets are same as that of level 1 boundary nodes, level 1 between and within edge sets. 24 so? so, so; so; so} so5 so, Figure 2.3: An example of level 3 subgraph tree ST They are formally defined in definition 2.2.4. Definition 2.2.4 Let PE" = E — ([3]c U EfUn-U Eikl- Then, N," = { :c | (z,y,z) 6 PEkAxEI/iij{yl(x,y,;/)E PEkAyel/f},B,-k={(x,y,z)](:1:,y,z) k k k k k M”) - 6 PE /\ we V, }andW,- ={ (:r,y,fz(:r,y)) | (x,y) E (N,- x N,)/\ (x ——i’ y in 5G5“) /\ :1: # y }. Function fz(;r,y) gives an aggregated cost (i.e. from node a: to y) with respect to 2 within SGf. As can be noticed from the above definition, PE" = @118}. Note that of: Bf" = U32? Bf" — Uj-‘éle. In order to Show how NI“, Bf, and W!“ are obtained, we use the digraph in Figure 2.4 as an example. From Figure 2.4, we got PE1 2 { (4,6,2), (5,7,3), (5,20,7), (10,11,4), (10,12,6), (14,16,4), (l5,14,3), (l7,19,5), (l7,20,4), (18,20,2), (23,25,4), (24,25,5) } and PE2 = { (5,20,7), (10,11,4), (10,12,6), (l7,19,5), (l7,20,4), (18,20,2) }. By utilizing these PE] and PE), we can obtain boundary nodes, within and between edges of level 1 and 2 subgraphs. These are Shown in Table 2.2. Note that in Table 2.2, between edge sets of SG; and SGl, do not contain edge(s) in {(5,20,7), (10,11,4), (10,12,6)} Figure 2.4: A11 example of a digraph G( V, E) partitioned according to ST in Figure 2.3 and {(l7,19,5), (l7,20,4), (18,20,2) } respectively. This is due to the last requirement related to Definition 2.2.3. Thus, in general, given level k+1 ST, edges in B!“ of S Gf capture connections between boundary nodes of level k-I child subgraphs of S'G]c and. those of the rest of level k-I subgraphs. Between edges of level k-I child subgraphs of i S G? capture connections between boundary nodes of themselves. i N} W} B: 1 {4,5} 0 {(4,6,2),(5,7,3)} 2 {6,7,10} {(6,10,3),(7,10,3)} ll 3 {11,12,14} {(11,14,2),(12,14,1)} {(14,16,4)} 4 {15,16,17,18} {(16,15,3),(16,17,2),(16,18,2)} {(15,14,3)} 5 {1920,2324} {(20,23,2),(20,24,3)} {(23,25,4),(24,25,5)} 6 {25} 0 0 i N} W} B? 1 {5.10} {511016)} {(5.2017).(10.11141.00.12.61} 2 {11,12,17,18} {(1l,17,8),(11,18,8),(12,17,7),(12,18,7)} {(l7,19,5),(17,20,4),(18,20,2)} 3 {19,20} 0 0 1 N? w? B? 1 0 0 0 Table 2.2: Boundary nodes, between and within edges of level 1, 2, and 3 subgraphs in Figure 2.4 26 We have given basic definitions necessary to define a Hi Ti graph. However, these definitions (i.e. Nf, BI“, and Wik) do not provide an efficient way of obtaining their corresponding values. A primary reason for this inefficiency is that their definitions require obtaining their values by using V," and Ef, which is computationally expen- sive. Thus, we need a more efficient way of computing these values. The following definitions provide a recursive way of obtaining the values of N,-", Bf, and W," based on the values of Nik", Bf", and Wf'l for all level It — l subgraphs in \Ilk‘l. Definition 2.2.5 Let D = { I, 2?, ..., nk_1 } where nk_1 = [Wk-1|. Suppose level I: subgraph SG]c (Vik, BI“) is induced by all nodes in level k4 subgraphs SGf’l, V t E I where I C D. Then, N!“ = { :c I (x,y,z) E UjiLI'Bj-P'l A :1: E Utethk'l A y ¢ UtelNzk-l } U i y 1 (3313/12) 6 Ujléi’Bjc-l A y E Utethk-l A 33 if UtEINtk-l } Definition 2.2.6 Let D = { I, 2, ..., 711..-} } where nk_1 = [Wk-1|. Suppose level I: subgraph SG]c (Vf, BI“) is induced by all nodes in level k-I subgraphs 50:”, V t E I where I C D. Then, Bf = { (:c,y,z) | (z,y,z) E U'-"‘"‘B'~"l A a: E Utele’l A 1:1 J y E Utethk-l } Definition 2.2.7 Let D = { 1, f3, ..., 721..-, } where nk_1 = |W"“|. Suppose level I: subgraph SGf (Vik, El“) is induced by all nodes in level k-I subgraphs SGf’l, V t E I k _ k k III-1'1 ) . where I C D. Then, W,- — { (:r,y,fz(a:,y)) | (x,y) E (N,- x N,) A (a: ——i y in UtEl(Wtk-l U Bic-l) A 37 E y I- By taking advantages of the above three definitions, we can efficiently obtain values of different levels of boundary node, within, and between edge sets. Based on within and between edge sets, the formal definition of Hi Ti graph is given as below: 27 Definition 2.2.8 A HiTi graph is a directed labeled graph defined in terms of between and within edges associated with all nodes of subgraph tree ST. A HiTi graph defined over level k+I subgraph tree ST is called level It HiTi graph. It is represented by Hk(Pk,Ak) where Pk = { :1: I (.r,y,z) E Uf=,U;-‘j;_,B; } U {y I (2:,y,z) e Uf=,U;-‘;,B; } and A" = I (may, [11W1fz($1y)l) | ($,y,fz($,y)) E U?‘=1W} f0?“ 1 = 11k } U { (x,y,[l,B,z]) I (:r,y,z) E U?‘=1BJ’- for l=1,k}. Note that the symbols “1” and “W”( “B ”) in the definition of an edge in A’c represent the edge is level 1 within(resp. between) edge. An example of level 2 HiTi graph corresponding to the digraph in Figure 2.4 is shown in Figure 2.5. . (1:21 6 . ./'. "'3’ 5 IIJJI ‘ my, '° \ [31"07 /’ Figure 2.5: A level 2 HiTi graph corresponding to the digraph in Figure 2.4 We have introduced basic concepts and formal definitions of a level I: Hi Ti graph in this section. It is easy to see that the overhead of a level 11: Hi Ti graph comes from n. the within edges in sets UleUFIWj. Thus, for the HiTi graph model to be practical, we introduce one requirement. That is, IUf=1U;-‘_'__1W;I should be small compared with IEI in G(V, E). In practice, IUf=1U3‘;,W;I is likely to be very small. This is because |W;I can not be greater than INjI x Ile where the boundary nodes of 5G;- are usually quite small compared to IVJ-‘I. 28 2.3 Use of HiTi Graphs for Computing Shortest Route In this section, we discuss the use of HiTi graphs for computing a minimum cost path. It is a Simple bookkeeping problem to keep track of nodes on a minimum cost path; hence we will focus on an algorithm for simply finding a minimum cost of the path. We first introduce a set of basic notations which will be used in the rest of this paper. Definition 2.3.1 Let ST represent a level I: subgraph tree. Assume set X consist of a set of subgraphs (i.e. nodes) of ST. Then, S],(X) { y I y is a level I subgraph which is an ancestor of each subgraph in X }, SA(X) = UleSflX), and SC(X) = { YI Y is a direct child node of a subgraph in X }. Note that Sc(X) = X for all leaf node subgraphs S G} in X. Definition 2.3.2 Assume set X consist of a set of subgraphs. Then, SB(X) and Sw(X) consist of between and within edge sets associated with all the subgraphs in X respectively. SN(X) consists of boundary node sets associated with all subgraphs in X. Definition 2.3.3 Let S G} and SG} be two distinct level I subgraphs defined in a level I: subgraph tree ST. Then, LU B_g(,-;(SG1,SG}) is the least leveled common ancestor subgraph of SC] and SC]. 29 The examples of the definition 2.3.1, 2.3.2 and 2.3.3 are illustrated through the level 4 subgraph tree shown in Figure 2.6. Assume that X = { SGé, SG]l }. Then, 91(x )- — { 9 9a.. 1, 9W) = { .903. .90: }. SAX) = { 90;. 90:, 903, 9'6"}, 5'6“ 111 . 9G5}, S(,~(X) = { sag, SGI, }, SC(S'2(X )) ={ 9G8, L 9G2” SGIO, SGI,, 3612 }1SB(X) = { 8211 311 }. Sw(X) = { W31. W111 1.5N(X) = { N819N111 }. and LUBSG(SG,‘,, 501,) is 50;. SO /\ /I\ /I\ /\,l A \ \/,\ i 5(i. 50i 5011 SG\QSGIOSGII SGiszis 561456155016 SGiv SGis i9 Figure 2.6: An example of level 4 subgraph tree ST 2.3.1 Shortest Path Algorithm SPAH We now describe a Shortest Path Algorithm based on H iTi graph (S PAH ) S PAH takes advantage of Hi T i graphs to speed up the computation of a minimum cost path. Our algorithm explores at most ES search space necessary for computing a shortest cost path. The size of ES is shown in the The following theorem. 30 Theorem 2.3.1 Let ST represent a level k+I subgraph tree where level Is: HiTi graph H"(P’°,Ak) is defined. Assume that we want to compute a minimum cost path from START node in SG,1 to DEST node in SG] Then, the explored search space ES is at most the union of E}, E} and A’k Q A" where A’k = SB(SC(SA({SG},SG]}))) U sw(90(.9..({9G:.96:11)). Proof: Let I be the level number of LUBSG(SG,I, SGj). Furthermore, let E31 13; u E}. u SB(SC(U’ 9;;(S(;:,SC;))) u .S'w(Sc(Uf,=,S,';(SG},SG}))) and ES2 n=1“ II = -S'B(SC(U:T—_l+152(LUBSGISGl1SUD») U Sw(SC(Uii:i+1SKILUBSGISGia 501)»)- Then, we can represent ES as a union of BS, and E32. From ES, it is easy to see that at least ES} search space is necessary to compute a minimum cost path from the nodes START and DEST. Hence, we only need to show that E32 is also a necessary search space. Assume that the shortest cost path from the nodes START to DEST is 2:0 = START —> $1 —> :22 —> —> :rm —> mm“ —> xm+2 —> xm+3 —> -—> em“ = DEST. It is possible that ({xm} U u;‘=+,,‘{+2{x,,}) c .9~<.9c(u:.....92(.9G:,90;») and 9.... e 9~(so(u.’:::..9::(LUBsa(SG.%SCI)» where (x,,,,:r,,,+1,[C,B,zI) and (:r,,,+1,:rm+2,[(,B,z]) are level C between edges for some C such that 1 S (I S k. D From theorem 2.3.1, it is easy to see that how Hi Ti graphs can significantly reduce the search space. That is, without using HiTi graph, ES would be U?;1E,l where 72, represents a total number of level 1 edges sets. Furthermore, theorem 2.3.1 allows more controlled storage management that is suitable for navigation systems (e.g. 31 automobile navigation system) where available storage is limited. This is possible because we do not need to have all level 1 edge sets to compute a minimum cost path. Instead, we only need a part of the level I: HiTi graph and two level 1 edge sets (i.e. one for START node and the other for DEST node). Note that the major storage overhead of a topographical road map comes from the size of level 1 edge sets. We take a level 4 subgraph tree in Figure 2.6 to give an example of theorem 3.1. Assume that we wants to find a minimum cost path from node START in SG] to node DEST in SGia- Then, our search space ES consists ofE,1 U Eis U SB(SC(SA(SGI))) U SWISCISAISGi)» UizGISBISGi’) U SWISGill UiinlsBISGi) U SWISGill- Within the search Space E S, SPAH traverses edges from node START in S G} to node DEST in SG] to find a minimum cost (i.e. with respect to cost 2) path. Its edge traversal consists of two phases, the ascending and descending phases. The ascending phase corresponds to the period of edge traversal from node START to the boundary nodes in SN(S:,'1(SG;-)) where l is a level number of LU Bsa(SGI,SGj-). Similarly, the descending phase corresponds to the period of edge traversal from the nodes in SN(.5'34'1(SG])) to node DEST. During the ascending and descending phases, SPAH traverses edges in a non-decreasing and non-increasing edge level order respectively. Note that each of the considered paths has its own thread of processing. Some of the paths are still in the ascending phase, whereas the others are in the descending phase. The following proposition 2.3.1 provides the theoretical basis of the detailed ascending method given in SPAH. 32 Proposition 2.3.1 Given a level I: HiTi graph, assume that I be the level number of LUBSG(SGI,SG]) where node START E V,1 and DEST 6 Vi]. Assume that level 11 to r edges are incident on node :1: such that :1: E (V,1 U SN(SC(SA(SG})))) and 0 S u S r. In order to find the shortest path from node a: to any node in SN(S§"'(SG})), it is enough to follow only level p edges from node a: where min(l — 1, 7') S p S r. Proof: Since level 11 to r edges are incident on node x, we only need to Show that it is not necessary to follow level ,11 to min(l — l, T) — 1 edges from node 3:. Then, we need to consider two cases when l— 1 S r and l— l > 7. Let SGI, — 1 be SQ'I(SG}). 0 Case I — 1 S r: In this case, we need to show that we do not have to visit the nodes adjacent to :1: by following level p to l -— 2 edges for the shortest path computation within 5fo1 from node a: to the boundary nodes of SGL“. Since level 1 — 1 edges are incident on node 2:, node a: is in SN(SGI,"1). From the definitions of WI", it is true that W1“ contains all the shortest path connections between the boundary nodes of SGI,’1 within SGL“. Thus, all shortest paths between node :1: and y within SGI,"l are also captured through level l— 1 within edges (1:,y, [l — l, W, z]) 6 W1" where y E SN(SGI,‘1), which proves this case. 0 Case I - l > r: In this case, we need to show that we do not have to visit the nodes adjacent to :1: by following level In to r — 1 edges for the shortest path computation within SGI,‘l from node a: to the boundary nodes of S G5,“. Since level 7’ edges are incident on node 2:, node a: is in SN(SG;) where SC; is a descendant subgraph of SGI,’1 in ST. Then, it is clear from the definition of 33 HiTi graphs that all paths generated from node a: by following level p to r — 1 edges must pass through level 7’ boundary node(s) of SGb’. From the definitions of W;, it is true that W; contains all the Shortest path connections between the boundary nodes of SC; within SGI. Thus, all Shortest paths from node :1: to y within SGZ are also captured through level 1' within edges (1:,y, Ir, W, 2]) 6 W; where y E Sit/(SGI), which proves this case. We take Figure 2.3, 2.4, and 2.5 to exemplify proposition 2.3.1. Assume that we want to find the shortest path from node 5 in SG] to node 25 in SGé. Since LUBSG(SGI,SG(1,) = 3, l = 3. Thus, we have SN(S§(SG,15)) = N32 = {19,20,25}. From Figure 2.4 and 2.5, we can see that three edges are incident on node 5. They are: one level 1 between edge (5, 7, 3), one level 2 within edge (5, 10, 6), and one level 2 between edge (5,20, 7). Then, by the above proposition, in order to find the shortest path from node 5 to any node in {19,20,25}, we do not need to follow the level 1 between edge (5, 7, 3) from node 5. After obtaining a minimum cost path from S PAH shown in Figure 2.7, a navigator (e.g. driver) may want to find a more fine-grained path connection on some edges (i.e. high level edges) with a level number greater than 0. This can be accomplished by specializing a high level edge. High level within edges are specialized by representing them in terms of lower level edges. For this purpose, we keep actual shortest path information for each corresponding within edges. A high level between edge cannot 34 begin Step 1: Assume we have a level It Hi Ti graph defined on a level I: + 1 ST; Find SG} and SG} to which START and DEST belong respectively; We maintain G(V, E) together with H"(Pk, A") as adjacency lists; Let 1 be the level number of LUBgc;(SG}, SGI); f°r(P=k;p>=1;P--) Mark all boundary nodes in SN(SC(S2(SG]))) with level p — 1; Step 2: /\(START) = 0; FSet = { START}; ESet = 0; while (FSet 96 0){ Select u from F'Set with minimum /\(u) + f(u, DEST); /* the function f (u, DEST) estimates the cost of the shortest path from node u to DEST */ FSet = FSet-{ u}; ESet: ESet U { u }; if (u = DEST) stop; Let r be the highest level number of the edges incident on node u; if (u is NOT marked) { /* Start Ascending Phase */ for each level p where min(l — l,r) S p S 1' { for each edge u —z-> v with level p { if ((11 ¢ FSet) and (u g ESet)) { Mo) = Mu) + z; FSet = FSetU{11};} else { if (Mv) > Mu) + 2) Mv) = Mu) + z; 1 } } } else { /* u is marked */ /* Start Descending Phase */ Let [i be the marked level number on node u; for each edge u ——z—> v with the levels from B to r { if ((v Q ESet) and (v Q ESet)) I X(v) = Mu) + z; FSet = FSct U I v I;} else { if (Mv) > Mu) + 2) Mu) = Mu) + z; } } } 1 end Figure 2.7: SPAH: Finding a Minimum Cost Path from START to DEST 35 be specialized in terms of lower level edges. This is because all between edges in ULIU}; B; are the edges in G(V, E). 2.4 Performance Analysis The algorithm SPAH can be classified as a variation of A* algorithm in that it uses the function f(u,DEST) to estimate the cost of the shortest path from node u to DEST. In the domain of road maps, the function f(u,DEST) com- putes the Euclidean distance between the node u and DEST. This is possible because the coordinates (i.e. longitude and latitude) of all nodes on a road map are assumed to be available. Assuming (u.:1:,u.y) and (DEST.:r,DEST.y) are the corresponding coordinates of the nodes u and DEST, f (u, DEST) computes \flDESTJE — u.a:)2 + (DEST.y - u.y)2. Since Euclidean distance is not an over— estimated cost between node u and DEST, our algorithm finds the optimal shortest path. The detailed proof for the optimality of A* algorithm is found in [30]. The following Figure 2.8 shows A* algorithm which finds the shortest path from the nodes START to DEST 011 G(V, E). Note that l(:r,y) represents the cost associated with each edge (:r,y) E E. A* shortest path algorithm are Shown to be more efficient than the breadth- first search single pair shortest path algorithm when the database can fit in main memory [64, 80]. For algorithm SPAH, this is the case since we showed in Theorem 2.3.1 that its explored search space is at most ES which is small enough to fit in main 36 memory. begin For each node u E V, Mu) = 00; Let /\(START) = 0, ESet = {START}, and ESet = 0; while (FSet 75 0) { Select a node u in FSet for which /\(u) + f(u, DEST) is minimum; FSet = FSet — {u} and ESet = ESet U {u}; If (u = DEST), then /\(u) is the Shortest path cost and Stop; else { For every edge (u,v) in E, if /\(v) > z\(u) + l(u,v) { Let /\(v) = A(u) + l(u,v); Let ESet = ESet U {12} if v a (ESet U ESet); } I} end Figure 2.8: A* Algorithm: Finding a Minimum Cost Path from START to DEST For our empirical analysis of Algorithm S PAH , we create two dimensional grid graphs G(V, E) with 4 adjacent nodes. Two dimensional grid graphs are considered as typical examples of road maps [64, 91]. 111 grid graph G, IVI and IE I are equal to 800 x 800 nodes and 4 x 800 x 799 directed edges. From G( V, E), we create a level 4 subgraph tree ST where each level 1 subgraph SGI(V,~‘, E3) has IV,-1I = 100 x 100, IE,-1I = 4 x 100 x 99. Thus, the level 4 subgraph tree ST consists of 64 level 1, 16 level 2, 4 level 3, and 1 level 4 subgraphs, which are shown in Figure 2.9. We use the above level 4 subgraph tree ST to generate level 3 HiTi graph for the empirical analysis of SPAH in the following subsections. ........................................ EEC] DEED [3.513 [:1 El g;j———............. 33GB in. - 1.13? 1:11;! 1:11;}, , ‘ “"‘ ...,... EDD? EU: :‘Ej"t':i"§gtj""1jg5-—W~w 531:1-131 in. - :1 1:1 , 1;}? {g _ D»— . .... . 1.... ............................ I...------~ p----—--o —-—----- ..-------o : -s-------v s-------o; -~-------I ~-------o; -_~-------o \-------l:' Figure 2.9: A 800 x 800 grid graph partitioned according to the level 4 ST 2.4.1 Comparison between SPAH and A* algorithm To show the search Space savings of SPAH over the traditional A* algorithm (pre- sented in [91]), we create 5 level 3 HiTi graphs where 10 S INEI S 20 and where the edge cost is generated based on a uniform distribution [100,120] with 5 different seeds. We represent IN,‘I as the total number of boundary nodes defined on level 1 subgraph SG}. Next, we create 5 plain grid graphs which are simply unions of PE1 and level 1 subgraphs used in the above 5 level 3 HiTi graphs. For each level 3 HiTi graph and plain grid graph created above, we compute 20 different Shortest paths randomly prefixed pairs of source and destination nodes. Let H” and A, be the total number of edges visited by S PAH and A* respectively. Note that throughout this paper, we multiply the estimation of f (u, DEST) by 100 to normalize the estimation with respect to the edge cost. The number 100 is used for this normalization because the Euclidean distance between two adjacent nodes is l and the edge cost between 38 two nodes is at least 100. We compare SPAH and A* by observing the ratio An/Hn. these values are then averaged over the 5 different level 3 Hi Ti and plain grid graphs with the same source and destination nodes. They are shown in Figure 2.10 where the numbers on :1: axis represent 20 ordered < source, destination > pairs with path length increasing from 1 to 20. 35 30 25 20 15 10 Art/H11 l I I l l l l l l 0 2 4 6 8101214161820 the number of shortest paths Figure 2.10: Performance comparison between A* and Algorithm S PAH Figure 2.10 clearly shows how effectively Algorithm S PAH cuts down the search space over the traditional A* algorithm. It is interesting to observe that the ratio An/ Hn increases rapidly as the path lengths from source to destination increase. This occurs because the search space A* needs to explore grows exponentially whereas that of S PAH grows very slowly due to the hierarchical structure of H iTi graphs. 2.4.2 Effects of edge cost distribution In this section, we studied the effects of edge cost distributions on the performance of SPAH. For this study, we generate 4 x 5 (i.e. 4 uniform distributions with 5 seeds) level 3 HiT i graphs where 10 S INEI S 20. Note that the 4 uniform distribu- 39 tions [100,120], [100,200], [100,300], and [100,500] correspond to 20%, 100%, 200%, and 400% variations of edge costs respectively. We apply S PAH to each level 3 Hi Ti graph by randomly creating 50 different < source, destination > pairs and then aver- aging the cost. Let MA and MD symbolize SPAH when the estimator f(u, DEST) gives Euclidean distance and zero (i.e. no optimization based on Euclidean distance) respectively. Figure 2.11 Shows the effect of edge cost distributions on the perfor- mance of SPAH in terms of flu (i.e. the total number of visited edges) of MA and MD. Note that the values of H" are averaged over 5 seeds. 72000 70000 68000 66000 64000 Hn 62000 60000 58000 56000 54000 52000 1 J 1 l l J l 0 50 100 I50 200 250 300 350 400 Edge cost variation in % IIIIIIII+I Figure 2.11: Effect of edge cost on the performance of MA and MD When f (u, DEST) gives Euclidean distance estimation, the performance of SPAH deteriorates as the variation of edge cost is increased. The primary reason is that increasing the variation degrades the quality of Euclidean distance estimation to shortest path. This degradation of Euclidean distance estimation seems to have more severe impact on the performance of MA for the first part of edge cost variation (i.e. between 20% and 200%) than for the rest (i.e. between 200% and 400%). Unlike MA, MD gives a very stable performance with varying edge cost distribution. 40 2.4.3 Effects of the number of hierarchical levels We examined the effects of different levels of Hi Ti graphs on the performance of S PAH (i.e. MA and MD). We constructed different levels of HiTi graph out of the same set of level 1 subgraphs. For this analysis, we create 2 (i.e. 4 g |N§| g 8 and 10 S INBI S 20) x 5 (i.e. an edge cost distribution [100,200] with 5 seeds) level 3 HiTi graphs. Similarly, we create 2 x 5 level 1 and level 2 HiTi graphs. Then, we measure average H,, of MA and MD the same way as we did in section 3.2.2. They are shown in Table 3.1 and 3.2. level 1 H iTi graph level 2 H iTi graph level 3 H iTi graph Hn for MA H,, for MD 57428 72227 57336 71953 57063 69392 Table 2.3: Effects of levels of HiTi graphs on H" when 10 S IN,-1I S 20 level 1 H iTi graph level 2 H iTi graph level 3 H iTi graph Hn for MA Hfl for MD 50650 59684 50370 59642 50669 60114 Table 2.4: Effects of levels of HiTi graphs on H, when 4 S IN,‘| S 8 AS we can see from table 2.3 and 2.4, higher level Hi Ti graphs do not necessarily guarantee the better performance when using S PAH than the lower level Hi Ti graphs. It depends 011 the the entire search space for S PAH . The entire search space for S PAH on a level k H iTi graph, represented by T", is formulated as follows: 41 nk 27111—1 _ "Is-2 T5 = 2 IE‘|+}:|W“I+—ZIW" 11+;— 21W“ 21+ i=1 i=1 +—Z|W?|+— EEZIW‘I forl=1,k 3i=l Note that IEII represent the total number of edges of a level 1 subgraph and n} represent the total number of level I subgraphs where 1 S l S k. If T” < T" where p > q, then S PAH is likely to perform better on the higher level p HiTi graph than on the lower level q HiT i graph. Otherwise, the higher level HiTi graph is not likely to provide a performance advantage over the lower level Hi Ti graph. Table 2.5 shows the values of T1, T2, and T3 corresponding to level 1,2, and 3 H iTi graphs used in the tables 2.3 and 2.4. T1 T2 T3 10 g IN,‘| g 20 99000 98737 98487 4 S IN,‘| S 8 82272 83364 83332 Table 2.5: Effects of levels of HiTi graphs on T" Table 2.5 verifies our conjecture on the performance of S PAH for different levels of H iTi graphs. An interesting thing to note from the tables 2.3 and 2.4 is that the search space (i.e. H,,) does not vary significantly going beyond level 1 HiTi graph. This is because our experiments were done with the grid graphs havingthe following property. That is, the difference between T5 and Ti“ does not vary significantly as i increases. We believe that for most road maps this is the case. As a result, creating 42 higher level Hi Ti graphs does not contribute to the reduction in computation time. From this observation, we can conclude that level i HiTi graph is good enough for road map applications where the difference between Ti and TH“ is small. 2.5 Updating HiTi Graph There are two different types of updates for a recursive relation G(V, E) (e.g. topo— graphical road map). They are edge addition, and edge deletion. An edge addition (deletion) can connect (resp. disconnect) two existing nodes belonging to either a level 1 subgraph or two different level 111 subgraph where 1 S m S 11:. Since updating G( V, E) may require modifying level I: HiTi graph, special update problems associated with HiTi graph will need to be addressed. In the following subsections, we will discuss the update problems on level I: HiTi graph. 2.5.1 Edge Deletion Suppose edge (2:, y, z) is deleted from E. Then, we have two possible cases to consider. The first case case is when (x, y, z) is in level m between edge set where 1 S m S k. The second case is when edge (x,y,z) is in E} where 1 S i S n]. The first case may change boundary nodes (i.e. a: and y) to non-boundary nodes and it may invalidate existing within edges. Boundary node :1: (y) becomes non- boundary nodes if there is no other level m between edges incident on node a: (resp. 43 y). Assume that node x and y become non-boundary nodes. We need to delete the nodes :1: and y from their corresponding boundary node sets. For this purpose, we first need to identify two level 1 subgraphs SG} and SG} where :1: E SN(SG}) and y e SN(SG}). Then, we delete node :1: from SN( 7;,Sj,({SG}})) and node y from SN(UI;1S]4({SG]})). Next, we find all existing invalid within edges (a,b,fz(a,b)) in Sw(SA({SG,l,SG]-})). A within edge (a,b,fz(a,b)) becomes invalid in the following three cases. case 4.1.1 Nodes a or b is :1: or y. case 4.1.2 A path from node a to b is disconnected. case 4.1.3 The value of fz(a,b) is not correct. For the case 4.1.1 and 4.1.2, we delete (a,b, fz(a,b)) from the corresponding within edge set. In the case 4.1.3, simply replace the incorrect value of fz(a, b) with a correct one. In the second case, the deletion of edge (:r,y,z) from E} may invalidate some existing within edges in Sw(SA({SGI})). Thus, we need to identify all the existing invalid within edges. After the identification of all invalid within edges, we perform the same actions as we did in the cases 4.1.2 and 4.1.3. 2.5.2 Edge Addition Suppose a new edge (9:, y, z) is added to G(V, E). Then, there are two cases we need to consider. The first case is that edge (2:, y, 2) becomes a new level m between edge 44 where 1 S m S k. The second case is that edge (:r,y, z) is added to E}. The first case changes non-boundary node :1: (y) to boundary node if there is no other level m between edges incident on node a: (resp. y). Assume that node :1: and y become new boundary nodes. Then, it is necessary to add node x and y to their corresponding boundary node sets. For this purpose, we need to identify two level 1 subgraph SG} and SG] where :1: E V,‘ and y E VJ-l. Then, we add node a: to SN( I;S£,({SGI})) and node y to SN( I;,.S],({SG}})). Next, we add edge (:r,y,z) to level m between edge set SB(S,',I‘({SG,‘}). The addition of this new level 772 between edge may cause the following cases in Sw(SA({SG,l, SG}})). case 4.2.1 Need to create new level ml within edges where 1 S m1 S 19. case 4.2.2 Need to update f,(a,b) of some existing level m1 within edges where 1 S m, S k. In the second case, the addition of edge (1:, y, z) to E3 may cause the same cases as the cases 4.2.1 and 4.2.2 for edges in Sw(SA({SG}}). 2.6 Conclusion In this chapter, we developed a new graph, called a Hi T i graph, to model very large topographical road maps. Hi Ti graphs provides a powerful formal framework for structuring topographical road map data in a hierarchical fashion. We have em- pirically shown that our proposed shortest path algorithm, based on HiTi graphs, significantly reduces the search Space for computing the minimum cost path over a 45 very large topographical road map. Our algorithm is also empirically analyzed by varying edge cost distributions and the number of levels of HiTi graphs. Finally, we have investigated the problems of updating a Hi Ti graph in this paper. The applica- tions of Hi Ti graph structure go beyond the domain of topographical road maps. It can be applied to any very large recursive relations where hierarchical abstractions of data are useful. In addition to SPSP problem discussed in this chapter, there are quite a few other database research problems associated with the automobile navigation system. They include data models for storing large amounts of road map data, the determination of current location in the database, and physical storage structure of road maps. We give our survey on these problems in Appendix C of this thesis. Chapter 3 HiTi Graph-based Parallel Shortest Path Algorithms In this chapter, we study for both intra and inter query parallel processing for SPSP problems. Based on H iTi graph structure, we propose two parallel shortest path algorithms named PASPAH and ISPAH for intra and inter query SPSP problems, respectively. We empirically analyze the performance of PASPAH and ISPAH al- gorithms on two-dimensional grid graphs by implementing them on BBN GP1000 shared memory multiprocessor system. The analysis of PAS PAH shows that the average execution time of PASPAH is not significantly better than those of S PAH and M 0T Opar. From these performance, we conjecture that inter query shortest path problem has more potential for the parallel processing than intra query shortest path problem. 46 47 3.1 Introduction A Single Pair Shortest Path (SPSP) computation is fundamental to topographical road map queries. In order to reduce the large search space of topographical road maps, a graph traversal approach is usually used. Although the graph traversal ap- proach reduces the search Space significantly, its search space is still large for most real world road maps. To cope with this limitation, two new approaches are pro- posed I5, 59]. Agrawal and Jagadish [5] studied SPSP computation in the context of problem of performing efficient search over disk-resident massive graphs. Jung and Pramanik [59] proposed a new graph model, named Hi Ti graph model for this prob- lem. These two methods narrow down the search space very significantly at the cost of storing the partially precomputed path information. However, their performance improvement is still limited to a Single processor case. For interactive applications, further reduction in response time is beneficial. Thus, efficient parallel SPSP algo- rithms have been developed to improve the performance further. Little research has been done on parallel SPSP algorithms. Mohr and Pasche [72] presented a new parallel shortest path algorithm named OTOpar which is a parallel implementation of 0T0 algorithm. OTOpar uses two processors where each processor uses A* algorithm with Manhattan distance estimation to build its corresponding tree, one rooted at the source node and the other rooted at the destination node. Their empirical analysis shows that the average execution time of OTOparis roughly half of 0T0. However, OTOpar algorithm is not scalable at all. That is, their performance improvement is limited by only two processors. Thus, we need more scalable parallel 48 SPSP algorithms than OTOpar. In this chapter, we study the parallel processing for intra as well as inter query shortest path problems. Intra query SPSP problem deals with parallelizing a single transaction of SPSP problem. The parallel SPSP algorithm proposed by Mohr and Pasche [72] belongs to this category. Inter query SPSP problem deals with paral- lelizing multiple single-pair shortest-path computations. Inter query SPSP problem arises in the domain of automobile navigation systems where many vehicles send their shortest route computation requests to a. central server. Then the server must be able to handle these multiple SPSP computation requests satisfying a real time constraint. We have developed two parallel SPSP algorithms. One is for intra query SPSP problem and the other for inter query SPSP problem. Both algorithms are based on HiT i graph introduced in Chapter 2. Hi Ti graph structure provides the opportunity for fine-grained parallel processing for intra query SPSP problem. This is because multiple parallel Shortest-path computations can be initiated based on the nodes at a higher level of abstraction of a topographical road map. Note that H iTi graph structure can capture a hierarchical abstraction of the topographical road map and it is easy to identify the appropriate nodes at the higher level. Even though we have done extensive analyses on parallel processing of H iTi graph, we also have explored the possibilities of using H iTi graph in a distributed environ- ment. 111 a distributed environment, H iTi graph provides an efficient structure for computing a distributed transitive closure. That is, it allows a dynamic decomposi- tion of a transitive closure computation. This dynamic decomposition in turn allows 49 us to minimize the communication costs between the sites involved for the transitive closure computation. Note that reducing the communication costs is a major opti- mization goal for distributed transitive closure computations. We give the detailed description of our proposed distributed transitive closure algorithm in the appendix A of this thesis. The rest of the chapter is organized as follows. In section 3.2, we give the detailed description of intra query parallel Shortest path algorithm and empirically compare its performance with those of the previous work. Section 3.3 discusses two inter query parallel shortest path algorithms. We give empirical performance analysis of these two inter query parallel algorithms in this section. Finally, section 3.4 gives the concluding remarks. 3.2 Intra Query Parallel Shortest Path Computa- tion The main performance overhead of the shortest path computation comes from the size of the search Space it needs to explore. In this section, we describe a PArallel Shortest PAth algorithm based on H iTi graph (PASPAH). To Show the basic idea behind PASPAH, we first graphically show the explored search space of the algorithms A* and OTOpar in Figure 3.1. Note that in the figure, the rectangle and eclipse represent the entire and the explored search Space respectively. Arrows indicate the direction of exploring the search space. It is intuitively obvious that A* algorithm 50 e) henflaedeeuchspeetorA‘nlgonuen b) beWmhw-Iwmtmm Figure 3.1: Relative size of explored search Space for A* and OTOpar explores the search space in an elliptical shape, which is shown in Figure 3.1 a. Since OTOpar explores the search Space by applying A* algorithm from both the source (i.e. S) and destination (i.e. T) nodes, the Size of its explored search Space becomes smaller than that of A*. This is clearly illustrated in Figure 3.1 b. Based on this observation, we can infer that, if a middle node M lying on the shortest path is known, we can further narrow down the search space by applying A* algorithm from the middle node to both source and destination, simultaneously. Our PAS PAH is based on this idea which is illustrated in Figure 3.2. Figure 3.2: Explored search space size of PASPAH 51 3.2.1 Use of HiTi Graph for Identifying Middle Nodes The problem that remains is how to identify the middle node without computing the shortest path. This problem can be partially resolved by taking advantage of HiTi graph structure, which we explain through Figure 3.3. Figure 3.3: Level 1 HiTi graph created from level 1 subgraphs In Figure 3.3, the rectangles symbolize the level 1 subgraphs SGl to SG9 and their boundary nodes are represented by B1 through B24. Level 1 HiTi graph de- fined on these level 1 subgraphs is then easily represented by all the boundary nodes interconnected by the dotted arrows. Assume that we want to compute the shortest path from node S in SG7 to node T in 302. It is clear from the figure that the shortest path from the nodes S to T must pass through at least one boundary node of S G; and one boundary node of S G2. From this observation, we can approximate the two possible sets of middle node candidates for the shortest path. They are 52 M} = {818, B21} and M2 = {B2, B3, B6} where set M] is chosen in this example. Then, the shortest path cost from the nodes S to T, represented by PSP_COST(S, T), can be computed by exploring the search space from each node in M1 to S and T simultaneously. Let S P_COST(X , Y) represent the shortest path cost from the nodes X and Y. To obtain PSP_COST(S, T), we first compute SP_GOST(B18,S), SP_C’OST(Bl8,T), SP-GOST(B21,S), and SP..COST(B21, T) in parallel. Note that each processor use A* algorithm to compute SP_COST(X, S) and SP-GOST(X, D) where X 6 M1. PSP-COST(S, T) is then obtained by selecting the minimum of { SP_COST(BI8,S) + SP_COST(BI8,T), SP_COST(B21,S) + SP_COST(B21,T) }. Note that we choose M1 over M2 since the size of set M1 is smaller than that of M2. This way, we can minimize the number of processors involved for the computation. As it might be noticed from the above example, there is a possible performance bottleneck problem in this approach. We re—take Figure 3.3 to explain the bot- tleneck in the following example. Assume the shortest path computation from the nodes A in SG4 to D in SG6. Since the number of the boundary nodes of SG4 and SG6 are equal, we will randomly choose one of them. In this example, set {38,811,315} is selected. Then, in order to compute SP-COST(A, D), our method will compute SP_COST(BS,A), SP-COST(B8, D), SP-COST(BII,A), SP_COST(B11,D), SP_COST(BI5,A), and SP_COST(BI5,D) in parallel. As we can see from Figure 3.3, since node A is located a lot closer to node Bll than the nodes B8 and 815, the actual shortest path from A to D is likely to pass through 53 the boundary node B11. In such case, SP-GOST(B11,A) and SP-GOST(B]1,D) computations will be done a lot faster than the others. However, we have to wait until the rest of computations end, although we al- ready got SP-COST(Bll,A) and SP_COST(Bll,D). This waiting time is re- quired to guarantee the optimality of SP-COST(BII,A) + SP-COST(B11,D) among the rest of results, which becomes the performance bottleneck of PAS PAH . To cope with the performance bottleneck, it is necessary to minimize the waiting time. The minimization of the waiting time can be realized if we have a cor- rect and efficient stopping condition. The stopping condition allows us to decide the optimality of SP_COST(Bll,A) + SP-GOST(BII,D) before completing all the computation of SP_COST(B8, A), SP-C'OST(BS, D), SP-COST(815, A), and SP-COS'T(815, D). 3.2.2 Efficient Stopping Criteria Before we explain stopping condition, we first examine how S P-COST(B8, A) in the above example is computed in A* algorithm. In A*, the search space is explored by permanently labeling a node :1: the shortest path cost from B8 to :1: until the next selected node :1: becomes the node A. The next unlabeled node a: is selected from the search space such that SP_COST(B8,:1:) + EuDist(a:,A) is minimum. Note that EuDist(:c,A) is the Euclidean distance estimation from the nodes :7: to A. Based on this observation, we now formally describe the stopping condition in the following proposition. 54 Proposition 3.2.1 Given a directed graph G(V, E), assume that the shortest path from the nodes S to T must pass through at least one node from set B = { B}, 82, ..., B" }. Let set C Q B where SP-GOST(X, S) and SP_COST(X, T) are already computed for each node X E (7 at time to. For each X E B — C, let N(X, S, to) and N(X, T, to) be the next node at time to selected by A* algorithm for the computation of SP-COST(X, S) and SP_GOST(X, T) respectively. Then, PSP-COST(S, T) is costo = mianc { SP_COST(X, S) + SP_C’OST(X, T) } if costo S minyeB-c { SP-COST(Y,N(Y, S, 150)) + EuDist(N(Y, S, to),S) + SP-COST(Y,N(Y,T,to)) + EuDist(N(Y, T, to),T) }. Proof: Assume that costo is not PS P-COST(S,T) although costo S miny53_c { SP-COST(Y,N(Y,S,to)) + EuDist(N(Y,S,to),S) + SP_COST(Y,N(Y,T,to)) + EuDist(N(Y, T, to),T) }. Then, at some time t, > to, there must be some node Z G B — C such that SP-COST(Z,S) + SP_COST(Z,T) < costo. However, SP_C()ST(Z, S) + SP-COST(Z, T) cannot be smaller than SP_COST(Z, N(Z, S, to)) + EuDist(N(Z, S, to),S) + SP_COST(Z, N(Z, T, to)) + Eu Dist(N(Z, T, to), T). This is because SP_COST(Z,N(Z,S,t)) + EuDist(N(Z,S,t),S) and SP-COST(Z,N(Z,T,t)) + EuDist(N (Z, T,t),T) are always monotonically increasing in A* algorithm where to < t g t]. As a result, we have costo g miny€3_o { SP-COST(Y,N(Y,S,to)) + EuDist(N(Y, S, to),S) + SP-COST(Y,N(Y, T, to)) + EuDist(N(Y, T, to),T) } < SP_COST(Z, S) + SP_COST(Z,T) < costo, which is a contradiction. Cl 55 To exemplify how proposition 3.2.1 works, we show the stopping con- dition for the computation of SP-COST(A, D) in Figure 3.3. Assume that SP-COST(BII,A) and SP_COST(B11,D) are obtained at time to. Then, PSP_GOST(A,D) is costo = SP_COST(BII,A) + SP-GOST(BII,D) if costo S min { SP-COST(B8,N(B8,A,to)) + EuDist(N(B8,A,to),A), SP_COST(815, N(Bl5, D, 70)) + EuDist(N(Bl5, D, to), D) }. 3.2.3 Formal Description of PASPAH In the previous section, we described the basic idea as well as a theoretical foundation of PASPAH when level 1 HiTi graph is given. We now formally present PASPAH in Figure 3.4 where we use the same notations as those defined in the previous chapter 2.3. Our PASPAH is developed for Shared memory multiprocessor systems with the following architecture assumptions: 0 All processors have their own local memory. 0 All Processors communicate with each others through globally shared memory. 0 All processors share disk. Based on the preceding assumptions, PAS PAH accesses a level k + 1 subgraph tree, level Is: HiTi graph H"(P",Ak), and level 1 subgraphs U?=‘1SG2(V,-1,E,-l), which are stored in the globally Shared memory. 56 begin Step 1: . for(j=k+l;j>=l;j——) . Mark all boundary nodes in SN(SC(S,’,(SG1, SGI))) with level j — 1; Let l be the level number of LUBSG(SG;, SGI); f0f(J'=l-1;j>=1;j--) _ Mark all boundary nodes in SN(SC(SQ(SG:, SG}))) with level j — 1; Step 2: ' If(|3~(Sit-1(SGI))I .<. ISN(S.I4-1(SGi))l) 501;] = 32"(501); else SGLII = Sfl(SG}); Step 3: for each boundary node a: in SN(SGI,I1) Enqueue (9:,S) and (2:,T) to BQ; Step 4: while (IBQI > 0) { Dequeue (x,y) from BQ; A next available processor performs S P(:c, y); /* 'SP(:r, y) computes the shortest path from :1: to y */ /* It checks stopping criteria given in proposition 3.2.1 when the computation is done */} end Figure 3.4: PASPAH: Find a shortest path from S in SG; to T in SC] 57 PASPAH consists of four steps. At step 1, the boundary nodes of HiTi graph is marked with edge level numbers. This marking step is necessary to avoid traversing the unnecessary low level edges incident on the marked nodes. When a marked node is visited, the corresponding marked level (i.e. lowJevel) Specifies the lowest level of the edges to be traversed from the marked node. Since the highest level (i.e. hithevel) of the edges incident on the marked node can be easily identifiable, the nodes expanded directly from the marked node are only those reachable through the edges with the levels ranging from lowJevel to hithevel. This marking step guarantees that the explored the search space of PASPAH is at most E,1 U E',1 U 53(5c(U"+'Sii(SGLSGI))) U Sw(SC(Ui:.’Si(SGl.303))- i=1 At step 2, we determine the level I subgraph SGI;1 whose boundary nodes will be used as the middle nodes as explained in the previous section 3.2.1. The step 3 is self explanatory. In step 4, the shortest path computation S P(:r,y) for each pairs of nodes (2:,y) in BQ is performed in parallel by the available processors. S P(:1:,y) computes the shortest path from the nodes :1: to y by using A* algorithm. Note that At: algorithm in S P(:z:, y) explores the neighbor nodes of the currently expanding node 2 by following only those edges whose level number is not less than the marked level number of the node 2. Whenever each processor finish its computation, it exclusively checks the stopping condition described in the previous section 3.2.2. If the condition is satisfied, then it aborts all the computations being performed by other processors. Otherwise, it simply stops its computation. 58 3.2.4 Performance Evaluation We empirically compare PASPAH with 0T0, OTOpar [72] and SPAH [59] algo- rithms. S PAH is an improved A* algorithm which takes advantage of HiTi graph structure. For fair comparisons, we modified 0T0 and OTOpar so that they also uti- lize HiTi graph structure. In other words, in the modified 0T0 and OTOpar named M 0T0 and M OTOpar respectively, we use S PAH algorithm for building two trees from both source and destination nodes. Note that we use Euclidean distance for the lower bound estimation in S PAH rather than Manhattan distance of OTOpar. This is because Manhattan distance estimation does not guarantee the optimal shortest path generation. We implemented SPAH, MOTO, MOTOpar and PASPAH on a BBN GP1000 Shared memory multiprocessor system, which has a nonuniform memory architec- ture. The BBN GP1000 multiprocessor currently consists of 85 nodes, each one with 4MBytes of local memory, linked together by a high speed butterfly switch. In this system, the globally shared memory is the sum of the memories local to all processors. Thus, the size of available main memory increases with increasing number of nodes in the system. The BBN GP1000 system can have up to 250 processing nodes. For our empirical analysis, we create two dimensional grid graphs G(V, E) with 4 adjacent nodes as we did in chapter 2.4. In grid graph G, IVI and IEI are equal to 400 x 400 nodes and 4 X 400 X 399 directed edges. From G(V, E), we create a level 4 subgraph tree ST where each level 1 subgraph SG}(V,-1,E}) has IV,‘I = 50 x 50, 59 IEII = 4 x 50 x 49. The level 2 subgraph tree ST consists of 64 level 1, 16 level 2, 4 level 3, and 1 level 4 subgraphs. Based on the above level 4 subgraph tree, we create level 1 HiTi graph which will be used throughout this section. To Show performance comparison between PASPAH, MOTOpar, MOTO, and SPAH, we create 5 level 3 HiTi graphs where 3 S INEI S 8 and where the edge cost is generated based on a uniform distribution [100,120] with 5 difierent seeds. Note that INRI is the total number of boundary nodes defined 011 level 1 subgraph S G}. For each level 1 Hi Ti graph, we compute two different sets SET I and SET II of 20 shortest paths randomly prefixed pairs of source and destination nodes. SET I consists of the pairs whose source node location within the corresponding level 1 subgraph is not far apart from the destination nodes. A simplified example of these two sets is shown in Figure 3.5. MbHIuw‘huet-M“i 9 . :J sg/V \_' . “ :rj/1/1 0 ‘ . ..r/Tw \. .444 (/ ' v \I \\\3 . x I -.u,- x //‘3' 1V?“ 3544/3/4, /5,/. .9 &\-.:‘.\§~o Mfy/fla xx,- ‘sEth-‘ek “Wu/ .(ouéu’ I Figure 3.5: An example of sets I and II We then compare the above 4 algorithms by observing their average execution times of two sets SET 1 and SET 11 over 5 different level 1 HiTi graphs. The per- formance of the four algorithms are given in Figure 3.6 a and b. As it is shown in 60 Figure 3.6, PASPAH requires 42 processors to achieve the maximum parallelism in this empirical analysis. Figure 3.6 a Shows that PASPAH outperforms the rest of the algorithms. However, Figure 3.6 b Shows that PAS PAH does not perform sig- nificantly better than S PAH and M OTOparl. This is because, when source nodes comes from SET 11, the explored search space of PAS PAH becomes almost same as those of SPAH and MOTOpm'. Thus, the average execution time of PASPAH is not significantly better than S PAH and M OTOpar. From this observation, we conjecture that the parallel processing for intra query SPSP problem is not promising. We have observed that S PAH performs better than M 0T0 and M OTOpar, which is a different result from the one given in [72]. Our empirical observation can be explained by the following two reasons. The first reason is that HiTi graph structure significantly reduces the advantage of using the two tree expansion approach given in [72]. In other words, the most of the nodes in the two level 1 subgraphs (i.e. where source and destination nodes are in) are already explored when two trees start to include the nodes in Hi Ti graph. This conjecture is verified in the appendix B of this thesis where the two tree expansion approach performs far better than than that of one tree expansion approach when H iTi graph is not used. The second reason is that the lower bound estimation of M OTOpar is Euclidean distance which provides a lot tighter bound than Manhattan distance of OTOpar in [72]. As a result, compared with OTOpar, M OTOpar takes much longer time to stop building the two trees before it finds the shortest path. 1In Figure 3.6 b, the minimum execution time of PAS PAH , SPAH , M OTOpar, and M 0T0 are 8.36, 8.46, 9.92, and 15.56 seconds respectively. 61 60 1 1 1 1 1 1 1 1 5O . paspah G— .. spah ~+ - 40 moto B— — motopar ->(- - Seconds 30 20 . lO 0 3 .9. 0 51015202530354045 the number of processors 85) 250‘11111111 paspah 9— 200 spah .+. . - moto B— I 150 motopar ->(- - ‘ Seconds 100 50 0 0 51015202530354045 the number of processors b) Figure 3.6: Performance of PASPAH, SPAH, MOTOpar, and MOTO over: a) SET I b) SET 11 62 3.3 Inter Query Parallel Shortest Paths Computa- tion In this section, we study the parallel processing for computing inter query Shortest path based 011 Hi Ti graph. Efficient and fast computation of multiple SPSP shortest paths are very critical for those automobile navigation systems where all route com- putations are performed 011 a Single server. For this purpose, we propose a new inter query parallel shortest path algorithm in the following subsection. 3.3.1 Formal Description of ISPAH Algorithm As it was already shown in section 3.2.4, S PAH outperforms M 0T0 and M OTOpar. Due to the number of processors P AS PAH requires, PAS PAH is not appropriate for inter query parallel shortest paths computation. We therefore use S PAH as the unit operation for parallelizing inter query Shortest paths computation. In other words, S PAH is performed in parallel for each Shortest path computation request. The detail description of the inter query parallel shortest path algorithm, named Inter SPAH (ISPAH), is shown in Figure 3.7. Algorithm ISPAH executes SPAH in parallel on a shared memory multiprocessor system. For SPAH running on each processor, it accesses both local memory and globally shared memory for the computation. Algorithm SPAH accesses the glob- ally shared memory only when it needs to accesses G(V, E) or level ls: HiTi graph Hk(Pk, Ah). Other than that, SPAH accesses the local memory. Note that there is 63 begin Adjacency lists of G(V, E) and level It HiTi graph H"(P", Ak) are stored in the globally shared memory; Let R contain a set of source and destination nodes pairs (x,y); for each (x,y) E R do in parallel { Find SG; and SG; to which :1: and y belong respectively; Let I be the level number of LUBS(_;(SG;, SGI); f0r(P=l-1;P>=1;P--) Locally mark all boundary nodes in SN(SC(SZ(SG]-))) with level p — 1; Perform Step 2 of SPAH for (x,y); } ' /* SPAH is given in Figure 2.7 in section 2.3.1 */ end Figure 3.7: ISPAH algorithm no memory access contention for reading or writing a local memory. In the following section, we empirically analyze the performance of I SPAH . 3.3.2 Performance Evaluation We implemented I S PAH on a BBN GP1000 shared memory multiprocessor system. For our empirical analysis, we used the same grid graph as described in section 3.2.4. Based on these data sets, we computed 43 randomly prefixed shortest paths. The performance is then measured in terms of the average speedup Tn, where Tfl represents the total time taken by n processors to compute M shortest paths. We express the speedup S,, of 12 processors as T}/ T,,. Figure 3.8 shows Sn as n is increased from 1 to 43. As we can see from Figure 3.8, the Speedup of I S PAH increases almost linearly up to 10 processors, after 10 it is beginning to level off, and after 25 processors, 64 4'5 I I I I f I I I 40 ISPAH e- ‘ 35 30 Speed 25 U 1) 20 I’I I I I I 0 51015202530354045 the number of processors Figure 3.8: Performance of ISPAH the performance is ever deteriorating. This occurs because of the globally shared memory access contentions. In I S PAH , all processors have to access a single H iTi graph in a globally Shared memory, which causes a severe memory access contention when more than 25 processors are used. In order to verify our conjecture of this memory access contention, we modified ISPAH so that an entire level I: H iTi graph H "(Pk , A") is replicated on the local memory of each processor. This revised I S PAH is named a Modified ISPAH (MISPAH). We empirically analyzed M I SPAH the same way as we did for ISPAH. Figure 3.9 shows the speedup of MISPAH up to 43 processors showing the advantage of replicating a level I: H iTi graph on each processor. Although M l S PAH performs better than I S PAH , it is scalable up to 41 processors. From this analysis, we conjecture that the parallel processing for inter query SPSP problems are much more promising than intra query SPSP problems. 451lTlTllT 40 t MISPAH e— " 35 ~ — 30 - - Speed 25 L- Up 20 ~ 15 - - 10 — 5 0 v 1 1 1 l 4 #1 1 0 51015202530354045 the number of processors T I Figure 3.9: Performance of MISPAH 3.4 Conclusion In this chapter, we studied the parallel processing for intra as well as inter query shortest path problems on topographical road maps. Based on Hi T i graph structure, we first developed a parallel shortest path algorithm named PAS PAH for intra query shortest path problem. Hi Ti graph structure provides an opportunity for developing more scalable parallel shortest path algorithms than two tree building approach used in M OTOpar algorithm. In M OTOpar algorithm, only two processors are used for building two trees, one from source and the other from destination. As a result, M OTOpar algorithm is not scalable at all. We empirically analyzed the performance of PAS PAH by comparing its execution time on grid graphs with those of M OTOpar and SPAH. We used a BBN GP1000 Shared memory multiprocessor system to implement PASPAH, MOTOpar, and SPAH. The BBN GP1000 has a nonuniform memory architecture. Note that SPAH presented in chapter 2 is the H iTi graph- based sequential shortest path algorithm. Our empirical analysis shows that, although 66 PAS PAH is more scalable than M OTOpar, the average execution time of PASPAH is not significantly better than those of M OTOpar and S PAH . From this analysis, we conjecture that the parallel processing for intra query shortest path problem is not very promising. For inter query Shortest path problem, we proposed a H iTi graph-based parallel algorithm named I SPAH . We empirically analyzed the performance of I S PAH on the BBN GP1000 shared memory multiprocessor system by measuring its speedup on grid graphs as processors are increased. Our analysis shows that the speedup of I S PAH increases almost linearly for up to 10 processors. However, I S PAH is scalable up to 25 processors. This performance degradation occurs due to the severe memory access conflicts of processors. We then presented an improved version of I S PAH named M I S PAH which reduces the memory access conflicts through partial data replication. The peformance analysis of M I S PAH shows that it is scalable up to 41 processors. From this empirical analysis, we conjecture that inter query Shortest path problem has more potential for the parallel processing than intra query shortest path problem. Chapter 4 Description and Location of Distributed Fragments of Large Recursive Relations In a distributed environment, it is advantageous to fragment a relation and store the fragments at various sites. 111 this chapter, based on the concept of lattice structures, we develop a framework to study the fragmentation problems of distributed recursive relations. Two of the fragmentation problems are how to describe and locate frag— ments. Description and location methods previously suggested are more suitable in parallel environments than in distributed databases. In this chapter, we propose a method to describe and locate fragments based on lattice structures. Finding lattice descriptions of fragments is shown to be an N P-complete problem. We analyze the performance of lattice approach both theoretically and experimentally. This is done 67 68 by creating a database of recursive relations. The empirical analysis shows that our proposed algorithms give near-optimal solutions. 69 4.1 Introduction In distributed database systems, it is often convenient and beneficial to fragment and distribute data according to referencing frequency or locality. By having only fre- quently accessed data stored in each local site, we can reduce the data maintenance cost. For example, in a distributed database for designing automobiles or aircraft, it is desirable to store the relevant and frequently-accessed parts together. This data can be maintained in a Single recursive relation such as a parts/subparts relation. There are three major problems in fragmenting recursive relations in distributed en- vironment. The first is how to fragment a recursive relation. The second and third is how to describe and locate the fragments of the recursive relation. The focus of our research is to investigate the second and third problems. Fragmentation in traditional (non recursive) databases is accomplished through the use of logical predicates, which the tuples of the fragments must satisfy. These logical predicates can easily capture the referencing locality of non-recursive data. They are formed by conjunction, disjunction, and negation statements defined on the attribute values of the tuples. The fragmentation criterion, or Simply criterion, for a fragment F is a predicate which, when applied to a relation R, will determine which tuples of R belong to F. This fragmentation of non-recursive relations is investigated in [16]. Fragmenting recursive relations in distributed databases has not been extensively studied. Most papers 011 parallel and distributed computation of transitive closures 70 have applied the traditional fragmentation criteria to recursive relations [32, 45, 47, 74, 84, 102]. However, this criteria, defined by a predicate or a hash function, does not properly capture the referencing locality of data in recursive relations. Locality often comes from the transitive relationship between data. Stated another way, relevant data, frequently accessed by recursive queries, are likely to be related in transitive closures. Although the fragmentation based on the transitive relationship provides less efficient non-recursive query processing strategies for recursive relations than the traditional predicate-based fragmentation, it gives a very powerful basis for an efficient fragment description and location technique suitable for recursive query processing. An efficient processing of recursive queries is important for large recursive relations in distributed databases. Based on the criteria defined by transitive relationships, Houtsma, et al., studied the parallel computation of transitive closures in [42, 43] and designed strategies for fragmentation in [44]. However, Houtsma’s paper [44] did not discuss how to describe and locate fragments in a distributed environment. In a distributed database, we need an efficient fragment description and location method because we can avoid a significant amount of communication cost for query processing by keeping remote fragment descriptions in local sites. In this paper, we focus only on acyclic recursive relations. Unless otherwise stated, recursive relations and acyclic recursive relations will be used interchangeably. The rest of the chapter is organized as follows. Section 4.2 describes a fragment description method based on lattice structures. In Section 4.3, a fragment location 71 method is described. In section 4.4, we suggest methods to incrementally update description of fragments when a recursive relation is modified. The performance analysis of our update algorithm is also discussed in this section. Finally concluding remarks are given in Section 4.5. 4.2 Fragment Description by Lattice Structures In this section we use the properties of partial order sets (posets) to define and locate fragments of acyclic recursive relations. The acyclic recursive relations, considered in this paper, have the generic form R (attributel, attributeg, attributeg, . . . , attributem) where the attributes of R satisfy the following: I. attribute} and attribute; are the key attributes of R and share the same domain 2. attribute} and attribute; are the recursive join attributes and their correspond- ing values are related by some transitive relationship 3. attributeg through attribute”, describe the relationship between attribute} and attributeg The above relation R is associated with relation R’ which maintains detail descriptions 0f attribute} of R. Thus, relation R’ has the generic form of R’ (attributel', attributeg’, attribute3', ...,attribute,,’) where the attributes of R’ satisfy the following: I. attribute}I of R’ is the same as either attribute] or attributeg of R 72 2. attribute,’ is the key attribute of R’. 3. attribute; through attribute,,' describe attribute}, Examples of relation R and R’ are Shown in Table 4.1 and Table 4.2 where R and R’ correspond to part / subpart and part relations respectively. part subpart quantity engine cylinder head 1 engine fan belt 1 cylinder head piston 4 Table 4.]: A11 example of relation R part color weight dimension cost engine black 800 3 by 2 by 2 1000 cylinder head gray 100 1 by 2 by 2 250 fan belt black 30 1 by 5 by 1 250 piston Silver 70 1 by l by 1 100 Table 4.2: An example of relation R’ The above relations part/subpart and part together can be viewed as a directed acyclic graph (DAG) GR(VR, ER). Each node in V3 represent a tuple of the part relation and symbolized by the key attribute of the part relation. Each edge in ER represent a tuple of the part / subpart relation and symbolized by the key attributes of the part/subpart relation. The values of attributes part and subpart are related by an inclusion (i.e. transitive) relationship. For example, engine directly includes cylinder head and fan belt while piston is indirectly included in engine. If we consider this inclusion relationship as a partial order, then DAG GR(VR, ER) is a poset. A 73 partial order, :1: j y, is represented as a path in this DAG if there is a path of length 0 or more from node :1: to y. In our fragmentation method, each site keeps ER while VB is fragmented exclu- sively and distributed over various sites. Thus, a fragment is a subset of nodes in V3. Since each site stores ER, our fragmentation technique is more effective when the ratio of [ER] to Will is small. When IERI is much larger than IVRI, the benefits of fragment- ing ER are likely to exceed the benefits of just fragmenting VR. However, fragmenting ER causes one problem in our fragment description and location method presented in the following sections. The problem is, if E; is fragmented and distributed, then we cannot locally determine the location of remote fragments necessary for recursive query processing. Definition 4.2.1 Let GR(VR, ER) represent an acyclic recursive relation. Then, an F—subgraph for a fragment F is a subgraph Gp(Vp,Ep) of the graph G3 induced by the nodes offragment F. For example, Figure 4.1 shows a digraph GR(VR, Eg) for a relation R. Suppose G3 has three fragments named F] = { 1, 2, 5, 8, 9 }, F2 = { 3 ,5, 7, 10, 11, 12}, and F3 ={4,13}. Figure 4.2 shows the three F-subgraphs G1, G2, and G3 induced by the fragments F1, F2 and F3, respectively. The F—subgraph is not likely to be a set of randomly unrelated nodes. More likely, it will be a connected component or a collection of several connected components. This is because, as mentioned before, the referencing locality of nodes in V3 often comes from the transitive relationships among the nodes. Figure 4.1: A digraph G for an acyclic relation An F -subgraph is also a poset, since each subset of a poset is itself a poset. Therefore, the properties of posets can be used to describe and locate fragments. MW 03 Figure 4.2: Three F -subgraphs The descriptions of fragments are maintained in a fragment table at each site. A fragment table is a binary relation of the form (f, d) where f is a fragment identi- fier and d is the corresponding description. A naive description (I of a fragment is defined as the set of key attribute values for nodes contained in the fragment. This naive description is quite flexible in the sense that any fragment can be described by grouping the key attribute values of its elements. It is also easy to maintain when updating fragments. However, this naive description suffers from the following serious drawbacks: 75 1. Fragment description size is large 2. Fragment description size grows linearly with respect to fragment size 3. Communication of fragments’ descriptions with other sites is costly due to their large size 4. A large search space is needed to find the fragments containing certain nodes. The above drawbacks can be very costly, especially when the original relation is very large. A more efficient approach is required for practical purposes. By taking advan- tage of ER at each site, we can construct a concise but flexible fragment description method. This method is illustrated in the following subsections. 4.2.1 Maximal and Minimal Nodes Approach We can represent a fragment by a set U of maximal nodes and a set of V minimal nodes of the F-subgraph GF(VF, Ep). Let F be a fragment. Then F: {.2le VRwhere ajzij aGU A bEV} Thus, a node is in the fragment if it has an ancestor node in U and a descendant node in V. The fragment F is then represented by the notation < U, V >. This approach, however, is not complete because it cannot describe certain types of fragments. For example, consider the DAG GR(VR, ER) in Figure 4.3. 76 @0 00 I (9 0 Figure 4.3: An Example of DAG GR(VR, E3) The fragment F = { 1, 2, 3, 7, 8, 9 } cannot be represented by this approach. This is because the fragment represented by < {1, 7}, {3, 9} > contains the additional node 5. In order for this maximal and minimal nodes approach to be applicable, we need the restriction which should be imposed on fragments. That is, for each fragment F represented by < U, V >, the intersection between U U { all descendants of nodes in U } and V U { all ancestors of nodes in V } should be equal to F. In the next subsection, we give a representation method based on the lattice approach which does not require the above restriction. 4.2.2 Lattice Approach 11) this approach, a fragment is described by a set of lattices defined on the corre Sponding F-subgraph Gp(Vp, Ep). A lattice is represented by L = < 2:1,,y1, > where ml, is the least upper bound (LUB) and y}, is the greatest lower bound (GLB). Next, We present a list of definitions which will be used throughout the rest of the paper. Definition 4.2.2 Let GF(VF, Ep) represent an F—subgraph. Then, any pair of nodes (any) in Vp is a candidate lattice < :r,y > for GI“ ifs: :5 y. 77 Definition 4.2.3 A candidate lattice < 1:,y > is called a simple lattice ifa: 74 y and there is no 2 such that a: j z j y, z 76 x and z 75 y. A candidate lattice of the form < 3,:1: > is called a trivial lattice. Definition 4.2.4 Let CR(VR, ER) represent an acyclic recursive relation and CF(VF,EF) be an F—subgraph of CR. A candidate lattice < 1:,y > for CF is suit- able if the intersection between { a: } U { all descendants ofz } and{ y } U { all ancestors ofy } forms a nonempty subset of VF. Definition 4.2.5 Let L = < :c,y > be a suitable candidate lattice for Cp(Vp, Ep). Then, lattice L covers node 2 E Vp ifs: j z j y. Definition 4.2.6 Let CL be a set of suitable candidate lattices defined on the F- subgraph Cp(Vp, Ep). Let LC Q CL. Then, we say LC is a lattice cover for CF if every node of Vp is covered by a member of LC. We will represent a fragment by a lattice cover. Unlike the maximal and minimal nodes approach, the lattice approach is complete in that it can describe any fragment by a set of suitable candidate lattices defined on the corresponding F-subgraph. For example, { < 1,3 >, < 7, 9 > } is a lattice cover precisely describing the fragment { 1, 2, 3, 7, 8, 9 } shown in Figure 4.3. Thus we can define a fragment F by a lattice cover as follows: F = {zlszjy for at least one €LC} 78 More than one lattice cover exists for a given F-subgraph. Figure 4.4a and 4.4b show two different lattice covers for the same F-subgraph G2 of Figure 4.2. Figure 4.4: Two lattice covers for the F-subgraph G2 The above Figures 4.4a and 4.4b correspond to the lattice covers LC; = { < 3,10 >, < 3,11 >, < 3,12 > } LC2 = { < 3,10 >, < 7,11 >, < 3,12i> } respectively. So far, we have presented the basic idea for describing the fragment by using a set of suitable candidate lattices called a lattice cover. One important thing to note from this lattice approach is that our lattice representation scheme (i.e. L = < 1:1,,yL >) has one representational ambiguity. That is, we cannot always uniquely reconstruct any lattice L by using its LUB x1, and GLB yL alone. This problem will be shown from the lattice L]: < 3,11 > = {3 j 6, 6 j 11, 3 j 7, 7 j 11 } in Figure 4.4a. From L1, consider two lattices L2 = {3 j 6, 6 j 11 } and L3 = { 3 j 7, 7 j 11 } . Even though all three lattices L1, L2, and L3 are represented by the same LUB and GLB (i.e. < 3,11 >), they are different lattices. However, this representational ambiguity is not a problem with our lattice approach. This is because in our lattice ELpproach, lattice L = < ethyl, > always refers to the one which maximally covers 79 nodes in GR(VR,ER). Thus, in the above example, lattice < 3,11 > refers to the lattice L1. The question that remains is what criteria will determine a good lattice cover for an F-subgraph. We define a good lattice cover as one that minimizes the search space for finding a lattice incorporating a given node. The search time will depend on the number of lattices in a lattice cover and not on the size of each lattice. In other words, the number of lattices in a good lattice cover should be as small as possible while the amount of overlap among the lattices is not important. An index called lattice cover size will be used to represent the number of lattices in a lattice cover. In the next subsection, we show that finding a lattice cover with a minimal index is NP — complete. We then present a heuristics for finding a good lattice cover. 4.2.3 Lattice Cover Problem is NP — complete The lattice cover problem, LC P, is defined as follows. Let P be a set of nodes where the partial order j is defined. For a given positive integer K, the problem is to determine the existence of a lattice cover LC whose cover size S K such that every node of P belongs to at least one of the lattices in LC. To prove it is NP — complete, we will map the minimum cover problem [33] to LC P. The minimum Cover problem, M C P, is stated as follows: Given a collection C of subsets of a finite set S with a positive integer K g |C|, is there a subset C’ Q C with IC’I S K such that every element of S belongs to at least one member of C’? 80 Before presenting a detailed proof for NP — completeness of LCP, we first explains the basic intuition on how to check if a given set of lattices is a lattice cover for the F-subgraph. Given a lattice in a lattice cover, finding a set of nodes covered by the lattice requires performing two transitive operations, one from the LUB (on the original edge graph) and the other from the GLB (on the inverted edge graph) of the lattice, and finding the intersection of the two node sets. Checking that a set of lattices is a lattice cover involves taking the union of the sets of nodes covered by the lattices and checking if this is identical to the sets of all nodes in a fragment. Next, we prove that LCP is NP — complete. Lemma 4.2.1 LCP is NP — complete Proof: LC P is in NP, since a nondeterministic algorithm may guess a set of suitable candidate lattices LC for a partial order set (P, j) such that |LC I S K, and then check in polynomial time whether LC is a lattice cover. We transform MCP to LCP. Let C = { S], 3;, ..., Sn } be a collection of subsets of a finite set S. The poset (P, j) is constructed as follows. For each subset S.- E C, we create four nodes, M,“ M,-,, N,, , and N,,. Next, we create direct partial orders for each i such that M,, j N,, and Mg, j N,., wherei = 1, 2, . .. ,n. For every x E S,- where i = 1, 2, .. . ,n, we create a direct partial order such that Mg, j .7: j N,,. This newly constructed poset (P, j) has 2n maximal nodes (i.e. M1,, M1, M2,, M2,, . . . a ".11, Mn_12, Mm, Mm) and 2n minimal nodes (i.e. N1“ N12 N21, N22, . . . ,Nn_1,, N,,.],, NM, NM). From this poset (P, j), let set M and N consist of all maximal and minimal nodes of P. Then, we create CL as such CL = { < x,y > : a: E M /\ 81 y 6 N A 1' j y }. Then, in any lattice cover LC Q CL, we must have at least 2n suitable candidate lattices to cover both the n maximal nodes (i.e. M1,, M2,, ,Mn,) and the n minimal nodes (i.e. N12, N22, ,N,g,). Furthermore, it is always possible to derive the above 2n suitable candidate lattices as 2n simple lattices. This is because we have Mg2 :5 Ng, and Mg, j Ng2 where i = 1, 2, ,n. Let a set X denote the above 2n simple lattices. It is obvious that the lattices in X also cover the remaining 12 maximal nodes (i.e. M1,, M2,, ,M,,,) and n minimal nodes (i.e. N1,, N2” ,an). Note that no lattice in X covers any node in S because the lattices in X are simple lattices. Therefore, if we can derive a lattice cover LC, where ILC I S (K + Zn), from this poset (P, S), then we have a minimum cover C’ Q C with IC’I S K. To see that this transformation can be performed in polynomial time, it suffices to observe that the total number of partial orders in (P, j) are bounded by a polynomial of 0(ICI - max{|Sg| : Sg E C}). D In order to demonstrate how the above transformation works, we will illustrate an example as follows. Let MCP have the instances such as S = { 1, 2, 3, 4, 5 } and C ={51,52,33,S.g,55}where31={1,2,3},Sg={4,5},S3={1,2},Sg={3, 4 }, and 55 = { 5 }, From this MCP, we need to construct a poset (P, '1). For each subset S,- E C, we first create four nodes Mg,, Mg2, Ng,, and Ng,, which are shown in Table 4.3. 82 i5; Mg1Mg°2 Ni1Ni2 1s,={1,2,3} A B K L 2 Sz={4,5} I J T U 3 S3={1,2} C D N O 4 s..={3,4} E F P Q 5 55:93} G H R s Table 4.3: Newly created nodes for each set Sg for transforming from MCP to LCP Next, we create direct partial orders for each set S'g where i = 1,2, . . ., 5. For 51, the direct partial orders created are A :5 L, B j K, A j 1 j K, A j 2 j K, and A . j 3 j K. The rest of direct partial orders created for other sets in S are shown in Figure 4.5. \ I \ I \ I \ I \ \ I I \ I \ I \ I \ I \ I \ I ' \ I \ ’ I X l is 2 ii is ‘ A I\ ’\ I I\ I, \/ Ii ///i‘\/ \‘/ \ I \ I V s ‘ r u r v I ‘ L K O N Q P S R U T Figure 4.5: Partial order sets created from MCP to LCP Note that in Figure 4.5, dashed arrows (undashed arrows) represent direct partial orders constructed for newly created nodes (resp. nodes in Sg). Then, set CL consists of suitable candidate lattices in { < B,K >, < A,L >, < A,K >, < A,N >, < A,P >, < D,N >, < 0,0 >, < C,N >, < C,K >, < F,P >, < E,Q >, < E,K >, < E,P >, < H,R >, < G,S>, < G,R>, < G,T >, < J,T>, < I,U >, < I, R >, < I,T > }. It is easy to see from Figure 4.5 and CL that if we can derive a lattice cover LC Q CL, where ILCI S K + 2 - 5, then we have a minimum cover 83 C’ c c with m s K. 4.2.4 A General Heuristic for Finding a Good Lattice Cover In this subsection, we present a heuristics for finding a good lattice cover. The following proposition will establish the relationship between the lattice cover size and the number of maximal and minimal nodes in the F-subgraph Gp(Vp, Ep). Proposition 4.2.1 Given an F-subgraph CF with M maximal nodes and N minimal nodes, the index for any lattice cover of CF must be at least max(M, N). Proof: Assume that M _>_ N. Since two maximal nodes can not belong to the same lattice, we need at least M lattices to cover all those maximal nodes. Similarly, if N 2 M, we need at least N lattices to cover those minimal nodes. D From the proof of proposition 4.2.1, it is obvious that all maximal (minimal) nodes in Cp must be used at least once as the LUB (resp. GLB) node of some lattice in any lattice cover of CF. This is because a maximal (minimal) node cannot be covered by any lattice whose LUB (resp. GLB) node is not a maximal (resp. minimal) node. This property provides a good basis for choosing a set of suitable candidate lattices, i.e. CL, from all possible suitable candidate lattices. We restrict all suitable candidate lattices to consist of those whose LUB and GLB nodes are chosen from maximal and minimal nodes (respectively) of F—subgraph. For example, consider the F-subgraph G2 in Figure 4.2. The suitable candidate lattice 84 < 7,11 > defined on G2 is not chosen as a member of set CL. This is because LUB node 7 is not a maximal node of the F—subgraph G2. The set CL consists of the lattices < 3,10 >, < 3,11 > and < 3,12 >. However, it is not always possible to derive a lattice cover for a given CF from the subset of suitable candidate lattices we just described. We take the DAG GR in Figure 4.6a as an example to illustrate this if) mi A? z W. n: ® 03 ® a b problem. Figure 4.6: A digraph G and F-subgraph of G From the graph G3 in Figure 4.6a and the fragment F = { 2, 5, 6, 9, 10, 13, 14 }, we get the corresponding CF shown in Figure 4.6b. It follows that CL = { < 2,13 > } from which we cannot derive a lattice cover for CF. This is because set CL cannot contain the unsuitable candidate lattice < 2,14 > which also includes additional node 11 ¢ F'. Thus, we need to enumerate additional suitable candidate lattices to derive a lattice cover. For this purpose, we present Algorithm 1 in Figure 4.7. It obtains additional suitable lattices by partitioning an unsuitable lattice < x, y > into a set of suitable lattices. The union of these suitable lattices then covers all nodes in {zlszjy /\ zEVp}. " . i . 85 The input to the algorithm is VF, unsuitable lattice L =< :r, y > and a subgraph SG;(V1, E1) induced by the elements of L =< :r, y >. The elements of L =< :r,y > was obtained from the intersection between ({x} U {all descendants of x}) and ({y} U {all ancestors of y}) by using ER. Note that the we represent E1 and ER as adjacency lists. begin 1. Let a subgraph SC1(V1, E1) corresponds to L = < :c,y >; 2. Let \I' = 0; 3. Obtain IV] = V] - V}: and V1, -'= V] — IV]; 4. Let SGL(VL, EL) be a subgraph induced by nodes in V1,; 5. Let P, = 0, S] = {x}, and i = 2; 6. Let S-z = { v | (u,v) 6 EL for all uE S] }; 7. If 32 = (ll then goto step 10; 8. For each v in 5;, let P,, = i; 5" i = i + 1, 5'1 = $2, and goto step 6; 10. Find 5 = {(u,v) | (u,v) E EL A (u,w) E E1 where w 6 IVI}; ll. EL = EL—tiand N={u | (u,v)Eb}; 12. Find w 6 N such that PW = min{ Pg, I v E N}; 13. Let M consist of all maximal nodes in EL; 14. Let \II = \IIU { |ueMA(ujw is true in EL)}i 15. From EL, obtain S = {elm j c j n} for each < m,n >€ ‘1’; 16. Obtain VL = v,, — {cl 0 e 50....) where < m.n >6 ‘1'}; 17. If VL = (D then goto step 20; 18. Let EL consist of those edges whose end nodes are in V1,; 19. Let N consists of all minimal nodes in EL and goto step 12; 20. return (\II, {SLIL 6 .\II}); end Figure 4.7: Algorithm 1: Partition an Unsuitable Candidate lattice L =< 1:, y > We re—take the DAG GR and CF in Figure 4.6 as an example to demonstrate Algorithm 1. The input to this algorithm is a VF = { 2, 5, 6, 9, 10, l3, l4 }, the unsuitable candidate lattice < 2,14 > and the subgraph SG;(V1, El) corresponding to < 2,14 >. Figure 4.8a shows SCL(VL,EL) obtained at step 4. The value P,, of 86 each node v in V1, is also shown beside node v in Figure 4.8a. Note that due to edge (5,14), P14 is initially set to 3 and then reset to 4. in: xi: ., ’39:: A; A: a) b) C) Figure 4.8: Changes of S'CL(VL, EL) in Algorithm 1 At step 10, since node 6 connected to node 11 in [V], 6 contains edge (6, 10). We delete edge(s) in 6 from EL at next step. Thus, the structure of $01, is changed, which is shown in Figure 4.8b. After the executions of step 12 to 15, we obtain \II = { < 2,6 > } and change SCI, This changed 50;, is shown in Figure 4.8c. Since V1, is still non-empty, steps 12 to 16 are repeated. During this repetition, set \I' is changed to \II = { < 2,6 >, < 5,14 > }. This set \I' contains suitable candidate lattices partitioned from the unsuitable candidate lattice < 2, 14 >. Next, we discuss the time complexity of Algorithm 1. The primary complexity of Algorithm 1 comes from traversing edges in EL. Then, it is easy to see that Algorithm 1 will not traverse more than |\II| - IEL| edges. Thus, the time complexity is 0(|\II| - IELI). 87 We now present Algorithm 2 which generates a good lattice cover for an F- subgraph. It is given in Figure 4.9. The input to Algorithm 2 is theyF-subgraph Gp(Vp, Ep), edge set ER, and set M and N. Set M and N denote the sets of max- inial and minimal nodes of the F—subgraph respectively. Note that edge set ER and Ep are maintained as adjacency lists. Algorithm 2 first creates a set of suitable candidate lattices CL by using Algorithm 1 and the method we described in the early part of this subsection. Next, Algorithm 2 tries to identify essential lattices among the lattices in CL. The formal definition of essential lattices is given in the following definition 4.2.7. Definition 4.2.7 Let CL be a set of suitable candidate lattices for an F—subgraph CF(VF,EF). Suppose that CL' = { LC], LC-z, ...,LCk } consists of all possible lattice covers for CF where LCg Q CL fori = 1,2,...,k. Then, a lattice in CL is essential if it belongs to any lattice cover LCg fori = 1,2, . . . ,k. Identifying all the essential lattices in CL was shown to be NP-Hard [73]. How- ever, some of the essential lattices can be identified by checking each node in the F—subgraph. That is, if a node is covered by a single lattice in CL, then the lattice is essential. After finding all identifiable essential lattices in CL, Algorithm 2 compute a set of nodes, V, not covered by identified essential lattices in CL. It then repeatedly apply a greedy heuristics to the nodes in V to select a lattice from CL that covers the most remaining uncovered nodes in V (with ties broken arbitrary). Figure 4.10 shows a digraph CR(VR,E’R) consisting of two F-subgraphs corre- sponding to the fragments { l, 2, ,14, 15 } and { 16, 17, , 21, 22 }. The 88 Stage 1 1. Let LC = (b. For each node v in Vp, create the corresponding ances- tor and the lattice set A,, and L3,, respectively. They are initialized to (0. Create two sets EC L (Essential Candidate Lattice) and U C L (Unsuitable Candidate Lattice). Let ECL = UCL = (0. 2. For each maximal node m in M, let A", = { m }. For every v which is a descendant of m in Ep, let also A,, = A, U { m }. Therefore, given a node v, its ancestor set A,, contains all the maximal nodes which are ancestors of v. For each maximal node m in M, let Am = {m}- 3. For each minimal node n in N, and each m 6 An, the node pair (m,n) suggests a candidate lattice < m,n >. There are EneN IAnI such pairs. Let CL denote the set of ZneN IAnI candidate lattices. 4. For each candidate lattice L =< 2:,y > in CL, create the corre- sponding set S L by using E3. The elements of S L are obtained from the intersection between ({ a: } U { all descendants of z }) and ({ y } U { all ancestors of y }). If L is not suitable, move L from CL to U C L. 5. For each L 6 CL, L5,, = L5,, U { L } ifv 6 51,. Let UCL' = 0. For each node v in Vp such that L3,, = 0: UCL' = UCL’ U { L | L e UCL A v e .91, }. 1f UCL’ = (2) goto step 8. 6. Apply Algorithm 1 to each unsuitable candidate lattice L 6 UCL’. Let PC L consists of those suitable candidate lattices returned from Algorithm 1. 7. For each L E PCL, L5,, = L5,, U { L } ifv 6 31,. CL = CL U PCL. 8. For all v 6 VP, if L5,, contains only one lattice L, then move L from CL to ECL. 9. Let L0 = LC U ECL and V}? = VF - ULeECLSL- Stage 2 V = Vp; U = CL; while ( V ¢ lb) { Select an L E U that maximizes ISL 0 VI; U=U—L; V=V—SL; LC=LCUL;} LC is the lattice cover for the F-subgraph; Figure 4.9: Algorithm 2: Find a Lattice Cover for an F -sub2raph 89 F-subgraph Cp(Vp, Ep) corresponding to the fragment { 1, 2, ,14, 15 } will be used to demonstrate Algorithm 2 as follows. 01110121°l31 [01110111 m 0 0m Om @ Gm \ . @III Guam Guam Gm Figure 4.10: A digraph G consisting of two F-subgraphs Stage 1 2. The ancestor set A,, of each node v in VF is shown beside node v in Figure 4.10. 3. The set CL of candidate lattices is obtained: CL = { < 1,12 >, < 1,13 >, <1,14 >,<2,13 >,<2,14 >,<3,13 >,<3,14>,<3,15>}. 4. Table 4.4 shows the nodes of each candidate lattice L in CL. The nodes of each lattice are computed by using the edge set ER. Among the candidate lattices in CL, lattice < 1,12 > is not suitable, since it also covers nodes 20 and 21. Thus, CL 2 CL — {< 1,12 >} and UCL ={ <1,12 > }. 5. In this step, wegot LS] ={< 1,13 >,< 1,14 > },L32={<2,l4 >,<2,15> },L53={<3,13 >,<3,14 >,<3,15> },LS,=0,...,LS,5={<3,15> }. Then, by the definition of UCL’, UCL’ = { < 1,12 > }. 90 11 SL L 51, < 1,12> SL={1,4,9,12,20,21} <2,13> SL={2,6,10, 13} < l,13> SL={1,4,5,9,10,13} <3,13> SL={3,7,10,13} < l,14> SL={1,5, 10,14} <3,14> SL={3,7,8,10, 11,14} < 2,14> SL={2,6,10,14} <3,15> SL={3,7,8,11,15} Table 4.4: Lattice L and Set 3;, 6. By applying Algorithm 1 to unsuitable candidate lattice < 1,12 >, we got PC L ={<1,4>,<9,12>}. 7- In this step, by using Ep, we got 50,4) = { 1, 4 }, 5(9’12> = { 9, 12 }, L31 ={<1,4>,<1,13>,<1,14>}, LS,={<1,4>,<1,13>},L39={ <1,13 >, < 9,12 > }, L512 = { < 9,12 > }. And CL = CL U { < 1,4 >, < 9,12 > }. 8° The lattices < 9,12 > and < 3,15 > are essential because L312 : { < 9,12 > }, L515 = { <3,15 > }. Hence, ECL = { <9,12 >, <3,15>}and CL ={ <1,4>,<1,13>,<2,13>,<2,14 >, <3,13 >, <3,14> }. 9‘ LC and V consist of{ < 9,12 >, < 3,15 > } and { 1, 2, 4, 5, 6,10, 13,14} respectively. Stage 2 A greedy method is performed in while loop. At the first execution of while loop, lattice < 1, 13 > will be selected from CL. Then lattice < 2,14 > will be selected in the next execution of the while loop and Stage 2 will end. Therefore, 91 the lattice cover LC consists of{ < 9,12 >, < 3,15 >, <1,13 >, < 2,14 > }, which in this example is the minimal lattice cover. 4-2 -5 Near Optimal Lattice Cover Size Corrnen ct al. [20] used the same greedy heuristics as the one in stage 2 of Algorithm 2- They derived a logarithmic ratio bound for their algorithm. Our greedy approxi- rn ation algorithm has the same performance ratio bound in the worst case as the one shown in [20]. This is because, in the worst case, our algorithm may not identify any essential lattice. As a result, only the greedy heuristics of stage 2 is used for deriving a l at t ice cover. Let OPTU‘“) be the optimal solution and LCP(I) the heuristic solution for LCP PTOblem. Cormen et al. prove in [20] that LCP(I) < (ln(max{|SL| : L 6 CL}) + 1) , O P T( I ‘). However, for our algorithm, this bound represents the worst case situation. To measure the performance of Algorithm 2, we create random DAGs GR(VR, ER). T118 generation of DAGs is based on a uniform distribution in the range [1,IVRI] Where |VR| represents the total number of nodes. Then, we' create F-subgraphs by partitioning 03. Each F-subgraph has a small number of maximal and minimal nodes C0111 pared to lVFl (i.e. about 1 % of |Vp| on the average). Based on the above scheme, we Create a set of random F-subgraphs Cp(Vp, Ep) as f0110W53 . Size of Vp is varied from 1000, 1500, to 2000 nodes. . For leI = 1000, we create CF by varying IEpI from 1500 to 10000 with an 92 interval 500 edges. . For IVpI = 1500, we create Gp by varying IEpI from 2500 to 10000 with an interval 500 edges. 0 For lVFl = 2000, we create CF by varying IEFI from 3000 to 10000 with an interval 500 edges. 0 For each fixed size of VF and Ep, we create 10 Gp’s using 10 different seeds. For each F-subgraph created as above, we compute the lattice cover size ILCFI and the percentage ratio between the total number of essential lattices (i.e. IEC LI) and sui table candidate lattices (i.e. ICLI), by using Algorithm 2. Note that CL (ECL) is Obtained in step 7 (resp. step 8) of Algorithm 2. These values are then averaged over the 10 F-subgraphs with same values of I VpI and IEpI. These are shown in Figure 4-11 Figure 4.11a shows that the lattice cover size |LCp| converges to the theoretically DOSSible minimal lattice cover size as we increase lEFl- The theoretically possible Irlinirnal lattice cover size is given in Proposition 2.1. From Figure 4.11b, we can clearly observe that the value of P approaches close to 100 % as We decrease IEFI This i ‘11 plies that the total number of essential lattices approaches close to the total number of the suitable candidate lattices, as we decrease IEpI. Since IECLI S ILCpI S ICLI, Algorithm 2 gives LCp close to the minimal lattice cover, as IEpI is decreased. Therefore, we can conclude that Algorithm 2 gives near optimal lattice covers for both dense and non-dense F-subgraphs. Note that dense F —subgraphs have much larger 93 number of edges per node than that for non-dense F-subgraphs. 50 l l T ILCFI ITIIIIIII b) P = 100 - IECLI/ICLI Figure 4.11: Performance analysis for Algorithm 2 Next, we discuss the complexity of Algorithm 2. Algorithm 2 consists of stage I and 2. The time complexity of stage I is dominated by step 4. Let Nd (Na) be the 1hfiximum number descendants (resp. ancestors) of node a: E M (resp. N). Let c, be the total number of edges to be traversed to find the above ancestor and descendant nodes. Then step 4 will run in 0(e, + (Nd + Na) - ZnEN IAnI)- For stage 2, we can ea8in see that its time complexity is 0(ICLI2 - IVpI). Thus, the time complexity of the algorithm is 0m + (N. + N.) - gal/4.4 + ICle - val). Note that. as the denseness of F-subgraph is decreased, the time complexity will approach to 0(eg + 94 (N4 + Na) - ZneN |A,,I). This is explained by the result shown in Figure 4.11b. 4-3 Finding Relevant Fragments The fragments containing the nodes a query needs to access, are defined as relevant fragments of the query. Note these set of nodes may be a super set of the result 11 odes. In order to find relevant fragments of the query, each site needs to access its local fragment table and local edge set ER. Before discussing methods of locating the relevant fragments, we first introduce a set of basic notations which will be used in t he rest of this paper. They are defined as follows: FT : the set of all lattices in the fragment table; Q : a set of nodes; DQ : Q U { all descendants of nodes in Q }; AQ : Q U { all ancestors of nodes in Q }; Sgub: {x|6FT}; Sgubq : {xIxESgubeE/lq }; Sggb:{yl<:r,y>EFT}; SglbqiiylyESgIbAl/GDQ }i Note that DQ and AQ are computed from E3 maintained in adjacency lists. The following proposition provides the basis for locating relevant fragments of the query. Proposition 4.3.1 Suppose Q consists of a set of nodes needed by a query. Let A be the set of lattices in FT which cover all nodes in Q. Then, the lattices in A are only those lattices in T ={ < x,y > I :1: E Stubq A y 6 Sglbq A < :c,y >6 FT }. 95 Proof: We need to show that A Q T and T Q A to prove A E T. Case 1: To prove A Q T, we need to show V < 2:,y > (< 1:,y >6 A —> < a:,y >6 T). Assume that there is a lattice < u,v >6 A such that < u,v >6 T. Then, < u,v > 11) ust cover at least one node 2 6 Q. By the definition of T, u 6 Sgubq or '0 ¢ Sggbq. However, since u j z j v and z 6 Q, u 6 AQ and v 6 Dq. This implies that u 63 Sgubq and v 6 39150, which is a contradiction. Case 2: To prove T Q A, we need to show V < 2:,y > (< a:,y >6 T —> < $,y >6 A). Assume that there is a lattice < u,v >6 T such that < u,v >6 A. Then, by the definition of T, < u,v >6 FT and it must cover at least one node in Q. Since A Con tains all lattices in FT which cover all nodes in Q, A must contain < u,v >. This contradicts our assumption. 0 Even though proposition 4.3.1 gives a general way of locating relevant fragments, it is an inefficient approach because it requires computing two transitive closures (i-e. finding all ancestors and descendants) of all nodes in Q. Thus, we need an efficient algorithm which minimizes transitive closure computations. Fortunately, for recursive queries, the nodes they need to access are likely to be a connected Component or a collection of several connected components. In this case, we can lhinimize computing two transitive closures of all nodes in Q by taking advantage of the transitive relationships among the query nodes. Furthermore, if a recursive query processing requires a transitive closure computation on ER, then either Q E DQ or Q _=. Aq. Then, in either cases, no more transitive closure computation is necessary. 96 This will be proved in the following proposition 4.3.2 and 4.3.3. Proposition 4.3.2 Let DQ E Q and ll ={ < 2:,y > I y 6 5'9ng A < :c,y >6 FT}. Then, 11 E T. Proof: We need to show that I] Q T and T Q IT to prove ll 5 T. Case 1: To prove T Q II, we need to show V < x,y > (< :c,y >6 T —-1 < 2:,y >6 II). For all < Ir,y >6 T, by definition of T, it is true that y 6 S9150 and < :c,y >6 FT. Tl) us, all < :r,y >6 T belongs to ll. Case 2: To prove ll Q T, we need to show V < x,y > (< :L‘,y >6 ll —+ < :c,y >6 T). Assume that there is a lattice < u,v >6 ll such that < u,v >¢ T. Then, it must be true that v 6 Sglbq, < u,v >6 FT and u 6 Szubq. By proposition 4.3.1, < u,v > Callliot cover any node in Q. However, < u,v > covers the node v 6 Q, since '1’ G Sggbq Q DQ E Q. Thus, < u, v >6 T, which is a contradiction. Cl Proposition 4.3.3 Let Ag 5 Q and Q ={ < 3:,y > I a: 6 Siubq A < x,y >6 FT}. Then, 9 E T. Proof: The proof is similar to the one for Proposition 4.3.2. Before presenting an efficient algorithm (i.e. Algorithm 3) for finding relevant fragments, we first introduces a function named Block(x,t), which is called within Algorithm 3. Note that parameter t represents either symbol “L” or “G”. Function Block(z,C) (Block(:r,L)) returns a set of GLB (resp. LUB) nodes from which a given 97 node :1: is reachable. This function is described in Figure 4.12. Block(m,t) /* a: : node; t: either C or L; */ begin current = {1:}; Block 2 0; Ift = G then .S = $915; else s = Slug; While ( current # 0 ) { temp = current 0 3; Block = Block U temp; current = current — temp; Ift = C then current = {y|(:r,y) 6 ER A a: 6 current}; else /* t = L */ current = {:r|(:c,y) 6 ER A y 6 current}; } return (Block); end Figure 4.12: Procedure Block(x,t) To show how Block(r, t) works, we take Figure 4.13 as an example. Figure 4.13a shows three F-subgraphs whose node information is stored at different Sit-es. The fragment table corresponding to the DAG CH of Figure 4.13a is shown in ITigure 4.13b. Note that this fragment table is stored at each site. From Figure 4.13b, We have Sglb = I 6, 7, 13, 14, 15, 24, 20, 23 } and Slug, = { 1, 2, 8, 9, 16, 17, 21 }. Then, it is easy to see from the procedure that Block(4,G) = {6,7}, Block(6,G) = {6}, Block(‘23,L) = {9, 16, 17, 21}, and Block(21,L) = {21}. We now describe Algorithm 3 in Figure 4.14. Let Q consists of a set of nodes a query need to access. Algorithm 3 first checks if either Q E DQ or Q 5 Ag. If it is, Using proposition 4.3.2 and 4.3.3, Algorithm 3 easily locates relevant fragments. Oth- erwise, the algorithm tries to find relevant fragments taking advantage of transitive 98 Fragrant O Lattice l 47> <8.l3> <9.15> <17,24> <16.20> d!.23> wwwnnu- a) b) Figure 4.13: a) DAG for the whole recursive relation b) Fragment table relationships among the nodes in Q. We re—take the preceding Figure 4.13 to demonstrate Algorithm 3. Assume that t-V'\Io recursive queries RQ1 and HQ; need to access nodes in Q1 = { 8, 10, 11, 13, 14 } Ei-nd Q2 ={ 12, 18, 19, 20, 21 } respectively. We show how to find relevant fragments of RQ1 and RQ2 in the following two cases. Case Rle We have [IQ = {8}, G’Q = {13,14}, EQ = { (8,10), (8,11), (10,13), (11,13), (11,14) }, EL = { (6,8), (9,11) }, and Ea = 0. Since E0 = 0, we obtain LT = {< 8,13 >, < 9, 14 >} at step 3. Thus, fragment 2 is the relevant fragment of the query RQI . 99 begin 1. Compute LQ = Stub O Q and GQ = Syn, (1 Q; 2. Compute EQ = {(x,y)la: 6 Q A y 6 Q A ($.31) 6 ER}, EL = {(x,y)lx ¢ QAy 6 QA (33,31) 6 ER}, Ea ={($,y)|1‘ E Q /\ y 9! Q A ($.11) 6 ER}; 31ng =Iothen /* Q2 Do */ {LT = {< x,y > Iy 6 CqA < x,y >6 FT}; Goto step 10; } IfEL=lllthen /*QEAQ */ {LT = {< x,y > Ia: 6 LqA < x,y >6 FT}; Goto step 10;} 4. For each node a: 6 Q, L, = C, = 0; 5. so, = {$l($,y) e Ea} u Go; 81.0 = {yl($,y) 6 EL} U Lo; 6. For each a: 6 800’ C, = Block($,G); For each a: 6 81,0, L, = Block(zr, L); 7. For each a: 6 Bag For every y which is an ancestor of :1: within EQ C, = C, U (1,; 8. For each a: 6 BLQ { L’ = L,; G’ = 0,; LT, = {< a,b> Ia e L.Abe (LA < a,b>6 FT}; while (LT, = 0) { /* No lattice covering node x is yet found */ L' = {uly E L’ A (my) 6 ER}; C’ = {va 6 G" A (x,v) 6 133}; For each u 6 L’, L, = L, U Block(u, L); For each v 6 C’, C, = C, U Block(v, C); LT,= {< a,b> |a6 L,Ab6C,A 6 FT}; }} 9. LT = U,€3LQ LT,; 10. All lattices in LT describe the fragments containing nodes in Q; end Figure 4.14: Algorithm 3: Find the Relevant Fragments 100 Case RQ2: We have LQ = {21}, CQ = {20}, EQ = { (18,20), (19,20), (20,12), (20,21) }, EL = { (9,12), (16,18), (17,19)}, and E0 = { (12,15), (12,23), (21,23), (20,22) }. At step 5, we obtain 800 = {12, 20, 21} and 8,0 = {12, 18, 19, 21}. At next step, function Block-(2:, C) and Block-(w, L) gives 0,, = {15,23}, G20 = {20}, 621 = {23}, L1 2 = {9}, L13 = {16}, L19 = {17}, and L2] = {21}. Since nodes 18,19, and 20 are tlle ancestors of each node :1: 6 Bag within Eq, G18 = { 15, 20, 23 }, C19 = { 15, 20, 23 } and C20 = { 15, 23 } are computed at step 7. In the following step, we identify all the lattices covering nodes in BLq. By the definition of LT,, LT12 = {< 9,15 >}, LT” ={<16,‘20 >}, L7}, = (21, and LT21 = {g 21,23 >}. Since LT19 = 0, we need to execute the while loop. In that loop, we got L’ = ID and G’ = {22}. Then, we cOmpute C19 = { 15, 20, 23 } U {24}, which gives LT19 = {< 17,24 >}. Thus, the lattices in { < 9,15 >, < 16,20 >, < 17,24 >, < 21,23 > } describe the relevant friitgments (i.e. fragment 2 and 3) of RQg. Next, we discuss the time complexity of A 1 gorithm 3. The time complexity of Algorithm 3 is varied depending the following two cases. The first case is when EC; or EL is equal to (ll. In this case, the complexity is 0(IEQ U EL U EGI + ILTI - IFTI). The second case is when the first case is not true. Let e1, l‘epresent the total number of edges belong to lattice L. Then, Algorithm 3 in the Second case do not traverse more than 0(21,e LT IeLI) edges. Thus, the complexity is 0(Ztem ICLI + ILTI ' lFTl)- 101 4.4 Updating Recursive Relations There are four different types of updates on a fragmented recursive relation GR(VR,ER). They are node addition, node deletion, edge addition, and edge dele- tion. A new node can be added to V3 or an existing node can be deleted from V3. This node addition and deletion require updating Vp at the corresponding site. An edge addition (deletion) can connect (resp. disconnect) two existing nodes belong- in g to either same fragment or two different fragments. This edge update requires revising the local copy of edge set ER at each site. These 4 different types of updates must be performed by following the two update constraints given below: 1. A new edge (as, y) can be added to ER if node a: and y already exists in VB. ‘2. A node a: can be deleted from V3 after all edges incident on node a: are deleted from E R. The preceding two constraints are introduced to make each type of update atomic. The updates may also require revising the descriptions of some fragments (i.e. lattice covers). If we have to recompute new lattice covers for those fragments affected by updates, the overhead of maintaining the fragment description would be high. However, we can avoid this re—computation by using an efficient incremental update algorithm. This is presented in the following subsections. 102 4.4.1 Edge Update Suppose that an edge (x,y) is added (deleted) to (resp. from) DAG GR(VR, ER) at site h. Then, by using Algorithm 3, site k can easily locate which fragments contain Ilode a: or y. Assume that fragments F.- at site i and Pi at site j contain node a: and y respectively. Note that i,j and k are not necessarily distinct. Based on the above assumption, we describe the detailed update algorithms for edge addition and d eletion. E dge Addition Site It sends the update information (i.e. an addition of edge (x,y)) to site i alld j. After receiving the update information, site i (j) performs the procedure AddEdge((:r,y),i) (resp. AddEdge((:r,y),j)). This procedure is described in Figure 4- 1 5. The DAG GR(VR,ER) in Figure 4.16 is used as an example to show the‘ex- eoCution of procedure AddEdge((x,y),s). Figure 4.16 shows two F -subgraphs whose 1lodes are stored at site 1 and ‘2. Assume that we add edge (13,15) first and then (9,7) to ER. Site 2 performs AddEdge((13, 15),2) because both node 13 and 15 are available locally. Since A is an empty set, we obtain L13 = {< 12,13 >} and L15 = {< 15,15 >}. Lattices < 12,13 > and < 15, 15 > are merged into a new lattice < 12,15 > because lattice < 12,15 > is suitable. Thus, lattice cover LC‘Z is changed to {< 8,10 >,< 12,15 >}. Next, we consider edge (9, 7) addition to 133. Since site 1 and 2 stores node 9 and 103 AddEdge((x,y),s) /* (x,y): edge; 3: site number; */ begin Let LC, be a lattice cover for a fragment F, at site .5; Let Ep, be an edge set corresponding to F,; Let Q1 = {10}, Q2 = {31}; By using En, compute $1.,le and 391502; Obtain A = { < u,w > | u 6 31,50! /\ w 6 5,1qu /\ < u,w >6 LC, }; If ( == 0) { /* No lattice becomes unsuitable by the addition of edge (x,y) */ Obtain L, = { < a,2: > I < a,;r >6 LC, }; Obtain L, = { < y,b> I < y,b>€ LC, }; 133' = ER U {(X,)’)}; For each < a,x > in L, For each < y,b > in L1’ If (lattice < a,b > is suitable) { /* Suitability checking is done by using ER’ */ LC, = LC, — {< a,;r >,< y,b >}; LC, = LC, U {< a,b >}; }} else { /* Some lattices in A may become unsuitable */ For each < u,w > in A If (< u, w > is not suitable) { /* Suitability checking is done by using ER’ */ LC, = LC, — {< u,w >}; Apply Algorithm 1 to lattice < u,w >; Let LC, include those lattices returned from Algorithm 1; } } end Figure 4.15: Procedure AddEdge((:i:, y), s) Site] Site2 3 9 8% .3 9 Lattice Cover LC]: Lattice Cover LC2: { <1.3>,<5.7> } { <8,10>.<12.l3>,<15,15> } Figure 4.16: A DAG consisting of two F—subgraphs 104 7 respectively, the site 1 (2) performs AddEdge((9,7),l) (resp. AddEdge((9,7),2)). For site 2, no change is made on lattice cover LC2 because of A = L9 = L7 = 0. For site 1, we got A = { < 5,7 > }. Lattice < 5,7 > is not suitable because it includes additional node 9 stored at site 2. Therefore, it is necessary to partition lattice < 5, 7 > into a set of suitable lattices. By applying Algorithm 1, we partition the unsuitable lattice < 5, 7 > into suitable lattices < 5,6 > and < 4, 7 >. Thus, we obtain L(31={<1,3>,< 5,6 >, < 4,7 > }- After performing AddEdge((;z:,y),i) (AddEdge((a:,y),j)), site i (resp. j) notifies to site 1: its fragment description (i.e. lattice cover) changes. Upon receiving the replies from site i and j, site k updates its local fragment table and ER. Then, site I: broa-dca.sts these changes to all other sites. They in turn update their local fragment table and ER. We now discuss the time complexity of procedure AddEdge((a:,y),s). Its com- pleXity is determined by the total number of edges to be traversed in ER'. Let LU B, = { l IE(AUL,)}andGLB,={g|€(AULy)}whereset A’ L;- and L, are shown in Figure 4.15. We denote a, (a,) as the total number of edges to be traversed to compute all descendants (resp. ancestors) of each node in LI] 8,7, (resp. GLBy). Then, it is easy to see that the time complexity of procedure AddEdge((z,y),s) is 0(a, + ay). 105 Edge Deletion Site Is: checks if fragments F, and F, are two distinct fragments. If they are, then no update is necessary on the fragment table. Site Is updates only its local ER and broadcast this change to all other sites. They in turn updates their local ER. If F,- and F, are same fragment (i.e. i = j), then site k sends the update informa- tion (i.e. a deletion of edge (a, y)) to site 2'. After receiving the update information, site i performs the procedure DeleteEdge((m,y),i). This procedure is described in Figure 4.17. We re—use DAG GR(VR,ER) in Figure 4.16 as an example to show the exe- cution of procedure DeleteEdge((x,y),s). Assume that we delete edge (8,9) first and then (8,11) from E3. Thus, site 2 first executes DeleteEdge((8,9),2) and then DeleteEdge((8,11),2). For DeleteEdge((8,9),2), we got A = {< 8,10 >}. Since .5'1={8,10,11 } and 5'2 = {8, 9,10,11},LC2is{< 8,10 >, < 9,10 >, < 12,13 >, < 15,15 > }. For DeleteEdge((8,11),2), we obtain A = {< 8,10 >}. Lattice < 8, 10 > becomes unsuitable because S, is an empty set. It is easy to see from the procedure how LC2 is changed to { < 8,8 >, < 11,10 >, < 9,10 >, < 12, 13 >, <15,15> }. After performing DeleteEdge((m, y),i), site i notifies to site Is: its fragment descrip- tion (i.e. lattice cover) changes. Upon receiving the reply from site 2', site 1: updates its local fragment table and ER. Then, site k broadcasts these changes to all other sites. They in turn update their local fragment table and ER. In the next paragraph, v “J:_‘l‘_1__.i‘ - l a Q 106 we discuss the time complexity of procedure DeleteEdge((z, y),s). DeleteEdge((zr,y),s) /* (x,y): edge; .5: site number; */ begin Let LC, be a lattice cover for a fragment F, at site .9; Let Ep, be an edge set corresponding to F,; Let BF. : BF: —{(:c,y)},Q1={:c}and Q2 = {y}; By using En, compute $1,,le and Sglqu; Obtain A = { < u,w > I u 6 5'1,le /\ w 6 391qu A < u,w >6 LC, }; If (A # 01) { /* Lattice in A may be affected by the deletion of edge (x,y) */ For each lattice < u,w > in A { By using Eh, compute S] = { v I u j v j w}; If (5'1 == ) { /* Lattice < u,w > is not suitable */ LC,=LC,-{}; LC,=LC,U{, };} else { /* lattice < u,w > is still suitable */ By using EFL: compute 52 = { v I u j v j w}; If1|32|75|51|1 { /* The number of nodes lattice < u,w > covers is decreased */ If(u;é;r) LC, = LC,U{ }; If(yafiw) LC.=LC.U{};}}}} end hill-I Figure 4.17: Procedure DeleteEdge((z, y), s) The time complexity of procedure DeleteEdge((m, y),s) is determined by the total number of edges to be traversed in ER. Let LUB, = { u I < u,w > 6 A } and GLB, = { w I < u,w > 6 A } where set A is referred in Figure 4.17. We denote d, ((1,) as the total number of edges to be traversed to compute all descendants (resp. ancestors) of each node in LU B, (resp. CLBy). Then, it is easy to see that the time complexity of procedure DeleteEdge((x, y),s) is 0(d, + d,). 107 4.4.2 Node Addition and Deletion Suppose that a new node x is added to DAG GHQ/R, ER) at the site It. In this case, we assume that site k wants to store node a: in its fragment. Site k first creates a new trivial lattice < 3:,3: >. It then updates its local fragment table by adding < m,n: > to its fragment description. Site k broadcasts this change to all other sites. They in turn update their local fragment table. Suppose that a node a: is deleted from DAG GR(VR, ER) at the site h. Then, by using Algorithm 3, site k can easily locate which fragment contains node 2:. Assume that fragment F,- at site i contains node 11:. Note that i and k are not necessarily distinct. Further note that, by the 2nd update constraint, all the edges incident on node a: are already deleted. Therefore, except for a trivial lattice < :c, a: >, no lattice in the description of fragment F,- covers node 3:. Site 1: updates its local fragment table by deleting lattice < 13,2: > from the description of fragment E. It then broadcasts this change to all other sites. They in turn update their local fragment table. 4.4.3 Performance Analysis of Update Algorithms In this section, we discuss the performance analysis of our update algorithms given in the previous section. We use random F-subgraphs for this analysis. Random F -subgraphs GF(VF, Ep) are created the same way as that given in Section 4.2.5. Let LCP be a lattice cover for an F-subgraph GF(VF, Ep). After each update on CF, the size of LCF may change. We denote this changed LCF by SLO (Semi Local 108 Optimal)-LCF. Besides SLO-LCp, we consider two other LC ps after each update on CF. They are TOP (Theoretically OPtimal)-LCF and N GO (Near Global Optimal)- LCp. N GO-LCF is obtained by applying Algorithm 2 to CF. In order to show the performance of our update algorithm, it is ideal to compare the size of SLO-LCF with that of TOP-LCP. However, TOP-LCp is computationally very difficult to achieve due to the N P-completeness of LCP. Thus, as an alternative, we compare the size of SLO-LCF with that of NGO-LCF. 200 1 I T r 180 SLO — ~ 160 ..- 140 - 120 ~ ILCpl 100 L I. I 2 Q C l 0 22 0 0 8 100 Num er 0 ran om up ates a) IVFI = 1000 and lEFl =1500 200 1 l i 1 180 160 140 120 |LCp| 100 80 60 40 I... 20 ..................... 0 l L l l 0 22 0 0 8 100 Num er 0 ran om up ates b) IVFI = 1000 and IEFI = 4000 llLllll LA Figure 4.18: Effects of updates on ILCpI for a) non-dense and b) dense F—subgraphs Figure 4.18 shows the effect of random updates on the size of LC}: for a non-dense as well as a dense F-subgraph. Note that random updates consist of node addition, 109 node deletion, edge addition, and edge deletion where each type of update is randomly chosen. Figure 4.18a shows that, for a non-dense F-subgraph, the performance of our update algorithms deteriorates very slowly. For a dense F-subgraph, the performance deteriorates faster than a non—dense F-subgraph. This is shown in Figure 4.18b. However, in the following paragraph, we will show that this performance degrade will not cause the frequent application of Algorithm 2 to compute N GO-LCF. We have measured the impact of updates on ILCFI for varying denseness of F- subgraphs. In this experiment, ILCFI is measured after 50 random updates. We averaged the value of ILCFI for 10 F -subgraphs using 10 different seeds, as we did in section 4.2.5. The results are shown in Figure 4.19. From Figure 4.19a, it is easy to see that the performance of our update algorithms is good for non-dense F—subgraphs while it is not as good for dense F—subgraphs. For a non-dense F-subgraph, SLO-LCF remains near optimal for a large number of random updates. This is seen in Figure 4.18a and 4.19a. Thus, we do not have to frequently apply Algorithm 2 to obtain N GO LCF. For a dense F-subgraph 0,2, we do not need to frequently obtain N GO LC p, either. This is justified from Figure 4.19b, which shows that the percentage ratio between ILCFI and IVpI for SLO-LCp grows very slowly with increasing denseness of F -subgraph. Therefore, we conclude that our update algorithms perform well without requiring frequent application of Algorithm 2 for both dense and non-dense F-subgraphs. 110 110 80 70 ILCFI 6°~ .. 50 40 30 20 10 l 100 *SLO <9— 90 WGO +- llfllllllLl 0 1500 2000 2500 3000 3500 4000 IEFI a) IVFI21000 100 90 80 7o 60 Q 50 4o 30 20 l 1 1 I SLO @- rNGO + ' lJllllll Tlllrfi 104 0 1500 2000 250I)E 3000 3500 4000 PI b) |Vp|=1000 and Q = 100 . ILCFI/IVFI Figure 4.19: Snapsho ts of ILCpI and Q after 50 random updates 111 4.5 Conclusion In this chapter, we developed a distributed database model for studying the problem of fragmentation of recursive relations. We used the mathematical properties of lattices to investigate fragmentation schemes. We showed that the lattice structures are powerful representation tools for distributed fragments of recursive relations. Both theoretical and empirical performance analysis of the lattice approach are provided. The empirical analysis verifies the near optimality of the performance of our proposed algorithm. We also showed that lattice approach provides an efficient way of locating the relevant fragments of a recursive query. There are special problems with updating fragmented recursive relations which do not occur in updating fragmented non—recursive relations. We discussed these PFOblems and their solutions. The empirical performance analysis of these solutions are also discussed. Chapter 5 Conclusion and Future Research 5 - 1 Contribution of This Dissertation We have examined recursive query processing strategies for large recursive relations. Recursive query processing requires transitive closure computations. As a special Case of transitive closure computation, the shortest path computation is essential for recursive query processing for the large recursive relations of topographical road In this thesis, we proposed an efficient recursive data organization method called the HiTi graph model. The H iT i graph model allows structuring of large recur- Sive relations for hierarchical abstraction. Based on the H iTi graph structure, we de‘VeIOped a new fast shortest path algorithm named S' PAH . S PAH uses the pre- C01I‘Dutations captured by the H iTi graph to speed up SPSP computation. These 112 113 precomputations are not used in the traditional SPSP algorithms, including A*. By using these precomputations, S PAH dramatically reduces the explored search space. This is shown through our empirical analysis of S PAH on grid graphs. We also stud- ied intra and inter query parallel processing for shortest path computations within the H iTi graph model framework. The traditional intra query parallel shortest path algorithm known as 0T0par is not scalable and was shown to be inefficient on the H iTi graph structure. We proposed a new parallel intra query shortest path algorithm named PAS PAH . We implemented PASPAH and MOTOpar on a BBN GP1000, Which has a nonuni- forn] memory architecture. The BBN GP1000 multiprocessor currently consists of 85 nodes, linked together by a high speed butterfly switch. In this system, the globally shared memory is the sum of the memories local to all processors. Our empirical anal- y Sis of PASPAH on grid graphs showed that it is more scalable and performs better ‘31) an M OTOpar. However, our analysis also showed that the average execution time of PAS PAH is not significantly better than that of S PAH and M 0T0par. Note th at M 0T0par is a version of OTOpar modified to utilize the H iTi graph structure. Next, we proposed a new inter query parallel shortest path algorithm called I ‘9 PAH. The performance evaluation of I S PAH was also performed on the BBN G P 1 000 shared-memory multiprocessor system by using the same grid graphs as those for PAS PAH . The empirical analysis of I S PAH revealed that I SPAH is scalable up to 25 processors although the speedup of I S PAH increases almost linearly for up 1‘. . O 1 0 processors. One of the major performance bottlenecks of I S PAH results from 114 the severe shared-memory access conflicts between processors. We then proposed an improved version of I S PAH named M I S PAH , which reduces the memory-access conflicts through partial data replication. As a result of this replication, M I S PAH is scalable up to 41 processors. From this analysis, we conjecture that the parallel processing for inter query SPSP problems is more promising than for intra query S PSP problems. We then studied the problems of describing and locating the fragments of large re- cursive relations in a distributed database. These two problems are not primary query optimization issues in a centralized database. In a centralized database, distribution and location of the recursive data are easily handled through hashing or predicates. Thus, its query optimization is primarily done on shortest path or transitive closure computations. However, in a distributed database, the above two problems are the major performance bottlenecks for recursive query processing. This is because the COSt of processing recursive queries is strongly dependent on the cost of transmitting information between sites in a distributed database. Traditional use of hashing is not Sui table for distributed recursive databases, since they cannot capture the referencing lOCEAJity of data. We used the mathematical properties of lattices to describe and idelltify the distributed recursive fragments. We found that lattice structures provide a. C'Dtliplete and sound description of recursive fragments. Our theoretical and empiri- Cal performance analysis of the lattice approach verifies its effectiveness in distributed dat abases. 5.2 115 Future Research Shortest path computation on topographical road maps requires accessing the road segments on the disk. Since representation of topographical road maps requires spatial data structures, we need spatial index schemes to enhance the performance of shortest path computation. Traditionally, R, R+, R*, k-d, k-d- b, MD, and GDB trees are used for indexing general spatial objects. However, the above methods are not optimized for the identification of road segments. Thus, more efficient spatial index schemes for the road segments need to be developed. Physically clustering adjacent road segments together in the same disk sector is an important research issue for the performance of shortest path computation. 1f adjacent road segments are not clustered, this will cause very frequent page faults which in turn will slow down the shortest path computation. In navigation systems, a shortest path may need to be recomputed because of changing traffic conditions, such as roadblocking. Instead of recomputing an entire shortest path, it may be possible to optimize recomputation based on previously computed path information. The M I SPAH algorithm given in section 3.3.2 is scalable up to 41 processors. More research is needed for further improvement. APPENDICES Appendix A HiTi Graph Based Distributed ' ’I‘ransitive Closure Algorithm In a distributed database, H iTi graph can be used for describing and identifying local relations (i.e. level 1 subgraphs). That is, each boundary node of a level 1 subgraph 1111i quely describes and identifies the corresponding local relation. Furthermore, the StI‘UCture of H iTi graph allows dynamic decomposition of a transitive closure compu- tation. By utilizing the above two advantages, we developed an efficient distributed tram sitive closure computation. We first describe two primary algorithms named TC.1 an d TC.2 which will be used for our proposed distributed transitive closure algorithm. Their descriptions are given in Figures A.1 and A.2. 116 117 /* U : Relation U(a:,y); J. site number; */ /* k : the highest level number of a subgraph tree ST; */ begin i— - 1; X: {30}; skip: TRUE, BN’-— - boundary node information of SC(S’+1(X)); R} = FU.,.,nyBN}.sg,BNJ‘.site(U Ddy=:c BNj); ifR' =0then{i=i—1;return((ll,z)} L1. WB;=(SB(SC(S'I,“(X)))USW(SC(S',+1(X)))); R— — 7IrR'..1:,WB".,y R; .,sg R' .site(Rj' Dal/=3 WB Ji) while (R 75 0) { skip = FALSE, R3: Rj U R, R“ - "news; .,yBN;. sg,BN;. sitc(R Wyn WBil- R}? } R; = ”I? .,.w: R' .y BN1. .99, BN'. sitc(Rj' My=3 BNj'); if (i— - - 1) then return(R1-, R2, , R}, i); BN‘-+1k = boundary node information of SC(S'+2(X)); Rf?- = "R' .,zR' .;y,BN .sg,BN'-. site(Rj' Mv=r BN‘IH); if (skip = TRUE) then R‘ = 0; if R‘“- = 0 then return(R‘-, R3, ..., RE, i); else{X= S'+1(X); i=i+1; skip- — TRUE; goto Ll; } end Figure A.1: Algorithm TC.1: AscendingPhase (U,j,k) These two algorithms use three different types of relations such as WBJ':(:r,y), B N;(:r, sg, site), and R;(:::, y, sg, site). Relation WBj(a:,y) consists of tuples t where t corresponds to either a level i between or within edge. Relation BN;(x,sg, site) C011 tains tuples t where t.a: corresponds to a level i boundary node, t.sg corresponds to the notation of the level i subgraph to which t.x belongs, and t.site corresponds to tllfi site number storing the level 1 subgraph to which to: belongs. We assume that the: subscript of each level 1 subgraph corresponds to a site number. For example, hQ local relation corresponding to SC}3 lS stored at Site 3. Relation R3 1S used as 118 a working relation for computing a distributed transitive closure. Thus, each site j has Rj-(x,y,sg, site) for each level i where 1 S i S k. Note that we use the same notations as those defined in the chapter 2.3 in TC.1 and TC.2 algorithms. /* Z : Relation 2(219139135te); */ /* j: site number; i: level number; */ begin x = STUD; EN; = boundary node information of SC(X); Z = Wz.:,z.,,,s~;.ag,s~;.me(Z NF, BNi')? WBj = (SB(SC(X)) u swam»; R = 7rZ.x,WB;.y,Z.sg,Z.site(Z ”11:2 W8”; while (R # 0) { skip = FALSE; Z = Z U R; R = 7rR.z,WB;.y,R.sg,R.site(R NF, WBi) — Z i } if (skip = FALSE) then Z = 7rZ.z,Z.y,BN;.sg,BN;.aite(Z NF, BNJi); return(Z); end Figure A.2: Algorithm TC.2: DescendingPhase (Z, j,i) Algorithm AscendingPhase() traverses Hi Ti graph by following edges with non- decreasing level number. To demonstrate Algorithm TC.1, we use the digraph and its corresponding HiTi graph shown in the Figures A3 and A.4 respectively. For site 1, which stores the local relation corresponding to so}, we have we; e { (4,6), (5,7), (6,10) }, we,2 = { (10,11), (11,17), (17,19), (17,20) }, 131v,l = { (4. so}, 1), (5, so}, 1), (0, 50;, 2), (7, 36;, 2), (10, so), 2) }, and 131v,2 = { ( 1 O, SG¥, 2), (11, 303, 3), (17, SGg, 4), (19, 50%, 5), (20, SC; 5) }. Assume that at site 1, AscendingPhase(U, 1, 3) is invoked where U = { (4,4) }. The outputs of Figure A.3: C(V, E) partitioned according to level 3 subgraph tree . n 2 l1 2 I! 23 I 3 l ' 2 l “ i 1’ 3 an/ .3 , 7. IO/ I I6 . Figure AA: A level 2 HiTi graph created from Figure A.2 AscendingPhase(U, 1, 3) are stored in relations R} and R? where R} = { (4, 6, SC; 2), (4, 10, 30.3,, 2) } and Rf = { (4, 11, .963, 3), (4, 17, 30;, 4), (4, 19, 503,, 5), (4, 20, 50:21, 5) }- In contrast to AscendingPhase (U, j, k), DescendingPhase (Z, j, i) traverses Hi Ti graph by following edges with a given level number. We use Hi Ti graph shown in A.4 as an example to demonstrate Algorithm TC.2. Assume that at site 3, l3'E=Sc.e11di11gP11ase(Z, 3, 2) is called where Z = { (4, 11, 503, 3), (4, 17, SC;, 4) } 7 Set X is initialized with SC}3 and it becomes to contain 5C3. Then, site 3 has BN3 = { (11, sag, 3), (14,301,3), (15, 561, 4), (16, $03,, 4), (17, 50;, 4) }, and WE; = (11,14), (14,16), (16,15), (16,17), (15,14) }. After the execution of while 1001) and the join with BNg, Z = { (4, 11, SCQ, 3), (4, 17, SCI, 4), (4, 14, SC;, 3), (4. 10, so}, 4) }. 120 Based on the above two algorithms, we discussed how to compute transitive clo- sure in distributed database systems. In our distributed transitive closure algorithm, we have two types of sites where a transitive closure is computed. They are coor- dinator and participant sites. A site is named a coordinator if a transitive closure computation is initially started at this site. The coordinator decomposes a transitive closure computation into a set of sub-computations if needed. A site is called a par- ticipant if it performs the sub-computation by the request of either a coordinator or another participant. We give two algorithms TC.3 and TC.4 in Figures A5 and A.6 where the former is for a coordinator and the latter for a participant. /* H: a set of nodes; */ /* TC: transitive closure of nodes in H */ /* k : the highest level number of a subgraph tree ST; */ begin D] = "Halal-.171” Dds/=4c Ei); D2 = 0; D2 = Dz U 01; D] = "01.x,E;.y(DI NF: Ej) " Dz; } AscendingPhase(Dg, j, k) returns (R}, R?, .. ., RI, 2); for (m = i downto 1 by -1) { Partition Rg" such that each partition 1" share the same value of Ry‘sg; Let 2 consists of PS of R3"; for each F 6 2 { s = a majority value of f‘.site; SendParticipant (F, m, J, 3); } } TC = 02 U Wait_Result.From_Participant(R,*); /* * represents a site number */ end Figure A.5: Algorithm TC.3: CoordinatorSite j(H,TC,k) 121 /* F : Relation I‘(2:,y,sg, site); */ /* 171: level number; j: coordinator site number; */ begin Receive(F, m, j, 4:); /* * represents a site number */ if (m = 1) { D = 7rl‘.a-,Ejl.y(r M =1: Ej)i R = 7rx.y(r)i while ( D 75 0 ) { R = R U D; D = "D.z,EJ‘.y(D Dds/=3 Ej) — R; } SendCoordinator(R, j); } else { DescendingPhase (F, j, m) returns Z; Partition Z such that each partition T share the same value of Z.sg; Let 2 consists of Ts of Z; for each T 6 Z { Check if any value of T.site is 3; If yes then h = 3; else h = a majority value of T.site; If (h 75 s) then notify coordinator site j that site h will send the coordinator the partial results of the transitive closure; SendParticipant (T, m — 1, j, h); } } end Figure A.6: Algorithm TC.4: ParticipantSite 3(1‘, m, j) To demonstrate Algorithm TC.3, we also use a recursive relation C(V, E) and Hi Ti graph shown in Figures A3 and A.4 respectively. Assume that a recursive query “Find all descendants of node 3” is issued at site 1. Then, site 1 becomes COOrdinator and algorithm TC.3 is executed at site 1. Algorithm TC.3 first computes a. transitive closure of node 3 (i.e. H = { 3 }) with respect to SC} (V11, E11). As a reSult, D; = { (3,4) } and AscendingPhase (D2, 1, 3) is called. After the execution of AscendingPhase (Dz, 1, 3), we got i = 2, RI = 1 (31 6, 56;, 2), (31 101 5617 122 2) } and R? = (3, 11, SCg, 3), (3, 17, 5C3, 4), (3, 19, 5C3, 5), (3, 20, 5C3, 5) }. R? is partitioned into two relations F, = { (3, 11, 30;, 3), (3, 17, SC;, 4) } and F2 = { (3, 19, 503,, 5), (3, ‘20, 503,, 5) }. Next, coordinator site 1 sends its decomposed sub-transitive closure computation requests to site 3 and 5 by executing SendParticipant (1‘1, 2, 1, 3) and SendParticipant (F2, 2, 1, 5). Note that for 1“,, site 4 could be chosen as a participant since there is no majority site number of elements in 1‘]. For RI, since all tuples share the same RI.sg (i.e. SCé), we have only one F; = R} in Z and SendParticipant (1‘1, 1, 1, 2) is executed. Algorithm TC.4 is executed at a participant site upon receiving the sub-transitive closure computation request (e.g. SendParticipant (F1, 1, 1, 2)) from a coordinator. To show 110w participant sites work, SendParticipant (1‘1, 2, 1, 3) and SendPar- ticipant (1‘ 1, 1, 1, 2) mentioned in above paragraph are considered as examples. For SendParticipant (F1, 2, 1, 3), since m = 2, site 3 first calls DescendingPhase (F1, 1, ‘2) and gets Z = { (3, 11, SC},, 3), (3, 14, 503,, 3), (3, 16, SC}, 4), (3, 17, SC}, 4) }. Relation Z is partitioned into two relations T1 = { (3, 11, SC?” 3), (3, 14, 490.33, 3) } and T2 = { (3, 16, SC}, 4), (3, l7, SCI, 4) }. For T1, site 3 sends its deCOI‘DpOSCd transitive closure computation requests to itself by executing SendPar- ticIipant (T1, 1, 3, 3). For T2, site 3 first notifies the coordinator site 1 that site 4 Will send partial results of the transitive closure. Site 3 then sends its decomposed $11 b-transitive closure computation to site 4 by executing SendParticipant (T2, 1, 3, 4) - As for SendParticipant (F1, 1, 1, 2), since m = 1, site 2 computes a transitive c1Osure with respect to SC; (V21, E3). Relation R contains { (3,6), (3,8), (3,9), (3,10) 123 } and site 2 sends R back to the coordinator site 1 by executing SendCoordinator (R, 1). Appendix B Empirical Analysis of Computing Shortest Path for Road Map Queries B.1 ' Performance comparison of shortest path al- gorithms In this section, we empirically compare the performance of Dijkstra’s and Nicholson’s Shortest path algorithms on road map graphs. Their pseudo-codes are described i 11 Figure RI and R2 respectively. For the detailed demonstration of Nicholson’s E11 gorithm, readers can refer to Nicholson’s paper in [77]. The inputs to the following aJgorithms are graph C(V, E), source and destination nodes 3 and d respectively. 124 125 Note that [(x,y) represents the cost associated with each edge (x,y) 6 E. 1. For each node 11 6 V, A(u) = 00. Let Ms) = 0, FS = {s}, and E5 = 0. Select a node v. in PS for which Mn) is minimum. FS = FS — {u} and ES = ES U {u} If (u = d), then Mn) is the shortest path cost and Stop. sweep For every edge (u, v) in E, if A(v) > Mu) + l(v, v), (a) Let /\(v) = Mu) + l(u,v). (b) Let FS = FS U {v} if v ¢ (FS u ES). 7. Go to step 3. Figure 8.1: Dijkstra’s shortest path algorithm To compare the performance of Dij kstra’s and Nicholson’s algorithms on road map graphs, we create two dimensional grid graphs CR(VR, ER) with 4 adjacent nodes. The grid has N at N nodes where each row and column includes N nodes. Thus, in a graph G3, IVRI and IERI are equal to N * N and 4 at N at (N - 1) respectively. These two di mensional grid graphs are considered as typical examples of road maps [64]. Based On the above scheme, we create a grid graph CR where IVRI = 200 a: 200 and lERl = 4 * 200*199. The cost of each edge in ER is generated based on a uniform distribution [1 , 100] with 10 different seeds. As a result, we have 10 different grid graphs CRs. For each grid graph created above, we compute 50 different shortest paths for ‘30 randomly prefixed pairs of source and destination nodes. Let D] J (N I C) be the total number of nodes visited at step 6 (at step 5.(a).i and 6.(a).i) by Dijkstra’s ( Nicholson’s) algorithms. Note that we also count the nodes re—visited for DI J (N I C ) We compare the two algorithms by observing the ratio D] J / N I C . These values are 126 99°!" For each node u 6 V, A,(u)=/\,(u)=oo. Let FS, = FS, = 0 and 2: = y = 0. For each edge (s,u) in E, A,(u) = l(s,u) and FS, = FS, U {v}. For each edge (d, v) in E, A,(v) = l(d, v) and FS, = FS, U {v}. Let C, = min/Mu)” A,(u) and C, = minhm” A¢(v) where u 6 FS, and U 6 FSg. If (Cs S Ct): (a) For each m for which A,(m) = minA’M” A,(u) for all u 6 FS, i. For every edge (m, u) in E, if A,(u) > A,(m) + l(m,u), A. A,(u) = A,(m) +l(m,u) B. if 11 ¢ FS,, FS, = FS, U {u}. (b) Reset :r = minA'M» A,(u). . If (C; < C,), (a) For each m for which At(m) = minMv)» A¢(v) for all v 6 FS, i. For every edge (m, U) in E, if A¢(v) > A,(m) + l(m, v), A. A,(v) = A,(m) + l(m,v) B. if v ¢ FS,, FS, = FS, U {U}. (b) Reset y = min/MM» A¢(v). If minuer,qu. (481“) + Atlull S (minA.(w)>:c 8:0”) + minA¢(v)>y 81(0)) where w 6 FS, and v 6 FS,, the shortest path cost is minu(z\,(u) + At(u)) and Stop. . Go to step 4. Figure B.2: Nicholson’s shortest path algorithm 127 then averaged over the 10 different grid graphs with same source and destination nodes. These are shown in Figure B.3. DIJ NIC 0 5101520253035404550 the number of shortest paths Figure B.3: Performance comparison between Dijkstra and Nicholson’s algorithms Figure B.3 clearly shows that Nicholson’s algorithm visit far less nodes than Dijk- stra’s when they are applied to grid graphs. It is interesting to observe that the ratio DI J / N I C is small for shorter paths. This happens when source and destination nodes are closely located in the graph. However, even in this case, Nicholson’s algo- rithm still performs slightly better than Dijkstra’s. Therefore, we can conclude from Figure B.3 that Nicholson’s algorithm is superior to Dijkstra’s when graphs capture road maps. Next, we present our shortest path algorithm in Figure 3.4 which finds shortest path by simultaneously constructing two trees, one rooted at the source node and tlle other at the destination node. This simultaneous expansion of two trees is the primary difference between our and Nicholson’s algorithms. Note that the two trees in Nicholoson’s algorithm are expanded alternatively. That is, if one tree has the IZlext shortest route than the other, the one with the next shortest route is expanded. This alternating tree expansion may cause one tree to be expanded more than the 128 PPN 9" 5090719» 10. 11. 12. 13. . For each node n 6 V, A,(u)=/\¢(u)=oo and Marked(u) = 0. Let A,(s) = A,(d) = 0, Marked(s)=1 and Marked(d)=2. Let FS,={s}, FS¢={t} and E, = E, = 0. Select 711, and m, respectively from FS, and FS, such that A,(m,) and At(mt) are minimum. Let CI = A,(m,) + At(m,). Let FS, = FS, — {711,}, ES, = ES, u {m,}, FS, = FS, — {m,}, E3, = ES¢U{m,}. Obtain I = { u I Marked(u) = 2 for all u 6 FS, }. [f I 75 0, C2 = min,€1(/\,(u) + A,(u)). Otherwise C2 = 00. If (m, = 171,), the shortest path cost is min(C1,C2) and Stop. 1f (C2 5 C1), the shortest path cost is C2 and Stop. If (Marked(m,) = 0), Marked(m,) = 1. 1f (Marked(m,) = O), Marked(m,) = 2. For every edge (m,, u) in E, (a) If A,(u) > A,(m,) + l(m,,u), i. A,(u) = A,(m,) + l(m,,u) ii. If u 9! (FS, 11 ES,), FS, 2 FS, u {ii} For every edge (m,, U) in E, (a) If )1,(v) > A¢(m¢) + l(m,,v), i. /\¢(v) = /\¢(m¢) + l(m,,v) ii.1fv g (FStUESg), FS¢ = F5: U {U} Go to step 4. Figure 3.4: Our two tree expansion shortest path algorithm 129 other. As a result, this unbalanced tree expansion may delay finding the shortest path. However, in our approach we expand the shortest paths of the two trees. In order to show the performance comparison between these two algorithms, we used the same data set and methods as we used for those between Dijsktra’s and Nicholson’s. This is shown in Figure B.5. 1.2IIIIII1II I}: NIC/JUN e- 1 1 v NIC 1-1 / 110111 JUN 1:06 1.04 1.02 0.9; l l 1 l 1 l l l l 0 51015202530354045 the number of shortest paths 4 ‘2 lllTlflll lllllll O1 C Figure B.5: Performance comparison between Nicholson’s and Our algorithms In Figure B.5, JUN represents the total number of nodes visited (including the revisitation) at step 11.(a) and 12.(a) by our algorithm. Figure B.5 clearly verifies Our conjecture that Nicholson’s alternating tree expansion delays finding the shortest Path. From this figure, we can conclude that our algorithm on the average always performs better than Nicholson’s. B .2 Conclusion 111 this appendix, we studied a single pair shortest path problem on the domain of I‘Oad map queries. We have compared Nicholson’s shortest path algorithm with that of Dijkstra. Our empirical analysis shows that Nicholson’s two tree expansion approach 130 outperforms Dijkstra’s one tree expansion method for road map queries. We prOposed a more efficient two tree expansion shortest path algorithm than Nicholson’s. This is justified through empirical analysis. Appendix C A Survey of Database Problems in Intelligent Transportation System C. 1 Introduction Intelligent Transportation Systems, ITS, are increasingly being used to automate the navigation function of the automobile. The navigation infrastructures such as Global Positioning Systems, GPS, and beacons have a profound impact on the design and development of modern navigation systems. Along with this, ITS also requires power- ful software systems including very large database system support. The development of these infrastructures requires substantial resources involving coordination on a mammoth scale. The US. government has played a major role in providing the nec- essary leadership in the development of these infrastructures. Software systems and databases have been developed by governmental as well as non-governmental agencies 131 132 such as Universities and corporations. These infrastructures and the software systems are built on top of large digital map databases. Much work has been done in the area of Geographic Information Systems to create large databases of digital maps. In this paper we survey the database work which is relevant to ITS. We also discuss other important work that is necessary for an understanding of ITS. ITS must be able to look up information on a digital map in order to automate navigation. A digital map not only stores location information about objects, but also shows the relative locations of objects. These objects are of differing dimensions such as points, line segments, and regions [106]. Multidimensional objects require more complex modeling and processing than those required in traditional applications. A database design for ITS must also capture the spatial relationships between these objects. One important spatial relationship is that of neighbor. The clustering of neighbor objects in the same area of the disk is an important database research issue [93]. A competing topic is the dispersion of stored neighbor objects to different disks to enable multiprocessing [55]. The database structures for these very large databases must be effective in quickly accessing multidimensional objects. For example, the map database stored for a system developed locally for the Twin Cities in Minnesota is approximately 10 Mbytes [92]. A system which uses country or continent wide data will access a much larger database. Efficient organization of large multidimensional databases for fast access is a paramount database research issue in this area. Accessing large databases of multidimensional objects requires efficient spatial indexing schemes. In section 0.2 J Ilj‘l 133 we discuss such indexing schemes. The currently available commercial navigation systems use one or more of these index methods. However the actual details of the indexing scheme used by particular current commercial systems is not known because they are proprietary. The currently available position systems, such as GPS and dead reckoning, play an important role in the design of these database systems. If the GPS system is used it automatically determines latitude, longitude, and altitude [1]. However, the returned position is not accurate. If dead reckoning is used, the position of a vehicle is known precisely at certain locations. As the vehicle travels, the new location is calculated. Dead reckoning systems often use a beacon to find the initial known position. The position of the vehicle is then calculated as a displacement from the known position as the vehicle moves. This calculation of displacement is also prone to errors. An important issue in designing ITS is to minimize these errors. The correct determination of a vehicle’s current location is essential for the successful operation of an automated navigation system. Positioning technologies such as GPS and dead reckoning are often combined to deliver the desired accuracy needed by ITS [61, 67]. In ITS, quick real time response to large database queries is essential. The simplest query in ITS is “where am I?” This query is answered partially by the positioning technology which determines the coordinates of the vehicle. These coordinates are usually mapped to an object in the database such as a landmark, a road, or a region. An important database design issue is the representation of the inherent hierarchy of the data such as roads and landmarks within cities, and cities within counties. An H-i 134 example query using this hierarchy is ”List all counties in Michigan which intersect Interstate Highway 194.” A number of typical road map queries have been discussed in [92]. Another issue of importance is the placement of the data; on board the vehicle or at a central site. If the data is 011 board the vehicle, then each vehicle must have a copy of the necessary data in order to answer the queries. Real time updates through a central system is expensive. These updates must be communicated in real time for up—to—date queries at the vehicle. If the data is maintained at some central site, the vehicle must access this remote database in order to receive the data or answers to its queries. 111 this model the maintenance of the database is simpler but at the cost of increased communication overhead. The remainder of this paper is organized as follows. In section C2 of this paper ‘ we describe the data model for ITS, and discuss classes of road map queries for ITS in section C.3. In section C.4 we briefly review the existing positioning technologies. Section C.5 describes a number of commercially available intelligent transportation systems. C.2 Data Models for Vehicle Navigation Systems ITS databases contain information about objects such as landmarks, roads, and ge- ographic regions We can classify these objects as O-Dimensional, 1-Dimensional, 2- Dimensional, and 3-Dimensional objects [106]. 0-Dimensional objects are points and 135 represent landmarks such as building sites or intersections of roads. A typical 1- Dimensional object is a road segment, and it is a primary object for road map queries. Roads are formed from straight line segments which in turn form the network of roads. Examples of 2-Dimensional objects are geographic regions such as cities, counties, and states. 3 dimensional objects are necessary for recognizing certain types of landmarks. A11 object such as a building must be described by latitude, longitude, and altitude. Efficient spatial indexing schemes for these multidimensional objects are critical to the performance of ITS queries. C.2.1 Indexing Schemes for accessing multidimensional road objects Indexing schemes for ITS databases can be classified into 2 groups based on the type of object indexed. They are point indexes for point objects, and indexes for higher dimensional objects. We first discuss point indexes. In order to index objects such as line segments or regions, each vertex is a point to be indexed. Alternatively, the object may be considered as a point in a higher dimensional space. Point indexing schemes include quad-trees, oct-trees, point quad-trees, k-d trees, k-d-B-trees, and the grid file. A quad-tree is an indexing scheme for objects in a planar region. A planar object consists of a collection of points in the plane. The quad-tree indexes the planar object by dividing the planar region into regular quadrants. This is a recursive process, and 136 halts when all points in a quadrant are of the same type, i.e., part of the object or not part of the object. The oct-tree is a generalization of the quad-tree to 3 dimensional space. The quad-tree can be further generalized to any k-dimensional space. Quad- trees are efficient mechanisms for computer vision, but not for ITS. The quad—tree (or oct—tree) is not height balanced. The point quad-tree (or point oct-tree) picks points of interest at which the plane (or 3—dimensional space) is subdivided into unequal size quadrants. This gives an index tree which is closer to being height balanced [88]. The k-d tree is a generalization of the point quad tree. A k-d tree is a binary tree in which k levels of the tree are necessary to partition the k-dimensional region into 2" regions. One significant difference between k-D trees and point quad-trees is that in k-d trees these 2“ regions do not meet in a single point [88]. The grid file approach partitions a k-dimensional space into non-uniformly sized k—dimensional grid blocks in order to index the data points located within the blocks [76]. This grid file approach is not suited to indexing spatial objects of higher dimension than points. The primary reason is that k-dimensional objects are mapped into points in 2k-dimensional space in the grid file approach. As a result, most grid blocks will not be used. Various versions of the grid file have been proposed in order to improve their performance [28, 63, 96, 12]. Freeston [28] has developed a nested and balanced grid (BANG) file which has a tree structured directory. This tree struc- tured directory has the self height balancing B-tree property. Kriegel and Seeger [63] have proposed a method called PLOP-Hashing to eliminate the grid directory. This is a multidimensional extendible hashing scheme which expands the pages of the file 137 corresponding to grid blocks in a piecewise linear fashion. However, the problem with PLOP-Hashing is that it can not fully capture natural ordering of higher di- mension spatial objects. Six and Widmayer [96] present a multilayer paradigm for transforming point index structures (such as the grid file) into index structures for k-dimensional intervals. This paradigm allows the preservation of the natural order- ing of interval data. Blanken et. al. [12] propose a generalized grid file which more fully preserves the natural ordering between geometric objects. The generalized grid file integrates the B+-tree, the grid file, and the k-d-B-tree. The second group of indexing schemes are better suited to indexing k—dimensional objects in k-dimensional space. The most important indexing scheme is the R- Tree [38]. The R—tree is a generalization of the B—tree to higher dimension objects. All objects are in the leaves of the tree, and the leaves are all at the same level. The R-tree functions by enclosing objects in rectangles. Nodes of the R-tree are formed by enclosing groups of rectangles in a larger rectangle which becomes an R-tree node. A problem with the R-tree structure is that these rectangle nodes in the tree overlap, which causes the lookup procedure to follow multiple tree paths. Several modifica- tions to the R—tree have been developed to alleviate this problem, such as the R+-tree, the R*-tree and the packed R-tree [10, 35, 79, 87]. These alterations to the R—tree prevent overlap (R+-trees) or minimize the overlap by using a complicated algorithm (R*-trees). A weakness of the R—trees is that rectangles are the stored object. An- other weakness is that R-trees have poor storage utilization. Objects such as diagonal line segments and convex objects are not well represented by bounding rectangles. 138 The cell tree is a modification of the R—tree in which the bounding regions do not need to be rectangular [36]. Jagadish uses bounding polyhedra in the P-tree [54] to eliminate the space wasted by bounding rectangles. Jagadish has also proposed a line segment index scheme based on the Hough transform. This index scheme always outperforms minimum bounding rectangle index schemes for retrieving line segments which go through a specific point or intersect a specific line segment. Several structures improve the worst case storage utilization found in R-trees. The MD—tree [75], like the R-tree, is a generalization of the B-tree to higher dimensions utilizing bounding rectangles. The storage utilization of the MD-tree has a worst case of 66.7% due to the dynamic readjustment of split regions. The GBD-tree [78] is similar to the R-tree, but uses extra data to increase the storage utilization. GBD- trees also speed the processes of insertion and deletion over R-trees, but at a higher memory cost and at a higher CPU. cost for large extent data. Freeston has pro— posed a generalization of n-dimensional B-trees in which the worst case performance is controlled [29]. This solution recursively partitions the data space, guarantees loga- rithmic access and update, and has a worst case 33% storage utilization. The hB-tree proposed by Lomet and Salzberg [69] has competitive storage utilization. The hB-tree differs from B+-trees in that the index terms are organized into k-d trees. Kolovson and Stonebraker propose segment indexes [62] as an extension to database indexing structures such as R-trees. These segment intervals improve the search performance on interval data with non—uniform length distributions. 139 C.2.2 Large Database Issues The amount of stored data is massive. If the data is kept entirely at the vehicle, then the vehicle system is responsible for all queries and updates. This requires a powerful system in the vehicle, both in terms of storage space and processing power in order to access the information. External communications will be minimum for this configuration. If the data is stored entirely at some site outside of the vehicle, then the vehicle system will be primarily a communications system. The external processor(s) must be extremely powerful since all the vehicles in the processing area will be sending their information requests to that external site. A third alternative is to distribute the data and processing between the external site and the vehicle. In this case, the vehicle can download the needed data from an external system, then answer queries on that downloaded data. Troy Michigan has a system developed by Siemans which uses this distributed approach. There are currently a number of systems under development and testing. These systems are restricted to a small geographic area. In order to develop a system which is viable in a large scale system, such as the continental United States, large database issues such as the clustering of data must be explored. Shekhar et. al., [94] discuss a partitioning technique based on max out partitioning of a similarity graph whose nodes are data items. The similarity is the probability that the data items will be accessed together. This technique assigns the data items to distinct disks for a speed up using parallel processing. The size of the database is indicated by Shekhar, et.al [92], in that the storage size of the data for the Twin Cities metropolitan area 11" 140 in Minnesota is 10 Mbytes. Jagadish discusses clustering of data so that objects which are close in multidimensional space are close together after mapping into a one dimensional [55] in order to minimize disk accesses. C.2.3 Update There are two different types of algorithms which are useful in automated navigation. A real-time system must be continuously updated according to the changes in road conditions. This also implies that the shortest route is dynamic; that is, the route planned at the beginning of the trip may not be the route finally taken. Methods to alter an already computed path using the computed path information could be faster than completely determining the path from the new current location. For long trips, a pre—computed path may be used as part of the overall trip. If it can be determined that the route from source to destination must pass through certain locations, then shortest path information which is already computed between those locations can be used in order to improve the speed of the route calculation. Updating the multidimensional objects in the database can be costly. The cost depends on the storage mechanism used for the objects. There is also the problem of changing the stored information about the relative locations of objects. For example, if a road is closed, all the information about the intersecting roads must be updated. El—"_ a." I" ..-4 I 0' 141 C.3 Queries 011 a 0-D object, a typical query is to find the landmark (or nearest landmark) for a particular location. Locations are to be described by latitude and longitude. The queries on l-D objects are more common, since they include all the find the path types of queries. Some of these queries are: find the shortest path between two locations; find the shortest path between two locations going through some list of other locations; find the shortest path between two locations not going through any locations in some list. There is also the query to determine which road segment goes through a particular location. The database must include the connectivity information about road segments in order to answer these queries. One practical problem that arises is that there is as yet no accurate large area map. The most common starting point for automated navigation is the Tiger file produced by the US. Census Bureau. These files have only road map accuracy. For 2-D objects, typical queries are: which region includes a particular landmark, which region includes a particular road segment. Another query type is to determine ' the neighboring regions of a particular region. Of all these types of queries, the predominating ones are the find the path queries. These types of queries are usually recursive, and can be answered by many types of algorithms, such as transitive closure, recursive query processing, partial transitive closure, depth first search, and A* [2, 3, 4, 7, 8, 21, 22, 26, 30, 40, 42, 46, 48, 49, 50, 52, 53, 57, 59, 70, 99, 101, 103]. The interface between the system and the driver gulf .- -.‘;§i; Z.- 142 must be as unobtrusive as possible. A distracting system could become the cause of accidents. Some possibilities include voice recognition or point to menu choices for input. For output, voice is the least distracting; reading a map with a blip to indicate a location on a map is more distracting. C.3.1 Optimization Storing the data in an hierarchic fashion can help optimize the query processing. One method of storing road map data which has been explored is storing the road infor- mation in a hierarchy. That is, interstate expressways are at level 1, state highways are at level 2, and county roads are at level 3. An example of the use of this in finding the path algorithms is that a path from a source to a destination uses level 2 and level 3 roads in the immediate area of the source and destination; level 1 roads are used to find a path between these areas. This type of method does not guarantee an optimal path, however, the amount of processing to find a good path is greatly reduced. In general, the queries must access objects of various types. Indexing schemes such as k-d-trees and point quad trees are efficient at finding point data (regions can be transformed into higher dimension points). These methods suffer from an unbalance in the tree heights. This leads to multiple disk accesses for a single object. The methods based on B-trees, such as the R-tree, are better in terms of finding a leaf node in the tree which contains the desired search object, however, multiple leaves may be discovered in the search due to overlap. The MD-tree and GDB-tree were developed to decrease the number of disk accesses in a search for an object. Shekhar 143 and Liu have presented a clustering method for storing objects such that connected objects are likely to be stored in the same disk sector [93] Shekhar and Yang have described MoBiLe Files in which the population density of objects determines the sector location for storage [90]. The number of processors available to answer queries is a vital factor in optimiza- tion. For a single processor system, storing neighboring objects in the same bucket is efficient, since all probable desired objects can be retrieved with a single disk ac- cess of that bucket. In a multiprocessor environment, it is more efficient to store the neighboring objects in different buckets on different disks. All probable desired search objects can be accessed in parallel. C.3.2 Query Types We divide the types of queries into two categories, shortest path, which has been extensively covered in the literature, and all other types of queries which we discuss in this section. One query is to retrieve certain information about an object in the database. For example, return the name of a road at a particular latitude and longitude; list the nearest landmark(s) (building, road intersection, etc.) to a certain location. Geographical data is inherently hierarchical. Queries are often asked in this context: list all the restaurants in a particular city; list all the roads in a particular county which are currently undergoing repair. Another classification of queries is by neighborhood. List all the national parks near 196; list all public colleges within 200 miles of home. A query such as list all the roads which intersect Maple street [RAJ—A- " _ weave : . 144 within a particular county will use hierarchic information as well as the connectivity information which is necessary to answer shortest path queries. C.4 Positioning Technology One of the main components of a navigation system is the current position determi- nation system. There are a number of hardware systems available for determining the current position. Typical examples of them include global positioning system (GPS), differential global positioning system (DGPS), Loran-C, dead reckoning, and map matching. The above systems can be used either independently or combined in the navigation system to enhance the accuracy of the vehicle’s position. In this section, we discuss the above systems. The U .S. government has placed satellites in orbit around the earth. GPS func- tions by sampling signals from several satellites. The hardware systems are able to use this sampled information to determine the absolute current position of the ve- hicle in latitude and longitude. GPS is inaccurate and the error can be as great as 100m [61]. Many papers have been published which show methods of reducing this error [1, 27, 100]. DGPS is a combination of on-board GPS hardware and GPS hardware at a base station. Here, the precise location of the base station is known. It is assumed that the positional error, determined by on-board GPS hardware, is identical to that deter- mined at the base station. By adjusting the reported position of the vehicle according 145 to the error at the base station, the error factor can be reduced to as little as 5m. Problems with DGPS are that the base station must communicate with the vehicle, requiring additional capabilities on the receivers [27]. Loran-C is a system for sending location information by radio signals from land based sites. Loran-C signals are propagated along the earth’s surface. The vehicle hardware samples signals from multiple transmitters, and determines a location. The positional error of Loran-C is 500m, but can be reduced to 100m. This compares favorably with that of GPS [66]. Dead Reckoning is a method of locating one’s current position on a map, given an accurate starting position and the vectors representing headings and velocities of vehicles. The instruments that are used in dead reckoning to measure changes in a vehicles direction and speed includes: 0 Wheel odometers for measuring distance, these are subject to changes in tire inflation, tire tread wear and slippage. o Compasses for determining direction, these are subject to errors due to the magnetic field of the vehicle and changes in the earth magnetic field in different locations. 0 Gyroscopes for measuring angular velocities of the vehicle, they suffer from changing temperatures which cause drifting problems. 0 Accelerometers for measuring changes in acceleration, both Gyroscopes and accelerometers cannot accurately measure small changes in movement [67]. It is possible to precisely determine the current position if a complete and accu- rate information about a vehicles movements and its starting point is given. However, in reality, the information gathered from the instruments for a vehicle’s movements 146 is likely to be inaccurate. As a result, the dead reckoning method accumulates er- rors as the vehicle continues to move [67]. This error accumulation is the primary disadvantage of the dead reckoning method. Map matching is a method of correcting errors in vehicle’s position caused by either GPS or dead reckoning systems. It tries to locate a logical path on the map that closely resembles the physical path traveled. For example, if after traveling 5 miles from the starting point the vehicle turns 90 degrees to the right, the map— matching algorithm will search for a road to fit the new direction. C.5 Current Systems In this section, we describe currently operational vehicle navigation systems in Eu- rope, U.S.A. and Japan. The current vehicle navigation systems can be classified based on 3 different positioning technologies. They are systems based on dead reck- oning with map matching, GPS, and the combined use of GPS and dead reckoning with map matching. The systems based on the dead reckoning with map matching include Ali Scout, Euro-Scout, Autoguide, Carin, and ETAK Travelpilot [14, 15, 18, 85, 97, 98]. Among these system, ETAK Travelpilot is a stand-alone system which does not require any infrastructure [14, 18]. Its map databases are stored in one CD-ROM disk. Trav- elpilot provides two major functions to a driver. One is to display to the driver a road map of the area surrounding the car. The other is to locate destinations by the 147 street address, and to indicate those destination on the map. The main disadvantage of this system is that it does not provide a route guidance to the driver. The rest of the systems [15, 85, 97, 98] are based on the infrastructure called Beacon. Bea- cons are usually installed at selected traffic signal lights and other strategic locations. They broadcast traffic information to the vehicles, such as the current location and the description of the surrounding road networks. As a result, these beacon-based navigation systems provide most recent real time route guidance to the drivers. In- teresting thing to note is that Ali-Scout and Euro-Scout do not require the car be equipped with CD—ROMs maintaining road maps. In other words, vehicles receive all necessary map data from the Beacons. The GPS-based system are GuideStar, Milemaster, Satellite Cruising System, and N V-l [41, 95]. Milemaster [41] is based on the text—based routes databases while others are based on the digitized road maps. Unlike beacon-based systems, GPS- based systems do not provide the most up—to—date real time routing guidance to the driver. Advance, Multi Vision, and Socrates are the typical examples of the navigation systems using GPS as well as dead reckoning with map matching [13, 108]. Compared with the other systems based on either GPS or dead reckoning with map matching alone, these system provide very accurate current position of the vehicles. Two of the most popular providers of digital maps and software for map access are ETAK and NavTech[][]. ETAK is a provider of a map database and software for ac- cessing the database. ETAK has been instrumental in a number of ITS projects such a." 4". 148 as Pathfinder ATIS in the Los Angeles area; TravTek in greater Orlando, Florida; Minnesota Guidestar; TRiMS (Trip Reduction Information Management System); and'others. ETAK actually uses two separate databases: one database for drawing maps and another network database with connectivity information for path calcu— lations. The network database is organized in connectivity—based clusters to reduce access time. ETAK uses a modified Moore’s algorithm for path finding. NavTech developed their own data model which has some of the features of the relational data model to support a navigation system. The physical organization of database is based 011 the k-d trees which uses hierarchical notions of parcels, zones, and regions. Parcels are lowest logical entity and it defines a geographical area with a certain number of road segments (always less than some maximum) [39]. The stored regions represent physically overlapping geographic areas. The NavTech database is compiled from a set of standard files on IBM main- frame computer, for a particular region. It contains sufficient detail, accuracy, and breadth to support the advanced transportation and navigation applications. The database contains navigation and digital cartography information partitioned into regions. The data includes detailed city and inter-town databases including traf- fic control data such as time dependent restriction on road accesses. The NavTech database consists of 7 database layers which are customer specific, points of interest, cartography, geo-political, navigation, path, and geometry layers [19]. NavTech allows the database to be compiled into different standards such as the European standard called GDF(Geographic Data File) [17]. Bibliography [1] [‘21 [3] [41 151 [6] [7] 181 19] M. Abousalem and E. Krakiwsky, ”A Quality Control Approach for GPS-Based Automatic Vehicle Location and Navigation Systems” IEEE Int ’1 Conf. on Ve- hicle Navigation and Information Systems(VNIS I VHS) pp. 466-470, 1993. R. Agrawal, “Alpha: An Extension of Relational Algebra to Express a Class of Recursive Queries”, IEEE Transactions on Software Engineering, Vol. 14, No. 7, 1988. R. Agrawal, A. Borgida, and H. Jagadish, “Efficient Management of Transitive Relationships in Large Data and Knowledge Bases”, In Proc. ACM-SICMOD 1.98.9 Intl. Conf. on Management of Data, pp. 253-262, 1989. R. Agrawal, S. Dar, and H. Jagadish, “Direct Transitive Closure Algorithms: De- sign and Performance Evaluation”, In ACM Transactions on Database Systems, Vol. 15, No.3, pp, 427-458, September 1990. R. Agrawal, and H. Jagadish, “Algorithms for Searching Massive Graphs”, In IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No.2, pp. 225- 238, April 1994. S. Azuma, K. Nishida, and S. Hori, “The Future of In-Vehicle Navigation Sys- tems”, IEEE Int ’1 Conf. on Vehicle Navigation and Information Systems{VNIS I VHS), pp. 537-542, 1994. F. Bancilhon, “Naive Evaluation of Recursively Defined Relations”, 0n Knowl- edge Base Management Systems — Integrating Database and AI systems, Spring Verlag, 1985. F. Bancilhon and R. Ramakrishnan, “An Amateur’s Introduction to Recursive Query Processing Strategies”, In Proc. ACM-SICMOD 1986 Intl. Conf. on Man- agement of Data, pp. 16—52, 1986. J. Bander and C. White, “A New Route Optimization Algorithm for Rapid Decision Support”, IEEE Int ’1 Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), pp. 709-727, 1991 149 150 [10] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, “The R* —tree: An Efficient and Robust Access Method for Points and Rectangles”, Proc. ACM- SICMOD 1990 Intl. Conf. on Management of Data, pp. 322-331, 1990. [11] B. Becker, H. Six, and P. Widmayer, “Spatial Priority Search: An Access Tech— nique for Scaless Maps”, Proc. ACM-SICMOD 1991 Intl. Conf. on Management of Data, pp. 128—137, 1991. [12] H. Blanken, A. Ubema, P. Meek, and B. van den Akker, ”The Generalized Grid File: Description and Performance Aspects”, Proc. IEEE 6th Int’l Conf. on Data Engineering, pp. 380-388, 1990. [13] D. Boyce, A. Kirson, and J. Schofer, “Design and Implementation ofADVANCE: The Illinois Dynamic Navigation and Route Guidance Demonstration Program”, IEEE Int ’1 Conf. on Vehicle Navigation and Information Systems(VNIS IVHS), pp. 415-426, 1991 [14] .1. Buxton, S. Honey, W. Suchowerskyj, and A. Tempelhof, “The Travelpilot: A Second-Generation Automotive Navigation System”, IEEE Transactions on Vehicular Technology, Vol. 40, No. 1, pp. 41-44, February 1991. [15] I. Catling and B. McQueen, “Road Transport Informatics in Europe — Major Programs and Demonstrations”, IEEE Transactions on Vehicular Technology, Vol. 40, No. 1, pp. 132-140, February 1991. [16] S. Ceri, M. Negri, and G. Pelagatti, “Horizontal Partitioning in Database De- sign”, In Proc. ACM SICMOD 1982 Intl. Conf. on Management of Data, 1982. [17] H. Claussen, W. Lichtner, L. Heres, P. Lahaije, and .1. Siebold, “GDF, A Pro- posed Standard for Digital Road Maps to be Used in Car Navigation Systems”, IEEE Int ’1 Conf. on Vehicle Navigation and Information Systems(VNIS IVHS), pp. 324-330, 1993. [18] H. Claussen and R. GmbH, “Status and Directions of Digital Map Databases in Europe”, IEEE Int 7 Conf. on Vehicle Navigation and Information Sys- tems(VNIS IVHS), pp. 25-28, 1993. [19] Navtech Company Document, Navigation Technologies, 740 East Arques Avenue, Sunnyvale, CA USA 94086-3833 ' [20] T. Cormen, C. Leiserson, and R. Rivest, “The Set-Covering Problem”, In In- troduction to Algorithms, MIT Press, Cambridge, Mass. and McGraw-Hill, New York, 1990, Chapter 37.3, pp. 974-978. [21] I. Cruz and T. Novell, “Aggregate Closure: An Extension of Transitive Closure”, Proc. IEEE 5th Int’l Conf. on Data Engineering, 1989. [22] S. Dar and H. Jagadish, “A Spanning Tree Transitive Closure Algorithm”, Proc. IEEE 8th Int ’1 Conf. on Data Engineering, 1992. ‘4: 1 1' ._ 151 [23] N. Dec and C. Pang, “Shortest-Path Algorithms: Taxonomy and Annotation”, Networks, Vol. 14, pp. 275-323, 1984. [24] W. Dijkstra, “A Note on Two Problems in Connection with Graphs”, In Numer. Math., Vol. 1, pp. 269-271, 1959 [25] S. Dreyfus, “An Appraisal of Some Shortest-Path Algorithms”, Operation Re- search, Vol. 17, pp. 395-412, 1969 [26] J. Eder, “Extending SQL with General Transitive Closure and Extreme Value Selections”, IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 4, 1990. [27] C. Feijoo, J. Ramos, and F. Perez, ”A System for Fleet Management using Dif- ferential GPS and VHF Data Transmission Mobile Networks” IEEE Int ’l Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), pp. 445-448, 1993. [28] M. Freeston, “The BANG file: A New Kind of Grid File”, Proc. ACM-SICMOD 1987 Int ’l. Conf. on Management of Data, pp. 260-269, 1987. [29] M. Freeston, ”A General Solution of the n-dimensional B—tree Problem”, Proc. ACM-SICMOD 1995 Int ’1. Conf. on Management of Data, pp. 80-91, 1995. [30] D. Galperin, “On the optimality of A*”, In Artificail Intelligence, Vol. 8, No. 1, pp. 69-76, 1977. [31] G. Gallo and S. Pallottino, “Shortest Path Methods: A Unifying Approach”, Math. Programming Study, Vol. 26, pp. 25-36, 1984. [32] S. Ganguly, A. Silberschatz, and A. Tsur, “A Framework for the Parallel Pro- cessing of Datalog Queries”, In Proc. ACM-SICMOD 1990 Intl. Conf. on Man- agement of Data, pp. 143-152, 1990. [33] M. Garey and D. Johnson, “Computers and Intractability: A Guide to the The- ory of NP-Completeness”, W.H. Freeman and Company, New York, 1979, pp. 222. [34] B. Golden and T. Magnanti, “Deterministic Network Optimization - A Bibliog- raphy”, Networks, Vol. 7, pp. 149-183, 1977. [35] D. Greene, “An Implementation and Performance Analysis of Spatial Data Ac- cess Methods”, Proc. IEEE 5th Int ’1 Conf. Data engineering, pp. 606-615, 1989. [36] O. Giinther, “The Design of the Cell Tree: An Ob ject-Oriented Index Structures for Geometric Databases”, Proc. IEEE 5th Int’l Conf. Data engineering, pp. 598-605, 1989. [37] O. Giinther and H. Noltmeier, “Spatial Database Indices for Large Extended Objects”, Proc. IEEE 7th Int ’1 Conf. Data engineering, pp. 520-526, 1991. 152 [38] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching”, Proc. ACM-SICMOD 1987f Intl. Conf. on Management of Data, pp. 47-57, 1984. [39] J. Guzolekm and E. Koch, “Real-Time Route Planning in Road Networks”, IEEE Int ’1 Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), pp. 165-169, 1989. f [40] .1. Han, G. Quadah, and C. Chaou, “Processing and Evaluation of Transitive Closure Queries”, Proc. Int ’1 Conf. on Extending Database Technology, 1988. [41] S. Hoffman and C. Stewart, “Text-based Routing: An Affordable Way Ahead ?”, IEEE Int ’l Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), pp. 45-48, 1993. [42] M. Houtsma, P. Apers, and S. Ceri, “Distributed Transitive Closure Computa- tions: The Disconnection Set Approach”, In Proc. of the 16th VLDB Conference, pp. 335-346, 1990. [43] M. Houtsma, F. Cacace, and S. Ceri, “Parallel Hierarchical Evaluation of Tran- sitive Closure Queries”, In Proc. Ist Int. Conf. on Parallel and Distributed In- formation Systems, pp. 130-137, Dec. 1991. [44] M. Houtsma, P. Apers, and G. Schipper, “Data Fragmentation for Parallel Tran- sitive Closure Strategies”, In Proc. IEEE 9th Int ’1 Conf. Data Engineering, pp. 447-456, 1993. [45] Y. Huang and J. Cheiney, “Parallel Computation of Direct Transitive Closures”, In Proc. IEEE 7th Int ’l Conf. Data Engineering, pp. 192-199, 1991. [46] K. Hua, J. Su, and C- Hua, “Efficient Evaluation of Traversal Recursive Queries Using Connectivity Index”, In Proc. IEEE 9th Int ’1 Conf. Data Engineering, pp. 549-558, 1993. [47] G. Hulin, “Parallel Processing of Recursive Queries in Distributed Architec- tures”, In Proc. of the 5th Intl Conf. on VLDB, pp. 87—96, 1989. [48] Y. Ioannidis, “On the Computation of the Transitive Closure of Relational Op- erators”, Proc. of the 13rd VLDB Conference, 1987. [49] Y. Ioannidis and R. Ramakirishnan, “Efficient Transitive Closure Algorithms”, In Proc. of the 14th VLDB Conference, pp. 382-394, 1988. [50] Y. Ioannidis, R. Ramakirishnan, and L. Winger “Transitive Closure Algorithms Based on Graph Traversal”, In ACM Transactions onDatabase Systems, Vol. 18, No.3, pp. 513-576, September 1993. I [51] K. Ishikawa, M. Ogawa, S. Azume, and T. Ito, “Map Navigation Software of the Electro Multivision of the ’91 Toyota Soarer”, Int. Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), IEEE, (1991) 463-473. 153 [52] H. Jagadish, R. Agrawal, and L. Ness, “A Study of Transitive Closure As a Re- cursion Mechanism”, In Proc. ACM-SICMOD 1987 Intl. Conf. on Management of Data, pp. 331-344, 1987. [53] H. Jagadish, “A Compression Technique to Materialize Transitive Closure”, In ACM Transactions on Database Systems, Vol. 15, No.4, pp, 559-598, December 1990. [54] H. Jagadish, “Spatial Search with Polyhedra”, Proc. IEEE 6th Int ’1 Conf. Data engineering, pp. 311-319, 1990. [55] H. Jagadish, “Linear Clustering of Objects with Multiple Attributes”, Proc. ACM-SICMOD 1990 Intl. Conf. on Management of Data, pp. 332-342, 1990. [56] H. Jagadish, “On Indexing Line Segments”, Proc. ACM- VLDB 1990 Intl. Conf. on Very Large Data Bases, pp. 614-625, 1990. [57] B. Jiang, “A Suitable Algorithm for Computing Partial Transitive Closures in Databases”, In Proc. IEEE 6th Int ’1 Conf. Data engineering, pp. 264-271, 1990. [58] S. Jung and S. Pramanik, “An Efficient Representation of Distributed Fragments of Recursive Relations” In Proceedings of the Fourth International Conference on Database Systems for Advanced Applications, pp. 394-407, 1995. [59] S. Jung and S. Pramanik, “HiTi Graph Model of Topographical Road Maps in Navigation Systems”, In Proc. IEEE 12th Int ’1 Conf. on Data Engineering, 1996. [60] S. Jung and S. Pramanik, “Empirical Analysis of Computing Shortest Path for Road Map Queries”, submitted to Information Processing Letters. [61] W. Kao, ”Integration of GPS and Dead Reckoning Navigation Systems” IEEE Int ’l Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), 1991, pp. 635-643 ' [62] C. Kolovson and M. Stonebraker, “Segment Indexes: Dynamic Indexing Tech- niques for Multi-Dimensional Interval Data”, Proc. ACM-SICMOD 1991 Int ’l. Conf. on Management of Data, pp. 138-147 1991. [63] H. Kriegel and B. Seeger, “PLOP-Hashing: A Grid File without Directory”, Proc. IEEE 4th Int ’1 Conf. Data engineering, pp. 369-375, 1988. [64] R. Kung, E. Hanson, Y. Ioannnidis, T. Sellis, L. Shapiro, and M. Stonebraker, “Heuristic Search in Data Base System”, In Proc. Ist Int. Workshop Expert Database Systems, pp. 96—107, Oct. 1984. [65] Y. Kusumi, S. Nishio, and T. Hasegawa, “File Access Level Optimization Us- ing Page Access Graph 011 Recursive Query Evaluation”, Proc. Int ’1 Conf. on Extending Database Technology, 1988. ' 1661 [57] [68] [59] [701 [71] [72] [731 [74] [751 [76] [771 [78] 154 G. Lachapelle, B. Townsend, H. Gehue, and M. Cannon, ”GPS versus Loran- C for Vehicular Navigation in Urban and Mountainous Areas” IEEE VNIS pp. 456-459, 1993. T. Lezniak, R. Lewis, and R. McMillen. “A Dead Reckoning/ Map Correlation System for Automatic Vehicle Tracking”, IEEE Transactions on Vehicular Tech- nology, Vol. 26, No. 1, pp. 47-60, February 1977. B. Liu, S. Choo, S. Lok, S. Leong, S. Lee, F. Poon, and H. Tan, “Intergrating Case-Based Reasoning, Knowledge—Based Approach and Dijkstra Algorithm for Route Finding”, Proc. Tenth Conf. Artificial Intelligence for Applications (CAIA ’94), (1994) 149-155. E D. Lomet and B. Salzberg, ”A Robust Multi-Attribute Search Structure” Proc. IEEE 5th lnt’l Conf. Data engineering, pp. 296-304, 1989. H. Lu, K. Mikikilineni, and J. Richardson, “Design and Evaluation of Algorithms to Compute the Transitive Closure of a Database Relation”, Proc. IEEE 3rd Int ’1 Conf. on Data Engineering, 1987. M. Mannino and L. Shapiro, “Extensions to Query Languages for Graph Traver- sal Problems”, IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 3, pp. 353-363, 1990. T. Mohr and C. Pasche, “A Parallel Shortest Path Algorithm”, Computing, Vol 40, pp. 281-292, 1988. S. Moran and Y. Perl, “The Complexity of Identifying Redundant and Essential Elements”, In Journal of Algorithm, Vol 2, pp. 22—30, 1981. W. Nejdl, S. Ceri, and G. Wiederhold, “Evaluating Recursive Queries in Dis- tributed Databases”, In IEEE Transactions on Knowledge and Data Engineer- ing, Vol 5, pp. 104-121, February 1993. Y. Nakamura, S. Abe, Y. Ohsawa, and M. Sakauchi, ”A Balanced Hierarchical Data Structure for Multidimensional Data with Highly Efficient Dynamic Char- acteristics”, IEEE Trans. on Knowledge and Data Engineering, Vol 5, No. 4, pp. 682-693, August 1993. J. Nievergelt and H. Hinterberger, “The Grid File: An Adaptable, Symmetric Multikey File Structure”, ACM Transactions on Database Systems, Vol 9, No. 1,pp. 38-71, March, 1984. T. Nicholson, “Finding the shortest route between two points in a network”, Computer Journal, Vol. 9, (1966) 275-280. Y. Ohsawa and M. Sakauchi, “A New Tree Type Data Structure with Homoge- neous Nodes Suitable for a Very Large Spatial Database”, Proc. IEEE 6th Int ’l Conf. Data engineering, pp. 296-303, 1990. If...» A “‘1 155 [79] D. Papadias, Y. Theodoridis, T. Sellis, and M. Egenhofer, ”Topological Relations in the World of Minimum Bounding Rectangles: A Study with R-trees”, Proc. ACM-SICMOD 1995 Int ’1. Conf. on Management of Data, pp. 92-103, 1995. [80] J. Pearl, In Heuristics: Intelligent Search Strategies for Computer Problem Solv- ing, Addison Wesley, Reading, Mass., 1984. [81] A. Pier, “Bibliography on Algorithms for Shortest Path, Shortest Spanning Tree, and Related Circuit Routing Problems (1956-1974)”, Networks, Vol. 5, pp. 129- 149, 1975. [82] S. Pramanik, and S. Jung, “Description and Identification of Distributed Frag- ments of Recursive Relations”, accepted for IEEE Transactions on Knowledge and Data Engineering. [83] G. Qadah, L. Henschen, and J. Kim, “Efficient Algorithms for the Instantiated Transitive Closure Queries“ In IEEE Transactions on Software Engineering, Vol 17, No. 3, pp. 296-309, March 1991. [84] Ries and Epstein, “Evaluation of Distribution Criteria for Distributed Database Systems“ In U. C. Berkley, Tech. Report UCB/ERL M78/22, May 1978. [85] D. ROper and G. Endo, “Advanced Traffic Management in California”, IEEE Transactions on Vehicular Technology, Vol. 40, No. 1, pp. 152-158, February 1991. [86] A. Rosenthal, S. Heiler, U. Dayal, and F. Manola, “Traversal Recursion: A practical approach to supporting recursive applications” In Proc. IEEE 3rd Int ’l Conf. Data Engineering, pp. 580-590, 1987. [87] N. Roussopoulos and D. Leifker, ”Direct Spatial Search on Pictorial Databases Using Packed R-trees”, Proc. ACM-SICMOD 1985 Int ’1. Conf. on Management of Data, pp. 17-31, May 1985. [88] H. Samet, Applications of spatial data structures : computer graphics, image processing, and CIS, Addison-Wesley series in computer science, 1990 [89] J. Shapiro, J. Waxman, and D. Nir, “Level Graphs and Approximate Shortest Path Algorithms”, In Networks, Vol. 22, pp. 691-717, 1992. [90] S. Shekhar and T. Yang, “Motion in a Geographical Database System”, Proc. 9nd Symp. Design and Implementation of Large Spatial Databases, 1991. [91] S. Shekhar, A. Kohli, and M. Coyle, “Path Computation Algorithms for Ad- vanced Traveller Information System (ATIS)”, In Proc. IEEE 9th Int ’1 Conf. Data engineering, pp. 31-39, 1993. [92] S. Shekhar and D. Liu, “Genesis and Advanced Traveler Information System (ATIS): Killer Application for Mobile Computing?”, NSF MOBIDATA Work- shop on Mobile and Wireless Information System, Nov. 1994. 156 [93] S. Shekhar and D. Liu, “CCAM: A Connectivity-Clustered Access Method for Aggregate Queries on Transportation Networks: A Summary of Results”, IEEE 11th Int ’1 Conf. Data engineering, pp. 410—419, 1995. [94] S. Shekhar and D. Liu, “A Similarity Graph Based Approach to Declustering Problems and its Application towards Parallelizing Grid Files”, In IEEE 11th Int ’1 Conf. on Data engineering, 1995. [95] M. Shibata and Y. Fujita, “Current Status and Future Plans for Digital Map Databases in Japan”, IEEE Int’l Conf. on Vehicle Navigation and Information Systems(VNIS 1 VHS), pp. 29—33, 1993. [96] H. Six and P. Wildmayer, “Spatial Searching in Geometric Databases”, Proc. IEEE 4th Int ’1 Conf. on Data Engineering, pp. 496-503, 1988. [97] H. Sodeikat, “EURO-SCOUT is Facing the German 1994 Market”, IEEE Int’l Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), pp. 551-556, 1994. [98] R. Tomkewitsch, “Dynamic Route Guidance and Interactive Transport Manage- ment with ALI-SCOUT” IEEE Transactions on Vehicular Technology, Vol. 40, No. 1, pp. 45—50, February 1991. [99] 1. Toroslu and G. Qadah, “The Efficient Computation of Strong Partial Transitive-Closures”, In Proc. IEEE 9th Int ’l Conf. Data Engineering, pp. 530- 537, 1993. [100] H. Tsuji, H. Maeda, A. Shibata, and F. Morisue, ”Evaluation of Location Sys- tem Combining a GPS Receiver with Inertial Sensor”, IEEE Int ’1 Conf. on Ve- hicle Navigation and Information Systems{VNIS I VHS), 1991, pp. 645-649 [101] P. Valduriez and H. Boral, “Evaluation of Recursive Queries Using Join In- dices”, Int ’1 Conf. on Expert Database Systems, 1986. [102] P. Valduriez and S. Khoshafian, “Parallel Evaluation of the Transitive Closure of a Database Relation”, In Intl. Journal of Parallel Programming, pp 19-42, 1988. [103] P. Valduriez and S. Khoshafian, “Transitive Closure of Transitively Closed Re- lations”, Int ’l Con. on Expert Database Systems, Benjamin Cumming, 1989. [104] D. Vineyard, S. Jung, and S. Pramanik, “A Survey of Database Problems in IVHS”, submitted to IEEE Transactions on Knowledge and Data Engineering. [105] H. Wang and B. Zhang, “Route Planning and Navigation System for an Au- tonomous Land Vehicle”, IEEE Int ’1 Conf. on Vehicle Navigation and Informa- tion Systems(VNIS IVHS), pp. 135-141, 1992. 157 [106] R. Weiland, ”Standards for Navigable Databases: A Progress Report” IEEE Int ’1 Conf. on Vehicle Navigation and Information Systems(VNIS I VHS), pp. 185-192, 1991. [107] T. Yang, S. Shekhar, B. Hamidzadeh, and P. Hancock, “Path Planning and Evaluation in IVHS Databases”, IEEE Int’l Conf. on Vehicle Navigation and Information Systems(VNlS I VHS), pp. 283-290, 1991 [108] F. Zijderhand and J. Biesterbos, “Functions and Applications of SOCRATES: A Dynamic In-Car Navigation System with Cellular-Radio Based Bi-directional Communication Facility”, IEEE Int ’l Conf. on Vehicle Navigation and Informa- tion Systems(VNIS I VHS), pp. 543-546, 1994. HICHIGQN STATE UNIV. LIBRARIES IIIII[IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 31293014107365