.i I I EJIIMIIJIflIIhWL ' «be? ;~o%74*l76 NIVERSITY LIBRARIES“ lllllllllllllllll lllllzlll 31293 005374 LIBRARY Michigan State University This is to certify that the dissertation entitled OPTIMIZING THE COST OF RELATIONAL JOIN QUERIES presented by Farshad Fot ouhi has been accepted towards fulfillment of the requirements for Ph.D. degree in Computer Science 2-3.le Major professor 4-27- 8'! MSU is an Affirmative Action/[q ual Opportunity Institution 0- 12771 MSU LIBRARIES .—;—. RETURNING MATERIALS:~ Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. OPTIMIZING THE COST OF RELATIONAL JOIN QUERIES By Fammmdlfinouhi IKEHSSEEYRAJICHi Subufinedto ldkflfiganéhaualhfivcnfiQI hipmnudfuuflhnentofflnrnxnfinnncnu; :flnwhetkmpeeof IXDCHTNR(JFIH{HJDSCHWTY qunuhnentokannpuunlhflenoc 1988 Milli ABSTRACT OPTIMIZIN G THE COST OF RELATIONAL JOIN QUERIES By Farshad Fotouhi The join operation is one of the most time consuming operations in a relational database system because it requires a large amount of cross referencing between tuples of different relations. Therefore, the efficiency of the join operation has a deterministic effect on the system performance. The best method for performing join depends, in general, on the access methods available, the parameters of the relations involved, and the context in which the query is presented. The objective of this thesis is to determine optimal strategies for performing join for a given set of constraints and assumptions. Here, the theoretical lower bound on the number of disk 1/08 is achieved. Both join- only queries and queries involving restrictions, projections and join are considered. Here, the existing join algorithms are classified into Relation-Scan, Index-Scan, and Hybrid classes. This classification is based on the availability and use of indices on the join attribute values. This research will show that each class of algorithms performs best for a range of parameter values. It is shown that the relation-scan class of algo- rithms performs best when all or most of the tuples of the joining relations participate in the join. For the index-scan class of algorithms, several graph models are proposed in order to show that the optimization problem for these algorithms is NP-hard. Therefore, a heuristic algorithm with linear time complexity is given. For the hybrid class, several algorithms which are based on preprocessing a new auxiliary data structure, called the Partial-Relations. are proposed. It is shown that for a wide range of parameter values the proposed algorithms perform better than the best available algorithms of the other two classes. To my parents, Mahmmoud Fotouhi and Azarmidokht Shazad ACKNOWLEDGEMENTS IwishtoexpressmyappreciafiontoProfessorSakfiananikforhisvaluable assistanceandcontinual supportineveryphaseofthepreparationofthiswork. Ialso fishmthankoomfiueemembasProfessmMPromehmeessmShm- bhnandProfessmEncksonfmthehhelpfirlcommentsontheconwnuofthiswork Professor Esfahanian provided many useful suggestions and participated in many ofthediscussimswhiehhelpedmdevdopsomeofthethemedcdmnntsinChaptuS ofthisdocument. Finally, I want to thank my family, without whose love and inspiration this would not have been possible. iv TABLE OF CONTENTS [fist of Tabla List of Figures 1. Introduction and Problem Statement 1.1. Relational Database Systems 1.2. Operations on a Database 1.3. Access Methods 1.3.1. Indexing 1.3.2. Hashing -- 1.4. Motivation 1.5. Problem Statement 2. A Chsdfieation Scheme for Join Algorithm 2.1. Introduction 2.2. Relation-Scan Class of Algorithms 2.2.1. Nested-Loop Join Algorithm 2.2.2.80rt-Merge Join Algorithm 2.2.3. Hash-Based Join Algorithms _- 2.2.3.1. Simple-Hash Join Algorithm 2.2.3.2. GRACE-Hash Join Algorithm - - ........ 2.2.3.3. Hybrid-Hash Join Algorithm ...... 2.2.3.4. Join Algorithm Based on Fragmentation Technique 2.3. Index-Scan Class of Algorithms 2.4. Hybrid Class of Algorithms 3. Optimal Access Sequence for the Index-Sean Class of Join Algorith- ©00QQQUI§~WHH§°§E HHHHHHHr—Hr— ©QGM§WWNHO N H 3.1. Introduction 3.2.TheGraph Models 3.2.1. Page Connectivity Model 3.2.2. Block Connectivity Model 3.2.3. Tuple Connectivity Model 3.3. The OBAseproblem and the OPAST—problem are NP-hard 3.4. Complexity of Computing the LeastUpper Bound on the Bufl’er Size 3.5. A Heuristic Algorithm for the Page Connectivity Model 3.6. Determining an Optimal Page Access Sequence 3.7. Cost Models and Performance Comparisons 3.7.1. Cost ofthe P Algorithm 3.7.2. Cost of the Fragmentation Technique 3.7.3. Cost of the Sort-Merge Algorithm 3.7.4. Cost Comparisons 4. Join and Semljoin Algorithm Based on the Partial-Relation Scheme ........ 4.1. Introduction 4.2. An Access Path Based on a Partial-Relation Scheme 4.2.1. Description of Partial-Relation Schemes 4.22. Partial-Relation Scheme vs. Indexing and Hashing 4.3. Algorithms Based on the Partial-Relation Schemes 4.3.1. Algorithm PRJ of the Index-Scan (lass ‘ 4.3.2. Algorithm PRS of the Hybrid Class 4.3.3. Cost Model 4.3.3.1. Cost Expression for Algorithm PRJ 4.3.3.2. Cost Expression for Algorithm PRS 4.4. Performance Evaluation 4.4.1. Comparisons with the Sort-Merge Join Algorithm 4.4.1.1. Interpretation of Results 21 ‘ 23 23 27 35 39 41 8333. 47 49 49 51 $3$3$K 59 38% 4.4.2. Comparisons with the Hybrid-Hash Algorithm for Join-Only Queries 5. Conclusions and Future Study 5.1. Conclusions - 5.2. Future Study ' 5.2.1. M-Way Join Operation 5.2.2. Recursive Queries Appendices ACostCompafisonsofOeadnganlndexvsGeatingaPR B. Computing the Number of Unique Attribute Values in a PR C. 110 Cost of the Sort-Merge Algorithm Under Condition A D. Execution Time of the Hybrid-Hash, SM, PR8 and MPRS Algorithms .......... References 72 74 74 75 75 76 79 79 81 82 83 86 3.1. 3.2. 3.3. 3.4. 3.5. 3.6. A.l LIST OF TABLES TheDifferenceBetweentheGraphModels 1he0PASfortheGraphofFigure32 The OBAS for the Graph ofFiglue 3.3 The OPAST for the Graph of Figure 3.4 Valueoffi; fortheGraphofFigureB.3 Trace of the Heuristic Algorithm for the Graph of Figure 3.2 . CostCompmisonsofOeadngaB-neevsaPR 21 fidfifi 41 80 1.1. 1.2. 1.3. 1.4. 3.1. 3.2. 3.3. 3.4. 3.5. The Cases for when the Buffer Size forA '1 - . - A ',. is Computed ................ 3.6. 3.7. 3.8. 3.9. 3.10. Average Bufl‘er Size for the Hemistic and the A Algorithms 3.11. Comparisons of the Sort-Merge. martian. and P Algorithms ............... 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9. LIST OF FIGURES A Simplified Relational Database for Keeping Inventories EfiectofProjecfionsonPlofRelation SUPPLY AnExampleot‘aSelectionOperation AnExampleofaNanrralJoin An Example ofa Page Connectivity Graph A Page Connectivity Graph An Example of a Block Connectivity Graph H(B.E) An Example of a Tuple Connectivity Graph A Graph 601.5) APossible Ottofthc Graph ofFigure 32 Average Buffer Size for the Heuristic Algorithm Trace ofAlgorithm A for the Graph ofFigure 3.2 AnExampleofaPartial-Relation _ Number of Characters Selected vs. Fraction of False Performance of Algorithm PRS when PRs Contain False Tilples ................... Comparisons Under Condition A for Different Restriction Factors ................ Comparisons Under Condition A Comparisons Under Condition B Comparisons Under Condition C Comparisons Under Condition B for Different PR Sizes Comparisons of Algorithms PRS, MPRS, SM and Hybrid-Hash ............ BBBRBBnnuu 35 41 45 51 61 $3393 71 73 CHAPTER 1 INTRODUCTION AND PROBLEM STATEMENT 1.1. Relational Database Systems As mass storage prices fall and computers are used extensively throughout businesses and other organizations, vast amounts of valuable information are piling up in elecuonic repositories. Finding ways to organize this information. and to allow con- venientaccesstoitissteadily becomingmcreimpcrtant. Theemergenceofadvanced conceptsbasedonartificial intelligencewillmakedatabasesandtheirmanagementeven more critical. Software developers have come up with a variety of database management sys- tems, which accept. organize. store, and retrieve information quickly and eficiently. A database management system (DBMS) consists of a collection of interrelated data and a set of programs to access that data. The collection of related data which contains infor- mation about an enterpriseisusuallyreferredtoasdledatabase. Inordermdesaibethesnucnneofadambaseadatamodeliscreatedandis administered by database administrator. A data model is a collection of conceptual tools for describing data, data relationships. data semantics. and data constraints [Kort86]. Database designms can choose among several data models, of which three currently reign as the most popular [Date75]. These models are the network data model. the hierarchical data model and the relational data model. In this research, only the relational data model [Codd70], which is the data model for relational database systems, iseonsidered. Areladonaldatabase consistsofacollectionofrelations.eachofwhich isassigrledauniquename.1nthismodel,dataarerepresentedasatable, witheach hor- izontal row representing a triple or a record and each vertical column representing one of the attributes, orfieldr of the tuple. In other words. a relation of arity n (i.e.. a l 2 relation with n attributes) is a subset of tuples from the cartesian products of the domains DIX02X ~ - ° XD,,. The cartesian product of two relations R and S, written RXS, of arity k; and k2, respectively is the set of (k1+k2)— tuples whose first k1 com- ponents are from a tuple in R and last 1:; components are from a tuple in S. Figure 1.1 shows a relational model of a simplified inventory database. This exam- ple was first popularized by Codd [Codd70]. Individual rows in the PART relation correspond to individual types of parts in the inventory. Because rows are required to be nonredundant within a relation, each row can be identified uniquely by a subset of its attributes. For example, in Figure 1.1a S# can be used to identify a row uniquely, while 8# and P# must be used together to identify a row of the SUPPLY relation. Such a set of attributes is called the key of the relation. If a subset of the attributes in one relation is the key of the second relation, then such a key is called a foreign key. For example, in Figure 1.1, S# and P# are the keys of the SUPPLIER and PART relations, respectively. Thus, 8# and P# in SUPPLY relation are called foreign keys. # I SN SL 81 ABC Co. MI 82 XY Co. NY Co. NY (a) SUPPLIER Relation P# PD Pl CAM 22 GM (b) PART Relation l—S'#P#O 81 P1 5 81 P2 6 83 Pl 6 (c) SUPPLY Relation Figure 1.1. A Simplified Relational Database for Keeping Inventories. 1.2. Operations on a Database Modeling data in a relational structure lends itself to the development of high-level query languages. Several styles of relational query languages have been developed. One of these is relational algebra. Relational algebra consists of a set of operators which can be applied to relations to create a new relation. These operators can be com— posed to form expressions. A typical set of relational operators is projection, selection, join, union, intersection and difi'erence. Here, both join-only queries}; and queries involving selections, projections and join are considered. A projection operation is the elimination of certain fields of all tuples. In many cases, a number of the fields contain information which is irrelevant for the query and these fields are deleted. This may have the effect, however, of producing multiple, identical tuples, since many tuples may be identical in a number of attributes. The rela- tional database model assumes that all the tuples are unique. Thus, in theory, duplicates must be eliminated. Figure 1.2 shows the effect of projecting on P# of the relation SUP- PLY. Note that the result is a relation (duplicates on P1 have been removed) with only one attribute. P# P1 P2 Figure 1.2. Effect of Projecting on P# of Relation SUPPLY. The selection operation, sometimes referred to as restriction, is a specification which eliminates some of the tuples of the relation. The specification is a Boolean expression defined over the columns of the relation. The expression may consist of arithmetic comparison operations (i.e., =,¢, <,>,S,2) on the value of one or more columns. The result of selection is a new relation with only the rows from the original 1* Aqueryisastatemanrerpeningtherurievalofinformation. relation which satisfy the selection condition. Figure 1.3 gives an example of the selec- tion operation on the SUPPLIER relation where the SL="NY". # SN SL 82 XYCo. NY __,[15 Co. NY Figure 1.3. An Example of a Selection Operation. The join operation is one of the most important operations in query processing. Join is a binary operation that allows us to combine the selection and cartesian products into one operation. The O—join of two relations R and S on columns i and j is denoted by R5305 which is the shorthand for 0599+ j) (R X S). Here 0' represents the selection and 9 I J is an arithmetic comparison operator. If both R and S have columns that are named the same then the equijoin of R and S on their common attributes is called natural join and it is denoted by RNS. Figure 1.4 gives an example of a natural join between two rela- tions SUPPLY and SUPPLIER, of Figure 1.1. As shown in the example, this operation allows a user to navigate through the database. In this research the natural join of two relations on a single domain is considered, and from now on wherever the term join is used we mean natural join. fit P# 0 SN SL 81 Pl 5 ABC Co. MI 51 P2 6 ABC Co. MI _33 Pl o 115 Co. NY Figure 1.4. An Example of a Natural Join. 1.3. Access Methods Many queries require only a small number of the tuples to be retrieved flour a rela- tion. Therefore, the system is not efficient if it must access all the tuples of the relation to check for the tuples which satisfy the user’s request. Ideally, the system should be‘ able to locate those tuples directly. In order to allow this form of access, an additional data structure associated with relations was designed. Here two such structures, namely indexing and hashing are considered. 1.3.1. Indexing An index on attribute A of relation R permits rapid access to a single tuple that has the desired value in that attribute. An index consists of pairs whose first component is a value fi'om atnibute A and whose second component is the Tuple Identifier (TID) of a tuple having that value. A TID is assumed to give direct access to the tuple, so that at most one page is accessed if a tuple is referenced using a TID. An index is stored in a special way to provide rapid access to it. The model which is widely used in most data- base systems is B—Tree [Baye72] or a variant of it [Knut73, Come79]. A B-tree index takes the form of a balanced tree in which the path from the root to any leaf of the tree is the same length. Each node in the tree has between [n l2] and n children, where n is fixed for a particular tree. Leaf pages contain (key,TID) pairs in sorted order, and the higher level pages contain pairs that consist of the largest key of the lower-level page with a pointer to that page. These pairs are also sorted. The cost of search operation is proportional to log N, where N is the number of tuples in the file [Baye72]. The efficiency of an index in query processing depends on whether the relation is clustered or unclustered with respect to the index. Suppose an index I is used to extract the tuples of relation R. If each data page of R is accessed at most once, then R is clustered with respect to the index I. The index I may be termed a clustering index with respect to R. On the other hand, if the data pages of R are referenced in a random, approximately uniformly distributed manner, then R is unclustered with respect to I . 1.3.2. Hashing One disadvantage of indexing schemes is that an index structure must be traversed in order to locate a datum. Hashing avoids the traversing of an index structure and it gives the best average access time to a single tuple. However, indexing is more efficient for range queries. The basic idea in hashing is to divide the tuples of a relation among buckets, each consisting of one or more blocks of storage. The address of a tuple is then determined by computing a function, called the hash function, on the search-key value of the desired tuple. Conventional hashing functions such as those discussed by Lum [Lum 71, Lum 73], Knuth [Knut73], Severance [Seve74] and Knott [Knot75] are useful for static files. However, when the file size is dynamic, either a large amount of secondary storage must be pre-allocated or long overflow chains and frequent rehashing must be tolerated. Starting in mid 70’s several new hashing techniques such as linear hashing [Litw80, Lars85], virtual hashinglLitw76], extendible hashing[Fagi79,Lome83], and dynamic hashing[Lars78] were proposed. In these techniques rehashing is avoided by allowing the storage space to dynamically adjust to the number of tuples actually stored. Most of these techniques, such as extendible hashing, do not allow overflow to occur, thus, the expected number of accesses in these techniques is minimized. Here, wherever the term hashing is used, I mean a conventional hashing function. 1.4. Motivation The rapid technological advances of the recent decade have made possible the use of database management systems over a wide range of applications. Among the most significant problems in computer science today is the issue of efficient implementation of database management systems to support an array of applications. Certain types of database operations are not widely performed today because of the high cost. One such operation is the join. The join operation is one of the most important and time-consuming operations in a relational database system. Because it allows the user to navigate through the data- base, it is an important operation. In addition, many of the techniques used for perform- ing join can be used for other relational operators such as a cross product and aggregate functions. Join is a time-consuming operation because it requires a large amount of cross referencing between tuples of different relations. Therefore, the efficiency of the join operation has a deterministic effect on the system performance. Because of its importance, the join operation has been a subject of intensive study in the development of relational database systems, and different approaches have been proposed. The objective of this thesis is not to give a join algorithm but to develop principles by which a join can be implemented. The best method depends on the access methods available, the parameters of the relations involved, and the context in which the query is presented. 1.5. Problem Statement The primary goal of this work is to determine optimal strategies for performing join for a given set of constraints and assumptions. Both join-only queries and queries involving restrictions, projections and join are considered. Optimal strategy requires the least cost among all other strategies. The cost is measured based on the number of disk I/Os when a small size main memory buffer is assumed. When large size main memory buffer is available, the cost is measured based on the number of disk I/Os as well as the amount of CPU usage. In this thesis, the existing join algorithms are classified into Relation-Scan, Index- Scan, and Hybrid classes. This classification is based on the availability and use of indices on the join attribute values. This research will show that each class of algo- rithms is best for a range of parameter values. There has been intensive study of the relation-scan class of algorithms. Therefore, this research concentrates on the last two classes of algorithms. Using graph models for the index-scan class of algorithms I will show that the optimization problem for these algorithms is NP-hard. For the hybrid class, several algorithms which are based on preprocessing a new auxiliary data struc— ture, called the Partial-Relations, are proposed here. The performance of these algo- rithms will be compared with that ‘of the existing ones. This research will show that for a wide range of parameter values the proposed algorithms perform better than the relation-scan class of algorithms. The organization of this thesis follows. Chapter 2 presents a classification of vari- ous join algorithms proposed in the literatme. In chapter 3 several graph models are presented to analyze the performance of the join Operation in a paging environment. The graph models are based on the block or page connectivity of the joining relations. In chapter 4 several join algorithms based on the Partial-Relation Scheme are presented. The performances of these algorithms are compared with some of the existing algo- rithms under various conditions. Chapter 5 contains the conclusions and suggestions for further work. CHAPTER 2 A CLASSIFICATION SCHEME FOR JOIN ALGORITHMS 2.1.Introduction Thewaflabifityofinexpmsivehrgemainmmiesooupledwithmedemandfor fasterresponsefimeisbringinganewperspectivetodatabasetechnology. Designersof dambasesystemshweassmmdmfilmcendynhudanbasesmsidemdiskdmingnan- suction processing. rim, substantial performance gains can be achieved by letting alargeportionof,ortheentiredatabase,resideinmainmemory. Sofartwoapproaches havebeenpmposedforusinglargeamountsofmainmemmyindatabasesystems. The firstappmachusesvaylargebufl'asmimpmvecmvenfionddiskaccessmethodsby storing part of the database or part of the indices. DeWitt et al.[DeWi84], ShapirolShap86], and Elhardt et al.[Elha84] have taken this approach in their work. Theaecondapproach,memanory-residemmbaseapproach,ismstmetheenfim database in main memory. Ammann et al.[Amma], Lehman and Carey[Lehm86a. Lehm86b], and Leland and RoomelLelaBS] take this approach. Haeaclassificafionofjoinalgofithmsundermeassumpfionthatthedambaseis diskmsidentandasmanorlugemainmemorybufi'aisavaflabk(firstappmach)is considered. The second approach (i.e., memory-resident database) is not considered because it requires a redesign of the database management systems. That is, the algo- rithms and the data strucnnes for query processing, concurrency control and recovery mustallberesnucnnedtosuesstheefficientuseofCPUcyclesandmemoryratherthan diskaccessesanddiskstorage. Theexistingjoinalgorithmsmaybecategoriaedintothreeclasses.1he classificationisbasedontheavailabilityanduseoftheindicesonthejoinatnibute values. Asshowninlaterchapterseachclassofalgorithmsisbestforarangeof 9 10 parameter values. The first class contains those algorithms which perform join by accessing all the tuples of the joining relations. This class of algorithms is referred to as relation-scan algorithms. This thesis show that the relation-scan class of algorithms performs better than the algorithms of the other two classes when most of the tuples of the joining relations participate in the join. The second class of algorithms uses indices to access only those tuples of the joining relations which participate in the join. There- fore, as shown in this thesis, the algorithms of this class perform better than the algo— rithms of the other two classes when a small number of tuples of the joining relations participate in the join. These algorithms are referred to as index-scan algorithms. The third class of algorithms uses the semijoin technique to perform join. The semijoin tech- nique has been widely used in distributed databases and in multiprocessor database machines in order to reduce the communication costs. This class of algorithms is referred to as the hybrid class of algorithms. The next sections describe some of the algorithms proposed for each class. 2.2. Relation-Scan Class of Algorithms Relation-scan algorithms perform join on all the tuples of the joining relations without any attempt to reduce the size of the relations. Therefore, when only a few tuples of the joining relations participate in the join, we show that the algorithms of this class are more costly than the algorithms of the index-scan and the hybrid classes. This situation occurs if a restriction or another join operation is applied to at least one of the joining relations before the join is performed. In this section some of the proposed algo- rithms in this class are considered. 11 2.2.1. Nested-Loop Join Algorithm The simplest of all join algorithms in terms of implementation is the nested-loop algorithm [Gotl75, Blas77, Kort86]. The nested-loop join algorithm for two relations R and S is as follows: a) Read R or as much as possible of R into the buffer. b) For each record in S check if it has the same join values as any record of R stored in the buffer. If they match, join the two records and output the result. c) Repeat this process for all parts of R. We see that the complexity of this algorithm is in the order of 0(IR lxlS I); where IR I and IS I represent the number of data pages in relations R and S, respectively. However, the nested-loop can be an efficient join algorithm if at least one of the joining relations is small enough to fit in the main memory buffer. This algorithm also takes advantage of different relation sizes in contrast to the sort-merge join algorithm which is described later. Since this algorithm has a high degree of parallelism, it has been implemented in multiprocessor database machines [Bitt83, Vald84]. A database machine is a device which implements the operations of the database by means of hardware. The nested- loop algorithm when implemented on a multiprocessor machine with P processors proceeds in the following manner: Distribute the smaller relation (called external) among P processors. Next, the other relation (called internal) is broadcast page by page to these processors. Each processor joins each page in the memory of the external rela- tion with the entire internal relation. If the external relation could not fit into the P processor’s local memories, the same operation must be performed for each pages of the external relation. 12 2.2.2. Sort-Merge Join Algorithm Another widely used join algorithm is the sort-merge technique[Blas77, DeWi84]. This algorithm requires initial surfing and multiway merging of both joining relations on their join attributes, and then merging the two sorted relations. The sort process itself can be optimized (see, for instance, Knuth [Knut73]). Initial sorting takes several passes over the relations and requires a considerable amount of CPU time. Sorting does not take advantage of different relation sizes. A detailed description of this algorithm fol- lows. Assume that both joining relations R and S are sorted in ascending order with respect to their join attributes A. Let us also assume that attribute A has no duplicate values in relation S. In order to join the two relations, two pointers r and s are used over relations R and S, respectively. Each pointer identifies a tuple in a relation. Both pointers are initially positioned on the first tuple of their relations. The merging of the two sorted relations starts by reading a page from each relation and comparing the join attribute values. If r.A =s.A, then r concatenate s is output, and r is advanced to the next tuple. r.A represents the value of attribute A in the tuple pointed at by the pointer r. If r.A >s.A, s is advanced; otherwise, r is advanced. The process is iterated until the end of one of the relations. In the case discussed above there is no looping, and the join IIO cost is exactly IR |+ I S I. However, the I/O cost of m-way merge-sorting the two relations prior to merging them is 0 ( IR Ix(log,,,_1 IR I) + IS Ix(log,,,_1 IS I)). Therefore, the complexity of this algorithm is in the order of 0 (n xlogn), where n is the number of the data pages to be sorted. If attribute A has duplicate values in relation S, then the algorithm can be modified by adding another pointer s1 on S positioned on the first tuple of the current run of tuples having the same values for attribute A. Suppose that r is to be advanced; let t represent the new tuple identified by the pointer. It r.A =s.A the pointer s is set to the 13 position of the s1 pointer. In this way a loop occurs on the run of equal values of attri- bute A in S. Note that if the join attribute of relation S can have replicated values, then the cost of merging scans is not guaranteed to be linear. 2.2.3. Hash-Based Join Algorithms The main idea for hash-based algorithms is to partition the joining relations into smaller subrelations where the nested-loop algorithm or any other algorithm can be employed. This is an example of a divide and conquer technique. Bratbergsengen[Brat84] presented join algorithms based on hashing. The performances of the proposed algorithms are compared against the nested-loop and the sort-merge. He has shown that for small relations the nested-loop method is preferable to sort-merge and hashing algorithms. However, for large buffer partitioning the relations based on hashing are always better than the sort-merge join algorithm. DeWitt, et al.[DeWi84], and Shapiro[Shap86] have also proposed three hash-based join algorithms, namely the simple-hash, GRACE-hash and the Iqbrid-hash. They analyzed and compared the per- formances of these algorithms with the sort-merge algorithm. Their results were the same as Bratbergsengen[Brat84]. They have also shown that the hybrid-hash join algo- rithm performs better than the simple-hash and GRACE-hash over a wide range of parameter values. In the next three sections each of these algorithms and a recent hash- ing method proposed by Sacco [Sacc86a] will be described briefly. 2.2.3.1. Simple-Hash Join Algorithm Let us assume that the hash table for the smaller of the two relations R and S fits in the main memory. Then the simple-hash join algorithm proceeds as follows: 1) Use a hashing function h to build a hash table for R in memory. Assume that R is smaller than S. 14 Sean S and for each tuple t of S, repeat the following steps. 2) Compute the hash value of the tuple t using the hashing function h. 3) Use the hashed value for t obtained in step 2 to search the hash table of R for a match. In case of a match output the result. If the hash table for R will not fit in memory, the simple- hash join algorithm proceeds as follows: Fill the memory with a hash table for part of R. Scan S against that hash table and if there is a match, output the resulting tuple. Build a hash table for another part of R in the memory and scan the remainder of S against it. Repeat this process until a hash table for all parts of relation R has been created. 2.2.3.2. GRACE-Hash Join Algorithm The GRACE-hash join algorithm has been implemented on the GRACE relational database machine [Kits83,Moto83,KitsS4]. The GRACE architecture is one of the can- didate relational database machines for the Japanese fifth-generation computer project sponsored by ICOT'I'. The GRACE database machine is organized for join-intensive applications. It utilizes the concepts of associative disks with filtering, hash-based data partitioning, and pipeline sort-merge in its processing modules. The basic premise of the architecture relies on the fact that the tuples of relations can be distributed into hash buckets determined by the range of values of the attribute to be hashed, which is the join attribute. Accordingly, the join complexity would be simplified from the 0 ( IR Ix IS I) complexity of the brute force nested-loop join algorithm, if we hashed the join attributes of these relations so that S S IRI=£ng and |SI=2m,-, .=1 i=1 ___¥ 1' Institute for New Generation Computer Technology 15 where n; and m,- are the sizes of the itll bucket of respective relations and s is the number of buckets in each relation, the join complexity would be simplified. This is because we would compare only those tuples of relations which occupy compatible buckets (i.e., for i at}, we do not process n,- with m,- ). Accordingly, the overall complexity would be 0(in; xmg). i=1 The GRACE-hash join algorithm is executed in two phases as outlined in Dewar et al.[DeWi84] and Shapiro [Shap86]. In the first phase the relations R and S are parti- tioned into IM I sets; where IM I is the number of pages in the main memory. The algorithm uses one page of memory as an output buffer for each of the IMI sets in the partition of R and S. The partitioning of the relations is done by using a hashing func- tion. In the second phase the join is performed using a hardware sorter to execute a sort-merge algorithm on each pair of sets in the partition. To provide a fair comparison between different algorithms, DeWitt et a1. [DeWi84] have used the hashing technique to perform join during the second phase. 2.2.3.3. Hybrid-Hash Join Algorithm In this algorithm a large main memory buffer is used to minimize disk I/O. In con- trast, the GRACE-hash join algorithm uses all the main memory as a buffer to partition the relations. The hybrid-hash algorithm [DeWi84, Shap86], however uses only as many pages (B) as are necessary to partition R into sets that can fit in the memory. The rest of the memory is used for a hash table that is processed at the same time that R and S are being partitioned. The algorithm proceeds as follows: 1) Assign the ith output buffer page to partition R;, for i=1,...,B. Scan R and hash each tuple of it using a hashing function h. If the hashed tuple belongs to R 0, place it in memory in the hash table for R0. Otherwise, place it in its appropriate output buffer page. When this step is finished, we have a hash table for R o in memory, 16 andR1,R2,.. .,Rp areonthedisk. Note that the hash table forRo has IM I-B pages ande, . . . ,R3 are ofequal size. 2) Assign the ith output buffer page to set 8;, for i =1,...,B. Scan S, and use the hash- ing function h to bash every tuple of s. If the hashed tuple belongs to so, search the hash table for R in memory for a match. If there is a match, output the result tuple; otherwise ignore the tuple. On the other hand, if the hashed tuple does not belong to S 0. then place the tuple in the appropriate output buffer page, S,- for some i>0. Nole, . . . ,Rg andSr, ~ ' ° .53 areon disk. Repeat steps (3) and (4) for i =1,...,B. 3) Read R,- and build a hash table for it in the memory. 4) Scan the partition S,- and hash each tuple of it. Scan the hash table of R;, which is in memory, for a match. If there is a match, output the result. Otherwise, toss the S tuple. As mentioned earlier, DeWitt et al.[DeWi84] or; and Shapiro [Shap86] have shown that the hybrid-hash and simple-hash outperform the sort-merge and the GRACE-hash join algorithms under the assumption that large main memory buffer is available. Their comparisons were based on the execution time of the algorithms (i.e., I/O time and the cpu usage). 2.2.3.4. Join Algorithms Based on Fragmentation Technique Fragmentation[Sacc86a] joins two relations by recursively partitioning two unor- dered relations into fragments until each fragment of the smallest relation in no larger than IM I-l. The basic idea of fragmentation is to partition the joining relations R and S into n disjoint fiagments in such a way that 1) for 19' Sn, fragments ith of both relations contain tuples whose join attribute values are drawn from a set T,- of values (which identifies a bucket) and 17 2) for any given i, j (iaej), S; and S,- are disjoint. The partitioning phase logically partitions the join attribute range into (IM I-l)j disjoint buckets by using a hashing function. Each bucket identifies a pair of fragments, one for each relation. At each partitioning step a bucket i is partitioned into IM I-l subbuckets. The problem of joining two relations then reduces to the problem of joining (IM l-l)i independent fragments of the two relations. Since each fragment of the smaller relation fits in the buffer, the final join phase has a linear cost. It has been shown in Sacco [Sacc86a] that the cost of fragmentation is 0(( IR I+IS I)x(log.M .-1 IS I)), with the assumption that S is a smaller relation than R. Sacco [Sacc86] has compared the performances of the fragmentation technique against the sort-merge join algorithms for various relation sizes. He has shown that if at least one of the joining relations (i.e., the larger relation) is already sorted then the sort- merge is better than the fragmentation. However, if the larger relation or both relations need to be sorted then the fi’agmentation technique outperforms the sort-merge. 2.3. Index-Scan Class of Algorithm The algorithms of this class attempt to reduce the size of the joining relations before performing the join. The reduction is done by using the indices on the join attri- butes. Therefore, these types of algorithms, require the existence of the indices on either one or both relations on their join attributes in order to be operational. If indices exist on both relations then the join is first performed on the indices, and a set of pointer pairs to the tuples (tuple ids) that will be joined is obtained. A pointer consists of a page number and an offset within the page. Using the pointer pairs, the selected pages of the relations are fetched and the appropriate tuples within these pages are then joined. When only the index on one of the two relations, say S, exists, then the algorithm proceeds as follows: 18 Repeat steps below for all the tuples of the smaller relation R. 1) Scan relation R and for each tuple of R fetch join attribute value, r. 2) Scan the index on the join attribute of relation S, with r as the search key. If r exists then fetch the corresponding tuples in S and join it with the tuple of R. For algorithms of this class Fotouhi and Pramanik [Foto88] have shown that if more than 50% of the tuples of the joining relations participate in the join, then the selected pages of the relations may be reaccessed several times. Reaccessing of the pages can be more costly than accessing the entire tuples of the joining relations as in the algorithms of the first class. Therefore, the algorithms of this class are good only when prior selection or join is performed on the joining relations. Blasgen and Eswaran[Blas76, Blas77] have studied the properties of various join algorithms in considerable detail. In order to be operational some of their proposed algorithms require the existence of indices on the join attributes. They have considered queries which contain the join operation as well as selections and projections. They examined the cases where various indices were present, where the restriction had vari- ous effects, and for various sizes of the relations. The cost of the algorithms is com- puted in terms of disk operations and showed that many different algorithms are best under particular sets of conditions. Goodman[Good80] has also proposed a few modifications of these algorithms. It has been shown in [Blas77] that in the absence of indices, the sort-merge join algorithm performs better than the other join algorithms. However, if clustering indices on the join attributes exist, then it is more cost effective to use the indices to perform the join. This research developed several graph models for some of the index-scan algo- rithms to determine an ordered page or block access sequence which requires the smal- lest size main memory buffer. The graph is formed for the page or block connectivity of the joining relations. In these models the join-participating pages of the relations are accessed only once. I will show that the problem of determining an ordered list of pages 19 or blocks which requiring a minimum size buffer is NP-hard. Therefore, a heuristic algorithm which determines an ordered page access sequence requiring a near optimal buffer size is presented. The graph models are discussed in the next chapter. There have also been join algorithms which use a precompiled join index ['MissSZ, Vald85]. A precompiled join index is an index which has been constructed on the domain of the join attributes. Therefore, a join of two relations, R and S, in these algo- rithms is done only by scanning the index and finding which tuple of R is joined with a tuple or a set of tuples of S. This is a very fast method. However, its maintenance cost is very high and this type of index may be very large. Such join algorithms are not con- sidered here. 2.4. Hybrid Class of Algorithms For these types of algorithms the join of relations is being replaced by the join of their semijoins [Ullm82]. The semijoin of relations R and S, written RKS', is RR (R NS), where 1!: denotes the projection operation. Note that, in general, RKS'atSXR. The expres- sion RXS selects those tuples of R that participate in the join of R and S. To perform join using semijoin, the algorithm proceeds as follow: 1) Scan relation R and save its join attribute values in a data structure A. 2) Scan relation S and compare the join value of each tuple with the ones in A. In case of a match save the tuple in a temporary file S’ and store its join value in a data structure B. 3) Scan relation R again and compare the join value of each tuple with the ones in B. In case ofa match save the tuple in a temporary file R’. 4) Join the two files R’ and S’ using any of the existing algorithms proposed for the first class of join algorithms. 20 In this thesis, several semijoin algorithms which are based on preprocessing the Partial-Relation scheme are being presented. A Partial-Relation scheme is a projection of the join and restriction attributes of the queries. Furthermore, the values of an attri- bute in a Partial-Relation scheme are obtained by applying a transformation function to the original values. The objective of this transformation function is to reduce the size of the attribute values. Partial-Relations are suitable for processing join-only queries as well as queries involving selections, projections and joins. The Partial-Relation scheme is described in chapter 4. Babb [Babb79] has implemented semijoin operations using boolean arrays. The idea is to hash the join attribute and then use the result as an address into the Boolean array. The presence of a marked bit in the array means that matching tuples exist. The value of the boolean arrays is the elimination of most of the data not needed in the result. Specialized hardware has also been proposed by Pramanik [Pram86c] to perform the semijoin operation. In order to support join as well as semijoin operations, the method is improved and adapted for multiprocessor database machines [Vald84, Shul84, Rich87]. Many specialized architectures have been proposed for high performance rela- tional database management. These architectures include logic-on-disk machines [Smit75, Schu79, Su79], VLSI-based special purpose processors [Kit583, Shib84], and loosely- and tightly—coupled multiprocessor architectures [DeWi79, Gard81, Hsia83, Tera83, DeWi86]. In this thesis, only join algorithms for von Neumann architecture are being considered. This is because most of the algorithms proposed in the past as well as those discussed here may be modified for efficient implementation on various database machine architectures [Good80, Bitt83, Shul84]. CHAPTER 3 OPTIMAL ACCESS SEQUENCE FOR THE INDEX-SCAN CLASS OF JOIN ALGORITHMS 3.1. Introduction Inordermopfimizethepaformanceofthehldex-scanchssofdgmithmsinapag- ing environment this research developed two graph models, namely the block connec- tivity and the page connectivity models. The term paging environment means that the unitofbufl'ersizeisapage. Thesemodelswillhelptoshowthattheindex-scanclassof algorithms performs better than the other two classes ofalgorithms when thejoinfac- erislow.Theperformancemeasmementisbasedontbenumberofpagesofmain memorywhichareneededtoguaranteeoneaccessperblockorpage. Theblockcon- necfivitymodelusumesmatabbckisamfitofsecondmysméwcess,andthepage connectivitymodelassumesapageisaunitofsecondarystorageaccess. Pramanik and Inner [Pram85b] have considered a similar problem however, their modelisbasedonsavingonlythetupleswithinpagesthatparticipateinthejoin. They haveshowndlattheproblemofdeterminingtheleastupperboundonthesizeofdre buffer,whenonlythejoiningtuplesaresaved.isNP-hard.Theleastupperboundonthe buffersizeimpliesthatthereisatleastonepageaccesssequencethatneedsthismuch buffer,butnopageaccessaequenceneedsanymorethanthis.Merren,etal.[Merr81] have considered the page scheduling which requires the least page-swapping counts. TheyhaveshownthatthisproblemcanberepresentedasaspecialcaseoftheHamil— tonian path problem. Thus, it is shown to be NP-complete. Two sufficient conditions fordreexistenceoftheopfimalsoludonswaeshowmwhicharebasedondle tThejohlfactaforareladonRisdefinedmbetbenfioofdlenrmberofmpbsch plticipatinginthejoinoperatiouoverdlenumberofmplesink [BlasTl]. 21 22 Hamiltonian path condition and the Euler path condition. Using these conditions, Mer- ‘ rett, et al. have presented heuristic procedures for near optimum solutions. Sacco and Schkolnick [Sacc86b] have developed a general model of buffer management, called hot set model, which can be used to minimize the number of page fault rates for different buffer sizes. Here, a model in which the entire page is saved in the buffer is considered as in Merrett, et al. [Merr81] to avoid reaccesses. Thus, guaranteeing only one access per page. The objective here is to determine an ordered list of blocks (or pages) which requires the smallest size buffer among all possible ordered lists of blocks (or pages). We refer to an ordered list of blocks (or pages) as a block (or page) access sequence. The following assumptions are made in the analysis: ' 1) A join of only two relations is considered.‘ Note that an m-way join operations (m >2) can be decomposed into a series of binary joins. 2) The database is disk resident. Therefore, the selected pages of the relations need to be saved in the main memory buffer before being processed. 3) A data page may contain tuples of more than one relation. 4) A block may contain the data pages of more than one relation. The remainder of this chapter is organized as follows: in Section 3.2 the graph models are presented. In Section 3.3 the problem of determining a block access sequence which requires the least amount of buffer will be shown to be NP-hard. In Section 3.4 the problem of computing the least upper bound on the main memory buffer size for a given set of joining pages will be considered and will be shown to be an NP- hard problem. In Section 3.5 a heuristic procedure is given. In Section 3.6 the perfor- mance of the heuristic is compared against the optimal page access sequence. Finally, in Section 3.7 the performances of the fragmentation technique, the sort-merge join algorithm and an algorithm based on the graph model are compared. 23 3.2. The Graph Models The index-scan algorithms use the indices on the join-participating attributes of the relations to determine a list of pointer pairs to the tuples that are to be joined. Thus, from each pointer pair one can determine the corresponding block pairs or page pairs that need to be accessed. Here, three types of graph models are developed fi-om these pairs. These graph models are the page connectivity, the block connectivity and the tuple connectivity [Pram85b]. These models differ from each other by the unit of secon- dary storage access and the unit of buffer storage. Table 3.1 summarizes the differences between the three models. Table 3.1. The Difference Between the Graph Models. Unit of Unit of The Graph Models Disk Access Buffer Storage== Page Connectivity Model Page Page Block Connectivity Model Block Page Tuple Connectivity Model Page Tuple 3.2.1. Page Connectivity Model For this model the assumption is that a pointer consists of a page number and an offset within the page. Here, the connectivity of all the pages in a join operation is represented by the page connectivity graph. Two nodes (1 and e of the page connectivity graph have an edge (a,e) between them if a tuple in the page corresponding to a joins with a tuple in the page corresponding to c. Figure 3.1 gives an example of a page con- nectivity graph. 24 Gr Figure 3.1. An Example of a Page Connectivity Graph. A component of a graph G is defined to be the maximal connected subgraph of G. For example, in Figure 3.1 there are three components G1, G; and G3. Components are disjoint; that is, only the tuples within one component may need to be joined, independent of the tuples in other components. As a result, in order to avoid reaccessing the pages of the joining relations one needs to save, at most only the pages of a single component in the buffer. It has been shown by Pramanik and Ittner[Pram85b] that as the selectivity factor 1' increases, the number of components in the page connectivity graph approaches one very quickly (i.e., the graph becomes a connected graph). Note that this does not imply that as the selectivity factor increases the page connectivity graph becomes a complete graph. A graph is said to be complete only if every node has an edge with every other node. Figure 3.2. A Page Connectivity Graph. Using the page connectivity graph concept, the problem of determining a page access sequence(PAS) which requires the least amount of buffer can be defined as 1' Selectivityfactorisdefinedtobetheratioofthenumberofmplesintheresultofajointothe slunoftheulplesinthejoiningrelations. follows: Given: A page connectivity graph G(V,E), and a positive integer K S I V I , where V is a non-empty set of nodes, E is the set of edges of the graph G, and IVI represents the number of elements in the set V. Question: Does there exist a page access sequence which requires a buffer of size K or less? The page access sequence which requires the least amount of buffer (i.e., smallest K value) is called the Optimal Page Access Sequence and is denoted by OPAS. Here, the problem of determining such an access sequence is referred to as the OPAS- problem. For example, for the page connectivity graph of Figure 3.2 there are 5! possi- ble page access sequences. The idea here is to look for the page access sequence which requires the least amount of buffer. If the nodes of this graph are fetched in the order abced then the required buffer size is 5. On the other hand, if the pages are fetched in the order acdbe then the required buffer size is 3 as it is shown in Table 3.2. Table 3.2. The OPAS for the Graph of Figure 3.2. Content of the Buffer Size of the Buffer a 1 a,c 2 a,c,d 3 c,d,b 3 d,b,e 3 This is the smallest buffer size one can expect for this page connectivity graph. Thus, OPAS for this graph is acdbe. In Section 3.41 show that the problem of comput- ing the least upper bound for K is NP-hard. In Section 3.5 a hemistic procedure with 0(n2) time complexity is given. The performance of this heuristic procedure is com- pared against the optimal page access sequence which has the complexity of 0(n I), 26 where n is the number of data pages in the page connectivity graph. 3.2.2. Block Connectivity Model For this model the assumption is that a pointer consists of a block number and an offset within the block. Therefore, the unit of secondary storage access is a block and only the appropriate pages of a block are saved in the main memory. Two blocks are said to be connected if a data page in one is joined with a data page in another. The con- nectivity of all blocks in a join operation is represented by a block connectivity graph. Figure 3.3 gives an example of a block connectivity graph. The circles in the figure represent the data pages and the rectangles represent blocks. Note that if one page per block is assumed, then the block connectivity graph becomes a page connectivity graph. A B C U @’ \@@ Figure 3.3. An Example of a Block Connectivity Graph H (3,53). The idea here is to look for a Block Access Sequence (BAS) which requires the least number of pages to be saved in the buffer. Such a sequence is referred to as Optimal Block Access Sequence (OBAS). The problem of determining such a block access sequence is referred to as the OBAS-problem. For example, in Figure 3.3 the block access sequence ABCDE requires 4 pages to be saved in the buffer while the sequence CADEB requires only 2 pages to be saved. An OBAS for this graph is CADEB. Table 3.3 shows the size of the buffer as blocks are fetched in the sequence CADEB. 27 Section 3.3 shows that the OBAS-problem is NP-hard. In Section 3.4 the problem of computing the least upper bound on the main memory buffer size for the block con- nectivity model is also shown to be NP-hard. Table 3.3. The OBAS for the Graph of Figure 3.3. Block Accessed Content of the Buffer C C 1 A a: . as D a3 E e 2 B -- 3.2.3. Tuple Connectivity Model The tuple connectivity model was first presented in [Pram85b]. For this model, the assumption is that the unit of secondary storage access is a page as in the page connec- tivity model. However, the size of the buffer is determined in terms of the number of tuples. The problem associated with this model is how to determine for a given tuple connectivity graph, an ordered list of pages (PAST) which require minimum number of tuples to be saved in the buffer. Such a sequence is denoted as OPAST and the problem of determining such a sequence is referred to as the OPAST-problem. A tuple connec- tivity graph is an edge-weighted graph, where the weights represent the number of tuples of one page to be joined with the corresponding tuples in another page. This implies that the relations have no duplicate values on the joining domains. Figure 3.4 gives an example of a tuple connectivity graph. The OPAST for this graph is ecdba which requires a buffer storage of only 5 tuples as shown in Table 3.4. When the join domains have duplicates, the tuple connectivity graph becomes a directed graph. The are weights represent the number of tuples from the tail page to be 28 joined with tuples in the head page. Table 3.4. The OPAST for the Graph of Figure 3.4. Page Accessed # of Tuples in the Buffer e 5 c 5 d 3 b 4 a 0 4 o 3 o 2 o 5 o 0 Figure 3.4. An Example of a Tuple Connectivity Graph. As mentioned earlier, the problem of computing the least upper bound on the buffer size for this model (regardless of duplicate values on the joining domains) is NP- hard [Pram85b]. Next section shows that the OPAST-problem is also NP-hard 3.3. The OBAS-problem and the OPAST-problem are NP-hard In this section I show that a special case of the OBAS-problem is NP-hard. Therefore, the general OBAS-problem is NP-hard. This result is then used to show that the OPAST-problem is also NP-hard. 29 The special case of the OBAS-problem is defined as the problem of determining the OBAS for a block connectivity graph, such that, a page within a block of the graph may be joined with at most one other page of another block and no other page in any other block. Here, this problem is referred to as the OBASl-problem. Figure 3.3 shows an example of such a block connectivity graph. Note that for this block connectivity graph, the degree of a node is the same as the number of pages in the block correspond- ing to that node. Next theorem shows that the OBAS l-problem is NP-hard. Theorem 1. The OBAS l-problem is NP-hard. Proof: I present a polynomial time reduction of a known NP-complete problem, minimum cut linear arrangement, to the OBAS l-problem. The minimum cut linear arrangement problem is defined as follows: Given a graph G (V,E) and a positive integer K, one has to decide if there exists a one- to-one functionf :V —) {l,2,..., IV I} such that for all i ,11, be the number of pages need to be saved in the buffer when block A’l is in the buffer and the blocks A’z, A'3....,A',- are fetched. Note that this is the required buffer size prior to fetching block A ’m . Also note that when a block is accessed, all the pages of that block are brought into the buffer, and only those pages of the block that are needed for the future joins are kept in the buffer. Suppose two con- secutive blocks A ’,- and A ’j of a block access sequence are accessed, then the number of pages that need to be saved from these two blocks is l(A’,-) + [(A’j) - Y.)- - Y},- , where Y5,- =ng =1 if (A’;,A’,-)eE, otherwise Y5,- =ng =0. The first two terms in the above formula represent the total number of pages in the blocks A ’,- and A ’j, respec- tively, and the last two terms represent the total number of pages that need to be joined and removed from the buffer. In general, when A ’1 is in the buffer and the next i -1, for i >1, blocks of the access sequence A ’1A ’2 - - - A ’, are fetched, the value of y,- can be represented as follows: 31 ‘Yi = EKA'j) " Z 2er (2) i=1 '=lk=l ”I. where Y}; =1 if (A’;,A';;)eE, otherwise ij=0. Since the label on a node of H represents the degree of the node, one can replace l(A 'j) in (2) with pj. Also, since the edge set E is the same for both graphs G and H, and BEV, then ng =Xjk. Therefore, 7; = g.- SK for i=2,...,n. This implies that the required buffer size for the block access sequence A’pA’g - ° - A'. is equal to K. Now I show that if the block access sequence A’1A’2 - - - A’,, is an optimal sequence, then the required buffer size for this sequence is also equal to K. If the buffer size for the sequence A ’1A ’2 - - ~ A ’, is considered, one of the follow- ing three cases may occur: Case 1) Case 2) Case 3) l(A'1)21, l(A’2)21, and (A’1,A’2)¢E. For this case (refer to Figure 3.5a), when block A '2 is accessed, the size of the buffer is Q =72 =l(A’1)+l(A’2). Therefore, l(A'1)< §; =7,- SK, for i>1. This implies that the required buffer size for the sequence A’IA’z - - - A’, is K. l(A’1)21,l(A’2)> 1, and (A’1,A’2)eE. For this case (refer to Figure 3.5b), when block A’z is accessed, the size of the buffer is £2 = y; = l (A’l) + l (A ’2) - 2. Since l(A’p) should have at least a value of 2, then the size of the buffer for the sequence is equal to K, where K2§; =7; 21(A'1). l(A’1)21,l(A’2)=1, and (A’;,A’2)eE. For this case (refer to Figure 3.5c), when block A’z is accessed, the size of the buffer is 1; =I(A'1)+I(A'2)-2=I(A'1)-1, which is less than l(A'l). If l(A ’1) S K then K is the buffer size for the sequence. However, when l (A’;) > K then the buffer size for the sequence is l(A’1). But for this case I show that there exist another block access sequence which requires a buffer of size less than l(A’l). This implies that A’lA'z ° . ° A',, is not an 32 optimal sequence. We obtain a block access sequence which requires less buffer than the sequence A'IA’z ° - - A’, by exchanging only the ordering of the nodes A’l and A’z in the sequence. For this new sequence, A'2A’1 - - -A’,, the required buffer size is l(A'1)-1 (i.e., K=‘Y2 =§2=1(A'r)- 1)- Therefore, one can determine the sequence in G (V,E) which gives minimum cut (i.e., minimum K value) by determining the block access sequence which requires the minimum size buffer in H(B,E). CI 5 (a)Case1 (c)Case3 Figure 3.5. The Cases for when the Buffer Size forA ’IA ’2 - ~ - A ', is Computed. Following is an example of the minimum cut linear arrangement problem for a given graph and the construction of the block connectivity graph from it. Figure 3.3 shows the block connectivity graph, H (3,5), constructed from the graph,G (V,E), of Figure 3.6. In the mph of Figure 3.6, for K =4, the minimum cut linear arrangement can be defined by the function f as follows: f (A’)=l. f (B’)=2. f (C ')=3. .f (D')=4. and f (E')=5 33 Table 3.5 shows the value of §; for i =2,3,4. Table 3.5. Value of §; for the Graph of Figure 3.6. 1' @- 2 l{(A'.C').(A'.D').(A'.E').(B'.E')}|=4 I{(A ’,D’), (A’,E’), (B’,E’)} I =3 4 |{(A’.E’).(B’.E’)} |=2 Given any ordered sequence of nodes in graph G one can replace every node in the sequence with their corresponding nodes of graph H. For example, the sequence A ’B ’C ’D ’E ’ of the graph G can be represented by the block access sequence ABCDE in graph H. Note that K value for the sequence in G is the same as the buffer size for the corresponding sequence in H. For the sequence A’C’D’E’B’ in G the value of K is equal to 2. However, the corresponding block access sequence ACDEB requires a buffer of size 3. Therefore, as mentioned in the proof of Theorem 1 (case 3), by exchanging the ordering of A and C in the above sequence we obtain the sequence CADEB which requires a buffer of size K =2. Figure 3.6. A Graph G (11.5). 34 In the next theorem I use the result obtained in Theorem 1, to show that a special. case of the OPAST-problem, call it OPASTl-problem, is also NP-hard. Therefore, the general OPAST-problem is NP-hard. The OPASTl-problem is defined as the problem of determining OPAST for a tuple connectivity mph, such that, there is only one tuple in each page of the tuple connectivity mph that needs to be joined with a tuple in another page. Therefore, the weights on the edges of the corresponding tuple connec- tivity mph are all 1’s. For example, consider the mph of Figure 3.6. An edge (A ’,C’) in this mph represents the need for joining a tuple in A’ with a tuple in C’. Note that the weights on the edges are all Is and not shown in the figure. Here, the case where the values of the joining attributes are unique is considered. Therefore, when page A ’, in Figure 3.6, joins with pages C ’,D ’ and B", there are three tuples with unique join values in A ’. Now I show that the OPASTl-problem is NP-hard. Theorem 2:. The OPASTl-problem is NP-hard. Proof: I present a polynomial time reduction of the OBAS l-problem to the OPASTl- problem. Note that here when I refer to a block connectivity mph I mean the mph which was considered for the OBAS l-problem. Let a block with p join participating pages in a block connectivity model represents a data page with p join participating tuples in a tuple connectivity model. The join of two blocks in a block connectivity model is then represents the join of two tuples in their corresponding pages of the tuple connectivity model. Therefore, given a block connectivity mph, one can use the above transformation to construct a tuple connectivity mph with weights of all edges assumed to be 1. Now given a block access sequence in the block connectivity mph, it is easy to see that the required buffer size, in term of the number of tuples need to be saved, is the same for the corresponding page access sequence in the tuple connectivity mph. Therefore, one can determine the optimal block access sequence in a block connectivity mph by determining an optimal page access sequence in the corresponding tuple con- nectivity mph when the edges have weights of 1. CI 35 3.4. Complexity of Computing the Least Upper Bound on the Buffer Size In this section I show that the problem of computing the least upper bound for the block connectivity model, denoted by LUBB, is NP-hard. Let us assume that the size of a block in a block connectivity mph is one. Then the block connectivity mph is the same as a page connectivity mph. Therefore, the page connectivity model is a special case of the block connectivity model. To show that the LUBB problem is NP-hard, I will first show that the problem of determining the least upper bound for the page con- nectivity model, denoted by LUBP, is NP-hard. The LUBP problem is defined as fol- lows: Given: A page connectivity mph G (V,E) and a positive integer KS I V I. Question: Does there exist a page access sequence which requires a buffer of size 2K? Note that by a page access sequence I mean a sequence where a page is accessed only once. I will show that the LUBP problem is NP-hard. Before proving this the following definitions and propositions are needed. A cut of the mph G (V,E) is defined to be two disjoint subsets of nodes X and Y, represented by (X ,Y), such that X UY=V. Figure 3.7 shows a possible cut of the page connectivity mph of Figure 3.2; where X ={b,e}, and Y={a,c,d}. Y X Figure 3.7. A Possible Cut of the Graph of Figure 3.2. 36 The frontier nodes for a subset of the cut, say X, are defined to be the set {x : (x,y)eE; where xeX and ye Y}. I define the non-frontier nodes for a subset of the cut, X, to be the set {x:(x,y)eE; where xeX and yeX}. Let F}; and NF); represent the set of frontier and non-fiontier nodes in the subset X of the cut, respec- tively. Note that FXUNFX=X. In Figure 3.7, F); = {b.e} , NFx = (I) , Fy = {d,c} , and NFy = {a}. A prefix of a page access sequence is any number of leading nodes of the sequence. A possible page access sequence for the page connectivity mph of Figure 3.2 is abode, and a prefix for this sequence is abc which has a length of 3. Proposition 1. Let (X,Y) be a cut of the page connectivity mph G(V,E) and X ,Y be a page access sequence which has the nodes in subset X of the cut as prefix (regardless of ordering) followed by the nodes in subset Y of the cut (regardless of ordering). Then X ,Y page access sequence requires at least a buffer of size of IF}; I+1. Proof: After fetching the pages of the subset X of the cut, regardless of the ordering, the only pages remaining in the buffer are the pages in the set Fx. In order to remove any one of these pages from the buffer one needs at least a node from the set Fy. Therefore, the smallest size of buffer needed is IFx I+1. III For example, in Figure 3.7, let Y.X be the page access sequence cadbe. When all the nodes in the subset Y of the cut (i.e., c,a, and d) are fetched, the only nodes remain- ing in the buffer would be the nodes in the set Fy which are c and d. In order to remove any of the nodes c or d from the buffer one needs at least one node from the set Fx. Therefore, the buffer size is at least 3. Proposition 2. Let a 1oz - - - a,, be a page access sequence of the page connectivity mph G (V,E) with n nodes. Also, let S;, for lSi Sn, be the required buffer for a prefix of length i of the page access sequence. Then, for every S;, for lSi Sn, there exists a cut (X,Y) of G. Proof: I need to show that there exists a cut (X, Y) of G such that the required buffer size when all the nodes of a subset of the cut, X, are fetched is IS; I+1. The cut can be 37 constructed as follows: place the nodes in a prefix of length i of the page access sequence in the subset X of the cut and the remaining nodes of the sequence in the sub- set Y of the cut. According to Proposition 1 the required buffer size when all the nodes of the subsetX of the cut are fetched is IS; I+l. El . Corollary: There is a one-to-one relationship between a cut of a page connectivity mph and the buffer size required for a prefix of a page access sequence. Proposition 3. For a page connectivity mph G (V,E) and a cut (X ,Y) of G when the nodes in NF; (or NFy) are placed into Y( or X) the size of the Fy( or Fx) will increase by 1. Proof: Here I show that by placing a node from the set NFy to X then the size of Fx will increase by 1. Let us assume the there is an edge (a,b)eE such that aeNFy and be Fy. Then placing the node a into the set X causes a becomes a frontier node of X. Therefore, IFxl increase by 1. El For example, in Figure 3.7, by placing node a of the subset Y into the subset X of the cut the size of the set Fx increases by 1. In the next two propositions I show that maxinlum IFxI gives the least upper bound on the size of the buffer. For example, in Figure 3.7, if the two nodes a and c from the subset Y are placed in subset X of the cut, then Fx has the maximum size. Therefore, the ham upper bound on the size of the buffer for the page connectivity mph of Figure 3.2 is IFX I+l = 4+1 = 5. Proposition 4. Let (X, Y) be a cut of the page connectivity mph G(V,E). Then FXUFy= IVI when IFXI is maximum. Proof: Our assumption is that IF}; I is maximum. I will show the following: 1) NFy=¢. If NF yitq), then placement of some of the nodes in NFy into the subset X of the cut will increase IF}; I by at least one, which is a contradiction. 2) NFX=¢. If NFX=¢, then placing all of the nodes in NFX into the subset Y of the cut, may cause one of the following two events to occur: 38 a) IFyI increases and NFy=¢. For this case one might have IFyI > IFxI which requires switching the labeling of the two subsets of the cut. b) IFyI increases and NEW. In this case I will place the nodes in NFy into the cut subset X. According to proposition 3, this will increase the size of Ex by at least one which is a contradiction. Note that by placing the pages in the set NF); into the set Y, the size of the IF); I can not be decreased. El Proposition 5. Let G(V,E) be a page connectivity mph G(V,E). Then the least upper bound on the size of the buffer is max( lFx I)+1; where max gives the maximum value of Ex. Proof: This follows directly from Proposition 1. CI The following theorem shows that the LUBB problem is NP-hard. Theorem 3. The problem LUBP is NP-hard. Proof: I present a polynomial time reduction of a known NP-complete problem to the LUBB problem. The dominating set problem is formulated as follows: Given a mph G (V,E) and a positive integer K S IV I one has to decide if there exists a dominating set of size K or less for G, i.e., a subset V’ g V with IV’ I S K such that for all us V-V’ there is a vs V’ for which (u,v)e E. K is called the domination number of the mph G if K is minimum among all possi- ble dominating sets for G. The dominating set problem is known to be NP- complete[Gare79]. Suppose that the nodes of the mph G were data pages of the page connectivity mph and the edges in mph G were the edges in the page connectivity mph. Then one can determine the upper bound on the buffer size for the page connec- tivity mph by determining the domination number of the mph. CI 39 Theorem 4. The problem LUBB is NP-hard. Proof: Since the problem of computing the least upper bound for the page connectivity model is NP-hard, and page connectivity model is a special case of the block connec- tivity model, then the LUBB is NP-hard. D 3.5. A Heuristic Algorithm for the Page Connectivity Model Since the problem of determining the OPAS for a given page connectivity mph is a complex combinatorial optimization problem here, a heuristic procedure with 0(n2) time complexity is presented; where n is the number of the nodes in the page connec- tivity mph. This heuristic algorithm determines a page access sequence which requires a near optimal bufi'er size. This algorithm is efficient when the join factor is high. For small problem size, I have compared the performance of the heuristic algorithm with a branch and bound algorithm presented in the next section. The comparison shows that the average buffer size for the heuristic method is very close to optimal (i.e., the buffer size differs from optimal buffer size only by 2 or 3 pages). Some definitions are needed before I present the heuristic method. The resident-degree of a node v, written 8,6(v ), is defined to be the number of adjacent nodes of v that are in the bufl'er. A node a is said to be adjacent to node v if there is an edge between them. The nonresident-degree of a node v, written 8,”, (v), is defined to be the number of the adjacent nodes of v that are not in the buffer. Therefore, the degree of a node v is 5(v) = 8m(v)+6m(v). Now the heuristic algorithm is described. Step 1. Initialize the buffer to empty, 8,6(v)=0 and 8M,(v )=6(v) for all vs V. Step 2. Add to the buffer a node with the smallest degree. Update the resident- degree and nonresident-degree of its adjacent nodes. Step 3. Add to the buffer a node with the largest 8,”. If there is more than one node then choose the one with the smallest 8”,. Update the resident-degree 40 and nonresident-degree of the nodes adjacent to the node added to the. buffer. Step 4. Perform join among the nodes (data pages) in the buffer. Step 5. Remove from the buffer the nodes having all of their adjacent nodes in the buffer. Step 6. If the buffer is empty then STOP, otherwise go to step 3. Note that steps 2 and 3 each requires one scan of the list of nodes. Therefore, the com- plexity of this algorithm is in the order of 0 (n2); where n is the number of data pages in the mph. Table 3.6 gives the trace of this algorithm for the page connectivity mph of Figure 3.2. Note that the page access sequence is ebdca. I have applied the above heuristic algorithm to random page connectivity mphs of n nodes and e edges. For each value of n, 20 random mphs with e edges were created. The buffer size for each mph is then computed. Figure 3.8 shows the average buffer size as a function of the edge factor for n =50, 100 and 200. The edge factor for a mph with n nodes is defined to be the ratio between the number of edges in the mph (i.e., e), t _ and total possible number of edges between the n nodes, which is 5,221. In this figure the assumption is that the edge factor can have values up to l (i.e., a complete page connectivity mph). However, there are cases in which the value of edge factor is much less than one. For example, if n pages of relation R need to be joined with m pages of the relation S with the assumption that these pages are disjoint then the max- m*n . < 1. rmum value for edge factor would be (m +n)"‘ (m +n _1) 2 41 Table 3.6. Trace of the Heuristic Algorithm for the Graph of Figure 3.2. Content of the Buffer Pages to be Joined Size of the Buffer e _ -- 1 ¢.b (e,b) 2 b’d (cvd)9(bvd) 2 b,d,c (b.C).(C.d) 3 d,c,a (d,a),(C,a) 3 200 — 160 r- )3: Average 120 _ Buffer Size 80 — n=lOO 40 — / n: O " 1 1 1 1 1 1 1 1 1 L 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8. 0.9 l edge factor Figure 3.8. Average Buffer Size for the Heuristic Algorithm. 3.6. Determining an Optimal Page Access Sequence In this section I first give a branch and bound algorithm to determine the optimal page access sequence. The complexity of this algorithm is in the order of 0 (n l). I then compare the performance of the heuristic procedure, given in the previous section, with that of the branch and bound algorithm. A tree organization is being used to facilitate the search for optimal sequence. The branch and bound algorithm chooses the smallest buffer size among all possible paths which start from the root and terminate at leaves. A path of length m from node v to node w in a mph G is an edge sequence from v to w of length m in which the edges are 42 distinct. The complexity of this algorithm, called it algorithm A, is 0 (n l), where n is the number of nodes in the page connectivity mph. Before I give the algorithm some definitions are needed. Given the page connectivity mph G(V,E) with n nodes, and an order list of nodes V1, ...,Vn let a;={vk:(vk,vp)eE for lSltSi-l and iSpSn} for ZSiSn-l. Then the size of the buffer for a prefix of length i is §,~=la,- l+1 ; where, Ixil represent the size of the set x. The size of the buffer for a given page access sequence is now can be defined to be max{§,- : i =2,...,n—1}. Let the feature brgfi'er size for a prefix of length i of a given page access sequence to be la’; |+1 , where a’;={vk : (vk,vp)eE for lSkSi-l and i+lSpSn} for ZSiSn—l. Note that to determine if a node v belongs to the set a.- one needs to scan n -i nodes. However, to determine if a node belongs to the set a’; one needs to scan only n --i --1 nodes. Thus, by using (1’; instead of a,- in determining the OPAS one can reduce the expansion of the search tree by one level. In the algorithm given below, I first find an ordered list of nodes with the smallest feature buffer size and then determine its buffer size. Now I give the algorithm. Step 1. Consider all the nodes of the mph and generate their children. Determine the feature buffer size for all the paths from the roots to their corresponding leaves. Step 2. Find the path from root to a leaf with the smallest size feature buffer size and let B be the last node in the path. If there are more than one such path, then choose longest. Step 3. Generate the children of the node B, and determine the feature buffer size for the new paths generated. 43 Step 4. If B has only one child then goto step 5; otherwise go to step 2. Step 5. Compute the buffer size for the path started from the root and ended at the child of B and then STOP. Figure 3.9 gives the trace of algorithm A for the page connectivity mph of Figure 3.2. The number on a node of the mph represents the feature buffer size for a path started at the root and ended at that node. The dotted edges in the mph represents the optimal page access sequence, which is the sequence acdbe. The size of the buffer for this sequence is 3. As mentioned earlier, this algorithm is practical only for small prob- lems. I have compared the average buffer size required for a random mph of 5 up to 10 nodes for both the heuristic method and algorithm A. Figure 3.10 shows the result of this comparisons. As seen from the figure, the heuristic algorithm determines a page access sequence of the random mph which requires an average buffer size very close to the optimal. Figure 3.9. Trace of Algorithm A for the Graph of Figure 3.2. 45 7 _ 5 _. 5 _ Average 4 _ Buffer Size 3 _ 2 _ 1 _ 0 "’ 1 1 1 1 1 1 5 6 7 8 9 10 Number of Nodes Figure 3.10. Average Buffer Size for the Heuristic and the A Algorithms. 3.7. Cost Models and Performance Comparisons In this section the performances of the fragmentation technique [Sacc86a], the sort-merge join algorithm [Blas77] and an algorithm based on the page connectivity model are compared. This last algorithm, called it algorithm P, uses the hemistic method given in Section 3.5 to determine the near optimal access sequence. The follow- ing notations are used to derive the cost of each algorithm: R ,- Relation i N,- Number of tuples in relation i E,- Average number of tuples from R,- in a data page L Average number of (key value, pointer) pairs in a leaf page of an index M Number of pages of available main memory buffer space P,- Join factor for relation i Now I briefly describe each algorithm and compute its I/O cost. 46 3.7.1. Cost of the P Algorithm 1) 2) 3) For two relations R1 and R2, this algorithm, proceeds as follows: Scan the indices on the join attributes of both relation in order to obtain a set of . . . . N1+N2 pornter pairs. The cost for thrs step rs L . Use the heuristic algorithm given in Section 3.5 to determine the near optimal page access sequence. Our assumption here is that enough memory is available to access the desired pages without any reaccesses. Therefore, the cost for this step is ZCI'O. Fetch the pages according to the sequence obtained in step 2 and perform the join. The cost for this step is 1(P1,N1,EI)+¢(P2,N2,52); where r(p,n,e) is the expected number of pages which must be accessed in order to fetch pn tuples of a relation containing n tuples distributed evenly, e per page. r(p,n,e) has been derived in [Good80] and is given below: T _ n/e if l—e/nSpSl 1:(p,n,e)-+ n-e n/e.(l- A" ) ifp<1-e/n l;.nl 3.7.2. Cost of the Fragmentation Technique Fragmentation computes the natural join of two relations by recursively partition- ing two unordered relations into fragments of size less than or equal to M -1 pages. The partitioning is done using a hashing function. The problem of performing join between two relations R 1 and R 2 is then reduced to the problem of joining the corresponding fragments of each relation. The cost of partitioning R,- into M -1 fragments is 2xH,-, N. where [15:23:- , for i =l,2. In order for each fragment of the smaller relation, say R2 to 47 fit in the buffer, partitioning can be recursively applied to each fragment produced by the previous phase. Therefore, phase It produces (M -—1)" fragments. The size of the kth H 2 (Mr-1)" lcx I'logm-1H 21-1, and the total cost of partitioning relation R,- is 2xl-I,-( [log,,,-1H 21-1). fragment is f 'I. The total number of partitioning phase is thus The total cost of fragmentation joins is: 21'11 (103M-1H2)+2H2(108M-1H2)-H1“H2 ; 3.7.3. Cost of the Sort-Merge Algorithm This algorithm requires the initial sorting and multiway merging of both joining relations on their join attributes, and finally merging of the two sorted relations. The cost for this algorithm is: H1+2 x(logM_1H1)+H2+2 x (logu-11-I2 ) 3.7.4. Cost Comparisons Figure 3.11 shows the cost of the three algorithms as a function of x for two dif- ferent join factors. 1 is defined to be the ratio of the buffer size to the number of pages M O I ; for 005 algorithm P and the fragmentation technique perform almost the same because for algorithm P almost all the pages of the joining relations have to be fetched. ' 48 3000 — . 2500 ' 2000 Number of 1 Disk Accesses 1000 500 0 1 1 1 1 1 J 1 1 I I 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x (a) Join Factors P1=P2=0.1. 1500 r— 1000 -— '° Number of . ..... Disk Accesses ............... 5.00M. ‘5 ................ Aleutian! ............ L _______ _ 0 _ 1 1 1 1 J 1 1 1 1 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x (b) Join Factors P 1=P2=0.5. Figure 3.11. Comparisons of the Sort-Merge, Fragmentation, and P Algorithms. CHAPTER 4 JOIN AND SEMIJOIN ALGORITHMS BASED ON THE PARTIAL-RELATION SCHEME 4.1. Introduction Anindexonattributerfrelatioanermitsrapidacwsstoasingletuplethathas a desired value in that attribute. However, for mum-attribute queries involving join and resnictionsseveralindicesmayneedtobeaccessed. Hashingisalsoanothermethodfor fastaccesstothedesiredurplesofarelation. Inadditionthistechniquehasbeenused for mum-attribute search [Bin-k76,Ullm82] or join-only queries [DeWi84, Shap86]. However,onemay need to scan severalhash tablesinordertopmcessaqueryinvolving restrictions and join. In [Tana79] a new access method called a Relational Inverted Structure (RIS) has been proposed which is a combination of indexing and hashing tech. niques. AnRIS maintainstheerossreferencesofthejoining attributesaswellasthe hashvaluesoftheremaining attributes oftherelations. However,anRISmaybe neededforeverysetofattributescorrespondingtothesamedomain. Furthermore, no extensive analysis has been done on their performances. File compression techniques have also been proposed for partial match retrieval [Sack83]. These techniques are not numbleformejoinopaafimsbecausemejoinvalueshavembewmpmssedunifmmly across the files. Thus, coding of join attribute values require global statistics which is moredifficulttoapplyforfilecompression. Hereweproposeanewaccesspathtoa relation, called a Partial-Relation scheme, or PR in short, in order to speed up the rela- tional join operations. APartial-Relation schemeisaprojectionoftherelationonthejoin andresuiction attributes of the queries [Pram86a. Fot088]. Further, the values of an attiibute in a Partial-Relation scheme are obtained by applying a transformation function to the origi- nalvalues. Theobjectiveofthisu'ansformationfunctionistoreducethesizeofthe 49 50 attribute values. As a result, the amount of storage required by a Partial-Relation is also reduced. Known transformation functions such as hashing and or compression tech- niques may be used to reduce the attribute values. Note that a Partial-Relation is created based on the statistics of the queries. As I will show, the PRs facilitate the implementa- tion of queries involving restrictions and join. Two join algorithms are presented for the index-scan and the hybrid classes. These algorithms preprocess the Partial-Relations first and then join the selected tuples of the relations. The performances of these algorithms are compared with the sort-merge and the hybrid-hash join algorithms (two known best algorithms of the relation-scan class). The analysis is based on the cost of accesses to the secondary storage and the CPU usage. It has been shown that for wide range of restriction factors? and/or join factors the proposed algorithms perform better than the sort-merge and hybrid-hash join algo- rithms. We have considered join-only queries and queries involving restrictions, projec- tions and join. The organization of this chapter is as follows: Section 4.2 describes the Partial- Relation schemes. In Sections 4.3 and 4.4 the join and the semijoin algorithms based on Partial-Relation schemes are presented. The performnces of these algorithm are com- pared with those of the sort-merge and the hybrid-hash. 4.2. An Access Path Based on a Partial-Relation Scheme In this section I first describe the Partial-Relation Scheme and then discuss the advantages and disadvantages of using PRs in performing join and restrictions over using multiple secondary indices or hashing techniques. T Restriction:factorforarelationisdefinedtobetheratiobetweenthenmnberofurplesinthe relation thatsatisfytherestrictionsandthesizeoftherelation. 51 4.2.1. Description of Partial-Relation Schemes Let R(A 1,112, - - - ,A,,) be a relation scheme with n attributes, A1,A2, - ~ - ,A, , . such that the first It attributes of R are participating in the joins or restrictions. Let IR I be the total number of tuples in R, and //R// be the amount of storage required by R. A Partial-Relation scheme for R, or PR(R), is defined to be a relation scheme with k-I-l attributes where the first It attributes are those of R and the (k+l)th attribute is an access attribute, or AA, which points to the corresponding tuple in R. The attribute values in PR(R) are derived by applying a Value-Reducer function which, for example, can select a few characters out of all the characters of an attribute value. Here, a set of Value- Reducer functions, one for each attribute of the PR, is defined. Figure 4.1 gives an example of a Partial-Relation scheme for the EMPLOYEE relation. Relation EMPLOYEE rec.# we address telephone gpt manager 1 BucknerJ. 162 Haslett,E.Lan 332-3860 Pharmacy Brown,D. 2 BuckmanJ. 299 Cleveland,E.Lan 227-1759 Shoe Repair Smith,A. 3 Buckmanl’. 160 Kalamazoo, ELan 335-2343 Photography Clare.R. 4 Buckle}. 452 Garfield,E.I.an 353-1759 Pharmacy Brown,D. 5 131erth . 684 Cedar,E.Lan 350-1775 Photomhy Clare,R. Value-Reducer functions: F(ename)=H(n1anager)=The first three characters of the attribute values. G(dept)='1'he first two characters of the attribute values PR(EMPLOYEE) ename dgrt manager AA Ph Bro Buc Sh Smi Buc Ph Cla Buc Ph Bro Buc Ph Cla fi tit-503N0— Figure 4.1. An Example of a Partial-Relation. The problem of determining optimal set of attributes for a PR is similar to the secondary index selection problem [Come78] which is known to be a NP-complete 52 problem. Low cost heuristics for near optimal solution have been given in [Schk75]. These algorithms are based on the statistical properties of the queries. To add a new attribute value to an existing PR one may scan the original relation and then create a new PR by using the Value-Reduca function to the values of the new set of attributes. I show in Appendix A that the cost ofcreating an index is more than the cost ofcreating a PR. Processing a query using PRs is done in two phases. In the first phase, one per- forms restriction, and join or semijoin operations on the PRs first. This produces a set of pointers to the resulting tuples of the original relations. In the second phase, final join and restriction operations are performed on these resulting tuples. The PRs contain par- tial values and as a result operations on them may produces a few false tuples 1' . There- fore, at the end of the first phase a super set of the tuples to which the join operation should be applied is obtained. The cost of processing a query is the sum of the cost to preprocess the PRs, and the cost to process the resulting tuples. Cost of preprocessing PRs depends on //PR // . In order to minimize the overall cost a set of Value-Reducer functions is defined which a) Minimizes //PR// , and at the same time b) decreases the number of false tuples. In general, these two objectives conflict with each other. By defining an appropriate set of Value-Reducer functions one may be able to avoid the problem of condition (b). However, this may not satisfy condition (a). Note that allowing false tuples in a PR may be advantageous in reducing the size of the PR. As we will show in Section 4.4.1, for a small percentage of false tuples (up to 40%) the performance of the proposed algorithms remain almost the same if the restriction factor is low. The number of false tuples 1‘ Falsenrpksuednseurpleswhichseemmmfisfydwquuywhenviewedmamneaspxm but actually are not qualified tuples [Lum70]. 53 depends on the ratio between the number of unique atuibute values in a PR and the ori- ginal relation. An analytical model is developed to compute this ratio, M, for N ran- domly generated unique attribute values. The ratio is given by AL_AL-X X N M=—A—. 1 N AL N where A is the alphabet size and X characters of each value of length L are selected by a Value-Reducer function. The derivation of this expression is given in Appendix B. To validate the analytical result, we have computed the percentage of false tuples (i.e, l—M) for 1000 randomly selected persons’ names of up to 20 characters. I have also computed the percentage of false tuples for 600 randomly selected telephone numbers of 7-digit long. The result of these experiments as well as that of the analytical model is given in Figure 4.2. The mph of the figme shows that the percentage of false tuples approaches 0 rapidly with an increasing value of X. Thus, by choosing only a few char- acters of the attribute values one can guarantee a few false tuples. The mph also shows that the percentage of false tuples for experimental data is less than that of the analytical model. This is because the adjacent characters of the experimental data are statistically dependent. The effect of a set of Value-Reducer functions on //PR // is also analyzed in [Pram85a, Pram86b]. 0.9 - Legends: 8'3 — A=Bxp., a226, N=l(XX) . 0'6 : B=Ana1.. a=26, N=1000 Fractron of False 0:5 _ C=Expr., a=10, N=600 Tuples 0.4 _ D=Anal., a=10, N=600 0.3 — 0.2 — 0.1 — 0 " 1 1 1 1 1 1 1 1 l 2 3 4 5 6 7 8 Number of Characters Selected Figure 4.2. Number of Characters Selected vs. Fraction of False Tuples. 4.2.2. Partial-Relation Scheme Vs. Indexing and Hashing As mentioned earlier, the primary function of the PRs is not to perform an efficient restriction on them as it is in conventional indexing, but to perform an efficient join. Therefore, for simple restriction queries with a very low restriction factor it is advanta- geous to use conventional indexing or hashing techniques. This is because PRs are scanned sequentially even for a low restriction factor. In general, PRs are suitable for applications where complex queries are processed. To process a query of this type, one needs to scan only one PR per relation. However, in methods using secondary indices, one may need to scan several indices and perform an intersection to obtain the set of keys or pointers to the relation. Lum [Lum70] proposed a compound indexing scheme in order to retrieve all the tuples satisfying a partial-match query, in one access. A compound indexing scheme consists of a set of secondary index files with a common set of index fields in each file, but each file is ordered on different index fields. Therefore, the implementation of com- pound indexing requires a lot of storage. On the other hand, if a serial filing method is used, only one index file is created for all the attributes to be indexed. General retrieval through this index file, however, may require a large number of accesses because the 55 index is scanned sequentially. Note that PR is a special form of a serial filing method. However,thesizeofaPRismuch smallerthanthatofanindexbecauseeachvaluein PR is reduced by a Value-Reducer function. Therefore, sequential scanning of a PR is notascostlyassequential scanningofanindexinaserialfilingmethod. Atthe same time, because of the reduction, false tuples may result in processing a PR. To process join-only queries with low join factors it is more cost effective to preprocess the PRs than to use the hashing techniques suggested in [DeWi84, Shap86]. This is because in hash-basedjoin methods the original relations are scanned several times while preprocessing PRs requires at most one scan of the original relations. Fmther, the preprocessing also reduces the cost of the second phase considerably because the size of the relations are reduced due to low join factors. On the other hand, for very high join factor preprocessing PRs is not as effective as the hash-based join algorithms. Since PRs are centralized resources, the degree of inter- and intra-query concurrency is decreased as compared with methods using multiple secondary indices. However, the cost of updating a single PR is much less than maintaining several indices. The overall storage requirements for a database will increase when PRs are used. However, the size of a PR is much less than the corresponding indices because of dupli- cate set of tuple pointers which appear in each index. Further, the storage utilization for a PR is close to 100% (a PR does not have to be ordered), where as in B-tree the average storage utilization is about 69%. 4.3. Algorithms Based on the Partial-Relation Schemes In this section two algorithms for a relational equijoin operation between two rela- tions, R and S are presented. The first algorithm, PRJ, is based on joining PRs, and the second algorithm, PRS, performs semijoin between the two PRs. ‘ The performances of the above algorithms are analyzed using the following form of the query: 56 PROJECT (JOIN (RESTRICI‘ (R), RESTRICI‘(S))) 4.3.1. Algorithm PRJ of the Index-Sean Class This algorithm first applies the restrictions on the PRs and then join the selected tuples of the two PRs. The result of this join is a set of AA-value pairs. Using these AA-values, the tuples of the original relations are fetched and joined. The details of this algorithm are as follows: a) Apply restrictions and projections on PR(R) by scanning it sequentially. The pro- jection is done only on the join and the access attributes. Store the result in a file R ’. Perform similar operations on PR(S) and store the result in a file S ’. b) Join R ’ and S’ and obtain a set of AA-value pairs which point to the joining tuples of R and S. Joining of R' and S’ is done by first sorting them and then merging the two sorted files. c) Fetch the tuples of R and S using these AA-values. Then perform the final restric- tion, join and projections on these tuples. 4.3.2 Algorithm PRS of the Hybrid Class This algorithm uses the semijoin and sort-merge techniques to process the above query. The semijoin operation is first performed on the PRs in order to reduce the size of the relations. The reduced relations are then joined using the sort-merge technique. For the semijoin operation a hashing function as well as Boolean arrays are used. PRS algorithm is good for the join-only queries. Also, as explained in Section 4.4.1, for general queries given above, the PRS algorithm performs better than the PRJ algorithm for high restriction or join factors. This is because the PRS algorithm reduces the relations by means of semijoin operation and not by using AA-values. On the other hand, the PRJ algorithm uses the AA-value pairs and, therefore, for high restriction fac- tors the number of accesses may be more than the number of pages in the relations. The 57 details of the PRS algorithm are as follows: a) b) C) Apply the restrictions on the PRs and perform the semijoin on them. Store the result of this semijoin in a bit array. Scan relation R, and for each tuple hash its join attribute value, and probe the bit array for a match. In case of a match store the subtuple in a temporary file R’. Create S’ from S in the same way as R’. Note that R’ and 5’ contain the entire length of key values rather than partial key and AA-values, as for the previous algorithm. Join R ’ and S’ by first sorting them on their join attribute values and then merging the two sorted files. 4.3.3. Cost Model Here we develop a cost model based on the number of disk accesses for the pro- posed join algorithms. In this model the cost of eliminating duplicates from the output and forming the output relation is ignored because this cost is similar for all the algo- rithms. The following set of parameters are used for the model. These parameters for i=1,2, where R 1 corresponds to R andRz corresponds to S, are as follows: Ni Er fr Number of tuples in R,-. N,- also represents the number of tuples in PR(i) because IR,-| = lPR(i)I. Average number of tuples from R,- in a data page Join factor for PR(i) or R,- (i.e., the fraction of tuples of PR(i) or R,- that participate in the equijoin) Main Storage space in term of page frames that are available (e.g. sort buffers) Restriction factor for PR(i) or R, Ratio between the size of the PR(i) and R,~; where f,-T+K T otherwise P1(T)=0 . Thus, the expected number of unique values in a sample of size T is [DD —K]. T T 21—P1(T)=Gx i=1 81 APPENDIX C Appendix C 110 Cost of the Sort-Merge Algorithm Haewegivethecostoftheson-mugejoinalguithmundatbedueecondifions describedinsection 4. UnderconditionAthecostis: 2 ZurthtIEtUosu-t (Jim/(251$! -1))))+211Ni/51 181 where tr;=min(J1N1 , Ng/Ej). 2 The cost of the algorithm under condition B is gonna/115, , 1(11.N,-,E;)). Finally ‘81 the cost under condition C is 2 FNi/EfiUiNi/Eiai'lozu-r(JiNi/mM -1)Ei)))- II 82 APPENDIX D Appendix D Execution Time of the Hybrid-Hash, SM, PRS and MPRS Algorithms Herewegivethecostofthehybrid-hash,SM.PRSandMPRSjoinalgorithms basedonthenumberofdiskIlOsandthecpuusage. Thefollowingpar'ameterawhich aregivenin[DeWi84],areusedforexpressingthecostoftheabovealgorithms: comp timetocomparetwokeyvalues hash timetohashakey move timetomoveatuple swap timetoswaptwotuples [0359 time for sequential IIO operation 10m timeforrandomIIOoperation F universal ”fudge" factor. This factor is used for expressing for example, the exu'astoragethatarelationneedtoholditshashtableinthebufl'er. a For the hybrid-hash the cost of joining two relations R and S is 2’15, where ill 111: (IR I+IS ”£0339. Thecostofreadingtherelations. ll'2= (//R //+//S/I)xllaslr. The cost for partitioning the relations. h3= (IIR//+//S//)x(1-q)xmove. The cost of moving the tuples to the output buffers. ”to! ‘3 IR 1 ‘ In: (R I+ IS I)x(l-q)xlom. The cost of writing from output buffers. Its: (l/R//+//S//)x(l-q)xllaslr. The cost for building hash tables for R and finding where to probe for S. llg= l/S/lexcomp. The cost for moving tuples to hash tables for S. 83 34 In: //R//xmove. The cost for moving tuples to hash table for R. M: (IR I+IS I)x(1-q)time10359. The cost for reading the partitions into the memory. 6 The cost of the SM join algorithm is 2a, where i=1 s1= (IR l+ IS I)xlOsg-Q. The cost of accessing the relations. s2: (//R l/log2 yp-gH/S/Ilogz-L—goxkomp +swap ). The cost of inserting tuples into priority queue to form initial runs. MR and MS specify the number of tuples from Rand S, respectively, that can fit in main memory at once. S3= (IR I+IS I)>