.\_ ‘ 1‘," ‘v HE. ‘ Inn V 'x 4-. w I I ‘ ""«s~ Viv- ”nun.” . u ”.1" \ va a??? ‘ 1;;‘7‘0‘7" ‘. u ‘ . “~ In _\, .V, x 1 Iv 5-"in ( ' - ,M‘u ‘1 \‘ 3mm!" ’1- l f ‘1 w' . 31" ' ' 1 4 ’ "V ‘ . .V‘Y'Sv. ‘If-‘fllfi‘l'f. l' r w ‘ Alli! 1 H x \I, ”1'..- ("A I if; ,I . H ‘fwyfé l ‘ f ‘w‘. 1 S (44‘s't‘J-XU4‘MI’.‘ ‘ . 7‘ #3914014: ' I 7 ‘l I ' -,. u l . dual,” ‘ H' .s’l 2t :2‘Ijt‘,’t ‘" a, 4‘“: ‘ E W Lib. mid] w». ‘ pxr‘.‘ . a'fg’A‘Q' ;. , MM" 44‘ ' w- i" . ._ . , ,‘(Q‘N‘Kb' ‘ '4 ‘4'} ,r v l «’14 ‘ $741. , 4» II 2 , J l ( :r‘ 3‘3: . ”(Hr ’5’! 'Y' {My {WAY I u" n Gill‘v‘}! ' 4 § ( E . "I ' ‘ v‘: fly} 3)} ‘ V» .v .gv“: ‘ 0”}qu I.“ {(435 ‘ d l‘ 4 1 . ‘ n ; {5.}??? 5: ' '3',“ fin???“ J , . ‘ M“ l“’ r: .... "um. ‘ N’fl'fl .a 14%: ‘aw 7AA," ~ K." '1'; u "\ J‘ ~ I.“~\‘ \ a‘. V ‘I " ‘ . ‘ x , h 3:9533‘2’!‘ ‘1‘ 4h V}; v . v ; i . ‘6‘ 5:.“ ’3‘ ‘, vx‘ l‘q» ~ 1?in .mva‘v ,.v i» i V ' ‘s 2‘». I; . 4V; 4 v" 0". no! ‘ . \.;. 31’» ‘I. ‘4 ‘ 1“ ‘4 Ma: 1'" .1 {.1}, y ,I , 43‘s“ .,. 31,1!44- '; J. \V,‘ . r‘. a .‘ ‘ .. 4 .4-, A1). ’ I , :- w“ “(Ma 15‘1”" “"5?” ‘ ‘ 'Tt‘qf."~;‘,¢‘u, (,Hu‘lp‘. l.,“l(‘l . ' ,l (K; “I. "act”; '9' Q6,“ \ A345. 34} 0 7“,"; 1- y’k‘ (Fig: "' ‘1‘. 4‘ FHA]. ". fla- 1‘ l.- J [Willilllfl/lllllllllHill!!!llIll/IllH/ll/lll 93 00552 6797 This is to certify that the dissertation entitled DATA DISTRIBUTION STRATEGIES FOR PARALLEL DATABASE ACCESSES presented by Myoungho Kim has been accepted towards fulfillment of the requirements for #degree in om u er ience MM Major professor Date ML MSU is an Affirmative Action/Equal Opportunity Institution 0—12771 LIBRARY Michigan State University A UM,...n._ 4‘_4 MSU LIBRARIES m RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. DATA DISTRIBUTION STRATEGIES FOR PARALLEL DATABASE ACCESSES By Myoungho Kim A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1989 ‘- , If -' , Z» ”N i 57;: ABSTRACT DATA DISTRIBUTION STRATEGIES FOR PARALLEL DATABASE ACCESSES By Myoungho Kim With the advent of commercially available general purpose multiprocessing sys- tems, the need for developing appropriate information processing systems are increas- ingly recognized. Since many database applications require a large number of data accesses with relatively less computation, exploiting parallel data accesses is important to improve performance in parallel processing database systems. In this thesis we investigate data distribution strategies for parallel processing of database systems. The primary objective is to maximize throughput and minimize response time through con- current data accesses. We propose database processing models as a general framework, and then present data distribution strategies for three common types of database applica— tions. Two of these applications are on multikey hash files and the third application is on B-tree accesses. First, we present data distribution strategies for partial match queries. The main contribution here is the development of new data distribution methods called Fieldwise eXclusive-or (FX) distribution methods to achieve optimal file distribution. An algebraic property of exclusive-or Operation along with field transfor- mation techniques are fundamental to these data distribution methods. We show that the proposed data distribution methods perform better than the others proposed in the past. We also present efficient data construction methods based on the usage of multi- key hash directory. Second, optimal distribution for parallel processing of multiattri- bute range queries is investigated. Here, we show that for various types of multiattribute range queries there are inherent limitations in achieving optimal distribu- tion. We extend FX distribution methods to achieve optimal distribution for many use- ful multiattribute range queries. For both partial match queries and multiattribute range queries, sufficient conditions for optimal distribution by the proposed distribution methods are given. Finally, we present node partitioning schemes for B-tree type indexes. The objective here is to develop a new parallel processing scheme for B-tree type indexes stored in parallel disks. We show that parallel processing of the proposed partitioned node B-trees performs better than parallel processing of conventional B- trees. This work presents a new basis on which parallel processing systems for other database applications can be designed. To my parents Han Cheol Kim and Kye Soon Lee iv ACKNOWLEDGEMENTS I wish to express my appreciation to my advisor, Dr. Sakti Pramanik, for his gui- dance over the years. Dr. Pramanik provided valuable comments and constant support in every phase of the preparation of this work. I would also like to thank Dr. Lionel M. Ni and Dr. Abdol Esfahanian for their many valuable suggestions and comments. I also wish to express my appreciation to Dr. Dorian Feldman for his encouragement and sup- port. Without their intellectual advice and inspiration, this dissertation would not have been possible. I would like to thank all the people who helped me during my stay at Michigan State University. Finally, I would like to thank my parents, my wife and my son for their constant support, patience, and love. TABLE OF CONTENTS List of Tables ............................................................................................................ ix List of Figures .......................................................................................................... x Chapter 1. Introduction .......................................................................................... 1 1.1. Database Machines ......................................................................................... 1 1.2. Main Memory Databases ................................................................................ 3 1.3. Concurrent Dynamic Search Structures ......................................................... 3 1.4. Classification of Database Queries ................................................................. 4 1.5. Problem Statement .......................................................................................... 5 1.6. Thesis Overview ............................................................................................. 6 Chapter 2. Database Processing Models for Parallel Processing ........................ 8 2.1. General Parallel Processing Model for Database Systems ............................. 8 2.2. Two Stage Parallel Processing Model for Database Systems ........................ 10 Chapter 3. Parallel Processing Strategies for Partial Match Queries ................ 14 3.1. Introduction .................................................................................................... 14 3.2. Definitions and Terminology .......................................................................... 16 3.3. Basic FX Distribution Method ....................................................................... 19 3.4. Extended FX Distribution Methods ................................................................ 24 3.4.1. Field Transformation Functions for Partial Match Queries .................... 25 3.4.2. I and U Field Transformation Functions ................................................. 27 3.4.3. I and I Ux Field Transformation Functions .............................................. 29 vi 3.4.4. U and I Ux Field Transformation Functions ............................................. 35 3.4.5.1U1, IU2, . . . , [Ux Field Transformation Functions .............................. 40 3.4.6.1, U and I Ux Field Transformation Functions ......................................... 46 3.5. Performance Comparison with Other Distribution Methods .......................... 53 3.5.1. Probability of Strict Optimality ............................................................... 54 3.5.2. Average Response Time .......................................................................... 55 3.6. Data Construction Methods ............................................................................ 62 3.6.1. Data Construction Based on Real Global Directory ................................ 63 3.6.2. Data Construction Based on Virtual Global Directory ............................ 66 Chapter 4. Optimal Data Distribution for Multiattribute Range Queries .................................................................................................................................... 68 4.1. Introduction .................................................................................................... 68 4.2. Definitions and Terminology .......................................................................... 69 4.3. Limitations of Perfect Optimal Distribution ................................................... 73 4.3.1. Type (0 - 2) Range Queries ..................................................................... 74 4.3.2. Type (0 - 1) Range Queries ..................................................................... 78 4.3.3. Type 0 Range Queries ............................................................................. 82 4.4. Optimal Data Distribution Methods for Range Queries ................................. 82 4.4.1. Optimal Distribution by Basic FX Distribution Methods ....................... 83 4.4.2. Field Transformation Functions for Range Queries ................................ 86 4.4.3.1 and UR Field Transformation Functions ............................................... 90 4.4.4.1 and UM Field Transformation Functions .............................................. 92 4.4.5. UR and UM Field Transformation Functions .......................................... 95 4.4.6.1, UR and UM Field Transformation Functions ....................................... 98 Chapter 5. Node Partitioning Schemes for B-trees .............................................. 100 5.1. Introduction .................................................................................................... 100 5.2. The PNB-tree .................................................................................................. 101 5.2.1. The PNB-tree Operations ........................................................................ 103 5.2.2. Parallel Disks Configuration for PNB-trees ............................................ 104 vii 5.3. Motivations of PNB-trees ............................................................................... 106 5.3.1. Compressed Height .................................................................................. 106 5.3.2. Reduced Frequency of Tree Restructuring .............................................. 107 5.4. Performance Comparison ............................................................................... 107 5.4.1. Performance Models ............................................................................ 108 5.4.2. Average Data Retrieval Time .............................................................. 109 5.4.3. Threshold Points .................................................................................. 114 5.4.4. Update Performance ............................................................................ 115 5.5. Other Strategies for The PNB-tree Organization ......................................... 115 Chapter 6. Conclusions ........................................................................................... 118 List of References ..................................................................................................... 121 viii LIST OF TABLES Table 3.1. Response Time for F 1=F2=F3=F4=2, F 5=F 6% and M = 16 ............... 59 Table 3.2. Response Time for F 1=F 2=F 3:2, F 4=F 5=F 6:4 and M = 32 ............... 59 Table 3.3. Response Time for F 1=F 2=F3=F4=F 5=F 6:8 and M = 32 ................... 59 Table 3.4. Response Time for F 1=F 2=F3=F4=F 5=F 6:8 and M = 64 ................... 60 Table 3.5. Response Time for F1=2, F 2=F 3:4, F 4=F 5=F 6:8 and M = 128 ....................... . ..................................................................................................... 60 Table 3.6. Response Time for F 1=F2=F3=F4=4, F 5=F 6:8 and M = 256 ............. 60 Table 3.7. Response Time for F1=F2=F3=4, F4=F5=F6=8 and M = 512 ............. 61 Table 3.8. Response Time for F1=F2=F3=8, F4=F5=F6=16 and M = 512 ........... 61 ix Figure 2.1. Figure 2.2. * Figure 2.3. Figure 3.1. Figure 3.2. Figure 3.3. Figure 3.4. Figure 3.5. Figure 3.6. Figure 3.7. Figure 3.8. Figure 3.9. Figure 4.1. Figure 4.2. Figure 4.3. LIST OF FIGURES General Parallel Processing Model ........................................................ Two Stage Parallel Processing Model .................................................... Two Level Implementation of H1 for The Two Stage Model ............... Basic FX Distribution ............................................................................ FX Distribution with I And U Transformation ...................................... FX Distribution with I And I U 1 Transformation ................................... FX Distribution with I And I U 2 Transformation ................................... FX Distribution with U And I U 1 Transformation ................................ FX Distribution with U And I U 2 Transformation ................................. FX Distribution with I, U And I U 2 Transformation .............................. Strict Optimality When for Any Fields r and s, F ,F s 2 M .................... Strict Optimality When for Any Fields r, s and t, F ,F SF, 2 M ............. Explicit Representation of Multikey Hash Functions ............................ Example Bucket Distributions ............................................................... Nonexistence of Perfect Optimal Distribution for Type (0 - 2) Range Queries ............................................................................................................ Figure 4.4. Figure 4.5. Bucket Distribution When F 1 = 4, F 2 = 4 and M = 8 ........................... Nonexistence of Perfect Optimal Distribution for Type (O - 1) Range Queries ............................................................................................................ Figure 4.6. Figure 4.7. Figure 4.8. Figure 4.9. Figure 5.1. Basic FX Distribution When F 1 = F 2 = 4 and M = 4. ......................... FX Distribution with I And UR Transformation .................................... FX Distribution with I And UM Transformation ................................... FX Distribution with UR And UM Transformation ............................... Partitioned Node B-tree .......................................................................... 9 11 13 19 29 35 36 39 4O 52 56 57 70 72 75 77 80 83 91 96 98 103 Figure 5.2. Figure 5.3. Figure 5.4. Figure 5.5. Figure 5.6. Figure 5.7. Figure 5.8. PNB-tree After Insertion of Key Value 57 ............................................ 105 PNB-tree After Deletion of Key Value 71 ............................................. 105 Data Response Time with Various Data Request Rates ........................ 111 Data Response Time with Various Disk Speeds .................................... 113 Threshold Points ..................................................................................... 114 Tree Restructuring Cost ......................................................................... 116 Insertion Time ........................................................................................ 116 xi CHAPTER 1 INTRODUCTION There will be many applications for large databases that cannot be performed in an acceptable response time by current database systems. Since one of the most significant ways of improving performance is through parallelism, the importance of parallel pro- cessing in database systems has been increasingly recognized. Parallel processing in database systems can increase throughput and minimize response time through con— current data accesses as well as processing data in parallel. However, parallel process- ing by itself does not necessarily lead to high performance. Some of the reasons are attributed to overhead due to interprocessor communication, remote memory accesses and data access conflicts. For parallel database operations external I/O also causes a serious performance bottleneck [Bor83]. Stone [Sto87] shows that parallel query algo- rithms in a multiprocessor system may perform poorly than efficient serial algorithms on a single processor system. The advantage of indexing to reduce external I/O traffic was also emphasized in that paper. There have been numerous works on improving perfor— mance of database systems by specialized software/hardware techniques. These include database machines, main memory resident databases and. concurrent dynamic search Sti'llCtllI’CS. 1.1. Database Machines Many proposals have been made in the past for machine architectures for efficient parallel processing of database operations. Early designs such as CASSM [Su79] and RAP [Ozk75] exploit the logic per track idea which was first proposed in [81070] as an alternative to pure associative memories. They are SIMD architectures and focus on 1 parallel scanning of data on secondary storage devices. DIRECT [Dew79] is a logical extension of the SIMD type associative processors which supports both intra-query and inter-query concurrency. It assigns the number of processors dynamically to a query and so it allows users to share the processing power simultaneously. These machines can be characterized by functional replication approach because processors (or memories) are basically similar to each other and capable of performing the same set of functions in parallel. On the other hand, database machines such as RDBM [chh83] and SiDBM [Le185] have used functional specialization approach which means that dif- ferent functional units are used for different classes of database operations. Some database machines such as DBMAC M583] and Delta [Kak85] use an attri- bute based data model which employs vertical fragmentation concept. They use a verti- cal fragmentation of a relation to minimize the cost of processing and transferring unnecessary attributes from disks. Other database machines which use some form of vertical fragmentation are DBC [Ban79] and a database computer proposed in [Tan83]. Due to the similarity between relational algebra tree and data flow graph, the data driven computation concept has been proposed as an effective operational method for relational database machines [Bor82]. GAMMA, proposed in [Dew86], exploits dataflow query processing techniques. In GAMMA, queries are compiled into a tree of operators, and each operator is executed by one or more operator processes. With the exception of a few control messages at the time a query is initiated, execution of an operator is self-scheduling, i.e., when the process terminates execution, data flows between the processes without centralized control. There are also other proposals of database machines using data flow concepts [Bic83, Gli83]. Performance of various database machines are discussed in [Haw82, Hi186]. The results show that the performance improvement depends on the query type as well as the database machine architecture. l.2. Main Memory Databases Since inexpensive large main memory is becoming available, and many database applications are I/O bound, some work has been done on main memory database sys- tems to minimize data access time. ESP, presented in [GarL84], consists of multiple processors, but it operates in SISD mode. ESP uses single data stream with all proces- sors operating in lock step. The main focus of this machine is to reduce the global memory access time. SiDBM presented in [Le185], on the other hand, has an MIMD architecture in which the processors are functionally specialized. Both machines use shared memory with common bus architecture. Storage structure for main memory databases has been studied in the past. Dewitt et al. discussed the access time of AVL trees and B-trees for various memory residency factors of databases [Dew84]. They have also shown that hash based join algorithms are very effective for large main memory database systems. Lehman et al. give perfor- mance results of main memory databases using various types of access methods [Leh86]. In that paper they have also proposed a T tree as an alternative to other tree structures. The T tree is a balanced binary tree with many elements per node. It exploits binary search nature of AVL trees, and good update and storage characteristics of B-trees. Several other issues of main memory databases have also been investigated in [Amm85]. 1.3. Concurrent Dynamic Search Structures There have been several proposals on concurrent dynamic search structures to sup- port parallel processing. Many of them are based on tree structures such as B-trees and binary search trees. For concurrent B-tree operations, the entire subtree of the highest affected node is exclusively locked in [Sam76]. Bayer et al. present the algorithms which Only lock out other writers until the actual modifications must be performed [Bay77]. This approach distributes exclusive locks mostly in lower sections of the B- tree and hence it increases the concurrency. Lehman et al. also described algorithms for concurrent B-tree operations which use only a small number of locks for each tree mani- pulating process [Leh81]. Concurrent search and insertion algorithms for AVL trees and 2-3 trees are given in [E1180-a, E1180-b]. There are also several other work on con- current manipulation of binary search trees [Kun80, Man84]. 1.4 Classification of Database Queries Parallel processing strategy which is appropriate for one database application may not be appropriate for another. For example, the granularity of parallel processing which is suitable for one application may not be so for others. Thus, it may be neces- sary to develop different parallel processing strategies for various types of applications. We classify database queries based on their parallel processing characteristics, as fol- lows : (A1) Single query with multiple hits (A2) Single query with a single hit (A3) Single complex query (A4) Multiple queries accessing the same relation (A5) Multiple queries accessing different relations Examples of queries of type A1 are partial match queries and range queries. Here, intra-query parallel processing is advantageous because a single query requiring many data records can be processed in parallel [Kim88, Pra88-b, Pra88-d]. Rosenau et al. have also applied this type of parallel processing for projection operation on a relation [R0587]. It is rather difficult to exploit parallel processing of type A2 queries because apply- ing parallelism for this type of query may require finer granularity which may result in lower throughput of the system. On the other hand, parallel processing may achieve the lower bound on access time for these types of queries when appropriate software and hardware architecture is used. Achieving and guaranteeing this lower bound are impor- tant for many real-time critical applications. Parallel processing strategies for this type of applications can be found in [Pra86, Pra87, Pra88-a, Pra88-c]. Queries of type A3 include join functions, sorting of files, and complex qualifications. Several database machines which use functionally distributed architec- tures have been proposed for this type of queries Hxl85, Sch83]. Intra-query parallel- ism is also advantageous for this type of applications. For A4 and A5 type queries, many independent queries can be processed in paral- lel. Transaction processing is an example for these types. The throughput of the system for these types of applications can be improved by maximizing concurrency among the queries. Multidirectory hashing proposed in [Cev88] has been shown to be effective for certain applications of these types. 1.5. Problem Statement The primary objective of this work is to investigate data distribution strategies for various types of database applications in parallel processing systems. In this thesis we will mainly focus on the data distribution strategies for A1 and A2 type queries because these queries are in fact the most basic types, and data distribution strategies for other types of applications may be developed based on the strategies used in A1 and A2 type queries. For queries of type A1, we will investigate optimal data distribution for multi- key search queries such as partial match retrieval and multiattribute range queries. Since multikey search queries access a set of records, appropriate data distribution is very important to facilitate parallel data accesses. Though much work has been done in the past on designing efficient file structures for this type of applications, data distribu- tion to enhance concurrency has not been considered much. We will investigate optimal data distribution and appropriate data construction for this type of applications. For type A2 queries, we will investigate parallel processing Of tree type indexes for external databases. Since the B-tree is a common storage structure for this type Of applications, node partitioning schemes to exploit parallel data accesses tO B-trees will be investigated. 1.5. Thesis Overview The remainder of this thesis is organized as follows. In chapter 2 we describe high level abstraction Of database processing for parallel processing systems. The objective Of this abstraction is to define a framework which can be used as a basis for more specific implementation. In chapter 3 we present Optimal data distribution for partial match retrieval type queries. The main focus Of this chapter is the development Of new data distribution methods, called Fieldwise eXclusive-or (FX) distribution methods, for maximizing data access concurrency. FX distribution methods use bitwise exclusive-or Operation on the field values which are computed by multikey hashing. We show several useful charac- teristics Of exclusive-or Operation for Optimal file distribution. Field transformation techniques are presented to extend the scope Of Optimality in FX distribution methods. We give sufficient conditions for Optimal distribution by the proposed distribution methods and show that these methods are Optimal for most partial match queries. We describe the performance improvement by FX distribution methods over Other methods proposed earlier. Efficient data construction methods for FX distribution approach are also discussed. In chapter 4 we investigate file distribution problems for parallel processing of multiattribute range queries. Optimal data distribution methods as well as the existence and nonexistence Of perfect Optimality are investigated. It will be shown that there are inherent limitations tO achieve Optimal distribution for various types Of range queries. We give sufficient conditions for the nonexistence Of perfect Optimal distribution for certain types Of range queries. We extend the FX distribution methods for several use- ful multiattribute range queries. Sufficient conditions for Optimal distribution for mul- tiattribute range queries will be described. It will be shown that the proposed data dis— tribution methods are Optimal for a large class Of multiattribute range queries. In chapter 5 we investigate the performance of various parallel processing methods for B-trees. The main focus Of this chapter is a node partitioning scheme for B-trees to exploit parallel data accesses. The proposed B-tree node partitioning scheme is based on synchronized disks. Parallel processing Of partitioned node B-trees on asynchronous disks are also discussed. We also show the performance improvement Of partitioned node B-Uees over conventional B-trees. Chapter 6 contains concluding remarks. CHAPTER 2 DATABASE PROCESSING MODELS FOR PARALLEL PROCESSING In this chapter we describe high level abstraction of database processing for paral- lel processing systems. The Objective of this abstraction is to define a framework which can be the basis of more specific implementation. 2.1. General Parallel Processing Model for Database Systems We propose an abstract model for parallel processing of database systems. This is shown in Figure 2.1. The model is based on distributing data and access structures to enhance concurrency. In the figure, Q5,S represent a set Of parallel access nodes. These can be main memory modules or disks, depending on the parallel processing environment. Ai’s denote a set of access structures. As shown in the figure, there are three mappings that are important for concurrent processing. They are (1) storage allo- cation for data (2) storage allocation for access structures and (3) key to access structure mapping. Storage allocation for data, and storage allocation for access structures determine the amount of physical access concurrency among the nodes. In practice, these issues should be considered based on the architectural types Of parallel processing systems. In other words, a number of factors such as the interconnection structure between the pro- cessors and the access nodes may contribute to actual system performance. In this thesis we assume that the parallel access nodes are logically single shared memory. DATA Storage Allocation Key-tO-Access Structure for Data Mapping Storage Allocation for Access Structures A: Figure 2.1. General Parallel Processing Model In the parallel processing system, the speed mismatch between computation and data access time becomes a more serious problem than in the uniprocessor system due to data access conflict if data are not evenly distributed over access nodes. Generally, database applications require many data accesses with relatively less computation, and so degradation of system performanc due to inappropriate storage allocation scheme may be more significant in database applications than in others. In addition to physical memory (or device) access conflict, there are other sources of conflict for concurrent accessing of data. This means that even though two data are stored in different access nodes, access concurrency can still be reduced. One such source is the lock contention because the data in the same locking entity cannot be accessed concurrently when that entity is locked. It has been shown in [Cev88] that lock conflict causes more serious overhead than the physical memory access conflict in certain cases. This is because each resolution for a lock conflict takes at least a lock duration time which is perhaps much longer than one remote memory access time. 10 Each lock duration time may be reduced by using finer locking granularity. However, lock conflict for critical shared variables, if exist, cannot be avoided. Unfortunately, most access structures have critical parts which frequently need to be locked, e.g., root node in the tree type index and directory size dependent variables in dynamic hashing. Based on the above Observation, in order to achieve high concurrency multiple independent access structures need to be constructed for each relation to reduce the amount Of sharing for critical shared variables. By multiple access structures for each relation, we mean that a set of records pertaining to one relation is partitioned horizon- tally and each subset Of partition contributes to each independent access structure. Thus, we need an appropriate mapping from a set of keys to a set of access structures. Key to access structure mapping determines logical access concurrency while storage allocation scheme for data and access structure determines physical access concurrency. For many applications we can simplify the general model such that only a group of data which are stored in the same access node, say Qi, contributes to a unique subset Of access structures, say A j to A, which are also constructed in the same access node. By this approach storage allocation strategy for data and access structures can be treated integratedly, and the complexity of processing models can also be reduced. Two stage parallel processing model shown in Figure 2.2 represents this idea. In the two stage pro- cessing model the first stage corresponds to data storage allocation and the second stage corresponds to key to access structure mapping. The two stage model provides a more systematic design procedure for developing parallel database systems. 2.2. TWO Stage Parallel Processing Model for Database Systems The basic idea Of this model is to partition data mapping into two stages. As shown in the figure, the first stage, H1, is called Data Distribution algorithm and the second stage, H2, is called Data Construction algorithm. ll QM—l Figure 2.2. Two Stage Parallel Processing Model The data distribution algorithm, which is the same as data storage allocation in the general model, determines how the data is appropriately distributed to the parallel access nodes so that maximum concurrency is achieved between the access nodes. The data construction algorithm, on the other hand, determines the appropriate data structure to minimize the access time. It receives data from the data distribution algorithm and then create local access structures such as hashed or indexed files. In general, the following strategies can be employed for data distribution : (B 1) Declustering based on query’s data reference pattern (B2) Random distribution (B3) Objective specific declustering (B4) Clustering based on data reference pattern In method B 1, the data distribution technique takes advantage of the data reference pattern of a query. For example, if a query references numerous records, the strategy may be to distribute the data so that these records are stored uniformly among the nodes. 12 This approach is useful for A1 type applications discussed in chapter 1. In method B2, records are randomly distributed between the nodes. This method is simple, but may not guarantee a good distribution. In the objective specific method, records are allo- cated to optimize certain objective functions. For example, in [Pra86] Pramanik et al. propose a data distribution technique to construct multiple directories for a single rela- tion, where a record is allocated to the node which has the smallest directory size. It has been shown in that paper that this approach gives the minimum total directory size. However, declustering of data may not be always beneficial. For example, if the com- munication cost between access nodes is large, clustering may give better performance than declustering. SO, B4 type strategies may depend on the interconnection network. Data distribution algorithms can be a functional mapping which depends only on data values. It maps a set of data values into a set of nodes. For example, if node addresses are determined by hashed values of input data, the distribution algorithm is a mapping which is independent of time or other system parameters. On the other hand, the distribution algorithm need not be functional. For example, random distribution may map the same record to different nodes at different time. Since data accesses are content-based for most database applications, it is advantageous to make a data distribu- tion algorithm functional depending only on data values. Let D be a set of data and ZM = [0, 1, . . . , M-l} be a set of parallel access nodes. Let a data distribution algorithm be a function from D to ZM. Since actual data are usu- ally unevenly distributed in the data domain, data distribution algorithms are commonly designed based on the hashed values of data which are evenly distributed in hashed address space. Thus, we define data distribution algorithm H1 as a composition of two functions, H 1(1) and H 1(2), such that H 1‘” is a mapping from D to T and H 1(2) is a mapping from T to ZM, where T is the set of hashed values. Figure 2.3 shows two level implementation of H1 for the abstract model in Figure 2.2. This model can be thought of as one class of the two stage model. (2) H1 Q0 H10) DATA - T QM— Figure 2.3. Two Level Implementation Of H1 for The Two Stage Model Let H2 be a hash-based data construction algorithm and LD be a set of entries in all local directories generated by H2 for a given file system. If there exists one-to-One correspondence between T and LD, T is called a real global directory. Otherwise, it is called a virtual global directory. When T is a real global directory, the set of all the local directories can be thought of as a partition of T. H 1(1) is usually static because the use of dynamic hashing for H 1(1) will cause significant overhead due to intemode data movement. However, static hashing scheme for H 1(1) may result in very sparse local directories or long overflow chains. These problems can be avoided by using a virtual global directory, where the actual local directory is determined by H2. When T is a real global directory, the ratio lTl/IDI directly affects the data retrieval time as well as storage utilization. On the other hand, it is more flexible that T is used as a virtual global directory. The comparison Of these two approaches will be described in more detail in chapter 3. Functional distribution, and real/virtual global directory con- cepts are used for parallel processing of partial match queries and range queries presented in chapter 3 and 4. CHAPTER 3 PARALLEL PROCESSING STRATEGIES FOR PARTIAL MATCH QUERIES 3.1. Introduction Many information processing systems involve the application which accesses the records in a file having the values of a specific set Of attributes in common. A file is a collection of records each of which is defined as an ordered n-tuple (r0, . . . , r,,_1) of n values which are the keys or attributes of the records. When a query is allowed to specify conditions on more than one attributes, the search performed for this query is called multikey searching, or associative searching. Multikey searching is used in queries such as partial match retrieval and multiattribute range queries. A partial match query is a query where some of the attributes are specified and the rest of them are unspecified. For example, query q = [Age = *, Department = "mathematics", State = "Ohio"] is a partial match query, where * denotes don’t care condition. Since each attri- bute in a file may or may not be specified in a partial match query, a set of records need to be retrieved. When a file is constructed on parallel devices, it is important to store these records to maximize concurrency. In this chapter we investigate parallel process- ing strategies for partial match queries. Optimal data distribution methods and appropriate data construction methods are described. It has been shown in [Rot74] that multikey hashing is effective for partial match retrieval type applications. Multikey hash function, H, for a file consisting of 11 fields is an ordered n functions such that given a record r = , H(r) = . H(r) is called a bucket. Rivest [Riv76] and Rothnie, et a1. [Rot74] have independently proposed the use Of multikey hashing, as an alternative to 14 15 inverted files, to reduce the total search time for partial match retrieval type queries. The design of multikey hash functions was considered in [Bur76]. The determination of each field size for minimum search time based on query statistics was also investi- gated by [Aho79, B0179]. In [4] it has been shown that the problem of finding the optimal field sizes for multikey hashing scheme is NP-hard. The main focus Of those research is on minimizing the total number of bucket accesses. Our objective in this chapter is to achieve maximum parallelism by distributing buckets in multikey hashing. There are a few heuristic methods for distributing data in partial match retrieval type queries. Du, et al. have proposed Disk Modulo (DM) distribution method [DuS82]. The DM distribution method is simple and elegant but does not work well in many cases. For example, it may not give Optimal distribution if some of the field sizes are less than the given number of devices. SO, for a large number of parallel access nodes, the DM distribution method may not be appropriate. Generalized Disk Modulo (GDM) method has also been proposed in [DuSSZ] to overcome this problem. This method gives a sufficient condition to achieve optimal distribution. However, no gen- eral method has been given to find the optimal distribution parameters. In fact, the problem of finding the optimal parameter values could be very complex [DuS82]. Since DM distribution method does not work well for binary cartesian product file (binary cartesian product file is a cartesian product file in which each field contains only two elements), several heuristics have been proposed by [Du86, Sun85] to distribute buckets in binary cartesian product files. These heuristics are also special cases of the GDM distribution method. Several useful properties of these modulo based distribution methods have also been given in [Sun87]. Other approaches such as data distribution methods based on minimal spanning trees and short spanning paths have been proposed in [Fan86]. We propose Fieldwise eXclusive-or (FX) distribution methods which give better performance for a wider range of partial match queries than existing methods. The main l6 idea of FX distribution methods is the use Of bitwise exclusive-or Operation on the field values which are computed by multikey hashing. Here, we show several useful charac— teristics of exclusive-or operation for Optimal data distribution. Field transformation techniques have been used to increase the scope Of Optimality in FX distribution methods. The rest Of this chapter is organized as follows. In section 3.2, we describe definitions and terminology. In section 3.3, we define Basic FX distribution method and its optimality conditions. In section 3.4, we present field transformation techniques to increase the scope of Optimality over the Basic FX distribution method. The extended optimality conditions for these field transformation schemes are also described. In section 3.5, we compare the performance of FX distribution methods with those of the other distribution methods proposed in the past. We discuss efficient data construction methods in section 3.6. 3.2. Definitions and Terminology Before describing FX distribution method, it is necessary to introduce some nota- tions as well as relevant definitions and assumptions. Definition 3 .1 . (a) f,- = {0, l, , Fi—l }, a set of hashed values of field i by the i-th hashing function in multikey hashing. (b) F ,- denotes lfi I . (c) M denotes the number of parallel devices. ((1) N is the set of all natural numbers including 0. (e) ZM is the set of all integers from 0 to M-l. (f) (am_1 a0)3 is a binary notation of an integer, where a,- is a binary digit. 17 I f,- | is assumed to be a power of 2 which is common for hash directory files for parti- tioned hashing schemes [Aho79]. The number of devices M is also assumed to be a power of 2. ZM will be frequently used to denote the set of devices. Definition 32. Let f 1 xfzx xf,, be the set of all buckets, where x denotes the carte- sian product of sets. When the given number of devices is M, data distribution method is a function fromflezx xf,l to ZM. Definition 3 .3 . Let R(q) be the set of buckets which satisfy qualifications for a partial match query q. The distribution method is called strict optimal for a partial match query in a given file system if each device has no more than [IR (q) l/M] number of buckets. Definition 3.4. When the distribution method is strict optimal for all possible partial match queries in a given file system, it is called perfect optimal for that file system. Definition 3 .5. The distribution method is called k-optimal, 0 S k S n, for a given file system consisting of n fields, when it is strict optimal for all partial match queries which have exactly k unspecified fields. Thus, the distribution method is perfect optimal, if it is k-optimal for all k = 0, , n. Note that some authors exclude cases where the number of unspecified fields is 0 (i.e., exact match) and the number of unspecified fields is n (i.e., retrieval Of whole file) from partial match queries. Definition 3 .6. [+] denotes exclusive-or operation between two bits. We will use the same notation [+] to denote exclusive-or Operation between integers and sets of integers as follows. When X = (am_1 a0)3 and Y = (b,,,_1 b0)3 are two integers, X [+] Y = (am_1 [+] bm_1 a0 [+] b0)3. IfX is an integer and Y = {y1, , yL} is a set of integers, X [+] Y is defined as {X [+] y,- | y,- e Y} . If both X = {x1, ,xK} and Y = {y1, ,yL} are sets of integers, X [+] Y is defined as {x; [+] yj l x,- e X, yj e Y} 18 Forexample, ifX1 =2and Y1 =3, thenX1 [+] Y1 = 1. Isz =2and Y2 = {0, 1,2,3}, thean [+] Y2 = {0, 1,2, 3}. Definition 3.7. [EKYD = Y1 [+] Y2 [+] Y3 - ' - [+] Y". II Note that [+] Operator is associative and [+] is a shorthand notation for performing i=1 exclusive-or operation between sets of integers Y1, Y2, , Y,,. We assume that precedence of multiplicative Operator * is higher than that Of exclusive-or operator [+], which means that in the absence of parentheses, multiplication is done before exclusive-or operation. But the precedence of [+] is assumed to be higher than that of + (addition). We will leave out the multiplication operator * when- ever there is no ambiguity (e.g., AB instead of A*B). When d is a power of 2, and J1, J 2 e N, we will use the following relations impli- citly between exclusive-or Operator and arithmetic Operators + and *. (1) IfJ2 ZM as a function such that TM(l) = (am_1 a0)3. Thus, function TM returns only the rightmost logzM bits of domain values. Since we can add arbitrarily large number of 0’s to the left of the given binary number of an 19 integer 1, function T M is always defined for any power of 2 integer M. It is also easy to see that for any Jl , 12 e N, TM(JI [+] 12) = TM(TM(JI) [+] 12) = TM(J1) [+] TM(JZ). 3.3. Basic FX Distribution Method Let f 1 xfzx xfn be a set of all buckets. The Basic FX distribution method allo- n cates bucket <11, , Jn> into device TM [[flUfi], where Jj e f,- forj = l, , n. 1: Example 3.1. Figure 3.1 shows the bucket distribution by the Basic FX distribution method, where f1 ={0,1},f2 = {0, 1, 2, 3, 4, 5, 6, 7} and M = 4. In the figure, binary numbers are used for field values and decimal numbers are used for Device No. (This convention will be used in all the examples of FX distribution). Here, Device N0 = TM [J 1 [+] J 2], where J 1 6 f1, J 2 6 f2 and TM returns the rightmost two bits of the result of J 1 [+] J 2. f1 f2 Device No 000 000 0 000 001 l 000 010 2 000 011 3 000 100 0 000 101 1 000 110 2 000 111 3 001 000 1 001 001 0 001 010 3 001 011 2 001 100 1 001 101 0 001 110 3 001 111 2 Figure 3.1. Basic FX Distribution 20 As shown in Figure 3.1, the Basic FX distribution method is strict optimal for any partial match query in this file system. For example, if f1 value is (001)}; and f2 is unspecified, then we have to access eight buckets <(001)B,(000)3>, , <(001)B,(111)B>. Since each device has two qualified buckets for this partial match query, the Basic FX distribution method is strict Optimal for this query. Lemma 3.1. ZM is a set which contains M different nonnegative integers from 0 to M- 1. Let k be some integer such that 0 S k 5 M-1. Then ZM [+] k = ZM. Proof: Clearly, for any nonnegative integer Aa which is less than M, 0 S k [+] A, 3 M—1. Thus, it is sufficient to show that for any two different elements of ZM, the results of exclusive-or with k are different. Let A0 = (a,,,_la,,,_2 - ~ . a0)3 and Ab = (b,,,_1b,,,_2 . - - b0)3, where m = logzM and at least one a,- ¢ bi, i.e., Aa¢ Ab. Let Aa [+] k = (c,,,_1 --'c0)B and Ab [+] k = (d,,,_1 . ' «10)3. Assume Aa [+] k = Ab [+] k. Then c,- = dj for all j = 0,, m—l. Since cj = d,- implies d} = bj, it follows that aj = b,- for allj = O, , m-l. This contradicts the assumption of Ad ¢ Ab. Since 0 S k [+] A,- S M—l for all A,- e ZM and k [+] A, at k [+] Ab for any two different A, and Ab in ZM, ZM [+] k is the same as ZM. 1:] Example 3.2. LetZs = {0,1, 2, 3, 4,5, 6, 7} and k = 3. Then Zg [+] k = {3, 2, l, 0,7, 6, 5, 4} = 23. In the proof of Lemma 3.1, it is shown that for any two different nonnegative integers I, J and any nonnegative integer k, I [+] k is different from J [+] k. We call this XOR uniqueness property. We can observe that the property described in Lemma 3.1 (this also implies XOR uniqueness property) is very useful for optimal file distribution of partial match retrieval type applications. This property is fundamental, and will be used in various places Of this chapter to develop further techniques for Optimal distribution. 21 In the following Theorem 3.1 and Theorem 3.2 we will describe the types of partial match queries for which the Basic FX distribution method gives Optimal distribution. Theorem 3.1. The Basic FX distribution method is always O-optimal and l-Optimal. Proof: (1) O—Optimal : This is trivially true. (2) l-optimal : Let only one field i be unspecified and T M [[J:}(J 1)] = h, where J j is the specified value of field j. Thus, h gives the projection of the rightmost logzM bits of the value obtained by the exclusive-or of all the specified values of the query. There are two cases, F,- SM and F,- > M. When F,- S M, for all I e f,~, TM(I) [+] h is different from each other by XOR uniqueness property. Therefore, the distribution is optimal. (Note that TM(A [+13 ) = TM(A) [+] TM(B ) = TM(A [+] Tu(B)) = TM(TM(A) [+] TM(B)))- When F; > M, let F,- = A*M. By Lemma 3.1 1‘, [+] h = ZAIM. Since #{ae ZArM ITM(Ot) = z} = A for any 2 e ZM, the Basic FX distribution method allocates A number of buckets to each device. (Note that # denotes the cardinality of a set.) Therefore, the distribution is Optimal. El Theorem 3.1 says that the Basic FX distribution method is strict optimal for any partial match query in which the number of unspecified fields is O or 1. Note that in the above expression, #{Ote ZArM ITM(0t) = z} = A, A denotes the number of qualified buckets that correspond to a particular device 2. Theorem 3.2. For any partial match query which has two or more unspecified fields, the Basic FX distribution method is strict optimal, if there exists at least one unspecified field i such that F ,- 2 M. Proof: For partial match query q, let q(f) = {i1, i 2, , ik} be the set of unspecified fields in which the size of at least one Of these fields is greater than or equal to M. Without loss of eneralit , assume F - 2M, i.e., F ~ = A- *M for some nonne ative g y ll ‘1 ll g integer Ai,- Let h = TM [lel-2f)“ 1)], where J 1- denotes the specified value of field j 22 which is not in q(f). Thus, h is the projection of the rightmost logzM bits of the value Obtained by the exclusive-or of all the specified field values of the query. By Lemma 3.1, h [+] fl, = ZA,‘*M and hence #{Jilefil | TM(h [+] 11",) =2} = Al, for all z e ZM. Here, we have done exclusive-or of h with the set of values of the unspecified field i 1. At, gives the number of unique values of the unspecified field i 1 that correspond to a particular value 2. Now we will exclusive—or the set of values h [+] ft, with a value of the unspecified field i2. Let J; G fiz. By Lemma 3.1 #{Jilefi1 | TM(h [+]J,-1 [+]J,~2)=z} = At, for all z e ZM. Note that this will not change the number of unique values of field i1 that correspond to a particular 2. We also used the property, TM(h [+] J}! [+] J52) = TM(h [+] J11 [+] TM(J,~2)). Thus, #{(J,~l , Ji2)e fi, xf,2 I TM(h [+]];1 [+]J,-2)=z} = AilFi2 for all z e ZM. Here, the number of unique values for each 2 is more by a factor F [2 because the size of the domain has increased by a factor F ,2. By continuing this argument, #{(J,-l, ' - - Jik)e k k k filx xfik {TM h [+] [+}(J,'P) =2} =Aill—IFL', = (HE-p) / M for all z e ZM. P: p= p=l CI Note that Theorem 3.1 works for partial match queries with zero or one unspecified field while Theorem 3.2 applies to partial match queries with more than one unspecified fields. Corollary 3.1. When all field sizes are no less than the given number of devices M, the Basic FX distribution method is perfect optimal. Proof: This is a direct consequence of Theorem 3.2. CI Theorem 3.1 and 3.2 show general characteristics of exclusive-or operation for optimal distribution. However, the Basic FX distribution method does not give optimal distribution for partial match queries with 2 or more unspecified fields, when the size Of none of the unspecified fields is greater than or equal to M. For example, when M = 16 23 and all others are the same as in Example 3.1, the distribution is not optimal if both fields are unspecified. Proposition 3.1 gives the conditions for Optimal distribution for these cases. Proposition 3.1. Let q(f) = {i1, i 2, , ik} be the set of unspecified fields for partial match query q, where F j < M, for all j e q(f). FX distribution is strict Optimal for par- tial match query q, if there exist a set of fields {[1, , ij} ; q(f) such that j Ifilx >< xfijl TM [[+}(Jip)]=z} = p: lfilx xfijl/Mforallze ZM. Proof: Let FilX ' ' ' >',Jj 6 fl forj = 1, , n, into device TM , where [+](Xj(Jj)) j=l i) if Ifjl 2 M, X j is the identity function, ii) if IfJ-l < M, X 1' is an element of set of injective (one-to—one) functions whose domains are fl- and ranges are ZM. X j- is called a field transformation function. When X j is the identity function for all j = 1, ..., n, Extended FX distribution methods reduce to the Basic FX distribution method. It is easy to see that all the 25 lemmas and theorems that hold for the Basic FX distribution method also hold for Extended FX distribution methods. By the above definition, a field transformation function can be any one-tO-one func- tion. However, we are only interested in the field transformation functions by which we can achieve strict Optimal distribution for various types of partial match queries for which the Basic FX distribution method does not give strict optimal distribution. Since the fields whose sizes are no less than the given number of devices M, do not cause any problem (whether it is specified or not), in this section we will focus only on the fields whose sizes are less than M. From now on, we will simply call FX distribution methods instead of Extended FX distribution methods. In the following subsections we describe field transformation functions which are used for the fields whose sizes are less than the given number of devices. 3.4.1. Field Transformation Functions for Partial Match Queries The field transformation functions developed for FX distribution methods are as follows. Definition 3.9. Letfi = {0, . . . , Fl—l}, and let If, I and Marc power of 2. (l) I z N —> N is an identity function. (2) When Ifil ZM is a function such that x IUXM’ 'fi' (1) =1 [+] [,[c+}ldkM’ '4'] where dkM’ 'f" = M/ If, l". The function IUQI' 'f’ ' is a general notation for functions IUIIW’ 'f’ I , IUIZW‘ If’ I , . . . , etc. For example, functions I UIIW’ If" and 103" W are defined as follows : 26 (a) When I fit < M, IUlM’If’l : f, —> 2,, is a funCtion such that IUIM’ 'f"(l)=1[+]1dM"f",wheredM"f" = M/lf,|. (b) When I fil2 2,, is a function such that IUZM' 'f"(1)=z[+] 1d,“ 'f" [+]1d2“"f", where dlM’W = M/Ifil, dzM’W = M/IfIIZ. Example 3.3. Letf1={0,1, 2, 3} andfg = {0,1} andM =16. (a) 016.4%): {0. 4. 8. 12}. and IU116'4tn>= {0[+10. 1[+l4. 2[+]8, 3[+]12} = {0, 5. 10, 15}. (b) 111216202) = {0, 1[+]8[+]4} = to, 13} and [(1316202) = {0, 1[+]8[+]4[+]2} = [0. 15}. We have defined basic transformation functions, I, U M' If" , 1U IM’ W, , M,|f,l 1U, which will be used in various combinations for optimal data distribution. For example, for any values of If,- I, lfjl and M, it will be shown that FX distribution methods distribute ordered 2-tuples in [(73) x U M’ W (fj) optimally. When X {Ml’lf‘ ' and XZMZ’If’I functions are applied to fields 1', and j, two transformation methods are said to be the same, if X 1 = X 2. For example, U M 1’ 'f‘ ' M2.If,-I M1,!f,l and U M2,lf,-I 1 s are said to be the same transformation methods. But U and M1,lf}l M2,!fjl [U or IU1 and IUZ are said to be difi’erent transformation methods. Because of notational complexity, we will leave out the superscripts M, l f,| from transformation functions and their parameters whenever there is no ambiguity. It is easy to see that for any proper subset f, of ZM where lf,| is some power of 2, U transformation function satisfies the requirement of field transformation functions (i.e., one-to-one function whose domain is f; and range is ZM). We will show later that I U 1, . . . , I UJr transformation functions also satisfy this requirement. From the definition of U 27 transformation function we see that the transformed domain elements are equally spaced. In other words, when they are linearly ordered, any two adjacent elements have the same distance. Several useful properties of IU1, . . . , IUJr transformation functions will be described in Lemma 3.7, Lemma 3.9, Lemma 3.11 and Lemma 3.12. In the following sections, we will present various combinations of field transforrna— tion functions by which FX distribution methods give optimal distribution. Note that we focus only on the fields whose sizes are less than the given number of devices. 3.4.2. I and U Field Transformation Functions In this section we show in Theorem 3.3 that for any values of F ,-, F j and M, FX dis- tribution methods distribute ordered 2-tuples in I (fi) x U (fj) optimally. Theorem 3.3 uses Lemma 3.2 which is an extended version of Lemma 3.1. Lemma 3.2. For a nonnegative integer L, let L = aw + b, where w is a power of 2, and a, b are some nonnegative integers such that 0 S b S w-l. When Z, = {0, 1, , w-l }, Zw [+] L = {aw, aw+1, , (a+1)w-1}. Proof: Let (1”,-1 - - - lk lk_1 - ~ - [0)3 be a binary notation of L, where [k has weight w. Then, (1”,-1 [k)B = a. All elements in ZW has the form (0 ...0 bk_1 ' ~ ' b0)3. When L [+] or for any or e Zw, the result should have the form (l,,,_1 - - ~ [k ck_1 .. ° c0)3 which is aw + (ck_1 c0)B, where (ck_1 co)3 is equal to (lk_1 10)B [+] (bk_1 - . - b0)3. Thus the proof follows by Lemma 3.1. E] Theorem 3.3. When there are only two fields i, j whose sizes are less than the given number of devices M, the FX distribution method with I ()2) and U (fl) is perfect optimal. Proof: There are two cases, i.e., FiFj < M and FiFj 2 M. Let dj = M/Fj. (case 1) FiFj < M (i.e., dj >Fi) I(f,-) [+] U(fj) = {0, 1,, Fi-1}U{dj, dj+1, , dj+F,--1} U {M-dj, M-dj+1, ..., ‘1! 28 M-dJ-+F,--1} by Lemma 3.2. In the right hand side all sets are disjoint and largest ele- ment is less than M because dj > F}. So, for all z e ZM, #{(J,-,Jj) e fixfj | I(J,-) [+] U(Jj) = 2} S 1. Therefore, it is 2-optimal. 0 and l-optimal come from Theorem 3.1. (case 2) FiFJ-ZM (i.e., dj SF,-) Let FiFj =A*M (i.e., dj = FilA). In order to be 2-optimal, for all z e ZM, #{(J;,Jj) e fixfj l 10,-) [+] U(Jj) = 2} should be equal to A. (1) 0 S U(Jj) S (A-1)dj For each U(Jj) in this range, I(f,-) [+] U(Jj) = {0, 1,, Fi-l} by Lemma 3.2. Let this set be So. Since there are A number of such U (J j) in this range, for all s 6 So #{(J,-,Jj) e fixfj l OSU(Jj)S(A—l)dj, [(J,) [+] U(Jj) :3} = A. (2) Adj S U(Jj-) S (2A-l)dj For each U(Jj) in this range, I(f,-) [+] U(Jj) = {F}, Fi+1, --- , 2F,-—l} by Lemma 3.2. Let this set be SI. Since there are A number of such U(Jj) in this range, for all s e 51 #{(J,-,Jj) e fixfj | AdjSU(Jj)S(2A —1)dj, l(J,-) [+] U(Jj)=s} =A. (M/Fi) (M/F;—1)Adj S U(Jj) S M-dj For each U(Jj) in this range, [(fi) [+] U(Jj) = {(M/F;—1)F,-, (M/Fi—1)F,~+1, . M—l}. Let this set be SM/F,—1- Since there are A number of such U (J j) in this range, for all s e SM,p‘._1, #{(J,-,Jj)e flxfj | (M/Fi—1)AdjSU(Jj)SM-dj, I(J,~) [+] U(Jj)=s} = A. M/F,—1 Since U Sp = ZM and there are A repetitions for each element in ZM through 1 (ft) [+] p=0 U(Jj), Jj = O, - - - , Fj—l, it is 2—optimal. 0 and 1-optimal come from Theorem 3.1. C] Example 3.4. Letfl = {0, 1, 2, 3},f2 = {0, l, 2, 3} and M =16. Figure 3.2 shows the bucket distribution by FX and DM distribution methods. Note that I (f 1) = [0, l, 2, 3} 29 and U(fz) = {0, 4, 8, 12} are I and U transformed values of f1, f2 and denoted by binary numbers. Here, Device N0 = TM(I (J 1) [+] U (12)) for FX distribution methods, and Device N0 = (J 1 +12) mod 16 for the DM distribution method, where J 1 6 f1, J 2 6 f2. f1 f2 I (f 1) U (f 2) Device No (FX) Device No (DM) 0 0 0000 0000 0 0 0 1 0000 0100 4 l 0 2 0000 1000 8 2 0 3 0000 1100 12 3 1 0 0001 0000 l 1 1 1 0001 0100 5 2 l 2 0001 1000 9 3 1 3 0001 1100 13 4 2 0 0010 0000 2 2 2 1 0010 0100 6 3 2 2 0010 1000 10 4 2 3 0010 1100 14 5 3 0 0011 0000 3 3 3 l 0011 0100 7 4 3 2 0011 1000 11 5 3 3 0011 1100 15 6 Figure 3.2. FX Distribution with I And U Transformation The FX distribution in Figure 3.2 is perfect optimal. But in DM distribution, the distribution is skewed. The GDM method can also give optimal distribution by multi- plying 3 to the first field values and by 4 to the second field values. However, these parameters should be found by trial and error. On the other hand, FX distribution tech- niques give a specific method. 3.4.3. I and IUx Field Transformation Functions In this section we show in Theorem 3.4 that for any values of F ,-, F k, and the given number of devices M, FX distribution methods distribute ordered 2—tuples in I ()2) x IUx(fk) optimally. Lemma 3.3 shows IUX transformation functions are injective. 30 Lemma 3.7 shows a useful property of IUX transformation functions which is used to prove Theorem 3.4. Lemma 3.7 is derived based on Lemma 3.4, 5.3 and 5.4. Lemma 3.3. Let fk = {0, , Fk-l} such that F k" < M for some positive integer x. Then, IUx(fk) is a set of Fk elements between 0 and M-1 (i.e., IU,(fk) is an injective function). Proof: (case 1 ) x = 1. Assume otherwise. Let d1 = M /F k. Then there exist two different elements K 1,K2 e fk such that K1 [+]K1d1= K2 [+] K2d1. Let K’ = K1 [+] K2. (Note that for any two different nonnegative integers K 1 and K 2 there always exists positive integer K’ such that K’ = K1 [+] K2). By substituting K1 [+] K’ for K2 we have K1d1 = K’ [+] K2d1. Let K1 = (0 O a, a0)3, K2 = (O 0 b5 b0)3 and K’ = (O 0 e1e0)3, where r, s, t denote leftmost bit positions in which the binary value is "1" in K1 , K 2, K’ , respec— tively. This implies a, at b, because K ’ = K 1 [+] K 2. Since d 1 is a power of 2, d 1 = 26 for some positive integer c. Then, K1d1 = (O O a,+c a1+c ac 0)B, K2d1 = (0 0 bs+c b1+c bc 0)B, where a1+c at b,+c. Since e1+c is zero, a1+c at e,+C [+] b1+c. This contradicts K 1 d 1 = K’ [+] K 2d 1. (case 2) x > 1 Assume otherwise. Then there exist two different elements K 1, K 2 e fk such that (K1[+]K1d1[+]...[+]K1dx)=(K2[+]K2d1[+]...[+]K2dx), where d1=M/Fki, i = l, . . . , x. Since K1d1 and K 2d1 are the only ones which can have binary value "1" between bit position logzd 1 and logZM (we considered the rightmost bit position to be zero), the only way the above equation can be satisfied is K 1 = K 2. Therefore, contrad- iction. El Lemma 3.4. When there are only two fields i, k whose sizes are less than the given number of devices M and F ,- 2 F k, the FX distribution method with I (f1) and I U 1 (fk) is perfect optimal. 31 Proof: There are two cases, i.e., F1Fk < M and EB, 2M. Let dk = M/F;C and d,- = M/F1. (case 1) F1-F,c < M (i.e., dk >F1) For all K e fk, 1(fi) [+] K [+] de={0, 1, Fi_1}[+]de = {de, de-i-l, ' ‘ ' , de'i-Fi—l] The first equality holds by Lemma 3.1, and the second equality holds by Lemma 3.2. Clearly, no element in ZM is repeated through I (f1) [+] IU1(K), K=O, F k—l because dk > F1. (case 2) FiFk 2 M (i.e., Ci}; 5 F1) Let FiFk =AM. Then, F1=Adk and Adidk =M. Let K6 fk i) OSK F,- 2 Pk and FiF,‘ =M (we can always find such F,- and M). For such M, F,- and Fk, I (f1) [+] IU1(fk) gives perfect optimal distribution by Lemma 3.4. For all such F k, by Lemma 3.5 there is exactly one element of IU1(fk) at each interval from O to M with interval size M /F k. Thus the lemma follows C] For example, whenf1={0, 1, 2, 3},f2 = {0, l, . . .7} and M: 16,1U1(f1) = {0, 5,10, 15} and IU1(f2) = {0, 3, 6, 5, 12, 15, 10, 9}. We can see that there is exactly one ele- ment of [U 1 (f1) and I U 2 (fz) at each interval with size 4 and 2, respectively. Lemma 3.7. Let f; = {0, , F j—l }, where F 1.16 < M for some positive integer x. Then, there is exactly one element of IUx(fj-) at each interval from O to M with interval size M/Fj. Proof: When x =1, this is true by Lemma 3.6. Whenx 2 2, let d,- =M/F-i, i = 1, . . ., x. Since (Jd2 [+] . . . [+] de) < d1 for all I e fj, the proof is clear. [3 Theorem 3.4. When there are only two fields i, j and the given number of devices is M such that F ,- < M and F j" < M, for some positive integer x, the FX distribution method with 1 (f1) and I Ux(fj) is perfect optimal. Proof: Let d 1=M/Fj. (case 1) F117} S M (i.e., d1 2F1) By Lemma 3.7, there is exactly one element of IUx(fj) at each interval from O to M with interval size d1. Thus, by Lemma 3.2 I (f1) [+] IUx(fj) is a set of all different F1F j ele— ments between 0 and M-1. 34 (case 2) F1FI- > M (i.e., d1 d1 , let K 1 mod d = 1|! 37 K2 mod d = (aa_2 a0)3. Then, (K1[+]K1d1) mod d = (aa_2[+]aa_2_g aa_3[+]aa_3_13 - - ' a0)3. Clearly, (K2 [+] K2d1) mod d is also (act-21+]act—2—s aa—3[+laa—3—B ' ‘ ‘ 00hr. If : Assume otherwise. Then there exist K 1, K 2 such that (K 1 [+] K 1d1) mod (1 = (K2 [+]K2d1) mod d, and K1 mod d at K; mod d. Let K1 mod d = (a01_2 -~ a0)3 and K2 mod d = (ba_2 - - . bo)3. Let i be the rightmost bit position such that a,- ¢ b1. When i < B, clearly, contradiction. When i 2 B, a1[+]a1_15 = b1[+]b,-_B because (K1 [+] K1d1) mod d = (K2 [+] K2d1) mod d by the assumption. Since i is the right- most bit position such that a; at b1, this implies that a,-_13 = b1_13. Therefore, a contradic- tion. (2) Induction Step. Suppose this is true for x = p. We want to show that the lemma is also true for x = p + 1. (Note that we consider only the case where F1” H < M). Let (K1[+]K1d1[+]...[+]K1dp)=L1, and (K2 [+] K2d1 [+] . . . [+] szp)=L2. Only If : K1dp+1 mod d is equal to szp+1 mod d. (It is easy to see that K1 mod d= K2 mod d if and only if K1d’ mod d = sz’ mod d for any (1’ which is a power of 2.) Since L1 mod d = L2 mod d by induction hypothesis, (L1 [+] K 1dp+1) mod d should be equal to (L2 [+] K 2dp+1) mod d. Note that when d is a power of 2, (L1 [+] K1dp+1) mod d: (L1 mod d) [+] (K1dp+1 mod d). If : Assume otherwise. Then there exist K 1, K 2 such that (K1[+]Kidil+] ' ' ' [+]Kidp+i) mod d = (K2[+1K2d1[+] ‘ ' ' [+]K2dp+1)m0d 4’ and K1 mod d at K2 mod d. Let 0t =1og2d and B = logzdp+1. Let K1 mod (1 = (aa_2 ' - - a0)3 and K2 mod d = (b 13-2 - - - b0)3. Let i be the rightmost bit position such that a,- ¢ b1. The remainder of the proof is similar to that of case 1, i.e., x = 1. This completes the induction. [1 Corollary 3.2. Let f,- = {0, , F j—l} such that F j" < M for some positive integer x. Let S, i = O, . . . , d—l be a subset of IUx(fj) whose residue by modulus d is i, where d is a power of 2 which is less than M. Then, for any such d, ISO! = |S1| = . . . = 38 [511.1 I. Proof: This is a direct consequence of Lemma 3.9. [:1 Theorem 3.5. When there are only two fields j, k and the given number of devices is M such that F j < M and F I“ < M for some positive integer x, the FX distribution method with U (f,) and IUz(fk) is perfect optimal. Proof: Let dj = M/Fj. (case 1) Fij > M (i.e., Fk > dj) Let Fij =AM. Then A =Fk/dj, i.e., Fk =Adj. Let K C- fk. (1) K mod d1: 0 (i.e., K = de for some I 6 fl) By Lemma 3.9, all IUx(K) elements such that K mod dj = 0 has the same residue co by modulus dj. So, all such IUx(K) can be represented as adj + co for some variable or e ij and some fixed nonnegative integer (:0 which is less than d j. Since F k =Adj, there are A number of IUx(K) such that K mod dj = 0. For such IUx(K), U(fj) [+] IUx(K) = U (fj) + co by Lemma 3.8. Let this set be S 0. Then there are A repetitions for each element in So through U(fj) [+] IUX(K), K = 0, d1, ~~ , (A—l)dj. (2) K mod dj =1 (i.e., K =de + 1, for some I e fj) By Lemma 3.9 all IUx(K) elements such that K mod dj = 1 has the same residue c1 by modulus dj. For all such K U(fj) [+] IUX(K) = U(fj) + c1 by Lemma 3.8. Let this set be S 1. Since there are A number of IU,(K) such that K mod d,- = 1, there are A repeti— tions for each element in S1 through U(fj) [+] IUX(K), K =1, dj+1, -°~ , (A—1)a’j+1. (dj-l) K mod dj = dj -1 By Lemma 3.8 and Lemma 3.9, for any IUx(K) such that K mod dj = dj—l, U(f,-) [+] IUx(K) = U(fj) + Cdrl for some cdl._1 which is less than dj. Let this set be $111.1. Then, there are A repetitions for each element in de_1 through U (fj) [+] I Ux(K ), d—l K = dj—l, - . - ,Adj-l. By Lemma 3.9, all c,-’s are different each other. So, (J51 = 1:0 39 ZM. Since there are A repetitions for each element in ZM through U (fj) [+] IUx(K), K = O, - ~ - , F k—l, it is 2-optimal. 0 and l-optimal come from Theorem 3.1. (case 2) Fij < M (i.e., Fk < dj) The proof is almost the same as that of case 1. Here, enumeration ends at F k—l instead of d j—l and there exists only one I Ux(K ) element at each step. El Example 3.7. Letf1 = {0, 1, 2, 3},f2 = {0, 1, 2, 3} and M =16. Figure 3.5 shows the FX distribution with U(f1) and IU1(fz). Here, U(f1): {0,4, 8, 12},IU1(fz)= {0,5, 10,15} and Device N0 = TM(U(J1) [+] IU1(12)), 116 f1, J2 6 f2. We can see that the distribution of Figure 3.5 is perfect optimal. U(fl) IU1(fZ) Device No 0000 0000 0 0000 0101 5 0000 1010 10 0000 1111 15 0100 0000 4 0100 0101 1 0100 1010 14 0100 1111 11 1000 0000 8 1000 0101 13 1000 1010 2 1000 1111 7 1100 0000 12 1100 0101 9 1100 1010 6 1100 1111 3 Figure 3.5. FX Distribution with U And [U 1 Transformation Example 3.8. Let f1 = {0, 1, 2, 3, 4, 5, 6, 7},f2 = {0, 1} and M = 16. Figure 3.6 shows the FX distribution with U(f1) and IU2(fz). Here, U(f1) = {0, 2, 4, 6, 8, 10, 12, 14}. 1U2(fz) = {0, 13} and Device N0 = TM(U(Jr) [+] “1202», Jr 6 f1, 12 6 f2. We can verify that the distribution of Figure 3.6 is perfect optimal. 40 U(fl) [U2(f2) DeviceNo 0000 0000 0 0000 1101 13 0010 0000 2 0010 1101 15 0100 0000 4 0100 1101 9 0110 0000 6 0110 1101 11 1000 0000 8 1000 1101 5 1010 0000 10 1010 1101 7 1100 0000 12 1100 1101 1 1110 0000 14 1110 1101 3 Figure 3.6. FX Distribution with U And IU2 Transformation 3.4.5. IU1 , IU2 , . . . , IUx Field Transformation Functions In this section we will show that the set of IU1 transformation functions can give optimal distribution for the case when there are several fields whose sizes are much less than the given number of devices. Definition 3.12. Let and <11,...,l,,>betwo ordered n-tuples. Then, =<[1,...,ln>,lfk1=11,...,kn=ln. Otherwise,¢. Lemma 3.10. When there are only two fields i 1, i 2 such that F12 2 F11 and F122 < M for the given number of devices M, the FX distribution method with IU 1(f1-1) and [U 2(f1-2) is perfect optimal. Proof: Assume otherwise. Then, there exist {J11,J12} ; f11, {121,122} ; fig such that <111er> ¢ 41211222 and (J11[+1111d11)[+1(121[+11216121[+]er6122) = (112[+]Jrzdlr)[+](122[+1122d21[+1122d22), Where 6111 = M/Fila 6121 = M/Fi2, 6122 41 (121/F12. (Note that F11F12 < M and d11 2 d21.) The above equation can be rewritten as [(111 [+]-[12M 11 [+1021 [+1122)d21] [+] [(er [+]J22)d22[+1111 [+11 121+11211+11221 = 0. Since the second term (delimited by bracket) is less than d 21 , and the first term does not have binary value "1" between bit position 0 and log2d21 - 1, the only way the above equation can be satisfied is that each term (delimited by bracket) should be equal to zero. Thus, we have following two equations. (J11[+1112)d11[+1(121[+1122)d21 = 0 (12111‘1122)€1221+1(111[+]J 121+11211+1122) = 0 (case 1) d11=d21 Then, J11[+]J12[+]Jz1[+]J22 = 0 from the first equation. So, from the second equation 121 is equal to J22 which results in J11 = J12. This contradicts ¢ <121,J22>. (case 2) d11 > d21 Let d 11 = pd 21, where p is a power of 2. From the first equation, J 21 [+]J 22 = 12 (J 11 [+]J 12). Substituting this in the second equation, (1111+1112)Pd22[+1(111[+1112)[+](111[+1112)P = 0- Thus, the only way this equation can be satisfied is J 11 = J 12 which implies 121 = J 22. Therefore, contradiction. C1 Definition 3.13. We define LMB(J), J e N, to be the leftmost bit position whose value is "l" in the binary notation of J, and define RMB(J), J e N, to be the rightmost bit posi— tion whose value is "1" in the binary notation of J. When J = 0, we define LMB(1) = RMB(J)=-1. For example, LMB(1) = O, LMB(3) = 1, LMB(10)= 3, RMB(1)= 0, RMB(3) = 0, and RMB(lO) = 1. Here, we considered the rightmost bit position to be the bit position ZCI'O. Theorem 3.6. When there are only 1 fields, i 1, i2, il such that F11 .>_ F1(1_1) 2 2 F11 and F111 < M for the given number of devices M, the FX distribution method with IU1(f,~1), , IU1(f,-1) is perfect optimal. 42 Proof: The proof is an induction on 1. (1) When I: 2, this is true by Lemma 3.10. (2) Suppose this is true until I = k—l. We want to show that this is true for l = k. For convenience, let these k fields be 1,2, , k. (Note that we only consider the case where sz 21:1, 11",," < M, and so F1F2...Fk < M.) Let F1, =p,,_1F,,_1 = pk-1pk_2Fk_2 = . .. =pk-1pk-2 . - °p2F2 =pk_1pk_2 -~p2p1F1, where each pi, i = 1, , k-l, is either 1 or some power of 2. Let d11 = M/F1, and d21 = M/Fz, dz2 = M0722, and . . . , and d11 = M/Fk, . . . , d1,}, = M/Fk". Then, dll=P1d21=P1P2d31=P1P2P3d41=---=p1P2°°'Pk—2d(k—l)l =P1P2 "'Pk—rdkr 6122 =P22432 =P22P32442 =- - -=P22P32~-Pk-22d(k—1)2 =P22P32---Pk—12dk2 433 =P33d43 =P33P43dss =. - - =P33P43---Pk—23d(k-1)3 =P33P43---Pk—13dk3 d (k —2)(k-2) = Pk-zk-Zd (k—1)(k —2) = Pk-Zk-ZPk—l k_2dk(k—2) d(k-1)(Ic-1) =Pk—1k_l deC—l) Assume otherwise, i.e., there exist {J11,J12} ; f1, . . . , [Jk1,Jk2} ; fk such that d11, . . . ,Jk1>¢<]12, . . . , Jk2>and (111[+1111d11)[+](121[+1121421[+1121d22)[+1 . . - [+](Jk1dk1[+] ---1+1Jk1dkk)= (1121+11126111)[+](122[+1122421[+]Jzzd22)[+] - - . [+] (szdk2[+] - . . [+1Jk2dkk) Here, Jkl it La. This is because if Jkl = sz, then ¢ and hence the equality cannot be satisfied by the induction hypothesis and XOR uniqueness property. The above equality can be rewritten as [(111[+1712)d11[+](121[+]J22)421[+] . . - [+1 (J(k—1)1[+]J (k-1)2)d(k—1)1[+](Jkrl+l/kz)dk1][+] [(121 [+1122)dzz [+] ~ . . [+](J(k-1)1 [+1J(k—1)2)d(k-1)2 [+](Jk1 [+1Jk2)dkzl [+1 43 [(J(It-2)1 [+11 (k—2)2)d(k-2)(k-2) [+] (J(k-1)1 1+1J(k—1)2)d(k—l)(k—2) [+1 (111 [+1Jk2)dk(k—2)1 [+] [(J(k—1)l [+] J(k-l)2)d(k-l)(lz—l) [+] (Jkr [+] Jk2)dk(k-1)1 [+1 [(1111 [+] 112M111: [+1 (111 H1112) [+] (121 [+1122)[+]. - 1 [+1011 [+1 J1:2)1 =0- Now, we will show that each term delimited by bracket (i.e., each line in the above equation) should be equal to zero. (1) The k-th term (i.e., the last term) should be equal to zero because the last term can- not be greater than dk(k_1), and all other terms do not have binary value ”1" between bit position 0 and logzdk(k_1) — 1, (2) Suppose (k-1)-th term is not zero. Then, (J(k_1)1 [+] J(k_1)2)d(k_1)(k_1) 2 dk(k_2). This is because all other terms do not have binary value "1" between bit position logzdk(k_1) and lngdk(k_2) — 1. So, the only way this inequality can be satisfied is LMB ((J(k—1)1 [+1 J(k-1)2)d(k—1)(k—1)) 2 RMB((Jkl [+] Jk2)dk(k-2))- (If not we can 566 immediately either (k-l)—th term should be zero or there is no way the whole equation can be satisfied.) Since RMB ((J(k_1)1 [+] J(k_1)2)d(k_1)(k_1)) should be equal to RMB((J/cr [+]Jk2)dk(k—1)). and RMB((Jkl [+] Jk2)dk(k—2)) -RMB((Jk1 [+]Jk2)dk(k—1)) = longk, it follows that (J (k_1)1 [+] J (k_1)2) > F k. This contradicts Fk_1 S Fk. Hence (k-1)-th term should be equal to zero. (3) Suppose (k-2)-th term is not zero. Then, (J (k_2)1 [+] J (k_2)2)d (k_2)(k_2) should be greater than or equal to dk(k_3). This is because first, all the other terms do not have binary value "1" between bit position logzdk(k_2) and lngdk(k_3) — 1, and second, from (2) (J (k—1)1 [+] J(k—1)2)d(k—1)(k—1) < dk(k—2) and this implies that (J(k_1)1 [+]J(k_1)2)d(k_1)(k_2) is less than dk(k_3). (If we multiply Fk_1 to the left- hand-side, and multiply F k to the right-hand-side of the inequality (J (k_1)1 [+] J (k_1)2)d (k_1)(k_1) < dk(k_2), this statement follows). The remainder of the proof is almost the same as that of case (2) 44 By continuing this argument, all the terms delimited by bracket should be equal to zero. Thus, we have the following k equalities. (1111+1112)P1 " 'Pk—l [+] (1211+1Jz2)P2 ' ' 'Pk-l [+1 - - - [+](1(k—1)1[+1J(k-1)2)Pk—1 [+1(Jk1[+11k2)= 0 (~1211‘1'1-722)I722 ' ' 'PIc-r2 [+1 - . . [+] (Jot-m[+11(Ic—1)2)Pk—12 [+1 (Jkl [+1J12) = 0 011-211 [+1J(k—2)2)pk—2k_2Pk—lk—z [+] (Jot—m[+]J(k—1)2)Pz-1k-2 [+1 (111114112): 0 (Jot—1n [+]J(k-1)2)pk-lk-1 [+] (11:1 [+1Jk2) = 0 (Jk1[+1Jk2)dkk [+1 (11114-1112) [+1 - - . [+1 (111 [+1Jk2) = 0 Now, assume pk_1 = 1. Then, from (k-1)-th equality (J (k_1)1[+]J (k_1)2) [+] (Jk1[+]Jk2) =0, and so from the (k-2)—th equality J (k_2)1 = J (k_2)2. This results in J(k_3)1 = J(k_3)2 from (k-3)-th equality, and results in J(k_4)1 = J(k_4)2, . . . , J11 = J 12. So, from the last equality Jk1 = sz, and hence J(k_1)1 = J (k_1)2. This contradicts at . Thus,pk_1 should be greater than 1. In the (k-1)—th equality, since pk_1 is a power of 2, LMB (Jk1[+]Jk2) should be greater than LMB (J 01-111 [+]J (k_1)2). (Otherwise, (k-l)—th equality cannot be satisfied.) By the similar reason, it is easy to see that LMB (Jk1[+]Jk2) > LMB(J1-1[+]J,~2), i = 1, k-l. Now, from the last equality, since LMB (Jk1[+]Jk2) > LMB (J11[+]J1-2), i = 1, k-l and dkk is a power of 2, there is no way the last equality can be satisfied. We have shown that there do not exist {J11,J12} ; f1, . . . , {Jk1,Jk2} ; fk such that ¢ and IU1(J11) [+] . . . [+] IUk(Jk1) =IU1(J12) [+] . . . [+] IUk(Jk2). This completes the proof for l = k. Thus the theorem follows by the principle of induction. [:1 The following example shows the proof of Theorem 3.6 when l = 3, i.e., for [U 1, [U2 andIU3. 45 Example 3.9. Let a file consist of three fields 1, 2, 3 such that F 3 2 F 2 2 F 1 and F 33 < M for the given number of devices M. Suppose FX distribution with [U 1(f 1), I U 2(fz) and IU3(f3) is not perfect optimal. Then, there exist {J11,J12} gf1, {121,122} gfz, {J31,J32} gf3 such that at and (1111+lfrrd11)l+1(121 [+]ledz1 [+1121422)[+](131 [+1131d31 [+1131d32[+11316133) = (1121+1J12411)[+](1221+]J22421[+1122d22)[+](132[+1132431[+1132432[+1132433) where d11 = M/F1, d21 = M/Fz, an = d21/F2, (131 = M/F3, (132 = d31/F3, (133 = d32/F3. Here, J31 at J32. This is because if J31 = J32, then ¢ and hence the equality cannot be satisfied by Lemma 3.10 and XOR uniqueness pro- perty. The above equality can be rewritten as, [(111[+1112)d11 [+1 (121[+1122)¢121 [+] (131[+1132)d31] [+] [(121 [+1122)d 22 [+] (J31 [+1132)432] [+] [(131[+1132)d33 [+] (J 11 [+1112) [+] (121[+1122) [+] (J31[+1132)] = 0 Let F3 = p2F2 = p1p2F1, where each pi, i = 1,2 is either one or some power of 2. Then, 411 =Prdzr =prpzdsri and 422 =P22432- Now, we will show that each term delimited by bracket (i.e., each line) should be equal to zero. (1) The third term (i.e., the last term) should be equal to zero because the last term is less than d32, and the first and the second term do not have any binary value "1" between bit position 0 and log2d32 -— 1. (2) Suppose the second term is not equal to zero. Then, (J 21 [+]J 22)d 22 > d31 because the first term does not have binary value "1" between bit position log2d32 and log2d31 — 1. Hence, LMB ((J21[+]J22)d22) 2 RMB ((J 31 [+]] 32)d 31). Otherwise, there is no way the equality can be satisfied. (Here, note that (121 [+]122)d22 < d21 S d 11). Since RMB ((J21[+]J22)d22) should be equal to RMB ((J31[+]J32)d32)a and '1! 46 RMB ((131 [+1132 )4 31) - RMB ((J 31 [+11 32)d 32) = 10g2F 3. it follows that (1211+1122) > F3. This contradicts F 2 S F3. Thus, the second term should be equal to zero, and hence the first term is also zero. Now, we have the following three equalities. (1111+1112)P1P2d31 [+] (1211+1122)P2d31 [+] (1311+1132)d31 = 0 (121 [+1122)P22(132 [+1 (1311+1132)432 = 0 (1311+1132)433 [+] (111144112) [+] (12114-1122) [+1 (13114-1132) = 0 Here, assume p2 = 1. Then from the second equality, (J 21 [+]J 22) [+] (J 31 [+]J 32) = 0, and so from the first equality J 11 = J 12 which results in J31 = J 32 in the last equality. This contradicts J 31 at J 32. Thus, P2 should be greater than one. In the second equality, since p2 is a power of 2, LMB (J 31 [+]] 32) should be greater than LMB (J 21 [+]] 22). Thus, from the first equality, it is easy to see that LMB (J 31[+]J 32) > LMB (J 11[+]J 12). Then, there is no way the third equality can be satisfied except J31 = J 32 which is contradiction. Therefore, the FX distribution method with IU1 (f1), [U2(fz) and IU3 (f3) is perfect optimal. Corollary 3.3. Let a file consist of 1 fields, 1, 2, , . . ,I such that F1 2 F1_1 2 2 F1 and F11 < M for the given number of devices M. Then, for any set of fields {i 1, ..., ik} g {1, 2, , . . ,1}, the FX distribution method with IU1-1(f1-1), ... ,IU1k(f,-k) is perfect optimal. Proof: This is a direct consequence of Theorem 3.6. 1] 3.4.6. I, U and [UK Field Transformation Functions In this section we show that FX distribution methods can always give perfect optimal distribution as long as the number of fields, whose sizes are less than the given number of devices, is no greater than 4. Lemma 3.11 and Lemma 3.12 shows useful properties of 1U, transformation functions in proving Theorem 3.7. 47 Lemma 3.11. Let fk = {0, . . . , F k-l} such that F k‘ < M for some positive integer x. Let d,- =M/Fki, i = 1, . . . , x. Let K1, K2 6 fk, and or be some power of 2 which is less than F k. Then, K1 and K 2 are in the same interval of size or from 0 to F k if and only if IUX(K1) and IU,(K2) are in the same interval of size ord1 from O to M. Proof: Only If : When K 1 and K 2 are in the same interval of size or, (K 1 [+] K 2) S or-l and hence (K1 [+] K2)d1 S (a—1)d1. Now, 1Ux(K1)[+]1Ux(K2) =(K11+]Krdr [+]---l+]K1dx)[+](K2[+lK2dl[+]---[+]K2dx) =(K1[+1K2)d1[+][(K1[+1K142[+1 - - . [+]K1dx)[+1(K2[+1K2d2[+] - - - [+]K2dx)1 <(K1[+]K2)d1+d1 The last inequality holds because [(K1 [+]K1d2 [+]. . . [+] K1dx) [+] (K2 [+]K2d2 [+] . . . [+] szx)] < d1 by the construction of di, i = 1, , x. Thus, IUx(K1) [+] IUx(K2) < ord1. This implies that IUx(K1) and IUx(K2) are in the same interval of size ad 1. If : When IUx(K1) and IUx(K2) are in the same interval of size ord1, (K1[+]K1d1[+]. . - [+]K1dx)[+l(K21+]K2dl[+1- - . [+]K2dx) < 041- 50. (K1 [+]K2)d1 + [(K1[+]K1d2[+] - - . [+]K1dx)[+] (K2[+]K2d2[+1 - - -[+1K2dx)] < (1411. Thus,(K1[+]K2)d1<0td1,andsoK1[+]K2 F11dj> F1, and d,- > (11,1 and d11 > F,-_dk1>F,-. (case 1) FiFij 2 M Let FiFj-Fk =AM, and let (1, = Bdk1 and dk1 = CF}, where A,B and C are some positive integers. Then, FiM/dek =AM or F1=Adlek =ABdk2. Since dj > F1, 10",) [+] U(fj)=So US1U...USFJ,_1,where s0 = {0, 1, ...,F1—1} $1 = {dj, dj'i'l, , dj+F;—1} Si = {ldj,idj+1, , (i+1)dj+F1~—1} SFJ-l = ((Fj—1)dj, (Fj—1)dj+1, , (Fj—1)dj+F1'—1} Clearly, all 51’s are disjoint. By definition 3.11 S,-+c = {idj+c, id j+l+c, , id j+F ,-—l+c} for any positive integer c which is less than d j. Let K e fk. By Lemma 3.7 there is exactly one element of K [+]de2[+] . . . [+]deJr at each interval from 0 to dk1 with interval size dk3 (Note that this can be obtained by substituting dk1 for M in 49 Lemma 3.7). (1)0 S (K [+] K dkz [+] . . . [+]Kd1a) < F,- Since F,- = ABdkz, there are AB number of IUx(K) such that US (K [+] K dkz [+] . . . [+]Kd1a) < F1. (1-1) Kmod B = O (i.e., de1 mod dj = 0) F~-l For each IUx(K) within this range, [(f,-) [+] U(fj) [+] IUX(K) = OS,- The equality i=0 holds due to Lemma 3.1, and Lemma 3.2 (Note that K[+]de1[+] - - - [+]de2 = de1[+](K [+] de2[+]...[+]Kd/a)). Let this set be T11. Since there are A number of such I Ux(K )’s within this range by Lemma 3.12 (this statement follows by substitute dk1 for M in Lemma 3.12), there are A repetitions for each element in T11. (1-2) KmodB =1 (i.e., de1 mod (11 =dk1) F-—1 For each IUX(K) within this range, I(f,-) [+] U(f1) [+] IU,(K) = (J S; + dk1. The equal- 1- ity holds due to Lemma 3.1 and Lemma 3.2. (Note that dk1F1.) Let this set be T23. By the same reason as previous, there are A repetitions for each element in T23. (C) (C-1)F,- SK [+1de2 < CF,- (C-l) KmodB =0 For each IUx(K) within this range, I(f,-) [+] U(f1) [+] IUx(K) = U S,- + (C—1)F,-. (Note i=0 that CF ,- = dk1.) Let this set be T51. By the same reason as previous, there are A repetitions for each element in TC 1. (C-B) KmodB=B-1 For each IU1(K) within this range, I(f,-) [+] U(f1) [+] IU1(K) = F.—1 US,- +(B—1)dk1 + (C —1)F,-. Let this set be TCB. By the same reason as previous, i=0 there are A repetitions for each element in TCB. Now, it is not difficult to see that i=C,j=B . . . . . U T1,- : ZM and all T1,- ’s are d1s301nt. Srnce we already show that there are A repen- i=1.j=1 _ i=C,j=B . trons for every element in U T,-- , it is 3-opt1mal. 2-optimal come from Theorem i=1,j=1 51 3.3, 7, 8. 0 and l-optimal come from Theorem 3.1. (case 2) F117ij < M The proof is almost the same as that of case 1. Cl Corollary 3.4. Let L be the number of fields whose sizes are less than the given number of devices M. FX distribution methods can be always perfect optimal, if L S 3. Proof: When L = 0, 1, 2, it follows from Theorem 3.1, 3.2, 3.3, 3.4, 3.5 and Proposi- tion 3.1. When L = 3, let i, j, k are fields whose sizes are less than M and F,- 2 Pk 2 F1. Ika2 2 M, apply I (f1), U (f,) and IU1(fk) transformation. Since F 1F k 2 M, the corollary follows by (i) of Theorem 3.7. If F12 < M, apply 1(f1), U (f,-) and IU2(fk) transforma- tion. Then the corollary follows by (ii) of Theorem 3.7. [:1 Example 3.10. Let f1 = {0, l, 2, 3},f2 = {0, 1},f3 = {0, 1} and M = 8. Figure 3.7 shows the FX distribution with I(f1), U(fz) and IU2(f3). Here, U(fz) = {0, 4} and 1U2(fs) = {0. 7}. and Device N0 =TM(I(J1)[+]U(12)[+]1U2(13)).Jr E f1.12 6 f2. J3 6 f3. (Here, we can also verify that the FX distribution with I (f1), U (f;) and IU1(f3) gives perfect optimal distribution because the file system of this example also satisfies condition (i) of Theorem 3.7). We have shown that by various combinations of field transformation functions FX distribution methods give strict optimal distribution for many types of partial match queries. Here, it should be emphasized that these field transformation techniques along with Proposition 3.1 increase the scope of optimality considerably. This is because by Proposition 3.1 optimal distribution for a subset of fields guarantees strict optimal distri- bution for many partial match queries in which those fields are unspecified. For exam- ple, let a partial match query unspecify fields 1, 2, 3, where I f 1 l = 8, l f2 I = 4, I f3 I = 8, and M = 32. When field 1 is U transformed and field 2 is I-transformed, then regard- less of field 3, the distribution is strict optimal for this query by Theorem 3.3 and Propo- sition 3.1. 52 I(f1) U(fz) I(1203) DeviceNO 000 000 000 0 000 000 111 7 000 100 000 4 000 100 111 3 001 000 000 1 001 000 111 6 001 100 000 5 001 100 111 2 010 000 000 2 010 000 111 5 010 100 000 6 010 100 111 1 011 000 000 3 011 000 111 4 011 100 000 7 011 100 111 0 Figure 3.7. FX Distribution with I, U And I U 2 Transformation Now, the summarized results of FX distribution methods are as follows (here, all the theorems, lemmas and corollary in section 3.3 also hold for (Extended) FX distribu- tion methods) Let a file consist of 11 fields and there be M parallel devices. Let L be the number of fields whose sizes are less than the given number of devices M. FX distribution methods can be always perfect optimal, when L S 3. Let L be greater than or equal to 4. Let (1qu ) be the set of fields which are unspecified for partial match query q. Then, FX distribution methods are strict optimal for partial match query q, if at least one of the following conditions holds. (1) lq,,(f)|=0 or 1 (2) there is at least one field i e qu(f ) such that F1 2 M. (3) | qu(f ) I = 2 and transformation methods of two fields in qu(f ) are different. (ref. section 3.4.1.) 53 (4) I q“(f) I = 3 and either (a) there are at least two fields i, j e qu(f ) such that F 1F ,- .>_ M and transformation methods of two fields i and j are different, or (b) transformation methods of three fields in qu(f) are I, U, IUJr for some x 2 2, and the size of IUx transformed field is not less than the size of U transformed field. (5) | qu(f) | 2 4 and either (a) there are at least two fields i, j e qu(f ) such that F 1F j 2 M and transformation methods of two fields i and j are different, or (b) there are at least three fields i, j, k e qu(f ) such that F 1F jF k 2 M and transfor- mation methods of fields i, j and k are I, U, I UJr for some x 2 2, and the size of I U, transformed field is not less than the size of U transformed field. (c) the fields in qu(f ) satisfy Theorem 3.6 Note that these are only sufficient and are not necessary conditions. It is unfortunate that FX distribution methods do not always guarantee perfect optimal distribution when the number of fields whose sizes are less than M, is greater than or equal to 4 in general. In fact, it has been shown in [Sun87] that when the number of fields whose sizes are less than the given number of devices, is greater than or equal to 4, there is no method which always gives perfect optimal distribution. How- ever, even for these cases we will show through performance experiments that FX distri- bution methods still gives near optimal distribution for most queries. 3.5. Performance Comparison with Other Distribution Methods In this section we compare the performance of FX distribution methods with those of the DM and GDM distribution method. The performance comparisons are based on the probability of strict optimality and the average response time for a given partial 54 match query. In section 3.5.1 the probability of strict optimal distribution for partial match queries is given. In section 3.5.2 we compare the average response time for the FX, DM and GDM distribution method. For both section 3.5.1 and 3.5.2, it is assumed that the probability of each field being specified is same for all fields and some field being specified is independent of each other. 3.5.1. Probability of Strict Optimality In this section we show that the probability of strict optimality for FX distribution methods is much higher than the DM distribution method. Even for the worst case the decrease of probability of strict optimality for FX distribution is not much. On the other hand, in the DM distribution the decrease is quite large. Since no general method has been given to determine the existence of parameter values for strict optimal distribution in the GDM method, we compare FX distribution methods to DM distribution method only. Let the file consist of 7 fields and M = 32. Let F1: 2, F2 =F3 = 4, F 4 = F5 =F6 = 8, F7 = 16. In the DM distribution method the probability of strict optimal distribution for some partial match query is 0.0547 (computed from the optimality conditions given in [DuS82]). On the other hand, in FX distribution methods with IU1(f1), IU2(fz), U(f3), I(f4), U(fs), IU1(f6), 1(f7), the probability of strict optimal for a partial match query is 0.9531 (computed from the sufficient conditions in section 3.3 and 3.4). Therefore, in this example FX distribution methods give much higher probability of strict optimality than the DM distribution method. Figure 3.8 and 3.9 show the percentage of strict optimal distribution for all possible partial match queries in a given file system. In these figures DM denotes the results of the DM distribution method and FX denotes the results of FX distribution methods. Here, the results are computed from sufficient conditions given for each method. Figure 55 3.8 shows the case where for any two fields r and 3 whose sizes are less than the given number of devices, F ,F s 2 M. We used files with six and ten fields. In this figure FX distribution methods used I, U and I U 1 transformation methods. Figure 3.9 shows the percentage of strict optimal distribution when for any two fields r and s whose sizes are less than the given number of devices, F ,F s < M, but for any three fields r, s and t whose sizes are less than the given number of devices, F,F,F1 2 M. Here, in FX distribution methods I, U and I U 2 transformation methods are used. These results show that FX distribution methods give high probability of strict optimal distribution for partial match queries in typical file systems. However, since there is no convenient way to compute probability of strict optimal distribution in more general file systems, in the next section we give results of performance experiments based on average response time. 3.5.2. Average Response Time Definition 3.14. For a given partial match query q, r,-(q) is defined as the number of qualified buckets in device i for a partial match query q. We call this the response size of device i for a partial match query q. Then, the largest response size for a partial match query q is defined as MAX(r 1 (q), r2(q), - . - rM_1(q)). For the response time of a partial match query, we will consider two factors, namely, largest response size and CPU computation time for bucket address calculation. In parallel disks environment, largest response size is the most important factor, while in main memory databases, CPU computation time is more important. When systems are configured such that the data retrieval time for any device is almost the same, the response time for a partial match query is determined by the device which has the largest number of qualified buckets. For example, parallel disks con- nected to one shared bus, or some of the multiprocessors based on multistage intercon- nection networks are considered to be such systems. Ill Percentage of 60 _ Strict Optimal DM Distribution 40 — 20 — 0 I I I I I I O 1 2 3 4 5 Number of Fields Whose Sizes Are Less Than M (a) Number of Fields = 6 100 FX 80— Percentage of 60— DM Strict Optimal Distribution 40- 20—- 0 I I I I I I I I I I 012345678910 Number of Fields Whose Sizes Are Less Than M (b) Number of Fields = 10 Figure 3.8. Strict Optimality When for Any Fields r and s, F ,F s 2 M 57 FX Percentage of 60 _ Strict Optimal Distribution 40- DM 20 — O I I I I I I O 1 2 3 4 5 6 Number of Fields Whose Sizes Are Less Than M (a) Number of Fields = 6 1 00 FX 80 — Percentage of 60 _ DM Strict Optimal Distribution 40 “ 20— 0 I I I I I I I I I I 012345678910 Number of Fields-Whose Sizes Are Less Than M (b) Number of Fields = 10 Figure 3.9. Strict Optimality When for Any Fields r, s and t, F ,F SF 1 2 M 58 Table 3.1 through 3.8 show the largest response size of the DM, GDM and FX dis- tribution methods for various file sizes with various number of parallel devices. The number of fields is six for all these experiments. In all these tables, the first column denotes the number of unspecified fields. For the GDM method, in order for comparison to be fair, we used seven different sets of multiplication parameters. These sets are GDMl : 3, 11, 23,37, 49, 53 and GDM2 : 5, 9, 31, 37, 53,59 and GDM3 :41, 43, 47, 51, 53,57 and GDM4 : 3, 5, 7, 11, 13,17 and GDM5 :3, 7,13, 43, 51, 57 and GDM6 : 2, 3, 5, 7,11,13 and GDM7 :2, 5, 11, 43, 51, 57. Here, the sets of parameters in GDMl, GDM2, GDM3, GDM4 and GDM5 are chosen based on [DuS82], i.e., relative prime to the given number of devices. GDM6 and GDM7 are used to include other cases. For FX distribution methods, field transformation functions applied in each experi- ment are as follows. Table 3.1:, Table 3.2: , Table 3.3 : , Table 3.4: , Table 3.5 : , Table 3.6: , Table 3.7 : and Table 3.8 : . Here, the sequence of transformation functions denotes the sequence of fields to which these transformation functions are applied. Heuristics based on the theorems in section 3.3 and 3.4 are used to choose these transformation functions. In all these tables, each entry is computed as an average value of largest response sizes from all possible partial match queries for that entry. The tables show that except for first row out of table 3.4 and 3.8, FX distribution methods give smaller largest- response-size than all the other methods. FX distribution is also very close to optimal. It should also be noted that there may exist a set of multiplication parameters by which the GDM method can give better performance than those of GDMl, GDM2, GDM3, 59 Table 3.1. Response Time for F 1=F2=F3=F4=2, F5=F6=4 and M = 16. DM GDMl GDM2 GDM3 GDM4 GDM5 GDM6 GDM7 FX Optimal 2 2.1 1.3 1.7 1.4 1.3 1.5 1.3 1.5 1.1 1.0 3 4.4 2.2 3.1 2.2 2.2 2.4 2.3 2.6 1.6 1.2 4 10.3 4.4 6.2 4.3 4.2 4.7 4.5 5.1 3.0 2.7 5 22.3 8.7 12.3 8.2 8.2 9.2 9.0 10.7 6.7 6.7 6 52.0 18.0 26.0 17.0 18.0 19.0 20.0 22.0 16.0 16.0 Table 3.2. Response Time for F 1=F 22F 3:2, F 4=F 5=F 6:4 and M = 32. DM GDMr GDM2 GDM3 GDM4 GDM5 GDM6 GDM7 FX Optimal 2 2.4 1.2 1.5 1.3 1.2 1.3 1.2 1.3 1.0 1.0 3 5.7 1.9 2.9 2.1 1.9 2.2 1.9 2.2 1.5 1.1 4 14.8 3.8 6.0 3.8 3.5 4.2 3.5 3.9 2.9 2.2 5 36.0 8.2 13.3 7.8 7.5 8.8 7.8 8.7 6.6 6.0 6 92.0 18.0 28.0 18.0 18.0 20.0 19.0 20.0 16.0 16.0 Table 3.3. Response Time for F 1=F2=F3=F4=F5=F6=8 and M = 32 DM GDMl GDM2 GDM3 GDM4 GDM5 GDM6 GDM7 FX Optimal 2 8.0 3.8 4.5 3.7 3.4 3.9 3.3 3.6 3.2 2.0 3 48.0 19.2 21.9 18.9 18.3 20.2 18.1 18.9 16.0 16.0 4 344.0 133.8 143.3 132.5 131.6 138.0 130.5 132.7 128.0 128.0 5 2460.0 1034.7 1058.3 1031.7 1029.3 1045.3 1026.3 1029.7 1024.0 1024.0 6 18152.0 8210.0 8292.0 8202.0 8200.0 8224.0 8196.0 8198.0 8192.0 8192.0 Table 3.4. Response Time for F 1=F 2=F 3=F4=F 5=F 6:8 and M = 64. DM GDM1 GDM2 GDM3 GDM4 GDM5 GDM6 GDM7 FX optimal 2 8.0 2.4 2.8 2.4 2.1 2.9 2.1 2.2 2.4 1.0 3 48.0 10.7 11.8 10.6 9.9 12.4 10.2 10.3 8.0 8.0 4 344.0 69.0 73.3 67.5 67.1 75.4 68.3 68.1 64.0 64.0 5 2460.0 522.3 531.3 517.3 516.2 538.2 520.5 517.0 512.0 512.0 6 18152.0 4115.0 4148.0 4102.0 4102.0 4146.0 4114.0 4102.0 4096.0 4096.0 Table 3.5. Response Time for F1=2, F2=F3=4, F4=F5=F6=8 and M = 128. DM GDM1 GDM2 GDM3 GDM4 GDM5 GDM6 GDM7 FX Optimal 2 4.1 1.2 1.3 1.3 1.2 1.2 1.0 1.4 1.0 1.0 3 17.8 2.8 3.1 2.7 2.5 3.0 2.6 2.9 1.9 1.5 4 81.9 9.6 9.2 8.7 8.2 9.9 8.9 8.8 6.5 6.3 5 351.3 35.3 33.5 32.8 32.2 35.0 35.8 31.8 29.3 29.3 6 1456.0 142.0 134.0 133.0 132.0 135.0 145.0 131.0 128.0 128.0 Table 3.6. Response Time for F1=F2=F3=F4=4, F5=F5=8 and M = 256. DM GDM1 GDM2 GDM3 GDM4 GDM5 GDM6 GDM7 FX Optimal 2 4.3 1.1 1.1 1.2 1.0 1.1 1.1 1.1 1.0 1.0 3 17.6 1.8 2.0 1.9 2.2 1.8 2.6 1.9 1.4 1.0 4 79.2 5.2 5.1 4.8 6.7 5.0 8.2 4.9 3.7 2.7 5 352.0 18.2 17.3 16.8 26.7 17.5 34.2 17.3 14.7 13.3 6 1592.0 73.0 72.0 70.0 119.0 71.0 155.0 69.0 64.0 64.0 61 Table 3.7. Response Time for F 1=F2=F3=4, F4=F5=F6=8 and M = 512. GDM3 GDM4 GDM5 GDM6 GDM7 FX Optimal ONUIshWN DM GDM1 GDM2 4.8 1.0 1.0 1.1 1.0 1.0 1.1 1.0 1.0 1.0 22.8 1.6 1.8 2.0 2.6 1.6 3.2 1.6 1.4 1.0 114.8 4.4 4.8 6.2 9.6 4.5 12.3 4.3 3.5 2.2 569.0 15.8 17.3 25.8 44.8 16.0 58.3 15.5 13.3 12.0 2848.0 70.0 75.0 122.0 220.0 70.0 289.0 69.0 64.0 64.0 Table 3.8. Response Time for F 1=F2=F3=8, F4=F5=F6=16 and M = 512. DM GDM1 GDM2 GDM3 GDM4 GDM5 GDM6 GDM7 FX Optimal OUIJKWN 9.6 1.3 1.4 1.4 1.3 1.3 1.7 1.3 2.3 1.0 91.2 5.3 5.7 5.6 7.7 5.5 10.0 5.5 5.1 3.2 911.2 39.9 40.1 42.2 70.2 40.5 90.3 40.5 37.3 35.2 395.5 392.7 408.67 700.2 397.7 909.5 397.3 384.0 384.0 9076.0 9176.0 4144.0 4096.0 4096.0 90404.0 4129.0 4112.0 4313.0 6969.0 4139.0 62 GDM4 and GDM5 in Table 3.1 through Table 3.8. However, even though such set of parameters exists, those can only be found by trial and error method. In disk based database systems the computation time is not significant compared to disk access time. But in main memory databases CPU computation time is important. Thus, we will discuss CPU computation time for the FX and GDM distribution method. We use optimized instruction codes for comparing CPU computation time. In the GDM method we use AND operation to implement modulo function. This is possible because the number of devices is assumed to be a power of 2. In FX distribution method, since the multipliers for U and [UK transformation are always power of 2, we can substitute multiplication by shift operation. Note that we cannot do this in the GDM method because multipliers in the GDM method are usually chosen from prime or odd numbers. Function TM is done by AND operation. In MC68000 processor, computation time of FX methods take much less than that of the GDM method. (In MC68000, XOR takes 8 cpu clock cycles, ADD takes 4 clock cycles, AND takes 4 clock cycles, n bit shift takes 6 + 2n clock cycles. But multiplica- tion takes 70 clock cycles). In intel 80286/80386 processor the ratios of clock cycles between different operations are almost similar to those of MC68000. For main memory databases FX distribution methods are much faster than the GDM distribution method. The computation time of the DM distribution method is less than that of FX distribution methods, but as is shown in the Table 3.1 through Table 3.8, the DM distribution method is not suitable for a large number of parallel devices. 3.6. Data Construction Methods In this section we discuss data construction methods for the file which is distributed by FX distribution methods. Data distribution methods determine the amount of access concurrency while data construction methods affect storage characteristics and time for each bucket access. We will present two approaches of data construction based on the 63 usage of multikey hashed directory. Multikey hashing for a given file with n fields pro- duce a subset of T, where T = f1 xf2>< xfn. As discussed in chapter 2, T can be used as either a real global directory or a virtual global directory. 3.6.1. Data Construction Based on Real Global Directory We will describe methods of using T as a real global directory. Let GD = [0..F1—l, . . . , O..F,,—1] be a multi-dimensional array in which the range of the i-th dimension is 0.171-1. This is the same range of multikey hashing for field i. Here, GD serves as a real global directory. Each element of GD contains the address of a bucket. When the directory is centralized, the problem is trivial. However, for maximum access concurrency the directory also needs to be distributed among the access nodes. Data construction methods for this case will be investigated in the rest of this section. FX distribution methods partition multi-dimensional array GD into M subsets, where M is the number of access nodes. Since a directory is also distributed among the nodes, we have to have efficient storage rule to locate the elements of this multi- dimensional array. In other words, for each element of GD we have to define the local address in each device. The similar ideas for distributing the elements of an array have been used in MDA memory [Bat77] and the prime memory system [Law82]. In order to determine the local address efficiently, we need a few techniques which are described below. Definition 3.15. In a given file system, a Minimal Pivot Set (MP8) is any set of fields in which the size of the cartesian product of those fields is no less than the given number of devices M, and any subset of an MP8 is not an MP8. Let there exist an MP8 whose cardinality is no greater than 4 in a given file system (if not, no efficient method of using T as a real global directory has been found). Let p be the one of MPS’s whose cardinality is minimum, and p’ be the set of all the remaining 64 fields. In multidimensional array GD, rearrange fields such that those fields in p become rightmost dimensions. When IpI = 2, let n-1 and n be two fields in p. Here, field n-1 and n denote the fields which correspond to (n-1)-th and n-th dimensions in GD, respectively. Then, apply I—transformation for field n-1 and U-transformation for field n. When Ipl = 3, let n-2, n-1 and n be three fields in p such that Fn_2 2 F,, 2 Fn_1. Then, apply I, U and [U 2 transformation to field n-2, n-1 and n, respectively. Note that the sequence of fields is important in above two cases. When |p| = 1, apply I transformation to the field in p. Let the sequence of elements in GD is based on row-major ordering, i.e, index ele- ments of high dimensions change first. Then, the storage rule of this multi-dimensional array is as follows : For a bucket < J 1, . . . , J,, > produced by multikey hashing (the sequence of fields in the given bucket is the same as that in GD, i.e., i-th field denotes i—th dimension in GD), (1) Device No is determined by the FX distribution methods. (2) The local address for this bucket in the device is Z J1m- + l [2 J 1K1]/MJ, where 71 iep’ iep n n = H/H Fj]/M, andM: HF-,ifi¢n and7t,-=1,ifi=n. =i+1 j=i+l The correctness of the local address calculation can be shown by the following proposi- tion. Proposition 3.2. Let GD’ = [0..F1—1, . . . , 0..F ,1—1] be a multi-dimensional array, n where the content of element [J 1, , J,,] is TM [[+](X 1-(J j))] which is the same as the i=1 device number computed by the FX distribution method. Let S = (a0, . . . , aM_1, aM, . . , am, . . ) be a linear sequence of the contents of GD’ by row- major ordering. Let S,- be a subsequence of S such that S,- = F1 x . . . Xpn (am, am“, . . . ,a(,-+1)M_1). Then, for any S,-, i =0, ~-- , —M———l , a set of 65 elements in S,- is 2114. (It is assumed that there exists an MP8 whose cardinality is no greater than 4.) Proof: When a cardinality of p is one (i.e., there exists a field whose size is no less than the given number of devices), or the size of the cartesian product of fields in p is M, it simply follows by Lemma 3.2. For other cases, the arrangement of fields in p ensures that sequence as shown below. (case 1) Ipl = 2 (i.e., I(f,,_1) [+] U(fn)) The proof immediately follows by Lemma 3.8. (case 2) 'PI = 3 (i.eo 1(fn—2) [+] U(fn—i) [+11U2(fn)) Since F,,_2 2F" 2F,,_1, F,,2 < M. Let Fn_2F,,_1F,, =AM and F,,-1F,, = %. Then, F,,_2 =AD. Let U(fn_1) [+] IU2(f,,) = R. By Theorem 3.5, R has F,,_1F,, distinct ele- ments which are between 0 to M-l. Let d,,_1=M/F,,_1, i.e., d,,_1 = F,,D. Since Fn—l d,,_1 > F,,, by Lemma 3.8, R = U [U (f,,_1) +c i] for F,, different number of ci’s which i=0 are between 0 to d,,_1. Now it is sufficient to show that there exists exactly one c,- for each interval of size D between 0 to d,,_1. This guarantees that for any set W of D con- secutive elements in f,,_2 which are all in the same interval of size D, R[+]W = ZM by Lemma 3.2. Now, let d,,1 =M/F,, and d,,2 = M/Fnz. We want to show that for any K1, K2 5 f,, such that K2 > K1, c,- = (K1[+]K1d,,1[+]K1d,,2) mod d,,_1 and cj = (K2[+]K2d,,1[+]K2d,12) mod d,,_1 are in different intervals of size D. Let dn_1 = Bdn1. When K 1 mod B at K 2 mod B, it is easy to see that c,- and cj cannot be in the same inter- val of size D because d,,1>D. Let K1 mod B = K2 mod B. This implies that K1 and K 2 are in the different intervals of size B. Therefore, by Lemma 3.12, K1[+]K1d,,2 and K 2[+]K 2dn2 should be in different intervals of size Bdnz between 0 to dn_1. Since Bdnz = D, the proof follows. E] 66 When a multikey hash function (91 is used for a file V, using T as a real global directory is advantageous if for most t e T, (91—10) 6 V with reasonable average chain length. However, an appropriate (91 may not be easily determined for dynamic files because the file size is not known in advance. Hence, the real global directory may turn out to be very sparse or to have long overflow chain. Here, it should be noted that applying dynamic hash function for 81 will cause significant overhead due to intemode data movement. The virtual global directory approach described in the next section can avoid this problem. 3.6.2. Data Construction Based on Virtual Global Directory In the previous section we described the method of using T as a real global direc- tory. In this section we describe the method of using T as a virtual global directory. ' The idea of the virtual global directory is to use one more hash function (92 which is local in each device. Let < J 1, . . . , J,, > be an ordered n-tuple produced by multikey hash function (91 for some record. The local hash function @2 uses this ordered n-tuple as an input key for its local directory. Since (92 does not affect data distribution, dynamic hash functions [Fag79, Lar78, Lit80] can be used as (92. When T is used as a virtual global directory, only the local directories physically exist. Each local directory can dynamically grow and shrink while the virtual global directory is static. This two—level mapping data construction is much more flexible than the real glo— bal directory because the storage utilization of directories is not affected by @1, and dynamic hash function for local directories can handle dynamic files. Thus, a virtual global directory approach is appropriate for dynamic files. One disadvantage of the virtual global directory approach is that it may cause more probings to find qualified records than in the real global directory. This is because dif— ferent ordered n-tuples produced by (91 can be mapped into the same local directory entry by @2. 67 Let V1 be the set of ordered n-tuples produced by @1 for a given file which are allocated into the same device, and V2 be the range of local hash function (92 in that device. Let p1 = NH and p2 = IV2l. Let t be an average number of elements in V1 which are mapped into the same v 6 V2 by 62. Then, the probability P that @2_1(v) = 1.11 . . l l forveV2,1s 1venb P= 1—— . Letc= /1. Then,t=—-—. 111 This can be derived by using lim (1 — 1 ) = e‘l’c, and t = L . For exam- p1_,°. c111 (l—P)p2 ple, 1: = 1.58 when c = l, and so the number of probings increases. (Detailed description about average chain length for various hashing algorithms can be found in [Knu73].) However, since T consists of cartesian product of all fields, many elements in T may not correspond to any record. On the other hand, the two level hashing scheme (i.e., the virtual global directory approach) can always achieve efficient storage utiliza- tion of the directories because the input of 62 is only those n-tuples which have corresponding records. Even for the case when a real global directory has many empty entries, the two level hashing scheme can always guarantee efficient storage utilization with slightly increased number of probings. CHAPTER 4 OPTIMAL DATA DISTRIBUTION FOR MULTIATTRIBUTE RANGE QUERIES 4.1. Introduction In this chapter we describe optimal data distribution for multiattribute range queries. A multiattribute range query, also called a region query or orthogonal range query, is an intersection query in which a multiplicity of the attributes are allowed to be range specified in a query’s qualification. The multiattribute range query differs from the partial match query in that the range specification is allowed in the multiattribute range query while it is not allowed in the partial match query. Several file structures have been proposed for handling multiattribute range queries (hereafter, a range query denotes a multiattribute range query). These include various types of specialized tree structures and hash based accesses. In [Ben75] the multidi- mensional binary search tree, also called k-d tree, has been proposed, which is an exten- sion of the standard binary search tree. Worst case analyses of these trees can be found in [Lee77]. There are other types of the tree structure for range searching [Ben79]. Hash based accesses for range queries has also been investigated in the past. In [Bol81], Bolour proposed box-array addressing functions which is a composition of ran- domizing function and order-preserving function. It has been shown in that paper that this hashing scheme is effective for answering small range queries. Several other types of the file structure for range searching can also be found in [Ben79, Knu73]. A lower bound on the complexity of range queries has been presented in [Fre81]. The main focus of those research is on minimizing the number of bucket accesses. However, they did not consider data distribution to enhance access concurrency. Though there are 68 69 some heuristic data distribution methods for multikey search queries [Du882, Kim88], range specification in a query’s qualification is not considered in those methods. In this chapter we will mainly focus on optimal data distribution for multiattribute range queries to facilitate parallel processing of this type of applications. A partial match query can be thought of as a special case of a range query because a range query reduces to a partial match query when each specified range size is only one. Thus, a partial match query will be treated as a special case of a range query in the rest of this chapter. The rest of this chapter is organized as follows. In section 4.2 we describe under- lying file structure, basic definitions and assumptions. In section 4.3 we show the inherent limitations for optimal data distribution for some types of range queries. We describe optimal data distribution methods for range queries in section 4.4. The optimality conditions for these methods are also given. 4.2. Definitions And Terminology In this section we describe underlying file structures as well as relevant definitions and assumptions. We use partitioned hashed directory as a file structure. In other words, the access structure for a file with n secondary keys is a partitioned hashed directory consisting of n fields. This directory is based on multikey hashing scheme, where each hash function is order-preserving. Multikey hash function is described in chapter 3. Several order- preserving hash functions have been proposed in [Gar86,Hsi88]. In the following para- graph we also give a method for implementing order-preserving hashed accesses by small tables. One way of implementing order-preserving hash functions is to represent the func- tion explicitly by a table. Figure 4.1 shows the explicit representation of n functions for a file consisting of n fields, where hi denotes the function (i.e., table to show the function 70 values) applied to i-th attribute values. Each table of the figure the first column denotes the interval of key values and the second column denotes the image for any domain value in that interval. hi hn Figure 4.1. Explicit Representation of Multikey Hash Functions In this structure a file consisting of n fields (i.e., n secondary keys) needs n tables. Let f,- be the projection of the second column of table i, i.e., f1 = {0, 1, . . . , 7} in Figure 4.1. A bucket is defined as an ordered n-tuple, , such that b,- is an element of f1. Then, the cartesian product f1 x . . . x f,, represents the multikey hash directory. As we discussed in chapter 3, this directory can be used as either a global directory or a virtual global directory. This representation requires additional storage for the tables as in Figure 4.1. Let us consider the storage overhead of these tables for the file with five fields. Suppose the directory consists of thirty thousand (about 215) entries (if a bucket contains 33 records in average, the file contains about one million records). When all the field sizes are the same, the size of each field is 8. If each entry of the tables takes ten bytes, the storage overhead is only 400 bytes because we need 5 * 8 = 40 entries altogether. Thus, the storage overhead is negligible and these tables can be easily stored in main memory. P 71 We use a bucket as a unit of data distribution. The main focus of this chapter is to investigate optimal bucket distribution for range queries in multikey hashing. Since we are dealing with multikey hash file which is the same as that in chapter 3, all the termi- nology defined in chapter 3 are also used in this chapter. In the rest of this section we describe definitions which are necessary in this chapter. Definition 4.1. r(u, v) denotes the range specification between it and v, where both boundaries are closed. We assume that a range query can have three types of field value specification which are single value, range, and don’t care in which case it is denoted by *. We will not use r(u, v) to denote a single value (i.e., range size is equal to one) and don’t care (i.e., unspecified). We also do not allow cyclic ranges like r(5,2), i.e., 5, 6 , . . . , F ,~—l, 0, l, 2. In other words, whenever r( u, v) is a range specification for field i, it is implied that u < v and, either it =00rv¢F,—l. Definition 4.2. A range query for files with n fields is denoted by [A 1, A2, . . . , An], where for each i = l, . . . , n, A,- = *, or r(u, v), or w, where u, v, w 6 f1 and u < v. We define types of range queries based on the number of range specified fields in a query’s qualification. Definition 4.3. When the number of range specified fields is or in a query q, query q is called type or range query. Example 4.1. Let q1, q2 and q3 be queries in the file (DEPT, AGE, STATE) such that q1 = [Math, r(20, 27), Ohio], q2 = [*, r(20, 27), *] and q3 = [r(Math, Physics), r(20, 27), *]. Here, r(20, 27), r(Math, Physics) denote range specification, and "*" denotes don’t care condition, i.e., the field value is unspecified. Then, q1 and q2 are type 1 range queries, and q3 is type 2 range query. By the above definition a partial match query such as [Math, *, Ohio] is a type 0 range query. 72 Definition 4 .4. Let R (q) be the set of buckets which satisfy the qualification for a range query q. The distribution method is called strict optimal for a range query q in a given file system if each device has no more than [IR (q) I/M] number of buckets. Definition 45. When the distribution method is strict optimal for all queries in type 0 through type or range queries in a given file system, it is called perfect optimal for type (0 - 0t) range queries in that file system. Example 4.2. Let f1 = {0, l, 2, 3}, f2 = {0, l, 2, 3} and M = 8. Figure 4.2 shows three bucket distributions denoted by Distlibution-l, Distribution-2 and Distribution-3. For example, the bucket <3, 1> is stored in device 5 for Distribution-l, device 2 in Distribution—2 and device 7 in Distribution-3. f1 f 2 Distribution-l Distribution-2 Distribution-3 0 0 0 0 0 0 1 1 1 4 0 2 2 2 2 0 3 3 3 6 1 0 4 4 1 1 1 5 5 5 1 2 6 6 3 1 3 7 7 7 2 0 0 7 2 2 1 l 6 6 2 2 2 5 0 2 3 3 4 4 ' 3 0 4 3 3 3 1 5 2 7 3 2 6 l 1 3 3 7 0 5 Figure 4.2. Example Bucket Distributions Let q1 = [r(l, 2), *] and q2 = [*, r(O, 1)]. Then, the Distribution-1 is strict optimal for query q 1 , but is not strict optimal for query q2. The Distribution-2 is strict optimal for query qz, but is not strict optimal for query q1. The Distribution-3 is perfect 73 optimal for type (0 - 1) range queries, and hence it is strict optimal for both queries q1 and q2. However, this distribution is still not perfect optimal for type (0 - 2) range queries because it is not strict optimal for query q3 = [r(O, 2), r(O, 2)]. Definition 4.6. Let a file consists of n fields, and S1, i = 1, . . . , n, be a subset of f1. Then, M (S 1 , . . . , 5,.) denotes a set of devices at which bucket is stored, where b,- 6 5;. For convenience, when S,- = {s1}, i.e., S1 has a single element, we will use s,- instead of {s1}. For example, in Distribution-1 of Figure 4.2, M(0, 0) = {0}, M({0, l}, 0) = {0, 4} and M({0,1, 2}, O) = {0, 4}. Since the perfect optimal distribution is the most desirable, it is worth investigating whether a perfect optimal distribution always exists. It will be shown that perfect optimal distribution for certain types of range queries does not exist inherently in many cases. In the following section we discuss nonexistence of perfect optimal distribution for certain types of range queries. We give data distribution methods for optimal bucket distribution for various types of range queries in later sections. It will be shown that the proposed data distribution methods are perfect optimal for certain types of range queries, and strict optimal for a large class of range queries. 4.3. Limitations of Perfect Optimal Distribution In this section we will show that there are inherent limitations for achieving perfect optimal distribution for certain types of range queries. In section 4.3.1 we show that for files with two or more fields, perfect optimal distribution for type (0 - 2) range queries does not exist in many cases. In section 4.3.2 it will be shown that for files with three or more fields, perfect optimal distribution for type (0 - 1) range queries is not always possible. In both sections the sufficient conditions for the nonexistence of perfect optimal distribution will be given. In section 4.3.3 we discuss perfect optimal 74 distribution for type 0 range queries. 4.3.1. Type (0 - 2) Range Queries In this section we show that perfect optimal distribution for type (0 - 2) range queries does not exist in many cases. Lemma 4.1. When a file consists of n fields (n 2 2) and the given number of devices is M, perfect optimal distribution for type (0 - 2) range queries does not exist if there are at least two fields i and j, and two integers a and b such that 2 S a < F1, 3 S b S Fj and ab = M. Proof: It is sufficient to prove the lemma for files consisting of only two fields. Assume perfect optimal distribution for type (0 - 2) range queries exists. Thus, we have a distribution for this file system which is perfect optimal for type (0 - 2) range queries. LetS= {S1, . . . ,sa} beasubsetoffi,wherea_>.2ands1 >0, s2 =s1+1, . . . ,sa = sa_1+1, and T = {t1, . . . , tb} be a subset offj, where b 2 3 and t2 = t1+1, . . . , tb = tb_1+1 such that ab = M. In other words, S and T are a and b consecutive values in field i and j, respectively, where IS x Tl = M. Clearly, when A and B are sets of con- secutive elements in field i and j, respectively, such that IA x Bl < M, M(A, B) is a set whose size is IA x Bl. Thus, the sets M(s1, T), . . . , M (sa, T) are mutually disjoint and M (S, T) = ZM (otherwise, the distribution is not strict optimal for query q1 = [r(s1, sa), r(t1, tb)]). Since the sets M(s1—1, T), . . . , M(sa—l, T) are also mutually disjoint, and the size of cartesian product of these sets is M, M (s1—1, T) = M (so, T) (Figure 4.3 gives a pictorial view of these steps). We will show a contradiction for either case when the size of T is even, or odd. (case 1) b is even, (i.e., b 2 4) Let T11 = {t1, . . . , rm}, and T12 = {t1,/2.11,. . . , tb}. The sets M(s1—l, T11) and M (sa, T11) are disjoint (otherwise, the distribution is not strict optimal for query q2 = [r(s1-1, Sa), r(t1, t1, 0)]. Note that the number of qualified buckets for qz is less than 75 1". fi Device No 51-1 _____ _ _T_11 _ F _ _> _ _T_12 _ : Sl _ _ _ E . _T_11 _ _T_12 _ : _ 5 Do not have _ : qualified 1 E bucket for Sr ————— I 43 _ -731 _ i _ _T_12 _ E s. _____ I _ _T_1_1 _ L _T_12 Figure 4.3. Nonexistence of Perfect Optimal Distribution for Type (0 - 2) Range Queries 76 M). Thus, M(s1—1,T11) is equal to M(sa, T12). and M(s1-1, T12) is equal to M(sa’ T11)' Now, let us consider range query q3 = [r (s1-1, so), r(t1, $12.11)]. Suppose the bucket is stored at device mo, and the bucket is stored at device m1. Since tb,2+1 e T12 and M(s1—1, T12) = M(sa, T11), mo is an element of M(sa, T11). Similarly, m1 is an element of M(s1-1, T11). Thus, device mo and m1 have at least two qualified buckets for query q3. However, the devices in a-l UM (sk, {t1,/2+2, , tbj) do not have any qualified buckets for query q3. This contrad- k=l lets the assumption. (case 2) b is odd Let T11, = {11, . . . , tb’} and T12, = {tb’+1: . . . , tb}, WhCI‘C b, = 'b—:1—. T116 8618 M (s1—1,T11’) and M (sa,T11’) are disjoint (otherwise, the distribution is not strict optimal for query q4 = [r(s1—l, sa), r(t1, tbr)]. Note that the number of qualified buck- ets for (14 is less than or equal to M because (a +1)(P;2*i) S ab when a 2 2 and b 2 3). However, M(s1—1, T) = M(sa,T) and M(s1—1,T11’) n M(sa,T11’) = 0 cannot be satisfied at the same time because |T11’I > ITI/2. This is a contradiction. CI Example 4.3. This example explains the proof of Lemma 4.1 through an example file system. Let f1 = {0, 1, 2, 3}, f2 = {0, l, 2, 3} and M = 8. Note that this file system satisfies the conditions of Lemma 4.1. Let buckets be distributed as in Figure 4.4, where mk, k = 0, , 15, denotes an element in Z 3. Since we have only eight devices, m1, at m, is not implied by u at v in the table. Suppose Figure 4.4 is a perfect optimal distribution for type (0 - 2) range queries. Then, (mo, m1, . . . , m7} = {m4, m5, . . . , m11} =Z3. Otherwise, the distribution is not strict optimal for at least one of the range queries q1 = [r(O, 1), *] and q2 = [r(l, 2), 77 f1 f2 DeviceNo 0 0 m0 0 1 m1 0 2 m2 0 3 m3 1 0 m4 1 1 m5 1 2 m6 1 3 "l7 2 0 m3 2 1 m9 2 2 mm 2 3 m11 3 0 M12 3 1 "113 3 2 m14 3 3 "215 Figure 4.4. Bucket Distribution When F 1 = 4, F 2 = 4 and M = 8 *]. This implies that {m0, m1, m2, m3} = {m3, m9, m11), m11}. Since {m0, m1} and {m3, m9] are disjoint (otherwise, the distribution is not strict optimal for range query q3 = [r(O. 2).r(0. 1) l). {mo.m1} = (mm. mm} and {mama} = {m81m9}. Let us con- sider the range query q4 = [r(O, 2), r(O, 2)]. Then, the devices which have the qualified buckets for this range query are m0, m1, m2, m4, m5, m6, m3, m9, m11). Here, m2 is one of mg or 1719, and m11) is one of mg or m1. Thus, at least two devices have more than one qualified buckets for this range query. However, m7 does not have any qualified bucket for q3. This contradicts the assumption. Lemma 4.2. When a file consists of n fields (n 2 2) and the given number of devices M is equal to four, perfect optimal distribution for type (0 - 2) range queries does not exist if there are at least two fields i and j such that F,- 2 3 and F J- 2 3. Proof: It is sufficient to prove the lemma for files consisting of only two fields. Sup- pose perfect optimal distribution for type (0 - 2) range queries exists. Thus, we have a 78 distribution for this file system which is perfect optimal for type (0 - 2) range queries. Let (mg, m 1 , m2, m3} be the set of four devices. Since any two of buckets <0, 0>, <0, 1> and <0, 2> should not be stored at the same device, let M (0,0) = {mg}, M(O, 1) = {m1} and M(0,2) = {mg}. Since M(O, {0, 1}) and M(l, {0, 1}) are disjoint (otherwise, the distribution is not strict optimal for query q1 = [r(O, l), r(O, l)]), and M(l, {0, 1}) and M(2, {0, 1}) are disjoint, M(O, {0, 1}) = M(2, {0,1 }). Since M(O, 0) is not equal to M(2, 0), M(O, 0) = M(2, l) and M(O, l) = M(2, 0). Thus, bucket <2, 0> is stored at device m1, and bucket <2, l> is stored at device m0. Since bucket <2, 2> cannot be stored at any one of the devices m0, m1 and m2 (otherwise, the distribution is not strict optimal for at least one of the range queries q2 = [2, r(O, 2)] and q3 = [r(O, 2), 2]), M(2, 2) = {m3}. Now, for any allocation of bucket <1, l> to any one of the devices mg, m 1, m 2 and m3, the distribution is not strict optimal for at least one of the queries q4 = [r(O, l), r(l, 2)] and q5 = [r(l, 2), r(l, 2)]. This is a contradiction. El Theorem 4.1. When a file consists of n fields (n 2 2) and the given number of devices is M, there does not exist a perfect optimal distribution for type (0 — 2) range queries if ( 1) there are at least two fields i and j, and two integers a and b such that 2 S a < F1, 3 S b S F1 and ab = M, or (2) there are at least two fields i and j such that F,- 2 3, F1 2 3 and M = 4. Proof: This is a direct consequence of Lemma 4.1 and Lemma 4.2. [3 Thus, there does not exist a data distribution method which guarantees perfect optimal distribution for type (0 - 2) range queries even for every file with only two fields. 4.3.2. Type (0 - 1) Range Queries In this section we show that for files with three or more fields, there does not exist perfect optimal distribution for type (0 - 1) range queries in many cases. 79 Theorem 4.2. When a file consists of n (n _>. 3) fields and the given number of devices is M, there does not exist a data distribution which is strict optimal for any type 0 range query that has at most two unspecified fields, and for any type 1 range query that has at most one unspecified field if (1) there are at least three fields i, j and k such that F ,- S F ,- is stored at device m0, bucket <0, 0, 1> is stored at m1, . . . , and bucket <0, 0, F 3—1> is stored at mp3_1. Clearly, for any u 6 f1, v 6 f2 and a set W which contains some con- secutive elements in f3 such that IW | < M, M(u,v, W) is a set whose size is IW I, and hence m0, . . . , m1.-3-1 are all different. We will show that buckets <0, 0, b> and should be stored at device mo. At first, M(O, 0, f3) is equal to M(O, 0, f3) (other- wise, the distribution is not strict optimal for at least one of the range queries q1 = [0, r(O, c-l), *] and q2 = [0, r(l, c), *]). Thus, the possible set of devices at which bucket <0, c, b> can be stored is (mo, . . . , mp3_1}. Bucket <0, 0, b> cannot be stored at any one of the devices m 1, . . . , mb. Otherwise, for range query q3 = [0, *, r(l, b)], at least one of the devices m1, . . . , m1, has two or more qualified buckets while there exist a device which does not have any qualified bucket. Note that the number of qualified buckets for q3 is M. Bucket <0, 0, b> cannot also be stored at any one of the devices mb+1, . . . , mp3_1. Otherwise, the distribution is not strict optimal for range query q4 = [0, *, r(b, F 3—1)]. Note that the number of qualified buckets for q 4 is less than or equal to M. Thus, bucket < O, c, b> should be stored at device m0. (Figure 4.5 gives a pic- torial view of these steps.) 80 f1 f2 f3 Device No 0 0 0 b 0 0 Fir-1 0 1 I) c-1 0 c 0 0 c b 0 c F3-1 1 . c’-1 : c 0 O . o P c 0 173—1 Figure 4.5. Nonexistence of Perfect Optimal Distribution for Type (0 - 1) Range Queries Ill 81 Now, in order for the distribution to be strict optimal for queries q5 = [r(O, c-l), 0, *] and 46 = [r(l, c), 0, *], M(c, 0, f3) should be the same as M(O, 0, f3). Thus, the pos- sible set of devices at which bucket can be stored is {m0, . . . , m1.-3-1}. Bucket cannot be stored at any one of the devices m 1 , . . . , mp3_1. Otherwise, the distribution is not strict optimal for at least one of the queries 47 = [*, 0, r(1,b)] and qg = [*, 0, r(b,F3—1)]. Thus, bucket should be stored at device m0. How- ever, this bucket distribution is not strict optimal for query q9 = [*, r(O, c), b] because device m0 has at least two qualified buckets for q9 while there exists a device which does not have any qualified bucket for q9. Note that cF 3 = M and F3 > F 2 + F 2F 3 / M imply that the number of qualified buckets for qg is less than or equal to M. This completes the proof. III Corollary 4.1. When a file consists of n (n 2 3) fields and the given number of devices is M, perfect optimal distribution for type (0 - 1) range queries does not exist if the same conditions of Theorem 4.2 are satisfied. Corollary 4.2. When a file consists of n (n 2 3) fields and the given number of devices is M, there does not exist a data distribution which is strict optimal for any range query that has at most one range specified and at most one unspecified field if (1) there are at least three fields i, j and k such that F,- < Fj < Fk, and M < F1Fk S Fij S 2M, and F1, 2 F j + F 1F], / M, and (2) there are two integers b and e such that bFJ- = CF 1, = M. Proof: It can be observed that Theorem 4.2 is proved by considering range queries (q1 through qg) which have at most one unspecified field. The only one possible exception is query qo which may have two unspecified fields if F j = c + 1. When the condition F ,- S F, < Fk is replaced by F,- < F j < F k from those of Theorem 4.2, the query q9 is guaranteed to be a type 1 range query (i.e., r(O, c) is a range specification) because F j > F; 2 c + 1. Thus, the corollary follows. III By Corollary 4.2, there does not exist a data distribution method which guarantees strict 82 optimal distribution for any range query that has at most one range specified and at most one unspecified field even for every file with only three fields. This also implies that there is no data distribution method which guarantees perfect optimal distribution for type (0 - 1) range queries for every file with only three fields. 4.3.3. Type 0 Range Queries When each field size and the given number of devices are power of 2, by Corollary 3.4 perfect optimal distribution for type 0 range queries is always possible if the number of fields whose sizes are less than the given number of devices, is no greater than four. The data distribution methods for type 0 range queries are also presented in chapter 3. In [Sun87] it has been shown that perfect optimal distribution does not exist for binary cartesian product files with n fields, if M 2 4 and n 2 [logZMJ + 2. This result implies that perfect optimal distribution for type 0 range queries is not always possible for files with four or more fields. However, for general file systems, sufficient condition for either existence or nonexistence of perfect optimal distribution for type 0 range queries has not been found. We have shown through Theorem 4.1 and Theorem 4.2 that there are inherent limi- tations to achieve perfect optimal distribution for range queries even for a file consisting of small number of fields. In the following sections we present optimal data distribution methods for range queries. These methods are the extended version of FX distribution methods presented in chapter 3. 4.4. Optimal Data Distribution Methods for Range Queries In this section we present FX distribution methods for range queries. We describe basic optimality conditions in section 4.4.1. In section 4.4.2 and the following sections we will present field transformation functions developed for range queries. The condi- tions to achieve optimal distribution by these field transformation techniques will also be 83 described. For convenience we assume, from now on, that each field size and the given number of devices M are power of 2. 4.4.1. Optimal Distribution by Basic FX Distribution Method In this section we describe conditions for optimal distribution for various types of range queries by using the Basic FX distribution method. The definition of the Basic FX distribution method is given in section 3.3. Example 4.4. Figure 4.6 shows the bucket distribution by the Basic FX distribution method, where f1 = {0, l, 2, 3}, f2 = {0, 1, 2, 3} and M = 4. Here, Device N0 = TM [J1 [+] J 2], where J1 6 f1, J 2 6 f2 and TM returns the rightmost two bits of the result of J 1 [+] J 2. f1 f2 Device No 000 000 0 000 001 1 000 010 2 000 011 3 001 000 1 001 001 0 001 010 3 001 011 2 010 000 2 010 001 3 010 010 0 010 011 1 011 000 3 011 001 2 011 010 1 011 011 0 Figure 4.6. Basic FX Distribution When F 1 = F 2 = 4 and M = 4. We can verify that the distribution of Figure 4.6 is perfect optimal for type (0 - 1) range queries but is not perfect optimal for type (0 - 2) range queries because it is not 84 strict optimal for query [r(O, 1), r(O, 1)]. In fact, by Theorem 4.1 there does not exist a perfect optimal distribution for type (0 - 2) range queries for this file system. Since we have shown that FX distribution methods are strict optimal for mosr class of type 0 range queries in chapter 3, we will not discuss the cases when only type 0 range queries are involved. Lemma 4.3. For any type 0 range query which has at least one unspecified field whose size is greater than or equal to M, all the devices have equal number of qualified buck- ets by the Basic FX distribution method. Proof: The proof is almost the same as that of Theorem 3.2. III Theorem 4.3. The Basic FX distribution method is strict optimal for any range query, for which there exists at least one unspecified field whose size is greater than or equal to the given number of devices M. Proof: Since a range query is a set of partial match (i.e., type 0 range) queries, and by Lemma 4.3 we know that equal number of qualified buckets are distributed in all the devices for each partial match query, the proof immediately follows. I] Corollary 4.3. The Basic FX distribution method is strict optimal for any range query, for which there exists at least one range specified field such that the specified range size is an integral multiple of the given number of devices M. Proof: When the specified range size is F1, it is easy to see that the effect of the Basic FX distribution method for this range is the same as that of the Basic FX distribution method for an unspecified field whose size is F, (note that for any integers J 1 and J 2, TM(J 1 [+] J 2) = T M(J 1) [+] T M(J 2)). Thus, the corollary follows. III Theorem 4.4. The Basic FX distribution method is strict optimal for any range query, for which there is at most one range specified field and all the other fields are specified as single values. Proof: The proof is almost the same as that of Theorem 3.1. E] 1. 85 Theorem 4.5. When all the field sizes are greater than or equal to the given number of devices M, the Basic FX distribution method is perfect optimal for type (0 - 1) range queries. Proof: When all the fields are specified as a single value, the proof is trivial. When at least one field is unspecified, the proof follows by Theorem 4.3. The only remaining case is when one field is specified as a range and all the other fields are specified as sin- gle values. The proof for this case follows by Theorem 4.4. III Theorem 4.3, 4.4 and 4.5 show sufficient conditions for optimal distribution by the Basic FX distribution method. However, the Basic FX distribution method does not give optimal distribution for many types of range queries. The following proposition gives the conditions for optimal distribution for these cases. Proposition 4.1. Let q(f) = {i1, i2, , ik} be the set of range specified or unspecified fields for a range query q. Let S1} be the set of elements in the range when field i J- is range specified, or be f1}. when field i ,- is unspecified. Then, FX distribution methods are strict optimal for a range query q, if there exists a set of fields {1' 1, , i j} c; q(f) such that |S1l>< >< x51}, | TM [H1019] :2} = |S,-lx x51). I/Mfor allze ZM. p: Proof: The proof is similar to that of Theorem 3.2. [I Let S,- be the set of elements in the range when field i is range specified, or be f1 when field i is unspecified. Proposition 4.1 says that we can guarantee strict optimal distribu- tion for a given range query, if (1) there exists a subset of the range specified or unspecified fields {i 1, . . . , i I} such that |S1l x - - - x51]. | is an integral multiple of M, and (2) the records projected on these sets of fields are distributed uniformly among the M devices. l!I 86 In other words, optimal distribution for a subset of fields guarantees strict optimal distri- bution for many queries in which those fields are unspecified or range specified. However, when the size of none of the fields is greater than or equal to M, the con- ditions given in Proposition 4.1 are not satisfied in the Basic FX distribution method. Thus, in the next section we propose field transformation techniques for the fields whose sizes are less than the given number of devices M. Note that the definition of field transformation functions is given in section 3.4. These field transformation techniques increase the scope of optimality over that of the Basic FX distribution method. By the definition of field transformation functions (i.e., one—to—one function), it is easy to see that all the lemmas, theorems and proposition that hold for the Basic FX distribution method also hold for FX distribution methods (i.e., the Basic FX distribution method with field transformation functions). 4.4.2. Field Transformation Functions for Range Queries In the previous section conditions for optimal distribution have been described when the Basic FX disribution method is used. In this section we present field transfor- mation techniques which can improve the performance over the Basic FX distribution method significantly. The following paragraph exemplifies the idea. Whenf1 = {0, 1, 2, 3},f2 = {0, 1, 2, 3} and M = 8, the disribution by the Basic FX distribution method is not strict optimal for many range queies (we can easily see that the distribution in Figure 4.6 is not strict optimal for many range queries when M = 8). Suppose X is an one-to-one mapping such that X (0) = O, X (1) = 4, X (2) = 2, X (3) = 6. When each bucket in the file f1 x f2 is stored at the device it [+] X (v), the dis- tribution is strict optimal for any range query in which field 1 is not specified as a range. (It can be easily verified by substituting (100)B, (010),; and (110)19 for (0003, (010)]; and (011)B, respectively in f2 column of Figure 4.6. In fact this distribution is perfect optimal for type (0 - 1) range queries.) Thus, our objective is to find such mapping, X '1: 87 in general, for the given file system. It will be shown that for any values of F 1, F 2 and M, such mapping can be easily found. The following proposition gives sufficient condi- tion for the mapping X. Proposition 4.2. Let a file consist of two fields i, j whose sizes are less than the given number of devices M, and X be an injective function from N to ZM. Then, FX distribu— tion method with I-transformation for field i and X-transformation for field j is strict optimal for any range query in which field i is not range specified, if for any L consecu- tive valuesJ1, . . . ,JL in fieldjsuch thatLSM/Fi, {X(J1), . . . ,X(JL)} = {kxFi +0): I x=0, . . . ,L-l, andforallx,k,, e N,kx ZM is a function such that UMM’ '4' (1) = URMa) [+] (1 mod dM‘ 'f" ), where dM' '1’" = M/ I f, I . Example 4.5. When f1 = {0, 1, 2, 3],f2 = {0, 1, 2, 3, 4, 5, 6, 7} and M =16, (a) 08160.) = {0 8. 4. 121 and UMW‘tfi): {0.9.6.15}. (b) 01216082) =-{0, 8, 4, 12,2, 10, 6, 14}, and (11416302) = {0,9, 4, 13,2, 11,6, 15}. Lemma 4.4. When there are only two fields i, j such that I fjl is less than the given number of devices M, functions URM and UM M‘F’ satisfy the condition of the function X in Proposition 4.2. Proof: The proof of the lemma is straightforward from the definition of URM and MP, UM CI 89 From the definition of the function UR, we can observe the following relations. UR-property 1. Let f1 = {0, . . . , Fl—l} such that |f1l < M. Then, for any I 6 f1, URM(1) = URF‘(1)* M/F1. The proof of UR—property 1 is straightforward from definition. Since UR F’(f,) = fl, UR-property 1 says that when the domain size of a function UR is less than the range size, the values of function UR are the multiples of the range size / the domain size. UR-property 2. When f1 = {0, . . . ,F1—1} such that F, < M, URM(f1) = UM‘F’(f1). Since for any I 6 f1 such that If, I < M, URM(l) is a multiple of M/FI, and for any two different 11 and 12 6 f1, URM(I1) :t URM(12), it is easy to see UR-property 2. Thus, by UR-property 2, all the lemmas and theorems in [12, 20] which hold for U—transformation functions also hold for UR-transformation functions. Several useful characteristics of the function UM will be described in section 4.5.4 and 4.5.5. Because of notational complexity we will use the following conventions : When the parameter M of a function UR denotes the given number of devices, we will leave out the parameter by default. When the parameter of a function UR does not mean the given number of devices, we will use the notation BI instead of UR. For example, UR F’ in UR-property 1 will be denoted by BIF'. The parameter M and |f1I of function UM will be left out whenever there is no ambiguity. Proposition 4.2 defines the class of function X such that FX distribution methods with I (f1) and X (fj) gives strict optimal distribution for any range query in which field i is not range specified. That proposition is only for the case when field i is [- transformed. In the following proposition we generalize Proposition 4.2 to include the case when other transformation functions are applied to field i. Proposition 4.3. When a file consists of two fields 1', j whose sizes are less than the given number of devices M, let X ,- and X j be transformation functions applied to field i 90 and j, respectively. Let d,- = M/F1. Then, FX distribution methods with X1(f,-) and X j(fj) are strict optimal for any range query in which field i is not range specified if (1) X1(f,-) [+]Xj(J1) and X1(f,-) [+]X1-(Jz) are disjoint for any J1, J2 e f,- such that J1[+]J2 M, and (2) perfect optimal for type (0 - 2) range queries when F 1F }- S M. Proof: (case 1) F1FJ- >M When one field is range specified and the other field is specified as a single value, the proof follows from Theorem 4.4. When field i is range specified and field j is 91 unspecified, the proof follows from UR-property 2 (i.e., UR (fj) = U (fj)) and Lemma 3.8. When field j is range specified and field i is unspecified, the proof follows by Lemma 4.4 and Proposition 4.2. When there is no range specification, the proof follows by UR-property 2 and Theorem 3.4. (case 2) HF S M From UR-property 2 and Lemma 3.8, each device has at most one bucket, and therefore the proof follows. C] Example 4.6. Let f1 = {0, 1, 2, 3},f2 = {0, l, 2, 3} and M = 8. Figure 4.7 shows the bucket distribution by the FX distribution method with I (f1) and UR (fz). Here, UR (fz) = {0, 4, 2, 6} and Device N0 = TM(I(J1) [+] UR (12)). J16 f1, 12 6 f2. I (f 1) UR Ifz) Device No 000 000 0 000 100 4 000 010 2 000 110 6 001 000 1 001 100 5 001 010 3 001 110 7 010 000 2 010 100 6 010 010 0 010 110 4 011 000 3 011 100 7 011 010 1 011 110 5 Figure 4.7. FX Distribution with I And UR Transformation We can verify that for any type 0 and type 1 range query the distribution of Figure 4.7 is strict optimal. 92 Before we describe the next section, it is worth considering the following question. Does there exist three field transformation functions X 1, X 2 and X3 such that for any two fields whose sizes are less than M, FX distribution methods with any two of X 1, X 2, X3 transformation functions are perfect optimal for type (0 - 1) range queries ? Unfor- nately, there do not exist such transformation functions because otherwise, it immedi- ately contradicts Theorem 4.2. For example, let a file consist of three fields 1, 2 and 3 such that F 1 = 4, F 2 = 4 and F3 = 8, and let M = 16. Suppose the above such functions X 1, X 2 and X3 exist, and let us apply X 1, X 2 and X3 transformation functions to field 1, 2 and 3, respectively. Then, it is easy to see that the distribution for this example file system should be strict optimal for any type 0 range query which has at most two unspecified fields, and strict optimal for any type 1 range query which has at most one unspecified field. However, by Theorem 4.2 we know that there does not exist such dis- tribution for this file system. This implies there does not exist a transformation function Y such that FX distribution methods with I and Y-transformation as well as UR and Y~ transformation is perfect optimal for type (0 - 1) range queries for every file with two fields. Thus, it is inevitable to have some restrictions for perfect optimal distribution for type (0 - 1) range queries when I and UM-transformation, and UR and UM- transformation are considered. 4.4.4. I and UM Field Transformation Functions In this section we show that for any values of F,-, F k and the given number of dev- ices M such that F1 S F k < M, FX distribution methods are perfect optimal for type (0 - 1) range queries when I—transformation is applied to field i and UM-transformation is applied to field k. Lemma 4.5. When there are only two fields i and k whose sizes are less than the given number of devices M, the FX distribution method with I—transformation for field i and UM-transformation for field k is strict optimal for any range query in which field i is not 93 range specified. Proof: This is a direct consequence of Lemma 4.4 and Proposition 4.2. III Lemma 4.6. When the size of field k is less than the given number of devices M, let d1, = M /F 1,. Then, for any two different nonnegative integers J 1 and J 2 such that J 1 and J; are within the same interval of size dk (i.e., J1 [+] J2 < dk), UM (fk) [+] J1 and UM (fk) [+] J 2 are disjoint. Proof: LetJ1=J’dk[+]c1andJ2 =J’dk [+] c2, where 1’, c1, c2 6 N, and c1 and c2 are less than dk. Assume the lemma is not true. Then, there exist two different K 1, K 2 e fk such that UR (K1) [+] (K1 mod dk)[+1JIdk [+101 = UR (K2) [+] (K2 mod dk)I+lJ’d/r [+102 Note that UM (K 1) = UR (K 1) [+] (K 1 mod dk) by definition. After removing J ’dk from both sides of the above equality, we have UR (K1) [+] (K1 mod dk)[+lc1 = UR (K2) [+1 (K2 mod dk)[+102 This implies that UR (K1) [+] UR (K2) = [(K1 mod dk) [+] c1] [+] [(K2 mod dk) [+] c2]. Since UR (K1) [+] UR (K2) > dk by UR-property 1, and (K1 mod dk) [+] c1, (K 2 mod dk) [+] c2 is less than dk, there is no way the above equality can be satisfied. This is a contradiction. CI Lemma 4.7. When the size of field k is less than the given number of devices M, UM (fk) [+] UM (K) = UM (fk) for any K e fk. Proof: Since UM (fk) [+] UM (K) are a set of F k different nonnegative integers, it is sufficient to show that for any two different K 1, K 2 in fk, (UM (K 1) [+] UM (K 2)) e UM(fk). Let dk=M/Fk. Now, UM (K1) [+] UM (K2) = [UR (K1) [+] (K1 mod 41.)] [+] [UR (K2) [+] (K2 mod 40] = UR (K1) [+] UR (K2) [+] [(K1 [+] K2) mod dk] 94 By UR-property 1, UR (K1) [+] UR (K2) is equal to [3150(1) [+] 31’” * (K2)]dk (note that we defined BIK to denote URK when the parameter K of function UR does not mean the given number of devices M). This implies that UR (K1) [+1 UR (K2) =BIF"(K1 [+] K2)dk = UR (K1 [+] K2) Thus, UM(K1) [+] UM(Kz) = UR (K1 [+] K2) [+] [(KlI+lK2) mod dk] = UM (K 1 [+]K2). Since K 1 [+] K 2 e fk by Lemma 3.2, the proof follows. E] Lemma 4.8. Let the size of field k be less than the given number of devices M. Then for any K e fk, UM (fk) [+] UR (K) = UM (fk) [+] (K mod dk), Where d1, = M/Fk. Proof: Let K e fk. Then, UM (fit) [+] UR (K) = UM (fk) [+] UR (K) [+] (K mod dk) [+] (K mod dk) = UM (fr) [+] UM (K) [+] (K mod dk) Since UM (fk) [+] UM (K) = UM (fk) by Lemma 4.7, the proof follows. E] Lemma 4.9. When there are only two fields i and k whose sizes are less than the given number of devices M and F; S F k, the FX distribution method with I-transformation for field i and UM-transformation for filed k is strict optimal for any range query in which field k is not range specified. Proof: Let dk = M/Fk. By Lemma 4.6, for any two different K 1 and K 2 which are in the same interval of size dk, UM (fk) [+] K1 and UM (fk) [+] K 2 are disjoint. Thus, by Proposition 4.3 it is sufficient to show that for any I 6 f1, UM (fk) [+] J = UM (fk) [+] (J [+] 0111,, ), where Otis any nonnegative integer such that ordk < F1. Now, UM (ft) [+1 out = UM (it) [+] UR (31h (00) = UM (ft) [+] (31%.) mod do The first equality holds because UR (31“ (01)) = BIF * (31F * (q))dk by UR-property 1, and BI F‘ (81 F‘ (01)) = or. The second equality holds by Lemma 4.8. Since at < F k/d;c II! 95 (because F1- SFk), this implies that BI F‘ (at) is a multiple of dk for all at = 0, . . . , F1/dk—l. Thus, BIF‘ (0t) mod dk = 0 and therefore UM (fk) [+] ordk = UM (fk) for all or = 0, . . . , F1/dk—1. This completes the proof. [I Theorem 4.7. When there are only two fields i and k whose sizes are less than the given number of devices M, the FX distribution method with I-transformation for field i and UM-transformaton for field k is (l) strict optimal for any range query in which field i is not range specified if F; > F k, and (2) perfect optimal for type (0 - 1) range queries if F ,- S Fk, and (3) perfect optimal for type (0 - 2) range queries if F 1F k S M. Proof: For (1) and (2), the theorem is a direct consequence of Lemma 4.5 and Lemma 4.9. When F 1F k S M, since F 1—1 < M /F k, the theorem is a direct consequence of Lemma 4.6. E] Example 4.7. Let f1 = {0, 1, 2, 3},f2 = {0, 1, 2, 3} and M = 8. Figure 4.8 shows the bucket distribution by FX distribution method with 1 (f1) and UM (fz). Here, UM (fz) = {0, 5, 2, 7} and Device No=TM(I(J1)[+]UM(J2)),J1e f1,J2 6 f2. We can verify that for any type 0 and type 1 range query the distribution in the figure is strict optimal. 4.4.5. UR and UM Field Transformation Functions In this section we discuss the cases of optimal distribution when UR and UM field transformation functions are used. Lemma 4.10. Let the size of field k be less than the given number of devices M. Then, for any K e fk, K mod (1 = UM (K) mod d, where d is a power of 2 which is less than or equal to M /Fk. Proof.‘ Let dk = M/Fk. UM (K) mod d = [UR (K) [+] (K mod dk)] mod d = [UR (K) mod d] [+] [(K mod dk) mod d] 96 [(fl) UM (fz) Device No 000 000 0 000 101 5 000 010 2 000 111 7 001 000 1 001 101 4 001 010 3 001 111 6 010 000 2 010 101 7 010 010 0 010 111 5 011 000 3 011 101 6 011 010 l 011 111 4 Figure 4.8. FX Distribution with I And UM Transformation Since UR (K) is a multiple of dk by UR-property 1, and dk is a multiple of d, UR (K) modd = 0 and (K mod d1,) modd = K mod d. Thus, the proof follows. [:1 Lemma 4.11. When there are only two fields j and k whose sizes are less than the given number of devices M, the FX distribution method with UR-transformation for field j and UM-transformation for field k is strict optimal for any range query in which field j is not range specified, if (1) F ,- 2 F k or (2) F ,F k S M. Proof: When one field is range specified and the other field is specified as a single value, the proof follows from Theorem 4.4. Thus, we have only to consider the case when field j is unspecified, and field k is range specified or unspecified. (case 1) F,- ZFk Let dj = M/Fj and dk = M/Fk. Then, dj S dk. By UR-property 2, UR(fj) = U(f1), and by Lemma 4.10, UM (K) mod dk = K mod dk for any K in fk. Since d ,- S dk, by Lemma 97 3.8 the proof follows. (case 2) F ij S M This is a direct consequence Lemma 4.10 and Lemma 3.8. CI Lemma 4.12. When there are only two fields j and k whose sizes are less than the given number of devices M, the FX distribution method with UR-transformation for field j and UM-transformation for field k is strict optimal for any range query in which field k is not range specified, if (1) F ,- S F1, or (2) F in S M. Proof: ’ (case 1) F j S Fk Let dk = M /F 1,. By Lemma 4.6, we know for any two different nonnegative integers J1 and J 2 such that J 1 and J 2 are within the same interval of size dk (i.e., J 1 [+] J 2 < dk), UM (fk) [+] J1 and UM (fk) [+]12 are disjoint. Thus, by Proposition 4.3 it is sufficient to show that for any J e fj, UM (fk) [+] UR (J) = UM (fk) [+] UR (J [+] adk), where or is any nonnegative integer such that ordk < Fj. Now, since J is also an element of fk because F J- S Fk, by Lemma 4.8 UM (fk) [+] UR (J) = UM (fk) [+] (J mod dk) for any J e fj. This implies that UM (fit) [+] UR (J [+] adk) = UM (fir) [+] [(J [+] (Id/c) mod dk] = UM(f/t) [+] (J mod dk) Note that J [+] ord,c is also an element of fk. Thus, by Proposition 4.3 the proof follows. (case 2) Fij S M This is a direct consequence of Lemma 4.10 and Lemma 3.8. CI Theorem 4.8. When there are only two fields j and k whose sizes are less than the given number of devices M, the FX distribution method with UR-transformation for field j and UM-transformation for field k is (1) strict optimal for any range query in which field j is not range specified if F j 2 F k, and (2) strict optimal for any range query in which field k is not range specified if F J- S F k, and (3) perfect optimal for type (0 - 1) range queries if F ,- =F k, and (4) perfect optimal for type (0 - 2) range queries if I! 98 F ij S M. Proof: The theorem is a direct consequence of Lemma 4.11, Lemma 4.12 and Lemma 3.8. [:1 Example 4.8. Let f1 = {0, 1, 2, 3},f2 = {0, 1, 2, 3} and M = 8. Figure 4.9 shows the bucket distribution by FX distribution with UR (f1) and UM (fz). Here, UR (f1) = {0, 4, 2, 6}, UM(f2) = {0, 5, 2, 7} and Device N0 = TM(UR(J1)[+]UM(12)), J16 f1, J2 6 f2. UR (f1) UM (fz) Device No 000 000 0 000 101 5 000 010 2 000 111 7 100 000 4 100 101 1 100 010 6 100 111 3 010 000 2 010 101 7 010 010 0 010 111 5 110 000 6 110 101 3 110 010 4 110 111 1 Figlire 4.9. FX Distribution with UR And UM Transformation We can verify that the distribution is perfect optimal for type (0 - 1) range queries. 4.4.6. 1, UR and UM Field Transformation Functions In this section we discuss the cases of optimal disribution when 1, UR and UM- transformation functions are used. 99 Theorem 4.9. When there are only three fields i, j and k such that F ,- SF j SF 1, < M, FX distribution methods can be always (1) perfect optimal for type 0 range queries, and (2) strict optimal for any type 1 range query which has at most one unspecified field if F j =Fk, orFiFk SM. Proof: (1) is true by Corollary 3.4. Thus, let us consider only the case (2). When F,- S F j =Fk, let field i be I-transformed, field j be UR-transformed and field k be UM- transforrned. When F1Fk SM, let field i be UR-transformed, field j be I-transformed and field k be UM-transformed. Then, for both cases, the theorem is a direct conse- quence of Theorem 4.6, Theorem 4.7 and Theorem 4.8. CI Note that by Theorem 4.2 there does not exist a data distribution method which guaran— tees strict optimal distribution for any range query in Theorem 4.9 for every file with three fields. We have shown through lemmas and theorems that FX distribution methods along with various combinations of field transformation functions give optimal distribution for many types of range queries. Here, it should be emphasized that the scope of optimality is increased significantly by these field transformation techniques along with Proposition 4.1. This is because by Proposition 4.1 optimal distribution for a subset of the fields guarantees strict optimal distribution for many range queries in which those fields are range specified or unspecified. CHAPTER 5 NODE PARTITIONING SCHEMES FOR B-TREES 5.1. Introduction In chapter 3 and 4 we have described data distribution strategies for multikey search queries which access a set of records for each query. These are type A1 applica- tions based on the classification in section 1.4. In this chapter we will investigate data distribution strategies for parallel processing of type A2 applications, i.e., the database query which accesses a record based on primary key. The object is to enhance con- currency by parallel processing of tree type index structure for the database stored in the external storage. Parallel processing strategy for key based access main memory data- bases can be found in [Cev88]. B-trees are the most commonly used tree type index structure for external data- bases [Bay72, Com79]. For the database stored in the secondary storage, the height of an index tree is the most important parameter in data retrieval time. One approach to reduce the height of a B-tree is to partition a file horizontally, and construct a B—tree for each subset of the partition. These B-trees are searched in parallel. When the database size is D and is partitioned into p subsets, the height of each B-tree is approximately logf(D /p), where f is the fan-out of an index node. Another approach to reduce the height of a B-tree is to use a large node B-tree for the whole file. When the size of index nodes is increased p times, the height of a B-tree becomes about longD. The second method gives smaller height because D > pf for most practical cases. However, this approach alone may not improve the overall performance because large nodes result in long block transfer time and more main memory processing time. 101 We propose a node partitioning scheme for large node B-trees for parallel process- ing. In the proposed scheme, each node is partitioned into multiple subnodes to be dis- tributed for parallel processing. We also propose a new approach to process these sub- nodes. We will call the proposed B-tree structure Partitioned Node B-trees (PNB- trees), while the standard B-trees will be called conventional B-trees. With respect to the two stage processing model in Figure 2.2, PNB-trees are based on partitioned node B-tree data construction, and random or object specific data distribution. The main results presented in this chapter are that the parallel processing of PNB- trees reduces access time, increases throughput and minimizes the frequency of tree res- tructuring. The basic structure of PNB-trees is based on B-trees. However, we use a different approach to process and search PNB-trees. The PNB-tree approach exploits parallel scanning by distributing multiple subnodes of a B-tree node among parallel disks. The rest of this chapter is organized as follows. In section 5.2 we present the basic structure of PNB-trees, and the search and update algorithms. The hardware environ- ment for the PNB-tree construction is also discussed in this section. The important parameters of the PNB-tree are described in section 5.3. Section 5.4 presents a perfor- mance comparison between PNB-trees and conventional B-trees. Finally, PNB-tree construction for various disk technology is discussed in section 5.5. 5.2. The PNB-tree PNB-trees are constructed on synchronized disks where read/write heads of all the disks are located at the same position. Synchronizing disks for disk interleaving was proposed in [Kim86] where multiple disks are interleaved like main memory interleav- ing. It has been shown in that paper that the response time improves considerably by using disk interleaving techniques when block size is large. This is because synchron- ized disk interleaving increases data transfer rate significantly. In PNB-trees, PI} 102 synchronized disks are used to exploit searching in parallel. The PNB-tree is a B—tree with each node partitioned into s subnodes, and all the subnodes of a node are stored on s different disks. The format of an index node is (¢,P0)(K1,P1)(K2,P2) - - - (Km,P,,,) where K,- is the key value and P,- is the address of the child subnode in each disk. This index node is partitioned into 3 subnodes, and all the subnodes of a node are accessed by the same address pointer. All the subnodes have the same format except the first one, where the key value in the first record is omitted. The leaf nodes which contain the data are also partitioned into s subnodes. However, PNB-trees differ from conventional B-trees in that the key values within a node in PNB-trees are not required to be in sorted order. This needs extra computation to deter- mine the child address, but the amount of data movement between disks is reduced con- siderably because a subnode does not overflow. The rule of locating a data record is defined recursively as follows : Let R = {(4) , Po),(K1, P1), (K2 , P2), ~° , (Km , Pm)} be a current index node. Then the data record with a key value K is in the descendant node pointed by P , if (Ky , P,) e R - {(1) ,P0)} and (K —Ky) is the minimum non- negative value among all (K —K,-) , i = l,...,m . If there is no nonnegative (K —- K1) , i = l,...,m, the desired record is in the descendant node pointed by P0. An example of a PNB-tree with a set of key values is given in Figure 5.1. It shows that each node is partitioned into two subnodes being stored on disks D1 and Dz. Note that the values in a node are not kept in sorted order. When the key value e.g., 25 needs to be searched, we first compute the differences 25 —- Ky for all the key values Ky in the root node. These values are <—7, 4, -66, -46>. By following the pointer associated with Ky such that 25 - Ky = 4, the node <21, 25, 23, 29> is obtained. Here it is assumed that all data values are stored in leaf nodes. jl 103 Dr D2 3 2 9171 \O D1 D2 '01 02101 DZID1 D I , II IIIII IIII IIIII Eltlll IIIII IIEI 2 IIII 12l- Figure 5.1. Partitioned Node B-tree The reason for the use of unsorted nodes is to minimize the inter-device data move— ment. Suppose that a node is in sorted order, and is partitioned and stored as above. Whenever a subnode overflows, data movement between disks is required. However, this does not happen in unsorted node construction because inserted records can be placed in any subnode of a node. 5.2.1. The PNB-tree Operations (1) LOOKUP The lookup of a record starts from the root node and continues until the leaf. All the subnodes of a node are searched in parallel. Let P be the address of the node currently being accessed. Initially, P is the address of the root. The lookup procedure consists of two phases. At first, (K , P) is broadcast to all the disks, where K is the desired key value. If P is the address of a leaf node, find the record and stop. If not, each disk finds an index record (K1, P1) in its subnode such that (K —K,-) is the minimum nonnegative value. In the second phase, find the minimum of these minimum nonnegative values at each disk. If no nonnegative value is found at any disk, repeat the first and the second phase by replacing P with P0. Otherwise, repeat the same by 104 replacing P with P j, where (K j , P j) is an index record and K J- is the overall minimum nonnegative value. (2) INSERTION To insert a record, apply the lookup procedure to find the desired leaf node. We then insert the record into any place within the node, if it has empty room. If the node is already full, it is sorted and then split into two nodes. The effect of node splitting on its parent node is handled recursively in the same way as in standard B-trees. Figure 5.2 shows the PNB-tree configuration after the key value 57 has been inserted into the tree of Figure 5.1. (3) DELETION The node underflow in PNB_trees is handled the same way as in standard B-trees. PNB-trees have an advantage over B-trees because the holes created by deletion can be easily filled by any records inserted. This is because the values within a node need not be kept in sorted order. In conventional B-trees the holes can also be used but only at the cost of sorting the node each time. Figure 5.3 shows the PNB—tree configuration when key value 71 is deleted from the tree of Figure 5.2. 5.2.2. Parallel Disks Configuration for PNB-trees For each disk we assume a simple microprocessor with a few thousand bytes of memory. All the subnodes of a node are processed locally by these processors. Only the matched record is sent to the host processor. For the second phase of a lookup pro- cedure, we use a minimum finding hardware module to which all the disks are directly connected. This module is a simple extension of a multiple input comparator. As soon as each disk finds an index record (K1-,P1) such that "the desired key value — K1" is minimum nonnegative value of that disk, it sends that index record to this minimum finding hardware. If a disk does not find a nonnegative value, it sends (—1, null). The 105 Dr 1132 III 9 III A D ‘D D, .D 91 - 98 — 7 - 87 - Figure 5.2. PNB-tree After Insertion of Key Value 57 Dz . 51 III D2 D 1 02 D1 1)2 I 87 II —II II21II25II 23II29I| IIII EIEI Figure 5.3. PNB-tree After Deletion of Key Value 71 106 minimum finding hardware determines the overall minimum nonnegative value and then broadcasts the corresponding address of the child node to all the disks. When all the disks reply with (—1, null), the address of a child node is the first pointer in the first sub- node of a node. Since the communication involved is very simple and communication distance is short, we expect that the time for these operations is negligible compared to disk access time. 5.3. Motivations of PNB-trees The motivations of PNB-trees are to reduce the tree height and to reduce the fre- quency of tree restructuring. This is because the height of index trees and the frequency of tree restructuring are very important parameters for the performance of external data- bases. 5.3.1. Compressed Height Let s be the number of disks, n be the number of records, r be the size of a data record, and k be the key size. Let the block size be b bytes and the pointer size be p bytes, where a block is a unit of data transfer. This block corresponds to a subnode in PNB—trees and a node in conventional B-trees. Let h be the height of a PNB-tree, where h includes the level of data nodes. To compute the worst case height we assume every block to be half-full except the root node. Then, the maximum number of leaf nodes is approximately shb b "'2 k+p 2(k+p) Since we need about 2nr/b data blocks (leaf nodes), the relation shb b h—z > _2n_r k+p 2(k+p) _ b should be satisfied [Knu73]. For all the examples in this chapter we will use r = 200 107 bytes, k = 30 bytes andp = 4 bytes. Then, b 2 2500 if h = 3, s = 4 and n = 1 M. There- fore, 3 Kbyte block size is enough for making 2 level indices in PNB-trees. Even for large files with 60 M records, the block size of 5 Kbyte guarantees 2 level indices in PNB-trees when eight disks are used. On the other hand, 4 level indices are needed for this file when the conventional B-tree of 5 Kbyte block size is used. To guarantee 2 level indices in conventional B-trees we need to use the block size of about 48 Kbyte. In this case the block transfer time is quite significant. The node size of PNB-trees is large, but I/O and computation time improves significantly because the large node is split into smaller subnodes and these subnodes are processed in parallel. Since the height of a B-tree decreases with increasing node size, the number of disk accesses in PNB-trees is minimized. Furthermore, the block transfer time is reduced because only the subnode contributes to the block transfer time. Note that the height can also be reduced in conventional B-trees by making the node size large. But this results in a long block transfer time and more main memory processing time. 5.3.2. Reduced Frequency of Tree Restructuring The B-tree restructuring due to node overflow is expensive. One way to reduce the frequency of tree restructuring is to use a large node. When the node size increases k times, the frequency of tree restructuring due to node overflow reduces by a factor of l/ck, where c is slightly larger than one. This is because the number of nodes in a B- tree decreases almost linearly with node sizes. The detailed analysis for computing the average number of nodes in a B-tree is given in [Yao79]. 5.4. Performance Comparison In this section we compare the response time of PNB-trees with conventional B- trees which are stored on parallel disks. 108 5.4.1. Performance Models For conventional B-trees each data request is directed to the disk which contains the corresponding B-tree. Thus, a request queue is formed in front of each disk. In PNB—trees all data requests form one queue. Let X be a random variable for the disk service time of a single disk. Then, X = S(seek time) + L(rotational latency) + T(block transfer time). For conventional B-trees on parallel disks Rotational Positioning Sensing (RPS) miss delay should be added for disk access time [Kim86]. RPS miss delay happens due to the channel contention by concurrent disk I/Os. Let X ,- be the disk service time for disk i in conventional parallel disks. Then, X1 = S + L + T + RPS 1. We define the disk response time to be the disk service time plus the queue waiting time for the disk service. The queue waiting time is computed as follows. The arrival process for disk access requests is assumed to be Poisson process with mean R. L is uniforrrrly distributed with mean one half of one disk rotation time, S is assumed to be exponentially distributed and the number of RPS,- rrrisses is assumed to be geometrically distributed [Kim86]. Since the distributions of S, L ,T and RPS,- are known, the mean and the variance of X 1 can be computed. We choose M/G/l queueing model to compute average queue waiting time. In M/G/l queueing model, the queue waiting time for disk i, denoted W1, is given by ammmwmfl 2(1-Ri*E 1X11) where R,- is the request rate for disk i [Ross85]. Thus, the average disk response time for disk i is E [X1] + W,-. Let Z ,- be the disk response time for a j—th level node of a B-tree. We define the data response time Z to be Z 1 +22 + . - . + 2),, where h is the height of the B-tree. Thus, the average data response time E [Z] = hE [Z I]. Here, for all j = 1, ..., h, s E%P%2@WHWD i=1 109 where sis the number of disks. For the PNB—tree model, R (Var (X)+E [X 12) E[Zj]=E[X]+ 2(1—R*E[X]) S for all j = l, . . . , h, where R is equal to 2R1. Note that there is only one queue and i=1 there is no RPS rrriss delay in the PNB-tree model. As described in section 5.2, PNB-trees use two phase processing to locate a record. In the first phase, computation time is overlapped with data transfer time because pro- cessing can be done concurrently with the data transfer between a disk and a local buffer. For the second phase, we assume that it takes less than one millisecond to find the overall minimum value. 5.4.2. Average Data Retrieval Time In this section we use IBM 3380 disk parameters which are E[S] = 7.2 ms, E[L] = 8.3 ms and block transfer rate = 3 Mbyte/sec. We will compare the performance of conventional B—trees on parallel disks with PNB-trees on synchronized disks. As typical examples, four disks and eight disks are used. In conventional parallel disks, data requests are usually skewed for some disks. The following skewed data request patterns are used in the performance experiments. 1) 2) DI D2 D3 D4 DI DZ D3 D4 D5 06 D7 D8 .458 .371 .153 .018 .388 .225 .153 .102999 .000001 .010 .054 .068 Here, Di represents disk i and the number below Di represents the probability that some request is a disk i request. Note that there is no skewed effect in the PNB-tree model because there is only one queue. Let the number of records be 1 M and all other param- eters be the same as in section 5.3.1. When 4 disks are used, the data response time for conventional B-trees is 70.07 ms while data response time for PNB-tree is 54.09 ms. 110 We have used 4 Kbyte block size with a data request rate of 2 per second. When 8 disks are used, block size is 2 Kbyte and the data request rate is 2 per second, the data response time of conventional B-trees is 77.85 ms and that for the PNB-trees is 51.85 ms. The root node is assumed to be resident in main memory for all response time com- putation. The height of the B-tree and the PNB-tree is computed with the assumption that every node is half-full. We see that the use of small block size for conventional B- trees may increase the data response time due to the increased height. In PNB-trees we can have smaller block size with the same height by using more disks. Thus, the PNB-tree takes advantage of less block transfer time. In conventional disk systems with small block size, the block transfer time is negligible but the time due to increased disk accesses is significant. Many parameters need to be considered in comparing the response time of conven- tional B-trees on parallel disks and PNB-trees on synchronized disks. This is because the height of B-trees depends on the block size, the number of records, record size and key field size. We will use the same values for these parameters as in section 5.3.1 for the following experiments. We compare three different B-tree processing models. The first is conventional B-trees with height 4. The second is conventional B—trees with height 3, and the third is PNB-trees with height 3. The second model uses an increased block size to reduce the height of the conventional B-tree. In the figures B, LB and PNB denote the response time of the first, second and third model, respectively. Figure 5.4 shows response time for various data request rate when IBM 3380 disk parameters are used. In Figure 5.4.(a), we used a database containing 1 M records with eight parallel disks. The block sizes for conventional B-trees are 4 Kbyte and 14 Kbyte. The block size for the PNB-trees is 2 Kbyte. In the figure PNB-trees are the most efficient, until 12 data requests/sec. Conventional B-trees of height 3 by using a large block size is always better than the conventional B-trees of height 4. But for very large 111 100 — 9O — 80 — Data Response 70 _ Time (ms) 60 — B 50 — 40 — 30 LB PNB IIIIIIIIIjj 0 2 4 6 810121416182022 Data Request Rate (a) l M Records 200 a 180 — 160 —~ 140 — Data Response 120 — 100 — Time (ms) 80 _ LB 60 — 40 — 20 a B PNB I I I I I I I I I I j 0 2 4 6 8 10 12 1416 18 2022 Data Request Rate (b) 60 M Records Figure 5.4. Data Response Time with Various Data Request Rates 112 files this is not the case. Figure 5.4.(b) shows the case for 60 M records with eight disks. Here, we use 10 Kbyte and 48 Kbyte block sizes for the conventional B-trees, and 6 Kbyte block size for the PNB-trees. The PNB-trees are the most efficient, until 21 data requests/sec. Using the large block size in the conventional B—trees turns out to be the worst. This is because for 48 Kbyte block size the block transfer time is comparable to one disk access time. Furthermore, it suffers severe RPS miss delay due to the long block transfer time. On the other hand, PNB-trees perform better than conventional B- trees for both small and large block sizes. Figure 5.5 shows the response time comparison for various disk speeds, when data request rate is 20/sec. Disk speed i in these figures denotes M x "average disk access time of IBM 3380 disk". Figure 5.5.(a) and 5.5.(b) show the results for 4 parallel disks with 1 M records and for 8 parallel disks with 60M records, respectively. In Figure 5.5.(a) 4 Kbyte and 14 Kbyte block sizes are used for conventional B-trees, and 4 Kbyte block size is used in PNB-trees. In Figure 5.5.(b) we used 10 Kbyte and 48 Kbyte block sizes for conventional B-trees and 6 Kbyte block size for PNB-trees. In this figure we also see that PNB-trees perform better than conventional B-trees for both small and large block sizes. The saturation request rate is defined to be a data request rate which makes the queue length infinite. The PNB-tree achieves a larger saturation request rate than the conventional B-tree in the worst case, where every request goes to the same disk. Note that in PNB-trees the worst case is the same as the average case. When the number of records is 1 M and 8 disks are used with 4 Kbyte block size, the PNB-tree is saturated at 21 data requests/sec, but the conventional B-tree is saturated at 14 data requests/sec. Therefore, PNB-trees can handle larger data request rate in the worst case. One problem of PNB-tree organization is an increased queue length. This is because all disk requests go to the same queue. For normal data request rate, the queue length is usually very small(less than 0.1). But for high data request rate(e.g., more 113 35 — 30 — 25 -— Data Response 20 ._ B Time (ms) 15 - 10 — 5 __ PNB 0 | | I I I | I I I 1 2 3 4 6 7 10 Disk Speed (a) 1 M Records 50 —1 40 — Data Response 30 - Time (ms) 201 LB 10— B PNB O I I I I I I I I I 1 2 3 4 5 6 7 8 9 10 Disk Speed (b) 60 M Records Figure 5.5. Data Response Time with Various Disk Speeds 114 than 15 data requests per second which are more than 45 disk access requests per second, if the height of a B—tree is 3), the queue waiting time is no longer negligible and the PNB-tree suffers from long queue waiting time. In the next section the range of data request rates until which the benefit of PNB- trees is larger than the increased queue waiting time will be given. Note that the range of data request rates resulting in long queue waiting time is quite dependent on the disk speed. 5.4.3. Threshold Points We define the threshold point to be the maximum data request rate until when the response time of the PNB—tree is smaller than the conventional B-tree. Figure 5.6 shows threshold points for various disk speeds. l M records and 4 parallel disks are used. Threshold Points 0 I I I I I 0 2 4 6 8 10 Disk Speed Figure 5.6. Threshold Points In the figure we can observe that threshold limit increases almost linearly with the disk speed. 115 5.4.4. Update Performance As we discussed in section 5.3.2, PNB-trees have an advantage of reduced number of tree restructuring for insertion/deletion operation. Figure 5.7 shows the total number of tree restructuring and Figure 5.8 shows the total amount of time for creating files of different file sizes for conventional B-trees and PNB-trees. Here, eight IBM 3380 disks with the conventional B—tree of 4 Kbyte and 32 Kbyte block size, and the PNB-tree of 4 Kbyte block size are used. In both figures, B denotes the conventional B-tree of 4 Kbyte block size, LB denotes the conventional B-tree of 32 Kbyte and PNB denotes the PNB— tree of 4 Kbyte block size. In Figure 5.7 the conventional B-tree of 32 Kbyte block size has almost the same shape as that of the PNB-tree. When the insertion/deletion request rate is high, the advantage of the reduced number of tree restructuring for the PNB-tree is quite significant. 5.5. Other Strategies for The PNB-tree Organization In synchronized disks, discussed so far, we use one pointer to locate all the sub- nodes of a node. When asynchronous disks are used for PNB-trees, multiple pointers are needed to locate multiple subnodes. Note that asynchronous disks do not use syn- chronizing disk-arm. In spite of a little bit extra storage to maintain multiple pointers, this approach gives much better overall storage utilization. This is because the records can be stored atany place within a node since key values need not be kept in sorted order. Therefore, we can store the data in a compact form with variable number of sub— nodes for each node. For this implementation, all the tree operation algorithms are the same as those in section 5.2. One problem of this approach is the increased disk access time because the disk access time is determined by the worst case disk position (note that the second phase of lookup operation requires responses of all disks). Another approach is to use synchronized disks and keep the key values ordered within a node. By keeping ordered key values in a node we can eliminate associated 116 500 1 450 — 400 — Number of 350 _ 300 — Restructuring 250 _ B 200 — 150 — 100 —- 50 — PNB 0 (* 10“) IIIIIIIIIIII 051015202530354045505560 Number of Records (* 106) Figure 5.7. Tree Restructuring Cost 50 _ 45 _ 40 _ 35 — B 30 — Time 25 _ LB 20— 3 (* 10 SEC) 15_ PNB 10— 5_ 0 Insertion I I I I I I I I 0 10 20 30 40 506070 80 Number of Records (* 104) Figure 5.8. Insertion Time 117 minimum finding hardware for each disk. This approach has the problem of increased number of tree restructuring due to subnode overflow. Note that restructuring operation is expensive. We can also use asynchronous disks and keep the key values ordered within a node. The disk access time for a successful search does not increase in this method. This is because for a successful search the decision can be made immediately in one disk without collecting responses from all the disks. Therefore, for a successful search the disk access time is the same as that of synchronized disks. However, this approach requires multiple pointers and has the same problem as synchronized disks with ordered key values. [It CHAPTER 6 CONCLUSIONS We have investigated data distribution strategies for various database applications. We have proposed a general database parallel processing model and a two stage data— base parallel processing model which can be used as a framework for more specific implementation. The general database parallel processing model shows important issues of parallel database systems. The two stage database parallel processing model is derived from the general database parallel processing model and is suitable for many database applications. The objective of these models is to facilitate the design of paral- lel processing database systems. We have applied these abstract models to three specific database applications. These are partial match queries, multiattribute range queries and key based accesses on B-tree type index structures. We have investigated concurrent data accesses for partial match queries. Much research has been done on designing efficient multikey hashing schemes for partial match retrieval type applications. We have focused on optimal bucket distribution in multikey hashing to achieve maximum access concurrency for partial match queries. We have presented FX distribution methods which are based on exclusive-or operation. Field transformation techniques have been used to increase the scope of optimality in FX distribution methods. Performance of FX distribution methods has been compared with those of the other existing methods for typical file systems. We have shown that FX distribution methods give higher probability of strict optimality than the existing methods. We have also shown that the query response time of FX distribution methods is better than that of the others. 118 119 Concurrent data accesses on multiattribute range queries has also been investi- gated. Though much work has been done in designing file systems for multiattribute range queries to reduce the number of bucket accesses, appropriate data distribution among the access nodes has not been considered for these applications to enhance access concurrency. There are several proposals for data distribution methods for multikey accesses. However, they did not consider range specification in a query. We have investigated optimal data distribution for multiattribute range queries in parallel process- ing file systems. Since perfect optimal distribution is the most desirable, the existence of perfect optimal distribution has been investigated. We have shown for various types of range queries that there are inherent limitations to achieve optimal distribution. The results show that optimal distribution does not exist in many cases even for files with two fields. We have given sufficient conditions for the nonexistence of perfect optimal distribution for certain types of range queries. We have also developed data distribution methods for several useful multiattribute range queries. Sufficient conditions for optimal distribution by these proposed methods have been given. It has been shown that the proposed data distribution methods are perfect optimal for certain types of mul- tiattribute range queries, and strict optimal for a large class of multiattribute range queries. We have proposed data distribution strategies for B—tree nodes to improve the response time of file accesses. This approach is based on the partitioned node B-tree called the PNB-tree. We compared the performance of PNB-trees with conventional B- trees on parallel disks. PNB-trees reduce the height of B-trees and exploit intra-node parallel processing. We have shown that the height reduction of the B-tree along with parallel processing within a node gives better performance than parallel processing of conventional B-trees on parallel disks. We have also shown that the frequency of tree restructuring due to node overflow is considerably reduced in PNB-trees. 120 This thesis investigated optimal data distribution for multikey hash files and node partitioning schemes for B-trees to enhance access concurrency. Some areas of future research that extend the work in this thesis are described below. (1) (2) (3) (4) We have proposed a general model and two stage processing model. Though these models are independent of the parallel processing architecture, they need to be tuned and further subdivided into several prototypes to reflect hardware architec- tures such as cube type interconnection. In many cases we do not know what the best achievable distribution is for a given file systems. It is worth while to investigate the existence of optimal distribution for these file systems. Knowledge of the existence of optimal distribution will help to deve10p methods for optimal distribution for these file systems. FX distribution methods have been developed for file systems in which each field size and the number of parallel access nodes are power of 2. FX distribution methods need to be extended for non power of 2 file systems. PNB-trees are developed for files on secondary storage devices. The application of PNB-tree approach to main memory databases needs to be investigated. Perfor- mance analyses of PNB-trees are mainly based on disk access time because disk access time is the most important parameter for the performance of external data- bases. However, performance analyses of PNB-trees in main memory databases will require different set of parameters. LIST OF REFERENCES [Aho79] [Amm85] [Ban79] [Bat77] [Bay72] [Bay79] [Ben75] [Ben79] [B1083] [B0179] [B0181] LIST OF REFERENCES Aho, A.V. and Ullman, J.D., "Optimal Partial-Match Retrieval When Fields Are Independently Specified," ACM Trans. on Database Systems, Vol. 4 No.2 , June 1979, pp. 168-179. Ammann, A., Hanrahan, M. and Krishnamurthy, R.,"Design of a Memory Resident DBMS,’ Proc. IEEE COMPCON, 1985, pp. 54-57. Banerjee, J., Hsiao, D.K. and Kannam, K., "DBC - A Database Computer for Very Large Databases," IEEE Trans. on Computers, Vol. c-28, No. 6, June 1979. PP. 414-429. Batcher, K.E., "The Multidimensional Access Memory in STARAN", IEEE Trans. on Computers, Feb. 1977, pp. 174-177. Bayer, R. and McCreight, E.M., "Organization and Maintenance of Large Ordered Indices," Acta Informatica, 1:3, 1972, pp. 173-189. Bayer, R. and Schkolnick, H., "Concurrency of Operations on B-trees," Acta Informatica, Vol.9, 1977, pp. 1-21. Bentley, J.L., "Multi-dimensional Binary Search Trees Used for Associa- tive Searching," Commu. of the ACM Vol. 18, No. 9, 1975, pp. 509-517. Bentley, J.L. and Friedman, J.H., "Data Structures for Range Searching," ACM Computing Surveys, Vol. 11, No. 4, Dec. 1979, pp. 397-409. Bic, L. and Hartman, R.L., "A Network Oriented Dataflow System," Data- base machines, Leilich, HO. and Missikoff, M., eds., Springer-Verlag, 1983, pp. 166—187. Bolour, A., "Optimality Properties of Multiple-key Hashing Functions," JACM, Vol.26, No.2, April 1979, PP. 196-210. Bolour, A., "Optimal Retrieval Algorithms for Small Region Queries," SIAM J. Comput. Vol. 10, No. 4, Nov. 1981, pp. 721-741. 121 [Bor82] [Bor83] [Bur76] [Cev88] [Com79] [Dew79] [Dew86] [Dew84] [Du85] [Du86] [DuS82] [E1180-a] 122 Boral, H. and Dewitt, D.J., "Applying Dataflow Techniques to Database Machines," IEEE Computer, August 1982, pp. 57-63. Boral, H. and Dewitt, D.J., "Database Machines : An Idea Whose Time Has Passed? A Critique of the Future of Database Machines," Database machines, Leilich, HO. and Missikoff, M., eds., Springer-Verlag, 1983, pp. 166-187. Burkhard, W.A., "Hashing and Trie Algorithms for Partial Match Retrieval," ACM Trans. on Database Systems, Vol. 4, No. 2, June 1976, pp. 175-187. Ceverance, C. and Pramanik, S., "A High Speed KDL-RAM File System For Parallel Computers," Techinical Report, Computer Science Depart- ment, Michigan State University, Sept. 10, 1988. Comer, D., "The Ubiquitous B-trees," ACM Computing Surveys, vol. 11, no. 2, June 1979, pp. 121-136. Dewitt, D.J., "DIRECT-A Multiprocessor Organization for Supporting a Relational Database Management System," IEEE Trans. on Computers, June 1979, pp. 395-406. Dewitt, D.J., Gerber, R.H., Graefe, G., Heytens, M., Kumar, KB. and Muralikrishna, M., "GAMMA : A Performance Dataflow Database Machines," Proc. Conf. on Very Large Data Bases, 1986, pp. 228—237. Dewitt, D.J., Katz, R., Olken, F., Shapiro, L, Stonebraker, M. and Wood, D., "Implementation Techniques for Main Memory Database Systems," Proc. ACM SIGMOD, 1984. PP. 1-8. Du, H.C., "On the File Design Problem for Partial Match Retrieval," IEEE Trans. on Software Eng. Vol. SE-l 1, No. 2, Feb. 1985, pp. 213-222. Du, H.C., "A Heuristic Disk Allocation Method for Binary Cartesian Pro- duct Files," BIT 1986, pp.138-147. Du, HQ and Sobolewski, J.S., "Disk Allocation for Cartesian Product Files on Multiple-Disk Systems," ACM Trans. on Database Systems, Vol. 7 No. 1, March 1982, pp.82-101. Ellis, C., "Concurrent Search and Insertion in AVL Trees," IEEE Trans. on Computers, Vol. c-29, Sept. pp. 811-817. [Ell80-b] [Fag79] [Fan86] [Fre81] [Gar86] [GarL84] [G1183] [Haw82] [Hil86] [Hsi78] [Hsi88] [Kak85] 123 Ellis, C., "Concurrent Search and Insertion in 2-3 Trees," Acta Infonna- tica, Vol. 14, 1980, pp. 63-86. Fagin, R., Nievergelt, J., Pippenger, N. and Strong, H.R., "Extendible Hashing - A Fast Access Method For Dynarrric Files,"ACM Trans. on Database Systems, Vol. 4 No. 3, Sept. 1979, pp.315-344. Fang, M.T., Lee, R.C.T. and Chang, CO, "The idea of De-clustering and Its Applications," Proc. Conf. on Very Large Data Bases," Aug. 1986, pp.181-188. Fredman, M.L., "A Lower bound on the Complexity of Orthogonal Range Queries," JACM, Vol. 28, No. 4, Oct. 1981, pp. 696-705. Garg, AK. and Gotlieb, C.C., "Order-Preserving Key Transformations," ACM Trans. on Database Systems, Vol. 11, No. 2, June 1986, pp. 213 -234. Garcia-Molina, H., Lipton, R]. and Valdes, J., "A Massive Memory Machine," IEEE Trans. on Computers, Vol. c-23, No. 5, May 1984, pp. 391-399. Glinz, M., "A Dataflow Retrieval Unit for a Relational Database Machine," Database machines, Leilich, HO. and Missikoff, M., eds., Springer-Verlag, 1983, pp. 166-187. Hawthorn, PB. and Dewitt, D.J., "Performance Analysis of Alternative Database Machine Architecture," IEEE Trans. on Software Eng, Vol. SE- 8, Jan. 1982, pp. 61-75. Hillyer, B., Shaw, DE. and Nigam, A., "NON-VON’s Performance on Certain Database Benchmarks," IEEE Trans. on Software Eng. Vol. SE- 12, No. 4, April 1986. PP. 577-583. Hsiao, D.K. and Baum, R.I., "Concepts and Capabilities of a Database Computer," ACM Trans. on Database Systems, 1978, pp. 347-384. Hsiao, Y.-S. and Tharp, A.L., "Adaptive hashing," Inform. Systems, Vol. 13, No. l, 1988. Pp. 111-127. Kakuta,T., Miyazaki,N., Shibayama,S., Yokota,H. and Murakami,K.,"The Design and Implementation of Relational Database Machine Delta," Data- base Machines, Fourth International Workshop, 1985, pp. 13-34. [Kh088] [Kim86] [Kim8 8] [Knu73] [Kun80] [Lar7 8] [Law82] [Lee77] [Leh81] [Leh86] [Lei78] [Le185] [LitSO] [Mis83] 124 Khoshafian, S., Valduriez, P. and Copeland, G., "Parallel Query Processing for Complex Objects," Int’l Conf. on Data Engineering, 1988, pp. 202 - 209. Kim, M. Y., "Synchronized Disk Interleaving", IEEE Trans. on Computers Vol.c-35, No.11, Nov. 1986, pp. 978-988. Kim, M. and Pramanik, 8., "Optimal File Distribution For Partial Match Retrieval," Proc. ACM SIGMOD, 1988, pp. 173-182. Knuth, DE. The Art of Computer Programming, Vol. 3 : Sorting and Searching, Addison-Wesley, 1973, pp. 517-537. Kung, H.T. and Lehman, P.L., "Concurrent Manipulation of Binary Search Trees," ACM Trans. on Database Systems, Vol. 5, No. 3, Sept. 1980, pp. 354-382. Larson, P., "Dynanric Hashing," BIT, 1978, pp. 184-201. Lawrie, DH. and Vora, OR, "The Prime Memory System for Array Access," IEEE Trans. on Computers, May 1982, pp. 435-442. Lee, D.T. and Wong, C.K., "Worst-Case Analysis for Region and Partial Region Searches in Multidimensional Binary Search Trees and Balanced Quad Trees," Acta Informatica Vol. 9, 1977, pp. 23-29. Lehman, PL. and Yao, S.B., "Efficient Locking for Concurrent Operations on B-Trees," ACM Trans. on Database Systems, Vol. 6, No. 4, Dec. 1981, pp. 650-670. Lehman, TJ. and Carey, M.J., "Query Processing in Main Memory Data- base Management Systems," Proc. ACM SIGMOD, 1986, pp. 239-250. Leilich, H.O., Stiege, G. and Zeidler, H.Ch., "A Search Processor for Data- base Management Systems," Proc. of the 4th Int’l Conf. on Very Large Data Bases, 1978, pp. 280-287. Leland, MD. and Roome, W.D., "The Silicon Database Machine" Data- base Machines, Fourth International Workshop, 1985, pp. 169-189. Litwin, W., "Linear Hashing : A New Tool for File and Table Addressing," Proc. of the 6th Int’l Conf. on Very Large Data Bases, 1980, pp.212-223. Missikoff, MJ. and Terranova, M., "The Architecture of Relational Data- base Computer Known as DBMAC," Advanced Database Machine [Ozk75] [Pra86] [Pra87] [Pra88-a] [Pra88-b] [Pra88-c] [Pra88-d] [Riv76] [R0587] [Ross85] [Rot74] [Sam76] 125 Architecture, D.k. Hsia0(ed), Prentice Hall, 1983, pp. 109-129. Ozkarahan, C.A., Schuster,S. A., and Smith, K. C., "RAP-An Associative Processor for Database Management," Proc. NCC, AFIPS, 1975, pp. 379- 387. Pramanik, S. and Davis, H., ”Multi Directory Hashing," Technical Report, Computer Science Department, Michigan State University, Oct. 15, 1986. Pramanik,S. and Kim, M., "HCB_tree : A B_tree Structure for Parallel Processing," Proc. Int’l Conf. on Parallel Processing, 1987, pp. 140-146. Pramanik, S. and Kim, M., "HCB_tree : A Height Compressed B_tree for Parallel Processing," Information Processing Letters, November 1988, pp. 213-220. Pramanik, S. and Kim, M.,"Generalized Parallel Processing Models for Database Systems," Proc. Int’l Conf. on Parallel Processing, 1988, pp. 76-83. Pramanik, S. and Kim, M., "Parallel Processing of Large Node B-trees," IEEE Trans. on Computers (to be published). Pramanik, S. and Kim, M., "Optimal Data Distribution for Parallel Pro- cessing of Multiattribute Range Queries," Technical Report, Computer Science Department, Michigan State University, Nov. 15, 1988. Rivest, R.L., "Partial-Match Retrieval Algorithms," SIAM J. Computing, Vol.5, No.1, March 1976, pp. 19-50. Rosenau, T. and Jajodia, S., "Parallel Relational Database Operations on the Butterfly Parallel Processor : Projection Results," Technical Report, Naval Research Laboratory, July 1987. Ross, S.M., Introduction to probability Models, Academic Press, 1985, 3rd Edition, pp. 355-399. Rothnie, J.B.Jr. and Lozano,T.,"Attribute based file organization in a paged memory environment," Comm. ACM, Vol.17, No.2, 1974, pp. 63- 69. Samadi, B., "B-trees in A System with Multiple Users," Information Pro- cessing Letters, October 1976, pp. 107-112. [Sch83] [S1070] [Sto87] [Su79] [Sun85] [Sun87] [Yao78] 126 Schweppe,H., Zeidler,H.Ch., Hell,W., Leilich,H.O., Stiege,G. and Teich,W.,"RDBM-A Dedicated Multiprocessor System for Database Management," Advanced Database Machine Architecture, Hsiao, D.K. ed., Prentice Hall, 1983, pp. 36-86. Slotnick, D.L., "Logic Per Track Devices," Advances in Computers, V01. 10, Academic Press, 1970, pp. 291-296. Stone, H., "Parallel Querying of Large Databases : A Case Study," IEEE Computer, Oct. 1987, pp. 11-21. Su, S.Y.W., Nguyen, L.H., Eman, A. and Lipovski, G]. "The Architec- tural Features and Implementation Techniques of multicell CASSM," IEEE Trans. on Computers, June 1979, pp. 430-445. Sung, Y.Y., "Parallel Searching for Binary Cartesian Product Files," Proc. ACM CSC Conf., 1985, pp. 163-172. Sung, Y.Y., "Performance Analysis of Disk Modulo Allocation Method for Cartesian Product Files," IEEE Trans. on Software Eng., Vol. SE-13, No. 9, Sept. 1987, pp. 1018-1026 . Yao, A.,"Random 2-3 trees," Acta Informatica, Vol. 9, 1978, pp. 159-170. “llllllllllllllllllll