MSU LIBRARIES “ RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. DIRECT PERFECT HASHING FUNCTIONS FOR EXTERNAL FILES By Yuichi Bannai A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science 1988 ABSTRACT DIRECT PERFECT HASHING FUNCTIONS FOR EXTERNAL FILES By Yuichi Bannai A composite perfect hashing scheme for large external files which guarantees single disk access retrieval has been proposed in [RL88]. It was suggested that direct perfect hashing functions be found by trial and error methods. In this thesis. we explore systematic methods of finding direct perfect hashing functions. Extensions of Sprugnoli’s Quotient Reduction and Remainder Reduc- tion methods are proposed. The experimental results indicate that these methods are practical. Also, the performance of different universal; classes of hashing functions are studied. To my loving wife, Namiko iii ACKNOWLEDGEMENTS I wish to express my thanks to my thesis advisor M.V. Ramakrishna for his guidance and encouragement. The other members of my thesis committee 8. Pramanik and M. Chung provided valuable comments. My thanks are due to E. Vincent who carefully read this thesis and made many suggestions for improving readability. I would like to thank my wife Namiko for her pati- ence and love that made it possible for me to complete this thesis. F'mally, I would like to thank my company Canon Inc. for their constant support during my stay in the United States. iv TABLE OF CONTENTS 1. Introduction 2. Background 3. External Perfect Hashing Scheme 4. Quotient Reduction Method 5. Remainder Reduction Method 6. Universal Classes of Hashing Functions 7. Dynamic Behavior 8. Conclusions and Future Work Appendix Bibliography 23 28 36 43 46 Table 4.1 Table 4.2 Table 4.3 Table 5.1 Table 5.2 Table 5.3 Table 6.1 Table 6.2 Table 7.1 Table 7.2 Table 7.3 LIST OF TABLES Hash table of Example 4.2 Average load factor of nine groups of file (A) Load factor for file (B) Probability of having no collisions for a given M and n Average load factor of nine groups of file (A) obtained using RR Load factor for file (B) obtained using R A hash table (example of failure) A hash table (example of success) Probability of an insertion causing rehashing using RR Probability of an insertion causing rehashing (n=250) Comparison of probability of rehashing vi xvii xxiv xxvi xxxvii xxxvii xxxviii Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 6.1.1 Figure 6.1.2 Figure 6.2.2 Figure 6.2.2 Figure 7.1.1 Figure 7.1.2 Figure 7.2.1 Figure 7.2.2 LIST OF FIGURES Admissible increments when b = 2 Set of keys for Example 4.1 Solution space Solution space for Example 4.2 Observed and computed probability of a trial succeeding Observed and computed probability of a trial succeeding Probability of a randomly chosen function being perfect (b = 10) Probability of a randomly chosen function being perfect (b = 20) Load factor of a group (b = 10) Load factor of a group (b = 40) Overall load factor (b = 10) Overall load factor (b = 40) vii viii ix xiii xvi xxxiv xxxiv xli xlii 1. Introduction Hashing is an important and widely used technique for organizing files. The storage area is partitioned into a finite number of buckets. Hashing involves the computation of the address of the data item directly by evaluating a function (hashing function) on the search-key of the desired record. The record is then stored in the corresponding bucket. Ifthe bucket is already full, we say an ‘overfiow’ has occurred. A hashing function is said to be ‘perfect’ if no overflows occur for the given data. There is no need to handle overflows in a perfect hashing scheme, and hence we obtain ideal retrieval performance. A composite perfect hashing scheme for large external files was proposed in [RL88]. An ordinary hashing fimction is used to divide records of the file into a number of groups. The records in each group are stored using direct perfect hashing functions. A simple trial and error method for finding perfect hashing functions is proposed and analyzed. In this scheme, any record can be retrieved in a single disk access. and the analysis shows that records can be inserted and deleted at an average cost comparable to that of many traditional hashing schemes. One drawback of this scheme is that there is no guarantee of finding a direct perfect hashing function for a given set of keys, which motivated this thesis. The proposed methods here ,. are an extension of Sprugnoli’s Quotient Reduction and Remainder Reduction methods [SP77]. In [RL88], Ramakrishna er a1. experimentally showed that wriversalz class H 1 of hashing functions could be used for making trials: the relative frequency of perfect hashing functions within universal 2 class H 1 is the same as that predicted by the analysis for the .set of all func- tions. However, they did not study other universalz functions: classes H 2 and H 3 of hashing functions. In this thesis, we examine if classes H 2 and H 3 behave similar to the class H 1. Outline of the Thesis The next section presents the background of research into perfect hashing. The external per- fect hashing scheme proposed in [RL88] is introduced in section 3. In section 4, the Quotient -2- Reduction method for generating perfect hashing functions is introduced and analyzed. The Remainder Reduction method, which is superior to the Quotient Reduction method, is presented in section 5. The performance of universal; classes H2 and H 3 functions are studied in section 6. Section 7 discusses the dynamic behavior of the proposed scheme. Definitions and Assumptions A hash table consists of m buckets. and each bucket has a capacity of b records. A set I = {x1,x2. ..... ,1.) of n search keys are to be stored in the table. A hashing function h :l -> [ 0 .. m—l ] assigns an integer 0 through m-l to each key. We will consider functions of integer keys in this paper. A hashing function h is a perfect hashing function for a given key set and table if no bucket receives more than b keys under the ftmction h. We use [p .. q] to denote the interval [p, p-I-l, p+2, ..... , q-l, q] (p < q), the length ofwhich is q—p+ 1. The load factor ( a ) is the ratio of the number of keys to the total capacity of the hash table, or = n/mb. 2. Background The perfect hashing schemes dealt with in the literature may be grouped into two classes [RL88]: a) Direct perfect hashing; b) Composite perfect hashing. Direct perfect hashing functions map a given key into an address where the key and its associated record are stored. The composite perfect hashing scheme involves additional table look-up to find the desired address. This means that a retrieval involves more than one level of access. Direct Perfect Hashing Sprugnoli, who was the first to formally define perfect hashing, proposed a systematic method of finding perfect hashing functions, called the Quotient Reduction method [SP77]. The functions are of the form h(x) = |_ (x+s)/NJ , where s and N are constants, and are called the admissible increment and quotient. respectively. He gave a constructive proof of the existence of the Quotient Reduction perfect hashing function for any given key set. His algorithm finds s and N for a given set of keys I = {x1,x2, ..... ,xn}. However, his algorithm was restricted to the special case of the bucket size being one. He also proposed the Remainder Reduction method, which yields better storage utilization especially when the keys are not uniformly distributed. The functions are of the form h(x) = L((xq+d) mod m) IN] . Jaeschke [1881] proposed the Reciprocal Hashing function of the form h (x) = [Cl (Dx+E)j mod n, where C, D, and E are constants. He proved that there always exists a perfect hashing function of this form for any given I. Chang [CH84] proposed a similar scheme based on the Chinese Remainder Theorem. However. these schemes are not of practical value. Please refer to [RM86, MR84] for further details. Composite Perfect Hashing Composite hashing is necessary to handle for large sets of data. Spnrgnoli suggested the idea of segmentation to handle large sets of keys [SP77]. The keys are divided into a number of groups by an ordinary hashing function so that each group can be handled by direct perfect hash- ing. Fredman er al. [FR82] constructed perfect hash tables using Sprugnoli’s idea. Their scheme guarantees constant time retrieval with 0 (n) space requirement. However, they did not discuss how to update the table. Cormack er a1. [CR85] pmposed a similar scheme including algorithms for insertion and deletion. 3. External Perfect Hashing Scheme The external perfect hashing scheme proposed by Ramakrishna er a1. [RL88] attempts to achieve single disk access retrieval at the cost of a small header table in internal memory. They used a composite data structure similar to that proposed by Cormack er al. [CR85]. Direct perfect hashing functions are found by a simple trial and error method. One drawback of this method is that there is no guarantee that perfect hashing functions can always be found for any given key set although the expected number of trials is small and can be controlled. We investigate systematic methods for finding direct perfect hashing functions for external files, and study properties of the file constructed by the proposed method such as the storage utilization and fiequency of rehash- ing. The details of the external perfect hashing scheme are described below. The set of keys is divided into several groups. The keys in each group are stored in a number of contiguous pages in the secondary memory using a perfect hashing function. An ordinary hashing function H is used to accomplish the grouping. Let key group t denote the set of keys {1: I x e the given key set and t =H(x)}. Each entry of the header table in the internal memory is of the form ( p, m, R, ), where (p,) denotes pointers to starting addresses of each group, (m,) is the group size, and (R,) denotes the parameters of the perfect hashing function for group t. Any record can be retrieved in a single access of the secondary memory in this scheme. Algorithms for retrieval, insertion, and deletion are given below. Retrieval of a record with key x: - Compute the group r to which 1: belongs by t :=H(x). - Extract from entry t of the header table. - Compute the page address of x by A, := p,+h (x, R,). - Read in page A,. - Search page A, for key x. If key 1: is not found, the search has failed. Insertion of a record with key x: - Compute A, as above and read in page A, . - If the page is not full then - Insert the record into page A, and write back the page. else { Rehashing } - Read in all pages of the group t. The number of pages is m,. -5- Find a new perfect hashing function for all the keys in the group, including the new key 1:. The number of pages obtained by the new perfect hashing function. mu, may or may not be equal to m,, and let R. be the parameter of the new per- fect hashing function. Find an address p, in the secondary storage having space for m, contiguous pages (pa may be equal tom )- Restore the records on pages p“. p,+l, p,+2.....,p.+m..—l using the new perfect hashing function. Update the header table entry of group r to < p,m,,R, >. Deletion of a record with key x: Compute A, as above and read in the page. Search page A, for key 1:. If key 1: is found then delete the record and write back the page, else the desired record is not in the file. It may be necessary to rehash the group to avoid low storage utilization. The pro- cedure of rehashing is similar to that for insertion. 4. Quotient Reduction Method Quotient Reduction hashing functions are of the form h (x) = |_(x+s)/NJ , where s and N are constants. We will show that functions of this form, which are perfect. exist for any given bucket size, by giving a constructive proof. The basic idea of this method is to divide the given set of keys into a number of intervals whose length is N, and to move the boundaries between the intervals by adding the admissible increments so that each interval, corresponding to a bucket, contains no more than b keys. The following is an extension, of Sprugnoli’s construction, to the general case b > 1. Admissible increments Let I = {x1,x2, ..... .x.) be the set of keys to be bashed. in sorted order. When the bucket size is b, keys x,- and Jim, must not fall in the same bucket. A set of admissible increments J,- of the key x.- for a given N is a set of integers for which the condition l-(x;+t,-)/NJ < l-(x;+b+tg)/NJ 1 Si 5 n—b, I; 8.]; holds. In other words, an admissible increment for x,- is any translation value which adjusts x,- and xm, to two different intervals [(p—1)N .. pN-l] and [(q-1)N .. qN-l], where p and q are integers, and p (N"‘ -1)—rmodN* holds is in the right hand side of the boundary. The values s and N (= N") in this space yields m* buckets. Similarly, each row represents the solution space for a different N. -13- (x1+s)modN Nar- N *-l N *-2 N*-3 N *-4 N‘-5 N *—6 N* denotes the upper bound of N. m* denotes the minimum number of buckets. N* =|_(r-1)/(m* -2)J m* =[ n/b] Figure 4.3 Solution space Consider the range of possible N which give a fixed number of buckets m. Let NLl and NU 1 denote the lower bound and the upper bound of N respectively, when {r mod N + (x1+s) mod N} SN—l, NLI. They can be obtained as follows. m = Lr/Nj + 1. r =(m-1)N+r mod N, since OSr modN SN—l (m—1)N S r S (m-1)N +N-l (r+1)/m SN S r/(m-l). Since N is an integer, the lower bound NLI and upper bound NU 1 are NL1=[(r+1)/m'l , NUl = Lr/(m-1)J . Similarly, when {r mod N + (wl-l-s) mod N} 2N, the lower bound NL2 and the upper bound NU 2 are given by - 14 - m = Lr/NJ + 2. (r+1)/(m—l) S N S r/(m-2), and hence. 1w.2 = [(r+1)/(m-1)l , NU2 = Lr/(m-2)j . Algorithm QR (Quotient Reduction ) We are now ready to introduce our algorithm QR to generate a perfect hashing function for the given set of keys I. The basic idea of this algorithm is to search the solution space of N and s to obtain m* buckets, m* +1 buckets and so on. Algorithm QR STEP 1 { Initialization } Compute the range of the key: r := 1,, - x 1. Compute the minimum number of buckets: m := I’n/b] . STEP 2 Compute the upper bounds and lower bounds of N corresponding to m buckets. NU; := Lr/(m-2)_| NL; := [(r+1)/(m-1)l NU] := Lr/(m-l)j 1w.1 := [(r+1)/m] IOI'N I=NL1 t0 NUI d0 Compute the set of the admissible increments J,- (as described in proposition 1 ) with N including only those elements t in the A’s such that the condition r modN +(x1 +t) mod N 2N holds. Take the intersection of 1,-’ s. if not empty then goto STEP 3 end for for N :=NL2 to NU 2 do Compute the set of the admissible increments J,- (as described in proposition 1 ) with N including only those elements t in the J,-’s such that the condition r modN +(x] +t) mod N SN—l holds. -15- Take the intersection of J,’ s. if not empty then goto STEP 3 end for m := m +1 gotoSTEPZ STEP3 s := Choose a value from intersection set. The perfect hashing function h (x) = [(x +s)/Nj The range of N that corresponds to m buckets is found as follows. When the range of N is NU; 2N 2NL;, the condition {r mod N + (x1 +1) mod N }2N holds. When the range ofN is NU] 2N ZNLI, the condition {r mod N +(x1+!) mod N} SN—l holds. If the intersection is not empty, any element of the set is acceptable (all give the same load factor). However, to obtain more uniform distribution of bashed keys, choose 5"“ from the inter- section set such that, 3* =er IN —(x1+j)modN -(x,,+j)modN I. J This choice of 3* makes the interval covered by the first bucket. [x1 .. p (N —l)]. as close as pos- sible with that of the last bucket [qN .. x3]. where p =h(x1)+ 1 and q =h(x,,). Finally. we need the transformation s =s* —N{(x1 +s*) div N} so that x1 hashes to bucket zero, Le, [(x, + s)/NJ = 0. Example 4.2 Consider the key set I = {31, 58. 67, 123, 142, 146. 154, 187, 198, 220} with b = 3. n = 10 and the range of keys, r= x10 - x1 = 189. The minimum number of buckets is m* = I'n/b'l = 4. The upper bound NU; := Lr/(m—2)J = L189/(4—2)_] = 94. Thelowerbound NL; := [(r+1)/(m—1)] = [(189+1)/(4—1)'| =64. The upper bound NU] := Lr/(m-1)j = L 189/(4—l)j = 63. -16- The lower bound NL. := [(r+1)/ml = [(189+1)/4] =48. The solution space is shown in Figure 4.4. (x1+s)modN Figure 4.4 Solution space for Example 4.2 The search starts with the value N = NLl = 48, and the set of admissible increments J = [17.18.19] is found with N = 48 in STEP 2. We move to STEP 3, and the element [18] is chosen which is transformed as s =18 -48{(31 + 18) div 48} =—3O , so that h(x1)=0. We thus have the Quotient Reduction perfect hashing function, (x.- - 30) J h (xi) = l 48 The hash table for this example is shown below. -17- Table 4.1 Hash table of Example 4.2 bucket# 0 l 2 3 x.- 31 58 67 123 142 146 154 187 198 220 Efficient QR Algorithm ( EQR ) We can see from (4.8) that the upper bound N* of the quotient N is proportional to the range of keys r. It is obvious that the algorithm QR is impractical for a large r due to the large computa- tion time. This is because every N is examined sequentially in the algorithm QR. In this section, we aim to reduce the computation time. The basic idea is to consider the values of N such that N =Lr/(m-1)J to determine if an admissible increment exists. We begin with m* + 1 buckets. If the search fails, we proceed with m* +2, and so on. If the search succeeds with N=N, and m, buckets. we proceed with m, - 1 buckets and for N =N, +1 to Lr/(m-2)_| . (This is to achieve better storage utilization due to less buckets.) Also, attempt is made to find the smallest N for the value in obtained, so that we have as uniform a distribution as possible. Algorithm EQR STEP 1 { Initialization } Compute the range of the keys: r := x, - x1. Compute the minimum number of buckets: m := [- nib] . { The value in“ } Compute the upper bound NU for the search: NU := Lr/(m-1)J . m := m + 1 { Starting with m*+l } STEP 2 Compute the lower bound NL for the search: NL = Lr/ (m -1)_| . Compute the set of admissible increments J,- with NL including only those elements tin the Jg’s such that the condition, r mod NL +(w, + t) mod NL S NL-l, holds. STEP 3 Take the intersection of Jg’s. if not empty then { next searches } -13- Store the result temporarily forN :=NL+l to NU do Compute the set of admissible increments J,- with N including only those ele- ments t in the Ji’s such that the condition, r mod N +(x1+ t) mod N S N—l, holds. Take the intersection of J,-’s. if not empty then { Search Complete } goto STEP 4 end for { Another set of searches ] Compute the lower bound NL; for the search: NL; := l' (r + 1)/m] . forN := NL; to NL-l do Compute the set of admissible increments J,- with N including only those ele- ments tin the Ji’s such that the condition r mod N +(x1+ t) mod N SN-l, holds. Take the intersection of J,-’s. if not empty then { Search Complete } goto STEP 4 else Return the result of intersection & NL. goto STEP 4 else or := m + 1 { Next bucket size } NU := [H (m -2)J { Next upper bound ] goto STEP 2 STEP 4 s := Choose a value from intersection set The perfect hashing function h (x) = |_(x +s)/Nj In STEP 2, NL is set to the maximum value of N corresponding to (m+I) buckets. The range of the admissible increments to be examined is calculated by the same method as in the algorithm QR. In STEP 3, a search is made for the quotient NL. If admissible increments exist, they are stored temporarily and further searches in the range NL+1 through NU are made that yield m buckets. If the first search fails, the next bucket size and the next upper bound NU are set, and the control is transferred to STEP 2. If one of the searches succeeds, we go to STEP 4 and the result is used in STEP 4. If none of the searches succeeds. another set of searches is made for N = NL; to NL -1. (Although we have found admissible increments for NL. we need the smal- lest N that yields the same number of buckets as NL.) The lower bound of these searches NU; which corresponds to m buckets is computed by NU; := i (r + 1)/m] . If none of these searches succeeds, the result of the very first successful search (N = NL) is retumed, or the result of the first successful search of this set is used in STEP 4. In STEP 4, the value of s is chosen by the -19- . same method as in the algorithm QR Example 43 For the same set of keys as in example 4.2, the EQR examines if there exist admissible increments with NL = Lr/(m-1)J = L189/(4-1)j = 69. The search failed for NL=69 in this example, and the next value of NL is set as Lr/(m-l)_] = L189/(5—1)J = 47. This search succeeded, and we obtained the perfect hashing function which requires five buckets. The next set of searches for the range of N that correspond to four buckets were done from N=48 to 68. A set of admissible increments J=[17,18,l9] was found in STEP 3. and the perfect hashing function H (x): [(x - 30)/48j , was obtained. (This hashing function is the same as that in example 4.2.) E] Complexity of the Algorithms QR and EQR The taking intersection in the QR and EQR is rather straightforward, and its complexity is 0(n2). (Please refer to the Appendix). In the algorithms QR and EQR, the time complexity is determined by the loop of N. In both algorithms, the total number of execution in these loops in the worst case is proportional to the upper bound of N, [r/fn/b'] —2J . Thus, the complexity of these algorithms is 0(rbn). The difference between the two algorithms is the constant of pro- portionality. Although the constant is small, the complexity is not practically acceptable in many cases. In Remainer Reduction method discussed next, we control r, using modulo operation, and hence the complexity. It is clear that the space requirement of these algorithms is 0 (n). -20- Results of the experiments To test our algorithms, we conducted several experiments using the following real life files. (A) Userids from IBM System at the University of Waterloo (12,000 keys); (B) Userids from the msudoc at MSU (600 keys). All the data was in alphanumeric form and the length of individual keys varied fiom 2 to 25 char- acters. The keys were converted into 2 byte long unique integers by the RADIX_Convert method. Details of this method is described in [RM86]. The keys in each file were divided into s groups using hashing functions of the form H (x)==(((cx+d) mod p) mod g), where p is a large prime number. The hashing function used to separate groups is as follows. H (x) = (((314559x+27182) mod 65521) mod 9) (4.9) For each of the nine groups obtained, direct perfect hashing functions were generated using algorithms QR and EQR. The average load factors obtained for the nine groups of file (A) are shown Table 4.2. The third row gives the ratio of the average load factor obtained by the EQR to that by the QR. Table 4.2 Average load factor of nine groups of file (A) (percentage) 1) n = 100 Algorithm b=10 b=20 b=30 QR 69.5 82.0 83.3 EQR 67.8 82.0 83.3 Ratio 97.6 100.0 100.0 2) n = 250 Algorithm b=10 b=20 b=30 b=40 b=50 55.1 69.6 80.1 80.6 82.0 52.1 68.1 78.5 80.6 82.0 94.6 97.8 98.0 100.0 100.0 -21- 3) n = 500 Algorithm b=10 b=20 b=30 b=40 b=50 QR 50.3 70.5 78.0 81.0 84.1 EQR 47.9 69.7 77.6 80.4 83.3 Ratio 95.2 98.9 99.5 99.3 99.0 Perfect hashing functions were generated for the keys in file (B) without any partitioning. Table 4.3 shows the load factor for these keys. Table 4.3 Load factor for file (B) (percentage) 1) n = 100 Algoritlun b=10 b=20 b=30 QR 71.4 83.3 83.3 EQR 71.4 83.3 83.3 Ratio 100.0 100.0 100.0 2) n = 250 Algoritlun b=10 b=20 b=30 b=40 #50 QR 61.0 65.8 69.4 78.1 83.3 EQR 61.0 65.8 69.4 78.1 83.3 Ratio 100.0 100.0 100.0 100.0 100.0 3) n = 500 Algorithm b=10 b=20 b=30 b=40 b=50 QR 42.4 75.8 83.3 83.3 83.3 EQR 42.0 75.8 83.3 83.3 83.3 Ratio 99.1 100.0 100.0 100.0 100.0 These tables show that higher b gives better storage utilization. We obtained over 80% load factors for large bucket sizes. whereas about 40% to 70% for small bucket sizes. For large bucket sizes. the load factor remains almost the same as the group size increases. whereas the load factor -22- falls clearly for small bucket sizes as the group size increases. The ratio of the average load factor ‘ obtained by the EQR to that obtained by the QR is close to 100%. However, the computation time of the EQR is not practical even though the EQR is faster than the QR. -23- 5. Remainder Reduction Method The Quotient Reduction methods described in the previous section has two major problems. a) The low storage utilization when the keys are not uniformly distributed; b) Large computation time. The Remainder Reduction method overcomes these two problems. The basic idea of this method is to scramble the set of keys to obtain a more uniform distribution within a narrow range. fol- lowed by Quotient Reduction perfect hashing. Since we can control r in the Remainder Reduction method. the computation time can be controlled. Remainder Reduction perfect hashing functions areofthe form h(x,-) = [{(qxi) mod M + s }/NJ . (5.1) where q, M. s, N are constants to be determined. The transformation 1,- = (qxg) mod M (5—2) accomplishes the scrambling mentioned above by choosing a q and M appropriately. The transformation (5.2) may cause a collision, i.e.. more than two distinct primary keys are transformed into identical keys. The probability p of no collision occurring by the transformation can be obtained as follows [FL68]. ___ MP. = M(M-l)(M-2) ..... (M-n+1) p M" M" 1 2 n-l _(1—-117)(1-M) ..... (1— M ). For small positive it. we have log(1-—x) = -x . l+2+ ...... +(n-l) =_‘_n(n-l) M 2M ' logp = We obtain p2,, 2M _ (5.3) -24- The probabilities of having no collisions for n=500. 250, and 100 are shown in Table 5.1. M =500 n=250 n=100 216 (65536) 0.15 0.62 0.93 217 0.39 0.79 0.96 218 0.62 0.89 0.98 219 0.79 0.94 0.99 22° 0.89 0.99 1.00 221 (2097152) 0.94 0.99 1.00 Table 5.1 Probability of having no collisions for given M and n. If we want to avoid collisions. we have to select suitably large M. For example. M should be greater than 221 when n = 500. However, we do not gain the advantages of the Remainder Reduc- tion method mentioned above with this large M. We can tolerate some collisions by the transformation (5.2). If the maximum number of identical keys is less than or equal to the bucket size b. then in general there exists a quotient reduction perfect hashing function. On the other hand. the probability that more than b keys collide under this transformation is almost zero. Let P (0t.m.b) denote the probability that none of the urns contain more than b balls. where n. n = orbm. balls are tossed into m urns. An approximate formula is derived in [RL88]. P (mm,b) Z e-MPOVQLb)’ where. n -b(1 i P0v(0t,b)= Z _e__.(’_b_0t)_ i=b+l ‘- 2 (bot)"+‘ 8..., b+2 (b+l)! b(l-0t)+2 ' For example. when n=500. m=500. and b=10. The probability that more than b keys collide is given by -25- 1 FP(0.1.500.10) : 1_ e-500Pov(0.1,10) —5.03x10" 2 l— e = a negligibly small quantity Now we are ready to introduce our Remainder Reduction algorithm. To implement this method. it is necessary to apply the transformation 1’,- = (qx.-) mod M for each x;. 1 S i Sn, and sort before applying Quotient Reduction method. Let {w1,w;, ..... .w,,} denote sorted keys from {£1.x’;, ..... .16.}. q and M are chosen to be prime numbers. Algorithm RR STEP 1 Transform keys {x1.x;, ..... .x,} into {11 .x’;, ..... ,Jin} using 1, = (qx;) mod M . Sort the keys (£1.16. ..... .X,} in non decreasing order. and obtain sorted keys {w1.w;, ..... .wn}. STEPZ Apply the algorithm QR to the sorted keys QR(n. {w1.w;,.....w,.} ) or Apply the algorithm EQR to the sorted keys EQR(n, {w,.w;.....,w,.} ) Perfect hashing function h (x.) = [{(ng) mod M + s} /NJ The complexity of Remainder Reduction Algorithm is given by 0 (Mb). where M is the modulo of the transformation (5.2). As the range of keys is given by M in this method. r in the QR is substituted by M. The space requirement is 0(a). Results of experiments For each of nine groups obtained by the hashing function (4.9). perfect hashing functions are generated using algorithm R with the QR and with the EQR. The average load factor of nine groups of file (A) is shown in Table 5.2. Several values of q were used in this experiment. and we observed that the value of q hardly affected the result. -26- Table 5.2 Average load factor of nine groups of file (A) obtained using RR (percentage) 1) n = 100, 4:37, M=2039. Algorithm b=10 b=20 b=30 RR using QR 70.0 78.3 83.3 R using EQR 69.3 78.3 83.3 Ratio 99.0 100.0 100.0 2) n = 250. q=71. M=4093. Algorithm b=10 b=20 b=30 b=40 b=50 RR using QR 65.2 78.9 80.8 84.3 85.2 R using EQR 64.1 77.4 80.8 84.3 85.2 Ratio 98.3 98.1 100.0 100.0 100.0 3) n = 500. q=101, M=8l91. Algorithm b=10 b=20 b=30 b=40 b=50 RR using QR 56.7 72.0 78.7 81.7 81.9 R using EQR 56.1 71.8 78.2 81.7 81.9 Ratio 98.9 98.7 99.4 100.0 100.0 Table 5.3 shows the load factor for file (B) Table 5.3 Load factor for file (B) obtained using RR (percentage) 1) n = 100, q=37. M=2039. Algorithm b=10 b=20 b=30 RR using QR 76.9 83.3 83.3 RR using EQR 76.9 83.3 83.3 Ratio 100.0 100.0 100.0 2) n = 250. q=7l. M=4093. -27- Algorithm b=10 b=20 b=30 b=40 b=50 RR using QR 71.4 78.1 83.3 89.3 83.3 R using EQR 71.4 78.1 83.3 89.3 83.3 Ratio 100.0 100.0 100.0 100.0 100.0 3) n == 500. q=101, M=8191. Algorithm b=10 b=20 b=30 b=40 b=50 RR using QR 60.2 69.4 79.4 83.3 83.3 R using EQR 57.5 69.4 79.4 83.3 83.3 Ratio 95.5 100.0 100.0 100.0 100.0 We observe that the load factors are almost above 70%. The average computation time for generating a perfect hashing function is reduced to only a few seconds on VAX 8600. Thus, we conclude that R method of finding perfect hashing function is a practical technique. -23- 6. Universal Classes of Hashing Functions In order to compare the proposed method to other methods such as simple trial and error method. we will examine the probability of a randomly chosen function being perfect. Let P(n.m.b) denote the the probability that a randomly chosen function is perfect. where n is the number of keys, m is the number of buckets. and b is the bucket size. A way of computing P (mmb) using the following simple recurrence relation is described in [RL88]: (m_1)n-b P (n+l.m.b) = P (n.m.b) — ,CbP (n —b.m-l.b) n m (6.1) However. choosing a function for a trial at random from the set of all functions is clearly imprac- tical. Hence. universal; classes of hashing functions were introduced in [RL88]. Let H be a class of functions which map a set of integers A = {0.1.2......a-l} into the set of integers B = {0.1.2......m-1}. where a > m. The set A corresponds to the set of keys and B corresponds to the set of addresses. Let x and y be distinct integers x. y e A and h.- e H a hash- ing function. We define 1 1fx $)’ and h;(X) = th) ’9 = 0 otherwise If 61,. = 1. then x and y are said to collide under h. The class H is said to be a universal; class of hashing functions if for all x.y 8A. 2 agony) s IHI /m. This means that H is a universal; nu: class of hashing functions if for every pair x.y e A. they collide no more than IH I / m functions. This implies that if a function is chosen randomly from a universal; class. the probability of a pair of keys colliding is less than or equal to l/m. In [RM86]. Ramakrishna performed a series of experiments using several test files to deter- mine if the universal; class H 1 functions could be used for selecting functions for trials. He showed that the relative frequency of perfect hashing functions within universal; class H 1 is the same as that predicted by the analysis for the set of all functions. However. he did not study other universal; hashing functions: class H; and H 3 functions. Since class H; and H 3 functions do -29- not require multiplication (only need Exclusive OR ). they may be convenient for long keys. The experiments here are aimed at examining if class H; and class H3 functions behave similar to the class H 1. The universal; class H3 is defined as follows by Carter and Wegrnan [CW79]: Let A be the set of keys of i bit binary numbers. and B be the set of j bit binary numbers. The set B corresponds to 21 buckets. and the number of bucket has to be a power of two. Let Mt be the set of arrays of length i whose elements are from B. We can regard Mt as i by j boolean matrices. For mt eMt. let mt (It) be the bit string which is the kth element(row) of the matrix mt. and for 1: 8A, let x, be the kth bit of x. Define f...,(x) =x1mt(1) Ox;mt(2) 0"" ®x1mt(i). where 9 denotes exclusive OR operation. The meaning of this function is to take exclusive ORs of those elements in nu that correspond to 1 bits of key it. The class H3 is defined as a the set of functions {f...1 lmt 3 Mt }. over all possible matrices mt. An example of class H3 is shown below. Example 6.1 When the number of buckets m = 4. the bucket size b = 3. for the set of keys. which consists of ten 8-bit keys. given in Figure 4.2, a hashing function is found as follows: A 8>Q boolean matrix mt is obtained by generating eight random numbers in the range of 0 to 3. The following is one such matrix. mt(l) 01 mt(2) l 1 mt(3) 10 mt(4) 00 mt(5) 10 mt(6) 1 1 mt(7) 00 mt(8) 01 Keys 67 and 123 are converted to binary form (01000011); and (01111011);, respectively. The hash addresses of these keys are given by -30- f~(67)==mt(2) Omt(7) ®mt(8) = 11 000 901 =10 fm(123) =mt(2) @mt(3) ©mt(4) Omt(5) ®mt(7) ®mt(8) =11 010 000 010 900 901 =10 Keys 67 and 123 are then stored in bucket 2. Sirnilerly after computing all the hash addresses. we obtain the following hash table Table 6.1 A hash table (example of failure) bucket# 0 1 2 ' 3 keys 3158142187 146198 67123 154220 Since the bucket size b is 3. this trial has failed. However. the following mt gives a perfect hash- ing ftmction: mt(l) 01 mt(2) 00 mt(3) 10 mt(4) 11 mt(S) 00 mt(6) 01 mt(7) 10 mt(8) l 1 Using this matrix mt, we obtain the hash table given below. Table 6.2 A hash table (example of success) bucket# 0 1 2 3 keys 123 146 154 67 187 142 198 31 58 220 The definition of the class H; is similar to that of H3 except that a key is mapped into a longer bit string. Let A be the set of keys of i-digit numbers written in base 0L For x eA. let x, -31- denote the kth digit of x. B has the same definition as in H3. Define g to be the function which maps x into the bit strings of length ior which has 1’s in positions x1+1. x1+x;+1, x1+x;+x3+1. m, etc. where x, is the kth bit of x. Then. Mt is the set of arrays of length i0t whose elements are from B. Mt can be regarded as ior by j matrices. We can apply the i or bit keys defined above to the same operation as in H 3. If H3 is the class defined above for i or hit keys. then H; = {fg I feH3}. Although H; needs more space than H3. the computation time of H; is less than that of H3 . because converted keys in H; have fewer l-bits. Example 62 When the base 0t=4. the same keys as in Example 6.1 are mapped into strings of length 16. For example. keys 67 and 123 are converted as follows. 3 (67) = g (0003)..) = 0100100000000000 g (123) = g ((1323).) = 0100101001000000. We can apply the same operation to these converted keys as in Example 6.1. mt is a 16 by 2 boolean matrix and is generated similarly. El Results of experiments A set of experiments was conducted as follows. A set of 100 hashing function was created randomly by generating random numbers in the range of 0 to 2j- l. The first n keys. n=ormb. in each of nine groups of file (A) were bashed by each of the 100 hashing functions. and the number of perfect hashing functions was recorded. n was selected so that the load factor varies from 50% to 100%. This experiment was repeated for various values of m and b. Figure 6.1.1 shows the results of one set of experiments with b = 10 and m = 16 using the group 8 of file (A). The solid line is a plot of P (n. 16.10) computed using the recurrence relation (6.1). The symbols 2 and 3 represent the experimental probabilities obtained with class H; and class H3 of hashing func- tions respectively. -32- Figure 6.1.2 shows the results of another set of experiments with b = 40 and m = 8 using the fourth group of file (A). The solid line is a plot of P (n. 8.40). The number of keys varied from 160 to 320. We used the same set of 100 hashing functions for all load factors and each of nine groups. 1.0 F P 0.8 - r o b 0.6 T Bucket size (b) : 10 a b Number of buckets (m) : 16 i l 0'4 ‘ Key set: group 8 of file (A) 1 t l’ 02 . 0.0 1 , 50 60 70 80 90 100 Load factor (%) Figure 6.1.1 Observed and computed probability of a trial succeeding 1.0 0.8 0.6 0.4 ‘ 81') Ol' (T51 < 31') then Delete the interval I j; := [s1 .. e1]. else Change the interval lj; to [max(TS1.s.-) .. min (TE1.e,-)]. if the set of intervals I j,- = (b then return (exist := false) end for case of the number of T's being two for each interval I j.- := [s1 .. e1] do if (TS1 > e.) or (TE1 < s.) or ((TE 1 < s.) and (TS; > e.)) then Delete the interval Ij1- := [s1 .. e1]. else if s,- S TE 1 then if e,- 2 TS; then Divide the interval 1;} into two intervals. [max (”1.51) .. T51] and [T52 .. min (TE2,81')]. else Change the interval I j; to [max (TS 1 .s1) .. min (TE 1 .e1)]. else Change the interval Ij1 to [max (TS;.s,-) .. min (TE ;.e,)]. if the set of intervals I j; = (D then return (exist := false) end for end for return( I j, exist) end BIBLIOGRAPHY -46- BIBLIOGRAPHY [CC80] Cichelli. RJ. Minimal perfect hash fitnctions made simple. Comm. of the ACM. 23, l, (1980). 17-19 [CH84] Chang. .C. C. The study of an ordered minimal perfect hashing scheme. Comm. of the ACM. 27. 4, (1984). 384-387 [CR85] Cormak. G.V.. Horspool. R.N.S. and Kaiserswerth. M. Practical Perfect Hashing. The Computer Journal, 28. 1(1985). 54-58 [CW79] Carter. L]. and Wegman. M.L. Universal classes of hash functions. Journal of Com- puter and System Sciences. 18. 2(1979). 143-154 [FL68] Feller. W. An introduction to probability theory and its applications. Vol.1. New York: John Wiley. 1968 [FR82] Fredman. M.L.. Komlos. J. and Szemeredi, E. Storing a sparse table with 0(1) worst case access time. Proc. 23rd Symposium on Foundations of Computer Science. IEEE Computer Society. 1982. 165-168 [1881] Jaeschke. G. Reciprocal Hashing: A method for generating minimal perfect hashing functions. Comm. of the ACM. 24, 12(1981). 829-833 [MR84] Mairson, H.G. The program complexity of searching a table. Ph.D. Thesis. Department of Computer Science. Stanford University. 1984. [RL88] Ramakrishna, M.V. and Larson, P.A. File organization using composite perfect hash- ing. (to appear) ACM Trans. on Database System. (Earlier version in 1985 ACM- SIGMO Inm’l conference on management of data, 190-200.) [RM86] Ramakrishna. M.V. Penfect Hashing for External Files. Ph.D. Thesis. Department of Computer Science. University of Waterloo. Research Report No. CS-86-25, 1986. [SP77] Sprugnoli. R. I. Perfect Hashing Functions: A Single Probe Restructing Method for Static Sets. Comm. of the ACM. 20. 11(1977). 841-850.