MSU

LIBRARIES

 

 

“

 

RETURNING MATERIALS:
Place in book drop to
remove this checkout from
your record. FINES will
be charged if book is
returned after the date
stamped below.

 

 

 

DIRECT PERFECT HASHING FUNCTIONS
FOR EXTERNAL FILES

By

Yuichi Bannai

A THESIS
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Department of Computer Science

1988

ABSTRACT

DIRECT PERFECT HASHING FUNCTIONS
FOR EXTERNAL FILES

By

Yuichi Bannai

A composite perfect hashing scheme for large external ﬁles which guarantees single disk access
retrieval has been proposed in [RL88]. It was suggested that direct perfect hashing functions be
found by trial and error methods. In this thesis. we explore systematic methods of ﬁnding direct
perfect hashing functions. Extensions of Sprugnoli’s Quotient Reduction and Remainder Reduc-
tion methods are proposed. The experimental results indicate that these methods are practical.

Also, the performance of different universal; classes of hashing functions are studied.

To my loving wife, Namiko

iii

ACKNOWLEDGEMENTS

I wish to express my thanks to my thesis advisor M.V. Ramakrishna for his guidance and
encouragement. The other members of my thesis committee 8. Pramanik and M. Chung provided
valuable comments. My thanks are due to E. Vincent who carefully read this thesis and made
many suggestions for improving readability. I would like to thank my wife Namiko for her pati-
ence and love that made it possible for me to complete this thesis. F'mally, I would like to thank

my company Canon Inc. for their constant support during my stay in the United States.

iv

TABLE OF CONTENTS

1. Introduction

2. Background

3. External Perfect Hashing Scheme

4. Quotient Reduction Method

5. Remainder Reduction Method

6. Universal Classes of Hashing Functions
7. Dynamic Behavior

8. Conclusions and Future Work
Appendix

Bibliography

23
28
36
43

46

Table 4.1
Table 4.2
Table 4.3
Table 5.1
Table 5.2
Table 5.3
Table 6.1
Table 6.2
Table 7.1
Table 7.2
Table 7.3

LIST OF TABLES

Hash table of Example 4.2

Average load factor of nine groups of ﬁle (A)

Load factor for ﬁle (B)

Probability of having no collisions for a given M and n

Average load factor of nine groups of ﬁle (A) obtained using RR
Load factor for ﬁle (B) obtained using R

A hash table (example of failure)

A hash table (example of success)

Probability of an insertion causing rehashing using RR
Probability of an insertion causing rehashing (n=250)

Comparison of probability of rehashing

vi

xvii

xxiv

xxvi

xxxvii
xxxvii

xxxviii

Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 6.1.1
Figure 6.1.2
Figure 6.2.2
Figure 6.2.2
Figure 7.1.1
Figure 7.1.2
Figure 7.2.1

Figure 7.2.2

LIST OF FIGURES

Admissible increments when b = 2

Set of keys for Example 4.1

Solution space

Solution space for Example 4.2

Observed and computed probability of a trial succeeding
Observed and computed probability of a trial succeeding
Probability of a randomly chosen function being perfect (b = 10)
Probability of a randomly chosen function being perfect (b = 20)
Load factor of a group (b = 10)

Load factor of a group (b = 40)

Overall load factor (b = 10)

Overall load factor (b = 40)

vii

viii
ix
xiii

xvi

xxxiv

xxxiv

xli

xlii

1. Introduction

Hashing is an important and widely used technique for organizing ﬁles. The storage area is
partitioned into a ﬁnite number of buckets. Hashing involves the computation of the address of
the data item directly by evaluating a function (hashing function) on the search-key of the desired
record. The record is then stored in the corresponding bucket. Ifthe bucket is already full, we say
an ‘overﬁow’ has occurred. A hashing function is said to be ‘perfect’ if no overﬂows occur for the
given data. There is no need to handle overﬂows in a perfect hashing scheme, and hence we
obtain ideal retrieval performance.

A composite perfect hashing scheme for large external ﬁles was proposed in [RL88]. An
ordinary hashing ﬁmction is used to divide records of the ﬁle into a number of groups. The
records in each group are stored using direct perfect hashing functions. A simple trial and error
method for ﬁnding perfect hashing functions is proposed and analyzed. In this scheme, any record
can be retrieved in a single disk access. and the analysis shows that records can be inserted and
deleted at an average cost comparable to that of many traditional hashing schemes. One drawback
of this scheme is that there is no guarantee of ﬁnding a direct perfect hashing function for a given
set of keys, which motivated this thesis. The proposed methods here ,. are an extension of
Sprugnoli’s Quotient Reduction and Remainder Reduction methods [SP77].

In [RL88], Ramakrishna er a1. experimentally showed that wriversalz class H 1 of hashing
functions could be used for making trials: the relative frequency of perfect hashing functions
within universal 2 class H 1 is the same as that predicted by the analysis for the .set of all func-
tions. However, they did not study other universalz functions: classes H 2 and H 3 of hashing

functions. In this thesis, we examine if classes H 2 and H 3 behave similar to the class H 1.

Outline of the Thesis

The next section presents the background of research into perfect hashing. The external per-

fect hashing scheme proposed in [RL88] is introduced in section 3. In section 4, the Quotient

-2-

Reduction method for generating perfect hashing functions is introduced and analyzed. The
Remainder Reduction method, which is superior to the Quotient Reduction method, is presented
in section 5. The performance of universal; classes H2 and H 3 functions are studied in section

6. Section 7 discusses the dynamic behavior of the proposed scheme.

Deﬁnitions and Assumptions

A hash table consists of m buckets. and each bucket has a capacity of b records. A set
I = {x1,x2. ..... ,1.) of n search keys are to be stored in the table. A hashing function
h :l -> [ 0 .. m—l ] assigns an integer 0 through m-l to each key. We will consider functions of
integer keys in this paper. A hashing function h is a perfect hashing function for a given key set
and table if no bucket receives more than b keys under the ftmction h. We use [p .. q] to denote
the interval [p, p-I-l, p+2, ..... , q-l, q] (p < q), the length ofwhich is q—p+ 1. The load factor

( a ) is the ratio of the number of keys to the total capacity of the hash table, or = n/mb.

2. Background

The perfect hashing schemes dealt with in the literature may be grouped into two classes
[RL88]:

a) Direct perfect hashing;

b) Composite perfect hashing.
Direct perfect hashing functions map a given key into an address where the key and its associated
record are stored. The composite perfect hashing scheme involves additional table look-up to ﬁnd

the desired address. This means that a retrieval involves more than one level of access.

Direct Perfect Hashing

Sprugnoli, who was the ﬁrst to formally deﬁne perfect hashing, proposed a systematic
method of ﬁnding perfect hashing functions, called the Quotient Reduction method [SP77]. The
functions are of the form h(x) = |_ (x+s)/NJ , where s and N are constants, and are called the
admissible increment and quotient. respectively. He gave a constructive proof of the existence
of the Quotient Reduction perfect hashing function for any given key set. His algorithm ﬁnds s
and N for a given set of keys I = {x1,x2, ..... ,xn}. However, his algorithm was restricted to the
special case of the bucket size being one. He also proposed the Remainder Reduction method,
which yields better storage utilization especially when the keys are not uniformly distributed. The
functions are of the form h(x) = L((xq+d) mod m) IN] .

Jaeschke [1881] proposed the Reciprocal Hashing function of the form
h (x) = [Cl (Dx+E)j mod n, where C, D, and E are constants. He proved that there always exists
a perfect hashing function of this form for any given I. Chang [CH84] proposed a similar scheme
based on the Chinese Remainder Theorem. However. these schemes are not of practical value.

Please refer to [RM86, MR84] for further details.

Composite Perfect Hashing

Composite hashing is necessary to handle for large sets of data. Spnrgnoli suggested the
idea of segmentation to handle large sets of keys [SP77]. The keys are divided into a number of
groups by an ordinary hashing function so that each group can be handled by direct perfect hash-
ing. Fredman er al. [FR82] constructed perfect hash tables using Sprugnoli’s idea. Their scheme
guarantees constant time retrieval with 0 (n) space requirement. However, they did not discuss
how to update the table. Cormack er a1. [CR85] pmposed a similar scheme including algorithms

for insertion and deletion.

3. External Perfect Hashing Scheme

The external perfect hashing scheme proposed by Ramakrishna er a1. [RL88] attempts to
achieve single disk access retrieval at the cost of a small header table in internal memory. They
used a composite data structure similar to that proposed by Cormack er al. [CR85]. Direct perfect
hashing functions are found by a simple trial and error method. One drawback of this method is
that there is no guarantee that perfect hashing functions can always be found for any given key set
although the expected number of trials is small and can be controlled. We investigate systematic
methods for ﬁnding direct perfect hashing functions for external ﬁles, and study properties of the
ﬁle constructed by the proposed method such as the storage utilization and ﬁequency of rehash-
ing.

The details of the external perfect hashing scheme are described below. The set of keys is
divided into several groups. The keys in each group are stored in a number of contiguous pages in
the secondary memory using a perfect hashing function. An ordinary hashing function H is used
to accomplish the grouping. Let key group t denote the set of keys {1: I x e the given key set and
t =H(x)}. Each entry of the header table in the internal memory is of the form ( p, m, R, ),
where (p,) denotes pointers to starting addresses of each group, (m,) is the group size, and (R,)
denotes the parameters of the perfect hashing function for group t. Any record can be retrieved in
a single access of the secondary memory in this scheme. Algorithms for retrieval, insertion, and
deletion are given below.

Retrieval of a record with key x:

- Compute the group r to which 1: belongs by t :=H(x).

- Extract <p,,m,,R,> from entry t of the header table.

- Compute the page address of x by A, := p,+h (x, R,).

- Read in page A,.

- Search page A, for key x. If key 1: is not found, the search has failed.

Insertion of a record with key x:

- Compute A, as above and read in page A, .

- If the page is not full then

- Insert the record into page A, and write back the page.

else { Rehashing }
- Read in all pages of the group t. The number of pages is m,.

-5-

Find a new perfect hashing function for all the keys in the group, including the
new key 1:. The number of pages obtained by the new perfect hashing function.
mu, may or may not be equal to m,, and let R. be the parameter of the new per-
fect hashing function.

Find an address p, in the secondary storage having space for m, contiguous
pages (pa may be equal tom )-

Restore the records on pages p“. p,+l, p,+2.....,p.+m..—l using the new perfect
hashing function.

Update the header table entry of group r to < p,m,,R, >.

Deletion of a record with key x:

Compute A, as above and read in the page.

Search page A, for key 1:.

If key 1: is found then delete the record and write back the page,
else the desired record is not in the ﬁle.

It may be necessary to rehash the group to avoid low storage utilization. The pro-
cedure of rehashing is similar to that for insertion.

4. Quotient Reduction Method

Quotient Reduction hashing functions are of the form h (x) = |_(x+s)/NJ , where s and N
are constants. We will show that functions of this form, which are perfect. exist for any given
bucket size, by giving a constructive proof.

The basic idea of this method is to divide the given set of keys into a number of intervals
whose length is N, and to move the boundaries between the intervals by adding the admissible
increments so that each interval, corresponding to a bucket, contains no more than b keys. The

following is an extension, of Sprugnoli’s construction, to the general case b > 1.

Admissible increments

Let I = {x1,x2, ..... .x.) be the set of keys to be bashed. in sorted order. When the bucket
size is b, keys x,- and Jim, must not fall in the same bucket. A set of admissible increments J,- of

the key x.- for a given N is a set of integers for which the condition
l-(x;+t,-)/NJ < l-(x;+b+tg)/NJ 1 Si 5 n—b, I; 8.];

holds. In other words, an admissible increment for x,- is any translation value which adjusts x,-
and xm, to two different intervals [(p—1)N .. pN-l] and [(q-1)N .. qN-l], where p and q are

integers, and p<q. There are n-b sets of admissible increments for a given I and b.

Proposition 1
There exists a quotient reduction perfect hashing function for a given I, b, and N if and only

if the set of admissible increments J,- , 1 Si 5 n—b, have at least one element in common: i.e.,

n-b
nJi ¢¢.

i=1

Proof

The "only if" part is obvious. The "if" part can be proved as follows.
The condition It (x,)<h (xm) must hold for every x,, 1 S i S n—b, for h to be a perfect hashing

function. Hence.

[(x;+t,-)/NJ < l-(xi+b+r,)/NJ for every x,-, 1 Si 5 n-b (4.1)

LCI k=[(x;+b+t;)/NJ and 5; =x,-...1 -x,~. II fOHOWS that x,- + I; < W and xi”, + t,- 2 UV. The SCI
of all admissible increments J ,- of x,- is given by
Jg={ti I kN-x,+bSt,-SkN—x,-}. (4.2)

Let u,- =t,- +x,-+¢, -kN. J,- canbe represented in terms of 8,- as follows.
i+b-l
J,-={t,-lt;=u.-—x,-+b+kN,OSu,-< z Oj}. (4.3)
jar'
Infact, x;+t;=x.-+u,--x.-+b+kN=ug-28j+kN, and

x,» + r,- =x.-+b + u,- — x5“, + kN = u,- + kN. Thus, two keys xi“, and x,- fall in two different inter-

 

i+b-l
vals [(p-1)N ..pN—l] and [(q-UN ..qN-1] , (p <k. kSP). if and only if 0514 < z 5]-
j-i
Figure 4.1 shows the admissible increments for x; with b = 2.
kN
XL; Xifr I {3+2 1.43
l I I l
5i + 8i+l

The keys can be moved to the left by xm - kN. and to the right by [W - x;.
The range of the admissible increments is given by
xi+2 '1‘” +1!” -Xr = Xr+2 -Xi = 5: +5.41-

Figure 4.1 Admissible increments when b = 2.

Taking modulo N, the reduced admissible increments for x,- is given by

- 9 -
i+b-—l
J;={t,-|t,-=u,~—x,-+bmodN IOSu,-< 2 61'}. (4.4)
jai
Each J,- is a subset of integers [0..N—1]. For each interval [xi .. xm], 1 Si S n-b, the set of
admissible increments, J,- is computed by (4.4). If an element s common to all J,-, 1 Si S n-b

exists, then the relation
[(x;+s)/NJ < |_(x,-+b+s)/NJ

holds for every x;, 1 Si S n-b. The corresponding s and N deﬁne a perfect hashing function

h(x,-) =[(x,~+s)/NJ . Thus, there exists a quotient reduction perfect hashing function if and only
13-!)
if n J,- ze (b. Cl

is]

In order to ﬁnd admissible increments for a given I, N, and b, we compute J,-, l S i S n—b,

and take their intersection. If the length of the interval between xi”, and x,- is equal to or greater

i+b—l
than the quotient N. 2 8,- 2N. the set of admissible increments is simply [0..N-1]. This

jai
means that any integer can be taken as an element of 1;, and we can ignore this set for the inter-
section operation mentioned above. Example 4.1 illustrates the computation of the set of admissi-

ble increments with b = 3.

Example 4.1
31 58 67 123 142 146 154 187 198 220
lstdifference 27 9 56 19 4 8 33 11 22
2nd difference 36 65 75 23 12 41 44 33
3rd difference 92 84 79 31 45 52 66

Figure 4.2 Set of keys for Example 4.1

-10-
Suppose N = 74 (later we will see how to choose the value of N), it follows from (4.4) that
L, = [ 0,24]\J[68,73],
J5 = [ O, 5]U[35,73]
.15 = [ 0, 1]U[24,73]
17 = [ 2,67]
By taking the intersection of all the J,- J1, 12, 13 since they are all [0..731), we obtain
J4 (3.15 (3.15 ab =4).

It follows from the previous proposition 1 that a Quotient Reduction perfect hashing function

does not exist for the given key set with N = 74. However, with N = 73.
J4 = [ 0.22]U[65,72]
J 5 = [ O, 3]U[32,72]
J 5 = [21,72]
17 = [0.641UI721
J4 nls mfg nJ7 = [72].

Thus, there exists a Quotient Reduction perfect hashing function for N = 73. El

The number of buckets m for a given N and s

The number of buckets m for a given set of keys I, a quotient N, and an admissible incre-
ment sis given by
{Lr/Nj +1 if{rmodN+(x1 +s)modN}SN-1
m =

Lr/NJ + 2 otherwise

where r is the range of keys, r = x, — Jr]. This can be readily shown as follows:

m=h(x,,)-h(x1)+1

-11-
= [(x,+s)/NJ -[(x1+s)/NJ +1
= [(x1+s+r)/NJ —l_(x1+s)/NJ +1. (45)
The ﬁrst term is reduced to

[aim/NJ ”1”” if{(x,-+s)modN+rmod N} SN-l
m = .
[(xﬁs-r-r) [(x1+s)/NJ + Lr/Nj +1 otherwrse

Thus, it follows from (4.5) that

Lr/NJ +1 if{rmodN+(x1+s)modN}SN-l
m={ (4.6)

Lr/Nj + 2 otherwise

The upper bound of N

The previous result leads to the upper bound of N for given I and b. The minimum number

of buckets In“ is given by
m* = [n/b'l .
The upper bound of N which yields the minimum number of buckets m* is given by (4.6),
m* = Lr/Nj + 2 (where r mod N at 0). (4.7)
Therefore.
m“ — 2 = Lr/Nj
r = (m*—2)N + r mod N.
Since 1SrmodNSN—l,
(m*-2)N +1 S r S (m*—1)N +N —1

r+l r-l
m*-l SNSm*_2.

 

 

-12-

Since N is an integer, the upper bound of N is given by

r-l _ r-l I
N*- mj -l_—rnm -2J- (4'8)

 

Solution space of the Quotient Reduction method

Figure 4.3 illustrates the relationship between quotient N, admissible increment s, and the
number of buckets m for a given set of keys. The horizontal axis represents the value of
(x1 + s )mod N , with s as a variable. The vertical axis represents the value of the quotient N.
The ﬁrst row of the ﬁgure corresponds to N = N*. The left hand side of the boundary in the ﬁrst

row corresponds to the space in which the following inequality holds.

(x1+s)mod N* S (N* - 1) -r mod N*.
The values N"I and s in this space yields m* - 1 buckets. The space in which the following ine-
quality

(xx +s)modN* > (N"‘ -1)—rmodN*

holds is in the right hand side of the boundary. The values s and N (= N") in this space yields

m* buckets. Similarly, each row represents the solution space for a different N.

-13-

(x1+s)modN

Nar-
N *-l
N *-2
N*-3
N *-4
N‘-5
N *—6

 

N* denotes the upper bound of N.
m* denotes the minimum number of buckets.

N* =|_(r-1)/(m* -2)J
m* =[ n/b]

Figure 4.3 Solution space

Consider the range of possible N which give a ﬁxed number of buckets m. Let NLl and
NU 1 denote the lower bound and the upper bound of N respectively, when

{r mod N + (x1+s) mod N} SN—l, NLI. They can be obtained as follows.
m = Lr/Nj + 1.
r =(m-1)N+r mod N, since OSr modN SN—l
(m—1)N S r S (m-1)N +N-l
(r+1)/m SN S r/(m-l).

Since N is an integer, the lower bound NLI and upper bound NU 1 are
NL1=[(r+1)/m'l , NUl = Lr/(m-1)J .

Similarly, when {r mod N + (wl-l-s) mod N} 2N, the lower bound NL2 and the upper bound

NU 2 are given by

- 14 -
m = Lr/NJ + 2.
(r+1)/(m—l) S N S r/(m-2),

and hence.

1w.2 = [(r+1)/(m-1)l , NU2 = Lr/(m-2)j .

Algorithm QR (Quotient Reduction )

We are now ready to introduce our algorithm QR to generate a perfect hashing function for
the given set of keys I. The basic idea of this algorithm is to search the solution space of N and s

to obtain m* buckets, m* +1 buckets and so on.

Algorithm QR

STEP 1 { Initialization }
Compute the range of the key: r := 1,, - x 1.
Compute the minimum number of buckets: m := I’n/b] .

STEP 2
Compute the upper bounds and lower bounds of N corresponding to m buckets.

NU; := Lr/(m-2)_|
NL; := [(r+1)/(m-1)l
NU] := Lr/(m-l)j
1w.1 := [(r+1)/m]

IOI'N I=NL1 t0 NUI d0

Compute the set of the admissible increments J,- (as described in proposition 1 ) with
N including only those elements t in the A’s such that the condition
r modN +(x1 +t) mod N 2N holds.

Take the intersection of 1,-’ s.

if not empty then goto STEP 3
end for

for N :=NL2 to NU 2 do
Compute the set of the admissible increments J,- (as described in proposition 1 ) with
N including only those elements t in the J,-’s such that the condition
r modN +(x] +t) mod N SN—l holds.

-15-

Take the intersection of J,’ s.
if not empty then goto STEP 3

end for
m := m +1
gotoSTEPZ
STEP3

s := Choose a value from intersection set.
The perfect hashing function h (x) = [(x +s)/Nj

The range of N that corresponds to m buckets is found as follows. When the range of N is
NU; 2N 2NL;, the condition {r mod N + (x1 +1) mod N }2N holds. When the range ofN is
NU] 2N ZNLI, the condition {r mod N +(x1+!) mod N} SN—l holds.

If the intersection is not empty, any element of the set is acceptable (all give the same load
factor). However, to obtain more uniform distribution of bashed keys, choose 5"“ from the inter-

section set such that,

3* =er IN —(x1+j)modN -(x,,+j)modN I.
J

This choice of 3* makes the interval covered by the ﬁrst bucket. [x1 .. p (N —l)]. as close as pos-
sible with that of the last bucket [qN .. x3]. where p =h(x1)+ 1 and q =h(x,,). Finally. we

need the transformation
s =s* —N{(x1 +s*) div N}

so that x1 hashes to bucket zero, Le, [(x, + s)/NJ = 0.

Example 4.2
Consider the key set I = {31, 58. 67, 123, 142, 146. 154, 187, 198, 220} with b = 3. n = 10

and the range of keys, r= x10 - x1 = 189. The minimum number of buckets is m* = I'n/b'l = 4.
The upper bound NU; := Lr/(m—2)J = L189/(4—2)_] = 94.

Thelowerbound NL; := [(r+1)/(m—1)] = [(189+1)/(4—1)'| =64.

The upper bound NU] := Lr/(m-1)j = L 189/(4—l)j = 63.

-16-

The lower bound NL. := [(r+1)/ml = [(189+1)/4] =48.

The solution space is shown in Figure 4.4.

(x1+s)modN

 

Figure 4.4 Solution space for Example 4.2

The search starts with the value N = NLl = 48, and the set of admissible increments J =
[17.18.19] is found with N = 48 in STEP 2. We move to STEP 3, and the element [18] is chosen
which is transformed as s =18 -48{(31 + 18) div 48} =—3O , so that h(x1)=0. We thus have

the Quotient Reduction perfect hashing function,

(x.- - 30) J

h (xi) = l 48

The hash table for this example is shown below.

-17-

Table 4.1 Hash table of Example 4.2

 

bucket# 0 l 2 3
x.- 31 58 67 123 142 146 154 187 198 220

 

 

 

 

 

 

 

 

Efﬁcient QR Algorithm ( EQR )

We can see from (4.8) that the upper bound N* of the quotient N is proportional to the range
of keys r. It is obvious that the algorithm QR is impractical for a large r due to the large computa-
tion time. This is because every N is examined sequentially in the algorithm QR.

In this section, we aim to reduce the computation time. The basic idea is to consider the
values of N such that N =Lr/(m-1)J to determine if an admissible increment exists. We begin
with m* + 1 buckets. If the search fails, we proceed with m* +2, and so on. If the search
succeeds with N=N, and m, buckets. we proceed with m, - 1 buckets and for N =N, +1 to
Lr/(m-2)_| . (This is to achieve better storage utilization due to less buckets.) Also, attempt is
made to ﬁnd the smallest N for the value in obtained, so that we have as uniform a distribution as

possible.

Algorithm EQR

STEP 1 { Initialization }
Compute the range of the keys: r := x, - x1.

Compute the minimum number of buckets: m := [- nib] . { The value in“ }
Compute the upper bound NU for the search: NU := Lr/(m-1)J .
m := m + 1 { Starting with m*+l }

STEP 2

Compute the lower bound NL for the search: NL = Lr/ (m -1)_| .
Compute the set of admissible increments J,- with NL including only those elements tin the
Jg’s such that the condition,

r mod NL +(w, + t) mod NL S NL-l, holds.

STEP 3
Take the intersection of Jg’s.
if not empty then { next searches }

-13-

Store the result temporarily
forN :=NL+l to NU do
Compute the set of admissible increments J,- with N including only those ele-
ments t in the Ji’s such that the condition,
r mod N +(x1+ t) mod N S N—l, holds.

Take the intersection of J,-’s.
if not empty then { Search Complete }
goto STEP 4
end for

{ Another set of searches ]
Compute the lower bound NL; for the search: NL; := l' (r + 1)/m] .
forN := NL; to NL-l do
Compute the set of admissible increments J,- with N including only those ele-
ments tin the Ji’s such that the condition
r mod N +(x1+ t) mod N SN-l, holds.
Take the intersection of J,-’s.
if not empty then { Search Complete }
goto STEP 4
else
Return the result of intersection & NL.
goto STEP 4
else
or := m + 1 { Next bucket size }
NU := [H (m -2)J { Next upper bound ]
goto STEP 2
STEP 4
s := Choose a value from intersection set
The perfect hashing function h (x) = |_(x +s)/Nj

In STEP 2, NL is set to the maximum value of N corresponding to (m+I) buckets. The
range of the admissible increments to be examined is calculated by the same method as in the
algorithm QR. In STEP 3, a search is made for the quotient NL. If admissible increments exist,
they are stored temporarily and further searches in the range NL+1 through NU are made that
yield m buckets. If the ﬁrst search fails, the next bucket size and the next upper bound NU are set,
and the control is transferred to STEP 2. If one of the searches succeeds, we go to STEP 4 and the
result is used in STEP 4. If none of the searches succeeds. another set of searches is made for
N = NL; to NL -1. (Although we have found admissible increments for NL. we need the smal-
lest N that yields the same number of buckets as NL.) The lower bound of these searches NU;
which corresponds to m buckets is computed by NU; := i (r + 1)/m] . If none of these searches
succeeds, the result of the very ﬁrst successful search (N = NL) is retumed, or the result of the

ﬁrst successful search of this set is used in STEP 4. In STEP 4, the value of s is chosen by the

-19-

. same method as in the algorithm QR

Example 43

For the same set of keys as in example 4.2, the EQR examines if there exist admissible

increments with
NL = Lr/(m-1)J = L189/(4-1)j = 69.

The search failed for NL=69 in this example, and the next value of NL is set as
Lr/(m-l)_] = L189/(5—1)J = 47.

This search succeeded, and we obtained the perfect hashing function which requires ﬁve buckets.
The next set of searches for the range of N that correspond to four buckets were done from N=48
to 68. A set of admissible increments J=[17,18,l9] was found in STEP 3. and the perfect hashing
function H (x): [(x - 30)/48j , was obtained. (This hashing function is the same as that in

example 4.2.) E]

Complexity of the Algorithms QR and EQR

The taking intersection in the QR and EQR is rather straightforward, and its complexity is
0(n2). (Please refer to the Appendix). In the algorithms QR and EQR, the time complexity is
determined by the loop of N. In both algorithms, the total number of execution in these loops in

the worst case is proportional to the upper bound of N, [r/fn/b'] —2J . Thus, the complexity of

these algorithms is 0(rbn). The difference between the two algorithms is the constant of pro-
portionality. Although the constant is small, the complexity is not practically acceptable in many
cases. In Remainer Reduction method discussed next, we control r, using modulo operation, and

hence the complexity. It is clear that the space requirement of these algorithms is 0 (n).

-20-
Results of the experiments

To test our algorithms, we conducted several experiments using the following real life ﬁles.
(A) Userids from IBM System at the University of Waterloo (12,000 keys);
(B) Userids from the msudoc at MSU (600 keys).
All the data was in alphanumeric form and the length of individual keys varied ﬁom 2 to 25 char-
acters. The keys were converted into 2 byte long unique integers by the RADIX_Convert method.
Details of this method is described in [RM86]. The keys in each ﬁle were divided into s groups
using hashing functions of the form H (x)==(((cx+d) mod p) mod g), where p is a large prime

number. The hashing function used to separate groups is as follows.
H (x) = (((314559x+27182) mod 65521) mod 9) (4.9)

For each of the nine groups obtained, direct perfect hashing functions were generated using
algorithms QR and EQR. The average load factors obtained for the nine groups of ﬁle (A) are

shown Table 4.2. The third row gives the ratio of the average load factor obtained by the EQR to
that by the QR.

Table 4.2 Average load factor of nine groups of ﬁle (A) (percentage)

 

 

 

 

 

 

 

 

 

 

1) n = 100
Algorithm b=10 b=20 b=30
QR 69.5 82.0 83.3
EQR 67.8 82.0 83.3
Ratio 97.6 100.0 100.0
2) n = 250
Algorithm b=10 b=20 b=30 b=40 b=50
55.1 69.6 80.1 80.6 82.0
52.1 68.1 78.5 80.6 82.0
94.6 97.8 98.0 100.0 100.0

 

 

 

 

-21-

 

 

 

3) n = 500
Algorithm b=10 b=20 b=30 b=40 b=50
QR 50.3 70.5 78.0 81.0 84.1
EQR 47.9 69.7 77.6 80.4 83.3
Ratio 95.2 98.9 99.5 99.3 99.0

Perfect hashing functions were generated for the keys in ﬁle (B) without any partitioning.

 

 

 

Table 4.3 shows the load factor for these keys.

Table 4.3 Load factor for ﬁle (B) (percentage)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1) n = 100
Algoritlun b=10 b=20 b=30
QR 71.4 83.3 83.3
EQR 71.4 83.3 83.3
Ratio 100.0 100.0 100.0
2) n = 250
Algoritlun b=10 b=20 b=30 b=40 #50
QR 61.0 65.8 69.4 78.1 83.3
EQR 61.0 65.8 69.4 78.1 83.3
Ratio 100.0 100.0 100.0 100.0 100.0
3) n = 500
Algorithm b=10 b=20 b=30 b=40 b=50
QR 42.4 75.8 83.3 83.3 83.3
EQR 42.0 75.8 83.3 83.3 83.3
Ratio 99.1 100.0 100.0 100.0 100.0

 

 

 

 

 

 

These tables show that higher b gives better storage utilization. We obtained over 80% load
factors for large bucket sizes. whereas about 40% to 70% for small bucket sizes. For large bucket

sizes. the load factor remains almost the same as the group size increases. whereas the load factor

-22-

falls clearly for small bucket sizes as the group size increases. The ratio of the average load factor ‘
obtained by the EQR to that obtained by the QR is close to 100%. However, the computation

time of the EQR is not practical even though the EQR is faster than the QR.

-23-

5. Remainder Reduction Method

The Quotient Reduction methods described in the previous section has two major problems.
a) The low storage utilization when the keys are not uniformly distributed;
b) Large computation time.
The Remainder Reduction method overcomes these two problems. The basic idea of this method
is to scramble the set of keys to obtain a more uniform distribution within a narrow range. fol-
lowed by Quotient Reduction perfect hashing. Since we can control r in the Remainder Reduction
method. the computation time can be controlled. Remainder Reduction perfect hashing functions

areofthe form
h(x,-) = [{(qxi) mod M + s }/NJ . (5.1)

where q, M. s, N are constants to be determined. The transformation
1,- = (qxg) mod M (5—2)

accomplishes the scrambling mentioned above by choosing a q and M appropriately. The
transformation (5.2) may cause a collision, i.e.. more than two distinct primary keys are
transformed into identical keys. The probability p of no collision occurring by the transformation

can be obtained as follows [FL68].

 

 

___ MP. = M(M-l)(M-2) ..... (M-n+1)
p M" M"

1 2 n-l
_(1—-117)(1-M) ..... (1— M

 

).

For small positive it. we have log(1-—x) = -x .

l+2+ ...... +(n-l) =_‘_n(n-l)
M 2M '

 

logp =

We obtain

p2,, 2M _ (5.3)

-24-

The probabilities of having no collisions for n=500. 250, and 100 are shown in Table 5.1.

 

 

M =500 n=250 n=100
216 (65536) 0.15 0.62 0.93
217 0.39 0.79 0.96
218 0.62 0.89 0.98
219 0.79 0.94 0.99
22° 0.89 0.99 1.00
221 (2097152) 0.94 0.99 1.00

 

 

 

 

 

 

Table 5.1 Probability of having no collisions for given M and n.

If we want to avoid collisions. we have to select suitably large M. For example. M should be
greater than 221 when n = 500. However, we do not gain the advantages of the Remainder Reduc-
tion method mentioned above with this large M.

We can tolerate some collisions by the transformation (5.2). If the maximum number of
identical keys is less than or equal to the bucket size b. then in general there exists a quotient
reduction perfect hashing function.

On the other hand. the probability that more than b keys collide under this transformation is
almost zero. Let P (0t.m.b) denote the probability that none of the urns contain more than b balls.

where n. n = orbm. balls are tossed into m urns. An approximate formula is derived in [RL88].

P (mm,b) Z e-MPOVQLb)’

where.
n -b(1 i
P0v(0t,b)= Z _e__.(’_b_0t)_
i=b+l ‘-
2 (bot)"+‘ 8..., b+2
(b+l)! b(l-0t)+2 '

For example. when n=500. m=500. and b=10. The probability that more than b keys collide is
given by

-25-

1 FP(0.1.500.10) : 1_ e-500Pov(0.1,10)

—5.03x10"

2 l— e = a negligibly small quantity

Now we are ready to introduce our Remainder Reduction algorithm. To implement this
method. it is necessary to apply the transformation 1’,- = (qx.-) mod M for each x;. 1 S i Sn, and
sort before applying Quotient Reduction method. Let {w1,w;, ..... .w,,} denote sorted keys from

{£1.x’;, ..... .16.}. q and M are chosen to be prime numbers.

Algorithm RR

STEP 1
Transform keys {x1.x;, ..... .x,} into {11 .x’;, ..... ,Jin} using 1, = (qx;) mod M .

Sort the keys (£1.16. ..... .X,} in non decreasing order. and obtain sorted keys
{w1.w;, ..... .wn}.

STEPZ

Apply the algorithm QR to the sorted keys
QR(n. {w1.w;,.....w,.} ) or

Apply the algorithm EQR to the sorted keys
EQR(n, {w,.w;.....,w,.} )

Perfect hashing function h (x.) = [{(ng) mod M + s} /NJ

The complexity of Remainder Reduction Algorithm is given by
0 (Mb).

where M is the modulo of the transformation (5.2). As the range of keys is given by M in this

method. r in the QR is substituted by M. The space requirement is 0(a).

Results of experiments

For each of nine groups obtained by the hashing function (4.9). perfect hashing functions
are generated using algorithm R with the QR and with the EQR. The average load factor of nine
groups of ﬁle (A) is shown in Table 5.2. Several values of q were used in this experiment. and we
observed that the value of q hardly affected the result.

-26-

Table 5.2 Average load factor of nine groups of ﬁle (A) obtained using RR (percentage)

1) n = 100, 4:37, M=2039.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Algorithm b=10 b=20 b=30
RR using QR 70.0 78.3 83.3
R using EQR 69.3 78.3 83.3
Ratio 99.0 100.0 100.0
2) n = 250. q=71. M=4093.
Algorithm b=10 b=20 b=30 b=40 b=50
RR using QR 65.2 78.9 80.8 84.3 85.2
R using EQR 64.1 77.4 80.8 84.3 85.2
Ratio 98.3 98.1 100.0 100.0 100.0
3) n = 500. q=101, M=8l91.
Algorithm b=10 b=20 b=30 b=40 b=50
RR using QR 56.7 72.0 78.7 81.7 81.9
R using EQR 56.1 71.8 78.2 81.7 81.9
Ratio 98.9 98.7 99.4 100.0 100.0

 

 

 

Table 5.3 shows the load factor for ﬁle (B)

Table 5.3 Load factor for ﬁle (B) obtained using RR (percentage)

1) n = 100, q=37. M=2039.

 

 

 

 

 

Algorithm b=10 b=20 b=30
RR using QR 76.9 83.3 83.3
RR using EQR 76.9 83.3 83.3
Ratio 100.0 100.0 100.0

 

 

 

 

2) n = 250. q=7l. M=4093.

-27-

 

 

 

 

 

 

 

 

 

 

Algorithm b=10 b=20 b=30 b=40 b=50
RR using QR 71.4 78.1 83.3 89.3 83.3
R using EQR 71.4 78.1 83.3 89.3 83.3
Ratio 100.0 100.0 100.0 100.0 100.0
3) n == 500. q=101, M=8191.
Algorithm b=10 b=20 b=30 b=40 b=50
RR using QR 60.2 69.4 79.4 83.3 83.3
R using EQR 57.5 69.4 79.4 83.3 83.3
Ratio 95.5 100.0 100.0 100.0 100.0

 

 

 

 

We observe that the load factors are almost above 70%. The average computation time for
generating a perfect hashing function is reduced to only a few seconds on VAX 8600. Thus, we

conclude that R method of ﬁnding perfect hashing function is a practical technique.

-23-

6. Universal Classes of Hashing Functions

In order to compare the proposed method to other methods such as simple trial and error
method. we will examine the probability of a randomly chosen function being perfect. Let
P(n.m.b) denote the the probability that a randomly chosen function is perfect. where n is the
number of keys, m is the number of buckets. and b is the bucket size. A way of computing

P (mmb) using the following simple recurrence relation is described in [RL88]:

(m_1)n-b

P (n+l.m.b) = P (n.m.b) — ,CbP (n —b.m-l.b) n
m

(6.1)

However. choosing a function for a trial at random from the set of all functions is clearly imprac-
tical. Hence. universal; classes of hashing functions were introduced in [RL88].

Let H be a class of functions which map a set of integers A = {0.1.2......a-l} into the set
of integers B = {0.1.2......m-1}. where a > m. The set A corresponds to the set of keys and B
corresponds to the set of addresses. Let x and y be distinct integers x. y e A and h.- e H a hash-

ing function. We deﬁne

1 1fx $)’ and h;(X) = th)
’9 = 0 otherwise

If 61,. = 1. then x and y are said to collide under h. The class H is said to be a universal; class

of hashing functions if for all x.y 8A. 2 agony) s IHI /m. This means that H is a universal;
nu:

class of hashing functions if for every pair x.y e A. they collide no more than IH I / m functions.
This implies that if a function is chosen randomly from a universal; class. the probability of a
pair of keys colliding is less than or equal to l/m.

In [RM86]. Ramakrishna performed a series of experiments using several test ﬁles to deter-
mine if the universal; class H 1 functions could be used for selecting functions for trials. He
showed that the relative frequency of perfect hashing functions within universal; class H 1 is the
same as that predicted by the analysis for the set of all functions. However. he did not study other

universal; hashing functions: class H; and H 3 functions. Since class H; and H 3 functions do

-29-

not require multiplication (only need Exclusive OR ). they may be convenient for long keys. The
experiments here are aimed at examining if class H; and class H3 functions behave similar to
the class H 1. The universal; class H3 is deﬁned as follows by Carter and Wegrnan [CW79]:

Let A be the set of keys of i bit binary numbers. and B be the set of j bit binary numbers.
The set B corresponds to 21 buckets. and the number of bucket has to be a power of two. Let Mt
be the set of arrays of length i whose elements are from B. We can regard Mt as i by j boolean
matrices. For mt eMt. let mt (It) be the bit string which is the kth element(row) of the matrix mt.
and for 1: 8A, let x, be the kth bit of x. Deﬁne f...,(x) =x1mt(1) Ox;mt(2) 0"" ®x1mt(i).
where 9 denotes exclusive OR operation. The meaning of this function is to take exclusive ORs
of those elements in nu that correspond to 1 bits of key it. The class H3 is deﬁned as a the set of
functions {f...1 lmt 3 Mt }. over all possible matrices mt. An example of class H3 is shown

below.

Example 6.1

When the number of buckets m = 4. the bucket size b = 3. for the set of keys. which consists
of ten 8-bit keys. given in Figure 4.2, a hashing function is found as follows:
A 8>Q boolean matrix mt is obtained by generating eight random numbers in the range of 0 to 3.

The following is one such matrix.

mt(l) 01
mt(2) l 1
mt(3) 10
mt(4) 00
mt(5) 10
mt(6) 1 1
mt(7) 00
mt(8) 01

Keys 67 and 123 are converted to binary form (01000011); and (01111011);, respectively. The

hash addresses of these keys are given by

-30-

f~(67)==mt(2) Omt(7) ®mt(8) = 11 000 901 =10
fm(123) =mt(2) @mt(3) ©mt(4) Omt(5) ®mt(7) ®mt(8)
=11 010 000 010 900 901 =10

Keys 67 and 123 are then stored in bucket 2. Sirnilerly after computing all the hash addresses. we

obtain the following hash table

Table 6.1 A hash table (example of failure)

 

bucket# 0 1 2 ' 3
keys 3158142187 146198 67123 154220

 

 

 

 

 

 

 

 

Since the bucket size b is 3. this trial has failed. However. the following mt gives a perfect hash-

ing ftmction:

mt(l) 01
mt(2) 00
mt(3) 10
mt(4) 11
mt(S) 00
mt(6) 01
mt(7) 10
mt(8) l 1

Using this matrix mt, we obtain the hash table given below.

Table 6.2 A hash table (example of success)

 

bucket# 0 1 2 3
keys 123 146 154 67 187 142 198 31 58 220

 

 

 

 

 

 

 

 

The deﬁnition of the class H; is similar to that of H3 except that a key is mapped into a

longer bit string. Let A be the set of keys of i-digit numbers written in base 0L For x eA. let x,

-31-

denote the kth digit of x. B has the same deﬁnition as in H3. Deﬁne g to be the function which
maps x into the bit strings of length ior which has 1’s in positions x1+1. x1+x;+1,

x1+x;+x3+1. m, etc. where x, is the kth bit of x. Then. Mt is the set of arrays of length i0t
whose elements are from B. Mt can be regarded as ior by j matrices. We can apply the i or bit
keys deﬁned above to the same operation as in H 3. If H3 is the class deﬁned above for i or hit
keys. then H; = {fg I feH3}. Although H; needs more space than H3. the computation time

of H; is less than that of H3 . because converted keys in H; have fewer l-bits.

Example 62
When the base 0t=4. the same keys as in Example 6.1 are mapped into strings of length

16. For example. keys 67 and 123 are converted as follows.
3 (67) = g (0003)..) = 0100100000000000
g (123) = g ((1323).) = 0100101001000000.

We can apply the same operation to these converted keys as in Example 6.1. mt is a 16 by 2
boolean matrix and is generated similarly. El

Results of experiments

A set of experiments was conducted as follows. A set of 100 hashing function was created
randomly by generating random numbers in the range of 0 to 2j- l. The ﬁrst n keys. n=ormb. in
each of nine groups of ﬁle (A) were bashed by each of the 100 hashing functions. and the number
of perfect hashing functions was recorded. n was selected so that the load factor varies from 50%
to 100%. This experiment was repeated for various values of m and b. Figure 6.1.1 shows the
results of one set of experiments with b = 10 and m = 16 using the group 8 of ﬁle (A). The solid
line is a plot of P (n. 16.10) computed using the recurrence relation (6.1). The symbols 2 and 3
represent the experimental probabilities obtained with class H; and class H3 of hashing func-

tions respectively.

-32-

Figure 6.1.2 shows the results of another set of experiments with b = 40 and m = 8 using the
fourth group of ﬁle (A). The solid line is a plot of P (n. 8.40). The number of keys varied from

160 to 320. We used the same set of 100 hashing functions for all load factors and each of nine

   
   
 

 

 

groups.

1.0 F
P 0.8 -
r
o
b 0.6 T Bucket size (b) : 10
a
b Number of buckets (m) : 16
i
l 0'4 ‘ Key set: group 8 of ﬁle (A)
1
t
l’ 02 .

0.0 1 ,

50 60 70 80 90 100

Load factor (%)

Figure 6.1.1 Observed and computed probability of a trial succeeding

1.0

0.8

0.6

0.4

‘<H'—"—~°O'O’O‘OH"U

0.2

0.0

We computed the 95% critical region for each point. The relative frequency of perfect hash-
ing functions will fall within the region bounded by 01, and 01) with 95% probability. Let
P(ormb.m.b) = 0. then the approximate values of BL and 91) may be obtained using the rela-
tions 0,, = 0 - 1.96o and 01, = e + 1.966, where o = m. We observed that almost all
the experimental values fall within the critical region. The result indicates that relative frequency

of perfect hashing functions in class H; and class H3 is the same as that predicted by the

-33-

    
  

Bucket size (b) : 40
Number of buckets (m) : 8

Key set : group 3 of ﬁle (A)

 

L i 1 1

50 60 70 80 90 100
Load factor (%)

Figure 6.1.2 Observed and computed probability of a trial succeeding

theoretical analysis in [RL88] for the set of all functions.

Comparison with trial and error method

In view of the analysis in [RL88]. P(ormb,m,b) is a good measure of the cost of perfect

hashing. Figure 6.2 compares P (ounb.m. b) to the experimental results shown in Table 5.2.

~< "wed—'0‘!» (To-1'0

'~< n-—-o'w0'o-t "U

1.0

0.8

0.6

0.4

0.2

0.0

1.0

0.8

0.6

0.4

0.2

0.0

-34-

 

 

 

 

 

l l l

 

Figure 6.2.1 Probability of a randomly chosen function being perfect (b = 10)

50 60 70
Load factor (%)

 

 

    

m=15,l6

m=34.35

 

l

 

 

 

50

Figure 6.2.2 Probability of a randomly chosen function being perfect (b = 20 )

60

70

80

Load factor (%)

100

-35-

‘Wbamntosu:mnvnmmyuﬁmsmedtermmnmdifuermaiuhdamdenornmdmdto
obtain the load factor in Table 5.2. Figure 6.2.1 plots the probabilities P (amb.m.b) as a function
of the load factor with b = 10 and m = 38. 39. 88, and 89. In Table 5.2. about 65% load factor
with n = 250 was obtained. In other words. on average m = n/orb = 38.5 buckets were produced
in the experiments. In Figure 6.2.1. the points P(65.38. 10) and P(65.39.10) correspond to the
experimental result with n = 250 in Table 5.2. When n = 500. about 55% load factor was obtained
with b = 10. The points P(65.88.10) and P(65.89. 10) then correspond to the experimental
result. We observe that these load factors obtained by the experiment can be mapped into the
range of the probability from 0.05 to 0.1 in P(ounb.m.b). This means that 10 to 20 trials are
required to ﬁnd a perfect hashing function with load factor 65% for 250 keys and with load factor
55% for 500 keys when the bucket size is 10.

Similarly. Figure 6.2.2 plots P(omb.m.b) with b = 20 and m = 15. 16. 34. and 35. The
point at a = 79% on the lines with m = 15 and 16 correspond to the experimental result with n =
250. and the points at or = 72% on the lines with m = 34 and 35 conespond to the result with n =
500 in Table 5.2. We observe that the load factors with n = 250 and 500 in Table 5.2 can be
mapped into the range from 0.05 to 0.1 in P (amb.m.b). The other load factors obtained in Table
52 were mapped into the range from 0.05 to 0.4 in P (amb,m.b).

-36-

7. Dynamic Behavior

In previous sections. we discussed about the static model: i.e., we showed how to ﬁnd per-
fect hashing functions for a given set of keys. In this section, we will discuss about the dynamic

behavior of the proposed schemes such as frequency of rehashing and dynamic load factor.

Insertions and rehashing

In our scheme. an insertion causes rehashing when a new key is bashed into a full bucket or
when the address of the new key is beyond the range of the existing hash table. The latter can
happen when the new key is extremely large (or small) compared to the keys already stored in the
hash table. For the Remainder Reduction perfect hashing function
h(x,-)= [{(qx1mod M) +s}/NJ . the probability PR. of an insertion causing rehashing for the
given state where the set of keys has been already bashed. is given by

R +R
PR: "M f, (7.1)

 

where R, is the range of keys not covered by the table. and Rf is the range of keys covered by

full buckets. R, and R f are given by (7.2) and (7.3) respectively.
R0 = max(s. 0) + max((M -1) — (m+l)N. 0). (7.2)

where m is the number of buckets.
Rf = 2 Rﬁ , (7-3)
it‘F

where Rﬁ is the range of bucket i. and F = {i l bucket i is full}. Rf, is calculated as follows:

N + min (s. 0) (the ﬁrst bucket)
Rﬁ = N + min ((M -1) — (m +1)N. 0)) (the last bucket)
N (otherwise)

If the hash table covers the entire range of M. the term R0 is equal to zero because the range of

-37-

keys is equal to M. The probability P3 was computed using (7.1) for each state obtained in
Table 5.2. Table 7.1 shows the result.

Table 7.1 Probability of an insertion causing rehashing using RR

 

 

n b=10 b=20 b=30 b=40 b=50
100 0.223 0.21 1 0.070 - -
250 0.088 0.142 0.1 10 0.144 0.208
500 0.049 0.077 0.083 0.107 0.066

 

 

 

 

 

 

 

 

For each bucket size. on average. the probability decreases as the number of keys increases. For
practical range of the group size. the probability of rehashing is small: with b=50. only one in
about 15 insertions causes a rehash when the group size is 500 records. These results are close to
those results obtained by a trial and error method in [RL88].

The probability of an insertion causing rehashing is affected by the load factor and the uni-
formity of the distribution of keys. It is clear that the probability is lower when there are fewer
full buckets: i.e., when the load factor is lower on average. For the same load factor. more uni-
form distribution causes the lower probability. This is determined by the unifomrity of the distri-
bution of primary keys in the Quotient Reduction method. Table 7.2 shows the probability of an

insertion causing rehashing in the Quotient Reduction and Remainder Reduction methods.

Table 7.2 Probability of an insertion causing rehashing (n=250)

(The load factor in percentage is given within parenthesis.)

 

b=20

0.158 (69.6)
0.142 (78.9)

Algorithm b=10

QR 0.136 (55.1)
R 0.088 (65.2)

 

 

 

 

 

 

-33-

In Table 7.2. when b=10. the probability decreases from 13.6% to 8.8% while the load factor
increasing 55.1% to 65.2%. This implies that the transformation (5.2) in the Remainder Reduc-
tion method yielded more uniform distribution than that of original keys.

In our methods. not only the uniformity of the distribution of keys but also the quotient N
affects the probability of rehashing. because smaller N usually yields the fewer full buckets. In
STEP 3 of algorithm EQR. "Another set of searches" is made in order to ﬁnd the smallest value
of N among the values that correspond to the same number of buckets. Table 7 .3 shows the

improvement of the load balance by this set of searches.

Table 7.3 Comparison of the probability of rehashing

 

 

 

 

 

 

 

n Algorithm b= 10 b=20 b=30 b=40 =50
100 EQR’ 0.229 0.435 0.522 - -
EQR 0.223 0.211 0.070 - -

250 EQR’ 0.095 0.169 0.258 0.316 0.492

EQR 0.088 0.142 0.1 10 0.144 0.208

500 EQR’ 0.053 0.087 0.138 0.115 0.147

EQR 0.049 0.077 0.083 0.107 0.066

 

 

EQR’ denotes the algorithm EQR without "another set of searches".

We observe that the probability is decreased in all cases in Table 7.3. Especially. when the
number of buckets is small: i.e., for n=100 b=30 and n=250 b=40.50. the probability of
rehashing is reduced drastically. The low probability for small bucket sizes makes the incremen-

tally built up ﬁle. which is made by one insertion at a time starting an empty ﬁle, more efﬁcient.

Dynamic Load Factor

We can build a ﬁle using the proposed external perfect hashing scheme in two ways. The
ﬁle can be built incrementally by making one insertion at a time starting from an empty ﬁle.

Whenever a bucket overﬂows. a new perfect hashing function is found by the proposed methods

-39-

for the keys in the group. We refer to the ﬁle constructed by this method as an incrementally built
ﬁle. On the other hand. if all the records are available at a time. the records are divided into
groups and then are stored using direct perfect hashing functions. We refer to this as initial load-
ing. We will examine the load factor of an incrementally built ﬁle or group (dynamic load factor).

Figure 7.1.1 and Figure 7.1.2 show the load factor of a group. using algorithm RR, plotted
as a function of the number of records in the group. The oscillation for the small number of
records are due to fragmentation Peaks and valleys appear when the number of records in the
group is around b. 2b. 3b.... . In an incrementally built group. the load factor increases when an
insertion does not cause a rehash. If an insertion causes a rehash. the resulting load factor is the
same as that for a group built by initial loading. The experiment was done for the keys in ﬁ1e(A)

using the Remainder Reduction hashing function with parameters q = 101 and M = 8191. The

initial perfect hashing function for x1(1 Si S b) is set to h(x,-)= [{(101x1) mod 8191}/8191J .

The ordinary hashing function (4.9) was not used due to the small ﬁle size. Figure 7.1.1 shows the
load factor with b = 10, and Figure 7.1.2 shows the load factor with b = 40.

Her-009311 carnal"

Hoe-909m GNOI"

85
80
75
70
65

55
50

95

85

80

75

70

65

60

-40-

 

 

 

] l l I

 

 

100 200 300 400

Group size

Figure 7.1.1 Load factor of a group (b = 10)

500

 

 

 

 

 

 

I l I l l

 

 

200 400 600 800 1000

Group size

Figure 7.1.2 Load factor of a group (b = 40)

1200

-41-

The result gives a signiﬁcant advantage to the external perfect hashing scheme in practice. When
b=40. the load factor is over 80% on average. As the group size gradually increases from 500 to
1000. the load factor still stays over 80%. The oscillation becomes smaller as the number of
buckets increases.

In order to examine the overall load factor of an incrementally built ﬁle. we implemented
the proposed composite hashing scheme and simulated with keys in ﬁle (A). Figure 6.2 plots the
overall load factor of the ﬁle as a ftmction of the group size. The ordinary hashing function (4.9)
and the Remainder Reduction hashing function with parameters q =101 and M = 8191 were
used. We observe the behavior similar to above. The effect of the bucket size on the load factor of
a ﬁle is very signiﬁcant. Clearly. the small size (b = 10) seems impractical because of the result-
ing low load factor. These results are close to those results obtained in [RL88].

90

 

85-

80-

75-

70—

HOF’ON'TI Q90!“

65-

 

 

60 r r r r 1 1
0 20 40 60 80 100 120 140

 

Group size

Figure 7.2.1 Overall Load factor (b=10)

Her-90911 amor—

86

82
80
78
76
74
72
70
68

-42-

 

 

 

l_ l J L

 

100 200 300 400

Group size

Figure 7.2.2 Overall Load factor (b = 40)

500

 

-43-

8. Conclusions and Future Work

External perfect hashing scheme was proposed in [RL88] and shown to be practical and
competitive. One drawback is a trial and error method of ﬁnding direct perfect hashing functions.
We investigated systematic methods of ﬁnding direct perfect hashing functions.

We gave algorithms to ﬁnd the Quotient Reduction perfect hashing functions. However. the
Quotient Reduction method is not practical for a large set of keys due to the computation time.
Another disadvantage is that this method is sensitive to the uniformity of the distribution of pri-
mary keys. The uniformity of the distribution affects not only the storage utilization but also the
probability of rehashing.

The Remainder Reduction method is superior to the Quotient Reduction method. The
results of experiments indicate that the proposed method for ﬁnding perfect hashing functions is
not only practical but also competitive to the other methods such as a trial and error method with
reference to the resulting load factor and dynamic behavior.

We also showed experimentally that the relative frequency of perfect hashing functions
within universal; class H; and H3 is the same as that predicted by the analysis for the set of all
functions.

In algorithm Efﬁcient QR including procedure Search. there is still much room for improve-

ment to reduce the computation time and to obtain better storage utilization.

APPENDIX

-44-

APPENDIX

The INT Procedure takes intersection of Us and ﬁnds common elements of each set of
admissible increments. An interval is represented by the starting and end values in this pro-
cedure. The procedure receives three arguments: the quotient N. the starting value of the set of
admissible increments j,. and the range of the set j,. and returns the set of intervals 1}“ (if any).
which are common elements, with setting the ﬂag exist to true.

The calculation of j, and j, is as follows .

When {rmodN+(x1 +J)modN} SN—l,

0S(x1+J)modNSN—rmodN—l.
Then
j,=(N-rmodN-1)—0=(N-1)—rmodN.
On the other hand. j, is given by
I (x1+js)m0dN=0.
andhence.
j,=N-x1modN.

When {r modN+(x1 +J) modN}2N.

N-rmodNS(x1+J)modNSN—l.
fr=N—1_(N-rmodN)=rmodN—l.
On the other hand, j, is given by
(x1 +js)m0dN=N-rmod N.
Thus.
js=(N-x1m0dN)+(N-rmodN).

Procedure INT is outlined below.

Procedure INT( N, 1,, J,. I}, exist)

begin
{ Make intervals for the solution space. }
ifj, +j, 2N then
Create two intervals: Ij1 := [0 .. j,+j,—N-l] and Ij; := Li, .. N—l].
else

-45-

Create the interval: I j1 := Li, .. j,+j,-l].
{ check the interval [x1 .. x143]. 1S n S n-b. }
fori :=l ton-bdo

Sznw—n

if 6 < N then
{ Create intervals of admissible increments. }
Compute the starting value T}, and the range Tj,.
Tj, I= (N - 114.1,) mod N

Tj, := 5
iij, + Tj, 2 N then
Create two intervals:
T1 1: [T51 .. T51] I: [0 .. Tj,+Tj,-N—l]
and

T; := [T51 .. TE1] := [T], .. N-l].
else
Create the interval T1 1: [TS1 .. T51] 3= [7; .. T,+T,-l].

{ Take intersection of intervals I ,- with TS. }
case of the number of T's being one
for each interval Ij1 := (s,- .. e.] do

“(TS1 > 81') Ol' (T51 < 31') then
Delete the interval I j; := [s1 .. e1].

else
Change the interval lj; to [max(TS1.s.-) .. min (TE1.e,-)].
if the set of intervals I j,- = (b then
return (exist := false)
end for

case of the number of T's being two
for each interval I j.- := [s1 .. e1] do

if (TS1 > e.) or (TE1 < s.) or ((TE 1 < s.) and (TS; > e.)) then
Delete the interval Ij1- := [s1 .. e1].

else if s,- S TE 1 then
if e,- 2 TS; then
Divide the interval 1;} into two intervals.
[max (”1.51) .. T51] and [T52 .. min (TE2,81')].
else
Change the interval I j; to [max (TS 1 .s1) .. min (TE 1 .e1)].

else
Change the interval Ij1 to [max (TS;.s,-) .. min (TE ;.e,)].
if the set of intervals I j; = (D then
return (exist := false)
end for

end for

return( I j, exist)
end

BIBLIOGRAPHY

-46-

BIBLIOGRAPHY

[CC80] Cichelli. RJ. Minimal perfect hash ﬁtnctions made simple. Comm. of the ACM. 23, l,
(1980). 17-19

[CH84] Chang. .C. C. The study of an ordered minimal perfect hashing scheme. Comm. of the
ACM. 27. 4, (1984). 384-387

[CR85] Cormak. G.V.. Horspool. R.N.S. and Kaiserswerth. M. Practical Perfect Hashing. The
Computer Journal, 28. 1(1985). 54-58

[CW79] Carter. L]. and Wegman. M.L. Universal classes of hash functions. Journal of Com-
puter and System Sciences. 18. 2(1979). 143-154

[FL68] Feller. W. An introduction to probability theory and its applications. Vol.1. New York:
John Wiley. 1968

[FR82] Fredman. M.L.. Komlos. J. and Szemeredi, E. Storing a sparse table with 0(1) worst
case access time. Proc. 23rd Symposium on Foundations of Computer Science. IEEE
Computer Society. 1982. 165-168

[1881] Jaeschke. G. Reciprocal Hashing: A method for generating minimal perfect hashing
functions. Comm. of the ACM. 24, 12(1981). 829-833

[MR84] Mairson, H.G. The program complexity of searching a table. Ph.D. Thesis. Department
of Computer Science. Stanford University. 1984.

[RL88] Ramakrishna, M.V. and Larson, P.A. File organization using composite perfect hash-
ing. (to appear) ACM Trans. on Database System. (Earlier version in 1985 ACM-
SIGMO Inm’l conference on management of data, 190-200.)

[RM86] Ramakrishna. M.V. Penfect Hashing for External Files. Ph.D. Thesis. Department of
Computer Science. University of Waterloo. Research Report No. CS-86-25, 1986.

[SP77] Sprugnoli. R. I. Perfect Hashing Functions: A Single Probe Restructing Method for
Static Sets. Comm. of the ACM. 20. 11(1977). 841-850.