.. 2.“.
.. . $1.24..-
v. :35. . .

 

Q‘.
l:

’4?)

.011}.
“ii...
. fit.
I ‘33:...
2; 1.2
:23... xx.“

1.3.
:3:

x 9
‘31?
3 .. 4!...

3 .. as O.
.82.):
,.s.t|\» 5f nix I
. ’3... .Il ..
n 1‘. 2h)! . i
).|I|¢)iot .
$1.!-

a... 3:. :32. xi...
3...}...

:2 t... £3.13

‘ S 8:3. ‘ Sin?)

3:... Laos).-

52.
b D 5‘
‘y‘é
.sd .x .
)L. 5 V
. a). '7.
.53..)rts
33.91,

.5:
351-.....)
3.117. :3;

.ittnl v: :1
I. 4 v .99.“:
. {5 y.

1....
... L . {Eu
‘3‘... ~—

:2!»

.1: .131.

.I. .. o
‘2... it».

. 5%.? \I.
.5 v I
agav‘ l

. .
11 its

«3!..- 5W-..5. . 313‘?

ravv‘tail.ixxn:tsl.i

A)! 9:! I
Ii 53' .. t .l
n 3..
)2 3:3.»
framiiizx. 1
‘1. :1
1.

£2.53. ... 0.3... 1.... liefr‘awffJ' A
Ar|..‘. A - rqli l v D 4 b v :11

21
v.3 :-

.15
1‘ l.
3.:

..r...,...,..a.
I ..... .. ;

 

This is to certify that the

dissertation entitled

PARALLEL HASH JOIN WITH SKEW
HANDLING ON MULTIPROCESSOR SYSTEMS

presented by

Walid Rifaat Tout

has been accepted towards fulfillment
of the requirements for

 

PhD degree in Computer Sc ience

 

QM”: L

Major professor

 

Date 9/8/94

 

MSU is an Afﬁrmative Action/Equal Oppormm'ry Institution 0-12771

 

 

LIBRARY
Mlchlgan State
University

 

 

 

PLACE It RETURN BOXtomnavothbchockwtﬂom yourncud.

TO AVOID FINES Mum on or bdoro dd. duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU luAn AMI-math. Action/Emu! Oppomhtty Imam
Wt

 

PARALLEL HASH JOIN WITH SKEW
HANDLING ON MULTIPROCESSOR

SYSTEMS

By

Walz'd R. Tout

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1993

ABSTRACT

PARALLEL HASH JOIN WITH SKEW
HANDLING ON MULTIPROCESSOR

SYSTEMS

By

Walz'd R. Tout

The sizes of databases have been growing at an exceedingly fast rate in the past
years and this trend is expected to continue for years to come. Much research has
focused on applying parallel processing to these increasingly large databases. Of
all database operations, the relational join is considered as the most time consuming
operation. Hence, it has been used in the research community as the standard measure
for parallel database systems performance.

The parallel database architectures that have been extensively studied are the
Shared—Everything, where processors share all memory and disk and the Shared—
Nothing, where each processor has its own memory and disk and all inter—processor
communication is done by message passing. Shared—Everything systems have the
ability to efﬁciently perform load balancing but were traditionally limited to twenty
or thirty nodes. Shared-Nothing systems can have hundreds of nodes but perform

poorly with uneven loads.

Multiprocessor systems based on Non-Uniform Memory Architecture (N UMA)
feature up to hundreds of nodes and provide a shared global memory. We investigate
this architecture for the join operation and propose two main memory join algorithms
that exploit well this shared architecture. We then introduce a dynamic load balanc-
ing scheme for the join algorithms to deal with uneven loads. We model both join
algorithms and the load balancing scheme analytically and perform experimentations
on commercially available N UMA multiprocessors.

We investigate a multiprocessor system based on N UMA architecture where nodes
represent clusters of processors. We present a join algorithm with load balancing for
this system and show that the architecture is well suited for parallel join process-
ing. The performance of the system is evaluated by analytical modeling, simulation
and actual experimentation. With this in hand, we study the effects of key system
parameters, such as network and I/O bandwidths, processor speed and cluster size
on performance. We show that the network bandwidth is a key limiting factor to
increasing the system size, especially with small cluster sizes. This shows that sys-
tems based on this architecture perform better than Shared—Nothing systems. The
reason is mainly because the clusters maintain a high level of locality and that load

balancing is facilitated by the presence of shared memory.

In the name of God, most Merciful, most Beneﬁcient

ACKNOWLEDGMENTS

First and foremost, my thanks go to my parents Rifaat Tout and Hafsa Hassan
for all the love and patience they have shown through my long years of study. I just
hope that I will be able to repay a small fraction of what I owe them. I owe a great
deal to my wife, Nada Hoblos for all what she put up with during the course of my
study and also to her family for their understanding and patience and for the precious
gift they honored me with. My thanks go also to the rest of my family, my brothers
Hicham and Samir for being by my side when I needed them (and when I did not).
My sisters Aman, Mouna and Zeina for their encouragement. And last but not least,
to my baby girl Wala whose coming into this world may have ﬁnally given me the
boost I needed to wrap things up.

I would not have been able to complete the PhD. program if it were not for the
help and guidance of my Thesis Advisor, Dr. Sakti Pramanik. My sincere thanks
go also to my co-chair Dr. Carl V. Page. I owe both of them for their time and
patience. I also like to thank the members of the committee, Dr. Lionel M. Ni, Dr.
Matt Mutka, Dr. Jacob Plotkin and Dr. Gerald Ludden for their valuable comments
on the thesis.

Over the years, many friends, ofﬁcemates and many others not listed here have
all encouraged me and gave me advice on various issues. Thanks go to all of them.

Finally I would like to thank all my brothers at the Center for being there when I
needed them. I would especially like to thank Omar Soubani, Amr Azim, Amin Mak-
lai, lyad Saad and Naji Al-Arfaj for their support and Mohamad Naja and Khaldoun

Rayes for their friendship and advice.

TABLE OF CONTENTS

LIST OF TABLES ix
LIST OF FIGURES x
1 INTRODUCTION 1
1.1 The need for parallelism in database systems .............. 1

1.2 The relational join operation ....................... 3
1.2.1 Nested-Loops Join Algorithm .................. 3

1.2.2 Sort—Merge Join Algorithm ................... 4

1.2.3 Hash—based Join Algorithm ................... 4

1.2.4 General Scheme for Parallel Join ................ 5

1.3 Data skew types and effects ....................... 6

1.4 Dissertation Outline ........................... 8

2 PREVIOUS WORK ON PARALLEL JOIN ALGORITHMS 10
2.1 Early work ................................. 10
2.1.1 DIRECT .............................. 12

2.1.2 MICRONET ........................... 13

2.1.3 GRACE .............................. 14

2.1.4 Teradata’s DBC/1012 ...................... 15

2.2 Recent work on parallel database systems ................ 16
2.2.1 XPRS ............................... 18

2.2.2 Volcano .............................. 18

2.2.3 BUBBA .............................. 19

2.2.4 GAMMA ............................. 19

2.3 Parallel join and load balancing ..................... 21
2.3.1 Bucket—converging and bucket—spreading join ......... 22

2.3.2 Adaptive load balancing join ................... 22

2.3.3 Scheduling—hash join ....................... 23

2.3.4 Load balancing hash join ..................... 24

vi

2.3.5 Load balancing in Shared Virtual Memory ........... 25
2.3.6 A hybrid system ......................... 26

3 MAIN MEMORY HASH—BASED JOIN ALGORITHMS FOR

MULTIPROCESSORS WITH N UMA ARCHITECTURE. 30
3.1 Introduction ................................ 30
3.2 The Multiprocessor Architecture Model ................. 31
3.3 Parallel Join on N UMA Architecture .................. 34
3.3.1 Distributed Hash Join on NUMA Architecture ......... 36
3.3.2 Full Replication Join ....................... 39
3.4 Analytical Models ............................. 40
3.4.1 Modeling of processors’ loads .................. 43
3.4.2 Cost Formulation for DHJ .................... 50
3.4.3 Cost Formulation for FRJ .................... 51
3.5 Performance Evaluation ......................... 52
3.5.1 Benchmarking relations ...................... 52
3.5.2 Performance comparison ..................... 53
3.6 Data Skew ................................. 57
3.6.1 Zipﬁan Distribution ........................ 58
3.6.2 Effects of Skew on the Performance of DHJ .......... 60

4 DISTRIBUTED LOAD BALANCING FOR PARALLEL MAIN

MEMORY HASH JOIN 62
4.1 Introduction ................................ 62
4.2 Proposed Load Balancing Scheme .................... 63
4.2.1 Scheduling by Sequential Probing ................ 64
4.2.2 Scheduling by Random Probing ................. 66
4.3 Analytical Model for Load Balancing based on Random Probing . . . 69
4.4 Performance Evaluation ......................... 71
4.4.1 Results of Varying Data Skew .................. 73
4.4.2 Analytical vs. Experimental Models ............... 77

5 THE NUMA WITH CLUSTERS OF PROCESSORS ARCHITEC-

TURE 79
5.1 Introduction ................................ 79
5.1.1 The NUCOP Architecture .................... 81
5.1.2 The Proposed Join Algorithm .................. 82
5.2 Analytical Model ............................. 84

vii

5.3 Model Validation by Simulation ..................... 88

5.3.1 Simulator Input .......................... 90

5.3.2 Simulator Output ......................... 91

5.4 Model Validation on the KSRl ..................... 91
5.4.1 KSRl Architecture ........................ 92

5.4.2 Comparison of Results ...................... 93

5.5 Projections for Variant Architectures .................. 96
5.5.1 Effect of I/O bandwidth ..................... 97

5.5.2 Effect of network bandwidth ................... 98

5.5.3 Effect of CPU speed ....................... 99

5.5.4 Effect of large cluster sizes .................... 100

5.5.5 Effect of skew rate ........................ 101

5.5.6 Comparison to other work .................... 102

5.6 Concluding Remarks ........................... 104

6 CONCLUSION AND FUTURE RESEARCH 105
A SYSTEM PARAMETERS FOR THE NUCOP MODEL 108
B THE NUCOP SIMULATOR 110
B.1 Simulator Validation ........................... 112
BIBLIOGRAPHY 114

viii

2.1

3.1

3.2

3.3

A.1

8.]

LIST OF TABLES

Summary of previously proposed join algorithms with load balancing. 29

Notations and parameters values. All values that pertain to the speciﬁc
architecture were measured on the BBN TC2000 multiprocessor system. 41

Number of partitions of a: into y parts. ................. 45
Example of various combinations for p = 3 ................ 46
Various system parameters with their default values. ......... 109

The major system parameters used to control the behavior of the NU-

COP simulator ............................... 113

ix

1.1

2.1
2.2

2.3

3.1

3.2

3.3

3.4

3.6

3.7

LIST OF FIGURES

A simple example of the relational join operator .............

Architecture of GRACE. .........................
A typical Shared—Everything Multiprocessor system. Each node has
access to all other nodes in the system. The access includes memory
and disks. .................................
The Shared-Nothing architecture. Here, nodes do not share memory
nor disk. All interactions between the nodes are done via message

passing. ..................................

A typical Multiprocessor with UMA architecture. Nodes in the system
access the physically shared memory through the network, typically a
bus. ....................................
A typical Multiprocessor with N UMA architecture. Each node in the
system has a processor and local main memory. The interconnection
network provides each node with shared access to the physically dis-
tributed memory. .............................
A Multiprocessor with COMA architecture. Each cluster—node can
have a number of processors, each with a local cache (subcache) and a
large second-level cache (main memory) .................
Distributed Hash Join algorithm for main memory databases on N UMA
architecture .................................
Full Replication Join for main memory database systems on N UMA
architecture .................................
Pseudo—code of the computation of M R using the Normal variables to
approximate the Multinomial distribution. ...............
Comparison of the analytical and experimental results for DHJ. The
experimental runs were conducted on the BBN TC2000 with up to 25
nodes — the maximum number of public nodes available on the system

we had access to. .............................

3

14

16

17

31

32

33

37

39

49

3.8

3.9

3.10

3.11

3.12
3.13

9999:“
J‘u-RQONH

99:“?
innocuous

CJ‘ 01 9" CJ‘

ﬁMMF-d

5.5
5.6
5.7
5.8
5.9
5.1

0

Comparison between the Distributed Hash Join and the Full Replica-
tion Join algorithms. ...........................
Comparing analytical and experimental results for Full Replication
Join algorithm. ..............................
Execution times for individual steps of the Full Replication join algo-
rithm. ...................................
Zipf—like distributions with various 0 values ...............
Typical distribution of tuples to buckets under skew. .........

Performance of Distributed Hash Join with skewed input relations. . .

Distributed load balancing scheme ....................
Percentage of overloaded nodes vs interval size. ............
Effect of random probing on network traffic ...............

Total time for the join algorithm with and without load balancing. . .

Total time with skew in the probing relation S. ............
Total time with skew in the building relation R. ............
Total time with skew in both relations. .................
Comparison of analytical and experimental results. ..........
Effect of block size .95 on load balancing. ................
Architecture of N U COP. .........................
Diagram of the NUCOP simulator. ...................
An example of an input ﬁle for the NUCOP simulator. ........

Validation of Analytical and Simulation models with Experimental re-
sults. Data points in the experimental results (obtained on the KSRI)
represent the total execution time averaged over 3 runs .........
Effect of IO bandwidth. .........................
Effect of network bandwidth with various number of clusters ......
Effect of CPU speed ............................
Effect of the cluster size on system performance .............
Total cost with constant N but varying n and m. ...........

Total join cost for various degrees of data skew. ............

xi

55

56

57
58
59
60

65
66
68
73
74
75
76
76
77

89
90

94
95
96
97
98
99
101

CHAPTER 1

INTRODUCTION

1.1 The need for parallelism in database systems

As databases expand beyond the capabilities of most mainframe systems, the interest
in parallel architectures and algorithms for supporting and processing the relational
queries has risen sharply. The ability of these architectures to deliver much higher
performance than uniprocessor—based mainframes has been validated in both research
and commercial parallel database systems [1, 2, 3, 4]. Most recently, Oracle, a com-
pany providing commercial relational database systems, demonstrated up to 2000
Transactions Per Second (TPS) on a new HP multiprocessor system [4]. According
to Oracle, this is far beyond the capability of transaction processing of any mainframe
system.

There is evidence of high demand for the processing capabilities of parallel systems
to deal with the increasingly large amounts of information available currently and in
the near future. In preparation for the coming of the Information Highway, Oracle
is also experimenting with a 16—processor nCUBE machine to provide the Video on
Demand service [5]. The nCUBE machine has 15 disks that can hold up to 40 digitized
movies and can provide close to 42 simultaneous streams of video. A number of other
companies such as Time Warner Inc. are looking to provide similar services and
are thus investigating the use of parallel machines from other vendors such as from

Silicon Graphics and Digital Equipment Corporation [5].

Recent statistics about current large databases indicate sizes ranging from hun-
dreds of Megabytes to few Terabytes. The Customer Information and Package Track-
ing database of United Parcel Service is estimated currently at 700 Gigabytes and is
expected to reach 1.2 Terabytes by mid 1995. The TRW, Equifax and Trans Union
credit agencies have compiled over 500 million records on more than 160 million peo-
ple throughout the United States [6]. With an average record size of only 5K per
person, this amounts to more than 800 Gigabytes of data.

In a large number of applications, queries may require an examination of all tuples
in a database relation. Given the sizes of the input relations in millions or even billions
of tuples, this processing can take excessively long times. In relational database
systems, queries consist of uniform operations applied to uniform streams of data.
By partitioning the input data among multiple processors, a relational operator can
be split into a number of independent operators with each working on a part of the
data in parallel. With the shear volumes of data mentioned above, relational queries
seem ideally suited for parallel execution.

The relational join is considered to be one of the most time consuming operations
in relational database systems because it requires the scanning and processing of all
the tuples of the corresponding relations. Most of the work in parallel database re-
search has focused on the study of this operation. Consequently, the join has been
used, by most researchers, as the standard performance measure for database sys-
tems. Most of the early work on parallel join was done in the context of database ma-
chines where the design of the parallel processing system relied heavily on specialized
hardware components. Later and more recent work have shifted to general—purpose
multiprocessor systems. Next, we start with a brief introduction of the join operator
and present the time complexity for various algorithms implementing the join. We

then give a general outline of parallel join algorithms.

 

 

 

 

Suppliers Parts

Sit SNAME CITY Pit PNAME COLOR CITY

81 Smith London Pl Nut Red London

82 Jones Paris P2 Bolt Green Parts

S3 Blake Chicago P3 Screw Blue Rome

S4 Clark London P4 Screw Red London

85 Adams Detroit P5 Cam Blue Park
P6 Cog Red Chicago

The join of Suppliers and Parts

where Suppliers.CITY = Parts.CIT Y

 

St! SNAME CITY Pi! PNAME COLOR

 

S] Smith London Pl Nut Red
81 Smith London P4 Screw Red
82 Jones Paris P2 Bolt Green
82 Jones Paris P5 Cam Blue
S4 Clark London Pl Nut Red

S4 Clark London P4 Screw Red
SJ Blake Chicago P6 Cog Red

Figure 1.1. A simple example of the relational join operator.

1.2 The relational join operation

The relational join operator, applied to the two relations R and S, is closely related
to the Cartesian product of these relations. The Cartesian product concatenates each
tuple from R with every tuple from S, while the join will combine only those pairs
with a speciﬁed relationship. Formally, the relational join operator combines two
relations, R and S , to produce a third relation, J, containing all tuple pairs from R
and S with matching attribute values. Figure 1.1 provides an example of a join on

two relations Suppliers and Parts.

1.2.1 Nested-Loops Join Algorithm

The simplest algorithm to compute the join is called the nested—loops join. With one
of the relations designated as the inner relation and the other as the outer relation, the
algorithm can be described as follows. For each tuple of the outer relation, all tuples

of the inner relation are read and compared. Whenever the join condition is satisfied

between the two tuples, they are concatenated to form a new tuple in the output
relation J. Therefore, the implementation of this algorithm requires 0(IRI x [S I)
execution time, where [RI and |S| represent the total number of tuples in the relations
R and S respectively. Note that this is the same order of complexity as the Cartesian

product.

1.2.2 Sort—Merge Join Algorithm

Another way to compute the join is to sort both relations, R and S, based on the join
attribute values and then compare the sorted relations on matching join attribute
values to form the output relation. This algorithm is known as sort—merge join.
Since the merge phase has a linear time, the total execution time for this algorithms

is dominated by the time of the sort phase. Hence the execution time of sort—merge

has an order of 0(IRI log(IR|) + IS'I log(|SI)).

1.2.3 Hash—based Join Algorithm

Hash—join is an alternative method that offers a linear execution cost instead of the
()(IRI x [SD cost of nested—loops join or the 0(IRIlog(IRI) + ISIlog(ISI)) cost of
sort—merge join. In this method, for each tuple of relation R (S), a hash value will be
computed by applying a hash function H1 to the value of the join attributes of that
tuple. The hash value determines the bucket where the tuple should be stored. This
is referred to as hash—partitioning the relation R where the tuples in different buckets
are totally disjoint with respect to join relationships. The relation S is hashed into a
different set of buckets using the same hash function H1. After both relations have
been partitioned, corresponding buckets of R and S may be joined independently.
This method breaks a large join into a number of smaller joins since only corre-

sponding buckets need to be checked. In order to join a pair of buckets, the R bucket

:51

is read and tuples are organized into an in—memory hash table, using a second hash
function, H2. Next, the corresponding bucket of S is scanned, each tuple is compared
against the hash table of R and matching tuples form the output (join) relation. In
this case, R is called the building relations, referring to building the in—memory hash
table and S is termed the probing relation.

As presented earlier, the sizes of individual relations in current and future
databases may be in Megabytes and even Gigabytes. Even the most efﬁcient unipro-
cessor implementations of the best join algorithms can not keep up with these ex-
tremely large amounts of data. By partitioning the data among multiple processors,
the join can often be split into many independent joins, each operating on a single
partition in parallel. This can greatly improve the execution time of the algorithm.

Next, we present the general scheme for parallel join algorithms.

1.2.4 General Scheme for Parallel Join

Most of the work on parallel join algorithms has concentrated on hash—based joins
[7, 8,9, 10]. This section presents a general scheme that illustrates the basic workings
of parallel joins. Chapter 2 will provide detailed description and analysis of a number
of parallel join schemes proposed in the literature.

In parallel join algorithms, each relation is assumed to be initially partitioned
among the N processors available in the system, i.e., each processor has a portion of
both relations. This partitioning, which is based on a set of attributes, may be done
using a number of schemes, such as hashing or range partitioning [11, 12]. This is
also known as declustering the given relation. Parallel join aims to break up a large
join of R and S into smaller joins that can be performed by different processors in
parallel. To achieve this, the partitions of R and S, local to each processor, have to
be totally disjoint from all other partitions, based on the join attributes. This is true

if the join attributes of a relation are the same as the attributes used in the initial

partitioning of the relation. This may not be always the case, however. Hence, in
general, the join algorithm requires the redistribution of R and/or S to guarantee
that all partitions are disjoint.

To distribute the tuples of processor Pg, a destination processor P,- (1 S j S N)
is computed for each tuple based on its join attributes. The computation of B, may
be done using either hashing or range—partitioning. If PJ- = P,- then the tuple belongs
to the local processor and consequently, it stays locally. Otherwise, the tuple is sent
to the destination processor Pj. Once redistribution is complete, each processor will
have a set of tuples consisting of the tuples that were kept locally and those received
from all other processors. More speciﬁcally, node N.- will contain all the tuples of
R and S whose destination node evaluated to N.-. Next, the join can be computed
directly on the sets of tuples in each processor. The local join at each processor may

be done using any of the schemes presented earlier in Section 1.2.

1.3 Data skew types and effects

In multiprocessor systems, relations are horizontally partitioned and distributed
across all nodes. In performing the join, these partitioned data are processed in
parallel by all the nodes in the system. If all nodes handle equal amounts of data,
then maximum performance improvement can be achieved. When the distribution
of loads on the various processors is unbalanced, it is likely that most processors in
the system will stay idle waiting for few overloaded processors to ﬁnish. This load
imbalance results in poor overall performance.

It has been shown that the major source of load imbalance in parallel processing
of database applications is data skew [13, 14, 15, 16]. Data skew is defined as the im-
balance in the distribution of tuples to processors. This skewness in data distribution

may be attributed to the following reasons:

0 Horizontal partitioning or how the data is initially partitioned among the pro-
cessors is usually based on the values of the join attributes in the given relation.
The values of these join attributes may not distribute uniformly over the tuples

of the relation.

0 For most queries involving join, a general query optimization is to perform all
select operations before the join. The results of the selectivity are likely to vary
at different processors. Thus, even if the processors start with evenly distributed

tuples, the results produced by the selection operations are likely to be skewed.

o In hash—join algorithms, tuples from each relation are first hashed and dis-
tributed among all processors and then the join is performed. Lakshmi and Yu
[15, 17] stated that in data sets belonging to real—life databases, the join values
are inherently skewed. For example, the ﬁeld part.origin for the part ’VCR’ is

likely to contain the value ’Japan’ more than any other value.

In the presence of such skew, an unbiased partitioning scheme, such as hashing,
will result in unequal loads on the various processors in the system. Even
within the same processor, the sizes of the various buckets can be quite different.
Depending on the rate of skew, few buckets may overﬂow while others contain

only a small number of tuples.

Since relation partitioning has been the primary method of distributing the load
for the join operation, the effects of skew on the performance of the join may be
severe. The nonuniformity in the sizes of the partitions, where some partitions may
be signiﬁcantly larger than others, means that the site with the larger load dictates

the performance of the join as it becomes the main performance bottleneck.

1.4 Dissertation Outline

The rest of the dissertation is organized as follows:

Chapter 2 provides essential background on the evolution of parallel database
systems and join algorithms. It outlines some of the early work on database machines
and traces the transition of the research to parallel database systems for general—
purpose multiprocessors. The chapter also presents a review of the recent work on
load balancing schemes to deal with data skew in parallel join algorithms. A summary
of the major advantages and disadvantages of each scheme is presented toward the
end of the chapter. We elaborate further on this in Chapter 4 where we introduce
our load balancing scheme for parallel hash join.

Chapter 3 introduces a new main—memory join algorithm that is suited for N UMA
architectures. In some cases, where the initial selection step may greatly reduce
the size of one of the input relations, the distribution of the larger relation may
no longer be justiﬁed. Hence, we investigate the performance of the join algorithm
where the smaller relation is fully replicated to all processors in the system. We
analytically model both join algorithms and compare the analytical results to actual
eXperimentations.

A dynamic load balancing scheme for main—memory multiprocessor systems that
effectively deals with data skew is presented in Chapter 4. The load balancing scheme
is applied to the Distributed Hash Join introduced in Chapter 3 and an analytical
model is developed. The quantitative performance of the balancing scheme is inves-
tigated both analytically and experimentally. Experiments were conducted on both
the BBN GP1000 and the BBN TC2000 multiprocessor systems.

It is suggested that parallel database systems on architectures with hierarchical
network organization exhibit better performance for executing joins on very large

relations [18]. Chapter 5 investigates this idea for dealing with data skew in parallel

join. The analysis is based on various system parameters such as CPU speed and
network and I / 0 bandwidth. We extend our load balancing join algorithm for the
new architecture and develop an analytical model to investigate the performance of
the system. We develop a simulator for the system using the CSIM process—oriented
simulation language. We also perform actual experimentation on the KSRl and
validate both the analytical model and the simulator. The architecture of the KSRI
is modeled by adjusting the values of certain system parameters. We project on the
performance of the system, using both the analytical model and the simulator, by
varying a number of key system parameters.

The dissertation is concluded in Chapter 6. This chapter also reiterates the con-

tributions of the research and outlines possible future research directions.

CHAPTER 2

PREVIOUS WORK ON PARALLEL JOIN

ALGORITHMS

Parallel processing of relational join has been a very active area of research and a
large number of algorithms have been proposed [7, 19, 9, 20, 21]. These algorithms
range widely from those highly dependent on specialized hardware [10] to algorithms
designed for general—purpose multiprocessor systems. Next, we brieﬂy describe some
of the early works on parallel database systems. We then present some of the recent

works that are more closely related to this thesis.

2. 1 Early work

During the late 1970 through early 1980, database machines were the main focus of
research on parallel database systems. Thus, most of the early work on parallel join
algorithms was done mainly in the context of these machines. Database machines
consist of the hardware and software designed to operate on very large database sys-
tems efﬁciently. These machines are most notable for their use of specialized hardware
components to efﬁciently implement, totally or partially, a number of database oper-
ations. Parallelism was generally achieved by one of two approaches: Reduction and
Pipelining [22]. Many of the complex computational tasks may be decomposed into
a number of parallel subtasks with approximately equal computational complexity.

Thus, a complex task can be reduced to a number of subtasks that are assigned to dif-

10

11

ferent processors and executed in parallel. This approach, namely reduction, exploits
the horizontal parallelism inherent in most database operations.

For example, a complex database query may be decomposed into a tree structure
of primitive database operations such as Selection, Projection and Join. The primi-
tive operations at the same (horizontal) level of the tree can be executed in parallel.
In addition, the data for a single primitive operation may be horizontally partitioned
across a number of processors where each processor operates on its partition in par-
allel.

The second approach, namely pipelining, uses a linear array of specialized pro-
cessors to perform a number of different functions required to complete a complex
database query. A portion of the data required by the task is given to the ﬁrst pro-
cessor which transforms it in some way and passes the result to the next processor in
the pipeline. While the ﬁrst processor is working on the next portion of the data, the
processor next to it, may work on the data portion it received from the first processor.

While reduction may be completely implemented in software [23, 24, 25, 26],
pipelining, as conceived by the early research, relied heavily on specialized hardware,
such as hardware ﬁlters and sorters [23, 27, 28, 29]. More recent work on parallel
databases implements the pipelining of operations between the different levels of
the query tree and does not rely on specialized hardware support for the various
operations. Instead, intermediate results that are produced by a set of processors
may be routed to another set, or even the same set of processors, in order to perform
the next level of primitive operations within the query tree [1, 3].

We now describe some of the specialized hardware components used in the design
of a number of database machines and note the systems that used them. Disk cache
was used to accelerate the effective bandwidth of disk storage devices as in DIRECT
[30, 31, 3‘2], CADAM [33], IBM [34, 35], DDM [36, 37], GRACE [27, 28, 38, 39],
SABRE [40, 41, 42, 43] and DELTA [44, 45, 46]. Today, disk cache is provided with

12

most disk drives and is currently considered an integral part of the drives’ subsystem.
GRACE and EDC [47] used magnetic bubble memories as secondary storage devices.
SM3 [48, 49, 50, 51] used switchable memory modules to facilitate the passing of data,
status and messages among a number of processors. Specialized hardware sorters
were employed in DDM, DEC/1012 [23, 24], GRACE and RDBM [52, 29]. Hardware
ﬁlters were used in GRACE, CADAM [33], DBC, SABRE and RDBM. Sorters are
still used today in the BBC/1012 database system which is one of the very few
database machines that survived to this date and is currently available commercially.
MICRON ET [53, 54, 55, 56] and SM3 relied on hardware control lines for fast inter-
processor synchronization and communication. In terms of memory and storage,
content—addressing hardware was used in DBC, CADAM, HYPERTREE [57, 58] and
DDM and cellular logic in the form of associative secondary storage was used in EDC
and DBC [59, 60].

These specialized hardware components formed an essential part of the design of
database machines. Next, we present a number of such machines and highlight their

dependence on the specialized components.

2.1.1 DIRECT

DIRECT was designed and prototyped at the University of Wisconsin in the late
19703 [30, 31, 32] to support highly parallel processing of a set of relational queries.
The system had an MIMD architecture and consisted of a number of query processors
(QP), a set of charge-coupled device (CCD) memory modules used as cache memory
and a set of mass storage devices, all interconnected by a novel crossbar switch.
Each CCD is connected to all query processors via special data lines. The CCDs
continuously broadcast data through the data lines connected to all QPs. This design

exhibits a number of desirable features:

13

1. Broadcasting of relation pages, since all QPs in the system can read the target
page from the same CCD by simply switching to that CCD (using the crossbar

switch).

2. Parallel accesses to relation pages, as individual QPs can simultaneously access

different CCDS.

3. Fine—grain accesses to tuples at both page and tuple boundaries, since QPs can

switch between CCDs at any given interval of time.

The weak point in the architecture of DIRECT is that relation pages must be
moved very frequently between the disk, the CCDS and the QPs. The overhead of such
constant data movement degrades the performance sharply and limits the scalability
of the design [61]. Memory contention is also possible when moving data between the
disk and the CCDs and between the CCDs and the QPs. Another problem with this
design is its reliance on the specialized crossbar switch. The crossbar switch becomes
prohibitively expensive when the total number of nodes in the system increases. This

reliance severely limits the scalability of the system.

2.1.2 MICRONET

MICRONET is a multiprocessor system designed and prototyped at the University of
Florida [53, 54, 55, 56]. The system consists of a number of microcomputers intercon-
nected via a custom—built bus called the MICRONET. One of the microcomputers
is designated as the control computer while all others are considered to be data com-

puters. The MICRONET bus provides simple hardware facilities such as:

1. Interprocess communication, synchronization and control,
2. Resolving bus access contention,

3. Broadcasting data and commands.

14

The MICRON ET system achieves a number of desirable features such as reliability
and scalability. However, its performance and capability to implement most of the
database operations, such as sorting and data ﬁltering, rely heavily on the custom-

built hardware.

Staging Ring

 

P: Processor C: Control Module
M: Memory Module D: Disk Module

Figure 2.1. Architecture of GRACE.

2.1.3 GRACE

GRACE [28, 62, 63, 39] is one of the few database systems actually implemented
and was part of the Japanese Fifth Generation Computer Initiative. The system was
designed and implemented at the University of Tokyo, Japan. This database machine
was one of the ﬁrst multiprocessor database systems to apply hashing techniques to
the join operation. The system is organized into two rings: the staging ring and
the processing ring. The disks ﬁlter and hash the data into the appropriate memory

modules on the staging ring. Next, the data is partitioned into buckets by using

dynamic hash clustering and the buckets are distributed to the memory modules.
Each processor is assigned a bucket and the extra buckets (in case the number of
buckets exceeds the number of processors) are processed serially. When tuples arrive

into a processing node, they are sorted in a pipeline merge sort unit. The join executes

in 0 (1151:1512) time, where [RI and [SI are the sizes of the ﬁrst and second relations

m

and m is the number of disks involved, respectively.

2.1.4 Teradata’s DEC/1012

The DBC/ 1012 [24] database machine was developed by Teradata Corporation and
consists of a set of Interface Processors (IFP), Access Module Processors (AMP) and
Disk Storage Units (DSU). The lFPs send and receive results from the AMPS which
perform the storage and retrieval of data on the DSUs. All relations are partially
declustered across multiple AMPS, that is each relation is horizontally partitioned
across a number of the AMPS available in the system. The IFPs and AMPS are
connected via a redundant, tree—structured network, the Ynet which is an active bus
providing logic circuitry to perform selection and sorting on the data as it travels.
Requests are broadcast to the corresponding AMPS and results are are sorted when
they return up the Ynet and then broadcast to the requesting IFP.

The BBC/1012 provides alternative ways to execute the join in order to facilitate
performing the join for different cases. The most common alternative, however, in-
volves hashing and distributing the source relations to the involved AMPS. Tuples
are then sorted in each AMP and the join is performed using the sort—merge join
algorithm.

Teradata has installed many systems containing over 100 processors and several
hundred disks. These systems have demonstrated near—linear speedup and scaleup
on relational queries and have been able to efﬁciently maintain and process very large

(Terabyte) databases.

 

 

 

 

[ Global Shared Memory [

 

Figure 2.2. A typical Shared—Everything Multiprocessor system. Each node has
access to all other nodes in the system. The access includes memory and disks.

The major disadvantage of database machines is their reliance on specialized hard-
ware components. This reliance may render a database machine useless in a short
period of time as new general purpose systems are able to provide the same function-
ality and performance for a much lesser price. Specialized hardware can beneﬁt large
database systems in terms of performance gains for special operations, as evidenced
by DBC/1012 and GRACE. The BBC/1012 is still available commercially and its Y—
net has been recently upgraded [5]. GRACE is still under investigation as part of the
Super Database Computer project at the University of Tokyo [39, 10, 64]. However,
in most recent works, this hardware reliance has been abandoned and most of the
functionalities were implemented in software. Database machines have been replaced
by parallel database systems running mostly on top of general—purpose multiproces-
sor systems. In the next section, we present a representative set of these parallel

database systems and discuss critical design issues relevant to the work of this thesis.

2.2 Recent work on parallel database systems

The two paradigms that have dominated parallel database research in the recent past
are the Shared—Nothing (SN) and the Shared—Everything (SE) architectures. In SE

systems (Figure 2.2), processors share all memory and disks. Data is mainly stored

l7

 

Interconnection Network

  

 

. .
‘‘‘‘‘‘‘‘

.-
.......

 

Figure 2.3. The Shared—Nothing architecture. Here, nodes do not share memory nor
disk. All interactions between the nodes are done via message passing.

on disk and transferred to the common memory in ﬁxed size pages. This paradigm is
the basis for the XPRS [65] database system. In the SN architecture, each processor
has its own main memory and disks and all interactions between processors are done
via message passing (see Figure 2.3). An example of this architecture is the GAMMA
database system [66]. These two paradigms are deﬁned more in terms of inter—
processors communications and sharing of resources than in terms of the underlying
hardware. For example, the BUBBA database prototype [3] is considered a shared—
nothing system even though it was implemented on top of a FLEX / 32 multiprocessor
system, a shared—memory machine.

Next, we present a number of parallel database systems developed on top of

general—purpose multiprocessors and classify the into their corresponding paradigm.

18

2.2.1 XPRS

The XPRS system, developed at the University of Berkeley is a high performance
multiprocessor database system. The goal is to demonstrate that high performance
for transaction processing and complex ad—hoc queries can be provided by a next gen-
eration database system running on top of a general—purpose multiprocessing system.
XPRS is based on a shared-everything architecture. The system features a query op-
timizer that can choose a good access plan based on the available buffer space and
available processors. In addition, the access plan generated by the optimizer must be

amenable to parallelism.

2.2.2 Volcano

The Volcano system [67] is being developed at the University of Colorado. Volcano
is a dataﬂow query processing system that is extensible and incorporates parallelism.
The system provides an operator model of parallelism, more speciﬁcally, it has an ea:-
change operator that parallelizes all other operators. The encapsulation of parallelism
in Volcano allows for new query processing algorithms to be coded for single—process
execution but run in a parallel environment without modiﬁcations. Volcano allows
processors to operate on different data. subsets in parallel and allows pipelining be-
tween different operations.

Some experimental results were. reported concerning the overhead of the exchange
operator when executed on a shared memory multiprocessor system [67]. The results
show that the operator is very fast. The granularity of data exchange between various

processors show that small packet sizes produced a severe performance penalty.

19

2.2.3 BUBBA

The BUBBA prototype was implemented using a 40—node F LEX/ 32 multiprocess-
ing system with 40 disks [68, 3]. Although the Flex/32 is a shared—memory ma-
chine, BUBBA was designed as a shared—nothing system. The shared—memory was
only used for message passing. The system contained three groups of processors:
Interface Processors (IP) for communicating with external host processors and co-
ordinating query execution; Intelligent Repositories (IR) for data storage and query
execution; and Checkpoint/ Logging Repositories (CLR). Relations were partitioned
among the various IRS using either hashing or range partitioning. Redistribution
was applied based on collected statistics about the usage of each relation. A unique
feature of BUBBA was its use FAD instead of SQL as the interface language. FAD
is an extended-relational persistent programming language. It provides support for
complex objects via several type constructors. The FAD compiler is responsible for
detecting operations that can be executed in parallel. This is based on the way the
data objects being accessed are partitioned. It was noted that the task of compiling
and parallelizing a FAD program is signiﬁcantly more difﬁcult than parallelizing a

relational query.

2.2.4 GAMMA

GAMMA [1, 2, 19] is a fully operational prototype of a shared—nothing database
system developed at the University of Wisconsin, Madison. GAMMA is based on the
experience from the earlier multiprocessor database machine prototype, DIRECT.
It originally consisted of 17 VAX 11/750 processors, each with two Megabytes of
main memory. The processors were connected together, with another VAX serving
as the host machine, by a token ring. Only eight of the processors had disk drives

attached to them. More recently, GAMMA has been ported to a 32 node iPSC/2

20

Intel hypercube [19]. Each node is conﬁgured with an Intel 386 CPU, 8 Megabytes of
memory, and a 330 megabyte MAXTOR 4380 disk drive. The nodes are connected to
form a hypercube using custom VLSI routine modules. Each module supports eight
full—duplex, serial, reliable communication channels operating at 2.8 Megabytes/ sec.

All relations in GAMMA are fully declustered, i.e. horizontally partitioned across
all disks in the system. A number of declustering strategies are provided by the system
to maximize the performance of different queries in the system. The Hybrid—Range
partitioning scheme was introduced by Ghandeharizadeh and DeWitt [12] and was
shown to provide support for small relations and for relations with skewed distribu-
tions. The same authors [1 1] also investigated the performance of various declustering
strategies based on round—robin, hash and range partitioning. Given T tuples and
N nodes, round—robin partitions a relation according to the following: place tuple
number i in node mod(i, N), where mod(i, N) represents the modulus of i and N.
Hash partitioning, decides the location of a tuple based on the value of a hash func-
tion applied to a set of attributes of that tuple. And ﬁnally, in range partitioning,
the total range of tuples is divided into N subranges, and each subrange is assigned
to the corresponding processor. The study concluded that each of these strategies
outperform the others for certain types of queries. This work is based on the work
by Kim and Pramanik in [69] where optimal declustering was studied for various sets
of queries.

The major difference between GRACE, BBC/1012 and other SN systems is the re-
liance of these two systems on special—purpose hardware, e.g. the Ynet in DBC/1012
and the Omega network in GRACE. As mentioned earlier, the majority of SN sys-
tems are implemented on top of general—purpose multiprocessors systems. In their
article on future database systems, DeWitt and Gray [70] note that while GRACE and
DBC/1012 have demonstrated good performance, only time will tell whether special—

purpose components can offer better price/performance than SN systems that use

21

conventional hardware

Most of the previously mentioned algorithms were designed with the assumption
of uniform distribution of tuples. In general, this assumption is not valid and data
skew is common in real-life database systems. The effect of data skew on four join
algorithms was explored in [19] on a 8 processor version of the GAMMA database
system. The study concluded that the parallel hash—based join algorithms, namely
Hybrid, Grace and Simple hash, are sensitive to data skew especially when the skew
rate is high. This study, however, did not consider any additions or modiﬁcations to
the join algorithms that would allow them to deal with the problem of skew. The
reason may be related to the fact that load balancing, on SN architectures, causes
a large overhead as the number of control message increases with increasing number
of nodes [18]. The next section surveys some of the recent work on dealing with data

skew in parallel join algorithms.

2.3 Parallel join and load balancing

Ideally, nodes participating in the join operation would handle approximately equal
amounts of data. This, however, is not generally the case and most real—life data
exhibit some degree of skewness, as noted in Section 1.3. More recent work on parallel
joins have. developed various schemes to deal with the effects of data skew. Next,
we briefly present and discuss a number of these schemes. Table 2.3 presents a
quick overview of the main highlights of each scheme. The third column of this
table summarizes what may be considered as the major disadvantages and potential

problems of each scheme.

22
2.3.1 Bucket—converging and bucket-spreading join

Kitsuregawa and Ogawa [10] describe two algorithms, bucket—converging parallel
hash—join and bucket—spreading parallel hash join. The ﬁrst algorithm is a paral-
lelization of the GRACE join algorithm and works as follows. All subrelations of R
are read from disk in parallel and are partitioned into p buckets (where p is much
larger than N, the number of nodes in the system). The size of each bucket is ex-
amined and, if necessary, enough buckets are redistributed so that the sums of the
buckets’ sizes at each processor are balanced. Next, S is processed similarly and in
the last phase, all respective buckets of R and S in each node are joined locally.

The ﬁrst phase of this algorithm is very susceptible to data skew that may occur
during the distribution of tuples. The bucket—spreading parallel hash join algorithm is
designed to deal with this problem. In this algorithm, both relations are partitioned
into p buckets, but instead of statically assigning buckets to nodes, each bucket is
horizontally partitioned across all nodes during the ﬁrst phase. During the second
phase of the algorithm, the buckets are redistributed onto the nodes using a very
sophisticated network, the Omega network. This network contains logic to balance
the load during the bucket redistribution. Simulation results show the algorithm to
be effective in limiting the effect of distribution skew.

Note that both algorithms require redistributing the relations twice in order to
produce a balanced load across all nodes. Also, the bucket—spreading algorithm relies

heavily on the built—in balancing capabilities of the Omega network hardware.

2.3.2 Adaptive load balancing join

Hua and Lee [71] proposed three parallel join algorithms for dealing with skewed data.
The ﬁrst algorithm, tuple interleaving parallel hash join is mainly based on the bucket—

spreading hash—join [10]. The major difference between the two algorithms is that

23

the tuple—interleaving algorithm does not require the Omega network that is needed
by the bucket—spreading hash-join algorithm. The second algorithm, Adaptive Load
Balancing parallel hash-join, is basically identical to the bucket~converging algorithm
of [10].

The third algorithm, Extended Adaptive Load Balancing parallel hash-join, is
designed mainly for the severe cases of date skew. After the relations R and S are
partitioned and distributed, each nodes reports the sizes of its local R and S buckets
to a central coordinator. Based on the reported information, the coordinator will
decide on the allocation of buckets to nodes. The decision is broadcast to all the
nodes in the system and the buckets are redistributed accordingly.

All three join algorithms are compared using analytical models. The basic results
indicate that the tuple—interleaved and extended adaptive load balancing algorithms
are unaffected by skew in the size of the partitions. Such skew does however affect
the performance of the adaptive load balancing algorithm which eventually becomes

much worse as the skew increases.

2.3.3 Scheduling—hash join

Wolf et al. [9], propose the scheduling—hash join algorithm for dealing with the cases of
severe data skew. The relations R and S are hashed into local buckets and statistics
are gathered while building each bucket. Next, a scheduling phase is run where a
central coordinator collects all the statistics and computes an allocation strategy of
buckets to nodes. This allocation strategy is broadcast to all nodes and the relations
are accordingly redistributed over the network.

A number of heuristics are proposed for the computation of the allocation strategy
including, longest processing time ﬁrst, ﬁrst ﬁt decreasing, and skew. The different
strategies were compared analytically and were shown to be highly effective in bal-

ancing the load especially as the number of odes becomes larger. Like Hua and Lee’s

24

schemes, these strategies require extra scanning steps to be added to the join algo-
rithm for collecting statistics about the various buckets of R and S. These extra steps
increase the actual computation time for the algorithms and exhibit a severe effect
especially with a moderate to low skew rate.

Walton et al. [72] presented a taxonomy of data skew for parallel database sys-
tems and a modiﬁed version of the the Scheduling Hash-join, to deal with the data
skew. The algorithm used gathered statistics to schedule parts of the join on different
processors in order to balance the load. An analytical model was used to compare the
performance of Scheduling Hash—join with that of Hybrid Hash—join. It was found
that Scheduling Hash handles skewed data effectively while Hybrid Hash degrades
and becomes eventually worse than Scheduling Hash. However, unless the data skew
rate is signiﬁcant, Hybrid Hash stays signiﬁcantly better than Scheduling Hash. This
is because of the overhead needed by the latter to gather the needed statistics used

in making the load balancing decisions.

2.3.4 Load balancing hash join

Omiecinski [20] proposed a load balancing hash—join algorithm for shared—memory
multiprocessor systems. The algorithm is based mainly on the bucket—spreading al-
gorithm of Kitsuregawa and Ogawa [10]. The major difference is that, like Hua and
Lee’s algorithms, it does not require any special hardware support for the redistribu-
tion of tuples. Analytical and limited experimental results on a 10—processor Sequent
Balance multiprocessor were presented. It was shown that the algorithm is effective
in limiting the effect of data skew. The algorithm does, however, suffer from the extra
statistics collection steps needed for the redistribution in the cases of moderate and
low skew rates.

Most of the above mentioned schemes attempt to balance the load by scanning

the relations to derive statistical information. Both relations may have to be com-

pletely scanned before the join begins. While this may succeed in minimizing skew,
performance may suffer greatly when the data is not skewed. The time to perform
the parallel hash join is usually a small multiple of the time required to scan the two
relations. Thus, the preprocessing of relations results in a substantial overhead that
may only be justiﬁed for the cases of extreme data skew. There is little evidence,
however, that such extreme levels of skew occur commonly in practice [21] and conse-
quently, it may not be justiﬁed to penalize the normal case in order to beneﬁt a few

extreme cases.

2.3.5 Load balancing in Shared Virtual Memory

Shatdal and N aughton investigated the use of Shared Virtual Memory (SVM) [73] on
top of shared—nothing to deal mainly with the problem of data skew. Shared virtual
memory is commonly used on top of physically distributed memory to provide a single
virtual address space shared by all processors in the system. This is accomplished by
using memory mapping managers responsible for providing the processors with access
to all available memory. These memory managers are also responsible for keeping the
system—wide memory coherent. The protocols for such access and coherency control
are beyond the scope of this thesis and can be found in more details in [74].

A new join algorithm, SVM—join, is introduced where the shared virtual mem-
ory is used to efﬁciently balance the load across all processors in the system. The
algorithm was carefully designed to avoid any network or disk thrashing caused by
multiple nodes updating shared memory pages or by too many pages being sent to the
same node, respectively. In contrast to previous algorithms, such as those described
earlier in this section, the SVM-join does not incur any overhead in the case of
small or no skew. Its performance is virtually identical to that of hybrid hash under
these conditions. The performed simulations show the join algorithm to be effective

in dealing with data skew and to improve on SN schemes. Their model, however,

26

assumes the network to have inﬁnite bandwidth. Such assumption oversimpliﬁes the
model since the bandwidth of the interconnection network is considered to be one of

the primary limiting factors of scalability in parallel systems.

2.3.6 A hybrid system

More recently, Hua and Lee [18] proposed a hybrid system where Shared—Everything
multiprocessors are connected by a network to form a Shared—Nothing system. The
authors cited an example system consisting of multiple IBM 3090 / 600 multiprocessors
linked together with an interconnection network. These multiprocessors can have up
to six nodes and provide shared memory. The performance of this hybrid system was
studied using the join operator as the workload model. The model assumes that only
the relation S needs to be redistributed during the ﬁrst phase of the join and that this
distribution will result in data skew on the receiving nodes. The skew was modeled
as a fraction 0 of the total relation size being assigned to one node while the rest of
the relation is distributed evenly among all other nodes.

The join algorithm used in the model is a modiﬁed version of the work by DeWitt
and Gerber [7]. Assuming N to be the total number of nodes in the system, the

algorithm works as follows:

1. Distribution phase: Each node i reads its subrelation R,- from disk. A hash
function H1 is applied to each tuple of R,- and the result, p (l S p S N),
becomes the destination node for that tuple. Tuples are sent to their destination
nodes using the local bus, for nodes that belong to the same multiprocessor, or
using the interconnection network for all other nodes. All received tuples are

written to disk. Repeat the same procedure for .S,.

2. Join phase: Let R:- and .5,’ be, respectively, the subsets of relations R and S on

nodes i after the distribution phase. Each node reads R: from disk and hash it

27

using hash function H2 to produce B buckets, R21, ..., RIB. At the end of this

I

step, all resulting buckets are on disk and processing of S,’ starts similarly.

In order to complete the join, each node reads a pair of corresponding buckets

RI,- and SI,- and join them separately.

Assuming there is no overlap between the two phases of the join algorithm, a num-
ber of cost evaluation functions are constructed using a set of workload parameters.
The cost model includes the individual costs for the CPU, the memory, the network
and the I / 0. Each cost is computed separately, for each phase, and the maximum
cost is taken as the cost of the corresponding phase. Using the cost functions, the
system is investigated for different values of CPU speed and network, memory and
I/O bandwidth. With the stated assumptions, the I/O bandwidth per cluster is
determined to have the most effect on performance. Based on this conclusion, an
optimal cluster size of 7 nodes is computed.

Note that, in the description of the algorithm, incoming tuples are written to disk
immediately after they are received. These tuples are hashed into buckets during the
join phase of the algorithm. This requires an extra pair of read and write per tuple
which can explain the heavy contribution of I / O to the overall cost. Also, the role
of the network is limited to only the distribution phase, since during the join phase,
no tuples are exchanged between nodes. However, since the clusters are connected
to form a SN architecture, the problem of balancing the load across clusters still
exists. If load balancing is to be applied during the join phase, the network cost will
contribute more to the total time of the algorithm. We show later, in Chapter 5,
that the network bandwidth has a strong impact on the total number of clusters.
Another problem is that the cost of memory accesses was computed separately from
the processing cost, thus ignoring the effect of memory access time on processing cost.

This computation is not accurate for most systems as a processor normally waits for

28

requested data to arrive from either main or cache memory into local registers before

it resumes execution.

29

 

 

 

 

 

 

 

Algorithm Description Comments
Bucket - Partition relations into p buckets. - Relations are
Converging - Examine each bucket and balance distributed twice

by redistributing enough buckets. - Suseptible

to skew

Bucket -Parition relations into p buckets but - Relations are
Spreading partition each bucket across all nodes. distributed twice

- Redistribute using Omega Network. - Relies on

Omega Network

Tuple - Similar to bucket—spreading but - Relations are
Interleaving redistribution is done in software. distributed twice
Adaptive. Load - Similar to bucket—converging. - Relations are
Balancing distributed twice
Extended - Designed for severe data skew. - Centralized
Adaptive - Buckets sizes are reported to control
Load central coordinator (CC). — Reliability of
Balancing - CC computes a bucket allocation allocation

plan and broadcast it to all nodes. plan
Scheduling - Gather buckets statistics. - Cost of bucket
Hash Join - A CC computes and broadcast statistics

a bucket allocation plan. - Centralized

- The Longest processing time ﬁrst, control

ﬁrst ﬁt decreasing and skew hueristics
are used for plan computation.

 

Load Balancing
Hash Join

- Similar to Bucket—Spreading.
- Does not rely on hardware.

- Cost of statistics
collection

 

Load Balancing
in Shared
Virtual Memory

- Partition and distribute relations.
- Nodes that are done ﬁnd busy
nodes to help.

- Busy nodes decide, for each tuple,
whether to join it locally or send

it to the helping node(s)

- Can generate
high network
trafﬁc.

- If busy node is
processing then
helping node idles

 

Hybrid System

- System consists of clusters of SE
connected to form a SN system

- Perform load balancing inside
each cluster

 

Efficient balancing
inside clusters

- No cross clusters
balancing

 

 

 

Table 2.1. Summary of previously proposed join algorithms with load balancing.

 

CHAPTER 3

MAIN MEMORY HASH—BASED JOIN
ALGORITHMS FOR MULTIPROCESSORS
WITH N UMA ARCHITECTURE.

3. 1 Introduction

In some cases, the size of the database is limited or is growing at a slower rate than
current trends in memory capacities [75, 76]. For a number of real time applications,
such as telecommunications or radar tracking, the data must be memory resident in
order to meet real—time constraints. In such cases, the database is necessarily smaller
than the amount of available main memory [75]. Examples of main memory database
systems are IBM’s QBE database project [77, 78, 79], the MM-DBMS system at the
University of Wisconsin [80], the MARS MMDB at Southern Methodist University
[81, 82], and IBM’s IMS/VS Fast Path [83], which is commercially available.

New Shared—Memory multiprocessor systems, such as the commercially available
KSRI [84] or the Stanford DASH research prototype [85, 86], are scalable far beyond
the capabilities of earlier shared—memory systems. These multiprocessors provide a
large number of nodes and very large main memories and as such, offer a suitable
architecture for parallel database systems.

Next, we present the current trends in shared—memory multiprocessors and brieﬂy

discuss the issues relevant to our work on parallel database systems. We then proceed

30

31

to introduce the parallel join algorithms for NUMA architecture.

3.2 The Multiprocessor Architecture Model

Shared memory multiprocessor systems can be divided into the following three groups:

 

Memory

L Network

Cache Cache
o o 0
Processor Processor

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.1. A typical Multiprocessor with UMA architecture. Nodes in the system
access the physically shared memory through the network, typically a bus.

1. Uniform Memory Architecture (UMA): Systems in this group are usually
based on a single bus connecting a number of processors, each having a local
cache, to a large physically shared memory. The UMA architecture is presented
in Figure 3.1. The bus provides uniform access time from all processors to the
shared memory. Contentions for the common memory and the common bus
limit the scalability of these systems. Multiprocessors made by Encore and

Sequent are based on UMA and can provide up to 30 processors.

2. N onuniform Memory Architecture (NUMA): These systems consist of a
number of nodes, each with a local memory, linked together by an intercon-
nection network. Nodes may contain local caches and have shared access to all
memories in the system through the network. Figure 3.2 shows a typical N UMA

multiprocessor system. As the shared memory in these systems is distributed

32

 

 

 

 

 

 

 

[ Network
Cache Cache
o o 0
Processor Processor

 

 

 

 

 

 

Figure 3.2. A typical Multiprocessor with NUMA architecture. Each node in the
system has a processor and local main memory. The interconnection network provides
each node with shared access to the physically distributed memory.

across all nodes, access times to different parts of the memory can vary and are
thus, nonuniform. The BBN GP1000, TC2000 [87] and the IBM. RP3 represent
commercial N UMA multiprocessors where each node contains a single processor
and the network consists of a Multistage Interconnection Network. The total
number of processors in these systems can be much higher than in UMA based

systems‘I87].

Example research prototypes based on N UMA architecture are the Stanford
Directory Architecture for SHared memory (DASH) [85, 86] and Paradigm [88]
projects and the Encore GigaMax. In these prototypes, nodes are clusters
consisting of a local memory and a number of processors, each with a local
cache, connected by a bus. The clusters are connected by an network (which

may also be a bus) that provides shared access to the distributed memories.

3. Cache—Only Memory Architecture (COMA): The memory organization
of COMA systems is similar to that of NUMA in that each node has a portion of
the total memory space. However, there is no notion of main memory as all local
memories are organized like large (second level) caches. A typical system based

on COMA architecture is shown in Figure 3.3. The KSRI [84] from Kendall

 

‘The BBN GP1000 has up to 250 nodes and the TC2000 can support up to 504 nodes

33

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[ Network ]
Attraction Attraction
Memory Memory
Cache Cache

O O O
Processor Processor

 

 

 

 

 

 

   

Figure 3.3. A Multiprocessor with COMA architecture. Each cluster—node can have
a number of processors, each with a local cache (subcache) and a large second—level
cache (main memory).

Square Research is the ﬁrst commercial system based on COMA architecture.
This currently available system supports up to 34 cluster—nodes, each having

32 processors for a total of 1088 processors.

The Swedish DDM [89] is a COMA-based research prototype developed at
the Swedish Institute of Computer Science. In DDM, each node represents a
single—bus multiprocessor system with physically distributed memories. These
memories form a second—level cache called, Attraction Memories. The DDM
features a hierarchical design where a number of nodes can be connected via a
bus to form a cluster. A number of these clusters can be connected by a bus

and grouped into a larger cluster and so on.

Most of the previous research on parallel join algorithms for shared memory mul-
tiprocessors has focused mainly on UMA—based systems [90, 91, 92]. As noted earlier,
contention for the shared bus and memory as well as cache coherency problems limit

the scalability of these systems. This restriction on scalability is one of the major

34

arguments by DeWitt and Gray in [70] to select Shared—Nothing as the architecture
of choice for parallel database systems. N UMA multiprocessor systems, however,
provide a large number of nodes and a very large, shared main memory [87, 86].
With new cache coherence schemes such as the Directory based coherency or partial
snooping schemes, these systems can have a large number of processors with little
overhead.

In N UMA—based systems, the globally shared memory is the sum of the memo-
ries local to all processors. Thus, the size of available main memory increases with
increasing number of nodes in the system. The BBN TC2000 system can have up to
504 processing nodes with each having up to 32 MBytes of local memory. This sums
up to a maximum of 16 Gigabytes of available memory for a fully conﬁgured system
[87]. Thus, multiprocessors based on N UMA architecture seem well suited for parallel
main memory database processing.

To design efﬁcient algorithms for NU MA—based systems, certain architectural is-
sues have to be taken into consideration. Since access times from a given processor
to different memories can vary, locality of reference becomes important. Algorithms
should be designed to maintain a high degree of locality in order to minimize both
the access times and the network load. Another important issue is regarding the use
of globally shared locks for synchronization purposes. Locking may cause a high rate
of memory access conﬂicts due to the Hot—Spot effect [93, 94]. Thus, the use of the
globally shared locks should be carefully designed to minimize the adverse effects of

hot—spots.

3.3 Parallel Join on N UMA Architecture

In this section we present two parallel algorithms for implementing the relational

join operation on N UMA architecture. The ﬁrst algorithm, Distributed Hash Join

(DHJ), is designed speciﬁcally for main memory database systems on N UMA—based
multiprocessors. The second algorithm, Full Replication Join (FRJ), is designed for
the speciﬁc cases when one of the relations is much smaller than the other. FRJ is
based on the replication of all the tuples of the smaller relation.

In previous works [91, 90], a global hash table of buckets is built in the shared
memory and then probed by all processors in parallel during the hashing and the
joining phases. Access synchronization is provided by using globally shared locks.
While this may be acceptable for UMA—based architectures (where the number of
nodes is moderate), performance in the presence of locks on N UMA—based machines
degrades severely with increasing number of nodes.

We provide distributed data structures to minimize the access conﬂicts and hot—
spots associated with processing the hash buckets. The approach taken here is to
build the buckets locally in each node. This has the effect of localizing the accesses
to the buckets and thus, eliminating the centralized structure (the global hash table)
and the locks associated with it. We also distribute and localize the processors syn-
chronization between the different phases of the join algorithm. This minimizes the
number of remote accesses performed by the various processors in order to test for
synchronization.

In order to analytically predict the performance of the join algorithm, we used
modiﬁed multinomial distribution to model the behavior of the hash function in
distrilimting the tuples across the nodes. This allows us to accurately compute the
loads at different processors in the system. We derive analytical models for both
join algorithms and compare their results to actual experimentations performed on a
BBN multiprocessor system. Provided some parameters values are known apriori, the
analytical models can be used as a basis for an optimizer to decide which algorithm
should be used.

For the proposed join algorithm and in the rest of this thesis, we assume that,

36

unless otherwise noted, the input relations are initially distributed uniformly across

all nodes.

3.3.1 Distributed Hash Join on N UMA Architecture

This section presents the Distributed Hash Join (DHJ) algorithm for implementing
relational join on N UMA architecture. In this algorithm, we provide distributed data
structures and special mechanisms to minimize the impact of locking and synchro-
nization and to maximize the locality of reference within the nodes.

The DHJ algorithm proceeds as follows. Each node hashes its local tuples based
on the join attributes to determine their destination nodes. Tuples that hash into the
local node are further hashed and organized into local buckets. These local buckets
are used for probing during the second phase of the algorithm. All other tuples, i.e.,
those that hashed to remote destination nodes, are grouped in order to be sent to
these nodes. When a node has completed distributing R, processing of S proceeds
similarly. Tuples are hashed and sent to the corresponding nodes where they are
joined with the tuples of R.

In order to improve the performance of the algorithm, tuples that are found to
belong to a remote node are not sent directly to that node since that would cause a
large number of remote accesses and possibly a large number of conﬂicts. Instead,
they are marked with that node’s number (N;). Later, a list of each node’s tuples
will be constructed and bulk—transferred to that node. The algorithm is presented in

ﬁgure 3.4.

The DHJ algorithm was designed to avoid locking of the remote access structures
when transferring tuples. Each node, N,, maintains a list, R, of N — 1 (the number
of remote nodes) transfer structures. The structure P,[j] contains a variable indi-

cating the total number of tuples sent to node N, by node NJ- as well as a pointer

37

For each node (in parallel) do
Step 1: Hash each tuple of R to determine its destination node, N,.
If N.- = local-node then
Organize tuple locally
else
Mark tuple with the destination node number, N;.
Bulk—transfer tuples to their destination nodes.
Increment Counter].
If (Counter! 2 N) then
Set Done] to TRUE in all nodes.

Step 2: Hash each tuple of S and determine N,.
If N.- # locaLnode then
Mark tuple with the destination node number, N;.
Bulk—transfer tuples to their destination nodes.
Increment Counter}?
If (Counter? 2 N) then
Set Done}? to TRUE in all nodes.

Step 3: Do
Hash R’s tuples received from remote nodes and
organize them locally.
Until (local Done] is TRUE)
Perform one more check for additional R tuples
received from remote nodes and organize them locally.

Step 4: Hash and join local tuples of S.

Do
Hash S’s tuples received from remote nodes and
join them locally.

Until (local Done? is TRUE)

Perform one more check for additional S tuples

received from remote nodes and join them locally.

Done

Figure 3.4. [)istributed Hash Join algorithm for main memory databases on NUMA
ar<.:l1itect1.1re.

38

to where these tuples have been stored on node N,- by node Nj. As each node has
exclusive access to the structures indexed by its node number, no locks are required
to synchronize the accesses.

Note that before starting the join phase, nodes have to wait for all the tuples of
R to be organized in the local buckets. This is achieved in Step 1 of Figure 3.4. The
same is true for Step 2 where it is required that all nodes be done before any node can
ﬁnish execution. Typically, this waiting is done using barrier synchronization where
nodes update a shared variable, the barrier, and then continuously check that barrier
to reach a predetermined value. For N UMA architecture, however, this type of syn-
chronization can cause severe performance degradation as a large number of nodes
need to frequently and concurrently check the current value of the barrier. The so-
lution employed by DHJ is to distribute the synchronization process and localize the
checking within each node. This is done by each node having a ”Donel” and ”Done‘Z”
flags that are initially set to FALSE. Done] and 0012.62 correspond to the synchro-
nizations for Step 1 and Sth 2 respectively. We only describe the synchronization
process for Step 1. The process for Step 2 is done similarly.

When a node is done executing Step 1, it increments a global counter, Counter],
that is initially set. to zero. Next, the node compares the value of Counter] to N, the
total number of nodes in the system. If the value of Counter] is greater than or equal
to N then, the node is the last node to ﬁnish Step 1. Consequently, this node will set
the [)onel flags in all the nodes (including its own ﬂag) to TRUE and then proceed
to Step ‘2. If the value of the counter is less than N, the node proceeds immediately to
Step ‘2. The cost of this type of synchronization is just N sequential remote accesses
to the shared global counter during the entire execution of the algorithm.

Next, we motivate and describe the Full Replication Join algorithm.

39
3.3.2 Full Replication Join

In a typical Selection-Projection—Join (SPJ) type query, the Selection—Projection
operations may reduce the size of some of the relations signiﬁcantly. In such cases,
Full Replication Join (FRJ ) fully replicates the smaller relation to all nodes so as
to avoid all the remote processing associated with processing the tuples of the larger
relation.

In FRJ the smaller relation, R, is ﬁrst replicated to all nodes, hashed into a
local bucket and then each node hashes its part of S to join the tuples locally. An
advantage of full replication is that, for any given node, the execution of the two steps
of the algorithm is totally independent of all other nodes in the system. When full
replication of the ﬁrst relation is done, the processing of the second relation, within
each node, can start immediately. The replication saves the cost of remotely accessing
the tuples of the larger relation, S, at the expense of replicating and hashing all of

the smaller relation, R. The algorithm is shown in Figure 3.5.

For each node (in parallel) do

Step 1: Bulk—transfer all tuples of R from all the other nodes.
Step 2: Hash the tuples of R and organize them locally.
Step 3: Hash tuples of S and perform the join.

Done

Figure 3.5. Full Replication Join for main memory database systems on NUMA
ar(_‘.llitecture.

Note that in this algorithm all hashing and joining are executed locally and no

remote accesses are needed, except for the initial replication.

40

3.4 Analytical Models

In this section we develop analytical models for the two join algorithms. Performance
results derived from these analytical models are then compared with those of the ex-
periments in the next section. First, we present a number of simplifying assumptions
made in developing the analytical models for DHJ and F RJ . We then develop the

models for analysis.

0 Both relations R and S are horizontally partitioned and their tuples are uni-
formly distributed across all nodes in the system. In multiprocessor systems
based on NUMA architecture, the amount of main memory increases linearly
with the number of nodes. Hence, IR! and ISI, the total sizes of the relations
R and S respectively, are taken to be proportional to the number of nodes in
the system. Therefore, |R| = N x NR and |S| = N x N3, where NR and
NS denote the average number of tuples of R and S per node respectively. For
example, in a system with hundred processing nodes and NR = 2, 500 tuples,

the size of the relation R is

 

RI = 250,000 tuples.

o The average number of comparisons needed to probe a bucket for a given tuple

is F.

o All local and remote memory allocations needed for distribution, organization

and joining of tuples, are in blocks of size Sb.

Table 3.1 contains a glossary of the terms and parameters used in formulating the
cost functions of the analytical models. The table provides a brief description of each
parameter as well as the corresponding value used in the computations. The values of
the system parameters were measured on a BBN TCZOOO multiprocessor by executing

the corresponding operations inside tight loops for a large number of iterations. The

41

results were then averaged over the number of iterations while accounting for loop

and system overheads.

 

 

 

 

 

 

Symbol Meaning Value

F Access Average 1.2

Sb Block Size (in number of tuples) 50

Th Time to hash .003

Tc Time to compare two keys .003

T, Time to access local .005

T. Time to access remote .013

Tm; Time to move a block locally .23

Tm, Time to move a block remotely .27

Ta, Time to allocate a local block .05

Ta, Time to allocate a remote block .09

Ta,- Time to actually join two tuples 2 x Tm; + 33: x Ta;
T; Time to insert a tuple (F x Tc + T1)
Tj Time to join 2 tuples (F x Tc + T0,)

 

Table 3.1. Notations and parameters values. All values that pertain to the speciﬁc
architecture were measured on the BBN TC2000 multiprocessor system.

The network and memory bandwidths, for the NUMA system discussed in this
chapter, increase linearly with increasing number of nodes. While concurrent ac-
cesses to the same memory location must be synchronized, it has been shown that,
in the abscence of hot—spots, networks such as the Multistage Interconnection Net-
work (MIN) of the BBN multiprocessor systen, maintain a close to linear bandwidth
with increasing number of nodes [95, 96, 87]. The DHJ algorithm distributes all data
uniformely and presents a synchronization scheme that minimizes locking. Hence,
network conﬂicts are minimized through this uniform data distribution and accesses.

Conflicts may occur during the different phases of DHJ when a number of nodes

try to concurrently access the same target node. To maintain data consistency all

42

such accesses must be serialized. If a given node N,- attempts to access a target node
N j, N, may have to wait for the other nodes that are already involved in accessing Nj.
Thus, in order to compute the time for accesses with conﬂicts, we need to determine
the number of nodes trying to access Nj. With N nodes active in the system, the
time to concurrently access the target node is €(R.N) X Ta, where Ta is the time to
access the node without any conﬂicts, |R| is the number of data items and €(R.N) is
the expected number of nodes involved in the access.

To determine the expected number of concurrent accesses, é, we formulate the

problem as follows:

Given IRI tuples and N nodes, (N << IRI), ifN tuples {N S IRI — LS1) are
randomly selected from the |R| tuples, ﬁnd the expected number of nodes with at least

one tuple to be written to.

This is the same as the problem of characterizing the number of granules (blocks)
accessed by a transaction as determined by Yao [97, 91]. The solution to the above
problem is given by Yao’s theorem [97] which states that the expected number of

blocks hit is given by

 

RIxNxD—i+1
=N 1—II’.V_ I
(’5 Xl “i IRIxN—Hl )l (1)

where |R| is the number of tuples in the relation R and

l
D = l——.
N

Therefore, the expected number of nodes involved in the access is {(RJV) = %.
In this model, conﬂicts are assumed and studied at each node’s access path to

the network and we assume a single path at each node. If multiple (2:) physical or

43

virtual paths exist for each node, then a: accesses may be executed at the same time

and consequently, the conﬂict formulation should be updated accordingly.

3.4.1 Modeling of processors’ loads

As noted in Section 3.3.], before a node can enter Step 4 of the algorithm in Figure 3.4,
all nodes must have ﬁnished executing Step 1. Thus the node that handles the largest
number of tuples of R, will dominate the time for Steps 1 and 3. Same is true for
Steps 2 and 4 with regard to S. Thus, the analytical model for DHJ determines the
performance based on the time for the node that handles the maximum number of
tuples MR and M3 for both R and S, respectively.

If the hash functions randomly distribute the tuples across the nodes, some nodes
will receive more tuples than others during the distribution phases of R and S. These
nodes, having to process more tuples, tend to dominate the performance of the algo-
rithm in the corresponding phases. Thus, a node that receives the maximum number
of R tuples will dominate both Steps 1 and 3 of Figure 3.4. In order to accurately
measure the time for each phase, we need to determine the maximum number of
tuples MR and M5 (for R and S respectively) that a node may receive.

We first deﬁne the term stochastic increase and then conjecture about the behavior

of the value of MR.

Deﬁnition 1 Given two variables X and Y, X is said to be stochastically greater or

equal to Y (X 23 Y) 4:) V r (r E Domain(X,Y)},

PM 2 r) 2 W 2 r),

where. [’(A) is the probability of event A.

44

Conjecturing on the behavior of MR

Assuming a uniform hash function and the tuples to be uniformly distributed in the
tuple space, we can formulate our model as follows. Let :1: be the number of balls
(tuples) and y be the number of urns (nodes) where :r = y x p for some constant fac-
tor p (n represents the average number of tuples per node, N R). In this model, both
balls and urns are indistinguishable. This is to reﬂect the fact that, from processing
point of vew, all processors are equivalent and only the number of tuples is important
in contrast to their individual identities. For example, if a: = 12 and y = 4, then
(8,3, 1,0) E (8,1,3,0) E (8,0,1,3) and so on. Since the urns are also indistinguish-
able, all the combinations in the above example reduce to just one. For notational
convenience, we use the combination with the numbers in descending order, e.g., for
the above example we use (8, 3, 1, 0).

Given 2: balls and x urns and allowing for empty urns, the number of pos-
sible ccunbinations is equivalent to the number of ways to partition an integer
into different sums of integers [98, 99, 100]. Integer partitioning is a well known
problem in Number Theory. For example, for a: = 4 we have 4, 3 + 1 and
2 + 2, 2 + 1 + 1, 1 + 1 + 1 + 1. To view this example with 4 urns, we have
(4,0,0,0), (3,1,0,0), (2,2,0,0), (2, 1,1,0), (1,1,1,1). Now let us consider Pf], the
number of partitions of as into y parts. For :1: = 4,y = 2 in the above example, the
combinations are {(4,0,0,0)} and {(3,1,0,0), (2,2,0,0)} and P42 = l + 2 = 3.

It is easy to see that the total number of combinations of .r balls into y urns, W53
is equivalent to the number of partitions of :1: into a maximum of y integers. Hence,
W: can be computed as the sum of the number of ways to parition a: into y or less

numbers:
y I
y __ 2 1
W1: — Pr 0
i=1

It has been shown in Number Theory that the numbers Pg satisfy the following

recurrence relation [98, 101, 102]:
P; + P3 + + P: = Pf”,

with P1} = P: = 1. This relationship enables a recursive computation of the

numbers P3 row by row as shown in Table 3.2.

 

 

y 1 2 3 4 5 6 7 8 9 10 11 12
X,
1 1 o 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0 0 0 0
3 1 1 1 o o 0 0 0 0 0 0 0
4 1 2 1 1 0 0 0 0 0 0 0 0
5 1 2 2 1 1 0 o o 0 0 0 0
6 1 3 3 2 1 1 o 0 o 0 0 0
7 1 3 4 3 2 1 1 0 0 0 0 0
8 1 4 5 5 3 2 1 1 0 0 0 0
9 1 4 7 6 5 3 2 1 1 0 0 0
10 1 5 8 9 7 5 3 2 1 1 0 0
11 1 5 10 11 10 7 5 3 2 1 1 0
12 1 5' 12 15 13 11 7 5 3 2 1 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.2. Number of partitions of :3 into y parts.

We now proceed with the conjecture that the value of M R increases stochastically
with increasing values of N. Note that the probability P(X 2 r) mentioned in
Deﬁnition 1 can be computed as the number of combinations with the rightmost
number, the maximum number in the combination, greater or equal to r divided by
the total number of combinations W5.

Given y urns, we let M: be the number of combinations having r as maximum.

First, we provide an example to illustrate and simplify the presentation of the var-

46

ions formulations. Let p = 3, Figure 3.3 presents the possible combinations for

(x,y) = (6,2), (9,3) and (12,4) respectively. We note that when r 2 g,

M5! =-. W:-1

where s = a: — r. The reason is that, the number of combinations with y urns and r
as maximum is equivalent to the number of combinations of .7: — r balls with y — 1
urns. For example, in Table 3.3 with a: = 12, M; = 5 = W: and M; = 4 = W43

butM§=6#W-,3=8.

 

 

 

 

 

I (x,y) [ Combinations [
(6,2) (6,0) (5,1) (4,2) (3,3)
(9,0,0)
(8,0,1)
(7,0,2) (7,1,1)
(9,3) (6,3,0) (6,2,1)
(5,4,0) (5,3,1) (5,2,2)
(4,4,1) (4,3,2)
(3,3,3)
(12,0,0,0)
(11,1,0,0)

(10,2,0,0) (10,1,1,0)

(9,3,0,0) (9210) (9,1,1,1)

(12,4) (8,4,0,0) (8,3,1,0) (8,2,2,0) (8,2,1,1)
(7,5,0,0) (7,4,1,0) (7,3,2,0) (7,3,1,1) (7,2,2,1)
(6,6,0,0) (6,5,1,0) (6,4,2,0) (6,4,1,1) (6,3,3,0) (6,3,2,1) (6,2,2,2)
(5,5,2,0) (5,5,1,1) (5,4,3,0) (5,4,2,1) (5,3,3,1) (5,3,2,2)
(4.4.4.0) (4.4.3.1) (4.4.2.2) (4.3.3.2)
(3,3,3,3)

 

 

 

 

 

 

 

 

 

 

Table 3.3. Example of various combinations for u = 3.

When 1' < i, s becomes larger than 1'. But, since r is the maximum number,

47

all combinations starting with r and containing a number larger than r have to be
dropped. These combinations are exactly the ones containing s,s — 1,. ..,r + 1 as
the maximums. Hence,

s-r—l

My : Wy—l _ Z Fifi-2-

T S
i=0

For example,

1
M; = Wf—ZPf+2 = 8—(1+1)= 6
i=0

and

3
M: = WQ—ZPfH =10-(1+1+2+2)= 4.
:0

Hence, the number of combinations with r 2 k for some variable 19 can be ex-

pressedas
x r-2r-1
y __ y—1 9-2
T, _ 5: WM — 2: Pj+y_., .
r=k j=0

The probability of obtaining a combination with r 2 l6 is

9
T1-
W2

 

P0239) =

That is to say, the probability is the number of combinations with k as maximum

over the total number of combinations with 3: balls and y urns.

From the example in Figure 3.3, we have P(5,3) = % f P(5,4) = § and

[’(6,3) = 1—62— 3 P(6,4) 2: 3343. We can see from the above example (and we have

veriﬁed using numerous other examples) that

Vka, P(k.y) S P(k,y+1)-

Thus, we conjecture that the value of the maximum (MR) increases stochastically
since the corresponding probability is increasing.

It is important to note that the above states only that the value of MR increases

48

stochastically with increasing number of nodes. That is to say,
V K, [’(MR 2 K s.t. N = X +1) 2 P(MR 2 K s.t. N = X).

This does not show however, that MR increases monotonically as the value of MR

can actually decrease as shown in Figure 3.7.

Computing the value of MR

In this section, we compute the actual value for the maximum based on a modiﬁed
Multinomial probabilistic distribution. Assuming the hashing function to randomly
distribute the tuples among all nodes, the number of tuples per node is a random

variable with Multinomial distribution. When 7110 is large compared to N, we have

and

VAR(X) = o = "N x (71".) x (1—%)

and Normal distribution may be used for approximation.

Let Z = (Z1,Z2,...,ZN) be a vector of N Independent and Identically Dis-
tributed (IID) random variables. Let I7 = a x (Z — Z) + p, where Z is the mean
of Z. Then the maximum number of tuples a node may receive for the parameter

values 111V and N is MN.nN = ma.1:(Y).

In order to determine MNMN, we compute
MN — t .4 — .
QN = ———’n—N-—/— = 7na;1:(Z,-—Z), forz=1,...,N

nN

49

then solve for M N," N:

MNJIN : QN X VnN +ﬂ-

Given N and N3, the number of nodes and average number of tuples per node
respectively, the computation of M R proceeds by computing N Normal variables with
a mean of 0 and a standard deviation of 1 [103, 104, 105, 106]. Next, we compute the
average of these variables (Z) and then the value of 171a:1:(Z, — Z) for i = 1,. . . , N.
Now, the value of MR can be computed as [771ax(Z.- — Z) X m + N3]. Fig—
ure 3.6 provides pseudo—code for the actual algorithm used to compute the value of

MR.

/* Compute N Normal Variables */
Fori =- 1 to ceiling(%) do
Repeat
Generate u] and 112 (Uniformly over [0,1])
Letv1=2xu1—-1andv2=2xu2—l
Let s = v,2 + 1)}?
Until (s < 1)
Let t = (/—2 x 11:31 and
Let Z),- = t x v] and 22,-“ = t x v2.
[)one

/* Compute value of MR */

Let Z = average of the Z variables and
Let Y = max(Z.-—- Z) i=1,...,N
M3 = Y X m + N3

Figure 3.6. Pseudo—code of the computation of MR using the Normal variables to
approximate the Multinomial distribution.

Next, we derive the cost functions for the DHJ algorithm using the derived values

of MR and Ms.

3.4.2 Cost Formulation for DHJ

Given the maximums derived above, the cost for the Distributed Hash Join algorithm

is derived as follows:

0 Step 1 : Hash each tuple of R and determine its destination node, N,. With the
assumption of uniform distribution, Tvl— of the tuples will hash to the local node
while the other 557-]- of the tuples will hash to remote nodes. Consequently,
the cost of organizing the local tuples into buckets is % x (Th + T.) and that

of grouping the other tuples according to their remote destination nodes is

5:11 x (5“) X Ta: '1' T1). Thus, the total cost of this is:

(N—I)

MRX Th '1" N

1
Th + Ti) + X (T X Tat + 11)] X 5(R,N) X €(s,N)

— x
Next, for each remote destination node, we allocate space on the corresponding

nodes and move the lists. The cost is:

N — 1 1
MR X (—-—l X T X Tar + Tmr X €(R,N) X {(5.57)
N 48b
0 Step 2 : The cost for this step is similar to that of Step 1. Each tuple of S

is hashed to determine its destination node, N,. The tuple is added to the

corresponding list if N,- is not the current node. The cost for this is:

N — 1 1
MS X Th ‘1‘ L—N—) X (Tb X Tal + 71)] X €(S,N) X {(RW)

51

Next, allocate space on the corresponding nodes and move the lists. The cost
is:
(N — I) 1

MS X — X T XTar + Tmr X £(S,N) X €(R,N)
N 15b

Step 3 : Hash all the tuples of R received from the remote nodes in Step I and
organize them locally. As in Step 1, with the assumption of uniform distribution,
the fraction of tuples sent to the local node from remote nodes is Alf. Thus,

the cost of this step is:

— l
_—). X [Th + Ti] X {(SJV) X €(R1N)

MRX N

Step 4 : Finally, the cost of joining the local and remotely received tuples of S
is:

Ms X [Th + Til x €(SJV)

3.4.3 Cost Formulation for FRJ

The cost for FRJ is analyzed as follows:

0 Step 1 : Copy all tuples of R from all other nodes (using bulk—transfer). The
cost is:

(N '— l) X Tb(N]{,N) X €(R,N)

a Step 2 : Hash the tuples of R and organize them locally. Cost is

NR X N X T1. + T1 X €(R,N)[

a Step 3 : Hash tuples of S and perform the join. The cost is:

Ns X [Th + T,‘ X €(R,N)[

52

A quantitative performance evaluation of these models is presented in the next

section along with actual experimentation on a commercial multiprocessor system.

3.5 Performance Evaluation

111 this section we study the performance of both join algorithms using the cost formu-
las developed in the previous section. We present actual experimental results obtained
from implementing these algorithms on a BBN TC2000 multiprocessor system and
show that these results do validate our analytical models with varying degrees of ac-
curacy. Measurements for the experimental results were taken for varying relations
sizes and the total cost of the algorithm was computed as the maximum time over all
nodes in the system. Each plotted value represents the average of a number of runs
with the same parameter values. A brief explanation concerning the actual relations

used in the experiments follows.

3.5.1 Benchmarking relations

The relations used in the experimentations are based on the standard Wisconsin
Benchmark [107]. Each relation consists of two integer identiﬁers and three strings
each of length 40 bytes. The initial partitioning of the relations is based on the ﬁrst
integer attribute. Each string is 40 letters long with three letters (the ﬁrst, middle
and last) being varied, and two separating substrings that contain only the letter 2:.
The three signiﬁcant letters are chosen in the range (A,B,...,V) to allow up to 10, 648
different string combinations. If a larger relation size is desired, as is the case in most
of this thesis, more signiﬁcant letters are added within the separating substrings to
create additional combinations.

In generating the tuples for the experimental runs, only the ﬁrst integer attribute

and one of the strings in each tuple were used. Other attributes in the benchmark

relations were designed to test operators other than join, such as the projection and
aggregate operations and thus, the values they contained were not used in our exper-

i111ents.

3.5.2 Performance comparison

When the sizes, in number of tuples, of the relations increase then more tuples need
to be transferred and processed by the remote nodes. In Figure 3.7, N R was ﬁxed at
500 tuples and NS was set to 1,000 and 2,500 tuples, while N varied from 1 to 25

nodes.

 

l2000 I

—d
—4
‘

10000 e Analytical N5 = 1
Experimental N5, = 1
8000 L Analytical NS = 2.5

Experimental NS = ‘ "

    

 

 

 

6000 l-
4000 [
2000 '
‘l
c
0 1 1 1 1
5 10 I5 20 25

Number of nodes

Figure 3.7. Comparison of the analytical and experimental results for DHJ. The
experimental runs were conducted on the BBN TC2000 with up to 25 nodes - the
maximum number of public nodes available on the system we had access to.

The figure shows an initial increase in cost when the number of nodes increases

from 1 to 5. This is due to the increase, in this range, of the number of conﬂicts

54

generated by concurrent accesses (as presented in the discussion about conﬂicts in
section 3.4). When the number of nodes and consequently the total number of tuples
increase, the maximum number of tuples a node may receive also increases. Note
however, that this maximum number of tuples increases stochastically and not mono-
tonically. This explains the fact that, for certain number of nodes in the ﬁgure, this
value does actually decrease as evident in our experimental results. The value of the
maximum was approximated using the formulation in section 3.4.1. This increase will
contribute to more delay in the processing of each phase and consequently, cause more
conﬂicts. This is shown in Figure 3.7 where the cost of the algorithm does increase
slightly with increasing number of nodes.

Nodes that receive more tuples and thus lag behind in Step 1, are subjected to
remote accesses by those processing tuples in Step 2. On average, this will only occur
for the period of time needed to process the extra tuples. The analytical model handles
this by including the conﬂict rate €(S.N) in the cost formula for Step 1. While this
helps in accounting for the conﬂicts generated by the faster nodes (those in Step 2),
it overestimates the effects of these conﬂicts. The same occurs when f(R,N) is used in
Steps 2 and 3. However, because the model computes the cost of the algorithm based
011 the node with the maximum number of tuples, it closely approximates the actual
experimental results. It can be noted from Figure 3.7 that the analytical model does
slightly overestimate the cost of the algorithm but that for the most part, it closely
models the experimental results.

Figure 3.8 illustrates both DHJ and FRJ with a small ratio of R to S (R is 100
and S is 2, 500 tuples) and shows that for up to 23 nodes, in this particular case, FRJ
outperforms DHJ. The reason is that, as mentioned earlier, while DHJ has to transfer
lists of tuples from remote nodes, all tuples accesses in F RJ, following the replication
step, are local. This locality of references is the main idea behind the F RJ algorithm.

When the size of S increases, the cost of remotely accessing the tuples, in DHJ,

increases much more rapidly than the cost of accessing them locally in FRJ. Hence,
larger sizes of S favor F RJ over DHJ. On the other hand, the costs of replicating and
locally hashing all of R increase very rapidly with the size of R. When the size of R
increases, the beneﬁts of the locality of references in F RJ decrease against the rapidly
growing costs of replication and local hashing. Thus, FRJ outperforms DHJ when
the size of R is very small compared to that of S. In Figure 3.8, the cost of FRJ
increases beyond that of DHJ when the size of R is approximately 2, 300 tuples (at
23 nodes).

Given the accuracy of the analytical models in predicting the performance of both
join algorithms, the models can be used in order to decide which algorithm to apply.
The information needed to make such a decision is the total number of nodes in the
system and the sizes of both relations. This information can be readily available when

the execution of the join operation is attempted.

NR = 100, N5 = 2.5K
16000 1 1 1 1 I 1

14000
12000
10000

8000

 

I

FWlJ -—-
IDHJ -——

990%

6000
4000
2000

O 1 l l I J l

10 15 20 25 30

Number of nodes

Ids—Hf? .—

~
d

(b

 

 

 

 

{5‘

Figure 3.8. Comparison between the Distributed Hash Join and the Full Replication
Join algorithms.

Figure 3.9 shows the performance of FRJ when NR: 50 and NS: 1, 000. Note that
as the number of nodes increases the number of tuples to be replicated and locally
organized increases linearly, and so the costs of replication and local organization
increase linearly (but with different rates, see Figure 3.10). When the size of R
increases, these costs dominate the performance of F RJ . The ﬁgure compares the
analytical and simulation results produced with the same parameters’ values. It can
be noted that for the most part the analytical model does reasonably approximate

the simulation results.

NR: 50, N5: 1K

 

 

 

 

1 1 1 1 1
14000 F Analytical — _
12000 )— Experimental — _

T

(t) 10000 —

T 8000 ~ —

E 6000 r ~

m

e 4000 r ‘
2000 - e

0 L 1 l I I
10 20 30 40 50

Number of nodes

Figure 3.9. Comparing analytical and experimental results for Full Replication Join
algorithm.

Figure 3.10 shows the individual costs of the replication, local hash and join steps
in FRJ when R = 25 and S 2: 1,000 tuples. As noted earlier, FRJ eliminates
all remote processing of S, thus improving the performance when S is large, at the

expense of fully replicating and locally organizing R . When the number of nodes

increases, the cost of joining the local tuples of S increases slightly because the size of
the local hash table for R gets larger. The local organization of R increases linearly
with the number of nodes but since the size of R is very small for this algorithm, the
effect of this step on the overall performance is minimal. The major factor affecting
performance is the cost of replication which increases more rapidly than the cost of

joining and local hashing. This is shown in Figure 3.10.

NR: 25, N; 2 1K
1

 

 

 

 

 

 

2500 j l 1
Replication —
2000 - Hash — -
T Join #1— r
(t) Total +-
a 1500 — e
1
E 1000 ~ 4
1n
6
500 '- /
0" I. ' 1A1 A... f n 1
5 10 15 20

Number of nodes

Figure 3.10. Execution times for individual steps of the Full Replication join algo-
rithm.

3.6 Data Skew

When the distribution of tuples is not uniform, some processors may get a much larger
share of work than others. This data skewness can severely affect the performance of

join algorithms [17, 72]. We next present the Zipﬁan distribution used to model data

skew and then examine the effects of skew on the performance of DHJ.

3.6.1 Zipﬁan Distribution

In order to model data skew in the input relations, the uniform distribution of tuples in
the tuple space is often replaced with the parameterized Zipﬁan distribution [108, 109].
Zipﬁan distribution, applied to the input relations’ tuple space, states that a large
number of tuples, tunable by a given parameter, are concentrated within a small
region of space. In the context of parallel database systems, it means that given
an unbiased partitioning strategy, such as hashing, a small number of nodes will get
most of the tuples. As noted in Chapter 1, the ﬁndings of Lakshmi and Yu state that
many real-life data sets exhibit this kind of data skew [17]. We next introduce the

formulation for Zipﬁan Distribution in the context of parallel database systems.

 

 

 

 

 

 

 

02100
I T T
P 0:0— -[
f, 0:.5---~ 2
b 02.8—
a _
b _
i
I —(
i _
t
y _.
0 l J l 1 i
0 20 40 60 80 100

Tuple value

Figure 3.11. Zipf—like distributions with various 0 values.

Assume the domain of the join attribute to have D distinct values. The probability
that the join attribute of a given tuple takes on the ith value in the domain is

_ C
l"“,-(1——Zi

, where 1 g i S D and
1

L = D 1
i=1 i] l—Z]

is a normalization constant. The value of Z ranges between 0, the pure Zipﬁan which
is highly skewed and 1 which corresponds to the Uniform distribution. In order to
make the value of the skew parameter more. intuitive, i.e., a higher value means more
skew, we opted to use 9 = 1 — Z in our discussions regarding data skew. Figure 3.11
presents a typical Zipﬁan distribution where the tuple domain is of size 100 and the

Zipﬁan parameter is set to 0, 0.5 and 0.8 respectively.

N321000, N :5, 0:0.7

 

 

 

 

500 ' I 1 I I I I I I
450 Node 1 '0— '1
400 _ 13:31:; -A— T
T 350 7 Node 4 — "
u 300 Node 5 'X' - _.
'1’ 2501 2
g 200 >Z -
150 r _
100 - e
50 -
0 l I I l I I I l

1 2 3 4 5 6 7 8 9 10

Bucket number

Figure 3.12. Typical distribution of tuples to buckets under skew.

60

Figure 3.12 shows a typical distribution of tuples, in the presence of skew, to
buckets within the various nodes. Note the effect of skew on the sizes of the buckets
where few buckets are quite large due to highly skewed input values while others are
quite small. The results presented in this ﬁgure were computed experimentally where
a total of 5,000 tuples were distributed among 5 nodes each with 10 buckets. The
skew rate was ﬁxed at 0 = 0.7. The buckets in each node were sorted by size to

simplify the presentation of the results in the ﬁgure.

NR = 500, N3 = 2,500

 

 

T T T I
_ 6:0— «.2
18000 0213+ ”0.6.0
T 16000— 0:.89- .0.» .,
O
t
a
1
t
i
In
6.
4000, 1 1 1 1

 

 

 

5 10 15 20 25
Number of nodes

Figure 3.13. Performance of Distributed Hash Join with skewed input relations.

3.6.2 Effects of Skew on the Performance of DHJ

Figure 3.13 compares the performance of DHJ with various degrees of data skew. The
ﬁgure shows that with increasing values of 0, Le. higher degrees of data skew, the

total execution time for DHJ increases. This is because the few nodes, overloaded

61

with a large number of tuples, will take longer to process these tuples. Other, more
lightly loaded nodes, may be idle during this time and thus resource utilization can
be very poor in these cases. The next chapter investigates load balancing for the join

algorithm in order to effectively deal with skewed data.

CHAPTER 4

DISTRIBUTED LOAD BALANCING FOR

PARALLEL MAIN MEMORY HASH JOIN

4.1 Introduction

In Selection—Projection—Join (SPJ) type queries, the selection and projection steps
may largely alter the distribution of tuples between nodes. Consequently, the join
operation will have to deal with an unbalanced load. Hashing of the tuples of R and
S, during the initial distribution phase of the join algorithm, may also result in data
skew, as explained in Chapter 1 and shown in [17, 72].

In Chapter 2, we introduced a number of proposed schemes to deal with the prob-
lem of load balancing. Based on our examination of these schemes according to both,
the balancing decisions and the actual data transfers, we can classify them into differ-
ent categories. The ﬁrst category uses a centralized scheduler to make the balancing
decisions. Algorithms that fall into this category were mainly designed for UMA—
based systems and do not scale well with increasing number of nodes. Algorithms
in the second category add a separate phase for collecting statistical distribution i11-
formation about the tuples. This information is later used to make the decisions
concerning load balancing. The major problem with the algorithms in this category
comes from the added cost incurred by the information gathering phase. Thus, the

performance of the load balancing scheme suffers in the cases of moderate to low skew

62

63

rates as the statistics collection phase has to be executed anyhow.

In this chapter, we propose a distributed dynamic load balancing scheme [110].
This scheme does not rely on prescanning the input relations and thus, it avoids the
extra overhead associated with collecting statistics about the distribution of tuples.

The balancing scheme is completely dynamic where each helping node, i.e. a
processor that is done processing its local data, decides which other nodes to help.
In order to make the helping decision, a helping node checks the status of the other
nodes and selects an overloaded node to help. The helping node then gets a block
of tuples from the overloaded node and processes these tuples locally. In a N UMA
multiprocessor, such interactions between the nodes may cause conﬂicts.

We show that random probing for overloaded nodes eliminates the hot—spot effects
associated with sequential probing. We develop an analytical model that accurately
predicts the performance of the balancing scheme and compare its results to those
obtained from actual experiments on a BBN multiprocessor system. We show the
load balancing scheme to be effective in dealing with various degrees of data skew.
The next section presents the load balancing scheme in details and addresses the

problem of conﬂicts.

4.2 Proposed Load Balancing Scheme

When load imbalance arises, each helping node will transfer an appropriately sized
block of tuples from a selected overloaded node and process the block locally. The
selection of the overloaded node is made by each helping node independently in order
to prevent any bottlenecks that may result from a centralized scheduling scheme.
The general scheme for distributed load balancing is shown in Figure 4.1. The
termination check in Step 2 of the ﬁgure is similar to that of DHJ in Section 3.3.1.

Each node in the system contains a local Done flag that is initially set to FALSE. A

64

node that is done with Step 1 increments a global counter, initially set to 0. The node
then determines whether it is the last node to ﬁnish Step 1. This is done by simply
comparing the current value of the global counter to N, the total number of nodes in
the system. If the values are different, then the node can go into the helping mode.
Otherwise, if the values are equal, this node will set the Done ﬂags to TRUE in all
the nodes, thus signaling the end of execution. While locking is necessary in updating
the global counter, there are only N such updates throughout the execution of the
algorithm. All helping nodes will check for termination of execution using their local
flags. Thus, termination check will not cause any additional network contention.
Each node PL. in the system, maintains an index ([1,) to its data and a lock (Lk) to
synchronize simultaneous accesses to 11,. A node, wishing to copy a block from node
Pk, has to lock Lk, update 1,, with the size of the block, unlock L, and then transfer
the block. The size of the block to be transferred from a given node is taken as the
minimum of Sb and the number of tuples left at that node. This means that all nodes
must use the locks in order to process their own tuples. On N UMA architecture, locks
can be very costly if not handled properly. The next two sections discuss strategies

for selecting overloaded nodes while considering the effects of locking.

4.2.1 Scheduling by Sequential Probing

Probing, to ﬁnd an overloaded node, can be done either sequentially or randomly.
In sequential probing, helping nodes sequentially check the status of other nodes and
select the ﬁrst overloaded node to help. Since the helping nodes are searching for over-
loaded nodes sequentially, all the helping nodes between two consecutive overloaded
nodes will concurrently try to help one of these overloaded nodes. These concurrent
accesses have to be serialized and thus, will form long chains at the overloaded nodes.
This phenomenon may cause an uneven distribution of helping to helped nodes and

can result in hot spots.

65

For each node (in parallel) do
Step 1: Process local data.
Increment Global Counter.

If (Global Counter = N) then
Set the Done ﬂags to
TRUE in all nodes.

Step 2: Repeat
A. Select an overloaded node.
B. Transfer a block of size
Minimum(Un—processed data items, Sb)

C. Process the block locally.

Until (local Done ﬂag = TRUE).

Done

Figure 4.1. Distributed load balancing scheme

To illustrate this phenomenon, let us assume that the system has 16 nodes and at
time t, all nodes have become helping nodes except for nodes number 13 and 16 that
are still overloaded. With sequential probing, node number 13 may be probed and
helped by all the 12 nodes before it, while node number 16, which may have many
more tuples than number 13, will only get help from nodes 14 and 15.

Figure 4.2 displays the percentages of large intervals when the number of nodes
in the system is N = 100. We deﬁne an interval as the number of helping nodes
between two successive overloaded nodes and we let n1 be the number of overloaded
nodes in the system. The ﬁgure shows that the percentage of large intervals increases
sharply with the skew rate. For example, less than 6 % of the intervals are more
than 40 when n] = 6. This percentage increases sharply to about 20 % when n; is
decreased to 4. Thus, the probability of forming long chains increases (which results
in hot spots) with decreasing number of overloaded nodes and consequently with

higher skew rates. According to Lakshmi and Yu [17], only very few nodes will have

66

 

N = 100
100 I I I I
80 n] = 4 _ _
n1 2 6 +—

% 60 "
n
0
d
e 40 ‘
S

20

 

 

 

 

.......
........

0 10 20 30 4O 50 60 70 80 90

Distance

 

Figure 4.2. Percentage of overloaded nodes vs interval size.

most of the tuples. Hence, sequential probing is likely to degrade the performance of
the balancing scheme in the presence of data skew.

The data in Figure 4.2 was obtained experimentally where 4 and 6 nodes were
selected randomly to be overloaded nodes and the intervals between them were mea-
sured. A count of each interval size was kept and the experiment was run for 10, 000
iterations. The ﬁnal results were then averaged and their cumulative probabilities

were computed and plotted.

4.2.2 Scheduling by Random Probing

As the number of nodes increases, the chain effect in sequential probing causes more
performance degradation. In order to avoid this problem, helping nodes probe ran-
domly for overloaded nodes. A node is selected at random and its status is checked.
If it is overloaded, the helping node will get a block of data from this node, process

it locally and then randomly select another overloaded node.

67

Random selection was shown in [111] to be as good as other more complex methods
for implementing load sharing in a homogeneous distributed environment. Kumar et
al. [112, 113] studied the scalability of various load balancing schemes including a
scheme called random polling. Under this scheme, an idle node randomly polls other
nodes for work and sends requests to these nodes. Upon receiving a work request,
a node will reject the request if it is an idle node, or grant it and thus, transfer an
appropriately sized block to the requesting node. A background process monitors the
state of each node and will terminate the whole program when all nodes become idle.
This random scheme was compared with a large number of other balancing schemes
that relied 011 statistics collection and/or redistribution of loads and was shown to be
the most scalable [112].

In our scheme, idle nodes probe randomly and perform the transfers directly and
thus, do not require any extra processing or lookups on the part of the overloaded
nodes. When a node is done joining its local buckets, it becomes idle and increments
a globally shared variable. The program may terminate when the value of this vari-
able reaches N. Accessing the shared variable is efficiently handled by the scheme
presented earlier in this section.

The effects of random probing on block transfers were tested experimentally on
a BBN multiprocessor system. A pair of processors, P,- and Pj were designated as
source and destination nodes respectively. During execution, processor Pj transfers a
block of tuples from P,- and processes the block locally. This transfer process from P,-
to PJ- is repeated 1000*times while all other nodes in the system are engaged in random
probing. A probing node would repeatedly execute a number of local operations and
then randomly probe a remote node. The number of local operations, Y determines

the frequency of random probing. The effects of varying the value of Y are shown in

 

‘It is important to have a large enough number of transfer to allow all processors enough time
to probe and ensure the measurements’ accuracy.

68

Figure 4.3 and discussed next.

:0
0'
II
N
~11

 

0.8 I I I 1‘

_
-:
--4
.1

0.75

Tight Loop -A— —
Y = 2 .0..
0.7 — Y=5*' -

 

0.65

 

0.6

(DE—nee —mnOi—]

 

 

05 l l 1 l 1 1 1 l
10 20 30 40 50 60 70 80
Number of nodes

 

Figure 4.3. Effect of random probing on network traffic.

The results of the experiments are shown in Figure 4.3. The ﬁgure illustrates the
effects of random probing on the general network traffic and, in particular, on the
block transfer operations. Note the effects of the tight loop, i.e. no local operations,
where an increase in transfer time can be seen. This is because the processors are
constantly generating remote accesses as fast as they could. Despite being very small,
single data—item accesses, the effects of this very high frequency of requests on the
transfer cost are evident with increasing number of nodes in the system. In our load
balancing algorithm, a helping node repeatedly executes a number of local operations
(such as checking the local Done ﬂag, a loop index or various local variables) and
then probes for an overloaded node. The number of local operations executed, Y is

equal to ﬁve. The ﬁgure shows that for this case, the effect of random probing on the

69

cost of transfers is minimal even for large values of N (up to 80 nodes).

4.3 Analytical Model for Load Balancing based

on Random Probing

In this section we develop the analytical model for the load balancing scheme. The
cost of processing all the tuples, without load balancing, can be computed as the total
time taken by the processor that ﬁnishes last, i.e. the processor with the maximum
number of data items. However, with the load balancing scheme, it is not the node
with the maximum number of tuples but rather the helping node, that gets the last
block of tuples, that will ﬁnish last. In the rest of this chapter, this node will be
referred to as the last helping node. The cost formulas for the total computation
time, derived in this section, correspond to this node.

To obtain the total time for load balancing, we will add up the processing costs
of each block processed by the last helping node. Thus, if T(i) is the time to process

the ith block by the last helping node, then the total time for load balancing is

L
T = 2mm

i=1

where L is the total number of blocks processed by the last helping node.
Let 710(2) and 711(i) be respectively, the average number of helping nodes and
overloaded nodes during the time interval needed to process the ith block. The cost

T(i) can be computed as the sum of the following components:

0 For a selected overloaded node, Nk, lock Lk, determine the block size, update [1,.
and then unlock Lk. The contention for the locks are modeled using the same
formulation as in Section 3.4. Hence, the cost is Tlock x {(MJV) + Tunzock. The

costs of locking and unlocking are system dependent and the costs of determin-

70

ing the block size and updating [k may be ignored (as these are local operations

and they are only executed once for each block of tuples).

a Transfer a block to be processed locally, only if this is a helping node. With
110(1) helping nodes scanning among all N nodes in the system, the probability

of transferring a block is given by 307$). Thus, the total cost of this step is:

”if,” x T,.(s.,1v),

 

where Tbt(Sb, N) is the cost of transferring a block of size Sb when N nodes are

active. This parameter is system dependent.

0 Process the data block locally. The cost for this is TI, 2 Sb X T,,, where Tp is the
time to process a single data item. The parameter Tp is application dependent
and for the join algorithm, it corresponds to the cost of organizing the tuples

of R into the local buckets, or the cost of joining the tuples of S .

While random probing eliminates the chain effect, conﬂicts may still occur when
more than one node randomly select and access the same overloaded node. Using the
same conﬂict formulation described in section 3.4, the number of nodes involved in
these accesses is {(MJV), where M is the number of data blocks. It is important to
note that remote accesses through the network are being done only by the helping
nodes. Hence, a node being accessed, i.e. an overload node can only be processing
locally and will not conﬂict with incoming requests.

During any time step, the number of concurrent trasnfers through the network is
necessarily smaller than the minimum of 710(i) and n] (i). This is because only helping
nodes perform transfers and these transfers are synchronized at the overloaded nodes.
When the helping and overloaded nodes are almost equal in number, it is very likely

that only a small fraction of the helping nodes will be attempting transfers through

71

the network. The main reason being that the helping nodes are probing randomly
for overloaded nodes and the probability of each helping node locating a distinct
overloaded node is very small.

We showed earlier that random probing by the helping nodes does not cause
network conﬂicts. It is also clear, from the above discussion, that the number of con-
current transfers is almost always small compared to the network bandwidth (which
increases with the number of nodes). Hence, the major conﬂicts that occur are those
caused by an overloaded node being selected by more than one helping node. These
conﬂicts are modeled using the 5 formulation as mentioned above.

Therefore, the cost of reserving and processing a block of data is T(i) =

Tb+(

 

n i

Initially, the system starts with all nodes processing their local tuples and thus,
710(0) = 0 and n1(0) = N. Computing the value of L in the cost formula for T is
not necessary. We evaluate T iteratively as follows: At each step of the cost formula
for T, the initial number of blocks in each node will be compared to i. If the number
of blocks for a node is greater or equal to i then that node becomes a helping node.
Hence, no(i) is incremented and 111(i) is decremented accordingly. When 120(2) reaches
N, this signals the end of processing. Thus, the summation of the cost formula for T
is evaluated iteratively until 110(2) is equal to N.

Next, we present and discuss the analytical and experimental results.

4.4 Performance Evaluation

In this section we compare the performance of the original Distributed Hash Join
(DHJ) of Chapter 3 and that of DHJ with load balancing (DHJLB). The section also

compares the results of the analytical model with those of actual experiments. For

72

each plotted value representing experimental results in the ﬁgures, a number of runs
were made with the same parameters’ settings. The time for each run is measured
as the maximum time over all the nodes in the system. The results from these runs
would then be averaged, disregarding the ﬁrst three runs to eliminate the overhead
of system startup. The experiments were run on a BBN TC2000 with up to 25 nodes
available for use. The analytical model, however, preserves the same trends shown in
this section for an even larger number of nodes.

As mentioned in the previous chapter, the total size of available main memory is
proportional to the number of nodes and so is the sizes of the input relations. For
the performance ﬁgures in the rest of this section, NR and N5 are set to 500 and
2500 respectively unless speciﬁed otherwise. In order to analyze the performance of
the load balancing scheme, the input data was skewed using parameterized Zipﬁan
distribution introduced in section 3.6.1.

To compare the performance of the analytical model to that of the experiments,
the values of the system dependent parameters of Section 4.3 were taken from Ta-
ble 3.1 of (’hapter 3.

Figure 4.4 shows the total processing times for both DHJ and DHJLB algorithms.
By transferring blocks of tuples from overloaded nodes and processing them locally,
the helping nodes improve the performance of DHJLB over that of DHJ as shown
in the ﬁgure. As the number of nodes increases, the total number of tuples in each
relation also increases and so does the skew rate. Consequently, DHJLB has to deal
with more load imbalance and thus, the balancing cost will increase. Note, however,
that DHJLB maintains a near~constant improvement rate (i.e. over the performance
of DHJ) as we increase the number of nodes in the system. For example, with a skew
rate of 6 = 0.8 and for the whole range of nodes, the performance of DHJ is about

50% worse than that of DHJLB.

73

NR = N3 = 1000

 

 

 

 

 

 

 

120000 . I I I
DHL03=05—— _~“
100000 — DHJLB, 03: 0.5 «4— —
T DHLOR=08~~ ,,~
0 80000 _ DHJLB, 03:0.8'0-- _.
t
a "
' 60000 — -
F
51 40000 ~ .0. ...... -j
e .- , .o °°°°°°°
20000 — .,.-"' _..,°..-- ' 2
” '” r T 7
5 10 15 20 25

Number of nodes

Figure 4.4. Total time for the join algorithm with and without load balancing.

4.4.1 Results of Varying Data Skew

Figure 4.5 shows the performance with no skew in the building relation, R, while
varying the skew in the probing relation, S. In the range of skew (0—.2), all nodes
will ﬁnish processing their tuples almost at the same time and thus, only minimal
or no helping will be done. The ﬁgure shows that, in this range, both algorithms
exhibit almost exact performance. This supports our claim that, in the case of no
or low skew, DHJLB always performs as well as the original DHJ algorithm and
thus outperforms most load balancing schemes which rely on collected statistics. The
reason is that the balancing scheme does not incur any overhead since it does not
require any prescanning or presampling of the input relations.

At high skew rates, few buckets within the nodes will have a large number of tuples.
While the load balancing scheme is able to help with the other buckets, processing

these large buckets takes longer times. The ﬁgure shows that the balancing scheme

74

is able to maintain good performance with increasing skew rates.

The effect of varying the skew rate in the building relation is presented in Fig-
ure 4.6. The ﬁgure shows the load balancing scheme to be effective in dealing with
the problem of skew, even for large skew rates. The ﬁgure also shows the size, in
number of tuples, of the largest bucket, i.e., the bucket that corresponds to the at-
tribute with the highest skew value in Figure 3.11. This skew value increases sharply
with increasing skew rate and at high skew, e.g. 0 =2 0.8, the size of the corresponding
bucket may become even larger than NR.

Note that the effect of skewing the building relation on the performance of the
join algorithm is more profound than that of skewing the probing relation. This is
because tuples in the building relation have to be distributed among the local buckets
and hashed and organized within each bucket. When the skew rate of R is high, these

steps will take even longer to execute.

 

 

 

 

 

 

120000
100000 -

'r

O

t _.

a 80000

I

E 60000 -

m

C
40000 —
20000 1 1 l 1 i l l

 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Skew rate (93)

Figure 4.5. Total time with skew in the probing relation S.

Figure 4.7 presents the effect of varying the skew in both relations. From the ﬁgure,
it is clear that computing the join with this kind of skew becomes very expensive. The
main reason for this is that when both relations are skewed on the same attribute,
the result which is quadratic in nature becomes excessively large with higher skew
rates.

Consider, for example, the case where 0 = 0.8. We know that the sizes of the
largest buckets are approximately 1000 tuples each (see Figure 4.6). The join of only
these two buckets will generate close to 1, 000, 000 tuples. Such joins take an excessive
amount of time to compute and as noted in [21], are not practical. The ﬁgure shows
that the load balancing scheme improves the performance of the join with these cases
of double skew, even for high skew rates (6 = 0.7). Results for skew rates higher than

= 0.7 were not reported since the amount of space needed for the results as well as

the computation time were excessive.

NR = N5 :1000, N = 20

 

 

 

 

 

100000 I I I I T I I
IDHJLB ——— 1,120 ’
80000 ~ DHJ *-
'r
O
t _
a 60000
1
E 40000 ~
111
C
20000 ~
0 1 J 1 1 1 l 1

 

 

 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Skew rate (93)

Figure 4.6. Total time with skew in the building relation R.

NR = N5 =1000, N = 20

 

 

 

 

 

 

 

I I I I I I
[DHJLB 500 t.
_ DHJ ...,_ 235 _‘
96
32
p
l l l l 1 l
0 0.1 (12 (13 114 (15 (16 0:7

Skew rate (03, 03)

Figure 4.7. Total time with skew in both relations.

250000
200000
1‘
O
t.
a 150000
I
E 100000
1 1‘1
9
50000
0
50000
40000
1.
0
t .
a 30000
1
E 20000
m
8

10000

NR = N3 = 1000

 

 

 

 

 

I I I I

Analytical, 0;; = .8 -o— 1
__ Experimental, 03 = .8 — _

1 1 1 1

5 10 I5 ’20 '25

Number of nodes

Figure 4.8. Comparison of analytical and experimental results.

77

4.4.2 Analytical vs. Experimental Models

The performance of the analytical model is compared with the experimental results
in Figure 4.8. Since the total number of tuples is N X (NR+ N3), when the number of
nodes increases, the number of tuples to be processed increases and the data becomes
more skewed. The ﬁgure shows this trend as the number of nodes is increased up to 25.
It should be noted from the ﬁgure that the analytical results closely approximate those
of the experiments. The maximum difference between the model and the experimental
results is 9%. This slight difference can be explained by the same reasoning used to

explain the differences in Figure 3.7.

NRZ500, N522500, N223

 

20000 I I I I I
93 = .8 +—
‘l 0R : .6 '0—
15000 03 = 0 — _

mace

10000 - ‘

 

5000

it 1*
.. X

 

 

 

 

50 100 150 200 250 300
Block size 5),

Figure 4.9. Effect of block size SI, on load balancing.

Before transferring a block of tuples for local processing, a helping node has to lock
Lk, update I)c (as deﬁned in Section 4.3) and then unlock Lk. Given a ﬁxed number

of tuples (during a speciﬁed run of the algorithm), a small block size means a larger

78

number of blocks and consequently more frequent lock accesses. As mentioned earlier,
lock operations are expensive on N UM A systems and a large number of simultaneous
lock operations can degrade the performance severely. The effects of different block
sizes are shown in Figure 4.9 where it can be seen that for small block sizes, the
increased lock conﬂicts degrade the performance severely. The ﬁgure also shows that
the performance degradation for small block sizes is more profound for larger values
of 0, i.e. higher skew rates. This is because higher skew rates mean more blocks
will be transferred by the helping nodes during load balancing and thus more lock
conﬂicts. Note that, as expected, varying the block size does not have much effect on
the uniform distribution case (0 = 0) since the number of blocks to move in this case

is either zero or insigniﬁcantly small.

CHAPTER 5

THE NUMA WITH CLUSTERS OF

PROCESSORS ARCHITECTURE

5.1 Introduction

Most of the proposed parallel database systems have been designed following either
the Shared—Nothing or Shared—Everything architectures. While SN systems are sup-
posed to scale up to hundreds of nodes [70] in the absence of data skew, their scala-
bility becomes limited if moderate or severe data skew is present. Shared—Everything
systems are able to deal efficiently with data skew but have been mostly limited to a
small number of nodes 1'

As mentioned in Chapter 2, Hua and Lee [18] have proposed combining the two
architectures into a hybrid system where clusters of SE systems are interconnected
to form a SN system. According to their analysis, the network does not have a
signiﬁcant impact on the overall system performance. They show that 1/0 is the
major factor and based 011 that, compute an optimal cluster size. However, their join
algorithm for the hybrid system incurs extra l/O steps for writing the tuples to disk
during the distribution phase and reading and writing them again during the ﬁrst
part of the join phase. By constructing the buckets as soon as the tuples are received,

these extra l/O costs can be eliminated. In addition, load balancing across clusters

 

‘Only up to 30 nodes in most systems.

80

was not considered and perfect balancing within each cluster was assumed.

Load balancing in SN systems can cause high network contentions and may con-
sequently lead to performance degradations [91]. Parallel database systems with a
large number of nodes and the ability to efﬁciently handle skewed data are necessary.
Current N UMA—based multiprocessor systems can combine the high scalability of the
Shared—Nothing architecture with the low—cost load balancing of Shared—Everything
systems. These multiprocessors are able to provide very large I / O bandwidth by con-
necting more than one disk to a single processor or to a cluster of processors. With
such features, these systems seem very desirable for parallel processing of database
systems.

In this chapter, we investigate a NU MA—based multiprocessor system where each
node is a cluster containing a number of processors. We term this architecture as
NUMA with Clusters Of Processors or NUCOP. We present a parallel hash join
algorithm for the N UCOP architecture that provides load balancing between all the
nodes in the system. In previous work on SN systems [66, 73], the l/O was considered
the main bottleneck in the system and the network was assumed to have inﬁnite
bandwidth.

We demonstrate a larger cluster size than that determined by Hua and Lee in
[18] for the same set of system parameter values. Our results show that the network
bandwidth is an important parameter that affects the system performance, especially
when the number of nodes or clusters is large. By ﬁxing the total number of nodes
in the system and varying the sizes of the clusters, we show that NUCOP systems
perform better than SN systems and can support a larger number of nodes that SE
systems.

We develop an analytical model for the algorithm and validate it using both
simulation and actual experimentation. The simulator developed for the NUCOP

architecture has a modular design that allows it to model a variety of systems. This

81

can be accomplished by replacing some speciﬁc component, e.g. the global network,

with a new one that may possess different properties such as topology and speed.
Our results show that the N UCOP systems are able to deal efﬁciently with skewed

data while providing enough resources to process very large relations. An illustrative

example of this is given in the last section of this chapter.

( , Interconnection Network - U

 

 

 

 

%9 Q... Q i Q (P Q ’
We ' ' '

Interconnection Network) ' Interconnection Network)
I l I

i i l .00

v v v

 

 

Shared Memory Shared Memory -

 

 

 

 

 

 

 

 

 

Figur151 .1. Architectureof NUCOP.

5.1.1 The NUCOP Architecture

This section presents our model of the N UCOP architecture. The system is shown in
Figure 5.1. N UCOP consists of clusters of processors connected by an interconnection
network. Each cluster contains a number of processors, a physically or logically shared
main memory and a number of disks. All processors within the same cluster share
the memory and disks. The network provides a globally shared memory by allowing
processors in any cluster to access the memories in all other clusters. Clusters in
N UCOP maintain a higher degree of locality than that provided by shared—nothing

systems. This is because many of the remote references may be satisﬁed within

82

the cluster itself resulting in a lighter overall network load. This is essential for
maintaining good performance with increasing number of nodes.

With a high degree of data skew in Hua and Lee’s hybrid architecture [18], many
clusters can be idle while the overloaded cluster is still processing its data. Since the
clusters are connected to form a shared—nothing system, load balancing across clusters
will limit the total number of clusters [91]. The main difference between our model
and the hybrid architecture is that we use a shared—memory N UMA multiprocessor
system. The globally shared memory allows for more efﬁcient load balancing between
all nodes in the system [73]. Another key difference is that the NUCOP system does
not require the individual clusters to be UMA—based multiprocessors. This allows for
a large number of nodes within each cluster.

111 the NUCOP architecture, we. assume the system to have parallel CPU, network
and I/O capabilities. That is to say, the only CPU cost for executing a network or I/O
request is the cost of initiating such a request. Commercial multiprocessors already
provide various forms of these parallel capabilities. The KSRI multiprocessor [84],
for example, has separate I/O co—processors and provides parallel I/O operations
through asynchronous system calls. Parallel data transfers over the interconnection
network are also provided through the pre—fetch and post—store operations [84, 114].
These operations will be described in more details later in this chapter.

We next describe. the join algorithm designed for this architecture and outline

some of the key issues in its design.

5.1.2 The Proposed Join Algorithm

In order to describe the join algorithm, let m be the number of clusters in the system
and n be the number of nodes per cluster. We assume that the attributes used to
make the initial partitioning of both relations R and S are different from the join

attributes. This means that both relations have to be distributed before the join can

83

be performed.
The join algorithm for the NUCOP architecture may now be described, at a high

level, by the following two phases:

0 Distribution Phase:

Each node N; reads and scans the local part of R. Each tuple is hashed and a
destination node, N,- is computed. Tuples are sent to their destination nodes
in blocks of size 55. When a node receives a tuple, it hashes it using a second
hashing function to determine the local bucket for that tuple. Tuples that fall
into bucket zero are organized into an ill-memory hash table that will be joined

with bucket zero of 5'. All other buckets are written to disk.

0 Join Phase

When all nodes are done with the distribution of R, the processing of S starts.
The process is similar to the distribution phase for R, except that when a tuple
hashes into bucket zero of 5', it is immediately joined with the matching tuples

in bucket zero of R. All other buckets are written to disk.

Once processing of bucket zero is done, processing of the other buckets starts.
Each node will read the corresponding pair of R and S buckets from disk and

perform their join.

We use the load balancing scheme of Chapter 4 in extending this hash join algo-

rithm to deal with data skew. This extension works as follows:

0 When a node is done performing the join on all its buckets, it checks for over-
loaded nodes within its cluster to help balance the load. Load balancing within
each cluster can be done efficiently since the clusters provide shared access to the
memory. This balancing phase is restricted to the current cluster and thus, does

not generate any trafﬁc on the global network connecting the various clusters.

84

0 Once all tuples within a cluster have been processed, the nodes of this cluster
probe other clusters and help balance the load across clusters. The selection
of an overloaded node uses random probing as in Chapter 4. Probing remote
nodes directly, as provided by the shared memory, allows efficient load balancing

between the various nodes in the system [73].

5.2 Analytical Model

In this section, we develop the analytical model for the hash—based join algorithm
presented for the N UCOP architecture. 111 order to model the data skew, we assume
that the distribution of S will result in a skew on the receiving nodes. The amount
of skew, 0, is defined as a fraction of the total relation size that is assigned to a single
node. Based on published measures of partition skew, Walton et a1 [16, 72] state that
typical values of 0 range between .1 and .3. The rest of the relation is divided evenly
among all the other nodes in the system. According to DeWitt et al. [21], this skew
distribution captures the spirit of the Zipfian distribution (see Section 3.6.1) while
being much simpler to manipulate analytically.

To compute the execution time of the algorithm, we assume that there is no
overlap between the two phases of the join. This is necessary to ensure that all tuples
of R that hashed to a certain node, have been actually received by that node before
proceeding with the join phase. Hence, Cram! 2 C] + (.72, where (71 and 02 are the
costs of the first and second phases respectively. The synchronization between the
two phases of the algorithm is similar to that described in Section 3.3.1.

Since the N UCOP system provides parallel CPU, network and I/O capabilities,
I / O and network data transfers may be overlapped with CPU processing. Hence, the

total cost for each of the phases may be computed as the maximum of the following:

C] = MAX(CCPU_1,ClOJaCNetJ)

and

(.72 = MAX(CCPU_2, C10.2, CNetJ)1

where Coin/4', Cm_,- and CNCL; are respectively, the CPU, I/O and Network
costs for phase i.

Note that we do not include a separate cost function for memory accesses as in
[18]. Memory access is not usually executed in parallel with other computations.
Hence, we feel the approach should be to include the costs of memory accesses with
the CPU costs to account for the. various read / write accesses to memory.

The performance evaluation of the dual—paradigm join algorithm in [73] did not
consider the network cost, but rather assumed the network bandwidth to be always
large enough based 011 previous experimentations. The network, however, is a re-
source that is shared by all clusters in the system and thus can form a performance
bottleneck. We do consider the network cost in the analytical model and our re-
sults show that the network bandwidth has serious impact on the performance of the
system.

During the processing of the join algorithm, conﬂicts may occur on the network as
a number of clusters attempt to access other clusters. Conﬂicts can also occur within
a cluster, as the various nodes within that cluster access each other or compete
for I/O. These conﬂicts are modeled using the same formulation as in Section 3.4
with the appropriate parameters. A brief explanation of the various system and
algorithm parameters used in the model along with their default values are presented
in Appendix A. Next, we derive the costs for both join phases by formulating the

individual costs mentioned above.

86
In the distribution phase:

0 I/O: R,- is initially read from disk then, all tuples received from other nodes are
written to disk except those in bucket zero. Thus, the number of tuples involved

is N]()_] = (R,- + R,- X LB—g—Q) , where B is the number of buckets. The cost is

(710-1 : T10 X NIO_1 X €((R,xn) n)
BS ‘

a Network: The tuples of R that hash to remote clusters are sent over the network.
Assuming a uniform hashing function, each cluster will send NM.” = 47%)) x
R,- tuples to remote destinations. The total number of tuples to be sent over
the network is then NM.” x m. Hence,

(.7 e 2T e meN e X
N c_1 N t N .1.1 {(NCBEJ ,, IL")

0 CPU: Tuples of R are hashed to determine their destination nodes. Incoming
tuples are also hashed into buckets and those that hash to bucket zero are
further hashed and organized into an in memory hash—table. The cost of this
is TC”; x [has], x R,- x (2 + l) . Each cluster will write (N10_1 — Rg) tuples to

8

disk and send ﬂied-L tuples over the network. Thus, the CPU cost is:

l
CUM/-1 = TCPU X [hash >< R1 X (2 + E) “l“

110 [Net
N - Ri N e
( 10-1 ) x (n x WCPU x 8,) + NH X (n x WCPU X 3s)

 

 

For the join phase, all cost computations that involve S will use K, instead of 5,,
where K 3 represents the number of tuples received by the overloaded cluster. The

costs are as follows:

87

0 I/O.‘ Read 5.- from disk, store all received 5 tuples except those that fall in
bucket zero. Join bucket zero of S with that of R and store the result to disk.

Next, read each pair of R and 5' buckets, join them and store the results.

(3-1)

B (B-llx

(710-2 = T10 X [5} + X K: + (Ks + 31)] X 5(1)} ,,)+

Es
T10 X [R1 X (S; + —) X .13] X ((5: n)
7” 85’

0 Network: Each cluster will initially send 9%?) x S; tuples. During load bal-
ancing, the network will carry at most Lug—ll X E, tuples to be processed in the
non—overloaded (helping) clusters.

(m — 1)

TV -'
(’NeL'Z = Tnet X (7" — 1) X 51' + 1n

X (Es + R1) X €(u‘m22xmig. N-..)

0 CPU: Read 5',- from disk, hash tuples to decide whether to keep them locally or
send them to remote nodes. Tuples that stay locally have to be hashed again
and organized into buckets. Receive incoming K, tuples, hash them into buckets
and except for bucket zero, write all buckets to disk. Join tuples that hash to
bucket zero of S with those in bucket zero of R and write the result to disk.
Read each pair of buckets, join them and write the result to disk. During load

balancing, transfer a couple of buckets, join them locally and write the result

 

to disk.
CCPUJ = TCPU X ([51 + Ks + %] X [hash + [IE-:- X 1%} X Js] X Ijoz'n)+
., , (B — 1) 110
.5,- I 8 ——
l + X X B X [n X chu X le+

88

, (m. — 1) [Net
.9,- —
l' x m X [n X chy x le+

 

———(B—1)x[n><5'ir+Ri-l-E]+§iXR‘XJ‘
B m

771.

5.3 Model Validation by Simulation

In order to verify and validate the analytical model, we developed a simulator for
the N UCOP architecture. The simulator is written in CSIM which is a process—
oriented simulation language [115]. Figure 5.2 presents a high level diagram of the
simulator. The main components of the system are deﬁned as separate objects. At
the lower level are the processor objects, the I / 0 objects and the local network
(lNet) objects. The cluster object includes one INet object and a number of I/O
and processor objects. The IN et object serves to connect the various processors
and I/O objects within a cluster. Another main component of the simulator is the
global interconnection network (gNet) object. The gNet object is used to connect
the various cluster objects in the system.

The modular design of the simulator will allow it to model different system conﬁg-
urations with various types of networks, 1/0 and CPU components. By deﬁning a new
object with different characteristics such as conﬁguration and speed, the simulator
should be capable of modeling completely new systems. The components, currently
available with the simulator, model only the KSR architecture. For example, the
lNet object models a uni—directional slotted ring bus. It should not be hard to add
new objects designed for different network topologies and have the simulator correctly
model the new systems.

The relations, R and S, in the simulator are assumed to be initially partitioned
across all nodes according to a uniform distribution. In order to model data skew, the

relations are drawn from a distribution approximating that of Zipﬁan as discussed

89

[ gm. ]

[ INet % ”'

 

 

 

Cluster

 

 

 

 

 

Figure 5.2. Diagram of the NUCOP simulator.

earlier in Section 5.2. All relations come from our synthetic database which is designed
according to the Wisconsin Benchmark guidelines [107] (see Section 3.5.1).

In the simulator, tuples are represented by a pair of numbers, the tuple’s key
attribute and the size of the tuple in bytes. After hashing a tuple and deciding that
it should be sent to another node, a counter representing the number of tuples that
have so far hashed to that node is incremented. This counter is compared with the
value of $5, the maximum number of tuples per block. If the counter exceeds the
value of 5),, then the block of tuples will be sent to the corresponding node and the
counter is reset to zero. Otherwise, processing of the next tuple starts.

The actual sending of a block of tuples is simulated as follows. The block of tuples
is divided into a number of packets (this number depends 011 the size of the packet,
the size of the tuple and the value of $5). The CPU object initiates a network request
to attempt the delivery of these packets to the lNet object. The cost of initiating a

network request is added to the CPU time of the current node for each packet that

90

9

; Comment lines start with the character ’; .

cpu 25
i0 40
gnet 150
n 24
n1 4
nr 8000
us 8000

Figure 5.3. An example of an input ﬁle for the NUCOP simulator.

the CPU delivers to lNet.

Next, the lNet object processes the received packets and determines their des-
tination nodes. The cost for lNet is updated accordingly. If the destination node
belongs to another cluster, then the lNet delivers the packets to the gNet object.
Each packet is then routed by the gNet to the correct cluster where it is delivered to

the corresponding lNet. This lNet ﬁnally routes the packet to its destination node.

5.3.1 Simulator Input

The simulator accepts command line parameters but can also accept values included
in an input ﬁle that is speciﬁed 011 the command line. The input syntax of this ﬁle
is to include a single parameter per line with the parameter name, as deﬁned in the
glossary (Appendix B), followed by the desired value. This format is shown via an
example in Figure 5.3. System parameters that are not speciﬁed neither in the input
ﬁle nor 011 the command line will default to the values speciﬁed in Appendix B. These
default values are built into the simulator by deﬁning them in the program’s header

ﬁles. Since the input ﬁle is read last, parameter values that are speciﬁed in the ﬁle will

91

override any other values built into the program itself or supplied 011 the command

line.

5.3.2 Simulator Output

The main output of the simulator are the timings for the various components, e.g., the
total l/O, CPU and network time for each node. Optional debugging and tracing
information for many of the CSIM objects are provided by invoking the program with
the —1 option. The user is warned however, that this can and actually does generate
excessively large output ﬁles. For example, during some of the debugging sessions of
this simulator, output ﬁles of sizes up to 20 or 30 MB were not uncommon.

In order to compute the total timings for the join, the values were extracted from
the. columns corresponding to the CPU, I/O and network costs. The maximum
value was then obtained for each stage of the join algorithm and the total time was
computed as the sum of these values.

Appendix B presents more details concerning the implementation and current
limitations of the NUCOP simulator. The appendix also contains a glossary of the
various parameters used in the simulator along with the default values for these
parameters. Performance ﬁgures obtained using the simulator are compared with

those of the KSRI experiments and the analytical model in Sections 5.4.2 and 5.5.

5.4 Model Validation on the KSRl

The KSRI multiprocessor from Kendall Square Research is used for the purpose of
testing and validating our analytical model. The KSRl is a shared—memory mul-
tiprocessor that can have up to hundreds of nodesl Next we brieﬂy describe the

architecture of the KSR], discuss the experimental results and compare them to the

 

lThe current commercial version has 128 nodes but the architecture supports up to 1088 nodes.

9‘2

results obtained from the analytical and simulation models.

5.4.1 KSRl Architecture

The KSRl is the ﬁrst multiprocessor in a family of shared—memory systems designed
by Kendall Square Research. It has up to 1‘28 nodes with 3‘2 MBytes of memory
(which is actually a cache memory) and .5 megabyte sub-cachef per node. Up to 3‘2
nodes are connected to slotted, pipelined rings called Ringz0. Larger systems may
be obtained by connecting a number of rings of type Ring:0 to a larger ring called
Ringzl. The current system§consists of a Ring21 connecting 4 rings of type Ring:0

and thus, has a total of:
o 128 nodes, 32 each in a ring of type Ring:0,
o 4 Gigabytes of cache, 3‘2 MBytes local to each node and
o 50 Gigabytes of disk space.

The amount of disk space allowed for use was much more modest. That was the
main reason behind running the experiments with only 8, 000 tuples per node for each
relation.

What is novel about the KSRl is the notion of ALLCACHE memory. The main
memory local to each node. is treated as a large local cache and is considered a part
of the single cashed address space available in the system. Data items referenced by
a cell migrate automatically to its local cache. Thus, when a block of memory is
sequentially accessed, only the ﬁrst reference is considered remote while subsequent
references are local.

As support for the parallel 1/0 and network capabilities in our model, the KSRI

provides asynchronous read/write operations to disk and post—store and pre—fetch

 

fThe sub—cache is divided equally into instruction and data sub—caches with 256KBytes each.
§The KSRI we. used for the experiments resided at the Cornell Theory Center.

93

operations over the network. Up to 3 concurrent network operations may be in
progress for each node. During these operations, the processor at the requesting node
is not stalled and can continue execution. Hence, the CPU cost for such operations
is only the actual time needed to initiate them.

In order to make the experimental runs with up to 128 nodes on the KSRl and
have full, uninterrupted access to the whole system, the machine had to be set to
single—user mode (where only one user is allowed to login during a certain period of
time). The KSRI at the University of Michigan runs the NQS queueing system. Users
would submit their jobs to NQS and expect the results the next day. The system is
reserved overnight for batch operations and the jobs are executed in single—user mode.
However, this system is limited to only 64 nodes and that is why most of the initial
small runs were executed here. The KSRI at Cornell also ran NQS but the largest
group of nodes was also 64. The full 128—node system was only available in single-
user mode for two—hour slots every Thursday morning and only by reservation. The
staff at the Theory Center generously allocated us ﬁve of these slots, over a period of
two months, during which all the reported runs were made.

Next, we briefly discuss the implementation and compare the results of the exper-

iments to those of the analytical and simulation models.

5.4.2 Comparison of Results

The join algorithm presented in Section 5.1.2 was implemented on the KSRI multi-
processor system. The results of the analytical model presented in this section were
obtained by using parameter values suitable for the KSRl architecture. These pa-
rameter values were measured individually using a number of facilities on the KSRl
including the pmon monitoring tool to get information about the program at the
hardware level. These values were also corroborated by other researchers who were

also using the KSRl [116] and in a number of published works [117, 114]. We ﬁrst list

94

 

 

 

 

'25 1 I I I I I
Analytical, 9 = .1 — Net = 150 MB/sec
Simulator,g = .1 A— 1/0 = 40 MB/Sec
20 '- KSRI, = .1 "*— - -1
If Analytical, 0 = .5 ..— CPU = 25 MIPS / ., ,1.
t Simulator, 9 = .5 -)(— ,;,/
E11 15 _ KSR1,0 = .5 -*— K "’ _
t: , / A"
{j} 10 _ ’27? // d!
e ”I” L/
5 :u’fln 1‘ -/ _
" ‘— "' ‘ ' l l 1 1
5 10 15 '20 '25 30

Number of nodes per cluster

Figure 5.4. Validation of Analytical and Simulation models with Experimental results.
Data points in the experimental results (obtained on the KSRl) represent the total
execution time averaged over 3 runs.

these parameters, discuss the experimental runs and then compare the performances
results.

The main system parameters for the KSRl multiprocessor are:
0 I/O bandwidth 2 40 MBytes/sec per cluster,
0 Network bandwidth 2 150 MBytes/sec ,

0 CPU speed 2 ‘25 MIPS.

To run the experiments on the KSRl, the relations were partitioned across all disks
in the system and were generated to cause a distribution skew according to the value
of the speciﬁed parameter, 0. Performance results were obtained while increasing the
number of nodes per Ring:0 from 4 to 3‘2, for a total of 16 to 1‘28 nodes. As presented

in most ﬁgures, the total size of the data (number of tuples in each relation) increases

95

 

I I I I I I I I

90 Net = 150 MB/sec Analytical, n = 6 —
80 CPU = 25 MIPS Simulator, n = 6 -— ‘
70 c 0 = .1 Analytical, n = 16 -A— _
Simulator, n = 16 -A—
60 5' Analytical, n = 32 -0— S
50 .j Simulator, n = 32 -0- _

 

 

 

 

 

 

 

5 10 15 ‘20

25 3O 35 40 45 50
I/O bandwidth

Figure 5.5. Effect of IO bandwidth.

linearly with the number of nodes. Each plotted data point in Figure 5.4 represents
the total execution time averaged over 3 runs.

Figure 5.4 compares the analytical, simulation and experimental results for N R =
N5 = 8K tuples per node. When the number of nodes per cluster increases, there
are more I / 0 requests and transfers and the I / 0 system starts to saturate. The
same is true for the global network connecting the various clusters and the local
network connecting the individual processors within each cluster. However, because
of the high network bandwidth on the KSRI, the I 0 cost dominates the performance.
With higher skew rates, the amount of data (K3) that has to be accessed from disk,
in the skewed cluster, increases sharply with increasing number of nodes. For the
KSRI conﬁguration tested in this work, the performance starts to degrade around 15
nodes per cluster for 0 = .1, and 9 nodes for 0 = .5.

By increasing the I/O bandwidth per cluster, the performance can be improved up

to the limitation of the network bandwidth. This could not be tested experimentally

96

 

60

I

1 l I
Analytical, n = 10 -)(—
Simulator, n = 10 -)(—
Analytical, n = 16 +—
Simulator, n = 16 +-
40 _ Analytical, 7?. = 32 -o—
Simulator, n = 32 -0—

    
 
    

50

I

30

‘20

(DB—new» u—-g:1.-eoi—]

 

10

 

 

5 10 15 20 25 30
Number of clusters

Figure 5.6. Effect of network bandwidth with various number of clusters.

since we do not have any control over the static configuration of the KSRl. The next
section, however, experiments with changing the values of the parameters and studies
the performance of the system using both the analytical and simulation models. Note
that both the analytical and simulation models closely approximate the experimental

results with a maximum difference of 8%.

5.5 Projections for Variant Architectures

In this section we study the performance of the algorithm by evaluating the cost
functions under different models. We project 011 the performance of the system by
varying the values of key system parameters, such as the I /O bandwidth, CPU
speed or cluster size. The ﬁgures shown in this section illustrate the impact of these
parameters on the performance of the overall system. We show that N UCOP provides

good performance and can accommodate a large number of nodes. Unless noted

97

otherwise, all figures in this section were generated using the parameter values given
in Appendix A. These values were used by both the simulator and the analytical

model.

Total time

 

 

Nodes per duster

CPU speed

Figure 5.7. Effect of CPU speed.

5.5.1 Effect of I/O bandwidth

Figure 5.5 shows the performance of the system with increasing I/O bandwidth. The
various curves represent different values for n, the number of nodes per cluster. As
shown in the ﬁgure, when 71 increases, C10 increases accordingly as the amount of
data per cluster, to be accessed on disk, increases. Results from the analytical model
and the simulation show that increasing the I/O bandwidth results in signiﬁcant
performance improvements in the lower bandwidth range, i.e. 5 to 20 MB/sec.

The ﬁgure shows, however, that while larger I / O bandwidths do improve the overall

9

(X)

 

 

 

 

 

 

 

 

 

 

110 I I I u
100 - Simulator, 9 = 0.2 +— e
90 L. Analytical, 9 = 0.2 -A— ..
T 80 __ Simulator, 9 = 0.5 ->(— ' _
0 Analytical, 9 = 0.5 *-
t 70 " Simulator, 9 = 0.8 -9- (
El” 60 -— Analytical, 9 = 0.8 '0— -
t 50 '- ’ -
i 40 -
'3 30 - ' = =
20 ﬁ = .. /_
10 ._ a
0 h T 1 1
20 30 40 50 60

Number of nodes per cluster

Figure 5.8. Effect of the cluster size 011 system performance.

performance, further increases in 1/0 speed do not yield similar improvements. For
example, the bandwidth increase from 40 to 50 results in a very small improvement

compared to that from 10 to ‘20.

5.5.2 Effect of network bandwidth

The network connecting the clusters is a shared resource that can form a performance
bottleneck. Figure 5.6 shows the effect of increasing the number of clusters while
keeping the network speed constant at 60M B / sec. With increasing values of m, the
total sizes of R and S, in number of tuples, increase. Consequently, the CPU, I / 0
and network costs also increase. However, what is important to note here is that
while C 10 increases with the higher volume of data per cluster, CM, is more severely
affected because the network is shared among all nodes and not localized to individual
clusters as is the case for 010- Note that on average, the amount of data to be sent

over the network is proportional to @311. Hence, increasing the value of m has a

99

 

35

    
  

 

 

 

Analytilcal 20MB/sec +i N =l 128 I
.30 __ Simulator 20MB/sec -A- ..
° Analytical 30MB/Sec +—
1‘ _ Simulator 30M B/Sec -0—
t 25 " Analytical 40MB/Sec -)<— ‘
a Simulator 40M B/sec -)(—
l 20 _
§ 1
r}, 15 ~ —
e \
\ g <
10 “ -.§. II ‘ <
-—_ N.
5 1 1 L 1 1 1
0 10 20 30 4o 50 60

Number of nodes per cluster

Figure 5.9. Total cost with constant N but varying n and 771..

much more severe effect on (7N8, than on C10.

The ﬁgure also presents the performance for different values of n. Increasing the
value of n results in more nodes and consequently, larger relations. This, in turn,
increases the load 011 the network and results in higher network costs.

It may seem, from Figure 5.6, that large cluster sizes (large n) degrade the sys-
tem’s performance by adding to the total load on the network. Section 5.5.6 focuses
more 011 the effects of varying the cluster size and provides further details regarding
the corresponding effects on performance. The section also compares the N U C 0P

architecture to previously proposed systems.

5.5.3 Effect of CPU speed

CPU speed is an important parameter for good performance but as Figure 5.7 shows,
it is not as critical to the overall system performance as network and I / O bandwidths.

With small CPU speeds, e.g. 5 and 10 MIPS, Copy dominates the performance of

100

the algorithm up to large values of n (28 nodes for 10M I PS) However, the ﬁgure
indicates that, for moderate CPU speeds, e.g. 20 MIPS, the performance is mostly
dominated by either network or I / 0 costs. The technological trend indicates that
CPU speeds have and continue to improve at a faster rate than network and I / O
bandwidths. With currently available CPU speeds of 50 MIPS and above, the
contribution of Copy becomes almost negligible even for very small values of n, as

shown in the ﬁgure.

5.5.4 Effect of large cluster sizes

Figure 5.5 showed that the I/O bandwidth poses a limitation 011 the total number
of processors per cluster. Another important system parameter is the bandwidth of
the local network (or bus) connecting the nodes within each cluster. This should also
pose limitations on the total number of nodes that can be effectively used within the
clusters. However, the KSRl multiprocessor system has a maximum of 32 nodes per
cluster and the bandwidth of Ring:0 is 1 GB/sec. With such conﬁguration [84], the
bandwidth of Ring:0 does not pose any performance limitations [118]. Thus, we could
not conduct any experiments to check the performance effects of the local network
bandwidth with various cluster sizes.

Here, we rely primarily on the simulator to investigate this aspect of the system’s
performance. The results are shown in Figure 5.8. To concentrate on the effects of
the local network bandwidth, we changed that bandwidth to 50 MB/sec and also
changed the I/O bandwidth to 100 MB/sec. The ﬁgure shows that, given such large
I/O bandwidth per cluster, increasing the number of processors within the cluster,
beyond a certain limit, will sharply degrade the performance of the local network.

The performance is shown with 3 different skew rates. Note that with higher
skew rates, the performance degrades sharply. This is because the local network in

the overloaded cluster has to transport the skewed data to the local nodes as well

101

 

30

     
 

 

 

 

 

I I I I I I
0:1—
25’ 0:2‘k— ‘1
T 9=5+—
if 20.— 9=8+- 4
a
l 15 a
r
i
r}, 10 ~
9.
5; -
0 l l I l l l
5 10 15 20 25 30

Number of nodes per cluster

Figure 5.10. Total join cost for various degrees of data skew.

as to node within other clusters. Hence, the amount of data transported becomes
too large for this local network and the performance starts to degrade. However,
with reasonable values for the the different components, such as the values used in
Figure 5.4 for the KSRI, we note that the 1/0 is the more restrictive component.
Note that the analytical model does not exhibit this effect and only reﬂects a larger
I/O load as the sizes of the clusters increase. This is because the bandwidth of the

local network could not be incorporated in the formulation of the analytical model.

5.5.5 Effect of skew rate

Data skew is a very important factor that can severely affect the performance of
parallel hash—based join algorithms [17]. Figure 5.10 shows the total cost of the join
for various degrees of data skew. The ﬁgure shows that increasing the degree of skew
increases the total cost accordingly. Note that for moderate skew rates, between 0.1

and 0.25, the cost function slightly increases. A larger skew rate means a larger value

102
for KS which translates into:
a Higher local network cost at the (overloaded) cluster that received K 5,

0 Higher l/O cost for writing and then reading K3 during the second phase of

the algorithm and
0 Slightly more CPU processing since there are more tuples to process in K5.

With severely skewed data (i.e. 0 = .8) the I/O cost, in the skewed cluster, for
storing and reading this very large number of tuples totally dominates the overall

performance as shown in the ﬁgure.

5.5.6 Comparison to other work

Figure 5.9 shows the performance of the algorithm when the total number of nodes,
N, and consequently the sizes of R and S are kept constant while the size of the
clusters, n, is varied from 2 to 64. The ﬁgure attempts to include the spectrum
of values for m and n when N is ﬁxed at 128 nodes. The case of m = N (n = 1)
corresponds to the Shared-Nothing with Shared Virtual Memory system introduced
by Shatdal and Naughton [73]. However, higher values of n correspond to larger SE
systems within each cluster. The case of n = 8 approximates the performance of the
system studied in [l8]?I

Shatdal and N aughton [73] assumed the network to be of inﬁnite bandwidth. This
was based on results obtained from runs on the GAMMA database machine with a
small number of nodes (typically 4 to 16), where the network bandwidth did not
form a bottleneck. Note that since N = m x n, increasing the cluster size means
decreasing the number of clusters in the system. With a large number of clusters,

the amount of data to be read from disk in each cluster is small and may be easily

 

1[Using 7, as in [18], instead of 8 would make 111 a fractional number.

103

handled by the available I / O bandwidth. However, the amount of data sent across
the network by each cluster (C Net) and the number of network requests generated by
each node during distribution and load balancing are large. This is especially true in
the presence of large data skew.

Figure 5.9 suggests that with a large number of nodes (or clusters of very small
size) in the system, the network does form a bottleneck and the cost of network
accesses may completely dominate the performance of the algorithm. However, an
important conclusion here is that while network bandwidth does place a limitation
011 the scalability of the system in terms of total number of clusters, as is the case
in pure shared—nothing systems, the N UCOP system has the potential of providing
more nodes since each individual cluster can have a large number of nodes.

As the ﬁgure shows, by grouping a number of nodes into clusters and hence de-
creasing the number of clusters, the amount of data and requests to be sent over the
network is minimized and the performance improves up to n = 16 (for I /O = 30
MBytes/sec). This shows that the N UCOP system is preferable to that introduced
in [73]. Another major advantage is that load balancing within a cluster is handled
locally inside the cluster and hence, does not interfere with across—clusters processing,
thus reducing the total network load.

In the N UCOP system, load balancing across clusters improves the overall perfor-
mance of the join algorithm as shown earlier in this section. However, as the cluster
sizes increase, the I / O bandwidth can saturate, as more data is accessed per cluster,
and start to dominate the performance as shown in Figure 5.9. It was shown earlier
(Section 5.5.1) that I/O bandwidth is a limiting factor for the sizes of clusters in NU-
COP systems. However, the ﬁgure suggests a larger optimal cluster size, 16 nodes,

than what was previously determined in [18] for the same set of system parameters.

104

5.6 Concluding Remarks

In this chapter, we introduced a parallel join algorithm with load balancing for the
N UCOP architecture. The analytical model, developed for this architecture, was vali-
dated with both simulation and actual experimentation on the KSRI multiprocessor.
The model and the simulator were also used to project on the performance of the
system with various parameter settings. We investigated the effects of various key
parameters on the performance of the system. We showed that the network bandwidth
becomes a limiting factor when the number of clusters increases. We demonstrated
a large cluster size and consequently showed that NU COP systems can have a larger
number of nodes than both SN and SE systems.

Given the large number of processors and the I/O and network bandwidths pro-
vided by the N U COP architecture, systems based on this architecture are able to
efﬁciently handle very large relations. Consider, for example, the processing of a
database system that belongs to one of the credit agencies mentioned in Chapter 1.
Assume that the database holds approximately 200 million records and has a total
of ﬁve. relations of equal sizes. This yields around 40 million tuples per relation.
With 40 MB/sec and 160 MB/sec as the values of I/O and network bandwidths
respectively, we can have m = 16 and n = 14 for a total of 224 nodes in the system.
Assuming uniform distribution initially (using an appropriate partitioning strategy),
each node will have around 174K tuples for each of the relations. By extrapolating
on the results in Figures 5.5 and 5.6, the join can be computed in approximately six

minutes.

CHAPTER 6

CONCLUSION AND FUTURE RESEARCH

We conclude that the proposed parallel hash join algorithms perform well on N UMA-
based multiprocessor systems. These systems are more suited for load balancing than
systems based on shared—nothing architecture. We also conclude that the NUCOP ar-
chitecture can successfully meet the demands of current and future database systems

based 011 the following facts:

0 The proposed system scales better than previously proposed parallel database

systems for performing relational joins.

a Very large relations are processed efﬁciently in terms of both processing time

and I/O speed.

0 Even with very large number of processors in the system, load balancing can

be performed efﬁciently.

The contributions of this research can be summarized as follows:

0 We introduced and studied two parallel hash join algorithms for main mem-
ory database systems on N UMA architecture. The ﬁrst algorithm, Distributed
Hash Join, uses distributed data structures where separate buckets are built 10-
cal to each node to minimize access conﬂicts. Processor synchronization is also
partially localized where processors increment the global counter and then wait

till a local ﬂag is set by the last processor to reach the barrier. This eliminates

105

106

the remote accesses that would otherwise be needed to test the current value of
the global counter. By minimizing the access conﬂicts and localizing the pro-
cessing of buckets, the DHJ algorithm was shown to exhibit linear performance

improvements with increasing number of nodes and relation sizes.

We showed that Full Replication Join outperforms DHJ when the size of one
of the relations is very small compared to the other. The analytical models of
DHJ and FRJ can be used to determine the algorithm that performs better
given the total number of nodes and the sizes of both relations. This can be

used as a basis for a query optimizer in deciding which algorithm to use.

Given the number of processors and the relations sizes (in number of tuples)
and assuming the hash function to uniformly distribute the results, a proba-
bilistic model was developed to determine the loads at the various processors.
The model uses multinomial distribution to determine the maximum number
of tuples received by the various nodes in the system. This allowed us to ac-
curately predict the performance of the Distributed Hash Join algorithm using

the analytical formulations.

A distributed load balancing scheme was introduced in Chapter 4 to deal with
the problem of data skew. The scheme was shown to perform well with increas-
ing number of nodes and to be robust in dealing with various degrees of data
skew. Random probing was shown to minimize the severe conﬂicts associated
with probing sequentially and to have only a minimal effect on the activities
of other helping nodes, e.g., transferring of blocks from overloaded nodes. A11
analytical model was derived that accurately captures the performance of the

balancing scheme.

The join algorithm proposed in Chapter 5 was implemented on the KSRl mul-

tiprocessor system. Section 5.4.2 showed that with the KSRl conﬁguration

107

and for certain skew rates, only up to 15 nodes per cluster can be used effec-
tively. By varying the system parameters, such as I / O and/or network speed,
the analytical model for N U COP architecture is able to accurately predict the
performance of the system. The model was shown to be able to expose the

scalability limitations of the given architecture for parallel database processing.

0 The network bandwidth was shown to restrict the total effective number of
processors when the cluster size is small. With larger cluster sizes, a larger
proportion of the data is processed within the cluster itself and hence, the
amount of data sent over the network decreases. This shows that for a given
network bandwidth the total number of effective processors for N UCOP systems

is larger than that of systems based 011 shared—nothing architecture.

0 Simulation studies allowed us to evaluate the performance of the N UCOP ar-
chitecture across a wide range of parameter values. The simulator was used to
verify the analytical results and its output closely matched that of the KSRI.
The simulator was designed in a modular fashion and has the ability to work
with various conﬁgurations and different system parameters values. Because of
the integration of functionalities within each component, the interactions be-
tween the workings of different components are kept down to a minimum and
that allows for easier replacements. The main components of the system, i.e.,
the CPU, local and global networks and I / O can be replaced by new components

with different topologies and speeds.

Several issues mentioned in this dissertation, but not studied thoroughly, provide
the basis for future research. These include extending the simulator with new network
and I / O conﬁgurations in order to model new systems with different conﬁgurations.
Also of interest is the impact of multiple joins as well as multi—join operations on the

load balancing scheme and 011 the overall performance of the system.

APPENDICES

APPENDIX A

SYSTEM PARAMETERS FOR THE NUCOP

MODEL

The following table (A.l) presents the system and algorithm parameters used in the
formulation of the analytical model for the N UCOP architecture in Chapter 5. The
table presents each parameter along with a brief explanation and a default value.
Unless otherwise noted, this is the value used in evaluating the various cost functions

of the analytical model.

108

109

 

 

 

 

 

 

 

Parameter Default Value Explanation

11 32 Number of nodes per cluster

111 4 Number of clusters

N n X 171 Total number of nodes in the system
N R 8000 Number of R tuples per node

N 5 8000 Number of S tuples per node

T3 N3 X N Total number of tuples in R

T3 N3 X N Total number of tuples in S

Js .00001 Join Selectivity

0 .15 Skew Rate

t 128 Tuple size in number of bytes

(2, 300000 Bucket size in bytes

be 131072 Bucket size that ﬁts in cache (128K)
B 25 Number of buckets

B3 128 Block size in number of tuples

Wm 40 IO bandwidth (in MB/sec)

Wm, 25 CPU speed (in MIPS)

Wm 100 Network bandwidth (in MB/sec)
110.1,, 1000 Cost of join (in number of instructions)
[,0 500 Cost of initiating an IQ request

[net 500 Cost of initiating a Network request
I hash 300 Cost of hashing a tuple

R,- Ifi Initial portion of R in cluster 2'

S, 1]? Initial portion of S in cluster i

S" 1733:“ Portion of S received by each node (except Skewed)
K, 0 X Ts + S}, X (n —— 1) Number of tuples in skewed cluster
E, 0 X T5 — S... Extra tuples in skewed cluster

Tia Wt“) Cost of IO

Tnet va... Cost of Network

T cpu Muir“! Cost of CPU processing

 

 

Table A.1. Various system parameters with their default values.

 

APPENDIX B

THE NUCOP SIMULATOR

In the following, we provide a detailed description of the simulator for the N UCOP
architecture. The simulator is written in CSIM which is a process—oriented discrete—
event simulation package for use with the C programming language. CSIM is imple-
mented as a library of routines that the user may call from a regular C program. In
CSIM, a system is modeled as a collection of predeﬁned CSIM structures. Processes,
which represent the active part of the system, interact with each other by visiting the

CSIM structures. The structure provided by CSIM are:

1. Facility: These are servers that are either reserved or currently in use by some

process.
2. Storage: A resource that can be partially allocated and deallocated by a process.
3. Event: This is the main resource used for synchronization.

4. Mailbox: A resource for interprocess communication.

Q1

. Table: Resources to collect detailed information and statistics about the runs.
There are a number of table types and each one is for collecting a different type

of information.

Following are the deﬁnitions of the main components of the simulator:

1. CPU: A11 array of N facilities, where N is the number of processors.

110

111

2. IO: An array of m facilities, where m is the number of clusters. This represents

a single I/O source per cluster.

3. lNet: A11 array of m multiple—server facilities. Each multiple—server facility

simulates a slotted—ring bus with the speciﬁed number of slots.

4. gNet: A single facility through which all communication between the various

lNets takes place.

nReq: An array of m multiple—server facilities. This is not a component of the

CJ'!

simulator but it had to be incorporated for more accurate results. On the KSRI,
each processor can have up to three pending network requests only. nReq is

used to enforce such restriction.

In simulating the networks in NUCOP, a simplifying assumption was made con-
cerning the locations of the empty slots 011 the ring. More precisely, it is assumed
that if there is an empty slot 011 the ring, it can be used immediately regardless of its
current position 011 the ring. While this assumption simpliﬁes the implementation,
our aim is not to extensively model the slotted ring in every aspect. Instead, the
objective is to understand the effect of the network bandwidth on the performance
of the system. By comparing the results from a number of simulator runs with some
actual experimental data, we found that the assumption does not seem to have a
great effect on the performance of the network.

In order to simulate actual processing of tuples by a given processor, the corre-
sponding CPU facility will be held for the corresponding amount of time. Holding
a facility is done by calling the function use(f,t), where f is the facility and t is the
amount of time required to process the current request. While a facility is in use
by some process, all other processes have to wait till the current process releases the

facility before they can use it.

112

Sending a block of tuples over the network involves ﬁrst reserving the CPU for
an amount of time equivalent to that of issuing a network request. Next, we check
the nReq facility to see if there are already 3 pending requests and if there are, then
we wait until one of the requests is done. This waiting time is added to the CPU
time since the network request has thus far been unsuccessful. When the request is
successful, the data reaches the local network I N et. Here, the ﬁrst step is to calculate
how many packets are needed and check if the destination node, dn, is within the
same cluster as the source node, sn. Next, the local 1 Net is reserved for each packet
to simulate the sending. If sn and dn belong to the same cluster, we are done sending
the data. Otherwise, the lNet has to send the packets to the cluster of dn over gNet.
Thus, gNet is reserved for each of these packets and then the l Net corresponding to
the cluster of dn is reserved for the duration of the receive.

Since the network is modeling a slotted—ring bus, reserving the network for a
packet results in reserving only one slot. All other slots may still be available and
can be used simultaneously by other processors.

The overall design of the simulator is modular enough to allow adding new net—
work topologies and I / O conﬁgurations. Time limitations did not, however, allow us
to investigate other conﬁgurations than that of the KSRl family of multiprocessor
systems. Table B.1 presents the parameters used in the simulation along with brief

description of each parameter.

B.1 Simulator Validation

In addition to being validated against the analytical and experimental results, the
simulator was also validated with the following static test. The system consists of
only two clusters with two nodes each. All parameters are assigned their default values

(see Table B.1) except the sizes of R and S which were made smaller, NR = N5 = 500.

We measured the number of requests submitted to gNet and to each lNet and I/O
object and compared them against the expected values which were computed by

hand. This is to make sure that all requests are being handled correctly by the

113

different components of the system.

The main reason for choosing such a small system is to make the hand computa-

tions feasible. Even for a small system of this scale and modest values for N3 and

N3, the hand computations proved to be challenging.

 

 

 

 

 

Parameter Default Value Description

11 32 Number of processors per cluster

m 4 Number of clusters

N n X 771 Total number of nodes in the system

N R 8000 Average number of R tuples per processor
N3 8000 Average number of S tuples per processor
J, .00001 Join Selectivity

6 .15 Skew Rate

t 128 Tuple size in number of bytes

Sp“ 128 Packet size in bytes

B 100 Number of buckets

Sb 128 Block size in number of tuples

.S'page 16384 1/0 page size (16K bytes)

BIO 40 IO bandwidth (in MB/sec)

How 25 CPU speed (in MIPS)

BIN“ 100 Local Network bandwidth (in MB/sec)
BgNet 100 Global Network bandwidth (in MB/ sec)
[,0 500 Cost of IO request (number of instructions)
In“ 500 Cost of Network request (number of ins.)
[has], 300 Cost of hashing a tuple (number of ins.)
1,0," 1000 Cost of join (number of ins.)

 

Table B.1. The major system parameters used to control the behavior of the NUCOP

simulator.

 

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and

[4]

[6]

l7]

[8]

[9]

M. Muralikrislma, “GAMMA — a high performance dataﬂow database ma-
chine,” in Proceedings of the 12th International Conference on Very Large Data
Bases, pp. 228—237, 1986.

D. J. DeWitt, S. Ghandeharizadeh, and D. Scheider, “A performance analysis
of the GAMMA database machine,” in Proceedings of the ACM Special Interest
Group on Management Of Data, pp. 350—360, 1988.

H. Boral, W. Alexander, L. Clay, G. Copeland, S. Danforth, M. Franklin,
B. Hart, M. Smith, and P. Valduriez, “Prototyping BUBBA: A highly paral-
lel database system,” IEEE Transactions on Knowledge and Data Engineering,
vol. 2, pp. 4—24, Mar. 1990.

S. Friedman, “New HP models break performance barrier,” Open Systems To-
day, p. 54, Nov. 1993.

“Oracle predicts: Video-On—Demand key selling point,” Dow Jones/News Re-
trieval, Nov. 1993.

M. James, “Dangerous things strangers know about you,” McCall, pp. 90—91,
Jan.1994.

D. J. DeWitt and R. Gerber, “Multiprocessor basin-based join algorithms,” in
Proceedings of the 11th International Conference on Very Large Data Bases,
pp. 151—164, 1985.

E. Omiecinski and E. Tien, “A hash—based join algorithm for a cube—connected
parallel computer,” Information Processing Letters, vol. 30, pp. 269—275, Mar.
1989.

J. L. Wolf, D. M. Dias, P. S. Yu, and J. J. Turek, “An effective algorithm for
parallelizing hash joins in the presence of data skew,” tech. rep., IBM T. J.
Watson Research Center, RC 15510, 1990.

114

[10]

[11]

[113}

[1:5]

[14]

[151

[Nil

[17]

[18]

[191

115

M. Kitsuregawa and Y. Ogawa, “Bucket spreading parallel hash: A new, ro-
bust, parallel hash join method for data skew in the Super Database Computer
(SDC),” in Proceedings of the 16th International Conference on Very Large Data
Bases, pp. 210—221, Aug. 1990.

S. Ghandeharizadeh and D. J. DeWitt, “A multiuser performance analysis of al-
ternative declustering strategies,” in Proceedings of the IEEE Data Engineering
Conference, pp. 466—475, 1990.

S. Ghandeharizadeh and D. J. DeWitt, “Hybrid—range partitioning strategy: a
new declustering strategy for multiprocessor database machines,” in Proceedings
of the 15th International Conference on Very Large Data Bases, pp. 484—492,
1990.

C. A. Lynch, “Selectivity estimation and query optimization in large databases
with highly skewed distributions of column values,” in Proceedings of the 14th
International Conference on Very Large Data Bases, pp. 240-250, 1988.

M. S. Lakshmi and P. S. Yu, “Effect of skew on join performance in parallel
architectures,” in Proceedings of the International Symposium on Databases in
Parallel and Distributed Systems, pp. 107—117, 1988.

M. S. Lakshmi and P. S. Yu, “Limiting factors of join performance on parallel
processors,” IEEE Transactions on Knowledge and Data Engineering, pp. 488—-
496, Feb. 1989.

C. B. Walton, “Four types of data skew and their effect on parallel join per-
formance,” tech. rep., Dept. of Computer Science, Univ. of Texas at Austin,

TR—90—12, 1990.

M. S. Lakshmi and P. S. Yu, “Effectiveness of parallel joins,” IEEE Transactions
on Knowledge and Data Engineering, vol. 2, pp. 410—424, Dec. 1990.

K. A. Hua, C. Lee, and J. K. Peir, “lnterconnecting shared—everything systems

for efﬁcient parallel query processing,” Proceedings of the 18th International
(Zlonference on Very Large Data Bases, pp. 262-270, 1992.

D. A. Schneider and D. J. DeWitt, “A performance evaluation of four parallel
join algorithms in a shared—nothing multiprocessor environment,” in Proceed-
ings of the ACM Special Interest Group on Management Of Data, pp. 110—121,
June 1989.

[‘20]

[‘21]

[‘22]

[23]

[241

[‘35]

[271

[281

[291

[30]

116

E. Omiecinski, “Performance analysis of a load balancing hash-join algorithm
for a shared memory multiprocessor,” Proceedings of the 17th International
Conference on Very Large Data Bases, pp. 375—385, Sept. 1991.

D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri, “Practical skew
handling in parallel joins,” in Proceedings of the 18th International Conference
on Very Large Data Bases, pp. 27~40, 1992.

S. Y. W. Su, Database Computers: Principles, architectures, and techniques.
McGraw-Hill Book Company, 1988.

Teradata, BBC/1012 Data Base Computer Concepts and Facilities. Teradata
Corporation, Release No. 1.1, (302—0001—01, 1984.

Teradata, DEC/1012? Data Base Computer Concepts and Facilities. Teradata
Corporation, Doc. No. (302—0001—05, 1988.

D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and
D. Wood, “Implementation techniques for main memory database systems,”
in Proceedings of the ACM Special Interest Group on Management Of Data,
pp. 1—8, June 1984.

D. J. DeWitt and D. A. Schneider, “A performance evaluation of four parallel
join algorithms in a shared—nothing multiprocessor environment,” in Proceed-

ings of the International Conference on Parallel Processing, pp. 110—121, 1989.

M. Kitsuregawa, S. Suzuki, H. Tanaka, and T. Moto-Oka, “Relational alge-
bra machine based on hash and sort,” IECE Japan Technical Group Meeting,
vol. EC—8l, no. 35, 1981.

M. Kitsuregawa, H. Tanaka, and T. Moto-Oka, “Relational algebra machine
GRACE,” RIMS Symposia on Software Science and Engineering, pp. 191—212,
1983.

H. Schweppe, H. C. Zeidler, W. Hell, H. O. Leilich, G. Stiege, and W. Teich, Ad-
vanced Database Machine Architecture, ch. RDBM — A dedicated multiprocessor
system for database management, pp. 36—86. Englewood Cliffs, NJ: Prentice
Hall, Mar. 1984.

D. J. DeWitt, “DIRECT — a multiprocessor organization for supporting re-
lational database management systems,” IEEE Transactions on Computers,
vol. C—28, pp. 395—406, June 1979.

[311

[32]

[33]

[34]

[35]

[:56]

[37]

[:58]

[391

[40]

[41]

117

D. J. DeWitt, “Query execution in DIRECT,” in Proceedings of the ACM Spe-
cial Interest Group on Management Of Data, pp. 13-22, May 1979.

H. Boral, D. J. DeWitt, D. Friedland, and N. D. Jarrell, “Implementation of
the database machine DIRECT,” IEEE Transactions on Software Engineering,
vol. SE—8, pp. 533-543, Sept. 1982.

S. Hikita, H. Yamazaki, H. Hasegawa, and Y. Mitsushita, “Optimization of
the ﬁle access method in content—addressable database machine (CADAM),” in
Proceedings AF IPS, pp. 507—513, 1981.

“IDM 400: Intelligent database machine product description,” tech. rep., Brit-
ton Lee, Inc., 1981.

“lDL tutorial,” tech. rep., Britton Lee, Inc., 1981.

Y. Tanaka, Y. Noxaka, and A. Masuyama, “Pipeline searching and sorting mod—
ules as components of a data ﬂow database computer,” Information Processing
Letters, pp. 427—432, 1980.

Y. Tanaka, “A data—stream database machine with large capacity,” in Proceed-
ings of the International Workshop on Database Machines, Sept. 1982.

M. Kitsuregawa, M. Fushimi, H. Tanaka, and T. Moto-Oka, “Memory man-
agement algorithms in pipeline merger sorter,” in Proceedings of the Fourth
International Workshop on Database Machines, pp. 208—232, 1985.

S. Fushimi, Kitsuregawa, and H. Tanaka, “An overview of a parallel relational
database machine GRACE,” in Proceedings of the 12th International Conference
on Very Large Data Bases, pp. 209—219, 1986.

G. Gardarin, “An introduction to SABRE: A multi—microprocessor database
machine,” Sixth Workshop on Computer Architecture for Non—Numeric Pro-

cessing, June 1981.

P. Valduriez and G. Gardarin, “Multiprocessor join algorithms of relations,”
Second International Conference on Data Bases: Improving Usability and Re-

sponsivness, pp. 219—237, June 1982.

P. Valduriez, “Semi—join algorithms for multiprocessor systems,” in Proceedings
of the ACM Special Interest Group on Management Of Data, pp. 225—233, June
1982.

[43]

[44]

[461

[471

[481

[491

118

G. Gardarin, P. Bernadat, N. Temmerman, P. Valduriez, and Y. Viemont,

“SABRE: A relational database system for a multi—microprocessor machine,”
Advanced Database Machine Architecture, pp. 19—35, 1983.

N. Miyazaki, T. Kakuta, S. Shibayama, H. Yokota, and K. Murakami, “An
overview of relational database machine Delta,” in Proceedings of the Advanced

Database Symposium, pp. 11—20, Dec. 1984.

H. Sakai, K. Iwata, S. Kamiya, M. Abe, A. Tanaka, S. Shibayama, and K. Mu-
rakami, “Design and implementation of the relational database engine,” tech.
rep., Institute for New Generation Computer Technology, Tokyo, Japan, Apr.
1984.

K. Iwata, S. Kamiya, H. Sakai, S. Matsuda, S. Shibayama, and K. Murakami,
“Design and implementation of a two—way merge—sorter and its application to
relational database processing,” tech. rep., Institue for New Generation Com-

puter Technology, Tokyo, Japan, May 1984.

S. Uemura, T. Yuba, A. Kokubu, R. Ooomote, and Y. Sugawara, “The design
and implementation of a magnetic—bubble database machine,” Information Pro-
cessing Letters, pp. 433—438, 1980.

S. Y. W. Su and C. K. Baru, “Dynamically partitionable multicomputers with
switchable memory,” Journal of Parallel and Distributed Computing, vol. 1,
pp. 152—184, Nov. 1984.

C. K. Barn and S. Y. W. Su, “Performance of statistical aggregation operations
in the SM3 system,” in Proceedings of the ACM Special Interest Group on
Management Of Data, pp. 77—89, June 1984.

C. K. Barn and S. Y. W. Su, “The architecture of SM3: A dynamically parti-
tionable multicomputer system,” IEEE Transactions on Computers, vol. C-35,

pp. 790—802, Sept. 1986.

A. K. Thakaore and S. Y. W. Su, “Matrix inversion and LU decomposition on
a multicomputer system with dynamic control,” in Proceedings of the Second

International Conference on Supercomputing, vol. 1, pp. 291—300, 1987.

H. Auer, W. Hell, H. O. Leilich, H. Schweppe, G. Stiege, S. Seehusen, J. Lie,
H. Zeidler, and W. Teich, “RDBM — a relational database machine,” Informa-
tion Systems, vol. 6, no. 2, pp. 91—100, 1981.

[53]

[57]

[58]

[59]

[601

[61]

[6‘21

[63]

119

J. D. Brownsmith and S. Y. W. Su, “Performance analysis of the EQUI—JOIN
operation in the MICRON ET computer system,” in Proceedings of the ICCC
Conference, pp. 264—268, Oct. 1980.

T. B. Genduso and S. Y. W. Su, “An analytical model of the MICRONET
distributed database management system,” in Proceedings of the Third Inter-
national Conference on Distributed Computing Systems, pp. 232—239, Oct. 1982.

S. Y. W. Su and K. P. Mikkilineni, “Parallel algorithms and their implemen-
tation on MICRON ET,” in Proceedings of the 8th International Conference on
Very Large Data Bases, 1982.

S. Y. W. Su, “A microcomputer network system for distributed relational
databases: Design, implementation, and analysis,” Journal of Telecommuni-
cation Networks, vol. 2, no. 3, pp. 307—334, 1983.

J. R. Goodman, “An investigation of multiprocessor structures and algorithms

for database management,” tech. rep., University of California at Berkeley, 1980.

J. R. Goodman and A. M. Despain, “A study of interconnection of multi-
ple processors in a database environment,” in Proceedings of the International
Conference on Parallel Processing, pp. 269—278, Aug. 1980.

D. K. Hsiao, Collected readings on a database computer (DBC). The Ohio State
University Press, 1980.

J. Banerjee, D. K. Hsiao, and K. Kannan, “DBC - a database computer for
very large databases,” IEEE Transactions on Computers, vol. C—28, pp. 414—
429, June 1979.

R. K. Shultz and R. J. Zingg, “Response time analysis of multiprocessor comput-
ers for database support,” ACM Transactions 0n Database Systems, pp. 14—17,
1984.

M. Kitsuregawa, H. Tanaka, and T. Moto-Oka, “Application of hash to database
machine and its architecture,” New Generation Computing, vol. 1, pp. 63—74,

1983.

M. Kitsuregawa, H. Tanaka, and T. Moto—oka, “Architecture and performance
of relational algebra machine GRACE,” in Proceedings of the International Con-
ference on Parallel Processing, pp. 241—250, Aug. 1984.

[641

[651

[661

[67]

[68]

[69]

[70]

[711

[72]

[73]

120

M. Kitsuregawa, S. Tsudaka, and M. Nakano, “Parallel GRACE hash join on
shared—everything multiprocessor: Implementation and performance evaluation
on Symmetry 881,” IEEE Transactions on Knowledge and Data Engineering,
pp. 256—264, 1992.

M. Stonebraker, R. Katz, D. Patterson, and J. Ousterhout, “The design of
XPRS,” in Proceedings of the 14th International Conference on Very Large
Data Bases, pp. 318-330, 1988.

D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and
R. Rasmussen, “The GAMMA database machine project,” IEEE Transactions
on Knowledge and Data Engineering, vol. 2, no. 1, pp. 44—62, 1990.

G. Graefe, “Encapsulation of parallelism in the Volcano query processing sys-
tem,” in Proceedings of the ACM Special Interest Group on Management Of
Data, pp. 102—~111, 1990.

G. Copeland, W. Alexander, E. Boughter, and T. Keller, “Data placement in
BUBBA,” in Proceedings of the ACM Special Interest Group on Management
Of Data, pp. 99—108, May 1988.

M. H. Kim and S. Pramanik, “Optimal ﬁle distribution for partial match re»
trieval,” in Proceedings of the ACM Special Interest Group on Management Of
Data, pp. 173—182, June 1988.

D. DeWitt and J. Gray, “Parallel database systems: The future of high per-
formance database systems,” Communications of the ACM, vol. 35, pp. 85—98,
June 1992.

K. A. Hua and C. Lee, “Handling data skew in multiprocessor database comput-

’ in Proceedings of the 17th International Conference

on Very Large Data Bases, pp. 525—535, Sept. 1991.

ers using partition tuning,’

C. D. Walton, A. G. Dale, and R. M. Jenevin, “A taxonomy and performance

model of data skew effects in parallel joins,” Proceedings of the 17th Interna-
tional Conference on Very Large Data Bases, pp. 537—548, 1991.

A. Shatdal and J. F. Naughton, “Using shared virtual memory for parallel join
processing,” in Proceedings of the ACM Special Interest Group on Management

Of Data, pp. 119—128, 1993.

121

[74] K. Li and P. Hudak, “Memory coherence in shared virtual memory systems,”

[761

[77]

[781

[79]

[80]

[811

[8131

[83]

[84]

Transactions on Computer Systems, vol. 7, pp. 321—359, Nov. 1989.

H. Garcia-Molina and K. Salem, “Main memory database systems: An

overview,” IEEE Transactions on Knowledge and Data Engineering, vol. 4,

pp. 509—516, Dec. 1992.

P. M. G. Apers, C. A. van den Berg, J. Flokstra, P. W. P. J. Grefen, M. L.
Kersten, and A. N. Wilschut, “PRISMA/DB: A parallel, main memory rela-
tional DBMS,” IEEE Transactions on Knowledge and Data Engineering, vol. 4,
pp. 541—554, Dec. 1992.

A. C. Ammann, M. B. Hanrahan, and R. Krishnamurth, “Design of a memory

resident DBMS,” in Proceedings of the 1985 IEEE COMPCOM Conference,
pp.54—57,1985.

D. Bitton, M. B. Hanrahan, and C. Turbyfill, “Performance of complex queries
in main memory database systems,” in Proceedings of the IEEE Data Engineer-
ing Conference, pp. 72—81, Feb. 1987.

K.-Y. Whang and R. Krishnamurthy, “Query optimization in a memory—
resident domain relational calculus system,” ACM Transactions 0n Database
Systems, vol. 15, pp. 67—95, Mar. 1990.

T. J. Lehman and M. J. Carey, “Query processing in main memory database
management systems,” in Proceedings of the ACM Special Interest Group on
Management Of Data, (Washington, DC), May 1986.

M. H. Eich, “A classiﬁcation and comparison of main memory database recovery
techniques,” in Proceedings of the International Conference on Data Engineer-
ing, pp. 332—339, Feb. 1987.

L. Gruenwald and M. H. Eich, “MMDB reload algorithms,” in Proceedings of
the ACM Special Interest Group on Management Of Data, (Denver, Colorado),
pp. 397—405, May 1991.

D. Gawlick and D. Kinkade, “Varieties of concurrency control in IMS/VS Fast
Path,” Data Engineering Bulletin, vol. 8, pp. 3—10, June 1985.

“KSRI principles of operations, rev. 6.0,” tech. rep., Kendall Square Research,

Waltham, MA, 10 1992.

[851

[861

[871

[88]

[89]

[901

[91]

[9‘21

[93]

122

D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessey, “The
directory—based cache coherence protocol for the DASH multiprocessor,” In
17th International Symposium on Computer Architecture, pp. 148—159, May
1990.

D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy,
M. Horowitz, and M. S. Lam, “The Stanford DASH multiprocessor,” IEEE
Computer, pp. 63—79, Mar. 1992.

BBN, “Inside the Butterﬂy TC2000,” tech. rep., BBN Advanced Computers,
Jambridge, MA, Feb. 1990.

D. R. Cheriton, H. A. Goosen, and P. D. Boyle, “Paradigm: A highly scalable
shared—memory multicomputer architecture,” IEEE Computer, vol. 24, pp. 33—

46, Feb. 1991.

E. Hagersten, A. Landing, and S. Haridi, “DDM — A cache-only memory ar-
chitecture,” IEEE Computer, pp. 44—54, 1992.

M. C. Taylor, “Parallel multi—join algorithms for main memory databases,” in

Proceedings of the International Conference on Parallel Processing, 1989.

H. Lu, K. Tan, and M. Shan, “Hash—based join algorithms for multiprocessor
computers with shared memory,” Proceedings of the 16th International Confer-
ence on Very Large Data Bases, pp. 198-208, 1990.

V. Deshpande and P. A. Larson, “The design and implementation of a parallel
join algorithm for nested relations on shared—memory multiprocessors,” IEEE

Transactions on Knowledge and Data Engineering, pp. 68—77, 1992.

W. T.-Y. Hsu and P.-C. Yew, “An effective synchronization network for hot-
spot accesses,” Transactions on Computer Systems, vol. 10, pp. 167—190, Aug.
1992.

S. P. Dandamudi and D. L. Eager, “Hot—spot contention in binary hypercude
networks,” vol. 4, pp. 239—245, Feb. 1992.

('7. Severance, '1 989.

E. F. Gehringer, J. Abullarade, and M. H. Gulyn, “A survey of commercial
parallel processors,” IEEE Computer, vol. 22, pp. 75—107, Aug. 1989.

[971

[98]

[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]

[107]

[108]

[109]

[110]

[111]

[112]

123
S. B. Yao, “Approximating block accesses in database organizations,” Commu-
nications of the ACM, vol. 20, pp. 260—261, Apr. 1977.
C. Berge, Principles of Combinatorics. Academic Press, 1971.
G. Berman, Introduction to Combinatorics. Academic Press, 1972.
A. B. Slomson, An Introduction to Combinatorics. Chapman and Hall, 1991.
I. Tomescu, Introduction to Combinatorics. Collet’s Publishers LTD, 1975.
J. H. Lint, A Course in Combinatorics. Cambridge Universal Press, 1992.
D. Knuth, The art of programming, vol. 3. 1972.
G. Dahlquist, Numerical Methods. Prentice Hall, 1974.
G. E. Forsythe, Computational methods for Mathematics. Prentice Hall, 1977.
Numerical Recepies. Prentice Hall, 1994.

D. Bitton, D. J. DeWitt, and C. Turbyﬁll, “Benchmarking database systems —
A systematic approach,” in Proceedings of the 9th International Conference on
Very Large Data Bases, pp. 8——19, Oct. 1983.

G. Zipf, “Human behavior and the principle of least effort: An Introduction to

Human Ecology,” Addison— Wesley, 1949.

J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek, “An efﬁcient algorithm for paral-
lelizing hash joins in presence of data skew,” in Proceedings of the International
Conference on Data Engineering, pp. 200—209, Apr. 1991.

W. Tout and S. Pramanik, “A distributed load balancing scheme for data par-
allel applications,” in Proceedings of the International Conference on Parallel
Processing, pp. II:213—II:216, Aug. 1993.

D. L. Eager, E. D. Lazowska, and J. Zahorjan, “Adaptive load sharing in ho-
mogeneous distributed systems,” IEEE Transactions on Software Engineering,

vol. 12, pp. 662-675, May 1986.

G. Ananth, A. Gupta, and V. Kumar, “Isoefﬁciency function: A scalability
metric for parallel algorithms and architectures,” tech. rep., Department of

Computer Science, University of Minnesota, 1992.

[113]

[114]

[115]

[116]

[117]

[118]

124

A. Gupta and V. Kumar, “Analyzing performance of large scale parallel sys-
tems,” in Proceedings of the 26th Hawaii International Conference on System

Sciences, 1993.

E. Rosti, E. Smirni, T. D. Wagner, A. W. Apon, and L. W. Dowdy, “The

KSRl: Experimentation and modeling of Poststore,” tech. rep., Department of
Computer Science, Vanderbilt University, Nashville, TN, 1993.

H. Schwetman, “CSIM users’ guide,” tech. rep., Micro Electronics and Com-
puter Technology Corp., MCC tech report ACT—126—90, Mar. 1990.

C. Larsen, “KSRl technical information.” Kendall Square Research, 1993. Per—

sonal contact through email.

R. H. Saavedra, R. S. Gaines, and M. J. Carlton, “Micro benchmark analysis
of the KSRl,” Transactions on Computer Systems, pp. 202—213, 1993.

T. H. Dunigan, “Kendall square multiprocessor: Early experiences and perfor-

mance,” Kendall Square Research: Technical Notes, Aug. 1992.

"11111111111111?