:7...

 

 

 

 

 

O.

y .

..
;‘ 1.1

 

H

 

 

 

 

li‘nD-szlxﬂvm
‘0}.‘1. I!
up D (00:0 1!

It
In

'4‘,
2'3 ‘
.

J
I

5;,
L

5
i

310:0 nl
llvluluﬂrﬂmm

I ”Judah“..- .

 

I rll’LQl‘

RSITYUBRM

’Illllllillu‘tlﬁllllmm\Ilugmj’l “

3 1293 00794 4

This is to certify that the

dissertation entitled

A Framework for Multiprocessor Performance
Characterization and Calibration

presented by
Arun K. Nanda

has been accepted towards fulﬁllment
of the requirements for

Ph.D. degree in Computer Science

 

hmtg \‘l by;

Major professor

Date 10/12/92

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

 

 

l’ ‘1

LIBRARY
Michigan State
University

x I

 

 

PLACE IN RETURN BOX to remove thin checkout from your record.
TO AVOID F INES return on or before date due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

:Tl—j

MSU Is An Affirmative Action/Equal Opportunity Institution
emote”.-

 

 

 

 

A Framework for Multiprocessor Performance

Characterization and Calibration

By

Ann: K. Nanda

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1992

ABSTRACT

A Framework for Multiprocessor Performance

Characterization and Calibration
By

Arun K. Nanda

In parallel programs using the shared-variable paradigm, run-time communica-
tion overhead manifests itself along three principal dimensions, namely, shared data
accesses (including memory contention, cache misses and non-local memory access
latencies), inter-process synchronization operations, and global barrier synchroniza-
tions. Performance measurements to quantify the rate at which communication costs
for an algorithm increases as more processors are used is integral to the study of an
algorithm’s efﬁciency and scalability. In this thesis, we explore the problem of per-
formance characterization of a multiprocessor in the context of the shared-variable
programming model with emphasis on characterizing the dynamic run-time behavior.

We have developed a hierarchical model to characterize multiprocessor system per-
formance using a multi-phase computation structure with concurrent asynchronous
execution within a phase. Two sets of system characterization parameters have been
proposed that completely describe the static and dynamic behavior of a given in-
put workload on a target multiprocessor system. The characterization parameters
are calibrated by experimental measurements on the input workload. A series of
loss functions are formulated to describe the performance degradation resulting from

static and dynamic overheads, thus providing realistic estimates of performance loss.

Since the characterization of performance is tied inextricably with the input work-
load, we have presented a ﬂexible technique for benchmark workload generation, that
can be tailored to fit a user’s preference for selective workload characteristics. A fam-
ily of workload emulation kernels, namely, the MAD, SAD and BAD kernels, have
been designed to isolate and measure the incremental impact of memory contention,
critical sections and barrier synchronization on performance, repectively, to calibrate
the hierarchical performance model. We have demonstrated the applicability of the
system characterization methodology and the effectiveness of the workload emulation
kernels by evaluating the performance of several synthetic workloads on the Sequent
Symmetry and BBN TC2000 commercial multiprocessors.

The proposed methodology is independent of any particular architecture or appli-
cation. We believe that our approach to performance characterization will serve to
model performance with greater fidelity than exists in the current state of art, since it
incorporates the effect of both static and dynamic influences in a workload execution.
Since a shared-variable programming paradigm is only assumed with no assumptions
made about the organization of the shared address space, our framework can be used
equally effectively to evaluate multiprocessors that provide a physical shared memory

or highly-parallel systems that support a shared virtual memory.

Copyright © by
Arun K. Nanda
1992

To my parents

ACKNOWLEDGEMENTS

I would like to take this opportunity to express my appreciation to those who
have contributed to the completion of this dissertation. I will always be indebted to
my advisor, Lionel Ni. He has been my mentor, my colleague, and my friend. His
guidance has helped me mature as a researcher, and his respect for my ideas has made
working with him very rewarding. I look forward to many fruitful interactions with
him in the future.

I am very grateful to the other members of my dissertation committee: Richard
Enbody, for his invaluable discussions on numerous occasions and comments to im-
prove the readability of this thesis, his perpetual willingness to listen to whatever I
had to say, be it research related or otherwise, and offer friendly advice; Abdol Es-
fahanian, for being my faculty advisor for two years, his critical suggestions on some
aspects of this thesis, and for his time and support; V. Mandrekar, for his continuous
encouragement and always accommodating me in his schedule at short notice.

I would like to thank the members of the Advanced Computing Research Facility
at Argonne National Laboratory, especially Dave Levine, for providing me access to
their computer systems and their help in arranging my special job scheduling requests.

My thanks to Honda Shing and Ten-Hwan Tzen for many enlightening discussions
on research issues.

A person cannot accomplish anything without the help and understanding of fam-
ily members. My mother’s constant encouragement, in spite of her personal hardships,
inspired me to do my best. My brother and sister always stood behind all my deci-
sions. My father- and mother-in-law offered their patient understanding throughout
the course of my doctorate work. I proudly share this accomplishment with them all.

Finally, my very special thanks to my wife Susmita, for sustaining me with her
continuous love and understanding, and spending many a sleepless nights with me

during my work to keep me company.

vi

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 INTRODUCTION
1.1 Multiprocessor Performance Evaluation .................

1.2 Survey of Benchmarks ..........................

1.2.1 Synthetic Benchmarks ......................
1.2.2 Kernel Benchmarks ........................
1.2.3 Application Benchmarks .....................
1.3 Motivation and Problem Definition ...................
1.4 Objective and Scope of Research ....................
1.5 Thesis Outline ...............................

2 BACKGROUND

2.1 Multiprocessor Memory Organization ..................
2.2 Limitations to Parallelism ........................
2.2.1 Memory Access Contention ...................
2.2.2 Spin Locks and Mutual Exclusion ................
2.2.3 Synchronization Barriers .....................
2.3 Target System Architectures .......................
2.4 Summary .................................

3 PERFORMANCE CHARACTERIZATION METHODOLOGY

3.1 The Parallel Computation Model ....................
3.2 Workload Characterization ........................
3.2.1 The Unit Grain ..........................
3.2.2 Workload Classiﬁcation ......................
3.3 Experimental Framework .........................
3.3.1 Measurement Structure ......................
3.3.2 Workload Generation .......................

vii

xi

(0(1)me

10
10
14
20

21
21
24
25
31
38
40
45

46
47
50
51
53
55
55
59

3.4 Performance Characterization Parameters ............... 60

3.4.1 Static Parameters ......................... 61
3.4.2 Dynamic Parameters ....................... 67
3.4.3 Performance Metrics ....................... 72
3.4.4 Aggregate Multiphase Performance ............... 74
3.5 The Workload Emulation Kernels .................... 75
3.5.1 Measurement of Incremental Overheads ............. 76
3.5.2 Kernel Structure ......................... 78
3.5.3 Minimization of Experimental Errors .............. 80
3.6 Summary ................................. 83

MAD KERNELS AND MEMORY ACCESS PERFORMANCE 85

4.1 Preliminary Studies ............................ 86
4.1.1 Workload Parameters ....................... 86
4.1.2 Quantities Measured ....................... 88
4.1.3 Memory Access Overhead Factors ................ 88
4.1.4 Experimental Results ....................... 94

4.2 MAD Workload Parameters ....................... 97
4.2.1 Unit Grain Characterization ................... 98
4.2.2 Output Metrics .......................... 102

4.3 Concurrent-Access Workloads ...................... 103
4.3.1 Homogenous Workloads ..................... 103
4.3.2 Heterogenous Workloads ..................... 111

4.4 Dual-Mode Access Workloads ...................... 114

4.5 Summary ................................. 115

SAD KERNELS AND SYNCHRONIZATION PERFORMANCE 117

5.1 Preliminary Studies ............................ 118
5.1.1 Synchronization Overhead Factors ................ 119
5.1.2 Experimental Results ....................... 121

5.2 SAD Workload Parameters ........................ 130
5.2.1 Unit Grain Characterization ................... 130
5.2.2 Output Metrics .......................... 132
5.2.3 Lock Implementations Studied .................. 133

5.3 Exclusive-Access Workloads ....................... 137

5.4 Dual—Mode Access Workloads ...................... 144
5.4.1 Homogenous Workloads ..................... 144
5.4.2 Heterogenous Workloads ..................... 147

viii

5.5 Summary ................................. 150

6 BAD KERNELS AND BARRIER PERFORMANCE 152
6.1 BAD Workload Parameters ....................... 153
6.1.1 Phase Characterization ...................... 153

6.1.2 Output Metrics .......................... 154

6.1.3 Barrier Implementations Studied ................ 155

6.2 Embarrassing Workloads ......................... 158
6.2.1 Scalability of Barrier Implementations ............. 162

6.2.2 Balanced Load and Simultaneous Arrivals ........... 162

6.2.3 Unbalanced Load and Staggered Arrivals ............ 164

6.3 Dual-Mode Access Workloads ...................... 166
6.4 Summary ................................. 169

7 CONCLUSIONS 171
7.1 Research Contributions .......................... 171
7.2 Directions for Future Research ...................... 174
BIBLIOGRAPHY 176

ix

1.1

2.1

3.1

3.2
3.3
3.4
3.5

4.1
4.2

4.3
4.4

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

6.1

LIST OF TABLES

Performance level comparisons for three classes of multiprocessors . . 5
Summary of target system architectures ................ 44
An example of weights assigned to different types of ﬂoating-point

operations to normalize their execution time to ﬂoating—point addition

time .................................... 53
Summary of average shared data access time tm ............ 63
System characterization parameters ................... 72
Application parameters used in the performance model ........ 72
Summary of access degradation kernel measurements ......... 79
Basic time measurements for the overhead factors model ....... 90

Parameter settings for different workload types used in the preliminary
studies ................................... 95

Unit grain attributes for studying memory access behavior ...... 99

Static characterization parameters for a homogenous workload with

M 2 128K, G; = G6 = (9m = (0/1,0,.<§',1),gc = ¢,g, = 45) ........ 106
Actual execution times (M = N + 1,w = 500, a: = 50us) ....... 128
Actual overhead times (M = N + 1,w = 500, x = 50us) ........ 128
Unit grain attributes for studying synchronization behavior ...... 131
Native lock support on each machine .................. 134
Pseudo-code for the TAS lock ...................... 134
Latency of locks used in the SAD experiments ............. 138
Half-performance lock factor c1/2 for different lock implementations . 139
Static characterization parameters for workloads used in incremental

overhead measurements .......................... 146
Workload parameters for studying barrier performance ........ 153

1.1
1.2

2.1
2.2

2.3

2.4
2.5

3.1
3.2
3.3
3.4
3.5
3.6
3.7

4.1
4.2
4.3
4.4
4.5
4.6
4.7

LIST OF FIGURES

Performance measurement levels ..................... 4
Steps in the experimental performance characterization method . . . 17
Organization of memory hierarchy in shared-memory multiprocessors 23
Tree saturation as a result of hot spot accesses over a multistage inter-

connection network ............................ 27
Memory address interleaving techniques: (a) Fine interleaving with se-
quential assignment across modules (one bank per module); (b) Coarse
interleaving with sequential assignment within module (one bank per
module); (c) Mixed scheme with fine interleaving among banks of a

module and coarse interleaving among modules (multiple banks per

module) .................................. 29
Sequent Symmetry system architecture ................. 41
BEN TC2000 system architecture .................... 42
Structure of parallel program execution ................. 49
Structure of a unit grain ......................... 52
Structure of a single computational phase ................ 54
Structure of the measurement framework ................ 56
Incremental measurement of dynamic overheads ............ 77
The concurrent loop structure of the kernels .............. 80
Normalized 90 percent conﬁdence intervals for three workload mea-

surements on the Sequent Symmetry for Nrcpeat = 5,10, 20 ...... 82
Efficiency vs. N (M = 1,7» = 100,3: = 0) ................ 96
Efﬁciency vs. N (M = N + 1,w = 100,3: = 0) ............. 97
Creation of memory access patterns using attributes d and s ..... 100
Effect of spatial distribution of memory access stream on performance 105

Effect of temporal distribution of memory access stream on performance107
Effect of contention for a memory location (hot-spot) on performance 108

Effect of length of computation on hot-spot write performance . . . . 109

xi

4.8 Effect of shared-data size on read performance ............. 111
4.9 Random access performance expressed in MegaWARPS ......... 112
4.10 Interaction between read and write memory—access streams ...... 113

4.11 Effect of length of computation on interference between read and write

streams .................................. 114
5.1 Generic structure of program executed by every processor ....... 119
5.2 Efﬁciency vs. N (M = N +1,w = 100,p = 0) ............. 121
5.3 Efﬁciency vs. N (M = N + 1,1) = 0.1,2: = 30ps) ............ 122
5.4 Efﬁciency vs. N (M = N + 1,0.) = 100,3: = 100113) ........... 123
5.5 Efﬁciency vs. N (M = N + 1,w = 100,p = 0.3) ............ 124
5.6 Overhead components vs. N (M = N + 1,w = 500,p = 0.1, a: = 30us) 126
5.7 Overhead components vs. :1: (M = N + 1,w = 500,p = 0.1) ...... 127
5.8 Overhead components vs. p (M = N + 1,0.) = 500,3: = 50us) ..... 129
5.9 Critical section structure ......................... 132
5.10 Pseudo-code for the MCS list-based queuing lock ........... 136
5.11 Working of the MCS list-based queuing lock .............. 137
5.12 Effect of frequency of CS on performance ................ 141
5.13 Effect of non—CS to CS computation ratio on performance ...... 143
5.14 Effect of non-CS to CS shared data access ratio on performance . . . 145
5.15 Incremental interference measured with stride of access 3 = 1 ..... 147
5.16 Incremental interference measured with stride of access 3 = 23 . . . . 148
5.17 Impact of non-CS memory accesses on CS execution performance . . 149
5.18 Impact of CS spin-lock on non-CS memory accesses .......... 150
6.1 Pseudo-code for a sense reversing centralized barrier .......... 155
6.2 Pseudo-code for a distributed dissemination barrier .......... 157
6.3 Time to achieve barrier vs. N ...................... 161
6.4 Time to achieve DSM barrier on the TC2000 .............. 163
6.5 Barrier performance of a perfectly balanced load ............ 164
6.6 Barrier performance of an unbalanced load ............... 165
6.7 Performance of staggered arrivals at the barrier ............ 167
6.8 Cumulative interferences unit stride workload on the Symmetry . . . 168
6.9 Cumulative interferences unit stride workload on the TC2000 ..... 169

xii

CHAPTER 1

INTRODUCTION

The ever increasing need for faster and more powerful computers, coupled with the
advent of fairly cheap microprocessors, has prompted considerable interest in mas-
sively parallel processor systems. Computational power has reached a plateau at the
current state of technology for single processor systems [23], due to certain funda-
mental limits (i.e., the speed of light and the width of the atom) being approached.
In an effort to sustain increases in the peak speed of new computer systems so as to
bridge the discrepancy between computational needs and available computing power,
designers have turned to multiple processors, vector arithmetic units, and other ar-
chitectural innovations. Using a large number of low-cost processors for achieving
supercomputing performance is attractive indeed. Unfortunately, it is much more
difﬁcult for a programmer or a compiler to take advantage of multiple processors
than of a faster clock speed. As a result, many machines with complex architectures
are able to deliver only a small fraction of their theoretical peak performance on all
but the most ideal problems.

The purpose of this dissertation research is to develop a ﬂexible approach to char-
acterize multiprocessor systems for general purpose parallel programming that can
measure and quantify the expected losses in parallel execution performance and deter-
mine performance bottlenecks for any selected workload. The proposed methodology
provides a framework for customized benchmark workload generation and yields a
set of parameters which characterize the target system. These parameters spotlight

the strong and weak points of a machine and, hence, aid in the design of efﬁcient

algorithms for it. It should be emphasized that it is not the intent of this thesis
to address the issue of performance prediction of application programs. We have
chosen the shared-memory programming model as the focus of our study. In this
model, processes communicate with each other through shared-variables residing in
globally accessible memory. The shared-memory programming model is widely be-
lieved to be easier to use than the message-passing model. The conceptual simplicity
of the shared-memory model derives from similarities with sequential programming.
Evidence in favor of the shared-memory model is the overwhelming dominance of
shared-memory multiprocessors for general purpose parallel programming, and the
considerable effort in software development designed to provide the illusion of shared
memory on multicomputers.

In this introductory chapter, we elaborate some of the pertinent issues in mul-
tiprocessor performance evaluation, provide a brief survey of the commonly used

multiprocessor benchmarks, and describe the objective and scope of this research.

1.1 Multiprocessor Performance Evaluation

The goal of computer performance evaluation is to identify opportunities for spe-
cific performance improvements throughout the life of a computer system and to
guide the design of more effective architectures. The requirements of target appli-
cations motivate the development of new systems; the development of novel systems
creates the need and the basis for performance evaluation research. Effective perfor-
mance evaluation of highly-parallel systems is essential because these systems must
function at the limits of their computing potential in order to meet the overwhelm-
ing demands of large scientiﬁc applications. However, analyzing the performance of
multiple-processor systems is a very complex task since many factors jointly deter-
mine system performance, and the modiﬁcation of some factors affects others. Since
many different tradeoffs are involved, it is crucial to carefully tune various parameters
such that a system achieves its peak performance.

Traditionally, three common approaches are used to evaluate multiprocessor per-

formance: analytical, simulation and experimental [56]. All three approaches are
necessary because each has its own advantages and limitations. Analytical mod-
els are extremely powerful in the sense that they allow the analytical correlation
of performance with organizational parameters. However, their applicability is not
universal. In order to be tractable, they typically have to make many simplifying as-
sumptions about the architecture and application characteristics that may not reﬂect
an accurate representation of reality. For example, memory interference models for
multiprocessors based on queueing theory often assume a randomly distributed (both
in time and space) memory request stream. This assumption fails for many scientific
and engineering applications that exhibit very regular data access patterns. If vector
instructions are used to implement these codes they must exploit, and hence empha-
size, this regularity in the temporal and spatial distribution of requests. Simulations
can generally approximate reality more closely, but they are expensive to run and
still do not replace real measurements. Moreover, interactions may be present on a
real system that affect performance and are difﬁcult to capture in a model.

The advantage of experimental performance analysis is, of course, that the per-
formance of the real system is obtained as opposed to the performance of a model
of the system. The drawback of such a solution is its experimental nature which
limits the number of codes analyzed and generally does not provide any methodology
for extrapolating the performance of an arbitrary code from the performance of the
benchmark codes. Furthermore, even when using very simple benchmarks, there is
no general method of correlating code characteristics with the performance observed.

Analytical and simulation modeling techniques find maximum applicability at the
system design phase where they facilitate prediction of system behavior long before
the actual hardware implementation. This helps in making judicious design decisions
that can avoid considerable investment of resources in an inefficient design. For
example, analytical models of processor—memory interconnection have been studied
in [86, 11, 18, 19]. Analytical models of application (or algorithm) execution on a
given architecture can also aid in asymptotic scalability studies [47, 42]. However,

hardware related parameters in such models need to be calibrated by experimental

measurements.

Owing to the diversity of architectural approaches of a multiprocessor, the develop-
ment of working models that can provide a true measure of the “actual” performance
of these machines under workloads of interest can be an extremely complex, if not
impossible, problem. Since a multitude of architectural and application parameters
jointly determine system performance and the modiﬁcation of some factors affects
others, it is not feasible to construct an elegant yet tractable analytical model that
encompasses all performance effects. Nondeterminism present in parallel program
execution on multiprocessors introduces an additional degree of complexity into the
performance measurement phenomenon. The dynamic run-time behavior of multi-
processor programs is impossible to capture accurately in analytical models.

In the face of the above difﬁculties, empirical results are the only reliable perfor-
mance measures [29]. This has led to the use of benchmark programs to characterize
and evaluate parallel computer performance (benchmarking). Although benchmark-
ing is widely acknowledged to be a difﬁcult and often controversial process [87, 97], it
also provides one of the few recognized means of acquiring useful performance infor-
mation about complex systems running complex tasks. The methodologies commonly

used in computer benchmarking and the associated pitfalls encountered are described

in [35].

 

Applications

, .............
I
I
l
: Hardware
I
I
I
I

.----------

 

 

 

Figure 1.1. Performance measurement levels

There are four levels in the hierarchy of performance measurements [85] as illus-
trated in Figure 1.1. The answer to the oft-asked question, “How fast is it?” depends
on the intended use of the performance data. At the lowest level lies the performance
of the hardware design. Determining this performance provides both a validation of
and directives for system software design. Only by understanding the strengths and
weaknesses of the hardware can system software designers develop an implementation
and user interface that maximizes the raw hardware potential available to the end
user. Given some characteristics of the available processing resources and the services
provided by the system software, users can develop algorithms that are best suited
to the computer system’s capabilities. Finally, the best mix of key algorithms will
maximize the performance of user applications.

A complete performance characterization requires not only an analysis of the sys-
tem’s constituent levels, it also requires both static and dynamic characterizations.
Static or average behavior analysis may mask transients that dramatically alter sys-
tem performance. A combination of static and dynamic characterizations is also
needed to understand the interactions between performance levels. Table 1.1 shows
a subset of the important performance measurements at three levels for three classes

parallel processing architectures.

Table 1.1. Performance level comparisons for three classes of multiprocessors

 

 

 

 

 

 

Level Vector Shared-memory Message-passing
processors multiprocessors multicomputers
Hardware Vector startup Memory contention Processor speed
Memory conﬂicts Network contention Communication
Memory-cache latency and
interaction bandwidth
System software Compiler Compiler OS support
Algorithm Vectorization Shared-memory access Communication
Inter-processor pattern
synchronization

 

 

 

 

 

Historically, benchmarking has been employed for system procurements. It will
certainly maintain its value in that arena as it expands to become the experimental
basis for a developing theory of supercomputer and multiprocessor performance eval-
uation. The number of benchmarks currently used is growing day by day. Every new
benchmark is created with the expectation that it will become the standard of the
industry and that manufacturers and customers will use it as the definitive test to
evaluate the performance of computer systems with similar architectures. A survey
of the common benchmarks in use today is provided in Section 1.2.

One of the key questions in benchmarking has to do with what kind of unit consti-
tutes the benchmark set. A number of general benchmarks such as Livermore Fortran
Kernels, NAS Kernels and the Linpack Benchmark have emerged during the past two
decades that are based on a collection of computation-intensive kernels extracted
from a. range of real application domains. Another benchmark called Whetstones, on
the other hand, is based on a collection of synthetic kernels. All these benchmarks
perform measurements at the “algorithms” level of Figure 1.1 and have one thing in
common—each component kernel in the benchmark is designed to stress a particular
aspect of system performance.

Discussions of benchmarking [35, 60, 117, 125] have lead to a growing recognition
that the most accurate information on a system’s aggregate performance is obtained
by making measurements on complete applications (applications-based benchmark-
ing). The underlying assumption here is that real engineering and scientiﬁc codes
stress and evaluate machines in a way that kernels and algorithms cannot. Efforts in
this direction include the Perfect, SPLASH and SLALOM benchmarks. Performance
measurements at the applications level capture and reﬂect the interactions that occur
within and between all the lower levels (Figure 1.1). Although this is indeed true,
these benchmarks provide useful measures of performance only to the particular set of
users that are represented by the benchmark applications. Because of the complexity
of designing a complete application program, when tests are done at this level rather
than on simpler units, the skill of the programmer may be a significant factor in the

performance. Some of the limitations with this approach are:

0 Complete applications are difﬁcult to port to a new architecture. Unless the
existing applications are modiﬁed and tuned to the new architecture, they may

not yield optimal performance.

0 The software technology for writing parallel programs is immature. It is unclear
how well programs written with today’s constructs will represent those that
might be written in the future, and what the implications of this are for the

effectiveness of evaluation studies performed today.

0 The available programs might not represent the best parallelization of the prob-
lem they solve, but only one that is reasonable and convenient to implement.
More signiﬁcantly, large-scale parallel processing may call for very different a1-

gorithms than those implemented on smaller machines today.

0 The relationships between applications and architectures take on new dimen-
sions with parallelism. The number of architectural variables is much larger,
making careful correlation of performance with code characteristics more difﬁ-

cult. -

Empirical studies based on carefully deﬁned benchmark experiments at all the levels
in Figure 1.1 can provide a hierarchical path to a complete deﬁnition of system per-
formance by extending our understanding of the incremental contributions made by
architecture, technology, compilers, operating systems, algorithms, and programming
implementations of physical problems.

Finally, there is the question of appropriate metrics to represent multiprocessor
performance. A single ﬁgure of merit such as MIPS (Millions of Instructions Per
Second) is meaningless in the context of the diverse CPU architectures available to-
day. The single number metric MFLOPS (Millions of Floating point Operations Per
Second) is more appropriate for scientiﬁc computations, but yet insufﬁcient. From
the end user’s standpoint, perhaps the desirable metric would be MRPS (Millions of
Results Per Second), although this metric would have no universal meaning. Usually,
different benchmark program measurements are summarized in order to ﬁnd the “av-

erage” performance of a computer. How to calculate these averages has been one of

the most confusing issues in performance evaluation [41, 117]. Siegel et al. provide

a detailed discussion of other metrics used for multiprocessor performance in [115].

1.2 Survey of Benchmarks

Benchmarks are standard programs used to evaluate the performance of a wide range
of computer systems. What distinguishes a benchmark from an ordinary program
is a general consensus of opinion within the industry and research circles that the
benchmark exercises a computer well. Common benchmarks fall into one of sev-
eral categories. Synthetic benchmarks are small programs especially constructed for
benchmarking purposes with the underlying assumption that the average character-
istics of real programs can be statistically approximated by a small program. They
do not perform any useful computation. Kernel benchmarks are code fragments ex-
tracted from real programs in which the code fragment is believed to be responsible
for most of the execution time. Application benchmarks go with the assumption that
complete real applications stress and evaluate machines in a way that kernels and code
fragments cannot. The most important advantage of reducing benchmarks to kernels
is that they may be rapidly ported to new computer architectures, whereas porting
a mature application would need a lot more effort. However, complete applications
provide the most accurate indication of performance.

The ﬁeld of multiprocessor benchmarking is still evolving and not yet mature.
The methodologies commonly used in supercomputer benchmarks and some of the
pitfalls encountered are examined by Dongarra et al. in [35]. Although there are a
wide variety of benchmarks available, some very site-specific, there is no consensus
yet on the most effective and acceptable multiprocessor benchmarks. We summarize

some of the more commonly used benchmarks in this section.

1.2.1 Synthetic Benchmarks

Whetstone. The Whetstone benchmark [27] was the ﬁrst program in the literature

explicitly designed for benchmarking. It is a synthetic program constructed with

nine small loops each containing statements of a particular type (integer arithmetic,
ﬂoating-point arithmetic, “if” statements, calls, etc.). It uses mostly global data and
has a high percentage of ﬂoating-point operations. Most of its execution time is spent
in mathematical library functions. The benchmark results are reported as MWIPS
(mega Whetstone instructions per second).

Dhrystone. This is another synthetic benchmark [123] that consists of 12 procedures
included in one measurement loop with 94 statements. It contains no ﬂoating-point
operations and a considerable percentage of its execution time is spent in string
functions. Unlike Whetstone, it uses very little global data and emphasizes data

locality. The benchmark results are given in Dhrystones per second.

1.2.2 Kernel Benchmarks

Linpack. This is a numeric benchmark [33] with a high percentage of ﬂoating-
point operations and no mathematical functions at all. More than 75 percent of its
execution time is spent in a 15-line subroutine (called saxpy in the single-precision
version and daxpy in the double-precision version). The results of this benchmark
are reported in MFLOPS.

Livermore Fortran Kernels. Also called the Lawrence Livermore Loops, this
benchmark [88] consists of 24 kernels (inner loops) of numeric computations from
different areas of physical sciences. The individual loops range from a few lines to
about one page of source code. They contain many ﬂoating-point computations and
a high percentage of array accesses. The program computes MFLOPS rate for each
kernel, for three different vector lengths.

NAS Kernels. This benchmark program [10] consists of approximately 1000 lines
of Fortran code, organized into seven separate tests each containing a loop that it-
eratively calls a subroutine. The subroutines have been extracted from a variety of
computational ﬂuid dynamics problems currently being worked on the NASA Ames
supercomputers. They all emphasize the vector performance of a computer system.

The performance is measured in MFLOPS.

10

1.2.3 Application Benchmarks

Perfect Benchmarks. Prompted by Kuck and Sameh’s proposal [69] and initiated
by a group of academic and industrial collaborators, the goals of this effort were to
deﬁne an applications-based methodology for supercomputer performance evaluation.
The Perfect Benchmarks [29, 17] consist of 13 programs drawn from a variety of
scientiﬁc and engineering ﬁelds with over 60,000 lines of Fortran source listing. The
methodology requires a set of baseline measurements followed by any number of
optimized measurements of each code.

SPLASH. Similar to the Perfect benchmarks, the Stanford Parallel Applications for
Shared-Memory (SPLASH) [116] is a suite of seven applications drawn from several
scientiﬁc and engineering problem domains. The applications are intended as a de-
sign aid for architects and software people working in the area of shared-memory
multiprocessing.

SLALOM. The SLALOM benchmark [5] solves a complete problem dealing with
“optical radiosity on the interior of a box”. It times input, problem setup, solution,
and output, not just the solution. It is the ﬁrst benchmark based on ﬁxed time rather
than ﬁxed problem comparison.

SPEC Benchmarks. Probably the most important current benchmarking effort is
SPEC [120] — the systems performance evaluation cooperative effort. Its goal is to
collect, standardize, and distribute large application programs that can be used as
benchmarks. The SPEC suite consists of 10 benchmark programs. The results are
given as performance relative to a VAX 11 / 780 using VMS compilers. A compre-
hensive number, the “SPECmark”, is deﬁned as the geometric mean of the relative

performance of the 10 programs.

1.3 Motivation and Problem Deﬁnition

There are two distinct activities [110] in evaluating any computer that are often not
distinguished in practice: system characterization and performance evaluation. The

goal of system characterization is to obtain a set of parameters that fully describes

11

the system behavior at some level of abstraction. The characterization parameters
spotlight the strong and weak points of the system they represent. Performance
evaluation, on the other hand, is the measurement of some number of properties
during the execution of a given workload. The properties measured may be the total
execution time to complete some job steps, the utilization of system resources, the
amount of parallel execution overhead, etc. It is important to note that the results
depend on, and are only valid for, the workload used in the evaluation.

Accurate performance characterization of a computer is crucial to the design of
effective algorithms for the system as it offers information on the sensitivity of the sys-
tem to various workload attributes. By providing a validation suite for performance
trends, it can guide the selection of appropriate values and tuning of important algo-
rithmic parameters. Characterization of uniprocessor systems have been undertaken
in [103] using a low-level machine architecture model and in [110] using a higher-level
Abstract Fortran Machine model.

The performance characterization of a multiprocessor system introduces a num-
ber of new considerations due to the presence of interactions between concurrently
executing processes. Inter-process communication, synchronization and contention
for shared resources are the primary sources of interference that inﬂuence a concur-
rently executing process. Therefore, in addition to describing the static behavior of a
single processor in isolation, multiprocessor performance characterization must also
incorporate some mechanism to represent the dynamic execution behavior of multi-
ple processors in the presence of these interactions. Further, the magnitude of the
interference encountered is a function of not only the number of processors but also
the parallel program structure and behavioral characteristics.

The well-known Amdahl’s Law [4] is one of the earliest attempts to address the
fundamental issue of parallel program performance. He qualitatively described the
gross features of a typical performance spectrum arising in supercomputers. He con-
sidered the overall performance of a machine that has two modes of computing (one

relatively slow, the other relatively fast) as a function of the time spent in each mode.

12

Ware [122] quantiﬁed the idea in the following model of multiprocessor performance:

t,+t,,

Speedup = m
8 P

(1.1)

where t, is the amount of time spent on serial parts of a program, t, is the amount
of time spent on parts of the program that can be executed in parallel, and p is the
number of processors used. The numerator in Eq. 1.1 denotes the execution time on a
single processor whereas the denominator denotes the execution time on p processors.
Buzbee [25] has pointed out that this model neglects the effect of multiprocessor
synchronization overhead. To correct this inadequacy, he proposed the additional
term o(p) in the parallel execution time, which is usually a monotonically increasing
function. However, he did not suggest any method for quantifying 0(p). Gustafson
[54] has recently demonstrated that the assumptions underlying Amdahl’s Law are
inappropriate for the current approach to ensemble parallelism and has reformulated
the law. Gelenbe [48, 49] has given a set of formulae that provide insight into the
effective speedup of parallel programs by taking into account the capacity of a program
to use its parallel structure effectively.

A three parameter (roo,n1/2,sl/2) description, introduced by Hockney [61, 63],
characterizes the performance of vector multiprocessors in terms of its vector startup
overhead and multiple instruction stream synchronization overhead. The parameter
r00 is the asymptotic rate of the vector operation for large vectors, n1/2 is the vector
length at which half the asymptotic rate is achieved, and 31/2 is the amount of useful
arithmetic that could have been done during the time taken for synchronization.
These three parameters were measured experimentally on a 2-CPU CRAY X-MP
machine in [62].

All the above models ignore the dynamic effects of communication and synchro-
nization on parallel program execution. More recently, Zhang [127] has presented a
timing model based on a modiﬁed Ware model that incorporates the various shared-
memory multiprocessor program execution effects into the sequential time component

t, of Eq.1.1. He calibrated t , and tp using experimental measurements on some matrix

13

computations. Although this study demonstrates the various multiprocessor effects,
it does not offer any method to deduce system behavior under other workloads. An-
alytical models for predicting multiprocessor performance on iterative algorithms in
terms of the speed of the processor, memory and the interconnection network have
been developed in [121, 28]. Statistical models for synchronous parallel algorithms
have also been proposed in [84]. But these models do not include the effect of memory
contention as a result of access patterns and mutual exclusion synchronization effects.

An experimental characterization technique for multiprocessor memory system
behavior was developed by Gallivan et al. [45] using a set of “load/ store” ker-
nels to deﬁne memory access patterns. This method was used to study the relation
of the Alliant FX/ 8 vector instruction set to its memory hierarchy. Although this
technique is very effective for observing the dynamic behavior of concurrent mem—
ory access streams, it is limited in scope and does not address the other sources of
performance degradation on a multiprocessor. Experimental study of memory access
contention has also been reported in [24]. Numerous comparative studies of multi-
computer/ supercomputer performance on speciﬁc application programs exist in the
literature [34, 82, 57, 32]. These studies, although interesting to read, frequently
provide only anecdotal information.

Using standard benchmarks to evaluate machine performance is a widely used
practice. Considerable effort has been expended to develop benchmark suites, as de-
scribed in Section1.2, that are considered to reﬂect real workloads [69]. Although
benchmarking is an excellent vehicle for “performance evaluation” (as deﬁned ear-
lier), there are a number of limitations to using it as an approach to “performance

characterization” :

0 Each benchmark is itself a mixture of characteristics, and doesn’t relate to a

speciﬁc aspect of machine performance.

0 They provide no insight as to which components of a given program workload

have the potential of being the bottlenecks and to what extent.

From the standpoint of the person engaged in the performance measurement activ-

14

ity, the use of a standard benchmark program suffers from one signiﬁcant limitation—
the lack of control over the benchmark characteristics. Selecting any standard bench-
mark as the basis for performance evaluation automatically establishes an associated
program workload that is built into the benchmark structure. Hence, it is not pos-
sible to experiment with changing individual parameters in the workload that affect
performance so as to determine optimal settings for such parameters for a given ar-
chitecture/ application combination. Such selective characterization of performance
along controlled performance dimensions is integral to the design and implementa-
tion of better algorithms. Upon identifying the most important parameters that have
signiﬁcant inﬂuence on system performance, we need to develop a simple model to
understand and a method to quantify the incremental effect of each of the parameters
on performance when they are observed separately. The method should also provide
means for observing how different parameters interact. Based on these results, we
can identify critical parameters and recognize performance bottlenecks.

Essentially, what is needed is a performance evaluation and characterization

methodology that includes the following functional components:

0 A ﬂexible benchmark workload generator that can be tailored to highlight the

performance of a system along selected dimensions.

0 A measurement framework that can incrementally capture and quantify both
the static and dynamic aspects of program behavior along the selected perfor-

mance dimensions.

0 A system characterization method that uses the measured quantities in a global

timing model to help predict performance trends.

In this research, we address the above problem and present a new approach to selective

performance evaluation and characterization for multiprocessor systems.

15

1.4 Objective and Scope of Research

The goal of this research is to explore the use of algorithm characteristics as an
abstraction that can help in designing benchmark sets that measure the effect of those
parameters which most signiﬁcantly inﬂuence multiprocessor performance. The ﬁnal
objective of such an exercise is to evolve a “system characterization” of the system
under test that can effectively guide the design of efﬁcient algorithms. The impact
of changing algorithmic parameters on algorithm performance can be predicted and
validated using the characterization data suite. Knowledge of expected performance
degradation of a multiprocessor program in advance, before actually writing it, helps
support an efﬁcient design and implementation methodology. The insight thus gained
helps users (and eventually compilers) understand why a given computation runs
slowly and how to redesign the algorithms to optimize performance.

We have focused on evaluations at the algorithm level, which means that the types
of conclusions that may be drawn relates to how well the structure of an algorithm
matches the capabilities of an architecture. Thus, the evaluations at this level do
not address the question of how the algorithm ﬁts into a complete task. However,
algorithms are more often readily available than complete tasks, and solutions to
complete applications are often constructed from a library of key algorithms. It will
therefore be of interest to understand what is being learned from architecture evalua-
tions performed at the algorithm level. Our approach will be to propose abstractions
by which this sort of evaluation can be facilitated. The objective is to make more sys-
tematic the way in which benchmark sets are selected. The approach proposed in this
research is intended to complement applications-based benchmarking as a method for
performance evaluation.

We have restricted the scope of our studies to multiprocessors supporting a shared
address space. The hardware architecture of the machine need not furnish a com-
mon shared-memory. The underlying programming model is assumed to be one using
shared-variables. This programming model is widely used and is evident from the over-

whelming dominance of shared-memory multiprocessors for general purpose parallel

16

programming both in the commercial and academic sectors. Examples of commercial
multiprocessors include the Encore Multimax, the Sequent Balance and Symmetry,
and the BBN GP1000 and TC2000 systems; among research prototypes are the NYU
Ultracomputer [51], the IBM RP3 [104], and the Illinois Cedar [44] machines. Fur-
thermore, a considerable effort in software development is designed to provide the
illusion of shared memory on multicomputers [26, 20, 79, 78, 108, 40, 22]. By re-
stricting our attention to a given class, we ﬁlter out some of the strong differences,
allowing ourselves to understand the performance within a class more precisely.

The execution time of a task on a multiprocessor may be nondeterministic on
account of queueing delays due to contention for shared resources such as memory or
communication channels, or to data-dependent computation times. Variations in exe-
cution times generally result in synchronization delays where one task has to await the
completion of other tasks. These synchronization delays are inherent in the structure
of the algorithm and limit the potential speedup of the parallel algorithm over a serial
algorithm. We distinguish between implicit and explicit synchronizations. Implicit
synchronization is caused by the contention for shared resources (shared memory, crit-
ical sections). Algorithms exhibiting only implicit synchronizations have been called
asynchronous [71]. Explicit synchronization mechanisms are normally used to enforce
precedence relations in synchronized algorithms. This thesis speciﬁcally addresses the
effect of implicit synchronizations in parallel algorithm execution.

Communication cost, synchronization overhead and the contention for shared re-
sources are recognized as the main sources of overhead present in multiple-processor
systems. The performance of a parallel program using shared-variables and exhibiting
only implicit synchronizations is strongly inﬂuenced along three major dimensions:
the distribution of shared-data over the memory hierarchy and the concurrent mem-
ory reference patterns to access them, mutually exclusive access to shared-data to
preserve consistency, and the presence of global synchronization barriers. Along each
performance dimension, the behavior of a given program is a complex function of a
number of architectural as well as application parameters. It is important to be able

to isolate and determine the effects of each of these components on overall system per-

17

formance. By increasing our ability to measure the pieces, combine their effects, and
relate their contributions to architectural and algorithmic characteristics, we enhance
our ability to model and predict performance in complex systems.

As discussed earlier, standard benchmark programs are not suitable for performing
the task of system characterization since we cannot isolate the effects of each of the
three performance factors when executing the benchmark workload. Although they
provide good indication of the overall system performance, a user does not have any
control on the benchmark characteristics. We need a ﬂexible benchmark workload
generator and a systematic measurement methodology to capture the incremental

contribution of each performance factor to the total parallel execution overhead. We

 

Experimental Performance
Characterization
. \

V

 

O

0' o
.0. ' e

_Q ° ' 9

 

 

 

 

 

 

 

 

 

 

 

 

   
 
   

Computation Benchmark Characterization
model ‘ workload parameters and
selection ,' generation calibration

 

Benchmark Workload
workload execution and
characterization measurement

 

 

 

 

 

 

Figure 1.2. Steps in the experimental performance characterization method

have developed an experimental performance characterization method based on the

18

construction of synthetic executable workloads. These workloads have the advan-
tage that they can be made parametric and hence ﬂexible in representing workload
characteristics. Although they have the disadvantage of possible lack of realism at
the applications level, they can be made to reﬂect the algorithm characteristics quite

accurately [121]. Our characterization technique consists of ﬁve distinct steps (Fig—

ure 1.2):

1. Parallel computation model selection.
To be universally applicable, the system characterization measurements must
be based on a uniform model of execution so that the results of an experiment
can be related to previous and future experiments. We consider a class of
structured multi-phase [91] iterative algorithms as our basis for characterizing
multiprocessor performance. Many engineering and scientiﬁc applications are
most frequently characterized as being highly iterative and adhere to this phase-

and-transition model.

2. Benchmark workload characterization.
The benchmark workload characterizer uses a hierarchical approach to construct
a variety of artiﬁcial workloads of interest using the parameters that most in-
ﬂuence the behavior of concurrent program execution. At the lowest level, it
uses a single grain of computation, called a unit grain, as the unit of parallel
workload speciﬁcation. The unit grains are assembled into the multi-phase par-
allel computation structure at a higher level thus incorporating the algorithmic

component into the workload.

3. Benchmark workload generation.
Assigning appropriate values to the attributes used to characterize a unit grain
creates synthetic workloads that are used as benchmarks for the characterization
process. Values assigned to the attributes may be constant quantities thus
creating invariant deterministic unit grain characteristics, or the attributes may
be treated as random variables of known probabilistic distributions thereby

producing stochastic unit grain behavior. The unit grain attributes are varied

19

in a controlled fashion to create parameter families that systematically traverse

the input parameter space.

4. Workload execution and performance measurement.
A family of workload emulation programs has been developed that use the
workload speciﬁcation to mimic the execution behavior of an actual program
that would demonstrate the same workload characteristics. Three sets of such
emulation programs have been designed corresponding to the three major per-
formance dimensions described earlier; each measures and quantiﬁes the perfor-

mance degradation resulting from overheads along its associated dimension.

0 Memory Access Degradation (MAD) kernels measure the overheads result-

ing from memory contention while accessing shared-data.

e Synchronization Access Degradation (SAD) kernels measure the overheads
resulting from synchronization operations and mutually exclusive access to

shared-data.

e Barrier Access Degradation (BAD) kernels measure the overheads result-
ing from the presence of synchronization barriers in parallel program exe-

cution.

The measurement framework allows for observation of interference between both

homogenous and heterogenous concurrent processes.

5. Performance characterization parameters.
Two performance metrics, unit grain eﬂiciency and interference, are introduced
to measure the relative performance of a workload as the number of parallel
processes increases. The performance of a given workload as the number of

processors vary is completely described by a set of six parameters — three

constants (R00,c1/2,f1/2) and three functions (t/Jm(N),¢,(N),i/)5(N)).

The usefulness of this methodology lies in its ability to selectively assess and char-
acterize a shared-memory multiprocessor using synthetic benchmarks whose char-

acteristics can be controlled by the person performing the evaluation. This is of

20

great practical importance to computer manufacturers as well as system and applica-
tion programmers alike. For researchers, it is an important exercise if lessons are to
be learned, particularly in the area of scalability. From a computer manufacturer’s
viewpoint, its use lies in evaluating a new system as soon as a prototype is running,
using the measured values to determine performance bottlenecks and making architec-
tural reﬁnements. The measurements also provide performance data for competitive
bidding. The goal for system and application programmers, on the other hand, is
understanding how the characteristics of an algorithm relate to the constraints of
an architecture. Further, most compilers for multiprocessor systems available today
which feature automatic vectorization and / or parallelization incorporate explicitly or
implicitly an econometric model of the processor for which they are targeted [112].
This model is used to evaluate when particular optimization choices should be invoked.

The performance data obtained can be used to calibrate such models accurately.

1 .5 Thesis Outline

The rest of this thesis is organized as follows. In Chapter 2, an overview of the
organization of shared-memory multiprocessors is presented. The factors that limit
parallelism on these machines along the three performance dimensions discussed in
the previous section are examined in detail. A summary of the architectural features
of the multiprocessor systems used for our experiments is also provided. The perfor-
mance characterization framework and its components are described in Chapter 3. In
Chapter 4, the use of the MAD kernels to evaluate the performance of shared-memory
accesses and quantify the losses in parallelism due to memory contention is addressed.
In Chapter 6, the study of performance losses due to inter-process synchronization
using the SAD kernels is presented. The measurement of the impact of synchroniza-
tion barriers on parallel execution performance using the BAD kernels is described in
Chapter 6. Finally, Chapter 7 summarizes the major contributions of this research

and provides directions for future research.

CHAPTER 2

BACKGROUND

Although multiprocessor systems do hold the potential for solving problems with vast
computational requirements, it is by no means obvious that a particular algorithm will
perform well on a given machine. Access to common memory is one of the key factors
in the performance of shared-memory multiprocessors. Large-scale multiprocessors
can encounter signiﬁcant performance degradation due to a number of factors related
to memory sharing. Contention for shared resources such as interconnection networks,
memory modules and shared-variable locations, serialization of execution due to mu-
tually exclusive access to shared-writable data, and synchronization barriers are all
factors that limit the performance of parallel program execution on shared-memory
multiprocessors. It is important to understand how these performance penalties de-
pend on the various architecture and algorithm design parameters.

In this chapter we review the shared memory organization, the primary factors
limiting parallel execution performance and the techniques used to reduce the impact
of contention in shared-memory multiprocessors. A summary of the architectural

features of the multiprocessor systems used in our experiments is also given.

2.1 Multiprocessor Memory Organization

In multiprocessors with global shared memory, parallel memory modules must be
used to provide sufﬁcient bandwidth for the processors. Furthermore, a suitable

interconnection network must establish the effective sharing of the memory modules

21

22

between the processors. Memory access latency can become a critical problem in large
systems when the distance between parts of the system is such that the time required
for data transfer is excessive. In small-scale multiprocessors such as the Alliant FX / 8
[102] and the Sequent Symmetry [80], all processors are attached to a single bus which
connects them to a global memory. Memory latency is reduced by associating private
caches with each processor. Cache coherence is enforced by protocols relying on a
fast broadcast mechanism.

For large-scale multiprocessors, a single bus fails as an effective interconnect as its
ﬁxed bandwidth limits its scalability. Technology limitations make it too expensive
to provide full hardware connectivity between all processors and memory modules.
Therefore, large-scale multiprocessors are built with intermediate connectivity using
interconnects such as multi-stage interconnection networks as in the BBN TC2000 [15]
and the IBM RP3 [104] systems, point-to—point connections as in the Intel Touchstone
DELTA [64], and hierarchical interconnects as in the Kendall Square Systems KSRl
[66] and the DASH multiprocessor [77]. Since broadcasting for cache coherence on
these interconnects is cumbersome, larger systems either provide cache consistency
using a directory-based protocol (as in the DASH project) or provide caching in
a restricted fashion under software control (as in the BBN TC2000 and IBM RP3
systems).

One solution to the memory latency problem on large multiprocessors is to build
a system in which not all memories are equally distant from all processors, thus
allowing data of special interest to a particular processor to be proﬁtably located near
it. Distributing a variety of memories around the system (hierarchical organization)
can minimize the average data access time and thereby improve system performance.
Other approaches to reducing memory latency where the interconnection network
itself is a component of the memory hierarchy have been explored in [90]. The number
of shared memory modules has a great impact on memory contention. If the number
of memory modules is less than the number of processors, memory contention will
occur if all processors issue a shared memory request at the same time.

Multiprocessor systems differ in their design as to how the shared memory mod-

23

Global Memory

   
 

..........................
..........
...........

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

L Global Interconnection Network
[Local interconnection ] [ Local interconnection
processor node processor node

Figure 2.1. Organization of memory hierarchy in shared-memory multiprocessors

ules are distributed over the memory hierarchy and how they provide hardware and
software so that each processor sees a single address space in this hierarchy. Typically,
a memory module is either local, meaning that it attaches to one processor, or global,
meaning that it is only accessible by sending requests through the interconnection
network. A request from a processor to its local memory does not cause any network
trafﬁc. This kind of memory organization is depicted in Figure 2.1. Such a memory
organization is motivated by price/ performance reasons similar to the cache/ main
memory hierarchy prevalent in uniprocessors. Note that in small-scale multiproces-
sors, the memory local to each processor includes only its cache. Each processor node
in Figure 2.1 could also be a cluster of nodes with each node having access to some
local memory and cluster-global memory in addition to the system-global memory
modules represented by G. The Illinois Cedar [44] system, for instance, has such a
cluster organization.

Local memory modules (such as M in Fig. 2.1) can be divided further into shared

24

and private modules. Shared memory modules are accessible by all processors,
whereas a private memory module is accessible only by the processor to which it
is attached. Global memory modules (such as G in Fig. 2.1) are implicitly shared and
private memory modules are implicitly local. Consequently, there are three types of
memory modules: local/ private, local/shared, and global/shared. For example, the
BBN TC2000 has only local/ shared memory modules, and the IBM RP3 can be set
up to have both local / private and local / shared memory modules.

Private memory provides a means for reducing network trafﬁc. Allocating private
data structures to private memory means that requests for such data structures do
not cause network trafﬁc and occur with minimum latency. However, memory latency
incurred in accessing shared data structures depends on where the data is located with
respect to the requesting processor. The location-dependent variation in the latencies
of shared-memory modules results in a non-uniform memory access time thus making
the issue of data distribution over the memory hierarchy a critical consideration for
performance. As an example, a remote memory access takes four times longer than a
local memory access on the BBN TC2000. Architectures such as the KSRl support

dynamic migration of data to the point of demand.

2.2 Limitations to Parallelism

Communication, synchronization and contention for shared resources are recognized
as the three primary sources of overhead in parallel program execution on multiple-
processor systems. We consider only multi-phase asynchronous parallel algorithms
constructed using the shared-memory programming model in this research. Since
all communication between concurrent processes in such algorithms occurs through
globally shared variables, the memory conﬂicts encountered in accessing the shared
variables is critical to overall performance. The amount of memory contention, and
the consequent performance degradation, depends not only upon the characteristics
of the memory hierarchy and the distribution of shared-data over the hierarchy, but

also on the characteristics of the data reference patterns and the interaction between

25

the two.

To ensure the consistency of concurrent updates to shared data, conﬂicting ac-
cesses must be protected within critical sections. In other words, a fundamental form
of synchronization necessary for asynchronous parallel algorithms is mutual exclu-
sion. Another form of synchronization commonly used by multi-phase algorithms
to demarcate the individual phases is barrier synchronization. Barriers enforce the
arrival of all participating processes at a point before any one of them can proceed
further. Both critical sections and barriers induce sequential components into the
execution proﬁle of an asynchronous parallel algorithm thus resulting in loss of par-
allelism. Moreover, inefﬁcient implementations of the mutual exclusion and barrier
operations (in hardware or software) could also lead to performance degradation. In
the following paragraphs, we discuss how each of these potential sources of loss in
parallelism is affected by design choices and what techniques have been developed to

minimize their impact.

2.2.1 Memory Access Contention

Distance is one reason for memory reference delays. A second reason is contention,
which consists of both network contention and memory contention. Multiprocessor
applications usually require shared data areas appropriately distributed over the mem-
ory modules. Memory conﬂicts may occur when two or more processors attempt to
gain access to a shared resource along the processor—to—memory path simultaneously.
The effect of memory conﬂicts, referred to as memory interference, may decrease the
execution rate of the processors. We describe below the factors that cause memory

access conﬂicts and contribute to performance penalties.

Contention for processor—to—memory path

Processors executing concurrently contend not only for memory, but also for the path
to memory. There are three principal ways of interconnecting processors and mem-
ory modules: bus, crossbar and multistage network. The bus, by its very nature,

provides a common route shared between all processors to gain access to a global

26

memory space, thus enforcing sequential-access to the shared memory. The high-
performance bus systems of today (e.g., the Sequent System Bus [114], the Encore
Nanobus [38]) employ a split-transaction protocol whereby multiple memory access
requests are pipelined onto the bus before a single memory transaction proceeds to
completion. As a result, the bus capacity can be fully utilized if the memory refer-
ence pattern can constantly keep the bus busy. The data transfer capacity between
processors and memories is determined by the bandwidth of the bus, and is therefore
constant. This limits the number of processors that can be usefully incorporated into
such a system, and hence ﬁxes an upper limit on performance. Crossbars scale up
linearly in terms of performance, but their major shortcoming is the cost and size
which is proportional to the square of the number of interconnected components.
Multistage networks provide multiple parallel paths to memory, but processors
may contend for paths through the network. Such paths consist of switches at each
stage of the network and links between switches in different stages. The switches of a
multistage network may be blocking or nonblocking. Blocking switches have buffers
to hold messages waiting while some other message is using the switch. Nonblocking
switches reject all but one of the conﬂicting requests so that no queues are formed.
This distinction has important implications for system performance as shown by sim-
ulation studies conducted as part of the IBM RP3 project. These studies [105] show
that small nonuniformities in memory reference patterns can lead to severe degrada-
tion of overall system performance due to some memory modules becoming hot. Such
nonuniform patterns resulted in a phenomenon called tree saturation, where trafﬁc to
the hot memories queued up in the switches and interfered with all other trafﬁc. This
saturation effect propagates back through the network, as shown in Figure 2.2, fan-
ning outward from the hot memory module in a tree-like fashion. This problem can be
partially resolved by combining networks [105, 75]. On the other hand, nonblocking
switches, by rejecting all but one of the conﬂicting memory requests, avoid the phe-
nomenon of tree saturation [119] so that degraded performance is experienced only by
the processors that access the hot memories. Thus, the design and implementation of

the interconnection network have a profound effect on the processor—to—memory-path

27

  
  
  

   

 

 

   
 

   
 

   
 

      
         
      
   

 
 

  
        
   

    

 

 
  
   

 

 

Processors Maid-stage switch network Memories
P v E: E} '9' [:3 E 7 = M
p M - M = = [A = M
P 'g' = — = v = M
P .5“: ”L1! = ”LA! = M
P s: = l = =- = l 7 "°‘:
ul-I‘ Habit-IN ._. ”9°
P {3 7; = v = M
P [AS ’A t: a a M
Cl Queue - Saturated queue
CI Switch buffer I Saturated buffer

Figure 2.2. Tree saturation as a result of hot spot accesses over a multistage inter-
connection network

delay experienced by memory accesses in large-scale multiprocessor systems.

Contention for memory module

Even if the interconnection network meets the bandwidth requirements of the
processor-memory trafﬁc, memory contention can still cause a problem if the
processor-memory trafﬁc concentrates on a small number of memory modules. There-
fore, it is essential to consider how data structures are allocated to the shared memory
modules. A memory module can service only one request at a time (assuming multi-
port memories are not used). This causes multiple simultaneous requests to the same
memory module to be serialized resulting in loss of parallelism.

Memory address interleaving is a technique [73, 99] used to reduce the effective
memory access time and, hence, increase memory bandwidth by attempting to dis-
tribute the concurrent memory request streams from multiple processors evenly across

multiple memory banks. Two broad classes of interleaving schemes used are modulo-

28

interleaving and random-interleaving. In the former scheme, a word with physical
address ,6 is mapped to the bank address 3 (modulo M), where M is the number
of memory modules (assuming a single bank per memory module) and is called the
degree of interleaving. The address format and address distribution for such ﬁne
interleaving is shown in Figure 2.3(a).

Usually processors access memory in the form of blocks (or cache lines if processor
cache is present). With ﬁne interleaving, the transfer of each word requires the
establishment of its own path from the processor to each memory module. In order
to maximize the amount of data transferred from a memory module during an access,
many of the multiprocessors today increase the granularity of interleaving from a
single word to several consecutive words, say g (equal to the cache line size). Thus,
every successive block of g words are now interleaved across the memory modules
instead of a single word, as shown in Figure 2.3(b). If multiple banks are used
per memory module, then addresses can be ﬁnely interleaved across the banks of a
memory module and consecutively among modules (shown in Figure 2.3(c)). Each
module can now transfer a block at a time thus increasing memory bandwidth. This
kind of coarse interleaving works quite well when most reference sequences address
successively numbered memory modules.

If the ratio between the time required to issue a request and the time required to
service a request is r, then a factor of f = min(M, r) increase in memory performance
is obtained by allowing all the memory modules to operate in parallel. However, when
the sequence of addresses does not access successive memory modules, as is the case
in many scientiﬁc applications, then the gain in performance can be signiﬁcantly
less. The random interleaving techniques [107, 124, 74, 98] attempt to overcome
this drawback by employing various methods to randomize the consecutive memory
addresses issued by a processor. Most of these approaches involve logical operations
on carefully selected address bits to effect the randomization.

Address skewing is yet another technique [55] that has been used in improving the
memory bandwidth in applications involving arrays. In these methods, the starting

positions in memory of successive rows of an array are displaced relative to one another

29

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

— — r— —l T T "—3" T
0 1 2 3
1 5 9 13
4 5 6 7 2 6 10 14
T Li— T T LJr ”r ‘11— ‘f‘
I - Memory Controller - I L - Memory Controller - I
Memory address distribution Memory address distribution
I Offset I Module number I I Module number I Offset I
Address Format Address Format
(8) (b)
module module module module
0123 8910111617181 242227
4567 1211415 20212223 282 31
I - - Memory Controller - l [

 

Memory address distribution

 

I Module number I Offset I Bankii I

Address Format
(C)

Figure 2.3. Memory address interleaving techniques: (a) Fine interleaving with se-
quential assignment across modules (one bank per module); (b) Coarse interleaving
with sequential assignment within module (one bank per module); (c) Mixed scheme
with ﬁne interleaving among banks of a module and coarse interleaving among mod-
ules (multiple banks per module)

30

by a ﬁxed distance (skewed) such that several subvectors of the array can be accessed
without conﬂict.

The ratio of the memory cycle time to the interconnection network cycle time is
a critical factor in the service demand placed on the memory modules. Address and
data buffers are, sometimes, used locally in each memory module to hold pending
memory requests thus eliminating them from the contention for the interconnection
network. Buffering is also used so that transient nonuniformities which occur in some
access patterns do not degrade performance [55]. The depth of buffering provided at

each module determines the extent to which memory access performance suffers.

Contention for memory location

In multiprocessor with a single address space, there are situations in which many
processors must access a single memory location. One typical example is the case of
memory locations holding synchronization variables like mutual exclusion locks which
are used to ensure exclusive access to a shared data or to a critical section of code.
If many processors need to access the resource controlled by the lock at about the
same time, there is a high degree of contention for the memory location of the lock
due to the highly repetitive access to the lock caused by busy-wait spinning [6, 52].
Depending on the implementations used for the busy-waiting mechanism, differing
degrees of memory and interconnection network contention may result introducing
performance bottlenecks that become markedly more pronounced as architectures
scale.

Both hardware and software techniques have been explored to reduce the im-
pact of such contentions for a shared memory location. Proposals for multistage
interconnection networks that combine concurrent accesses to the same memory lo-
cation [109, 104, 51], software combining techniques [126], multistage networks that
have special synchronization variables embedded in each stage of the network [65],
and special-purpose cache hardware to maintain a queue of processors waiting for the
same lock [76, 50] are among the many hardware solutions suggested for this problem.

Software solutions for scalable synchronization in shared-memory multiprocessors us-

31

ing carefully designed data structures and their appropriate placement in the memory

hierarchy have also been implemented and tested [6, 52, 89].

Maintenance of data coherence

Memory incoherence (inconsistent copies of data) is another serious problem in mul—
tiprocessors with global memory and memory (or cache) that is local to each cluster
or processor. The coherence problem is caused by the existence of replicated copies of
a shared memory block at different levels of the shared memory hierarchy. This can
introduce inconsistency if special arrangements are not provided to detect when one
copy is modiﬁed. Note that inconsistency can occur only for shared, writable mem-
ory blocks. Read-only or nonshared data can always be safely cached or replicated
without precautions. Many multiprocessor systems (such as the Encore Multimax,
the Sequent Symmetry) provide additional hardware to automatically enforce data
coherency among multiple shared copies of a datum.

Stenstrém [118] has surveyed a number of proposed cache coherence schemes for
maintaining data consistency in shared-memory multiprocessors. The two most popu—
lar approaches are the snoopy cache protocols that rely on a broadcast interconnection
medium such as a bus, and directory-based protocols [3] used on other general intercon-
nection networks. The data coherency mechanism may add an overhead component

to the access time of shared-writable data thus degrading performance.

2.2.2 Spin Locks and Mutual Exclusion

Synchronization is a fundamental concept in parallel programming because it pro-
vides the basis for cooperation of tasks in a program and controls access to shared
resources. In the shared-variable programming model, processors communicate by
sharing data structures. Since each processor has equal access to the shared memory,
some method for ensuring mutual exclusion—the logically atomic execution of oper-
ations (critical section) on a shared data structure—is required. Consistency of the
data structure is guaranteed by serializing the operations done on it. Synchronization

constructs can be divided into two classes: blocking constructs that deschedule wait-

32

ing processes, relinquishing the processor to do other work, and busy-wait constructs
in which processors repeatedly test shared variables to determine when they may
proceed. Busy-wait synchronization is preferred over scheduler-based blocking when
scheduling overhead exceeds expected wait time, when processor resources are not
needed for other tasks, or when blocking is inappropriate or impossible (for example
in the kernel of an operating system).

One of the most widely used busy-wait synchronization constructs is a spin lock.
Spin locks provide a means for achieving mutual exclusion and are a basic build-
ing block for synchronization constructs with richer semantics, such as semaphores
and monitors. Spin locks are ubiquitously used in the implementation of parallel
operating systems and application programs. Since pure software mutual exclusion
is expensive [72], virtually all shared-memory multiprocessors provide some form of
hardware support for making mutually exclusive accesses to shared data structures.
This support usually consists of instructions that atomically read and then write a sin-
gle memory location. Atomic instructions serve two purposes. First, if the operations
on the shared data are simple enough, they can be encapsulated into single atomic
instructions [59]. Mutual exclusion is directly guaranteed in hardware. If a number
of processors simultaneously attempt to update the same location, each waits its turn
without returning control back to software. Second, if the critical section requires
more than one instruction, then a spin lock is used to guard the critical section and
atomic instructions are used to arbitrate between simultaneous attempts to acquire
the lock. If the lock is found busy, then waiting is done in software.

Spin locks are generally employed to protect very small critical sections, and may
be executed an enormous number of times in the course of a computation. Unfortu-
nately, simple approaches to busy-waiting tend to produce large amounts of memory
and interconnection network contention thus exhibiting very poor performance. With
an ill-designed spin lock, spinning processors can slow other processors doing useful
work including the one holding the lock by consuming communication bandwidth.
As a consequence, the overhead of busy-waiting synchronization, referred to as lack

interference, is widely regarded as a serious performance problem.

33

When many processors busy-wait on a single synchronization variable, they cre-
ate a hot spot that is the target of a disproportionate share of the network trafﬁc.
Pﬁster and Norton [105] showed that the presence of hot spots can severely degrade
performance for all trafﬁc in multistage interconnection networks, not just trafﬁc due
to synchronizing processors. Agarwal and Cherian [2] have investigated the impact
of synchronization on overall program performance by simulations of benchmarks
on a cache coherent multiprocessor. Their study indicates that memory references
due to synchronization cause cache line invalidations much more often than non-
synchronization references. In order to alleviate these performance concerns, modern
multiprocessors generally incorporate sophisticated atomic operations into their archi-
tectures, permitting faster and more efﬁcient implementation of synchronization prim-
itives. Particularly common are "various Fetch-And-Q operations [67] which atomi-
cally read, modify, and write a memory location. Fetch-And-d) operations include
Test-And-Set, Fetch-And-Store (swap), Fetch-And-Add, and Compare-And-Swap.

More recently, there have been proposals for multistage interconnection networks
that combine concurrent accesses to the same memory location [51, 104, 109], multi-
stage networks that have special synchronization variables embedded in each stage of
the network [65], and special-purpose cache hardware to maintain a queue of processes
waiting for the same lock [50, 76]. The principal purpose of these hardware primitives
is to reduce the impact of busy waiting. Several software techniques developed of late
have also achieved a similar result. By distributing the synchronization data struc-
tures over the shared-memory hierarchy appropriately, it can be ensured that each
processor spins only on locally accessible locations, locations that are not the target
of spinning references by any other processor. All software approaches to efﬁcient
spin lock implementation have adopted this philosophy in one form or the other. The
implication of these software techniques is that efﬁcient synchronization algorithms
can be constructed in software for shared—memory multiprocessors of arbitrary size.
Special-purpose synchronization hardware can offer only a small constant factor of
additional performance for mutual exclusion [89].

We describe brieﬂy several spin lock implementations that have been proposed.

34

Each lock implementation uses a hardware supported atomic operation to invoke
mutual exclusive access to the shared lock variable. However, they differ in the
frequency with which the shared lock variable is polled, and the amount of network

trafﬁc generated as a result of busy-waiting.

Simple Locks

The simplest mutual exclusion lock employs a polling 100p to access a shared variable
that indicates whether the lock is held. Based on what operation is used to poll the

shared lock variable there are two possible implementations:

0 Spin on Test-And-Set: Each processor to repeatedly executes a
Test-And-Set instruction until it succeeds at acquiring the lock. The principal
shortcoming of the test-and-set lock is contention for the shared lock variable.
Each waiting processor accesses the single shared ﬂag as frequently as possible,
using relatively expensive read-modify-write (Test-And-Set) instructions. The
result is degraded performance not only of the memory bank in which the lock

resides but also of the processor-memory interconnection network.

0 Spin on Read (Test-And-Test-And-Set): Fetch-And-d> instructions can be
particularly expensive on cache-coherent multiprocessors since each execution
of such an instruction may cause many remote invalidations. To reduce this
overhead, waiting processors poll with read requests during the time that the
lock is held. As a result, spinning is done in the cache without consuming bus or
network cycles. Once the lock becomes available, some fraction of the waiting
processors detect that the lock is free and perform a test-and-set operation of

which exactly one attempt succeeds.

Collision Avoidance Locks

The primary factor responsible for the poor performance of the simple lock approaches
is the high degree of collisions among concurrent lock acquisition attempts. Thus, if

each waiting process delays an amount of time before rechecking and attempting to

35

obtain the lock, then the number of unsuccessful Test-And-Set instructions and the
resulting reads by other waiting processes can be reduced. There are two possible

variations:

0 Delay-after-release: This variation waits for the lock to be released before
delaying. If some other processor acquires the lock during this delay, then the
processor can resume spinning; if not, then the processor can try the test-and-
set, with a greater likelihood that the lock will be acquired. Polling for the
lock release is only practical for systems with per-processor coherent caches.
On other systems, processors would consume communication bandwidth if they

were to spin reading memory while waiting for the lock to be released.

0 Delay-between-reference: An alternative approach is to insert a delay be-
tween successive polls of the lock. This can be used on architectures without
coherent caches or with invalidation-based coherence to limit the communica-

tion bandwidth consumed by the spinning processors.

The mean delay can be set statically or dynamically using exponential backoff
techniques (similar to the Ethernet exponential backoff for CSMA networks) to adapt

to varying conditions.

Ticket Locks

Ticket locks reduce the number of Fetch-And-Q operations to exactly one per lock
acquisition. They ensure FIFO service by granting the lock to processors in the
same order in which they ﬁrst requested it. The lock consists of two counters, one
containing the number of requests to acquire the lock, and the other the number
of times the lock has been released. A processor acquires the lock by performing a
Fetch-And-Increnent operation on the request counter and waiting until the result
(its ticket) is equal to the value of the release counter. Contention due to polls of
the release counter can be reduced by introducing a delay on each processor between
consecutive probes of the counter. In this case, however, exponential backoff is clearly

a bad idea. Since processors acquire the lock in FIFO order, overshoot in backoff by

36

the ﬁrst processor in line will delay all others as well, causing them to backoff even
farther. Experiments conducted by Mellor-Crummey and Scott [89] suggest that a
reasonable delay can be set proportional to the difference between a newly-obtained

ticket and the current value of the release counter (proportional backoff).

Tournament Locks

Another approach to reducing contention for a single shared lock variable is to have
a tree of locks of radix b and height h. The tree forms a tournament wherein winners
of leaf lock contests become contestants at the next level. The winner of the root
lock has permission to enter the critical section protected by the tree of locks. Each
process uses its process identity to choose a random path from the root to a leaf
lock. The process may contend only for locks along that path. While every process
may contend for the root lock, the number of processes eligible to contend for a lock
decreases by the radix of the tree at each level (bh) as we proceed towards the leaves.
Thus, contention at the leaf locks can be made arbitrarily small as the number of

leaves approaches the number of processes.

Queuing Locks

In a queue lock, each arriving processor enqueues itself and then spins on a separate
ﬂag. When the processor ﬁnishes with the critical section, it dequeues itself and
nudges the next processor in the queue. This permits the hand-off of the lock to
be free of contention. The trick is for each processor to use an atomic operation to
obtain the address of a location on which to spin. This class of locks is characterized
by FIFO ordering of lock acquisitions and, if the spin location of each processor is
selected properly, then, a constant bound on the number of network transactions per
lock acquisition.

The best implementation varies somewhat among architectures. With distributed-
write cache coherence, processors can all spin on a single counter. To release the lock,
a processor simply writes its sequence number into the counter. Each processor’s

cache is updated, directly notifying the next processor in line with a single network

37

transaction. With invalidation-based coherence, each processor should wait on a ﬂag
in a different cache block. Only two bus or network transactions (an invalidation and
a read miss) are needed to signal the next processor in line. Similarly, on a multistage
network without coherent caches, each ﬂag should be placed in a separate memory
module.

Based on the data structure chosen for the queue of spinning processors, the

queuing locks can be classiﬁed as array-based or list-based.
e Array-based queuing locks
o List-based queuing locks

Anderson [6] has developed an array-based method of queuing busy-waiting proces-
sors in shared memory that requires only a single atomic operation per execution
of the critical section. The queue is implemented as a circular array of ﬂags on
which busy-waiting processors can spin. Each arriving processor does an atomic
Fetch-And-Increment to obtain a unique sequence number, which determines a lo-
cation in the array (flags) on which it can spin thus enqueuing itself. When a
processor ﬁnishes with the lock, it taps the processor with the next highest sequence
number; that processor now owns the lock. Since processors are sequenced, no atomic
read-modify-write instruction is needed to pass control of the lock. A similar array-
based queuing lock has also been proposed by Granuke and Thakkar [52].

A queuing lock wherein the queue of spinning processors is structured as a linked-
list was devised by Mellor-Crummey and Scott [89]. Their technique works equally
well, requiring a constant number of network transactions per lock acquisition, on
machines with and without coherent caches. It requires an atomic Fetch-And-Store
(swap) instruction and beneﬁts from the availability of the Compare-And-Swap in-
struction. Without Conpare-And-Swap, the guarantee of FIFO ordering of lock ac-
quisitions is lost introducing the theoretical possibility of starvation, although the

lock acquisitions are likely to remain very nearly FIFO in practice.

38

2.2.3 Synchronization Barriers

In addition to the spin-lock, barrier synchronization is the other most important
mechanism for coordinating parallel processes. A barrier deﬁnes a logical point in the
control ﬂow of an algorithm at which all processes must arrive before any is allowed
to proceed further. Barriers are commonly employed when an algorithm consists
of several distinct stages, each of which has internal parallelism but which must be
performed in strict sequence without overlap. A barrier is clearly one of the most
deleterious forms of synchronization, since it requires in effect that every process
communicate with every other process. Additionally, since all processes must wait at
the barrier until the last arrives, the effects of ﬂuctuations in process execution time
or imperfect load balancing are maximized.

A processor typically performs the following three steps at a barrier:
1. Marks itself as present at the barrier (entry phase).
2. Waits for all other participating processors to arrive at the barrier.

3. After all participating processors have arrived, it proceeds past the barrier (exit

phase).

Many algorithms exist for performing barrier synchronization in software [81, 21,
58]. Careful implementation of some of these algorithms are found to scale well
to large-scale multiprocessors without the contention for synchronization operations,
referred to as barrier interference, becoming a signiﬁcant problem [89]. Barrier algo-
rithms can be distinguished [8] by three features: the depth of the barrier (linear or
logarithmic), the barrier scheduling mechanism (static or dynamic), and the type of
exit phase (symmetric entry and exit phases, or broadcast exit).

Linear barriers are most commonly implemented using centralized counters to
keep track of the number of processors that have arrived at the barrier. Each pro-
cessor incurs a ﬁxed amount of overhead accessing the shared counter, so the total
overhead of such barriers is linear in the number of processors. Logarithmic barriers

include the software combining tree barrier [126], the butterﬂy barrier [21] and the

39

dissemination barrier [58]. In the butterﬂy and dissemination barriers, synchronizing
P processors is accomplished in [log2 P] .stages of [P/ 2] two-processor synchroniza-
tions each. In a tree barrier, groups of processors synchronize with each other, and
one processor out of each such group goes on to synchronize with the next higher level
group, and so on. Although in a logarithmic barrier each processor performs 0(log P)
synchronization operations (versus 0(1) for the linear barrier), these synchronizations
can be overlapped in machines with parallel processor—memory networks, resulting in
total barrier overhead that is only logarithmic in the number of processors. In bus—
based machines, linear barriers are more efﬁcient than logarithmic barriers because
fewer total bus accesses need to be performed (assuming the bus is the limiting factor
on performance).

In statically scheduled barriers, processors update synchronization variables in an
order predeﬁned at compile or load time, whereas in dynamically scheduled barriers,
processors proceed in the order that they arrive at the barrier. Therefore, dynami—
cally scheduled barriers require either explicit software locks (such as Test-And-Set),
or more complex atomic read-modify-write operations such as Fetch-And-Add. Stat-
ically scheduled barriers do not incur the overhead of software locks, but also cannot
take advantage of the “skew” in processor arrival times where some processors can
start synchronizing early.

In the entry phase of a barrier, processors report their arrival by updating some
shared state information. In the exit phase, processors exit the barrier after determin-
ing that all other processors have arrived. Separate entry and exit phases are required
if the barrier is to be reused, in order to properly reinitialize the synchronization vari-
ables. In barriers with symmetric entry and exit phases, similar operations are used in
both phases. In barriers with broadcast exit, the last processor to complete the entry
phase broadcasts this information to all other processors. Barriers with broadcast
exit are more efﬁcient than symmetric barriers because they require fewer memory
operations on shared variables. However, they also require more local storage at each
processor.

Many research efforts have also focused on hardware barrier synchronization tech-

40

niques on the premises that a 0(log P) growth in synchronization delay of software
approaches prevents the exploitation of ﬁne-grain parallelism. The Burroughs Corp.
proposal for the Flow Model Processor (F MP) [83] included the ﬁrst detailed de-
scription of a hardware implementation of barrier using the equivalent of a massive
“AND” gate. Another scheme developed in [106] consists of a hardware module with
bit-addressable registers R(i), (i = 1,2,...,P), one associated with each of P pro-
cessors, an enable switch, logic to test for all zeroes (all processors have reached the
barrier), and a barrier register BR. The “fuzzy” barrier scheme of [53], also sup-
ported in hardware, is basically a delayed barrier ﬁring mechanism where the actual
wait may occur several instructions after a processor indicates that it has encountered
a barrier. In all these schemes, all physical processors in the machine were considered
to be involved in each barrier synchronization. More recently, the “barrier MIMD
architectures” proposed in [100] support an arbitrary subset of the processors to be

barrier synchronized.

2.3 Target System Architectures

We have used two shared-memory multiprocessors with very different shared memory
organizations, namely a 26-node Sequent Symmetry S81 and a 45-node BBN TC2000,
to illustrate our experimental characterization methodology. Two older generation
systems, a 24-node Sequent Balance 21000 and a 96-node BBN GP1000, were also
used in some of our early experiments. These systems were selected more because of
the convenience of access than anything else. Of these, the BBN GP1000 system is
installed at Michigan State University whereas the remaining systems are installed
at the Advanced Computing Research Facility of the Argonne National Laboratory.
In this section, we brieﬂy describe and compare the salient features of these sys—
tem architectures that are relevant to the interpretation of the experimental results
obtained.

The Sequent Symmetry S81 [114] is a bus-based shared-memory multiprocessor,

belonging to the Uniform Memory Access (UMA) class, and, containing from 2 to 30

41

  
      

 

 

S...-------- -b-----------

Dual-processor board Dual-processor board

Figure 2.4. Sequent Symmetry system architecture

processors packaged on dual-processor boards and upto 240 Mbytes of main memory.
Each processor subsystem consists of an Intel 80386/ 80387 CPU / FPU combination
and a 64~Kbyte 2-way set-associative cache. Cache coherence is enforced by using a
write-invalidate copy-back caching policy on a cache line that is 16 bytes long. It can
contain upto six memory modules, each consisting of a memory controller board and
8 or 16 Mbytes of memory. It can also, optionally, contain a memory expansion board
with 24 Mbytes of memory on it. When the system contains a pair of equal-sized
memory subsystems, alternate 32-byte address blocks are interleaved between the
two modules. The Sequent System Bus (SSB) forms the heart of the system’s global
interconnection network. All the processor and memory subsystems along with other
device interfaces are directly connected to the bus. The system bus operates at 10
MHz (1 cycle = 100 ns). It can carry 64 bits of data with address and data being
time multiplexed on the bus. Multiple bus transactions are pipelined so that the bus
throughput can be maximized. The bus is rated at a peak data transfer rate of 53.3
Mbytes / second. The Symmetry provides an atomic Fetch-And-Store operation but
no Conpare-And-Swap operation.

The BBN TC2000 [14] is a large shared-memory multiprocessor that belongs to

42

Processor function board Processor function board

------------- '-------------

 

 

 

 

 

. .
e

 

'-------‘
'-------

 

 

1
switch
I] interface I] "switc interface I]
------r-----4 ..L... -----J
l Butterfly Switch I

 

Figure 2.5. BBN TC2000 system architecture

the Non-Uniform Memory Access (N UMA) class due to the distributed nature of its
shared memory modules. It is built using Motorola 88100 RISC processors. These
processors reside on a function board that also has a MC88200 Cache and Memory
Management Unit (CMMU) 16 Kbytes each of instruction and data cache, 4 or 16
Mbytes of memory, and a switch interface. The function boards are interconnected
by a multistage switching network so that they can access each other’s address space.
The network consists of 8x8 crossbar bidirectional switches arranged in a logs N -
column butterﬂy interconnection pattern, where N is the number of processor nodes.
Every remote memory reference, is sent out over the switching network, but local
memory access is performed over a direct path bypassing the network. This causes a
remote memory access to incur a higher latency in comparison to a local access.

A route speciﬁes a complete and exact path through the switch. A reply to a
given request is also returned along the same path. If a conﬂict occurs at any stage
in the network, due to the access paths of two or more concurrent requests crossing

each other, then the switch selects exactly one request at random to proceed and

43

rejects all others, which must be retried at a later time. Thus, the switches are non-
blocking in nature. Alternate paths between function boards may exist depending
on conﬁguration size. Use of these alternate paths helps reduce congestion within
the switch. However, on the TC2000, the switch interface selects a given route for
an initial message before its ﬁrst transmission into the switch, and does not change
that route during any retries of the message. Different paths may be used by separate
initial messages, but not by separate retries. There were two alternate paths available
on the system to which we had access. All shared-data, by default, are not cached
on the TC2000. A user can choose to selectively cache shared—data and manage its
coherency explicitly. The TC2000 provides a Fetch-And-Store operation via the
non instruction.

The earlier generation Sequent Balance 21000 system [113] is also a bus-based
global memory multiprocessor, much as the Symmetry, based on the NS32000 se-
ries microprocessor. The bus supports multiple pipelined memory requests. Cache
consistency is maintained by write-through with invalidation scheme. An additional
feature present on the Balance, that was later removed from the Symmetry, is a dedi-
cated lock memory (called Atomic Lock Memory or ALM) connected to the bus that
supports process synchronization primitives. However, the overhead of accessing the
ALM is sufﬁciently high that applications on the Balance may use spin-locking based
on xchg, the exchange-with-memory instruction supplied by the processor [7].

The BBN GP1000, a generation older than the TC2000, is also a NUMA mul—
tiprocessor [13] based on the Butterﬂy switch multistage interconnection network.
It incorporates upto 256 processor nodes each containing a Motorola 68020 CPU,
4 Mbytes of memory and a MC68851 paged memory management unit for virtual
memory processing. The network is composed of 4 stages of 4x4-switches. Memory
accesses over these switches is handled much the same way as on the TC2000.

The architectural features of the systems on which our experiments were conducted

are summarized in Table 2.1.

44

' Table 2.1. Summary of target system architectures

 

 

[TFgature

I] Sequent Symmeg I BBN TC2000

 

No. of Processors
Processor Type
Clock Cycle Time
Memory Size
Data Cache Size
' Cache Line Size
Cache Coherence
IN Network
Peak IN Bandwidth
Operating System
Timer Resolution

26

Intel 80386
62.5 ns

32 MB

64 KB/proc
16 bytes
copy back
Bus

53.3 MB/sec
DYNIX B3.1.2
1 ps

 

 

 

45

Motorola 88100

50 ns

720 MB (16 MB/proc)

'N/A

N / A

N / A

2-col 8x8-switch MIN
38 MB /sec/channel
nX OS release 2.0.6

1 as

 

 

Feature

JISequent Balance

1 1 BBN GP1000

l

 

No. of Processors
Processor Type
Memory Size

Data Cache Size
Cache Line Size
Cache Coherence
IN Network

Peak IN Bandwidth
Operating System
Timer Resolution

 

24

NS32000

16 MB

8 KB /proc

8 bytes
write-through
Bus

26.7 MB/sec
DYNIX

1 us

 

 

96

Motorola 68020

384 MB (4 MB/proc)
N / A

N / A

N / A

4-col 4x4-switch MIN
32 Mbits/sec/channel
Mach 1000

62.5 us

 

 

 

45

2.4 Summary

Efﬁcient access of shared data is the single most important factor in the performance of
parallel program execution on shared-memory multiprocessors. The effective memory
access latency is determined by the hierarchical organization of the shared memory
modules and the distribution of data over this hierarchy. Contention for the network,
memory modules and memory locations can all increase the memory reference delay.
Data coherence mechanisms for replicated data can also contribute to increased la-
tency due to additional network trafﬁc generated. The performance of asynchronous
parallel algorithms on multiprocessors is also inﬂuenced by the use of spin-locks for
enforcing mutual exclusion and barriers. Not only do these forms of synchronization
introduce a sequential bottleneck, but an inefﬁcient implementation of these primi-
tives can have a signiﬁcantly detrimental effect on other shared memory accesses.

In this chapter, we have enunciated the various factors that contribute to the
performance degradation of asynchronous parallel algorithms on multiprocessors using
the shared-variable programming model. The observation and quantiﬁcation of these

overheads is the object of our performance characterization study.

CHAPTER 3

PERFORMANCE
CHARACTERIZATION
METHODOLOGY

The execution performance of a parallel program using shared-variables depends on
static characteristics of the underlying algorithm such as computation granularity,
computation-to-communication ratio, data reference patterns and ﬁxed synchroniza-
tion costs. In addition, performance is also inﬂuenced by run-time overheads incurred
during parallel execution from three primary activities, namely, resource contention
during concurrent accesses to shared data, mutually-exclusive access to shared data,
and synchronization barriers. This overhead is a function of the dynamic run-time
behavior of the system. It is added to the execution time in the form of processor
latencies and busy waits. As overhead increases, the amount of parallelism that can
be exploited decreases. An accurate and complete performance characterization of
multiprocessor program execution must take into account not only the static system
behavior, but its dynamic behavior as well. Furthermore, it is important to be able
to isolate and measure the effect of each component on overall system performance.
By increasing our ability to measure the pieces, combine their effects, and relate their
contributions to architectural and algorithmic characteristics, we enhance our ability

to model and predict performance in complex systems.

46

47

In this dissertation, we have developed a hierarchical performance characteriza-
tion technique that relies on experimental calibration. The method is based on the
construction of synthetic executable workloads. These workloads have the advantage
that they can be made parametric and hence ﬂexible in representing workload char-

acteristics. Our technique consists of ﬁve distinct steps as shown in Figure 1.2 of

Chapter 1:
0 parallel computation model selection,
0 benchmark workload characterization,
e benchmark workload generation,
0 workload execution and performance measurement, and
0 performance characterization.

In this chapter, we describe each of these activities leading to the system charac-
terization objective. The characterization parameters obtained represent the static
performance of a machine as well as different aspects of dynamic interaction between

the machine architecture and the application structure.

3.1 The Parallel Computation Model

Theobjective of this thesis is to develop a set of parameters that characterize the
static and dynamic performance of a shared-memory multiprocessor, and obtain quan-
titative measures for these system characterizers in the context of a certain class of
algorithms based on the shared-variable computational paradigm. The quantiﬁcation
of the characterization parameters is performed through experimental measurements
on the target machine for a selected set of workloads.

To be universally applicable, the system characterization measurements must be
based on a uniform model of execution for parallel computations so that the results
of an experiment can be related to previous and future experiments. Besides, in

the development of a parallel program on a shared memory system, it is natural to

48

ﬁrst deal with the software structure of the program and then with the algorithmic
parameters that determine computational efﬁciency (for example, task granularity,
distribution of shared data and their access patterns, frequency of synchronization,
length of critical sections, etc.). We use a hierarchical model to characterize and
measure the incremental impact of software structure, hardware resource contention,
lock contention and synchronization barriers on the absolute rate of computation as
well as the relative computational efﬁciency.

Parallel algorithms can be classiﬁed based on the structure of their task graphs
[91]. Experience shows that most parallel algorithms belong to one of only a small
number of classes [46]. Examples of classes of task graphs are those representing asyn-
chronous, multilevel partitioned, multiphase, and pipelined parallel algorithms. We
use a phase and transition model of program execution with a multi-phase task struc-
ture as the basis of our system characterization methodology. A parallel computation
is viewed as a sequence of computational phases separated by global synchronization
points (or barriers) (as shown in Figure 3.1). The computation and communication
patterns and, hence, the program behavior are well deﬁned within a phase, but may
change from one phase to the next. Many scientiﬁc and engineering problems adhere
to this model in practice. Application examples represented by this computational
structure include the parallel PDE solver using the synchronous Jacobi method, par-
allel FFT, molecular-motion computations, weather prediction models, etc.

Each phase is comprised of a collection of asynchronous tasks without any explicit
synchronization constraints among them. They may, however, synchronize implicitly
as a result of hardware resource contention during shared-data accesses and software
resource (such as locking semaphores) contention during mutually exclusive access
to critical regions of code. Computations developed according to the popular SPMD
(Single Program Multiple Data) parallel programming paradigm ﬁt this task structure
well. At a lower level, a task may correspond to one more iterations of a parallel
DOALL loop construct [70] executing concurrently on a single processor. The iterations
of a DOALL loop are data independent and, therefore, can be assigned to different

processors and executed in any order. Parallelism at a higher level can be exploited

 

Phase2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.1. Structure of parallel program execution

by high level spreading of large-grain tasks.

We focus attention on the class of structured iterative algorithms with multiple
phases. Within a phase the computation, shared data access and synchronization
patterns are very regular for each iteration. Frequently in these applications, the
computation can be uniformly distributed among processors thus assigning equal
amount of work with identical characteristics to each processor. Therefore, if we
assume that each process performs a series of identical iterations within a phase, then
the overall multi-phase performance of the complete application can be extrapolated
from measurements performed at the iteration level [92]. Since all iterations are
identical, we will measure the performance behavior of a single loop iteration when
executing concurrently with other identical iterations.

In Figure 3.1, assume that there are u computational phases. Assume that phase k
is comprised of wk identical iterations on each processor and the number of processors

employed (degree of parallelism) is NI. If tk, is the time it takes to complete one

50

iteration on processor i during phase It, then the total time to complete w). iterations
on processor i is given by wktk, since all iterations are identical. Hence, the time T],
required to complete phase k of computation is a function of N}, and is given by

Tk(Nk) = ,ggﬁkfwktkd = wk ' 11513235}th = wktk

where t), is the effective iteration time for phase It. At the end of each phase, all
processes must wait for the slowest process among them before they can continue.
The time spent in waiting for the last process to arrive is already accounted for in
the phase execution time Tk(Nk). However, the additional time penalty needed to
broadcast the event of the arrival of the last process at the barrier contributes to the
total execution time. This time, Tba,,(Nk), depends not only on the number N], of
processors involved in the barrier, but also on the implementation and the method
used to busy-wait for the arrival of the last process. If all the sequential components
of the parallel program execution, such as creation of parallel processes etc., can be
represented by the single time component Tug, then the total execution time T of the

computation is given by

T = Tgcq + 2(TkUVk) 'l' Tbarr(Nk)) = Tseq '1’ 2(wktk 'i‘ Tbarr(Nk))

k=l k=l

Using this model, if the per-iteration execution time tk and the barrier performance
Tb", can be accurately characterized for a given workload for varying degrees of

parallelism Nk, then the overall performance of the computational workload can be

estimated.

3.2 Workload Characterization

System characterization (to distinguish it from benchmarking) is a set of experiments
that isolate and measure the performance response of a system to controlled work-
load inputs. These responses describe the system and determine its performance.

The accuracy of the system characterization depends closely on the type of work-

51

loads chosen for selective assessment and how well they represent the measurement
objective. Having chosen a multi-phase program structure at the algorithm level, we

next concentrate on deﬁning the program characteristics within a phase.

3.2.1 The Unit Grain.

Measurement data about the behavior of real workloads on shared memory multipro-
cessors are scarce (examples are [1], [30], [37] and [12]). Hence a broad but abstract
model of workload speciﬁcation is adopted for system characterization. It allows the
exploration of performance over a wide spectrum of assumptions about data sharing,
locality of reference, and inter-process synchronizations.

It has been shown [121] that the performance of a parallel system in the short
term—during one iteration—for example, can also be used to model long term per-
formance. We model the computation in a single process (or thread of activity),
which is part of the parallel workload, as a sequence of loop iterations that may be
random or deterministic. Each such loop iteration represents a single grain of compu-
tation, called a unit grain and denoted as G in Figure 3.1. The sequence of iterations,
therefore, represents a string of grains constituting the execution proﬁle of a single
processor in a parallel program. The unit grain is the fundamental level at which all
performance measurements are taken.

Each unit grain G is further assumed to be composed of exactly three granules:
shared-memory access, local computation and synchronization (Figure 3.2). A shared-
memory granule, denoted as gm, is concerned with accessing globally shared data
needed for the computation. Most often, access to globally shared data within this
granule would be in concurrent—read mode, since writes to shared data must be prop-
erly guarded within critical sections in order to preserve memory access consistency.
In situations where concurrent writes are legitimate and consistency preserving, how-
ever, gm could include writes to shared data. A local computation granule, denoted as
ye, represents the portion of the execution grain that performs CPU bound compu-
tation using only process private data. We assume that any shared data needed for

the computation is ﬁrst retrieved into a process private area (possibly internal reg-

52

 

 

 

.0...

 

 

 

------’

 

 

 

 

NLOCK

 

 

 

CS: Critical Section

 

 

 

4----- e

 

8

Figure 3.2. Structure of a unit grain

isters or processor cache) before being used. A synchronization granule, denoted as
9,, represents inter-processor synchronization in the form of mutual exclusion (using
locking semaphores) to access critical sections of code wherein updates to write-shared
data are performed. It could also represent synchronization operations such as event
post/wait for synchronous algorithms. This granule imposes an ordering restraint
on the otherwise concurrent execution of a multiprocessor application. Using this

decomposition, the unit grain G is deﬁned to be a 3—tuple of granules.

G = (gmgaga)

A special characterization called null characterization, and denoted by g.- = 43, is
reserved to indicate that granule g.- is absent from the unit grain. Any component
granule in the deﬁnition of G can be null, reﬂected by the alternate bypass paths
shown around each granule in Figure 3.2.

We will characterize the unit grain G by choosing an appropriate characterization
for each of its component granules. The choice of attributes needed to characterize

each granule depends upon what aspect of the multiprocessor system performance

53

is under study and the level of abstraction at which the analysis is to be carried
out. For example, if the speed of ﬂoating-point operations were of interest, then the
computation granule gc could consist of an appropriate ﬂoating-point expression(s),
whereas the granule gm could simply be speciﬁed as a memory access frequency, and
the granule g, made null. The hardware execution times of the different ﬂoating-point
operations selected for the computation in 96 can be normalized to addition time by
assigning suitable weights to each type of ﬂoating-point operation. An example set

of weights for a sample machine are shown in Table 3.1.

Table 3.1. An example of weights assigned to different types of ﬂoating-point opera-
tions to normalize their execution time to ﬂoating-point addition time

 

 

 

F loating-point Normalizing
operation weight
+3 —I * 1
/,sqar 4

EXP, SIN, etc. 8

IF (X .REL . Y) 1

 

 

 

Similarly, the absolute performance of synchronization primitive could be mea-
sured by using null characterizations for gm and gc, while characterizing g, with the

relevant details of the implementation of the synchronization primitive.

3.2.2 Workload Classiﬁcation

Using the 3—granule decomposition of the unit grain, a single phase of computation
in our multi-phase program structure can now be represented as shown in Figure 3.3.
Each task (assigned to a separate processor) processes a string of t unit grains before
synchronizing at a global barrier. Granule gc contains the meaningful computations
performed by a task and hence represents the operations whose overall rate should

be maximized. Based on whether the granules 9m and g, are present in the unit grain

START

 

 

. l

I

i

I

o e e '

m“ -
exclusion

- --b

 

PI P2
-
I—1_

 

 

 

 

 

END

Figure 3.3. Structure of a single computational phase

definition, the range of workloads represented by this technique can be categorized

into four broad classes based on the mode of concurrent accesses to shared data.

A. Embarrassing workloads. All computation in these workloads is performed

within granule gc with no shared-data access or inter-process synchronizations

(9171 = 45,90 = ¢)

B. Concurrent-access workloads. In addition to computation performed within gc,
processes also access shared data concurrently in gm (gm 74 ¢,g, = (1)). As
an example, processes may perform local computations while accessing shared
data in concurrent read-only mode. Concurrent write-sharing is also permissible
as long as the write operations performed on the shared data are consistency

preserving.

C. Exclusive-access workloads. In workloads belonging to this class, processes ac-

cess shared data only in exclusive mode, i.e., in a mutually exclusive fashion,

55

inside g, in addition to performing local computation in gc (gm = tb, g, 75 925).
There is no concurrent sharing of any global data in this class. Write-sharing
of data between processes that requires mutually exclusive updates to ensure

data integrity belongs to this workload class.

D. Dual-mode access workloads. This is the most general class in that both con-

current sharing as well as exclusive sharing of global data is allowed in addition

to local computation (gm at d), g, 76 9b).

Workloads designed according to each of the classes above can be used to either mea-
sure a system’s performance along a particular dimension or the interactions between
different performance dimensions. This provides a means of observing how different
factors affecting performance interact. Based on these results, one can identify critical

parameters and recognize performance bottlenecks.

3.3 Experimental Framework

The deﬁnition of the unit grain provides a unit of workload speciﬁcation for the com-
putational activity in a single process (or a single thread of control). Our objective is
to measure not only the static characteristics of the execution of a speciﬁed workload
but also the dynamic characteristics that result from the run-time interactions be-
tween concurrent processes. In other words, we would like to be able to observe and
quantify the loss in performance that results from the interference between concur-
rently executing grains. The program characteristics of the interfering grains may be
identical (homogenous) or non-identical (heterogenous). With this objective in mind,
the measurement structure selected for the experimental study of the interference

behavior is now described.

3.3.1 Measurement Structure

In order to fulﬁll our goal of observing the interference between a set of simultaneously

executing homogenous or heterogenous grains under varying degrees of parallelism,

56

we have established an experimental structure for our measurements (see Figure 3.4)

that consists of:
0 one test processor (called Po),
0 a variable number, N, of competitor processors (called P1, P2, ..., PN), and

e a number, M, of data elements that are shared by the test and competitor

processors.

The test processor Po executes a unit grain called the test grain and denoted by
G; = (g,‘,,, g:, gﬁ). Each competitor processor executes a unit grain called the competi-
tor grain and denoted by G6 = (gfn, gg, gj). Every competitor processor, P1, ..., PN,
executes an identical copy of the competitor grain Ge simultaneously. The number
of competitor processors, N, can be varied to control the degree of parallelism and,

hence, the extent of interference among the concurrent grains.

BARRIER >

       
    

 

.....................
....................
......................

5°

 

 

 

test

processor competitor

processors

Figure 3.4. Structure of the measurement framework

57

We also make the following assumptions for all our experimental measurements:

0 The number of concurrently competing processes in our framework is less than or

equal to the maximum number of available processors Nmax, i.e., N +1 S Nmar.

e A process once created and attached to a processor remains stationary, i.e.,

process migration is not allowed.
0 The execution of a process is nonpreemptive.

The ﬁrst assumption ensures that all the processes in a given workload are simultane-
ously active on different processors thus participating in shared resource contention
resulting in the worst-case runtime overheads. Throughout this thesis, therefore, the
terms process and processor are used interchangeably. The second assumption helps
eliminate the context-switch overhead that would entail from process migration. The
third assumption precludes any unexpected program behavior due to unpredictable
process preemptions. Further, all measurements are performed on a quiescent system
thus enabling us to ascribe reasons for the observed losses with greater conﬁdence.

The second aspect of the grain interactions that needs to be controlled is the size
of the shared-data space within which all the grains interact. This is accomplished by
assigning a suitable value to M. The structure of the shared data is assumed to be
a one—dimensional array consisting of M elements and distributed over the memory
modules in the shared address space in some predetermined fashion. This view of
the shared data is justiﬁed by the fact that any higher dimensional data structure
will ultimately be translated into a one-dimensional sequence of memory addresses
for the purpose of storage. A hot-spot scenario results from setting M = 1.

The set of input parameters to the experiment, I, can now be consolidated and

written as
I: {N,M,GI,GC}

Note that by setting G; = Ge, we can create a homogenous workload; or by using

different characterizations for Gt and Gc (Gt 75 Ge), we can create a heterogenous

58

workload. Homogenous workloads are used to characterize the loss in processing ef-
ﬁciency ensuing from runtime overheads when multiple identical processes cooperate
to achieve a common goal (as in SPMD style computations). On the other hand,
heterogenous workloads are used to characterize the interaction between unrelated
processes (the test and competitor grains in this case). The interference in the ex-
ecution of a process of interest (the test grain) due to the simultaneous execution
of multiple “non-related” processes (the competitor grains) can be observed. By
varying N, the performance degradation under varying degrees of parallelism can be
measured.

For a given set I = {N, M, Gth} of input parameters, the average execution

time per unit grain for processor Pk (denoted by ’17,) is given by (refer to Figure 3.4)

_ Tfk _ Ti

T
k Nitr

, k = 0,1,...,N.

The effective unit grain execution time Tc;(N) for a concurrent workload with N
competitor processes active is recorded for each experiment performed. The value
recorded for TG(N) is different for homogenous and heterogenous workloads given
that the purpose behind the two types of workloads is different.

For homogenous workloads:

For heterogenous workloads:

With these deﬁnitions, it is obvious that a null characterization of the test grain
(i.e., G; = (¢,¢,¢)) is meaningless for both types of workloads. However, a null
characterization may be used for G6 for heterogenous workloads.

We also deﬁne the uncontested execution time of a granule g.-, denoted by Tg,(0),
as the time required for the unit grain G with g,- as the only non-null component
granule in it to complete its execution when executing alone on a multiprocessor (no

interference from other grains). The uncontested execution time of a unit grain is

59

the sum of the uncontested execution times of its component granules. Using this

deﬁnition, and the fact that Tg,=¢,(0) = 0, we can write

Tm = T9m(0) when G=(gm,¢,¢)
To = Tg.(0) when G=(¢,gc,¢)
T, = Tg,(0) when G=(¢.¢,ga)
T = Ta(0) when G=(gm,9c,ga)

= Tm+Tc+Ts

where Tm, Tc, 7', and r are the uncontested execution times of gm, gc, g, and G,

respectively.

3.3.2 Workload Generation

With a suitable selection of attributes characterizing the unit grain, the workload
model parameters contained in I allow a wide range of workload behaviors to be
represented. If all the points in the parameter space of I were to be tested, it would
result in an overwhelming number of experiments. This would not only be extremely
time consuming to be practically feasible, but also make it impossible to draw con-
clusions. Hence, a systematic method is adopted for traversing the input parameter
space by the creation of parameter families wherein a family of related behaviors is
obtained by ﬁxing all but one parameter. The parameters in I that remain ﬁxed
within a parameter family are said to be anchored. The changing parameter, say X,
within a family is denoted by X. If attribute y of grain G.- is varied, then the changing
parameter is denoted by Gg(y).

Using this convention, I1 = {N,M,GI,GC}, for example, denotes a parameter
family wherein M, G; and G; are anchored while the number of competitor processes
N is varied. Similarly, I2 = {N , M, Gt(s), Ge} denotes a parameter family wherein
N, M and G6 are anchored while the attribute s of G; is varied.

Assigning constant values to each attribute in the characterization of G creates an

60

instance of the unit grain. The resultant tuple is called a characterization instance of
G. If a study of the deterministic execution behavior of the workload represented by
G is desired, then the value assigned to each of the characterization attributes may
be interpreted as invariant quantities, resulting in an invariant G from one iteration
to another. On the other hand, treating each attribute value to be the mean of a
known probabilistic distribution transforms the corresponding attribute into a random
variable thus permitting a study of the stochastic execution behavior of the workload.
For the probabilistic characterization of input workloads, any input parameter X can
be associated with a spread factor f (denoted by X [ f]), 0 S f S 1, causing X to
become uniformly distributed in the interval [(1 — f )X , (1 + f )X ] In other words,
an input parameter speciﬁcation of the form X [f] is equivalent to X being selected

from a uniform distribution over the interval [(1 — f )X , (1 + f )X I

X[f] E U[(1- f)X,(1+ f)Xl

3.4 Performance Characterization Parameters

It has been recognized for years that the single parameter Mﬂop/s (megaﬂops) is in-
adequate to measure the performance of a multiprocessor system, because it takes no
account of the communication, synchronization and resource contention overhead in-
herent in the parallel execution of multiple processes. More recently, a two-parameter
(r00, 31/2) description has been used [61] to characterize the ﬂoating-point performance
in MIMD computing that is based on measuring the importance of the overhead of
synchronizing multiple instruction streams. The parameter r00 denotes the asymp-
totic ﬂoating-point performance as Mﬂop/s whereas 31/2 indicates the amount of use-
ful arithmetic that could have been done during the time taken for synchronization.
In a similar spirit, a three-parameter (r00, n1 /2, 31/2) description of MIMD vector com-
puters [62] has also been used that incorporates, in addition to the synchronization
overhead 31/2, the vector startup overhead in terms of n1 ﬂ. However, the parameters

used in these characterizations assume that the overheads are constant quantities

61

thus accounting for only the static overheads encountered. The variation of program
performance with the number of processors and the associated dynamic overheads
caused by run-time interactions between processes cannot be accurately captured by
such static parameters only.

In this dissertation, we develop a hierarchical performance model to describe the
performance of the multiphase program structure used as the basis of our studies.
Each level in the hierarchy provides a measure of the fraction of total processing
power that is lost due to inefﬁciencies at that level. In doing so, each level furnishes
a set of parameters that characterizes the importance of overhead factors that limit
performance at that level. The hierarchical performance model integrates the charac-
terization parameters from each level into a composite framework that describes the
net performance of a system as its ideal potential performance degraded successively
by overheads encountered at each level of the hierarchy.

The lowest level in the hierarchy, the granule level, focuses attention on each
component granule of the unit grain. The effect of the static distribution of work
among the granules on computational performance is captured by the three static
parameters (R00, f1/2,c1/2) measured at this level. Measurements at the next higher
level, the grain level, quantify the overheads that result from run-time interactions
between concurrent instruction streams as a function of the number of interfering
processes. The inﬂuence of these overheads on overall performance is described by
the two dynamic parameters (wm(N),tb,(N)). At the highest level, the phase level,
the loss in performance due to global synchronization at the end of each phase is

observed and quantiﬁed using the dynamic parameter wb(N).

3.4.1 Static Parameters

The decomposition of the unit grain G into the three component granules (gm, gc, g,)
signiﬁes the division of the total work performed within a unit grain into communi-
cation, computation and synchronization components. The granule gc performs all
the meaningful computation, whereas the time spent within the granules gm and g,

represents communication (through shared variables) and synchronization (mutual

62

exclusion) overheads, respectively. The relative proportion of time spent in each of
these granules during execution determines the maximum rate at which the compu-
tation in gc can progress. The static parameters characterize the dependence of a
multiprocessor performance on the static overheads inherent in the algorithm design
resulting from communication and synchronization.

Assume that the computation performed within a unit grain can be expressed in
terms of a number of basic computation units (BCUs). A BCU may simply represent
a single ﬂoating-point operation at one extreme, or it may represent a very large
computational block involving many number of ﬂoating-point operations at the other
extreme. In other words, the amount of computation that a BCU is chosen to repre-
sent is a matter of the level of abstraction at which the computational performance
is of interest. Stated another way, a single BCU produces a single result of interest
and the rate of BCU execution determines the rate at which results are generated.
Let the unit grain G contain a total of c BCUs distributed between gc and g,. Also,
let the unit grain G contain a total of m shared data references distributed between
g..., and g,. The synchronization in g, is assumed to be mutually—exclusive access to
a critical section guarded by a pair of lock/unlock operations.

For a given workload (i.e., a given characterization instance of G), deﬁne

tm = the average time per shared data access,
tc = the average time per BCU,

t, = the average time per synchronization operation.

The value of tm depends not only on the hardware characteristics of the shared
memory, but also on the distribution of shared data over the memory hierarchy and
their‘access patterns imposed by the application algorithm. In the case of UMA
(Uniform Memory Access) multiprocessors with no caches, where all memory is global
and equidistant from all processors, the shared data access time tm is equal to the time
tglow to access a data item in global memory. If per-processor caches are present on

a UMA multiprocessor (e.g., Sequent Symmetry), then the shared data access time

63

is governed by the proportion h of cache hits exhibited by the shared data access
pattern. If teach: denotes the time to fetch a data item from the cache and tgloba, to

fetch it from the global memory, then tm is given by

tm : htcache '1' (1 _ h)tglobal-

In the case of NUMA (Non-uniform Memory Access) multiprocessors, all memory
is not equidistant from all processors thus exhibiting different access latencies for
different levels in the memory hierarchy. Let tram; and tmnotc respectively denote the
times to access a data item from processor local and remote memory modules, and r
denote the proportion of shared data accesses that go out to a remote memory module.
Assuming that no per-processor cache is present (e.g., IBM RP3), the average access
time tm is given by

tm : rtremote '1' (1 _ r)tlocal-

If per-processor caches are present (e.g., BBN TC2000) and h denotes the proportion

of cache hits, then tm is given by

tm = htcache + (1 _ h)[rtrcmote '1' (1 " r)tlocalI-

The average shared data access times for the different memory organizations are

summarized in Table 3.2.

Table 3.2. Summary of average shared data access time tm

 

[Memory | no per-processor cache I with per-processor cache I

UMA tglobal htcache + (1 — h)tglobal
NUMA rtremote + (1 — r)tlocal htcache + (1 —' h)Irtremote + (1 — r)tlocal]

 

 

 

 

 

 

 

The value of tc depends upon the composition of the BCU. For instance, suppose

that the rate of ﬂoating-point operations were of interest. Let a single BCU consist of

64

a total of n arithmetic operations each involving a different number of ﬂoating-point
operations. If the n operations can be classiﬁed into t types such that there are n,-
arithmetic operations of type i that require w; ﬂoating-point operations, then the
BCU time tc can be expressed in terms of the time t f, to perform a single ﬂoating-

point operation as

t t
to = anwgtfp where Zn,- = n.

i=1 i=1
The value of t, is determined by the particular implementation chosen for the
locking primitives. If tlock and tunlock represent the latencies of the locking primitives,

then

ts = tlock + tunlock-

Using the characteristic times tm, tc and t, of a given workload, and the unit grain
parameters c and m, the single processor (no interfering processors) execution time

Ta(0) of the unit grain G can be expressed as follows.
TG(0) = T = etc + mtm + t,. (3.1)

Since a total of c BCUs are computed, we ﬁnd the average BCU computation rate

per processor, 12(0), as a function of the grain parameters to be

 

12(0) = 3 c

= . .2
T ctC + mtm + t, (3 )

With a little rearrangement of the above expression, the average computation rate

R(0) can also written as

 

Roe
12(0) = 1 £2 22 (3.3)
f c
where:
tm ts C
Roozt—, f1/2——, C1/2——, and f:—

The value R00 provides a measure of the asymptotic (i.e., maximum) performance
in BCUs / second per processor. The degradation of performance from this peak is de-

termined by the amount of computational work performed per shared data reference,

65

here measured by f, the computation granularity c, and the static parameters f1/2
and CI”. The half-performance memory factor, f1 ,2, measures the memory bottleneck
in terms the amount of lost work that could have been done during the time of the
shared data access, whereas the half-performance lock factor, c1 ,2, measures the lost
work due to synchronization. Hence, they signify the cost of shared data access and
synchronization in a currency that has a known value to the programmer.

The signiﬁcance of the half-performance factors become apparent if we consider a
unit grain G with only one kind of overhead in it. For instance, if the synchronization
granule g, is absent from G, then r, = 0 => t, = 0 => c1 ,2 = 0. This results in the

average computation rate 8(0) given by

R

R(0)=1+ f1/2/f.

It can be seen from the above expression that for f = fl l2, half the asymptotic per-
formance R.>0 is obtained. Thus fl); is the minimum computation-to—communication
ratio necessary to achieve half the asymptotic performance.

Similarly, if the shared-memory access granule gm is absent from G, then Tm =

0 => tm = 0 => f1/2 = 0. This results in the average computation rate R(0) given by

Roo

3“” = 117,7

Once again, as before, it can be seen that c1/2 is the amount of work in a unit grain
that is necessary to achieve half of the asymptotic performance.

We characterize the static performance of a multiprocessor system in terms of the
3—parameter description (R00, f1 /2, c1 /2)' The values of R00, f1]; and 61/2 are likely to
depend on hardware and application characteristics as the discussion on tm, t6 and
t, earlier in this section illustrated. The parameters (R00, f1 /2, (31/2) have been chosen
for system characterization rather than the original timing parameters tm, t0 and
t,, because they are more directly related to facts about a problem that are known
to a computer user. The parameters f1 /2 and 61/2 provide a user with a yardstick

with which to compare the computation—to—communication ratios and computation

66

granularities that occur in his problem, and Rec provides a target with which to
compare the actual performance of his program.

Eq. 3.3 gives the functional form of the approach of the average computation
rate to the maximum R00 as the computation granularity c and the computation—to—
communication ratio f change. The functional form of this approach to the asymp-
totic will occur repeatedly in the subsequent discussion of performance, and we deﬁne

it as the loss function
1

1+x°

 

loss(x) =

(3.4)

The expression for the average computation rate in Eq. 3.3 can then be written as

R(0) = Rooloss(f/f1/2 + C/Cl/g) (3.5)

which shows how the peak performance is degraded by memory bottleneck (ﬁrst term)
and inadequate granularity (second term).
We can now express the uncontested single-processor execution time of the unit

grain in terms of the static characterization parameters as
Ta(0) = r = Rgol(c + mf1/2 + Cl/g), (3.6)
and the individual timing parameters as
tc = 12;}, tm = Rgo‘fm, t, = Egg/2.

Values of (R00, f1 /2, c1 /2) can be obtained by ﬁtting a set of measurements of r for
different combinations of c and m to Eq. 3.6. As Eq. 3.6 represents an equation in three
unknowns, a set of three measurements of r with linearly-independent combinations

of c and m should, in theory, be sufﬁcient to solve for the unknown parameters.

67

3.4.2 Dynamic Parameters

If the concurrent execution of processes, represented by unit grains, on different pro-
cessors were ideal (i.e., no mutual interference), then the net computational rate
achieved with N competitor processors would be (N + 1)R°o. However, in practice,
parallel execution of cooperating processes involves contention for shared resources
in hardware (memory modules, interconnection network, etc.) and software (shared
lock variables). The result is runtime overheads that are dynamic in nature which
degrade the asymptotic performance further beyond the inefﬁciencies introduced by
the static parameters (R00, f1/2, c1 ,2). It is important to know the computation cost
of these dynamic overheads, because this will inﬂuence the way in which a particular
program is organized for parallel execution (i.e., how it is parallelized).

The multiphase algorithm structure chosen for our studies is assumed to ex-
hibit asynchronous behavior (i.e., only implicit synchronizations) of parallel processes
within a phase and global barrier synchronizations between phases. As discussed ear—
lier, there are three overhead dimensions that exert a critical inﬂuence on the perfor-
mance of such application structures, namely, overhead due to contention for shared
data and memory, overhead due to access of mutually-exclusive critical sections, and
overhead due to synchronization barriers. The measurement of the incremental contri-
bution of each of these factors to the total overhead helps identify critical parameters
in the workload and recognize potential performance bottlenecks.

The incremental overheads resulting from memory contention and shared lock
contention are characterized by measuring the interference among concurrent grains
within a phase. In other words, the performance degradation is observed at the grain
level. The incremental overhead due to synchronization barriers is obtained from
experimental measurements at the phase level. Each overhead component for a given

workload is characterized by an interference factor expressed as a function of N, the

number of competing processes.

 

68

Grain level characterization

Barring the loss in efﬁciency due to the relative proportions of the granule lengths, the
ideal parallel execution performance of a unit grain G in the absence of any external
interference is given by its uncontested execution time T 0(0) = r. If the asynchronous
execution of concurrent unit grains within a phase were free of mutual interference,
then the execution time per unit grain would still remain as 7. However, this ideal
performance is hampered by two factors: memory interference and lock interference.
Memory interference results from contention for shared hardware resources along
the processor—to—memory path, contention for memory modules and the overheads
of maintaining data coherence across the memory hierarchy (e.g., cache coherence).
Lock interference results from contention for a shared lock variable and the queuing
delay ensuing from enforcing the mutual exclusion semantics.

The total execution time of t unit grains (Figure 3.3) within a phase with N
other interfering grains present, T(N), is given by its ideal execution time T(0) = tr

augmented by memory and lock interference overheads. In other words,
T(N) = [T + Om(N) + 0,(N), (3.7)

where 0m(N) and 0,(N) are, respectively, the extra overheads due to memory and

lock interference. If the corresponding average overheads per unit grain are denoted

by Om(N) = Om(N)/t and O,(N) = 0,(N)/Z, then Eq. 3.7 can be rewritten as

OM 04M). (3...

T(N)=tr(1+——+—

T 7'

We deﬁne two grain-level dynamic characterization parameters incremental mem-

ory interference (115,.) and incremental lock interference (1b,) as follows:

em) = omuv) oer)

 

and ¢,(N) =

 

(3.9)

T

The memory interference ¢m(N) for a given workload varies with N, and depends

69

upon the distribution of shared data objects over the memory hierarchy and the
memory reference patterns. Similarly, the lock interference ¢,(N) also varies with N,
and depends on the implementation of the locking primitives, the frequency of critical
section access and the amount of computation performed in between consecutive
critical section operations. The total execution time from Eq. 3.8 can be expressed

in terms of the dynamic characterization parameters as

T(1V) = “(I + ¢m(N) + IMN»- (310)

Given that there are c BCUs computed per unit grain and 5 unit grains executed
per processor within a phase, the total number of BCUs computed within a phase is
(N + 1)t’c as there are (N + 1) processors executing concurrently. Hence, the effective

BCU rate with N competitor processes active, R(N), is given by (using Eq. 3.10)

 

(N +1)€c (N +1)c/1'
R N = —— = . 3.11
( ) T(N) 1+ ¢m(N) + z/2.(N) ( )
Substituting the rate C/T from the granule level expression in Eq. 3.3, we get
N 1 R00
R(N) = ( + l (3.12)

 

(1+ f1/2/f + 61/2/C)'(1+ lmeV) + PAN».

The computation rate R(N) can also be expressed in the functional form of the

loss function deﬁned earlier as

R(N) = (N +1)Roo loss(f/f1/2 + C/CI/z) loss(wm(N) + l/Js(N)) (3.13)

which shows how the peak performance is degraded by the static (ﬁrst loss term) and
the dynamic (second loss term) overheads.
The average unit grain execution time, TG(N), with N competitor processes

present can be expressed in terms the system characterization parameters deﬁned

70

so far as follows.
TG(N) = R;‘(c + min + 01/2)'(1+ W) + am» (3.14)

The dynamic parameters tbm(N) and ¢,(N) for a given workload can be obtained by
experimental measurements at the grain level to determine the increase in the average

execution time per unit grain G.

Phase level characterization

In addition to the increase in unit grain latencies caused by memory and lock in-
terference, the effective BCU computation rate per phase is further decreased due to
additional overhead of barrier synchronization at the end of a phase. If the additional
latency due to the barrier with N competitor processes is given by 05(N), then the
total time to complete a phase, T( N), with 3 unit grains per processor is obtained by

augmenting Eq. 3.10.

T(N) = 37(1 + ¢m(N) + ¢.(N)) + 05(N)

rr(1+ ¢m(N) + u.(N)) (1+ ,7 CM)

(1+ ¢m(N) + ¢.(N))

 

) (3.15)

We deﬁne the phase-level dynamic characterization parameter incremental barrier

interference (tbb) as follows:

MN) = 0'”). (3.16)

T

 

The barrier interference tbb(N) for a given workload varies with N, and depends upon
the implementation of the barrier and the degree of load imbalance within the phase
preceding the barrier. Using this deﬁnition of barrier interference, the execution time

of a single phase can then be expressed as

T(N) = («1+ ¢m(N) + u.(N))(1 + WNW) (3.17)

71

where

 

~ _ WIN)
MN) ‘ 1+ MN) + ¢.(N)'

The modiﬁed parameter $5(N) can be interpreted as the incremental barrier overhead
normalized with respect to the actual unit grain execution time Tg(N) under con-
tention conditions, as opposed to being normalized with respect to the uncontested
unit grain time TG(O).

The effective BCU rate per phase including the barrier and with N competitor
processes active, R(N), can then be computed from Eq. 3.17 as

T(N) (1+ ftp/f + 01/2/C)'(1+ MN) + MN» - (1 + tt(N)/€)'

 

R(N) =

Expressing the net per-phase computation rate R(N) in the loss functional form,

we get

R(N) = (N +1)Roo l083(f/f1/2 + C/Ci/2)
1033(um(N) + u.(N)) loss(zbb(N)/€). (3.18)

which shows the net performance as the peak performance degraded by all the char-
acterization (both static and dynamic) parameters.
The total execution time per phase, T(N), with N competitor processes active

then becomes (in terms the system characterization parameters)
T(N) = 12;..‘(6 + mfm + c1/2)'(1+ ¢m(N) + ¢.(N)) - (1 + ¢b(N)/€)- (3-19)

The dynamic parameter ‘i/Jb(N) for a given workload is obtained by experimental
measurements at the phase level to determine the increase in execution time of the
phase on account of the barrier being present.

The system characterization parameters described in the previous paragraphs
quantify the losses in performance that result from the static characteristics of an

algorithm and the dynamic overheads overheads encountered at run-time. For a

72

Table 3.3. System characterization parameters

 

 

 

 

Type I Parameter Description ]
Static R00 Asymptotic computation rate (BCUs/s)
parameters f1/2 Half-performance memory factor
c1/2 Half-performance lock factor
Dynamic ¢m(N) Incremental memory interference
parameters w,(N) Incremental lock interference
t/)b(N) Incremental barrier interference

 

 

 

 

given workload, the system characterization parameters (summarized in Table 3.3)
help relate the expected performance of the workload to the application parameters
(summarized in Table 3.4) as a function of the employed parallelism N (or degree of

interference) .

Table 3.4. Application parameters used in the performance model

 

Warameter Description [

 

 

Number of BCUs per unit grain

Number of shared-data accesses per unit grain
Number of unit grains per processor per phase
Degree of interference (#of processors = N + 1)

 

 

2&3 o

 

3.4.3 Performance Metrics

The performance measurements taken at either the grain or phase level in our ex-
perimental framework are quantiﬁed using the fundamental metric called cumulative
interference and denoted by \P(N). This measure answers the question: how much
longer is the expected execution time T(N) of the given workload in a conﬂicting

situation compared to its expected conﬂict-free execution time T(O). This results in

73

the following deﬁnition of the cumulative interference measure.

T(N) - T(0) _ T(N) — fr
T(O) — [1'

 

\P(N) = (3.20)
For measurements performed at the grain level, T(N) = [Tg(N) where Tg(N) is the

average unit grain execution time. Substituting this in Eq. 3.20, one can see that

Tg(N)—T.

1.

‘II(N) '2 (3.21)
In other words, the cumulative interference \II(N) for measurements performed at the
grain level can be inferred from the average execution times per unit grain. It is
apparent that \II(N) Z 0 always.

For grain level experimental measurements, we also deﬁne a second metric called
unit grain eﬂ‘iciency, denoted as 6 (N), that measures the relative performance of the
unit grain in the presence of contention with respect to its uncontested execution

time. It is given by the following ratio.

 

N = 3.22
a ) Tam) ( )
Combining equations 3.21 and 3.22 it can be seen that
{(N) — ——1 (3 23)
_1+MN) '

The value of 6 (N ), 0 < 6 (N ) S 1, for a given point in the performance space expresses
the extent of deterioration of a unit grain performance as a result of conﬂicts. A
value of {(N) = 1 indicates no degradation at all implying that the concurrently
executing unit grains in the workload do not suffer any mutual interference. This
is, obviously, the ideal situation for achieving the best possible utilization of the

processing resources for a group of concurrent tasks. The cumulative interference for

this ideal case is ‘II(N) = 0.

74

3.4.4 Aggregate Multiphase Performance

The usual parameter that is used to compare the performance of algorithms is the
speedup, which is deﬁned as S(N) = T(l)/T(N) where T(l) and T(N) are respec-
tively the times for the algorithm to run on one and N processors. Using the rate of

work notation used in our study, speedup can be written as

an

S(N) = R(1)'

However, the value of the speedup alone cannot be used to compare the execution
time of two algorithms unless the value of T(l) is the same in both cases. Put another
way, speedup is the execution speed measured in arbitrary units which change from
algorithm to algorithm if T(l) changes. It is quite possible in the comparison of two
algorithms, for the algorithm with the worst speedup to execute in the least time,
if the T(l) for the worst algorithm is the greater. We prefer therefore to measure
performance in absolute units (for example, BCUs per second), which is the variable
R(N). It should always be remembered that the objective of algorithm development
is to reduce the execution time T(N) (i.e., increase R(N)) which is not necessarily
the same as increasing the speedup. An algorithm with the greater speedup in some
sense uses the parallel hardware more intensely (e.g., there are fewer idle processors),
but it does not necessarily execute in the least time.

Since a program, in our study, is an ensemble of multiple phases (Figure 3.1), the
aggregate performance of the program may be characterized by the performance of its
component phases. The performance of each phase, in turn, is characterized by the
static parameters (R00, f1 l2, 01,2) and the dynamic parameters (wm(N), I/J,(N), wb(N))
for the workload within that phase and follows the performance model elaborated
earlier in this section.

The net computation rate of a program is simply the total number of BCUs
computed, W, divided by the total computation time, T(N). Note that T (N ) depends
on the multiprocessor used, but W is constant for a given problem. Similarly, the

computation rate of an individual phase k is R), = wk/Tk where wk is the total number

75

of BCUs computed by phase k and T), is the total time required by phase k. The net

rate of the program containing v phases is

 

2w).
(2 = l < k < v
Rn t 2T1:
or, it can also be written as
w
Rnet=2'—g: ISkSU.
Rh

Thus, the net computation rate of a program is the weighted harmonic mean of
the computation rates of the component phases (not the arithmetic average of the

rates). Note that the weights are the total computation work of each phase.

3.5 The Workload Emulation Kernels

Once an appropriate characterization for the unit grain has been selected, we have a
method of specifying different workloads of interest by assigning suitable values to the
grain attributes and the input parameters. What is needed is an emulation program
that uses the workload speciﬁcation to mimic the execution behavior of an asyn-
chronous program that would demonstrate the same characteristics, namely, memory
reference and synchronization patterns. The Memory Access Degradation (MAD),
Synchronization Access Degradation (SAD), and Barrier Access Degradation (BAD)
kernels are a family of such emulation programs. As we are only interested in measur-
ing the concurrent execution conﬂicts of the given workload, no real computation need
be performed by the emulation programs, Their only purpose is to mimic the usage
of shared resources of the speciﬁed workload keeping intact the timing relationships
between the different components of the computational structure.

Each kernel is written to use a set I of input parameters and generate a set of
performance measures, <I>(I), of interest by executing the the emulated workload in a

controlled experiment. Each experiment represents a point in the performance space

76.

of the system.

Access Degradation Kernel: I ———> <I>(I)

It should be emphasized that these kernels are different from standard benchmarks.
They are not parts of “real” computations like the Livermore loop kernels. The key
attribute of these kernels is that they are programs that do not perform any useful
computation, but rather, they are programs that model the computation, memory
access and synchronization structure of a class of workloads of interest. They generate
synthetic loads that are designed to stress a particular aspect of the target system.

The usefulness of this approach lies in the fact that:

e The measured performance is not tied to any speciﬁc application. The user
can design selective workloads, using the workload characterization technique

provided, to generate a system characterization of interest.

0 A collection of such kernels can be used to quantify and compare the perfor-

mance of existing, new, or experimental architectures.

0 They are simple and, hence, the interpretation of the observed behavior in terms

of the kernel structure is easy.

3.5.1 Measurement of Incremental Overheads

The static system characterization parameters (R00, f1/2,c1/2) can be measured by
timing the single-processor execution of a unit grain deﬁned by a given input work-
load, and ﬁtting the measured data to the timing model for uncontested execution
time dictated by Eq. 3.6. The key purpose of the workload emulation kernels (MAD,
SAD and BAD) is to facilitate the measurement of the incremental contribution of dy-
namic overheads along the three focal performance dimensions—memory contention,
lock contention and barrier synchronization—for a given input workload. In other
words, the kernels help calibrate the dynamic system characterization parameters

(t/Jm, 1b,, tbb) as functions of N and hence characterize the dynamic behavior of a given

77'

workload. The incremental measurement relationship between the three kernels is

shown in Figure 3.5.

single processor Ideal “nit grain
execution 95mm”

granule incremental memory interference

   
 
  

Memory Access
performance

 

workload incremental lock interference

 

 

 

Mutual exclusion
performance

    

SAD kernels

phase . . .
Incremental barrier Interference

BAD kernels —> SYnChfonization barrier
performance

Figure 3.5. Incremental measurement of dynamic overheads

The kernels are executed in the order MAD —+ SAD -t BAD for a given workload.
The MAD kernels measure the run-time overheads arising only due to contention for
shared memory; the SAD kernels measure the cumulative overheads arising due to
memory as well as lock contention; and the BAD kernels capture the total cumulative
overheads. Each kernel is coded so as to eliminate from its own measurements the
incremental contention overhead measured by its successor kernel.

Each kernel computes the fundamental metric cumulative interference \II(N), as
deﬁned in Eq. 3.20, by timing the execution of a given workload with varying number
of competitors N. Let us denote the cumulative interference measured by the MAD,
SAD and BAD kernels as \Ilm, \II, and W5, respectively; and the workload execution
time measured with N competitors as Tm““(N), T‘°d(N) and Tb““(N), respectively.

78

Then from the deﬁnition of \II (Eq. 3.20), we can derive the expression for incremental

memory interference It", from the MAD kernels as

_ Tm“‘(N) — T(O) _ 0..(N) _ (MN)
“(All " T(O) ‘ T ‘ ‘77"

 

= umuv). (3.24)

Similarly, for the SAD kernels we have

T"‘dUV) - T(O) (T‘““(N) - T'"““(N)) + (Tm°d(N) - T(0))

 

 

 

 

MN) T(O) E T(O)
_ o.(1v) + 0,..(N) _ O.(N) + c“)...(1v) _
_ [T = 1' = 1(’mUV) +¢,(N).

Therefore, the incremental lock interference w, can be computed from the following

expression.

(MN) = ‘1’.(N)- WW) = WAN) - 111,,(N) (3-25)
The cumulative interference measured by the BAD kernels is given by

T"°“(N) - T(O) (Tb°“’(N) - T’°"(N)) + (T‘°“(N) — T(0))
T(O) T(O)
0.,(N) MN)

[1' + \I’sUV) E 7- + WAN)

‘I’I(N)

 

 

 

Therefore, the incremental barrier interference ‘t/Jb can be computed as

IMN) = (WAN) - ‘I’.(N))- (3-26)

The workload level at which experimental evaluation is performed and the metrics

computed by each of the kernels is summarized in Table 3.5.

3.5.2 Kernel Structure

This section describes the program structure of the access-degradation kernels and
their relationship with the experiment control parameters. As seen from Figure 3.4,

every participating processor executes a unit grain (test or competitor), speciﬁed by

79

Table 3.5. Summary of access degradation kernel measurements

 

 

 

 

Workload Measurement Barrier Metrics
processed by level present? computed
I-proc execution granule no R00, [1,2, 01/2
MAD kernels unit grain no \Ilm, Ibm
SAD kernels unit grain no ‘11,, w,
BAD kernels phase yes \Ilb, tbb

 

 

 

 

 

the input parameters I, repetitively. Each processor executes a concurrent loop as
shown in Figure 3.6. All processors are synchronized at a barrier at the beginning to
ensure that they start executing their assigned grains at the same time. Two distinct
iterative regions can be identiﬁed in this concurrent loop. The code to emulate the
unit grains speciﬁed by I is enclosed within the inner loop with i as its loop control
variable, and is repetitively executed Nit, number of times. In reality, we unrolled
this loop to reduce the loop overhead per iteration. The additional code delimited by
the two invocations to the read_clock() function is what we call an observation.

The outer loop with k as its loop control variable constitutes an experiment. Thus
an experiment consists of a set of observations (controlled by the variable Nrepcat). All
the observations in an experiment are assumed to be statistically independent. The
ﬁnal step in an experiment consists of computing the arithmetic mean and variance
of the sample of recorded observations. The sample mean is used as the observed
measure of performance, <I>(I), for the input parameter set I. Conﬁdence intervals are
computed for each set of observations to ensure that the variation between extremes
is within reasonable limits.

The length of each observation run, Ni", and the size of an experiment sample,
Nam“, are selected based on the resolution of the clock available on the target system,
the overhead of the timing function and the overhead of the loop control statements.
The choice of suitable values for these two control parameters is crucial to the mini-
mization of experimental error and the conﬁdence interval of the measured quantities

[111]. A more detailed discussion of the dependence of experimental errors on these

80

 

Concurrent Loop
{
for (k s O; k < Nrepeat; k++)

{
kernel_specilic_inilialization();

harried):
t1 - read_clock();
for (l - O; i < Nitr; i++)

body of test/competitor grain
}
t2 - read_clock();
runtime[k] - (12 - t1) /Nitr;
}

compute_sample_stats (Nrepeat, runtime);

 

 

 

Figure 3.6. The concurrent loop structure of the kernels

control parameters is provided in the next section.

3.5.3 Minimization of Experimental Errors

One of the important considerations of the experimental system characterizer is to
control the accuracy and exactitude of the measurements. In this section, we dis—
cuss the sources of variability in the measurements and illustrate the importance of
the control variables N“, and Nrepea} in minimizing experimental errors and hence
conﬁdence intervals.

Referring to Figure 3.6 we see that the time recorded in each observation Oj also in-
cludes the execution time of the loop control code that controls the test (F 0Roverhead)
and the overhead incurred by the timer routine (Coverhead)- These have to be sub-
tracted from each observation 0,. These measurements have their own variance and

the subtraction of these overheads increases the variance of our measurements. The

81

mean value 0 of a sample of observations is

 

A 1 ”repeat
0 = 0'
N repeat E J
and its variance
2 1 Nrepeat 0 0 2
o 0 = ———-—— -—
N repeat - 1 12:; ( J )

Now the mean value of each experiment is the time it takes to execute the body

of the test / competitor grain Nu, times, plus the overhead of the timing function

A

0 : Nib-(T + FORoverhead) ‘l' Coverhead

where T is the mean time it takes to execute once the body of the test / competitor

grain. We can compute this value and the variance with the equations

A

0 - Coverhead
_ FORoverhead

T =
Nitr

 

and

 

2 2
_ 0' 0 + 0' Covcrhead
_ N?

:tr

0'2T ‘1‘ 02F0Roverhead

Looking at the above equations we can see that there are four factors affecting
the magnitude of variance: the resolution of the timing function; the variance of
our observations; the variance of the execution time of the timing function; and the
variance of the FOR control statements. If the execution time of each observation is

such that we have

Oj >> Crcsolution + Coverhead + FORoverhead

then the only factors that affect our measurements are the dispersion of our observa—
tions.
We have two ways of reducing the variance of our results and therefore the size of

the conﬁdence intervals—increasing the length N“, of an observation and increasing

ll].

82

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Nrepeat = '5
I I I
. M1
Nitr l J
1000 I : M2
1
10000 ‘
|
100000 —J——
1 l 1
-0 3 -0 2 -0 1 0 0 1 0 2 0 3
Nrepeat = 10
I I F
Nitr l / M1
1000 I : M2
I M3
1
10000 '
|
100000 I];
l I l
-0 3 -0 2 -0 1 0 0 1 0 2 0 3
Nrepeat = 20
I I I
1000 I < M2
10000 —__—|
|
100000 T
l l 1
-0 3 -0.2 —0 l 0 0.1 0.2 0.3

Figure 3.7. Normalized 90 percent conﬁdence intervals for three workload measure-
ments on the Sequent Symmetry for Nrepeat = 5,10, 20

83

the sample size ancat. It is important to know the values for N“, and Nrcpm that will
give a small standard deviation in our measurements. These values are system depen-
dent. In Figure 3.7 we show the normalized 90 percent conﬁdence interval of three
workload measurements (indicated as M1, M2 and M3) on the Sequent Symmetry
S81 multiprocessor system. The workload measurements were performed for values
of N“, = 1000, 10000 and 100000. We also obtained measurements for Nrcpm = 5,10
and 20. The conﬁdence intervals for T are obtained using the Student’s t distribution

and the standard error of I as follows:

1; t—gi( 021-.)1/2 T+L§( 0'2T )1/2
T Nrepeat , T Nrcpeat

and the normalized conﬁdence intervals are

_t_9_5_( 0'2T )1/2 .t_95_( 0'2T )1/2
T Nrepeat , T Nrepeat

We can see that for a ﬁxed value of Nrepeat the conﬁdence interval of our mea-

 

 

 

 

surements decrease as the time of the measurement (controlled by Na.) increases.

We obtained acceptable results on the Sequent Symmetry for Nrcpeat = 10 and
N“, = 100000.

3.6 Summary

In this chapter, we developed a comprehensive experimental performance character-
ization methodology for shared memory multiprocessors based on measurement of
the static and dynamic overheads that arise during program execution. The run-
time interference along three principal performance dimensions have been considered,
namely, memory contention, lock contention and synchronization barriers. A paral-
lel computation structure with multiple phases separated by global synchronization
barriers and asynchronous balanced task execution within each phase has been se-
lected as the basis of the performance characterization study in this dissertation.

A hierarchical workload characterization technique using the abstraction of a unit

84

grain has been proposed for the ﬂexible and parametric speciﬁcation of workloads
of interest. Three static parameters (R00, f1/2,c1/2) and three dynamic parameters
(¢m(N), ¢,(N), 1/25(N)) were deﬁned to describe the static and dynamic behavior of
a given input workload as a function of the number N of processes competing for
shared resources. The structure and semantics of of three kernel families — MAD,
SAD and BAD — was presented to facilitate the measurement of the static and the
dynamic parameters. Finally, the primary sources of experimental errors and means

to minimize them were also discussed.

CHAPTER 4

MAD KERNELS AND
MEMORY ACCESS
PERFORMANCE

On large-scale multiprocessors, access to common memory is one of the key per-
formance limiting factors due to the signiﬁcant overheads that may be encountered
related to contention for access to shared memory modules. The shared memory per-
formance depends not only on the characteristics of the memory hierarchy itself, but
also upon the characteristics of the memory address streams and the interaction be-
tween the two. The factors that cause memory access conﬂicts and the architectural
solutions adopted to minimize contention were discussed in Chapter 2. Quantitative
assessment of the contention overheads for different types of memory access workloads
promotes a better understanding of the performance of systems as they scale in size
and use newer memory technologies.

The MAD kernels and the related experimental framework described in this chap-
ter provides an effective testbed for characterizing the shared memory performance
for a variety of memory access workloads. Experimental measurements are performed
at the unit grain level with multiple unit grains executing the speciﬁed workload in
parallel without a global synchronization barrier. The performance metrics are com-

puted on a per-unit-grain basis. The MAD kernels can be used in isolation to perform

85

86

a detailed evaluation of the sensitivity of a shared memory organization to various
memory access parameters; or they can be used in conjunction with the SAD and
BAD kernels, within the hierarchical framework described in Chapter 3, to character-
ize the incremental loss in performance for a given workload resulting from memory

access conﬂicts.

4.1 Preliminary Studies

The performance studies described in this section were designed and aimed at a pre-
liminary investigation of the performance degradation experienced by multiprocessors
as a result of contention for shared memory resources. Three commercial multipro—
cessors were used as the target systems in the study: a 96-node BBN GP1000 (called
BBN-1), a 32-node BBN TC2000 (called BBN—2) and a 24-node Sequent Balance
21000 (called Balance). The architectural features of these systems were described in
Chapter 2. The performance measurements taken were used to quantify two major
sources of overhead in shared memory accesses, namely, non-local access latency and
waits due to access conﬂicts. An analytical model for these overhead factors was
formulated to explain and corroborate the observed behavior [96].

Parallel execution performance degradation in the presence of synchronization
locks were also a subject of these preliminary investigations. The presence of lock-
based mutual exclusion operations introduces two additional sources of runtime over—
heads, namely, locking latency and waits due to lock conﬂicts. The experimental
results from input workloads containing lock-based mutual exclusion operations are
reported in Chapter 5 (Section 5.1). The observed performance losses solely due to
memory access conﬂicts using workloads with no lock accesses are presented in this

section.

4.1.1 Workload Parameters

The unit grain abstraction is used as the fundamental unit of input workload spec-

iﬁcation. A very simple parametric workload model is used to create a variety of

87

program behaviors. A unit grain G is characterized by three attributes; G = (c, m, x).
The attribute c deﬁnes the number of local computational operations, including local
memory access, performed by a process between consecutive accesses to a critical
section (deﬁned to be a unit grain). This parameter controls the computational load
of each processor. Similarly, the attribute m deﬁnes the number of shared data ref—
erences, not mutually-exclusive, made by a process between successive accesses to
critical sections. The attribute a: speciﬁes the amount of time (in p secs) spent by
a process within each critical section. This attribute is speciﬁed as an absolute time
duration to highlight the inﬂuence of critical section length on performance. Since
memory contention overheads are the focus of the study described in this section, a
value of a: = 0 is used.

In addition to the unit grain attributes two more parameters, N and M, are used
to specify global characteristics of the workload. N speciﬁes the number of com-
petitor processes interfering with the execution of any grain whereas M speciﬁes the
number of shared data objects used by the concurrent processes. The M objects are
assumed to be evenly distributed over the available shared memory modules. Thus,
the complete input workload speciﬁcation includes the six parameters (N, M, c, m, 2:).

For notational convenience we deﬁne two derived parameters in terms of the basic
input parameters described above. First, the granularity w = c + m of a program
is deﬁned to be the total number of operations performed between synchronization
points (e.g., critical sections). Second, the shared-access fraction p = m/(c + m) is
the fraction of total operations devoted to shared data accesses.

In each execution of a unit grain, a processor performs to operations, each operation
being a local computation or shared data access in the proportion dictated by p. Each
shared data reference consists of a read followed by a write to the shared data location.
This is done to force the reference to actually go out to shared memory even in the
presence of data caching. If a: 76 0, then the processor acquires a lock and enters the
critical section for a duration of 33;; secs. Only homogenous workloads, with every
participating processor executing an identical copy of the unit grain G, are used in

these preliminary studies. In other words, the workload unit G is “replicated” on all

88

the N + 1 processors involved.

4.1.2 Quantities Measured

For each workload speciﬁed by a set of input parameters, a corresponding set of timing
data that essentially consists of the effective execution time per unit grain, TG(N), is

generated. The two performance metrics computed for each workload are unit grain

efficiency (5) and overhead factor (6) deﬁned as follows:

_ Tam) _ TGUV) - TG(0)
a ) — TG(N) and 9 — TG(0) '

 

Because of the replicated workload used, this deﬁnition of efﬁciency 6 of running a
program on a parallel architecture can also be interpreted as the ratio of the actual
speedup achieved to the ideal speedup achievable on that architecture.

The loss in efficiency is attributable to two key overheads arising from shared
data accesses —— non—local access latency and waiting time due to access conﬂicts.
The ﬁrst kind of overhead is an important factor for a non-uniform organization
of memory hierarchy (NUMA multiprocessors). The second type of overhead is a
result of contention for hardware resources during shared data access. If we denote
the overhead time due to non-local memory latency as 0;, and the overhead due to
hardware contention as 00, then we can rewrite the expression for overhead factor 9

O _ TG(N) — TG(0) _ 01+ 0c
— Tam) _ Tam)

 

 

:ol'i'oc

which gives the normalized overhead components 0; (latency factor) and (96 (con-

tention factor).

4.1.3 Memory Access Overhead Factors

In this section, we formulate a mathematical model to describe the behavior of con—
current unit grain execution and the resulting overheads. To facilitate the brevity of

expression, we deﬁne some basic cycle times that characterize the program execution

89

on each system. All subsequent execution and overhead times will be expressed in

terms of these fundamental time units. Deﬁne

tc 2 avg. time per local computation operation,

ta = avg. time per local memory access,

t, 2 avg. latency per remote memory access,

tw = avg. waiting time per remote memory access due to contention,
t“, = avg. time to execute the lock primitive without contention,

tu; 2 avg. time to execute the unlock primitive without contention.

The time to denotes the basic time required to access a local data object. The
time t; denotes the additional latency component incurred in accessing a remote data
object. In the BBNs, the t; component is non-zero since a remote memory reference
goes out on the interconnection network whereas a local reference does not. Thus, in
the absence of contention, the time for a remote memory access on the BBNs is given
by to + t;. However, in the Balance, the bus latency is subsumed in the basic memory
access time ta since it is an integral component of the memory access time. There is
no additional delay incurred by “remote” references, since local and remote memories
are indistinguishable, thus giving t, = 0. The time tw denotes the additional delay
over and above the components to and t1 caused as a result of contention among
concurrent memory accesses. Note that all the times deﬁned above (except tw) are
constant being the characteristic of the underlying hardware/operating system and
do not depend on the workload. The values of t“, and tug include the overhead of
function call. A comparison of these fundamental unit times for the three systems
under consideration is shown in Table 4.1.

The remaining term tw, however, is dependent on the memory reference pattern
and the communication bandwidth of the interconnection medium. It embodies the
queueing delay experienced by a memory reference that must traverse the intercon-
nection medium to be serviced. This delay arises from the interference between

concurrent memory references at the destination memory module as well as on the

90

Table 4.1. Basic time measurements for the overhead factors model

 

LSystem [I tc(ps) [ta(ps) ] t1(ps) I tug +tu1(ps) [
BBN-l 10.12 2.18 3.42 71.83
BBN-2 1.49 0.71 1.43 28.62
Balance 37.22 10.85 0.00 83.18

 

 

 

 

 

 

 

 

 

network. We need to obtain an expression for tu, that reﬂects its dependence on the
workload. Several earlier works have modeled memory interference for MINs using
Markovian models [18] and probabilistic analysis [19, 101]. Similar work done for

analyzing contention in bus-based systems include [86, 31, 43].

Contention Time on the BBNs

We use the result derived by Patel [101] using probabilistic analysis for Delta networks.
The derived results apply to a p—stage MIN using k x k switching elements. A memory-
access-cycle (MAC) is deﬁned to be the time interval from the initiation of a memory
request to the completion of the request. No distinction is made between read and
write cycles in the analysis. The primary assumptions on which this analysis is based

are as follows.

(i) The memory references generated by each processor are independent of each

other.
( ii ) The memory references are uniformly distributed over all the memory modules.

(iii) All the k” potential processors (since the system consists of a p-stage MIN with

k X k switches) in the system participate in the memory workload creation.

If each processor generates memory requests at the rate of 1' requests per MAC, then

for any input line of a switch in stage-1 of the MIN, the probability

Pr[a request arrives during a MAC] = r

91

The ﬁrst two assumptions are satisﬁed by our performance measurement frame-
work, where r is determined by the processor workload. However, since only 12. = N +1
processors (out of the total capacity It") participate in generating memory requests,
the effective request rate at each stage-1 switch must be changed in assumption (iii).
Assume that any processor in the system could be selected to participate with equal

probability. Now, for any switch input at stage-1, we have

Pr[input is active] = n / k”
Pr[a request arrives during a MAC I input is active] = r
Pr[a request arrives during a MAC | input is not active] = 0

By using Bayes’ formula [39], we obtain

Pr[a request arrives during a MAC] = r = (1%) - r
Thus, 1" becomes the effective request rate at each input line of stage-1 switches. Using
the new effective request rate, the probability PA that an arbitrary memory request

is accepted by the MIN (from [101]) is given by

 

r k” r
p _ _P = _ _P
A r n r
where
73—1 k .
7‘,‘=1—(1— k) androzr (4.1)

We do not have a closed form solution for PA, but plots [101] of PA vs. network size
(k?) indicate that PA decreases logarithmically as the network size increases.

On the BBN system, a memory conﬂict is essentially a conﬂict at the output line
of a switch in the last stage of the MIN. Hence, it is accounted for by the switch
contention analysis. The average number of wasted memory cycles per request, w, is

easily computed by noting that a request that is rejected i times consecutively before

, "m "A, l
V I v
M _

   

92

being accepted waits for i cycles.

to = :t(1-PA)iPA = l—PA n 1‘

 

= _ . _ _ 1
i=0 PA kp ’10
Hence, the average waiting time per request due to contention is
n r
tw= ta t = —-——l to t 4.2
w<+o (kw )(+,) ()

Independent, uniformly distributed references are not, however, an accurate model
in the presence of global locks, even if all non-lock references are uniformly distributed.
Hence, the expression above will not apply accurately in situations with a single
“spike” or hot-spot in the memory reference pattern. The hot spot case is analyzed

later in this chapter in Section 4.3.

Contention Time on the Balance

We use the result derived by Das and Bhuyan [31] using probabilistic analysis for
multiple—bus multiprocessors. The derived results hold for a system with n processors,
M memory modules and B buses. We have adapted the expressions for the special
case of a single bus system (B = 1) such as the Balance. Again, no distinction is
made between read and write cycles as for the BBNs, and the analysis is based on

the following assumptions.

(i) The bus operation is synchronous, i.e., all requests are issued at the beginning

of the bus cycle.

(ii) The bus is circuit-switched, i.e., the bus is held for the entire duration of a

memory access.

(iii) The requests generated during a bus cycle are random and are independent of

each other.

(iv) The requests issued in successive cycles are independent of each other.

93

The fourth assumption is unrealistic because a rejected request will indeed be
resubmitted in the next cycle. However, this assumption leads to simpler analysis,
and it does not result in a substantial difference in the actual results [18]. Let r
be the probability with which a processor generates a request in every bus cycle.
The probability that there is at least one request for a memory module M5, when 12

processors participate, is given (from [31]) by

r n
X-1“(1‘7w‘)

The number of memory services requested in a cycle is a Binomial random variable
with parameters M and X. Hence, the expected number of memory requests received
per cycle is M X . Since only one of these requests can be accepted by the bus, the
probability PA that an arbitrary request is accepted can be written as PA = 1/M X .
Now, using an argument similar to that for the BBNs, the average number of wasted

bus cycles per request, w, can be computed as

l—PA 7‘ n
= Z —1'—' ’—
w PA (M ) M(1 M)

 

Therefore, the average waiting time per request due to contention is

tw = wtbu, = ((M — 1) — M (1 — 1%)”) tbu, (4.3)
where tbu, is the bus cycle time (100 ns for the Balance). Note that all the terms
in Eq. 4.3, except n, are constant for a given system and workload. The value of
n = N +1 varies according to the input parameters speciﬁed. It should also be noted
that since successive memory requests on the Balance are pipelined onto the bus, the

above equation only provides an upper-bound for the contention time.

Overhead Factors

Recall that TG(0) was used as the unit of normalization in the deﬁnition of 9. We

can express Tg(0) = r in terms of the workload parameters and the basic time units

94

Tg(0) = r = ctc+mta+x = w(1—p)tc+wpta+x

Note that we have not added the terms (tuc + tut) in the above expression since
an application with a single process does not need the service of a lock to exclusively
access a shared resource. We can now express the overhead factors in terms of the
time units deﬁned earlier.

Latency Factor.

For the Balance, since there is no distinction between local and remote memory,
there is no additional overhead incurred by remote accesses. The bus latency con-
stituent of the memory access time is subsumed in the basic memory access time ta.
Hence, there is no additional latency overhead, resulting in 9; = 0.

In the case of the BBNs, the latency overhead is contributed by those shared-
data references that are sent out on the interconnection network. Every iteration
contains m references to shared-data, each involving two accesses (read/ write), and
one shared access each for acquiring and releasing the lock. Converting the shared-

memory reference count in terms of time, we obtain

0] (2m + 2)“ 2(wp + 1)t1
01 = — = —— = (4.4)
1' 'r w(1— p)tc + wpta + a:

 

Contention Factor.
The contention overhead is contributed by all shared—data references (in both the
BBNs and the Balance). Using the same arguments as for latency factor above, we

obtain the following result for contention factor for the BBNs as well as the Balance.

Q _ (2m + 2)tC _ 2(wp + 1)tu,

ac = _
r r w(1— p)tc + wpta + a:

 

(4.5)

4.1.4 Experimental Results

During the course of our experimentation, more than two hundred input parameter
sets were tested. Each parameter set was constructed by varying the input parameters

according to one of the workload forms shown in Table 4.2. The range of parameters

95

was selected with the goal of observing the sensitivity of 5 and O to diﬂerent types
of workloads. The data presented in this section are only a few excerpts from the
workloads created to measure pure memory contention characteristics [95] with no
synchronization. Data corresponding to workloads (C and D) with synchronization

in the unit grain are reported in Section 5.1 of Chapter 5.

Table 4.2. Parameter settings for different workload types used in the preliminary
studies

 

[ Workload [ N [ M [ w [ p [ a:(ps) [
A 0 to max 1 100 0.0 to 0.4 0
B 0 to max M = N + 1 100 0.0 to 0.4 0
C 0 to max M = N + 1 100,500 0.0 to 0.4 0 to 100
D max M = N + l 500 0.1 to 0.4 0 to 150

 

 

 

 

 

 

 

 

 

 

Value of max chosen based on the number of processors available.

Workload A

This workload represents an extreme case in that it creates a “hot-spot” memory
access pattern by forcing all processes to continually access a single memory module.
The effect of a hot-spot on the efﬁciency 6 is shown in Figure 4.1 for two different
values of shared-access fraction p.

As can be seen from Figure 4.1, the efﬁciency drops by more than 50% at N = 20,
p = 0.1 for the BBN systems. The performance degradation becomes even more
pronounced as the values of N and p increase. On the other hand, the Balance is not
affected as much by the hot-spot, since the entire shared-memory of Balance forms
one indistinguishable unit. As long as the mean time between requests is greater than
or equal to the memory-access-time ta, there is no contention at the memory module
and the Balance is able to service the requests efﬁciently.

Obviously, the deterioration in execution speed in the case of BBN-1 and BBN-2 is

primarily due to the increasing contention overhead 0c. The expression for to (Eq. 4.2),

96

 

 

 

 

 

I f I a l f
balance (p = 0.1) -e-—
1 __ __ __~ bbn—l (p = 0.1) 'O— .1
bbn-2 (p = 0.1) +—
0 8 balance (p = 0.4) -O— _
’ bbn-l (p = 0.4) -o—
bbn-2 (p = 0.4) 4—
5(N) 0.6 - ~
0.4 - -*
0.2 - -‘
0 1 M 1 l 1 ﬁi
0 10 20 30 40 50 60

No. of competitors (N)

Figure 4.1. Efficiency vs. N (M = 1,0) = 100,1: 2' 0)

however, fails to explain the phenomenon as it is based on the premise that shared
references are uniformly distributed over the memory modules. An explanation for
the observed behavior is found from [105], a communication bandwidth analysis done
for the RP3 system. It shows that the effective bandwidth of the network reduces
drastically in the presence of a memory “hot-spot”. This is true even when the
fraction of total memory references directed at the hot-spot is as low as 1%. The
severe degradation in bandwidth occurs due to the Tree Saturation E'ﬂect described
in Chapter 2, which not only deteriorates the access time for the hot-spot references,

but penalizes other references as well.

Workload B

This workload highlights the hardware overhead characteristics of each architecture.
Figure 4.2 shows the trend in efﬁciency as the number of processors executing con-
currently is varied. Since at = 0 in this case, all the overhead is due to communication
latency and contention in hardware. Clearly, both 01 and 0c depend on the shared-

access fraction p and increase linearly with p as indicated by Eqs. 4.4 and 4.5. For

97

 

(18

{(N) 0.6

balance (p = (LIV-Ae—

 

 

 

 

 

0.4 bbn-l (p = 0.1) 4— -
bbn-2 (p = 0.1) +—
0.2 ' balance (p = 0.4) Q— -
bbn-l (p = 0.4) -0—
0 1 1 1 b - =
0 10 20 30 40 50 60

No. of competitors (N)

Figure 4.2. Efﬁciency vs. N (M = N + 1,00 = 100,3 2 0)

the Balance, 0; = 0 and, again, the loss in efﬁciency is little. Notice that for the
BBNs, 5 drops initially but remains relatively ﬂat for higher values of N. This is due
to the fact that as N increases, the number of memory modules also increase and the
data references get redistributed uniformly over the memory modules. The efﬁciency
curve for BBN-2 drops off faster than the than the corresponding curve for BBN-1.
This can be inferred by examining the expression for tc (Eq. 4.2). The factor (n / kp)
in this equation signiﬁes the fraction of the network capacity that is occupied. For a
given value of N, this factor is larger for the BBN-2 (k = 8,p = 2) than for BBN-1
(k = 4, p = 4), thus yielding a larger value for tC for BBN-2. However, the relatively
ﬂat shape of the curves for higher values of N points to the fact that the systems can
be utilized better by using a larger number of processors to compensate for the loss

in efﬁciency due to latency and contention.

4.2 MAD Workload Parameters

The major consideration in memory system design for multiprocessors is that the

memory bandwidth must match the memory demand of the processors. The effec-

98

tiveness of the memory design in meeting this goal depends not only on the organiza-
tion of the memory hierarchy, but also on the distribution of the shared data in the
hierarchy, the memory reference pattern of the program, and the locality of memory
references. In addition to temporal locality and spatial locality of references, parallel
computing also makes a new type of locality, called processor locality, desirable. To
keep high processor locality, unnecessary interleaving of references by more than one
processor to the same memory data should be avoided.

It is clear that the workload used to evaluate the memory performance can have a
strong inﬂuence on the results. For example, a (perhaps artiﬁcial) workload exhibiting
little or no locality of reference will tend to favor a very simple processor—memory
interconnection network built out of fast, dumb switches over a network with smarter,
slower switches. Hence, the selection of appropriate workloads of interest is of prime

importance to the success of the experimental study.

4.2.1 Unit Grain Characterization

The domain of the parameter space for investigating the shared-memory performance
is prohibitively large. Unfortunately, measurement data about the behavior of real
workloads are scarce. So, it is not possible to make performance comparisons using “a
typical, real workload”. Therefore, we adopt a ﬂexible parametric model of unit grain
characterization that facilitates the exploration of performance over a wide spectrum
of memory access workloads. The attributes selected for the unit grain should help
probe the memory system systematically by creating diverse sets of memory address
streams to determine its sensitivity to the different workload characteristics. These
workloads not only measure the sustained memory bandwidth under different memory
demands, but also highlight potential bottlenecks. The unit grain characterization

selected for this purpose is summarized in Table 4.3.

Characterization of gm:

The shared-memory access granule gm is characterized by a 4—tuple of attributes:

gm = (p, d, s, m). The ﬁrst attribute, p, simply indicates the probability of a shared

99

Table 4.3. Unit grain attributes for studying memory access behavior

 

EGranule] Attribute Meaning [

 

 

 

 

 

 

common N number of competitor processors
M number of shared data elements
p probability of write access to shared memory
gm d initial distance of concurrent address streams
s stride of memory access
m number of shared memory accesses per granule
96 c number of basic computation units (BCUs)
g, (p non—existent

 

 

 

data reference being a write access. In other words, p = 0 implies that all accesses
are reads, and p = 1 implies all accesses to be writes. As mentioned earlier, writes
to shared data by multiple processors are typically performed within critical sections
in a mutually exclusive fashion unless the concurrent writes are guaranteed to be
consistency preserving.

The next attribute, d, determines the initial disposition of the concurrent memory
reference streams emanating from the processes executing in parallel. It denotes
the distance between the starting addresses of shared data access of each processor
expressed as number of shared data elements. In other words, if there are M shared
data elements in all, then the processor P,- begins its string of memory accesses with
element i x d (modulo M). Thus, if the shared data elements are accessed with
regular stride, then the attribute d can be used to stagger the starting addresses of
multiple processors in any desired fashion. For instance, a value of d = 0 causes all
participating processors to begin their shared data access with the 0‘“ element.

The attribute 3 represents the stride of shared data access from one memory ac-
cess to the next, thus deﬁning the spatial distribution of the memory request streams.
By manipulating the access stride, the effect on performance of the mapping strate-
gies used to assign elements of an array to the memory banks at a given hierarchy
can be evaluated. Depending on how the shared data elements are distributed over

the memory hierarchy, using different access strides will cause the memory request

100

transactions to traverse over different components of the processor—to—memory inter-
connection. Figure 4.3 illustrates the use of the attributes d and 3 together to create
a variety of memory access patterns for both one-dimensional and two-dimensional

shared data structures.

M

M

 

 

(a) One-dimensional data access

POP1P2P3 POP1P2P3

   

d=8.s=l d=8,s=9
( Assume column-major storage)
(b) Twodimensional data access

Figure 4.3. Creation of memory access patterns using attributes d and s

101

Finally, the attribute m denotes the number of memory accesses to be performed
within a single memory-access granule. The value of m determines the granularity
of shared data access within a grain. The main purpose of changing this attribute is
to control the density of memory requests, thus highlighting the interaction between

request bursts and idle periods.

Characterization of gc:

Since all the computation within granule gc operates purely on processor private data
out of a private memory space (assumed to be available locally), by our deﬁnition, the
computation granule does not alter the memory interference behavior of the shared
data access stream as it is external to the processor. Its only inﬂuence is setting
the memory access rate and, hence, the temporal distribution of the shared data
references. So we have characterized the computation granule gC by simply a I—tuple
consisting of a delay count: 96 = (c). The attribute c represents the number of
computational steps performed within a unit grain, and is expressed in terms of a
“basic computation unit” (BCU). The basic unit of computation chosen for granule
9c is a simple delay loop with a loop count of 1. Alternate BCUs such as a single

ﬂoating-point computation could be used to highlight the ﬂoating-point performance.

Characterization of gs:

As only the shared memory access performance is of interest here, the null charac-
terization was chosen for the synchronization granule, i.e., g, = ¢. When the MAD
kernels are used in the hierarchical performance framework of Figure 3.5 to measure
the incremental overheads due to memory contention, a non-null characterization of
9. could be used. The handling of a workload with g, 5:5 cl by the MAD kernels is
described in Section 4.4.

Using the individual granule characterizations, the deﬁnition for the unit grain G

can be written as the 3—tuple of tuples.

G = ((p,d,s,m), (C), ‘15)

102

Both homogenous and heterogenous workloads can be created by selecting different

attribute values for G; and Ge.

4.2.2 Output Metrics

The metric used to observe the trends in the memory contention performance of
an input workload, as a function of the degree of interference N, is the unit grain
efficiency {m(N) as deﬁned by Eq. 3.22. A value of {n.(N) = 1 would seem to indicate
that the concurrent memory access streams are independent of each other and do not
encounter any conﬂicts at all. A value of £m(N) < 1 reﬂects signiﬁcant conﬂicts with
the competitor processes leading to extremely high access latencies.

The cumulative memory interference \Ilm(N) can be computed from {m(N) using
Eq. 3.23. Also, from Eq. 3.24, it is known that the incremental memory interference
wm(N) is equal to \Ilm(N) in the case of the MAD kernels. Therefore, we have the
following relationship between the efﬁciency and interference measures.

_1_€m(N)

‘I’mUVl - m— : ¢m(N)-

It should be emphasized that the efﬁciency metric is a measure of the relative
performance of a workload with N competitors as compared to its performance 'with
no competitors. Similarly, the interference metric is also a relative measure in that
it presents the net contention overhead as a fraction of the uncontested unit grain
execution time, i.e., the number of unit grains that could have been processed during
the time lost due to overheads. Thus both the measures are scaled in terms of the
uncontested unit grain time T. The implication of this for two workloads with identical
absolute contention performance (i.e., same net overheads) is that the one with the
larger amount of work per unit grain (i.e., larger 7') will be adjudged as the more

efﬁcient of the two.

103

4.3 Concurrent-Access Workloads

Concurrent-access workloads, with no lock-based synchronization within the unit
grain (i.e., g, = 43), were designed and used [93] to characterize the impact of concur—
rent memory reference patterns on the shared memory performance. The increased
access latencies observed in this case are purely due to access conﬂicts in hardware and
the overhead of maintaining the consistency of replicated data over the memory hier-
archy. The workloads have been employed to measure and compare the performance
of the Sequent Symmetry and the BBN TC2000 systems.

The shared data, with M elements, were allocated using the shmallocO call on
each machine. On the Symmetry, the data elements are interleaved across the memory
modules with a interleaving granularity of 32-bytes. On the TC2000, the shared data
use shared, uncached memory. If the system is conﬁgured with interleaved memory,
then the shared data is interleaved. However, since the current version of the nX
operating system does not support interleaving, the shared data is scattered across
the allocated cluster instead.

We conducted experiments using a number of parameter families. Each family was
designed to measure the effect of a particular grain attribute on the resultant con-
tention and, hence, unit grain efﬁciency. The spectrum of input parameters included
both homogenous and heterogenous settings. The heterogenous parameter families
were particularly useful in revealing the interactions between concurrent read and

write streams, especially on cache-based systems such as the Symmetry.

4.3.1 Homogenous Workloads

In these experiments, the attributes for the test and competitor grains were set to be
identical, i.e., G: = Gc. Thus, the resultant performance degradation when concur-

rent grains with identical execution behavior compete was measured.

104

Spatial Distribution

By manipulating the stride s of shared-data access, and by choosing a value of M
large enough to cause a complete sweep of all the memory modules, the effectiveness
of the interleaving of the main memory system is probed. Changing the value of s, in
effect, creates different spatial distributions of the memory access stream generated
by each process. In Figure 4.4, the efﬁciency of both read and write accesses is shown.
The observed efﬁciency £m(N) of a given workload provides a measure of the potential
increase in the memory bandwidth for that workload by a factor of (N + 1)£m(N). By
examining the input parameters, it can be seen that all processors start their access
from the shared-data element 0 (since d = 0) and perform subsequent accesses with
identical strides. For read access, the Symmetry scales fairly for s = 0, 2. However, for
s 2 4, every access to a shared-data element results in a cache-miss (since the cache
line length is 16) forcing a memory read transaction over the bus. The bus, therefore,
begins to saturate at N = 14. For write access, a stride of 0 causes repeated writes
to the same location by all processes. This results in heavy cache-invalidation trafﬁc
on the bus in addition to severe memory module contention. This is reﬂected by a
steep drop in the efﬁciency of the grain as early as for N = 2. For other stride values,
there is still cache-invalidation trafﬁc on the bus, although not as severe as the s = 0
case, since every process writes to the same data locations in sequence. Hence, the
memory bandwidth saturates (reﬂected by the extremely low value of £m(N)) right
at the outset with N = 6. However, by distributing the writes so that all processes
do not trace the same sequence of addresses, the write performance could be much
improved. The TC2000 scales well for both reads and writes for all strides except
3 = 0, which is effectively a hot-spot scenario [119].

The static characterization (R00, f1/2) of the memory access performance for var-
ious stride values appears in Table 4.4. The parameter C1/2 = 0 since there is no
synchronization granule present in the concurrent-access workloads. The access pat-
terns selected have the attribute d = 0 implying that all processors start with the ﬁrst
shared data element and trace the exact same sequence of elements in their respective

memory reference streams (except in the random stride case). The standard unit of

105

 

 

 

 

 

 

 

 

 

Read Access Write Access
I I I I I

l 1
0.8 0.8
{...(N) 0.6 £m(N)0.6
0.4 0.4
0.2 0.2

0 L 1 1 1 1 0 ‘

0 4 8 12 16 20 0 4 8 12 16 20
No. of competitors (N) No. of competitors (N)

(a) Sequent Symmetry

 

 

        
  

 

 

 

 

 

 

Read Access Write Access
I I I T I T I I I I I I I
l - l
0.8 - s = 0 -><—-[ 0.8
s = 2 +-
_ s = 4 -°—s
§m(N) 0.6 s = 8 Q_ Em(N)0.6
s = 16 -°—
0.4 f- 3 ___ 23 _._'* 0.4
0.2 *- ‘ 0.2
0 l l I l I l l l 0 I l I l l l l 1
048121620242832 048121620242832
No. of competitors (N) No. of competitors (N)

(b) BBN TC2000
N, M = 128K, G, = Gc(p=0/1,d =0,§,m =1,c=0)

Figure 4.4. Effect of spatial distribution of memory access stream on performance

106

computation chosen for granule 9c is a simple delay loop with a loop count of 1. The
much higher value of the parameter R00 for the BBN TC2000 is a consequence of its
RISC instructions compared to the CISC instructions of the Sequent Symmetry as

well as its faster clock rate.

Table 4.4. Static characterization parameters for a homogenous workload with

M =128K,G¢ = Gc = (gm = (0/1,0,.§’,1),gc = ¢,g, = (1)).

 

 

 

 

 

 

 

 

 

 

 

Stride of Sequent Symmetry BBN TC2000
Access R00 = 0.6 x 106/second R00 = 4.9 x 106/second
Memory Read Memory Write Memory Read Memory Write
(3) f1/2 f1/2 f1/2 f1/2
0 0.052 0.066 11.45 11.71
1 0.288 0.150 11.47 10.75
2 0.520 0.432 11.48 10.75
3 0.777 0.753 11.48 10.76
4 1.002 1.060 11.49 10.76
6 1.012 1.053 11.50 10.76
8 1.037 1.076 11.50 10.77
16 1.032 1.083 11.53 10.80
23 1.030 1.089 11.56 10.83
random 1.267 1.295 13.31 12.68

 

 

The f; /2 parameter for all access strides is much higher for the BBN TC2000
pointing to the fact that there is a large disparity between the computation and
shared memory access speeds on that system. Another interpretation of this fact is
that for a given target rate of computation, a much larger computational granularity
per shared data access is necessary on the TC2000 as compared to the Symmetry.
Also noticeable in Table 4.4 is the fact that f1/2 is relatively insensitive to the stride
of access 3 on the TC2000. This is a consequence of the absence of data caching thus
necessitating a majority of the data accesses to go out over the network incurring
the worst—case latency. On the other hand, the parameter f1/2 on the Symmetry is

relatively lower for s = 0, 1, 2, 3 than for higher values of s. This is as a result of some

107

of the data accesses being satisﬁed by the cache for s < 4. For 3 2 4, every access

results in a cache-miss as the cache line size is 16 bytes on the Symmetry.

Sequent Symmetry

 

    

 

 

 

0.4 -
N =10 o
0.2‘ N = 20 o -
0 l L l l
0 2 4 6 8 10

No. of computation steps (C)

67, M = 128K, G, = Gc(p= 0/1,d= 2,3 =16,m =1,c= 0)

Figure 4.5. Effect of temporal distribution of memory access stream on performance

Temporal Distribution

The variation of the density of memory requests of each processor is accomplished by
altering the number of computation steps performed within the computation granule
gc. This corresponds to a shared memory access followed by a subsequent interval
of c units of delay with no memory access. Figure 4.5 shows the improvement in
unit grain efﬁciency that is achieved as a consequence of increasing the length of ye
on the Symmetry. The effect is particularly striking for write operations, since the
intervening computational delay without any bus accesses provides sufﬁcient time for

the cache-invalidation trafﬁc on the bus to reach quiescence.

108

Memory Hot Spot

The interference proﬁles generated by setting M = 1 is indicative of the performance
of the execution grain under severe contention (hot spot) conditions. In these ex-
periments, the processors not only contend for the global interconnection network,
but also for a single shared-data item. This performance is depicted in Figure 4.6.
The write performance on the Symmetry degrades severely. The reads on Symme-
try cache the shared-data item on the ﬁrst access, and operate out of the cache
on subsequent accesses, thus exhibiting no degradation. However, writes to a sin-
gle location by multiple processors cause the shared location to bounce between the
processor caches (ping-pong effect) thus generating an overwhelming amount of cache-
invalidation trafﬁc causing bus saturation. This is apparent from the extremely low

value of {m = 0.025 with just 3 processors executing concurrently, i.e., N = 2.

 

 

 

 

 

 

 

 

 

(a) Sequent Symmetry (b) BBN TC2000
I I I T I I I I I I I I
1 1 read 4*“
write +-
0.8 0.8 -
£m(N)0.6 {m(N)0.6 .
0.4 0.4 -
0.2 0.2 -
0 I 0 1 1 1 1 1 1 1 1
0 4 8 12 16 20 0 4 8 12 16 20 24 28 32
No. of competitors (N) No. of competitors (N)

1x7, M: 1, G. =Gc(p=0/1,d=0,s =0,m =1,c=0)

Figure 4.6. Effect of contention for a memory location (hot-spot) on performance

On the TC2000 both reads and writes exhibit a severe bottleneck. To analyze

the performance of hot-spot accesses, we resort to the expression for the maximum

109

 

 

 

 

 

 

 

 

 

 

(a) Sequent Symmetry (b) BBN TC2000
I I r I
N = 10 '9—
1 .- N = 20 +— . 1
0.8 0.8
5,..(N) 0.6 {m(N) 0.6
0.4 0.4
0.2 0.2
0 1 1 1 1 0 7 J 1 1 1 1 1 1 1 1
0 2 4 6 8 10 0 10 20 30 40 50 60 70 80 90100

No. of computation steps (C) No. of computation steps (c)

N,M=1,Gt=Gc(p=1,d=0,s=0,m=1,5)

Figure 4.7. Effect of length of computation on hot-spot write performance

network throughput per processor as derived in [105]. The asymptotic maximum

value rm” of per-processor network throughput as determined by the hot spot access

request rate is given by
1

rm” =1+h(P—1)

 

(4.6)

where P is the number of processors (it is assumed that there are an equal number
of memories), r is the number of network packets emitted per processor per switch
cycle (0 S r S 1), and h is the fraction of memory references directed at the hot spot
(i.e., each processor emits packets directed at the hot spot at a total rate of rh).
Using the unit grain attributes, the net memory request rate per processor is given
by r’ = m / r. If tm denotes the network switch cycle time, then the memory request
rate per processor per switch cycle becomes r = mtw/r. For the workload shown
in Figure 4.6(b), P = N + 1 and all accesses are to the hot spot making h = 1.
The maximum per-processor request rate, therefore, is limited to rm” = 1/ (N + 1)

using Eq. 4.6. In other words, the following constraint should be satisﬁed to prevent

110

network saturation.

mtm 1 r
<——— => N l<
r -N+1 + "mt“,

 

r:

 

(4.7)

Since m = 1 and r = tm (tm being the average memory access time) for the workload
of Figure 4.6(b), the limiting value of N is given by N + l S tm/tm. As can be seen
from the ﬁgure, the network begins to be saturated at N = 18.

Figure 4.7 shows the improvement in the efﬁciency of writes to a hot spot resulting
from an increase in the length of computation c within a unit grain. The increased
computation time on the Symmetry allows the cache-invalidation trafﬁc to subside
between consecutive hot spot writes. On the TC2000, increasing c results in a larger
value of r = etc + mtm in Eq. 4.7 thus increasing the limiting value of N at which

network saturation sets in.

Size of Shared Data

By manipulating the size M of shared data, all memory references on the Symmetry
can be kept in the cache, or made to ﬂush cache on each pass through the shared-
data. The TC2000, on the other hand, does not cache shared data. However, varying
the shared-data size on the TC2000 revealed some interesting facts. The efﬁciency
{m was observed to behave identically for values of M from 1 through 4. Progressive
improvement in (m was observed for each increment of 4 in the value of M (Figure 4.8).
This would imply that the scattering of shared-data by the system across cluster
memory modules was done in chunks of 4 elements (i.e., 16 bytes). Thus, going from
M = 4 to M = 16 (and so forth) increases the number of memory modules, for which

the processors contend, from 1 to 4 (and so forth) leading to a decrease in contention.

Random Memory Access

Most multiprocessor memory organizations are designed to use special techniques
(such as memory interleaving, skewing) to maximize the performance of uniform

memory-access patterns. But the performance of the memory hierarchy under condi-

111

 

 

 

 

BBN TC2000

I I I I I I I I
1 ~_ _ ‘ -
0.8 - -1
{m(N)o.6 - -
04 [- c : r _

M = 4 -e— C : 3

0.2 ~M = 16 .._ _ , , -1

M .._.. 64 +_ ‘ ' ' : : : : a

0 = 256 +1" 1 1 1 1 1 1

0 4 8 12 16 20 24 28 32
No. of competitors (N)

-o

N, 112,0, = Gc(p=0,d=1,s =1,m =1,c=0)

Figure 4.8. Effect of shared-data size on read performance

tions that do not display such uniformities in memory access is also of interest. So,
we measured the memory bandwidth under random access conditions, expressed as
Words Accessed Randomly Per Second (WARPS), to quantify this performance. This
is done using a homogenous workload consisting of only memory-access granules gm
and varying its stride attribute s randomly. The results of these tests are presented
in Figure 4.9. The read and write performance on the TC2000 are comparable and
appear to scale reasonably with the number of processors. The read performance on
the Symmetry scales (for the number of processors used in the experiment), but the
writes begin to show saturation at around 13 processors. This difference is, again,
due to the extra cache-invalidation trafﬁc injected into the bus as a result of writes

to shared—data.

4.3.2 Heterogenous Workloads

Using a heterogenous workload, we have investigated the interactions that occur be-
tween concurrent read and write memory access streams. In particular, we demon-

strate using the following two scenarios:

112

 

 

 

 

in: Same Dunc 13px 17px 211m:

Irina-m.

 

Figure 4.9. Random access performance expressed in MegaWARPS

(a) Case 1: the test grain performs read (write) accesses to shared data with uni-
form stride, while the competitor grain performs write (read) accesses with

random stride,

(b) Case 2: the test grain performs read (write) accesses to shared data with uni-
form stride, while the competitor grain performs write (read) accesses to a single

shared (hot spot) location.

The grain efﬁciencies for both these cases is shown in Figure 4.10.

Performance on the Symmetry, when G, performs read accesses, steadily deterio-
rates. It is markedly worse for Case 2 due to the heavy invalidation trafﬁc generated
by the competitor grains while repeatedly writing to one shared location. When Gt
executes write accesses on the Symmetry, Case 2 corresponds to the competitor pro-
cessors operating out of their private caches thus causing no bus trafﬁc and memory
contention. Hence, virtually no degradation is experienced by the test grain. The
interference from the competitor grains on the TC2000 is fairly small in both cases,
owing to the much higher bandwidth of the multistage network and the non-blocking
switches used.

The improvement in execution efﬁciency of G, on the Symmetry, for Case 2 above,
as a result of introducing computational delay is shown in Figure 4.11. Again, the

cache-invalidation trafﬁc on the bus reaches quiescence during the computational

0.8
§m(N) 0.6
0.4

0.2

0.8
{m(N) 0.6
0.4
0.2

0

Test grain: READ

 

I I I I I

   

Casel -9-‘
Case2 *—

I

 

 

l l l l l

 

0 4 8 12 16
No. of competitors (N)

20

113

0.8
5,...(N) 0.6
0.4

0.2

(a) Sequent Symmetry

Test grain: READ

 

I I I j I I I I

I
l

” Casel -6—-
Case2 *—

 

 

l l l l l l l l

 

0 4 8121620242832
No. of competitors (N)

0.8
§m(N) 0.6
0.4
0.2

0

(b) BBN TC2000
N, M = 128K, Gt 91$ Gc (described in text)

 

Test grain: WRITE

 

I I I I I

 

 

 

- Casel '9-u
Case2 *—

l l l l l

0 4 8 12 16 20

No. of competitors (N)

Test grain: WRITE

I I I I I

 

ITT

I
l

- Casel +-
Case2 -°—
L L l l l

0 4 8121620242832
No. of competitors (N)

 

 

l l I

 

Figure 4.10. Interaction between read and write memory-access streams

114

Sequent Symmetry
I I I I I f I I I

 

 

 

 

1'- : =¢:?"q—.__:==
0.8b d
§m(N)o.6- -
04'- N=5-6— -4
N=10-°—
02 N=15‘*_ .1
1 N=20'°—
01 I I I I I J i J

0 10 20 30 40 50 60 70 80 90 100
No. of computation steps (c)

1?, M: 128K, G.(p=o,d=0,s =4,m =1,c=0),Gc(p=1,d=0,s =0,m=1,é)

Figure 4.11. Effect of length of computation on interference between read and write
streams

delay, resulting in faster execution time. The amount of computation necessary for
a given N to restore the execution efﬁciency to a requisite level can be determined
from this characterization graph. For example, a value of c = 20 is needed with 10
competitors to reach an efﬁciency of 0.9, whereas a value of c = 50 is needed with 20

competitors to reach the same level of efﬁciency.

4.4 Dual-Mode Access Workloads

Workloads consisting of concurrent accesses to shared data (granule gm 94 (p) as well
as exclusive access to shared data within critical sections (granule g, aé gt) can be used
to characterize the combined degradation of performance resulting from memory and
lock contention. The MAD kernels, for such dual-mode access workloads, measure
the incremental overhead (and therefore incremental interference) resulting from the
dynamic nature of pure memory access conﬂicts. The overheads arising from the

locking semantics of the critical section access is precluded from the measured per-

115

formance degradation by transforming the shared lock variable in 9, into private lock
variable and replicating it into each processor’s local memory during the execution
of the MAD kernels. This leaves the memory contention behavior for shared data
accesses intact, but eliminates the performance losses due to lock contention (which
depends upon the implementation of the locking primitives) and queuing delay for
mutually-exclusive critical section access. The lock contention and queuing delay
characteristics are measured by the SAD kernels.

The incremental interference charcterization studies, including both memory and

lock interference, for dual-mode access workloads are presented in Chapter 5.

4.5 Summary

The performance of the shared memory organization of a multiprocessor depends not
only on the characteristics of the memory hierarchy itself, but also upon the character-
istics of the memory address streams and the interaction between the two. The MAD
kernels described in this chapter provide an effective testbed for characterizing the
shared memory performance for a variety of memory access workloads. These kernels
were employed to measure and compare the performance of the Sequent Symmetry
and the BBN TC2000 multiprocessors.

The static characterization parameter R00 for the TC2000 was much higher than
the Symmetery on account of its simpler RISC instruction set and faster clock rate.
With the shared data uniformly distributed over the available memory modules, the
static parameter f1 /2 was insensitive to the stride of data access on the TC2000 in the
absence of caching. However, on the Symmetry, f1/2 was related to the proportion of
the data references satisﬁed by the cache for a given stride of access. The Symmetry,
being a bus-based machine, displayed limited scalability in memory performance due
to the bandwidth saturation of the bus. The onset of saturation was much faster
when writes to shared-data were performed due to the additional cache-invalidation
and write-back trafﬁc on the bus. The degradation in performance was most severe

when continuous writes to a single shared location were performed. On the other

116

band, the TC2000 with a multistage network interconnection, was more tolerant
to increasing bandwidth demands from the concurrent grains and displayed better
scalability as long as the shared-data was distributed relatively evenly across the
available memory modules. Performance degradation in the presence of memory hot-
spots was quite severe for reads and writes alike. The read and write performance
were always comparable on the TC2000.

The MAD kernels can be used either independently to perform a detailed eval-
uation of the sensitivity of a shared memory organization to various memory access
parameters; or they can be used in conjunction with the SAD and BAD kernels to
isolate the incremental overhead contribution of memory access conﬂicts from the to-
tal performance loss experienced by an input workload. The MAD kernels have also
been used at Oak Ridge National Laboratory to perform a preliminary investigation
[36] of the memory access performance of the new KSRl multiprocessor from Kendall

Square Research.

CHAPTER 5

SAD KERNELS AND
SYNCHRONIZATION
PERFORMANCE

On shared-memory machines, processors communicate by sharing data structures. To
ensure the consistency of shared data structures, processors perform simple operations
by using hardware-supported atomic primitives, and coordinate complex operations
by using synchronization constructs and conventions to protect against overlap of
conﬂicting operations. Inter-processor synchronization can become a signiﬁcant per-
formance limiting factor on large-scale multiprocessors. For the class of asynchronous
multi-phase algorithms considered in this dissertation, the most prevalent form of
synchronization construct used within a phase is the critical section that must be ac-
cessed in a mutually-exclusive manner. Entry into critical sections is usually guarded
by spin locks and may be executed an enormous number of times in the course of a
computation. Quantitative assessment of the synchronization performance of a com-
bination of given workload and spin-lock implementation provides valuable insight
into the scalability of the synchronization technique to large-scale multiprocessors.
The critical factors affecting spin lock performance and the various design imple-
mentations commonly used have been discussed in Chapter 2. The impact of critical

section synchronization and the spin lock implementation used on the overall perfor-

117

118

mance of a workload is our focus in this chapter. The SAD kernels and the related
framework are presented as an effective testbed to characterize the synchronization
performance of a multiprocessor for a variety of workloads and spin lock implementa-
tions. The SAD kernels can be used in isolation to evaluate the sensitivity of a chosen
synchronization method to various workload parameters; or they can be used in con—
junction with the MAD and BAD kernels, as per the hierarchical model presented in
Chapter 3, to characterize the incremental loss in performance for a given workload

resulting from synchronization overheads.

5.1 Preliminary Studies

The performance studies described in this section are a part of the same suite of
preliminary studies described in Section 4.1. The results presented here describe the
parallel execution performance degradation in the presence of synchronization locks.
Besides the latency and contention overhead factors arising due to memory contention
described in Section 4.1, the presence of lock-based mutual exclusion operations in-
troduces two additional sources of runtime overheads, namely, locking latency and
waits due to lock conﬂicts. Developing a model for the lock related overheads and
measuring them for an input workload is the subject of this investigation.

The parameters used to specify input workloads are (N, M, c,m, :c), which have
the same semantics as described in Section 4.1. However, a: 74 0 for the workloads used
in these studies. An identical copy of the generic program based on these parameters,
whose structure is illustrated in Figure 5.1, is executed by each processor. The LOCK
and UNLOCK routines were implemented by us using the low-level locking primitives
provided on each system. Furthermore, the LOCK routine was instrumented to count
the amount of delay incurred by the invoking processor before acquiring the lock. This
data was used to compile the total queuing delay encountered by a workload due to
lock contention. The two performance metrics computed for each workload are unit

grain eﬁ‘iciency (5) and overhead factor (6) as before.

 

 

 

       

smarts-sagas“

':‘-§:::E=é:§-&c&fsozx ‘. . . . .. .......... sassitiaéé

   

 

Figure 5.1. Generic structure of program executed by every processor

5.1.1 Synchronization Overhead Factors

In addition to the memory access overhead factors deﬁned in Section 4.1, the loss
in workload efﬁciency also includes the lock related factors, namely, the software
overhead of executing the LOCK/ UNLOCK routines and the queuing delay due to
lock contention. If we denote the software execution overhead time 0,, and the
queuing delay due to lock contention as Oq, then the expression for the total overhead

factor 9 can now be written as

6 _ TG(N) -TG(0) _ 01+0c+0.+0q
— Tam) — Tam)

 

 

=01+0c+03+0q

which gives the two new normalized overhead components 0, (software factor) and

0,, (lock factor). Using the deﬁnitions of 6 and 9, it can easily be veriﬁed that

1 1
1+6 —1+(01+9.+0,+0.)

 

6 = (5-1)

which provides an indication of the trend in efﬁciency 6 as the overhead factor 9

varies.

120

Software Factor.
The pure software overhead arising out of a call to the LOCK and UNLOCK

routines is a constant for a given system and a given implementation of these routines.

O. in: + M

08 = — =
r w(1-p)tc+wpta+:c

 

(5.2)

Lock Factor.

This overhead arises from the contention for a global shared lock and the conse-
quent queueing delay to acquire the lock. Let q denote the probability that at any
instant of time, a process P,- is executing in region-II of Figure 5.1 (in the absence
of any lock contention). Note that a process in region-II could be in one of three
possible states: waiting to acquire the lock, executing in the critical section or trying
to release the lock. We can express the probability q as the proportion of the iteration

time spent by process P,- in region-II.

q _ a: + 2tw + (tn: + tul)
w(1— p)t. + 2(wp +1)(t. + t: + t...) + x + (ta. + tuz)

 

(5.3)

Since the workload of all the concurrent processes in our model is identical, the
probability q is the same for all of them. Now, let W be the number of processes
already in region-II when process P.- arrives at region-II. It is clear that W is a
Binomial random variable with parameters N and q, i.e., W ~ B(N,q), since there
are N other processors contending for the critical section. Hence, the expected number

of processes in region-II when P,- arrives is given by
E [W] = N q

As the implementation of our locking protocol assigns the lock to processes in a
F CFS fashion, the process P.- must wait for E [W] processes before it can acquire the

lock. Thus, the average waiting time for the lock is given by

Oq = E[W] ' (.15 + tw +tu1)

121
We can now express the lock factor as

0 _ Oq _ Nq($+tw+tu1)
q — T(l) _ w(l—p)tc+wpta+a:

 

(5.4)

5.1.2 Experimental Results

Once again, the workloads were created as per the parameter variations shown in
Table 4.2. The performance data presented in this section correspond to the workload

types B, C and D that include a non-empty critical section (:6 74 0) in the unit grain.

 

I I I I I I

   
    
  
 

—‘
A d
v v

0-8 balance (a: = Ops) -e— ..

bbn-l (a: = Ops) ..—

 

 

 

 

bbn-2 (:1: = 0118) 4‘— -
{(N) 0.6 A balance (x = 50,13) *-
bbn-l (x = 5011.9) '0- 1
0.4 .. bbn—2 (a: = 501”) "‘—
o.2 - d
0 1 l l g l FT
0 10 20 30 4° 5° 60

No. of competitors (N)

Figure 5.2. Efﬁciency vs. N (M = N + 1,w = 100,p = 0)

Workload B

Figure 5.2 illustrates the effect of introducing synchronization points into the program
workload, where the synchronization occurs through a globally shared lock. Notice
that even in the absence of any other shared-data reference (p = 0), the efﬁciency

drops by more than 50% in the BBN-1 and the BBN-2. This, once again, vindicates

122

the existence of the hot—spot problem on the BBNs — the shared lock being the
hot-spot site in this case. The dominant contributon to total overhead was found to
be from 0,,. From Eq. 5.4, it can be seen that 0,, increases linearly with N, but the
bulk of the delay in the expression emanates from tw for a hot-spot lock reference.

The Balance, once again, does not suffer a signiﬁcant loss in eﬂiciency from the the

globally shared lock.

Workload C

The size of critical sections in parallel programs is usually kept small to alleviate the
queueing delays at the critical section entry points. Since critical sections introduce
serialization bottlenecks into an otherwise parallel program, the granularity of the
computation performed in parallel between these synchronization points must be
appropriately selected to compensate for the synchronization overhead. Otherwise,

the effective speedup gained from parallelization is sacriﬁced.

 

I l I
balance (a: = 100) -e—

1 _ bbn—l (w = 100) -O— _
\ bbn-2 (w = 100) +—
0 8 ' balance (w = 500) -O—
' o bbn—l (w = 500) -o— -

bbn—2 (w = 500) +-

 

 

 

£(N) 0.6 - s
0.4 r _
0.2 - 1

0 1 1 1 1 1 1

0 10 20 30 4o 50 60

No. of competitors (N)

Figure 5.3. Efficiency vs. N (M = N +1,p = 0.1, :1: = 30ps)

Figure 5.3 shows how efﬁciency is affected when the program granularity is changed

123

from w = 100 to w = 500. As can be seen from the graph, the efﬁciency improves for
all the three systems when granularity is increased, keeping other parameters ﬁxed.
At N = 20, the increase in efﬁciency is approximately 24% for the Balance, 48%
for the BBN-1 and 36% for the BBN-2. A key reason for this improvement can be
ascribed to the fact that process executions get staggered in region-I (Figure 5.1),
thus reducing the probability that the arrival of two processes at the critical section
coincide. Examining Eq. 5.3 for this probability q, it can be seen that an increase in w

increases the denominator thus yielding a smaller value of q. That, in turn, produces

a smaller 9,, in Eq. 5.4.

 

  
 
 

 

 

 

I I M I
balance (p= 0. 0) -e—
1 ‘ bbn-l (p= 0.0) -o— -
Q , bbn-2 (p= 0. 0) +—
08 balance(p=0.2)-O—
' bbn-l (p = 0. 2) -o— ‘
bbn- 2 (p = 0. 2) +—
£(N) 0-6 balance (p = 0. 4) -O- ‘
61111-1 (p = 0. 4) ..—
0.4 bbn-2 (p = 0. 4) + _
002 \ q
0 1 1 1 1 1
0 10 20 30 40 50 60

No. of competitors (N)

Figure 5.4. Efﬁciency vs. N (M = N + 1,1» = 100,:6 = 100ps)

Figure 5.4 illustrates the loss in efﬁciency due to increased contention as N in-
creases for three different values of shared-access fraction p. In the case of the Balance,
increasing p leads to greater contention for the bus bandwidth, thus yielding a higher
value of the contention factor 911- Hence, a steady decrease in efﬁciency is observed
as p is increased. In the BBNs, however, the additional deterioration in efﬁciency by

increasing the value of p from zero to a positive quantity is not so striking. This,

124

once again, points to the fact that the performance degradation due to the shared

lock hot-spot when p = 0.0 still remains the dominant cause for overhead at p = 0.2

and p = 0.4.

 

I I r
balance (:1: = Ops) -e—

bbn—l (a: = Ops) -o— -

bbn-2

. = i i :1: = 0ps) +—
' balance (a: = 30ps) -o— _

bbn—l (:1: = 30ps) +—
alance :1: = 100psj -o— ‘

  
   
   
  
 

0.8

 

 

 

 

 

{(N) 0-6 b (
bbn-l (:1: = 100ps) -o—
0.4 - bbn-2 (a: = 100ps) -11-— -
0.2 - \ -
"\ \
0 I I L I I I
0 10 20 30 40 50 60

No. of competitors (N)

Figure 5.5. Efﬁciency vs. N (M = N + 1,w = 100,p = 0.3)

The inﬂuence of critical section length on the overall efﬁciency of a program work-
load is plotted in Figure 5.5. The efﬁciency suffers on all the three systems as the
critical section length .r is increased. Increasing the value of :1: results in a process
having to wait for the shared lock for a longer time on the average, as indicated by
Eq. 5.4. However, in the BBNs, the extent of loss in efﬁciency in going from :1: = 0
to a: = 30 is far more signiﬁcant than that from a: = 30 to :1: = 100. An increase in 0,,
proportional to :r, as predicted by Eq. 5.4, does not explain this non-uniformity. The
additional overhead, that causes this non-uniform behavior, is due to the introduction

of a memory hot-spot at the site of the shared lock for the critical section.

125

Workload D

This workload was designed to study the effect of the input parameters on the indi-
vidual overhead components. Figure 5.6 plots the individual overhead factors for the
three systems as the degree of concurrency (N) is varied under a ﬁxed shared-access
fraction (p) and critical section length (2:). As explained earlier, the software over-
head is a ﬁxed and constant quantity. The latency factor 01 also remains ﬁxed here
as it depends only on the proportion of shared accesses p. The 0c and 0,, components
increase steadily with n for all the three systems, as predicted by Eqs. 4.2 and 5.4.

For small critical section lengths, a process spends a greater proportion of its time
in region-I of Figure 5.1 and, hence, the execution proﬁle of the concurrent processes
gets evenly distributed in region-I. However, as :1: increases, the lock factor 0,, begins
to dominate as shown in Figure 5.7. This is an outcome of the two-fold effect that the
critical section duration has on 0,, in Eq. 5.4. An increase in the length of the critical
section not only increases the a: term, but also leads to an increase in the probability
q. In fact, as the hardware technology gets faster (i.e., tc, ta and t1 become smaller),
the value of q increases even more for a given computational granularity w (Eq. 5.3),
further accentuating the 0., component. This fact is apparent from the 0,, curve for
BBN-2 which uses a faster technology. To compensate for the decrease in tc, to and
t1, the computation granularity w must be increased to prevent an increase in the
value of q. On the Balance, the unit times tC and to are very large causing the term
w(l — p)tc + 2(wp + 1)(ta + tw) in the denominator of Eq. 5.3 to overwhelm :1: in the
range under consideration. This results in an extremely small value of q thus making
9., negligible. The inﬂuence of a: on 0c is only in as much as the creation of a hot-spot
effect at the global lock on the BBNs.

Figure 5.8 shows the individual overhead components on the three systems as a
function of the shared-access fraction p. Observe that the lock factor, 9,, is the largest
overhead component on the BBNs, whereas the contention factor, 96, is the largest on
the Balance. The presence of a separate dedicated bus for shared lock access in the
Balance segregates the contention for lock access from those for other shared-memory

access. Increased number of shared-memory accesses, as dictated by increasing p,

126

(a) Balance (b) BBN-2

I I I I I I I I I I

 

 

 

 

 

 

 

   

 

0.04 '- -‘ -‘
-]
0.03 [- '-
0.02 _.
0.01
0 0. L
0 5 10 15 20 25 0 5 10 15 20 25
No. of competitors (N) No. of competitors (N)
(c) BBN-l
I I I I r n

 

 

 

 

 

0 10 20 30 40 50
No. of competitors (N)

Figure 5.6. Overhead components vs. N (M = N + 1,6) = 500, p = 0.1, :c = 30ps)

127

leads to greater contention for the system bus bandwidth and a consequent increase
in 0c. Changing the value of p also changes the fraction of time spent by a process in
region-I (Figure 5.1). The exact nature of this change on any system is governed by
the relative measures of tc and ta+t1 on that system. Also, note that the normalization
factor TG(0), too, depends on the value of p. If tc > ta-l-th then increasing p results in
a smaller proportion of time in region—I and a smaller value of TG(0). The results are
just the opposite if tc < ta + t1. Hence, the interpretation of the plots in Figure 5.8

is closely related to the ratio of the computational to memory-access speeds of the

individual systems.

 

    
   
 

 

 

 

 

 

 

  

 

 

 

 

I I F I I
4.0 - .1
3'0 ' 611-2 (N = 20)
0
q 2.0 r ‘l
bn-l (N = 60)
1.0 .1
0.0 ‘71: 3]: I ibalancedN = 20) :1:
0 30 60 90 120 150
Critical section length (:L'ps)
I I I I I
0.5 '- .1
0.4 -
0c -
i 1 * balalnce (N =40)
0 30 60 90 120 150

Critical section length (mp3)

Figure 5.7. Overhead components vs. a: (M = N + 1,w = 500,p = 0.1)

All the performance ﬁgures presented so far have been normalized quantities.
However, in order to provide a feel for the absolute speed of each system, we also
enumerate some real execution times. Table 5.1 shows the unnormalized values for
execution times as p varies with a ﬁxed parameter setting of M = N + 1 = 20,
w = 500 and a: = 50ps. It immediately reveals that the BBN-2 has the fastest and
the Balance has the slowest execution times of the three systems. Table 5.2 documents
the unnormalized overhead times corresponding to the same workload as represented

in Table 5.1. The software overhead time, 0. = tug + tuz, is not included in this table.

128

It is a ﬁxed quantity for a given system and can be found from Table 4.1.

Table 5.1. Actual execution times (M = N + 1,1» = 500,1: = 50ps)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BBN-1 (N = 60) BBN-2 (N = 20) Balance (N = 20)
P TG(0) TGUV) 6 TG(0) TGUV) E Tam) TG(N) E
(#8) (#8) (#8) (#8) (#8) (#8)
0.1 4627.6 11289.6 0.41 789.7 2115.2 0.37 17748.0 18303.5 0.97
0.2 4208.9 11756.3 0.36 770.8 2199.8 0.35 16086.2 16834.0 0.95
0.3 3815.3 11843.3 0.32 766.1 2256.3 0.34 14494.5 15287.5 0.95
0.4 3455.3 11906.2 0.29 744.9 2273.7 0.33 12984.4 13849.3 0.94
Table 5.2. Actual overhead times (M = N + 1,6) = 500, :1: = 50ps)
BBN-1 (N = 60) BBN-2 (N = 20) Balance (N = 20)
p 01 0c Oq 01 0c 0,, 0: 06 09
(#8) (#8) (#8) (#8) (#8) (#8) (#8) (#8) (#8)
0.1 371.2 1040.4 5178.3 138.2 114.5 1043.2 0.0 367.6 104.8
0.2 742.4 1357.2 5375.8 276.4 221.2 901.7 0.0 537.4 127.3
0.3 1113.6 1605.7 5236.6 414.6 170.5 875.5 0.0 567.9 142.3
0.4 1484.8 1888.9 5005.2 552.8 238.7 707.7 0.0 661.9 119.8

 

 

 

129

(a) Balance (N = 20)

 

 

 

 

 

 

 

 

 

 

 

 

 

I I I I
0.10 -' ..
0.08 - ..
O hd 9 .1
fac‘tordlo6 _ 0C
0.04 "’ // -
0.02 '- 0 -
ﬁ— 9
0.00 #54:. - z Lia—_—
0 0.1 0.2 0.3 0.4
Shared-access fraction (p)
(b) BBN-1 (N = 60)
I I I I
3.0 - .1
2.5 - 9 -
Ovhd 2-0 ’ ‘
1.0 - -*
0'5 7 W31 "
0.0 - * 4 9.
0 0.1 0.2 0.3 0.4
Shared-access fraction (p)
(c) BBN-2 (N = 20)
I I I I
2.5 " ..
2.0 ' U )1 AG "
Ovhd 1 5 - M -‘
factors '

1.0 - MN), _

0.5 - .1
44% 0c
0.0 i A 4 ar—

0 0.1 0.2 0.3 0.4

Shared-access fraction (p)

 

 

 

Figure 5.8. Overhead components vs. p (M = N + 1,1» = 500, :1: = 50ps)

130

5.2 SAD Workload Parameters

Although the performance model presented in Chapter 3 can be adapted to evalu-
ate any synchronization mechanism, the form of inter-processor synchronization in
granule g, chosen for the class of algorithms under consideration in this thesis is the
critical section (CS). The critical section is guarded by a pair of LOCK / UNLOCK op-
erations (Figure 5.9), implemented as spin locks, to ensure mutual exclusion. Besides
the performance of a spin lock implementation itself (i.e., latency and throughput),
an important criterion for any lock-based synchronization mechanism in the presence
of many competing processors is the impact it has on other components of grain
execution and vice versa. This mutual interference can be acutely detrimental to
application performance when execution of the code within a critical section is pro-
longed as a result of interference from other concurrent operations, which in turn
causes the serial bottleneck to become more pronounced leading to a greater number
of spinning processors waiting for the lock to be released. The family of SAD kernels
are designed to measure this mutual interference as well as the performance of the

spin lock implementation itself.

5.2.1 Unit Grain Characterization

As was the case for the MAD kernels, due to the scarcity of data on real workloads,
a ﬂexible parametric model of unit grain characterization is again chosen. The at-
tributes selected for the unit grain should help not only in evaluating the selected
spin lock implementation, but also in measuring the waiting time on account of lock
contention and the interference between code executed within and outside of the crit-

ical section. The unit grain characterization selected for this purpose is summarized
in Table 5.3.
Characterization of gm:

The same four-attribute characterization of the shared—data access granule gm as

used for the MAD kernels is chosen for the study of the synchronization behavior

131

Table 5.3. Unit grain attributes for studying synchronization behavior

 

[ Granule ] Attribute Meaning ]

 

 

 

 

 

common N number of competitor processors
M number of shared data elements
p probability of write access to shared memory
gm d initial distance of concurrent address streams
s stride of memory access
m number of shared memory accesses per granule
gC c number of basic computation units (BCUs)
c, number of computation steps in CS
g, m, number of memory accesses in CS
p, probability of a write access in CS

 

 

 

 

with the SAD kernels. Therefore, gm = (p, d,s,m) where the attributes have the

same semantics as discussed for the MAD kernels.

Characterization of gc:

The single-attribute characterization of the computation granule gc as used for the
MAD kernels is also chosen here. Therefore, 9C 2 c where the attribute c has the

same meaning as in the case of the MAD kernels.

Characterization of g3:

Two factors related to the synchronization operation that have a signiﬁcant inﬂuence
on the speed of execution of a unit grain are the frequency and length of the criti-
cal section. Since the durations of the granules gm and g6 indirectly determine the
frequency of occurrence of the synchronization granule, we characterize gm with a 3-
tuple of additional attributes necessary to control the duration of the critical section

and the shared-data access pattern within it.

93 = (cs, maps)

132

 

 

Shared Duh

READ. "-le

 

 

 

 

Shared Dan
WRITE:

P1":

 

 

 

 

Figure 5.9. Critical section structure

The value of the attribute c, indicates the number of computational steps per—
formed within the critical section, using processor private data, expressed in exactly
the same delay unit as in gc. This time interval is marked by the fact that there is
no access to the shared-memory, and thus no contribution to the global interconnec-
tion network trafﬁc, by the processor executing the granule. The attributes m, and
p, together deﬁne the nature of memory accesses performed from within the critical
section. The total number of shared memory references within the critical section is
given by m,, while p, indicates the fraction of these references to shared data that
are write operations. All the shared data accesses within this granule are assumed to
go out over the global interconnect thus contributing to network trafﬁc.

Using the individual granule characterizations, the deﬁnition for the unit grain G

can be written as the 3—tuple of tuples.

G = ((1), (113,771), (C): (Ca1ms1Psll

5.2.2 Output Metrics

The metric used to observe the trends in lock contention and the serialization loss due

to synchronization for an input workload, as a function of the degree of interference

133

N, is the unit grain eﬁ‘iciency {,(N) as deﬁned by Eq. 3.22. A value of {,(N) = 1
would seem to indicate that the concurrent processes do not mutually interfere at
all. Furthermore, the relative disposition of the computation performed within and
outside the CS is such that mutually-exclusive accesses to the CS do not result in
any waits. A value of {,(N) < 1 reﬂects considerable lock contention and execution
serialization to access the CS.

The cumulative lock interference \P,(N) can be computed from {,(N) using
Eq. 3.23. Also, from Eq. 3.25, it is known that the incremental lock interference
1,1),(N) is equal to the difference of \Ilm(N) and \P,(N) for a given workload. Therefore,

we have the following relationship between the efﬁciency and interference measures.

1—€,(N)

MN) = €.(N)

In the case of exclusive-access workloads, since concurrent shared memory accesses are
non-existent (gm = 63), there is no incremental memory interference, i.e., wm( N) = 0,
thus making 1b,,(N) = \P,(N). For dual-mode access workloads, both the incremental
overhead components would be present.

It should be emphasized that the efﬁciency metric is a measure of the relative
performance of a combination of workload and spin lock implementation with N
competitors as compared to its performance with no competitors. Therefore, al-
though suitable for characterizing the behavior of a given spin lock implementation
with respect to the different workload parameters, it does not facilitate an effective
comparison between two different implementations. The absolute unit grain execution

times, Tg(N), should be used instead for this purpose.

5.2.3 Lock Implementations Studied

We have chosen three spin lock implementations on each of the target systems studied.
The ﬁrst one is the native LOCK/UNLOCK operations provided on each system
(referred to as the NAT lock) to support parallel programming. This support is in

the form of function calls in a parallel programming library as shown in Table 5.4.

134

Table 5.4. Native lock support on each machine

 

Procedure Sequent Symmetry BBN TC2000

InitLock s-lock.init (lock) lock 1.; CLEAR
GetLock s.lock (lock) UsLock (lock,de1ay)
ReleaseLock s.unlock (lock) UsUnlock (lock)

 

 

 

The other two implementations selected represent somewhat two extremes of busy-
waiting efﬁciency. The test-and-test-and-set lock (referred to as the TAS lock) spins by
reading the shared lock variable until it becomes free, and then attempts a test-and-set
operation to acquire the lock. The simple pseudo—code for it is listed in Table 5.5. On
machines with coherent caches, the spin on read eliminates interconnection network
trafﬁc. But upon release of the lock, several spinning processors rush to grab the
lock simultaneously thus inundating the interconnect with test-and-set requests. This
problem is especially acute on systems with invalidation—based cache coherence where
the ﬂood of invalidations as a result of the test-and—set operations cause the shared-
lock location to bounce from one processor cache to another before quiescence sets
in. This effect has also been called the ping-pong eﬂect. On architectures without
coherent caches, even the spin on read generates heavy network trafﬁc in addition
to creating a memory hot-spot. The TAS and NAT implementations are almost
identical on the Symmetry. But on the TC2000, NAT incorporates a ﬁxed delay

between consecutive polls of the shared lock variable by a processor unlike TAS.

Table 5.5. Pseudo-code for the TAS lock

 

Procedure Implementation

InitLock lock :- CLEAR
GetLock while (lock 1. BUSY or test-and_set (lock) I BUSY)
Releaselock lock :8 CLEAR

 

 

 

135

The last spin lock implementation chosen is a list-based queueing lock devised
by Mellor-Crummey and Scott [89] (referred to as the MCS lock) with the following

characteristics:

— guarantees FIFO ordering of lock acquisitions;
— spins on locally-accessible ﬂag variable only; and

— works equally well (requiring only 0(1) network transactions per lock acquisi-

tion) on machines with and without coherent caches.

Figure 5.10 shows the algorithm for this lock. Each processor using the lock
allocates a Qnode record containing a queue link and a boolean ﬂag. Each processor
employs one additional temporary variable during the GetLock operation. Processors
holding or waiting for the lock are chained together by the links. Each processor spins
on its own locally-accessible ﬂag. The lock itself contains a pointer to the Qnode record
for the processor at the tail of the queue (or the value nil if the lock is not held). Each
processor in the queue holds the address of the record for the processor behind it ——
the processor it should resume after releasing the lock. Compare-And-Swap enables
a processor to determine if it is the only processor in the queue, and if so remove
itself correctly, as a single atomic action. The spin in GetLock waits for the lock to
become free. The spin in ReleaseLock compensates for the timing window between
the Fetch-And-Store and the assignment to predecessorT .next in GetLock. Both
spins are local to the processor.

Figure 5.11, parts (a) through (e), illustrates a series of GetLock and ReleaseLock
operations. The lock itself is represented by a box containing an ‘L’ in it. The other
rectangles are Qnode records. A box with a slash through it represents a nil pointer,
and non—nil pointers are shown as directed arcs. The state of each processor in the
queue (R: running, B: blocked, E: exiting from critical section) is indicated along
with its identiﬁcation within each Qnode record. In (a), the lock is free. In (b),
processor 1 has acquired the lock and is running. In (c), two more processors have
entered the queue and are blocked spinning on their locked ﬂags. In (d), processor

1 has completed and has changed the locked ﬂag of processor 2 so that it is now

136

 

 

type Qnode = record
next : TQnode;
locked : Boolean;
end;
type Lock = TQnode;

{ Parameter ”Q” below points to a Qnode record allocated in shared memory

locally accessible to the invoking processor}

procedure GetLock (L : TLock; Q : TQnode)
QT.next := nil;
predecessor : TQnode :2 Fetch-And-Store (L, Q);
if predecessor # nil then {queue was non-empty}
QT.locked := TRUE;
predecessorT.next := Q;

while QT.locked = TRUE do; {spin}

procedure ReleaseLock (L : TLock; Q : TQnode)
if QT.next = nil then {no known successor}
if Compare-And-Svap (L, Q, nil) then
return; {returns if and only if it swapped}
while QT.next = nil do; {spin}
QT.nextT.locked := FALSE;

 

Figure 5.10. Pseudo-code for the MCS list-based queuing lock

 

137

running. In (e), processor 2 has completed and has unblocked processor 3. If no
more processors enter the queue in the immediate future, the lock will return to the

situation in (a) when processor 3 completes its critical section.

pointer l

(a)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(d)

 

 

 

 

 

(C) (6)

Figure 5.11. Working of the MCS list-based queuing lock

 

 

 

We emphasize that we have deliberately chosen only a few spin lock implementa-
tions for the purpose of demonstrating the effectiveness of the evaluation methodol-
ogy. There are a number of other spin-lock implementations available in the literature
[6, 89]. Our selection should not be construed as a deﬁnitive indication of their relative

merits.

5.3 Exclusive-Access Workloads

Exclusive-access workloads, with synchronization in the form of lock-based mutually-
exclusive access to a critical section within the synchronization granule g,, were de-

signed and used [94] to characterize the impact of serialization of execution, and

138

lock latency and contention on a shared memory multiprocessor performance. The
increased unit grain execution times observed in this case are purely due to the soft-
ware overhead of executing the locking primitives, serialization of access to the CS,
and lock contention. The workloads have been employed to measure and compare the

performance of three lock implementations on the Sequent Symmetry and the BBN

TC2000 systems.

Table 5.6. Latency of locks used in the SAD experiments

 

 

 

 

 

Spin Lock Sequent Symmetry BBN TC2000
Lock Local I Lock Remote
NAT 7.4 ps 4.3 ps 12.1'ps
TAS 6.1 ps 1.8 ps 8.0 ps
MCS 10.1 ps 8.6 ps 15.8 ps

 

 

 

 

 

 

An important fundamental criterion for any lock implementation is its latency——
the time it takes to acquire and release it in the absence of competition. Table 5.6
shows this measure for the locks used in our study. On the TC2000, since a dichotomy
in the memory hierarchy exists, the latency of the lock depends on its location with
respect to the processor invoking it. Thus, the latency when the lock is situated in a
processor’s local memory and a remote memory are shown under the columns “Local”
and “Remote”, respectively. The results presented in this section pertain to the case
of the shared-lock being remote to all processors. The half-performance lock factor
61/2 for the various lock implementations is given in Table 5.7. Once again, the large
disparity between processor speed and lock access latency on the TC2000 is reﬂected
by its high values of CI p.

A critical section synchronization enforces a serialization of execution on the par-
ticipating processors, thus causing a loss of parallelism. Since only one processor can
execute in the CS at any time, all other processors waiting for mutually exclusive

access to the CS spend time idling, wasting potentially productive computational cy-

139

Table 5.7. Half-performance lock factor 61/2 for different lock implementations

 

 

 

 

Spin Sequent Symmetry BBN TC2000

Lock (R00 = 0.6 x 106/second) (Rm = 4.9 x 106/second)
Type c1); Lock Local 61/2 [ Lock Remote c1/2
NAT 4.45 21.07 59.31

TAS 3.67 8.82 39.22

MCS 6.08 42.16 77.45

 

 

 

 

 

 

 

cles. Further, the implementation technique used for the spin lock guarding the CS
can also adversely impact performance beyond what is dictated by serialization due
to excessive lock contention and interconnection network traﬂic generated [6]. The
net execution eﬂiciency {,(N) observed for a combination of input workload and spin
lock implementation is a result of both of the above factors. Let us suppose that the

net observed efﬁciency can be decomposed into two factors as follows:

{1(N)=O(N)°ﬂ(N)

where a(N) represents the loss in parallel work due to serialization of CS access, and
B(N) represents the loss in performance due to lock implementation considerations.
The factor a(N), called structural efficiency, signiﬁes the inﬂuence of the unit grain
structure on the overall synchronization performance. The factor ,6(N), called spin
lock efficiency, on the other hand, signiﬁes the impact of the spin lock implementation
methodology on the overall synchronization performance.

The efﬁciency component B(N) is difﬁcult to quantify analytically since it is a
complex function of the runtime interactions occuring between concurrent processes.
However, we can derive an approximate relation for a(N) for the case of determinis-
tic homogenous workloads. For now, let us assume that spin lock implementation is
100% efﬁcient, i.e., [3(N) = 1. This implies that {,(N) = a(N). For a homogenous
workload, since all the N + 1 processors are executing identical unit grains, they will

soon become “skewed” so that they attempt their CS access at different times. Thus,

140

there will be no CS contention if N processors have time to complete their g, granule
while the (N + 1)“ is processing granules gm and ye, that is, if rm + rc Z N r,. Oth-
erwise, contention occurs and the waiting time for each unit grain is N r, — (rm + Tc).

Hence, the unit grain execution time is given by

TG(N) = Tm + Tc '1' To '1' tqueuc

where

Nr.— “rm-l-rc ifrm+rc<Nr,
tqueuc = ( ) (5.5)
1 otherwise

It should be noted that if concurrent memory accesses are present in the granule gm,
then the overhead due to memory access contention is not included in the total unit
grain time. In other words, memory accesses in gm are assumed to be conﬂict-free
for this derivation. The MAD kernels measure the extent of performance degradation

due to memory access contention. If we deﬁne the serialization ratio /\ as

 

 

)1 = T’" + 7° (5.6)
Ts
then the structural efﬁciency can be expressed using Eqs. 5.5 and 5.6 as
l-ﬂ if )1 < N
a(N)= Tm+Tc+Ts = NH (57)
TG(N) 1 if )1 2 N

In reality, the spin lock implementation will not be 100% efﬁcient. Therefore, if we

remove this restriction, then the net unit grain execution efﬁciency can be expressed

as

(ﬁﬁpmquA<N

MN) HAZN

53(N) = (5'8)

Expressing {,(N) in this form highlights the synchronization losses as being a con-
sequence of two distinct effects: the serialization loss given by a(N) and determined

by the characteristics of the algorithm and hardware speed; and the lock contention

141

loss B(N) determined by the spin lock implementation characteristics.

Experiments were conducted using a number of homogenous parameter families,
each family designed to measure the effect of a particular grain characteristic on
the resultant contention and the consequent unit grain efﬁciency. In particular, the
impact of the frequency of synchronization and the serialization ratio were evaluated.
All processors contend for a common lock and the CS guarded by it. In the case of
the TC2000, the lock is remote to all processors.

Frequency of synchronization

The probability that a processor arriving at the CS ﬁnds it busy, thus incurring
a queueing delay, is proportional to the frequency, 1/(1'c + 7,), with which the CS
is accessed [96]. The computation granularity re between synchronization points
required to restore the loss in efﬁciency due to synchronization is measured for varying

degrees of parallelism N. The result is plotted in Figure 5.12.

Sequent Symmetry BBN TC2000
I I I I I MCS . j I I I ﬂ I I

c=160 NAT *

 

 

      

 

 

 

 

 

 

 

l TAS o 1
0.8 - 0.8
£.(N) 0-6 {.(N) 0-6
0.4 - 0.4
0.2 - 0.2
0 1 ‘ ‘ l ' 0 " ‘
0 4 8 12 16 20 0 4 8 12 16 20 24 28 32

No. of competitors (N) No. of competitors (N)

FLM = 128K,G. = 0.19.. = 11,9. = (6,9. = (0.60))

Figure 5.12. Effect of frequency of CS on performance

142

Examining the workload used in Figure 5.12, we see that the serialization ratio is
given by A = ctc / t, = c/c1/2. The computation granularity c required to obtain 100%
structural efﬁciency, using Eq. 5.7, is given by

c 2 NCl/g is required for a(N) = 1.

The value of 61/2 for the three locks used is given in Table 5.7. A further increase in
granularity is required to compensate for the spin lock efﬁciency B(N). As is evident
on the Symmetry, the grain efﬁciency improves by increasing the granularity from
c = 40 to c = 160. The performance of NAT and TAS are almost identical due
to their identical implementation. For values of N S 4, a granularity of c = 40 is
sufﬁcient to maintain close to perfect efﬁciency. However, a much higher granularity
of c = 160 is needed for greater values of N. The additional cache-invalidation trafﬁc
on the bus in the case of NAT and TAS account for their lower efﬁciency compared
to MCS.

The c1); values for all the three locks on the TC2000 are very high necessitating
a very large computation granularity to make up for the structural efﬁciency factor
a(N). For the range of c considered in Figure 5.12, the a(N) factor dominates the
synchronization performance on the TC2000 thus yielding efﬁciency curves propor—
tional to 1 / (N + 1) for all three lock types. A granularity of c = 100 is not sufﬁcient
compensation for any of the locks even with N = 2. However, c = 300 restores the

efﬁciency to l for MCS and NAT with N g 4, and for TAS with N S 6.

Serialization ratio

Another important factor governing the performance of an application on a multipro-
cessor is its serialization ratio A as given by Eq. 5.6. The length of the CS, 7,, denotes
the amount of time for which the shared lock is held thus affecting the number of
spinning processors waiting to access the CS, and A determines the amount of wait
before the CS can be accessed. The serialization ratio A can be varied by changing

the relative amount of computation and shared data accesses performed within and

143

outside the CS. In Figure 5.13, the variation in grain efﬁciency is shown as a function

of the length of communication in CS (c,) for a ﬁxed length of computation (c).

 

 

 

 

 

 

 

 

 

Sequent Symmetry (C = 120) BBN TC2000 (C = 300)
7 MCS 0 I I l I I
NAT *
1 ! TAS o 1 [- -
‘ = 10
0.8 A 0.8 - 4
l
{,(N) 0.6 ~ {.(N) 0.6,
0.4 d 0.4
0.2 -[ 0.2
0 0
0 2 4 6 8 10 12 14 0 5 10 15 20 25 30
Length of computation in CS (0,) Length of computation in CS (c,)

1?, M = 1281120. = G.(g.. = 8.11.,9. = (6.61»)

Figure 5.13. Effect of non-CS to CS computation ratio on performance

For the workload used in Figure 5.13, the serialization ratio is given by
A = etc/(t, + c,t,) = c/(c, + Cl/g). The computation granularity c required to obtain
100% structural efﬁciency, using Eq. 5.7, is given by

c 2 N(c, + Cl/g) is required for a(N) = 1.

A further increase in granularity is required to compensate for the spin lock efﬁciency
MN)

On the Symmetry, the grain efﬁciency decreases with an increase in the CS length.
For N = 10, the MCS lock is a little less efﬁcient for low CS lengths. This can be
attributed to its higher latency compared to the other two spin-locks. However, for
higher values of N, lock contention becomes the dominant factor and MCS outper-

forms the others. On the TC2000, MCS consistently performs better than the other

144

two locks for a given N due to its constant number of network accesses per lock
acquisition. For the value of c and the range of c, considered for the TC2000 in
Figure 5.13, the a(N) factor dominates the synchronization performance. The NAT
and TAS locks exhibit almost identical performance.

5.4 Dual-Mode Access Workloads

Dual-mode access workloads, with both gm and g, granules present, are used for
characterizing the relative contributions of memory contention and synchronization
to the total performance degradation. Measurements were performed on the Sequent
Symmetry and the BBN TC2000 systems. The shared data, with M elements, were
allocated using the shmallocO call on each machine. On the Symmetry, the data
elements were interleaved across the memory modules with a interleaving granularity

of 32-bytes; on the TC2000, they were scattered across the allocated cluster.

5.4.1 Homogenous Workloads

In these experiments, identical attributes were used for the test and competitor grains.
All processes contend for a common lock and the CS guarded by it. On the BBN
TC2000 system, the lock is located remote to all processors.

Serialization ratio

In Figure 5.13 the serialization ratio A was varied by changing the relative number
of computation units processed within and outside the CS. A similar effect is also
accomplished by varying the number of shared data accesses within the CS (m,) with
respect to the number of accesses outside (m). However, in the latter case, the shared
data accesses within the CS may also encounter additional memory access conﬂicts
resulting in longer CS duration. This case is shown in Figure 5.14.

On any machine, all the lock types should experience the same structural ef—
ﬁciency a(N) for a given number of competitors N, since it is proportional to

A = mtm/(t, + m,t,,,) = m/(m, + t,/tm), and depends only on the unit grain char-

145

 

 

 

 

 

 

 

 

 

 

Sequent Symmetry BBN TC2000
I I I I I I MCS . r I I I I 1
NAT *
1 ~ TAS o 1 - 4
0.8 - 0.8 -
{,(N) 0.6 - {.(N) 0.6 -
0.4 -* 0.4 -(
0.2 d 0.2 n
0 1 1 1 1 1 1 0 7
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
No. of memory accesses in CS (m,) No. of memory accesses in CS (m,)

17,114 = 128K, G. = G,(gm = (0,64K[1.0],4,48),gc = ¢,g, = (0,16,, 1))

Figure 5.14. Effect of non-CS to CS shared data access ratio on performance

acteristics. Hence, the observed differences in the unit grain efﬁciency {,(N) of the
different spin locks can be attributed to the lock efﬁciency factor [3(N) and interfer-
ence between the shared data accesses performed within and outside the CS.

On both systems, the grain efﬁciency decreases with an increase in the number of
shared data accesses within the CS. At m, = 0, the performance difference is entirely
due to the efﬁciency of the spin lock implementations. As m, increases, the memory
access contention encountered by each access within the CS effectively increases the
CS length. On the Symmetry, the MCS lock performs better than the other two for
high degree of contention (N = 20) because of its higher lock efﬁciency factor ﬂ(N).
On the TC2000, the effect of the higher eﬂiciency ﬂ(N) for the MCS lock is more
pronounced at m, = 0 for high degree of contention (N = 30). The increase in the
CS length with an increase in m, causes the structural efﬁciency a(N) to reduce thus

causing the MCS lock performance to approach that of the other implementations.

146

Incremental Overheads

The gross unit grain execution time in the presence of competitor grains includes in
it an overhead component due purely to memory contention during shared data ac-
cesses, and another due to critical section synchronization. As discussed in Chapter 3,
the incremental magnitude of the shared data access and synchronization overhead
components for a given workload can be measured and characterized in a hierarchi-
cal fashion with the help of the MAD and SAD kernels, respectively. The isolation
of the memory contention overhead was described in Section 4.4. Figures 5.15 and
5.16 show the cumulative interferences \Ilm(N) and \II,(N) measured for two different
workloads differing only in the stride of shared data access (i.e., the s attribute).
Each interference is shown for two different spin lock implementations, namely, the
TAS and MCS locks described earlier. The static characterization parameters for the

two workloads are tabulated in Table 5.8.

Table 5.8. Static characterization parameters for workloads used in incremental over—
head measurements

 

 

 

Multiprocessor Roo f1/2 f1); C1/2 c1/2
System (s = l) (s = 23) (TAS lock) (MCS lock)

Symmetry 0.6 X 106/ second 0.288 1.030 3.67 6.08

TC2000 4.9 x 106/second 10.75 10.83 39.22 77.45

 

 

 

 

 

 

 

 

 

 

 

For the workload shown in Figure 5.15, a stride of 1 is used for data accesses. On
the Symmetry, this results in 3 out of 4 accesses being satisﬁed by the cache thus
placing a very low demand on the system bus. This is reﬂected by an extremely low
value of ‘1’," even for large N. On the TC2000, the butterﬂy switch is able to sustain
the bandwidth demand for N S 18. For higher N, switch contention results in a
larger memory interference \Ilm. On both machines, the MCS lock exhibits a much
slower growth rate in the synchronization overhead owing to its lower interconnection

network demand.

147

Sequent Symmetry BBN TC2000

T I I I I MCS. IIMIFII

TAS o 9.0
8.0
7.0

6.0
Cum 5.0

Interf.
4.0

3.0
2.0
1.0

. 0.0
0 4 8 12 16 20 0 4 8 12 16 20 24 28 32
No. of competitors (N) No. of competitors (N)

 

 

9.0 1-

L L

Cum.
Interf.

 

 

 

 

 

N, M = 128K, G. = (g... = (0,64K[1.0],1,32),gc = (16),g, = (1,2,o.5))

Figure 5.15. Incremental interference measured with stride of access 3 = 1

The workload in Figure 5.16 uses a stride of 23 to access shared data. This causes
a cache miss for every access on the Symmetry leading to heavier bus contention
and consequently a higher value of \Ilm. On the TC2000, stride 23 accesses do not
behave any differently than stride 1 accesses as all memory references go out over the
butterﬂy switch for either strides. The MCS lock again exhibits a slower growth rate
of overhead compared to the TAS lock.

The incremental interferences 1b,,(N) and z/J,(N) can be computed from the mea-

sured values of ‘1'," and \Il, using the relations 3.24 and 3.25 respectively.

5.4.2 Heterogenous Workloads

Using a heterogenous workload, we have explored the interactions that occur between
concurrent execution of code within and outside the critical section. In particular,

we investigated the following two situations:

(a) Impact of memory accesses done outside the CS in prolonging the length of the

CS by interfering with shared data accesses within the CS, and

148

 

 

   

 

 

 

 

 

Sequent Symmetry BBN TC2000
I I I I I MCS . I m I I I I I
9 0 - ~ TAS o 9.0 -
8 0 P - 8 0 l"
7.0 - -' 7.0 r
6.0 '- 6.0 '-
ggg. 5.0 - 12:3. 5.0 r
4.0 - 4.0 '-
3.0 '- 3.0 r
2.0 - 2.0 r
1.0 r 1.0 -
0.0 l l 1 1 0.0 ,
0 4 8 12 16 20 0 4 8121620242832
No. of competitors (N) No. of competitors (N)

N,M = 128K, 0. = (gm = (0,64K[1.0],23,32),g, = (16),g, = (1,2,0.5))

Figure 5.16. Incremental interference measured with stride of access 3 = 23

(b) Impact of the spin lock accesses to enter the CS on memory references external

to the CS.

Figure 5.17 depicts the results of the test described in case (a) above. In this exper-
iment, the test processor P0 is the only one executing the CS (granule 9,) whereas all
competitor processors perform only memory accesses (granule gm). Since the shared
lock location itself encounters no contention, the observed degradation in performance
can be ascribed purely to the memory access conﬂicts that occur between shared-data
access within and outside the CS. Hence, all the spin-lock implementations exhibit
comparable performance. The efﬁciency remains close to 1 on the TC2000 due to the
much higher bandwidth of its switching network.

An important measure of synchronization performance is the additional amount
of interconnection network trafﬁc caused by multiple processors attempting to syn-
chronize, and the impact of this trafﬁc on the execution of the other components
of a unit grain. This measure was obtained by recording the performance of a test
grain composed of only shared memory accesses (granule gm) when competing with

grains comprised of only critical section accesses (granule g,). The results are shown

149

 

 

   

 

 

 

 

 

 

Sequent Symmetry BBN TC2000
I I a I I MCS "— I I I I I I I
NAT *-
1 .. TAS '9—
0.8 .1
£.(N) 0.6 - ~ {.(N) 0.6 - -
0.4 *- -( 0.4 P -
0.2 '- '- 0.2 - -
0 1 1 1 1 1 0 1 1 1 1 1 1 1
0 4 8 12 16 20 0 4 8 12 16 20 24 28 32

No. of competitors (N)

No. of competitors (N)

NaM =128K10t = (gm = ¢agc = ¢ags = (018105)),
Gc = (g... = (0, 64K[1.0],4,48),gc = 8,9. = ¢)

Figure 5.17. Impact of non-CS memory accesses on OS execution performance

in Figure 5.18.
As expected, the MCS spin-lock outperforms the other two by a signiﬁcant mar-
gin on the Symmetry for values of N Z 2 due to its constant number of network
accesses per lock acquisition, thus contributing minimal additional bus trafﬁc. Each
competitor grain competes with the test grain for the use of the bus three times (refer
to Figure 5.10) during each CS access: to fetch the lock queue header, to attach itself
at the end of the queue, and to release the lock to the next processor in line. The
critical point at which the bus usage by Gc interferes with G1 occurs at N = 12 as
evident from the sudden drop in test grain efficiency at that point. On the TC2000,
the NAT and TAS spin lock trafﬁc interferes with other shared-data access causing
a decrease in the grain efﬁciency. The extent of degradation is not as marked as for

the Symmetry due to the higher bandwidth of its interconnect.

150

 

 

   

 

 

 

 

 

 

Sequent Symmetry BBN TC2000
1-2 I I I I I MCS"-1-2’II I—I I I I
NAT +-
1 TAS -O—
0.8
£3(N) 0'6 £,(N) 0'6 '- T
0.4 0.4 - ~
0.2 0.2 - -‘
0 1 1 1 L L 0 1 1 1 1 1 1 1
0 4 8 12 16 20 0 4 8 12 16 20 24 28 32
No. of competitors (N) No. of competitors (N)

N, M = 128K, G1 = (gm = (0,64K[1.0],4,20),gc = 45.9. = <25),
Ge = (gm = ¢agc = ¢aga = (0’ 0,0»

Figure 5.18. Impact of CS spin-lock on non-CS memory accesses

5.5 Summary

Synchronization among concurrent processes to access critical sections of code in a
mutually-exclusive manner leads to loss in parallel performance as a result of serial-
ization of execution. The synchronization overhead incurred is not only a function
of the ratio of the amount computation performed outside and inside the CS (serial-
ization ratio), but also of the implementation characteristics of the spin lock used to
guard the critical section. The SAD kernels described in this chapter provide an effec-
tive means of characterizing the synchronization performance under varying workload
conditions. These kernels were employed to measure and compare the performance
of three spin lock implementations (TAS, NAT, MCS) on the Sequent Symmetry and
the BBN TC2000 multiprocessors.

The MCS lock has the highest uncontested latency. However, it induces the least
amount of interconnect contention. For cases in which competition is expected, the
MCS lock is the best implementation of choice. The TAS lock has the least latency,

but its performance deteriorates rapidly with contention. The performance of the

151

TAS lock can be improved by exponential backoff between successive poll of the lock
variable [2], although this case has not been evaluated here. The NAT lock on the
TC2000 does delay between polls for a ﬁxed amount of time (i.e., it is not adaptive).
This could result in the release of a busy lock going unnoticed for some time because
of the waiting processors being in the middle of the polling delay.

The SAD kernels can be used either independently to evaluate the efﬁciency and
scalability of the implementation of synchronization primitives; or they can be used
in conjunction with the MAD and BAD kernels to isolate the incremental overheads
resulting from inter-process synchronization and lock access contention from the total

performance loss experienced by an input workload.

CHAPTER 6

BAD KERNELS AND BARRIER
PERFORMANCE

A barrier deﬁnes a logical point in the control ﬂow of an algorithm at which all
processes must arrive before any is allowed to proceed further. They are commonly
employed when an algorithm consists of several distinct stages, each of which has in-
ternal parallelism but which must be performed in strict sequence without overlap. A
barrier is clearly one of the most deleterious forms of synchronization, since it requires
in effect that every process communicate with every other process. Additionally, since
all processes must wait at the barrier until the last arrives, the eﬂects of ﬂuctuations
in process execution time or imperfect load balancing are maximized.

The key factors in the performance of a barrier implementation were discussed
in Chapter 2. Quantiﬁcation of the overheads arising from barrier synchronization
for a variety of workloads helps assess not only the scalability of a particular barrier
implementation to large multiprocessors, but also the loss in execution parallelism.
The BAD kernels presented in this chapter can be used in isolation to evaluate and
compare the performance of different barrier implementations; or they can be used in
conjunction with the MAD and SAD kernels, as per the hierarchical model presented
in Chapter 3, to characterize the incremental loss in performance for a given workload

resulting from barrier synchronization.

152

153

6.1 BAD Workload Parameters

The synchronization barrier separates adjacent phases of the multi—phase computa-
tion structure selected as the basis of our performance studies. The presence of the
barrier increases the completion time of a phase by adding the time to execute the
barrier. Moreover, it also forces all processors to wait for the slowest among them
thus accentuating the worst-case performance. To incorporate the effects of barriers
into our characterization of the aggregate performance of a workload, measurements

are performed at the level of a phase of computation.

6.1.1 Phase Characterization

A single phase consists of a number of concurrent processes executing a string of
unit grains and terminating at a global synchronization barrier. Hence, the param-
eters necessary to characterize a phase of computation must include: ( 1) a set of
attributes to describe the behavior of the concurrent unit grains within the phase, (2)
the number of unit grains executed by each processor in the phase, and (3) the choice
of a particular type of barrier implementation. The parameters used in the barrier

performance studies are summarized in Table 6.1.

Table 6.1. Workload parameters for studying barrier performance

 

[Granule I Attribute Meaning ]

 

 

 

 

 

 

N number of competitor processors in a phase
common M number of shared data elements

2 number of unit grains per processor per phase

p probability of write access to shared memory
gm d initial distance of concurrent address streams

s stride of memory access

m number of shared memory accesses per granule
96 c number of basic computation units (BCUs)

c, number of computation steps in CS
9, m, number of memory accesses in CS

p, probability of a write access in CS

 

 

 

154

As is evident from Table 6.1, the attributes describing the unit grain behavior are
the same as used for the SAD kernels with the additional parameter 2 representing the
the length of the task executed by each processor. If the total work to be performed
within a phase, say W unit grains, is perfectly parallelizable among P processors
in a homogenous setting, then the amount of work performed by each individual
processor is given by l = [W/P]. The consolidated set of input parameters I to the

experimental framework (described in Section 3.3), therefore, now becomes
I = {NiMiliataac},

where the test and competitor unit grains are characterized by the 3—tuple of tuples

G = ((p, d, s, m), (c), (c,, m,,p,)).

6.1.2 Output Metrics

The metric used to quantify the overhead of barrier synchronization and the conse-
quent increase in the phase execution time for an input workload, as a function of
the degree of interference N, is the cumulative barrier interference \Ilb(N) as deﬁned
by Eq. 3.20. In other words, if T(N) denotes the total time to complete executing a

phase with N competitor processes contending for resources, then

WN) = T(N%(:))T(0) : T(Ne); [1"

 

When N = 0, there is only one processor operating thus making the barrier syn-
chronization redundant. Therefore, the measured execution time for a workload with
N = 0, i.e., T (0), does not include the barrier overhead thus yielding T(0) = [7' used
in the expression above. The incremental barrier interference 1/J1,(N) can be computed
from the measured values of \Ilb(N) and \Il,(N) for a given workload as dictated by
Eq. 3.26.

A value of \II1,(N ) = 0 would indicate the barrier as an idealized entity which

consumes no resources and induces no interference with the processes executing within

155

a phase. In reality, however, a barrier does consume resources, and this will have a
major effect on performance. Although \Ilb(N) > 0 for a non-ideal barrier, judicious
design choices can help minimize this interference. It should be noted that since \Ilb(N )
expresses the barrier overhead encountered in terms of an abstract normalization
unit, namely T(O), it is a suitable metric for comparing performance only when the
same reference workload is used as the basis. For performance comparisons across

workloads, the absolute time measure T(N) should be used instead.

 

shared count : integer := P; { number of processors synchronizing }
shared sense : Boolean := True;

processor private local-sense : Boolean :2 True;

procedure CentralBarrier()
locaLsense :2 not local_sense; { Each processor toggles its own sense }
if Fetch-And-Decrement (&count) = 1 then
count := P;
sense := local.sense; { Last processor toggles global sense }
else

repeat until sense = local.sense;

 

 

 

Figure 6.1. Pseudo-code for a sense reversing centralized barrier

6.1.3 Barrier Implementations Studied

We have chosen two barrier implementations on each of the target systems studied to
demonstrate the utility of the BAD kernels. The ﬁrst is a centralized implementation
of the barrier (referred to as the CNT barrier), where each processor updates a small

amount of shared state to indicate its arrival and then polls that state to determine

156

when all of the processors have arrived. Most barriers are designed to be used repeat-
edly (to separate the phases of an algorithm). In the most obvious formulation, each
instance of a centralized barrier begins and ends with identical values for the shared
state variables. Each processor must spin twice per instance; once to ensure that all
processors have left the previous barrier and again to ensure that all processors have
arrived at the current barrier.

The number of references to the shared state variables can be reduced and one of
the two spinning episodes can be eliminated by “reversing the sense” of the variables
(and leaving them with different values) between consecutive barriers [58]. The re-
sulting code is shown in Figure 6.1. Arriving processors decrement count and wait
until sense has a different value than it did in the previous barrier. The last arriving
processor resets count and reverses sense. Consecutive barriers cannot interfere with
each other because all operations on count occur before sense is toggled to release
the waiting processors.

The potential drawback of centralized barriers is the spinning that occurs on a
single, shared location. Because processors do not in practice arrive at a barrier
simultaneously, the number of busy-wait accesses will in general be far above the
minimum. On broadcast-based cache-coherent multiprocessors, these accesses may
not be a problem. The shared ﬂag (or sense variable) is replicated into the cache
of every waiting processor thus causing local spinning without any network trafﬁc.
This shared variable is written only when the barrier is achieved, causing a single
broadcast invalidation of all cached copies.

On machines without coherent caches, however, or on machines with directory—
based caches without broadcast, busy-wait references to a shared location may gener-
ate unacceptable levels of memory and interconnection contention. For such classes of
machines, Hengsen, Finkel, and Manber [58] have proposed a “dissemination barrier”
(referred to here as the DSM barrier) that yields a much more efﬁcient pattern of
synchronization. In round k (counting from 0) with P processors participating, pro-
cessor i signals processor (i + 2") mod P. Synchronization is not necessarily pairwise

and requires only [log2 P] synchronization operations on its critical path regardless

157

of P. The ﬂags on which each processor spins are statically determined, and no two
processors spin on the same ﬂag. Each ﬂag can therefore be located near the processor

that reads it leading to local-only spinning.

 

type Flags = record
myﬂags : array [0..1] of array [0..LogP] of Boolean;
partnerﬂags : array [0..1] of array [0..LogP] of TBoolean;

end;

processor private parity : integer := 0;

processor private sense : Boolean := True;

processor private localﬂags : Tﬂags;

shared allnodes : array [0..P-1] of ﬂags;

{ allnodesﬁ] is allocated in shared memory locally accessible to processor i. }

{ 0n processor i, localflags points to allnodes[i]. Initially allnodes[i].myﬂags[r][k]
is False for all i, r, k. Ifj = (i + 2") mod P, then for r = 0,1:
allnodes[i].partnerﬂags[r][k] points to allnodesfj].myﬂags[r][k]. }

procedure DisseminationBarrier()
for instance : integer := 0 to LogP-l do
localﬂagsT .partnerﬂags [parity] [instance] T : = sense;
repeat until localﬂagsT.myﬂags[parity] [instance] 2 sense;
if parity = 1 then
sense := not sense;

parity := 1 - parity;

 

 

 

Figure 6.2. Pseudo-code for a distributed dissemination barrier

Figure 6.2 presents the dissemination barrier. Alternating sets of variables are used

in consecutive barrier episodes for each signaling operation, thus avoiding interference

158

without needing two separate spins in each operation. Sense reversal is also used to
avoid resetting variables after each barrier. The parity variable controls the use of
alternating sets of ﬂags in successive barrier episodes. The shared allnodes array
would be scattered statically across the memory banks on a machine with distributed

shared memory and no coherent caches.

6.2 Embarrassing Workloads

The class of embarrassing workloads (refer to Section 3.2) are used to measure the
performance effects attributable purely to barrier synchronization. Since no shared
data accesses nor inter-processor synchronizations are present within the unit grain
(i.e., gm = 65, g, = (b), the concurrent processes within a phase execute independently
of each other. Any observed losses in performance can be ascribed entirely as the
result of global barrier synchronization.

Synchronization barriers impose two kinds of performance penalties on the runtime
behavior of an algorithm. The ﬁrst, which is in some sense irreducible, is due to
ﬂuctuations in the time taken by the processors to complete their share of the work
within a phase. The second kind of penalty results from the use of resources by
the barrier, and in particular the contention for shared resources. The consequences
of ﬂuctuations in the execution time or the unbalanced workload distribution are
maximized as a result of the wait for the last processor to complete its work.

If the barrier itself is considered as an idealized entity which consumes no re-
sources, the execution time of the phase can be determined analytically, as Kruskal
and Weiss [68] have shown. If there are P processors (note that in terms of our work-
load parameters, P = N +1) that begin their work simultaneously, and the time
each takes has the mean p and standard deviation 0, then the time at which the last

processor completes its work, T p, is given by

Tp =p+o(/210gP. (6.1)

159

The approximation is especially good for a Gaussian distribution function but is valid
more generally as shown in [68].

In reality, the barrier execution does consume resources and computational cy-
cles. The time to achieve the barrier, T5,", consists of two distinct components [16]:
the entry phase time, Tam,” during which processors report their arrival; and the
exit phase time, Ta“, during which processors exit after determining that all other

processors have arrived. There are two cases to consider:

1. A balanced load is one for which the variance in arrival times is less than the
overhead incurred at the entry phase. An extreme case is the perfectly balanced
load for which a = 0 in which case all processors arrive at the barrier simulta-
neously. The barrier overhead in this case is the time for all P simultaneously

arriving processors to traverse the entry and exit phases.

Tbarr(P) = Tentry(P) + Texit(P) (62)

2. An unbalanced load is one for which the variance in arrival times is greater than
the time required for the entry phase. An extreme case occurs when the last
processor to arrive at the barrier ﬁnds that all other processors have completed
the entry phase. In this case, the barrier overhead is given by the last processor

to complete its entry and all P processors to exit.

Tbarr(P) = Tcntry(1) + Tea-“(19) (63)

The total time to complete a phase of execution, T(P), can be expressed as the

sum of the effects of unbalanced load (Eq. 6.1) and barrier overhead.

T(P) = p + m/Zlog P + T1,,"(P)

If the total performance penalty resulting from barrier synchronization is denoted as

160

05(P), then
05(P) = o‘/2log P + Tb,,,.(P).

It is clear that the cumulative barrier interference \Ilb(N) is proportional to 01(P).
Using Eqs. 6.1, 6.2 and 6.3, and the fact that a = 0 for a balanced load, the overhead

function can be written, for a balanced load, as
01(P) = T,,.1,,,(P) + Tm-1(P), (6.4)

and for an unbalanced load as

01(P) = a(/2log P + T,,,,,.,(1) + T,,.-,(P). (6.5)

The time to complete the entry and exit phases for the two barrier implementa—
tions selected (CNT and DSM) can be expressed in terms of the timing of the basic
operations involved. For the CN T barrier, the entry phase entails that each of the P
processors atomically decrement the count variable, each decrement operation requir-
ing a time of tatomgc. The entry phase of the DSM barrier, on the other hand, requires
each arriving processor to signal its arrival only to its ﬁrst round synchronization
partner, the pairwise synchronization round needing a time of t,,-g,,,1. Therefore, the

time for the entry phase can be expressed as follows.

Ptatomgc for the CNT barrier
Tentry(P) = (6.6)

t,,-g,,,1 for the DSM barrier
Similarly, the exit phase of the ON T barrier consists of the last arriving processor
writing to the sense ﬂag to toggle its status (requiring time Lon-1,), and the changed
status of sense being read by the P — 1 waiting processors (each read requiring time

tn“). The exit phase of the DSM barrier goes through the remaining (logP — 1)

rounds, the ﬁrst round having been performed in the entry phase, of pairwise signaling

161

to complete the barrier. Thus, the exit phase time can be expressed as follows.

for the CNT barrier
for the DSM barrier

twrite + (P — 1)tread
(log P — 1)t,,'gna1

Tezit(P) = (6.7)

The equality in the Eqs. 6.6 and 6.7, in reality, should read “proportional to” for
accuracy. However, the constant of proportionality is not important for the discussion
at hand and, hence, has been treated as unity. It should also be noted that the values
of tum-1, and in“ used in the expressions for the CNT barrier are not constants for
machines without coherent caches, and are determined by the hot spot access latency
for that system with the variable sense being the hot spot site. Similarly, t,,-g,m( used
for the DSM barrier may involve 0(1) network transactions if parallel accesses over
the interconnection are possible (such as on a MIN), or 0(P) network transactions

on serial interconnections (such as on a bus).

 

 

   

  
    
       

 

 

 

Sequent Symmetry BBN TC2000
120 I r I I 7000 I T I I I
100 _ CNT *- 6000 1 CNT +—
80 5000
T(N) ,0 ._ T(N)4°°°
(#8) (#3) 3000
40
2000
20 1000
0 l l l J l 0

0 4 8 12

16

20

 

 

0 4 8121620242832

 

 

No. of competitors (N) No. of competitors (N)

NCM =o,e = 0,01: 6.19... = 8,9. = 8,11. = 8)

Figure 6.3. Time to achieve barrier vs. N

162

6.2.1 Scalability of Barrier Implementations

In large-scale multiprocessors, the number of interconnection network accesses per
processor to achieve the barrier increases sharply as collisions in the network cause
processors to repeat accesses. This observation is especially true for centralized bar-
rier algorithms, like CNT, implying that they will not scale well to large numbers of
processors. Algorithms that restrict spinning to locally-accessible memory, like the
DSM, are much more amenable to scaling for large numbers of processors. Our mea-
surements conﬁrm this conclusion. Figure 6.3 shows the time T(N) to achieve barrier,
with no computation at all in the unit grains, for the two barrier implementations
chosen.

On the TC2000, the time to achieve a CNT barrier increases more than linearly
in the number of participants. Since the Butterﬂy switch does not provide hardware
combining, at least 2P — 1 accesses to the barrier state are required (P to signal
arrivals, and P — 1 to discover that all have arrived). The DSM barrier, on the other
hand, proceeds through only [logz P] rounds of synchronization that leads to a stair-
step curve (shown in Figure 6.4 for clarity). The time to achieve a barrier with this
algorithm scales logarithmically with the number of processors participating.

The performance on the Symmetry differs sharply from that on the TC2000 for
two principal reasons. First, it is acceptable on the Symmetry for more than one
processor to spin on the same location; each obtains a copy in its cache. Second, no
signiﬁcant advantage arises from distributing writes across the memory modules; the
shared bus enforces an overall serialization. The DSM barrier requires 0(P log P)
bus transactions to achieve a P-processor barrier, whereas the CNT barrier requires
only 0(P) transactions. Consequently, the CNT barrier scales better than the DSM

barrier on the Symmetry.

6.2.2 Balanced Load and Simultaneous Arrivals

A balanced workload exhibits a variance in processor arrival times at the barrier

that is much less than the overhead incurred at the barrier. A perfectly balanced

163

DSM Barrier

 

T(N)
(#8)

 

 

 

0 I I I I I I I

0 4 8 12 16 20 24 28 32
No. of competitors (N)

NaM = Die =01Gt = 66(g‘m = ¢igc = ¢iga = (J5)

Figure 6.4. Time to achieve DSM barrier on the TC2000

load with a constant execution time on each processor (i.e., o = 0) induces the
maximum overhead at the entry phase of a linear barrier, since simultaneously arriving
processors contend for access to the shared barrier state and must be serialized. A
slightly increased ﬂuctuation level can, indeed, beneﬁt performance [9]. This occurs
because the presence of ﬂuctuations can reduce the queue length at the barrier entry
critical section by spreading out arrival times and causing some processors to start
synchronizing early.

Figure 6.5 shows the barrier performance of a perfectly balanced workload with
a constant number of computation steps (c = 1000) executed per processor between
barriers. The overhead curves observed for this workload are a result of the dominant
effect of the barrier overhead on performance as given by Eq. 6.4. The overhead

05(P) incurred is obtained by combining Eqs. 6.4, 6.6 and 6.7.

Ptatomic + twrite + (P - 1)trcad for CNT

01(P) =
(log P)t,,-g,,al for DSM

The higher overhead for DSM on the Symmetry is a direct consequence of the

 

  
   

164

 

   
 
   

   

 

 

 

 

 

       

 

 

Sequent Symmetry BBN TC2000
0.09 I r r I r 35 I I I I I I I
0.08 CNT '0—
30 " CNT ‘0— ""
0.07 1- DSM -e— DSM .9.
0.06 25 7
.05 20 .1
111,,(Ng 1111(N)

.04 15 a
0.03 ~ 10 .
0.02 -‘

0.01 . 5 “
0 1 1 1 1 1 0
0 4 8 12 16 20 0 4 8 12 16 20 24 28 32
No. of competitors (N) No. of competitors (N)
N,M = 0,e =1,G1= G,(gm = 8,9. = (1000),g, = 45)

Figure 6.5. Barrier performance of a perfectly balanced load

P bus transactions required in each t,,g,,,1, due to the serial nature of the bus, thus
needing a total of Plogz P bus accesses. In comparison, CNT requires only 2P bus
accesses. Thus, the performance of DSM and CNT remain comparable for up to
log2 P S 2 (i.e., P S 4) beyond which the higher logP factor for DSM causes the
performance curves to diverge. In contrast, DSM on the TC2000 requires a total
of only log P network transactions as the Butterﬂy switch permits parallel accesses,
whereas CNT must perform P atomic operations serially followed by P accesses to a
hot spot location. The hot spot accesses result in extremely high latencies for twm,

and tr,“ for N Z 18. This is evident from the signiﬁcant difference in overheads for

DSM and CNT for N Z 18.

6.2.3 Unbalanced Load and Staggered Arrivals

In an unbalanced workload, processors arrive at the barrier in a staggered fashion. The
variance in processor arrival times is greater than the time required to synchronize
at the barrier. This results in the variance in arrival times to dominate observed

performance. Figure 6.6 shows the barrier effects on an unbalanced workload in which

each processor performs c computation steps randomly selected from an Uniform

distribution over the interval (0, 2000].

0.9 I

Sequent Symmetry

 

0.8
0.7

I I

 

 

BBN TC2000
35 I I I I I I I
30 '- CNT *—
DSM '9—

   

 

 

 

 

 

 

 

25
0.6 ~
0.5 - - 20
‘11:.(N) ‘I’b(N)
0.4 - ‘ 15
0.3 .. 10
0.2 CNT -0- -
0.1 DSM -G- J 5 -
0 L I I I I 0
0 4 8 12 16 20 0 4 8 12 16 20 24 28 32

No. of competitors (N) No. of competitors (N)

N,M = 0,! =1,G1= Gc(gm = ¢,g, = (1000[1]),g, = 45)

Figure 6.6. Barrier performance of an unbalanced load

The total performance penalty 01(P) for an unbalanced load is given using Eqs.
6.5, 6.6 and 6.7.

0 (P) 0V 21081 + tatomic + twrite + (P _ 1)tread for CNT
b =
O'\/210g ] + (10g P)t,{gnal for DSM

The standard deviation 0 of a random variable uniformly distributed over the interval
[a,b] is given by (b — a)2/ 12. For the workload used in Figure 6.6, the standard

deviation of the computation times is thus given by

a _ 2000 t _ 2000 R_,
—\/T§ °_\/1—2 °°

where t, is the time per computation step. Since t, = 1/R,-,,fty is much larger on the

166

Symmetry (refer to Table 4.4), the effect of the variance in arrival times predom-
inates thus rendering the difference in the barrier overheads as insigniﬁcant. The
DSM and CNT, therefore, exhibit almost identical behavior of 01, (and hence ‘11,)
on the Symmetry. The dominance of the arrival ﬂuctuations is evident by observing
that \I'b(20) = 0.049 for the CNT barrier for a balanced workload with c = 1000 (Fig-
ure 6.5); but it makes a quantum jump to \Ilb(20) = 0.86 for an unbalanced workload
with the same mean (Figure 6.6). A similar increase can also be noted for the DSM
barrier. On the TC2000, however, the value of a is not large enough to overshadow
the difference in the DSM and CNT barrier overheads. The effect of 0, therefore, is
observed as a slight increase in the value of ‘Ilb(N).

To isolate and highlight solely the eﬂect of the barrier overhead (the terms
T,,,¢W(1) + Ten-,(P) in Eq. 6.5) in an unbalanced load, we measured the barrier perfor-
mance of a heterogenous workload in which the test processor executes a number of
computation steps far in excess of those executed by the competitor processors, i.e.,
G¢(c) >> G,(c). Also, G,(c) is made to vary randomly over an uniformly distributed
interval to emulate the staggered arrivals. By the time the test processor Po arrives
at the barrier, all other processors have already completed their entry phase and are
waiting. The performance for this case is shown in Figure 6.7.

The broadcast-based cache-invalidate operation in Tent”, of the CNT barrier on
the Symmetry results in a constant overhead (tatomgc + tum-1,) incurred by the test
processor Po. For CNT on the TC2000, Po has to compete with the N processors
already spinning on the sense variable to toggle it thus incurring a high tum-1, latency.
The DSM barrier on both machines, however, requires Po still has go through the
log P rounds of synchronizations thus exhibiting essentially the same overhead as for

a balanced load.

6.3 Dual-Mode Access Workloads

In accounting for the effect of variance in the arrival times of processors at a barrier

in Eq. 6.1, it was assumed that the the ﬂuctuations in the execution time of each

167

Sequent Symmetry BBN TC2000
0-035 IIII 40IIIIIII

35 CNT *—
DSM '9—
30
25
WAN) 20

 

 

    

 

 

 

 

 

0 I I I L 0

0 4 8 12 16 20 0 4 8 12 16 20 24 28 32
No. of competitors (N) No. of competitors (N)

 

NaM : Oil =11Gt =(gm = ¢igc : (1000):.93 = ¢)1
Ge = (9m = 43191: = (300lll)aga = 43)

Figure 6.7. Performance of staggered arrivals at the barrier

processor were incidental in the computation itself. The effects of memory conﬂicts,
contention for other hardware resources or other interprocessor interactions were ig-
nored due to the nature of the embarrassing workloads. Therefore, the computation
times of all processors could be treated as independent identically distributed random
variables (i.i.d’s) with mean p and variance 0.

However, if ﬂuctuations in the barrier arrival times are present as a result of
planned interactions between processors during the phase, such as contention in reg-
ular reference patterns to shared data and mutual exclusion at critical sections, then
the assumption of independence between processors no longer holds. The situation
thus becomes more complex and the effect of ﬂuctuations is best characterized exper-
imentally. The dual-mode access workloads are used for this purpose.

The same workloads as used in Section 5.4 to measure the incremental overhead
components associated with memory access contention and CS synchronization are
used here again to observe the incremental overhead resulting from barriers. The

cumulative interference values \Ilm, \II, and \111, as measured by the MAD, SAD and

168

 

 

 

 

 

 

 

CNT Barrier DSM Barrier
12 I I I I I 12 ﬂ I I I f
MCS e
TAS o 10
8
Cum. Cum. 6
Interf. Interf.
4
2
0
0 4 8 12 16 20 0 4 8 12 16 20

No. of competitors (N) No. of competitors (N)

N,M =128K,e =1,G.=(gm = (0,64K[1.0],1,32),g, = (16),g, = (1,2,0.5))

Figure 6.8. Cumulative interferences unit stride workload on the Symmetry

BAD kernels respectively for the same workload are plotted in Figure 6.8 (for the
Symmetry) and Figure 6.9 for the TC2000. The workload with unit stride (s = 1) is
used. For each barrier implementation, the cumulative barrier interference \Ilb(N) is
measured with the TAS and MCS locks used, in turn, for the critical section.

Since I = 1 for the workload used, the difference between the III), and \II, curves

directly measure the incremental barrier interference 1b,. In other words,
tMN) = \111,(N) - \I'.(N)

in Figures 6.8 and 6.9. On the Symmetry, both CNT and DSM barriers display com-
parable values for the cumulative and hence incremental barrier interference. This is
as a result of the predominance of the effect of barrier arrival ﬂuctuations discussed
in the previous section. Both TAS and MCS lock workloads experience similar in-
creases in the total overhead on account of the barrier. It is also noteworthy that for
low values of N < 12 the incremental barrier interference 1/21,(N ) is the single largest

source of runtime overheads.

 

 

169

 

 

CNT Barrier DSM Barrier

200 I I I I I I I 25 I I I I I I I

180 MCS c

160 TAS o 20 F -

140 ‘1’5

120 15 - -1
8‘12"}. 100 E1331.

80 10 r -*

60 , —. '

40 5 - , = " if ‘P‘ u

20 M " _111,,

0 0 ’ ’- 1"“

0 4 81216202428
No. of competitors (N)

 

 

32

 

0 4 8121620242832

No. of competitors (N)

N,M =128K,€ =1,G. = (gm = (0,64K[1.0],1,32),g, = (16),g, -_— (1,2,0.5))

Figure 6.9. Cumulative interferences unit stride workload on the TC2000

On the TC2000, the incremental barrier interference 1b.,(N) is far worse for the
CNT barrier than for DSM. The primary reason behind this dismal performance is
two-fold: ﬁrst, the last arriving processor at the ON T barrier must contend with the
N processors already present for access to the “sense” ﬂag to toggle its state; second,
the continuous spins on the barrier sense ﬂag ﬂood the interconnection network with
busy-wait trafﬁc thus interfering with the memory accesses performed within the unit
grain. The effect of the busy-waits is further accentuated in the performance of the
TAS lock workload for large N due to the combination of two spinning instances,
namely, within the TAS lock and within the CNT barrier. With the DSM barrier,
however, the incremental barrier penalties experienced by both TAS and MCS lock

workloads are comparable.

6.4 Summary

Synchronization barriers impose two kinds of performance penalties on parallel al-

gorithm performance: overhead of barrier execution, and maximization of load im-

170

balance losses. The overhead of barrier execution includes the contention for shared
resources by the barrier code. Two barrier implementations were studied on the Se-
quent Symmetry and TC2000 multiprocessors — a centralized sense-reversing barrier
(CNT) and a tree-like dissemination barrier (DSM). If independent network transac-
tions can proceed in parallel on a machine, then the critical path length is 0(log P)
for the DSM, but 0(P) for the CNT. On an interconnection that serializes network
transactions, the logarithmic factor will be dominated asymptotically by the linear
(or more) total number of network transactions.

The DSM barrier was observed to be more suitable on the distributed8memory
TC2000 system, whereas CNT performed better on the cache-coherent Symmetry
system. In the DSM barrier, no network transactions are due to spinning, so inter-
connect contention is not a problem. The ON T, on the contrary, maximizes memory
contention and hot spots caused by synchronization on machines without coherent
caches. The performance of CNT on distributed-memory machines without coherent
caches can be improved by adaptive backoff strategies between polls of the sense ﬂag.
However, their scalability is limited on large-scale systems, as the number of network
accesses per processor increases sharply as collisions in the network cause processors
to repeat accesses [2].

The CNT barrier enjoys one additional advantage over DSM: it adapts easily to
differing numbers of processors. If the number of processors participating a barrier
changes from one barrier episode to another, the log-depth DSM barrier would require
internal reorganization, possibly swamping any performance advantage obtained in
the barrier itself. Changing the number of processors in the CNT entails no more
than changing a single constant.

The BAD kernels can be used either independently to evaluate the efﬁciency and
scalability of the implementation of a barrier mechanism; or they can be used in con-
junction with the MAD and SAD kernels to isolate the incremental overheads incurred
as a result of synchronization barriers from the total performance loss experienced by

an input workload.

CHAPTER 7

CONCLUSIONS

The increasing complexity of multiprocessor systems necessitates the development of
accurate techniques to characterize their behavior under a variety of workload condi-
tions so that efficient algorithms can be designed to effectively utilize the machine and
reasonable performance expectations established. This thesis proposes a hierarchical
model to characterize multiprocessor system performance and develops techniques to
measure and calibrate the parameters of the model. In this chapter, we summarize
the salient contributions made by this research and present interesting avenues for

possible future research.

7 .1 Research Contributions

The run-time overhead of communication on multiprocessors can signiﬁcantly limit
the amount of program parallelism that can be exploited. In programs using the
shared-variable paradigm, communication manifests itself along three principal di-
mensions, namely, shared data accesses (including memory contention, cache misses in
cache-coherent machines and non-local memory accesses in hierarchical or distributed
memory machines), inter-process synchronization operations, and global barrier syn-
chronizations. As more processors are added, the communication costs of algorithms
increase. It is the rate at which these costs increase that determines an algorithm’s
efﬁciency and scalability. Measurements must be made to quantify the impact of

such run-time overheads on the overall performance of a system for speciﬁc algo-

171

172

rithms / applications.

We have developed a system characterization methodology based on a hierarchi-
cal approach using a multi-phase computation structure to describe the static and
dynamic behavior of program execution on a multiprocessor. We maintain that the
characterization of performance is tied inextricably to the input workload used and,
therefore, should depend signiﬁcantly on the user’s needs and preference for selec-
tive workload characteristics. We have presented a ﬂexible technique for benchmark
workload generation that can be tailored to highlight speciﬁc aspects of performance.
The workload generator is based on the deﬁnition of a unit grain that allows a user
to identify the most signiﬁcant factors inﬂuencing performance and use them as the
characterization attributes for the unit grain.

Two sets of system characterization parameters have been proposed to completely
describe the behavior of a given input workload on a target multiprocessor system.
The ﬁrst set, involving the three static parameters (R00, f1/2, c1”), describes the max-
imum asymptotic performance possible and the expected performance degradation as
a result of static overheads in the input workload. The second set, consisting of the
three dynamic parameters (1b,,(N), 1b,(N), 1/25(N)), describes the run-time overheads
resulting from dynamic interactions between concurrent processes along the three
performance dimensions mentioned earlier as a function of execution parallelism. We
have also presented a series of parameterized formulae that relate the quantitative
characteristics of a workload to a realistic estimation of its performance via the system
characterization parameters.

We have developed a family of workload emulation kernels that allow one to
study the interaction of the different factors identiﬁed in an input workload and
measure the incremental inﬂuence of each factor on performance. The measured
data is used to calibrate the system characterization parameters described above.
The MAD kernels, designed to calibrate the memory contention parameter tbm(N),
provide a testbed for the investigation of multiprocessor memory system performance
under a variety of memory reference patterns. The SAD kernels, used to calibrate

the synchronization parameter 1/2,(N ), facilitate the evaluation of the implementation

173

efﬁciency of synchronization operations based on spin locks and their sensitivity to
algorithm characteristics. The BAD kernels, used to calibrate the barrier parameter
¢5(N), allow us to explore the efﬁciency of a barrier implementation and the losses
accruing from barrier synchronization. We demonstrated the applicability of the
system characterization methodology and the effectiveness of the workload emulation
kernels on the Sequent Symmetry and BBN TC2000 commercial multiprocessors in
studying the performance of several synthetic workloads.

We believe that our approach to performance characterization will serve to model
performance with greater ﬁdelity than exists in the current state of art, since it in-
corporates the effect of both static and dynamic inﬂuences in a workload execution.
Further, the proposed methodology is independent of any particular multiprocessor
architecture or application. Only a shared-variable programming paradigm is as-
sumed, but no assumptions are made about the organization of the shared address
space. Hence, our framework can not only be used to evaluate multiprocessors that
provide physical shared memory, but also possible highly-parallel designs in the future

supporting shared virtual memory over scalable interconnection networks.

Limitations of the Approach

Although the approach presented in this thesis can be successfully applied to charac-

terize the performance of a wide variety of multiprocessor workloads, it has several

limitations.

0 The parallel processes in the workload are assumed to be statically assigned to
processors with no run-time migration. Hence, the overhead of dynamic load

balancing strategies, adopted on many multiprocessors, is not modeled.

o It is assumed that processes are assigned only one per processor with the total
number of processes being less than the number of physical processors available.
Although this is an accurate reﬂection of the structure of parallel programs on
systems on which process creation and destruction are too expensive to be done

frequently, many parallel machines have begun to support the implementation

174

of “light-weight processes” (or threads) that may time-share a single processor.
If such parallel threads are used, then our model does nor account for the

context-switch overheads associated with managing the threads.

0 The model has limited applicability to heavily data-dependent parallel applica-
tions. For algorithms with data-dependent branches, probabilistic models are
more appropriate. Although our workload generator allows the use of stochastic
parameter values, the reliability of the measured performance will depend on
the accuracy with which the the probability distributions chosen for workload

parameters represent the real algorithm characteristics.

7 .2 Directions for Future Research

The performance models and experimental results presented in this thesis establish a
foundation for future study, but need to be extended in several ways.

Algorithms that exhibit essentially asynchronous execution of concurrent processes
within a phase (only implicit synchronization in the form of mutually-exclusive access
to a critical section are present) are considered in our performance studies. The
unit grain based workload models should be expanded to include a larger variety of
workload characterizations. For example, other forms of shared memory inter-process
synchronizations such as those with explicit event post/wait or message send/receive
semantics should be investigated. Also, the performance of alternate abstractions of
the basic computation unit (BCU), such as a complex ﬂoating-point expression or a
fundamental matrix operation, should provide interesting insight into the computing
performance of a machine. In the same light, program models other than the multi—
phase structured iterative algorithms studied here can be selected as the basis of
system characterization.

Only a single memory access stream emanating from each processor was con—
sidered, since most available general-purpose multiprocessor systems provide only a
single physical path from processor to memory. However, to include vector processors

with multiple processor—memory paths in the scope of the proposed methodology, the

175

workload generation techniques can be adapted to provide multiple memory access
streams and the performance model augmented to reﬂect the corresponding change.

The other most popular programming model, besides the shared-variable
paradigm, uses message passing for inter—process communication and is normally used
on distributed-memory multicomputers. The extension of our proposed framework to
address the performance issues in the message-passing programming paradigm and
characterize the behavior of message-passing workloads would, in some sense, impart
a degree of completeness to the performance characterization methodology.

Finally, a particularly challenging proposition, in this respect, is the building of
an integrated system characterization and application performance estimation en-
vironment. It would allow common performance experiments to be performed on
different multiprocessor systems to characterize them and use the repository of data
gathered, in conjunction with an application analyzer, to enable accurate estimation

of application execution performance on a target architecture.

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]
[6]

[7]

[8]

[9]

BIBLIOGRAPHY

A. Agarwal and A. Gupta. Memory reference characteristics of multiproces-
sor applications under Mach. In Proceedings of the 1988 ACM SIGMETRICS
Conference, pages 422 - 433, 1988.

Anant Agarwal and Mathews Cherian. Adaptive backoff synchronization tech-
niques. In Proceedings of the International Symposium on Computer Architec-
ture, pages 396 -— 406, May 1989.

A. Agarwal et al. An evaluation of directory schemes for cache coherence. In
Proceedings of the 15th Annual International Symposium on Computer Archi-
tecture, pages 280 —- 289, 1988.

G.A. Amdahl. Validity of the single processor approach to achieving large-scale
computing capabilities. In AFIPS Conference Proceedings, volume 30, pages
483 - 485, 1967.

Ames Research Laboratory. The SLALOM Benchmark, 1992.

Thomas E. Anderson. The performance of spin lock alternatives for shared
memory multiprocessors. IEEE Transactions on Parallel and Distributed Sys-
tems, 1(1):6 — 16, 1990.

Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. The per-
formance implications of thread management alternatives for shared-memory
multiprocessors. IEEE Transactions on Computers, 38(12):1631 — 1644, De-
cember 1989.

NS. Arenstorf and H.F. Jordan. Comparing barrier algorithms. Technical Re-
port 87-65, ICASE, NASA Langley Research Center, Hampton, VA, September
1987.

T.S. Axelrod. Effects of synchronization barriers on multiprocessor perfor-
mance. In Parallel Computing 3, pages 129 - 140. North-Holland, 1986.

176

177

[10] D.H. Bailey and J.T. Barton. The NAS kernel benchmark program. Technical
report, NASA Technical Memorandum 86711, August 1985.

[11] F. Baskett and A. J. Smith. Interference in multiprocessor computer systems
and interleaved memory. Communications of the ACM, 19:327 - 334, June 1976.

[12] SJ. Baylor and RD. Rathi. A study of the memory reference behavior of
engineering/ scientiﬁc applications in parallel processors. In Proceedings of the

1989 International Conference on Parallel Processing, volume 1, pages 78 - 82,
1989.

[13] BBN Advanced Computers Inc., Cambridge, Massachusetts. Overview of the
Butterﬂy GP1000, November 1988.

[14] BBN Advanced Computers Inc., Cambridge, Massachusetts. TC2000 Technical
Product Summary, November 1989.

[15] BBN Advanced Computers Inc., Cambridge, Massachusetts. Inside the TC2000
Computer, 1990.

[16] C.J. Beckmann and C. Polychronopolous. The effect of barrier synchronization
and scheduling overhead on parallel loops. In Proceedings of the 1989 Interna-
tional Conference on Parallel Processing, volume 2, pages 200 — 204, 1989.

[17] M. Berry. The Perfect Club benchmarks: Effective performance evaluation of
supercomputers. The International Journal of Supercomputer Applications, 3:5

- 40, 1989.

[18] DP. Bhandarkar. Analysis of memory interference in multiprocessors. IEEE
Transactions on Computers, C-24:897 — 908, September 1975.

[19] Laxmi N. Bhuyan. An analysis of processor-memory interconnection networks.

IEEE Transactions on Computers, C-34:279 — 283, March 1985.

[20] R. Bisiani and M. Ravishankar. PLUS: A distributed shared-memory system.
In Proc. 17th Intl. Symp. on Computer Architecture, pages 115-124, 1990.

[21] ED. Brooks. The butterﬂy barrier. Int. Jour. of Parallel Programming,
15(4):295 — 307, 1986.

[22] R. Bryant, P. Carini, H. Chang, and B. Rosenburg. Supporting structured
shared virtual memory under Mach. In Proc. USENIX Mach Symposium,
November 1991.

178

[23] W.H. Burkhardt. Aspects of multiprocessor systems. In Proceedings of the
COMPEURO ’87 Conference, pages 99 - 105, 1987.

[24] Ingrid Y. Butcher and Margaret L. Simmons. Measurement of memory access
contentions in multiple vector processor systems. In Proceedings of the Super-
computing ’91 Conference, pages 806 - 817, November 1991.

[25] B.L. Buzbee. The efﬁciency of parallel processing. Computer Design, June 1984.

[26] A. Cox and R. Fowler. The implementation of a coherent memory abstraction
on a NUMA multiprocessor: Experiences with PLATINUM. In Proc. 12th ACM
Symp. on Operating System Principles, pages 32—44, Dec. 1989.

[27] H.J. Curnow and RA. Wichmann. A synthetic benchmark. The Computer
Journal, 19(1):43 — 49, 1976.

[28] Zarka Cvetanovic. Performance Analysis of Multiple-Processor Systems. PhD

thesis, University of Massachusetts, Amherst, Department of Computer Science,
May 1986.

[29] George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputer
performance evaluation and the Perfect benchmarks. Technical Report 965,
University of Illinois at Urbana-Champaign, Center for Supercomputing Re-
search and Development, Urbana, IL, March 1990.

[30] F. Darema-Rogers, G.F. Pﬁster, and K. So. Memory access patterns of parallel
scientiﬁc programs. In Proceedings of the 1987 ACM SI GME TRI CS Conference,
pages 46 — 58, 1987.

[31] Chita R. Das and Laxmi N. Bhuyan. Bandwidth availability of multiple-bus

multiprocessors. IEEE Transactions on Computers, C-34z918 - 926, October
1985.

[32] U. Detert and G. Hofemann. CRAY X-MP and Y-MP memory performance.
In Parallel Computing 17, pages 579 - 590. North-Holland, 1991.

[33] J.J. Dongarra. The Linpack benchmark: An explanation. In Supercomputing
First International Conference Proceedings, Athens, Lecture Notes in Computer
Science 297, pages 456 - 473, 1987.

[34] J .J . Dongarra and A. Hinds. Comparison of the Cray X-MP / 4, Fujitsu VP-200
and Hitachi S-810/ 20: An Argonne perspective. Technical Report ANL-85-19,
MCS Division, Argonne National Laboratory, Argonne, IL, October 1985.

179

[35] J .J . Dongarra, J. Martin, and J. Worlton. Computer benchmarking: Paths and
pitfalls. IEEE Spectrum, 24(7):38 - 43, July 1987.

[36] Thomas H. Dunigan. Kendall Square multiprocessor: Early experiences and
performance. Technical Report ORNL/TM-12065, Oak Ridge National Labo-
ratory, Oak Ridge, March 1992.

[37] J. Eggers and RH. Katz. A characterization of sharing in parallel programs and
its applicability to coherency protocol evaluation. In Proceedings of the 15th
International Symposium on Computer Architecture, pages 373 - 382, 1988.

[38] Encore Computer Corporation. Multimaa: Technical Summary, 1986.

[39] W. Feller. An Introduction to Probability Theory and Its Applications, volume 1.
New York: Wiley, 1957.

[40] B. Fleisch and G. Popek. Mirage: A coherent distributed shared memory design.
In Proc. 12th ACM Symp. on Operating System Principles, pages 211—223,
December 1989.

[41] Philip J. Fleming and John J. Wallace. How not to lie with statistics: The cor-

rect way to summarize benchmark results. ACM Computing Practices, 292218

- 221, March 1986.

[42] Ian Foster, William Gropp, and Rick Stevens. The parallel scalability of the
spectral transform method. Technical report, MCS Division, Argonne National
Laboratory, Argonne, IL, January 1991.

[43] KT. Fung and RC. Torng. On the analysis of memory conﬂicts and bus con-
tentions in a multiple microprocessor system. IEEE Transactions on Computers,
C-27:28 — 37, January 1979.

[44] D. Gajski et al. Cedar construction of a large scale multiprocessor. Technical
Report UIUCDCS-R-83—1123, University of Illinois, Department of Computer
Science, February 1983.

[45] K. Gallivan, D. Gannon, W. Jalby, A. Malony, and H. Wijshoff. Experimentally
characterizing the behavior of multiprocessor memory systems: A case study.

IEEE Transactions on Software Engineering, 16(2):216 — 223, February 1990.

[46] ER Gehringer, D.P. Siewiorek, and Z. Segall. Parallel Processing: The Cm‘“
Experience. MA: Digital, Bedford, 1987.

180

[47] E. Gelenbe. Asymptotic processing time of a model of parallel computation. In
Proc. of Nat. Comp. Conf., Las Vegas, NV, November 1986.

[48] E. Gelenbe. Multiprocessor Performance. New York: Wiley, 1989.

[49] E. Gelenbe. Multiprocessor performance and the activity set model of program
behavior. In J .L. Delhaye and E. Gelenbe, editors, High Performance Comput-
ing, pages 121 — 132. Amsterdam, The Netherlands: North-Holland, 1989.

[50] J .R. Goodman, M.K. Vernon, and P.J. Woest. Efﬁcient synchronization primi-
tives for large-scale cache coherent multiprocessors. In Proceedings of the Third
International Conference on Architectural Support for Programming Language
and Operating Systems, pages 64 - 75, April 1989.

[51] A. Gottlieb, R. Grishman, C.P. Kruskal, K.M. McAuliffe, L. Rudolph, and
M. Snir. The NYU Ultracomputer — designing an MIMD shared memory parallel
computer. IEEE Transactions on Computers, C-32(21):175 — 189, February
1983.

[52] Gary Graunke and Shreekant Thakkar. Synchronization algorithms for shared
memory multiprocessors. IEEE Computer, pages 62 - 69, June 1990.

[53] R. Gupta. The fuzzy barrier: A mechanism for the high speed synchronization
of processors. In Third Int. Conf. on Architectural Support for Programming
Languages and Operating Systems, pages 54 - 63, April 1989.

[54] J.L. Gustafson. Amdahl’s law re-evaluated. Communications of the ACM,
31:532 - 533, 1988.

[55] D.T. Harper III and J .R. Jump. Vector access performance in parallel memories
using a skewed access scheme. IEEE Transactions on Computers, C-36(12):1440
— 1449, December 1987.

[56] P. Heidelberger and S. Lavenberg. Computer performance evaluation method-
ology. IEEE Transactions on Computers, C-33:1195 — 1220, December 1984.

[57] J. Helin and K. Kaski. Performance analysis of high-speed computers. In
Proceedings of the 1989 Supercomputing Conference, pages 797 -— 808, 1989.

[58] D. Hensgen, R. Finkel, and U. Manber. Two algorithms for barrier synchro-
nization. Int. J. Parallel Program, 17(1):1 — 17, 1988.

181

[59] M. Herlihy. Impossibility and universality results for wait-free synchronization.
In Proceedings of the Seventh Annual ACM Symposium on Principles of Dis-
tributed Computing, pages 276 - 291, 1988.

[60] RH. Hill. The art of benchmarking. The Spang Robinson Report on Supercom-
puting and Parallel Processing, (3), 1989.

[61] R.W. Hockney. Performance characterization of the HEP. In J .S. Kowalik,
editor, MIMD Computation: HEP Supercomputer and its Applications. Cam-
bridge: MIT Press, 1985.

[62] R.W. Hockney. (romnl [2,31/2) measurements on the 2-CPU CRAY X-MP. In
Parallel Computing 2, pages 1 - 14. North-Holland, 1985.

[63] R.W. Hockney. Parameterization of computer performance. In Parallel Com-
puting 5, pages 97 —- 103. North—Holland, 1987.

[64] Intel Corporation. A Touchstone DELTA System Description, 1991.

[65] D.N. Jayasimha. Distributed synchronizers. In Proceedings of the 1988 Inter-
national Conference on Parallel Processing, pages 23 - 27, 1988.

[66] Kendall Square Research. KSRI, 1992.

[67] C.P. Kruskal, L. Rudolph, and M. Snir. Efﬁcient synchronization on multi-
processors with shared memory. In Proceedings of the Seventh Annual ACM
Symposium on Principles of Distributed Computing, pages 218 — 228, 1986.

[68] C.P. Kruskal and A. Weiss. Allocating independent subtasks on parallel proces-

sors. In Proceedings of the 1984 International Conference on Parallel Processing,
pages 236 — 240, 1984.

[69] David J. Kuck and Ahmed H. Sameh. A supercomputing performance evalua-
tion plan. Technical Report 692, University of Illinois at Urbana-Champaign,

Center for Supercomputing Research and Development, Urbana, IL, June 1987.

[70] DJ. Kuck et al. Dependence graphs and compiler optimizations. In Proceedings
of the 8th ACM Symposium on Principles of Programming Languages, pages 207
— 218, January 1981.

[71] H.T. Kung. Synchronized and synchronous parallel algorithms for multiproces-
sors. In J.F. Traub, editor, Algorithms and Complexity: New Directions and
Recent Results. New York: Academic, 1976.

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

182

L. Lamport. A fast mutual exclusion algorithm. ACM Transactions on Com-
puter Systems, 5(1), 1987.

D.H. Lawrie and CR. Vora. The prime memory system for array access. IEEE
Transactions on Computers, 31(5):435 — 442, May 1982.

D. Lee. Scrambled storage for parallel memory systems. In Proceedings of the
International Symposium on Computer Architecture, pages 232 — 239, 1988.

G. Lee, C.P. Kruskal, and DJ. Kuck. The eﬂectiveness of combining in shared-
memory parallel computers in the presence of “hot-spots”. In Proceedings of the

1986 International Conference on Parallel Processing, pages 11 — 12, August
1986.

J. Lee and U. Ramachandran. Synchronization with multiprocessor cache. In

Proceedings of the International Symposium on Computer Architecture, pages

27 — 37, May 1990.

D. Lenowski et al. The directory—based cache coherence protocol for the DASH
multiprocessor. In Proceedings of the 17th Annual International Symposium on
Computer Architecture, pages 148 - 159, May 1990.

K. Li and P. Hudak. Memory coherence in shared virtual memory systems. In
ACM Transactions on Computer Systems, pages 321—359, November 1989.

K. Li and R. Schaefer. A hypercube shared virtual memory system. In Proc.
Intl. Conf. on Parallel Processing, pages 125—131, 1989.

T. Lovett and S. Thakkar. The Symmetry multiprocessor system. In Proceedings
of the 1988 International Conference on Parallel Processing, pages 303 - 310,
August 1988.

RD. Lubachevsky. Synchronization barrier and related tools for shared memory
parallel programming. In Proceedings of the 1989 International Conference on
Parallel Processing, volume 2, pages 175 — 179, August 1989.

O. Lubeck, J. Moore, and R. Mendez. A benchmark comparison of three su-

percomputers: Fujitsu VP-200, Hitachi S-810/20 and Cray X-MP/2. IEEE
Computer, 18, December 1985.

S.F. Lundstrom. Applications cosiderations in the system design of highly con-
current multiprocessors. IEEE Transactions on Computers, C-36(11):1292 —
1309, November 1987.

183

[84] S. Madala and J .B. Sinclair. Performance of synchronous parallel algorithms

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

with regular structures. IEEE Transactions on Parallel and Distributed Sys-
tems, 2(1):105 — 116, January 1991.

Allen D. Malony. Performance Observability. PhD thesis, University of Illinois
at Urbana—Champaign, Department of Computer Science, October 1990.

M.A. Marsan and M. Gerla. Markov models for multiple-bus multiprocessors.
IEEE Transactions on Computers, C-31:239 - 248, March 1982.

J .L. Martin. Performance evaluation: Applications and architectures. In Second
International Conference on Supercomputing, pages 369 — 373, May 1987.

PH. McMahon. The Livermore Fortran kernels: A computer test of the ﬂoating-
point performance range. Technical Report UCRL-53745, Lawrence Livermore
National Laboratory, December 1986.

John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable syn-
chronization on shared-memory multiprocessors. ACM Transactions on Com—
puter Systems, 9(1):21 — 65, February 1991.

HE. Mizrahi, J .L. Baer, D. Lazowska, and J. Zahorjan. Extending the memory
hierarchy into multiprocessor interconnection networks: A performance analy-
sis. In Proceedings of the 1989 International Conference on Parallel Processing,
volume 1, pages 41 — 50, August 1989.

J. Mohan. Performance of Parallel Programs: model and analyses. PhD thesis,

Carnegie-Mellon University, Pittsburg, Department of Computer Science, July
1984.

Arun N anda and Lionel M. Ni. Benchmark workload generation and perfor-

mance characterization of multiprocessors. In Proceedings of the Supercomput-
ing ’92 Conference, November 1992.

Arun N anda and Lionel M. Ni. MAD kernels: An experimental testbed to study
multiprocessor memory system behavior. In Proceedings of the 1992 Interna-
tional Conference on Parallel Processing, August 1992.

Arun Nanda and Lionel M. Ni. SAD kernels: A software tool to evaluate syn-
chronization behavior of multiprocessors. In Proceedings of the 1992 Computer
Science and Applications Conference, September 1992.

184

[95] Arun Nanda, Honda Shing, Ten-Hwan Tzen, and Lionel M. Ni. A replicate
workload framework to study performance degradation in shared-memory mul-
tiprocessors. Technical Report MSU-CPS-ACS-l8, Michigan State University,
Department of Computer Science, January 1990.

[96] Arun Nanda, Honda Shing, Ten-Hwan Tzen, and Lionel M. Ni. Resource
contention in shared-memory multiprocessors: A parameterized performance
degradation model. Journal of Parallel and Distributed Computing, 12:313 —
328, July 1991.

[97] K.W. Neves and H.D. Simon. Supercomputer performance evaluation: Bench-
marking applications on supercomputers. In Second International Conference
on Supercomputing, pages 374 — 379, May 1987.

[98] A. Norton and E. Melton. A class of boolean linear transformations for conﬂict-
free power-of-two stride access. In Proceedings of the 1987 International Con-

ference on Parallel Processing, pages 247 — 254, August 1987.

[99] W. Oed and O. Lange. On the effective bandwidth of interleaved memories in

vector processing systems. IEEE Transactions on Computers, C-34(10):949 -
957, October 1985.

[100] M.T. O’Keefe and HG. Dietz. Hardware barrier synchronization: Static Barrier
MIMD (SBM) and Dynamic Barrier MIMD (DBM). In Proceedings of the 1990

International Conference on Parallel Processing, volume 1, pages 35 - 46, 1990.

[101] J .H. Patel. Performance of processor-memory interconnections for multiproces-
sors. IEEE Transactions on Computers, C-30z771 - 780, October 1981.

[102] R. Perron and C. Mundie. The architecture of the Alliant FX/ 8 computer. In
Spring COMPCON ’86, pages 390 — 393, March 1986.

[103] B.L. Peuto and L.J. Shustek. An instruction timing model of CPU performance.

In Proc. Fourth Annual Symp. Comput. Architecture, volume 5, pages 165 -— 178,
March 1977.

[104] G. Pﬁster, W.C. Brantley, D.A. George, S.L. Harvey, W.J. Kleinfelder, K.P.
McAvliffe, T.A. Melton, V.A. Norton, and J. Weiss. The IBM research parallel
processor prototype (RP3): Introduction and architecture. In Proceedings of the

1985 International Conference on Parallel Processing, pages 764 — 771, August
1985.

[105]

[106]

[107]

[108]

[109]

[110]

[111]

[112]

[113]

[114]

[115]

[116]

185

G.F. Pﬁster and V.A. Norton. Hot-spot contention and combining in multistage
interconnection networks. IEEE Transactions on Computers, C-34z943 — 948,
October 1985.

C. Polychronopolous. Compiler optimizations for enhancing parallelism and

their impact on architecture design. IEEE Transactions on Computers, C-
37(8):991 — 1004, August 1989.

Ram Raghavan and John P. Hayes. On randomly interleaved memories. In

Proceedings of the Supercomputing ’90 Conference, pages 1 - 10, November
1990.

U. Ramachandran, M. Ahamad, and M.Y.A. Khalil. Coherence of distributed

shared memory: Unifying synchronization and transfer of data. In Proc. Intl.
Conf. on Parallel Processing, volume 11, pages 160—169, August 1989.

RD. Rettberg, W.R. Crowther, P.P. Garvey, and RS. Tomlinson. The Monarch
parallel processor hardware design. IEEE Computer, pages 18 — 30, April 1990.

Rafael H. Saavedra-Barrera, Alan J. Smith, and Eugene Miya. Machine charac-
terization based on an abstract high-level language machine. IEEE Transactions
on Computers, 38:1659 — 1679, December 1989.

RH. Saavedra-Barrera. Machine characterization and benchmark performance
prediction. Technical Report UCB/CSD 88/ 437, University of California,
Berkeley, June 1989.

RC. Scarbourough and HG. Kolsky. A vectorizing FORTRAN compiler. IBM
Journal of Research and Development, 30(2), March 1986.

Sequent Computer Systems Inc. Balance 8000 System Technical Summary,
1984.

Sequent Computer Systems Inc. Symmetry Technical Summary, 1987.

Leah J. Siegel, Howard J. Siegel, and Philip H. Swain. Performance measures
for evaluating algorithms for SIMD machines. IEEE Transactions on Software
Engineering, SE-8(4):319 - 330, July 1982.

J.P. Singh, W. Weber, and A. Gupta. SPLASH: Stanford parallel applications
for shared-memory. Technical report, Computer Systems Laboratory, Stanford
University, CA, 1991.

186

[117] James E. Smith. Characterizing computer performance with a single number.
ACM Computing Practices, 3121202 — 1206, October 1988.

[118] Per Stenstrom. A survey of cache coherence schemes for multiprocessors. IEEE
Computer, pages 12 — 24, June 1990.

[119] R. Thomas. Behavior of the Butterﬂy parallel processor in the presence of
memory hot spots. In Proceedings of the 1986 International Conference on
Parallel Processing, pages 46 - 50, 1986.

[120] J. Uniejewski. SPEC benchmark suite: Designed for today’s advanced systems.
SPEC Newsletter, 1, 1989.

[121] Dalibor F. Vrsalovic, Daniel P. Siewiorek, Zary Z. Segall, and Edward F.
Gehringer. Performance prediction and calibration for a class of multiproces-
sors. IEEE Transactions on Computers, 37:1353 — 1364, November 1988.

[122] W.H. Ware. The ultimate computer. IEEE Spectrum, pages 84 — 91, March
1982.

[123] RP. Weicker. Dhrystone: A synthetic systems programming benchmark. Com-
munications of the ACM, 27(10):1013 - 1030, October 1984.

[124] S. Weiss. An aperiodic storage scheme to reduce memory conﬂicts in vector
processors. In Proceedings of the International Symposium on Computer Archi-
tecture, pages 380 — 385, 1989.

[125] J. Worlton. Understanding supercomputer benchmarks. Datamation, pages 121
- 129, 1984.

[126] RC. Yew, S.N. Tzeng, and D.H. Lawrie. Distributing hot-spot addressing in
large-scale multiprocessors. In Proceedings of the 1986 International Conference
on Parallel Processing, pages 51 - 58, August 1987.

[127] Xiaodong Zhang. Performance measurement and modeling to evaluate various
effects on a shared-memory multiprocessor. IEEE Transactions on Software
Engineering, 17(1):87 - 93, January 1991.

ICHIGRN STRTE UNI V.

lllll[ll[Ill][WWI][I][I][lllllllllllllll