: .., E... . . r.rl.,5r:.;tcL....c:5.5.: 257,!
i . it .2 2-5%.: 1. .- 2. . 2..

 

 
 

 

 

'7HE8tS

                                                                                                                             

31293 00904 5810

This is to certify that the

dissertation entitled

A SCALABLE SNOOPY CACHE-COHERENCE
SCHEME ON A MULTIPLE BUS
MULTIPROCESSOR

presented by

Tyan—Shu JOU

has been accepted towards fulﬁllment
of the requirements for

PhD degree in Computer Science

 

QM Cam
MajoU rofcssor /

 

Date x7317 <93, (7,93

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

DATE DUE DATE DUE DATE DUE
‘W

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU Is An Affirmative Action/Equal Opportunity Institution
c:\circ\dateduetpm3-p.1

A SCALABLE SNOOPY CACHE-COHERENCE
SCHEME ON A MULTIPLE BUS

MULTIPROCESSOR

By

Tyan-Shu Jon,

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1993

ABSTRACT

A SCALABLE SNOOPY CACHE-COHERENCE
SCHEME ON A MULTIPLE BUS

MULTIPROCESSOR

By

Tyan-Shu Jou

Maintaining data coherence among multiple caches is one of the most important
problems in building shared-memory multiprocessors. Snoopy and directory coher-
ence schemes are the two main approaches for solving the cache-coherence problems
on shared-memory multiprocessors. Traditional snoopy schemes work efﬁciently on
small-scale, single-bus-connected systems, but the bottleneck caused by the shared
bus signiﬁcantly limits the allowable number of processors. A system which adopts
a directory scheme can scale relatively better due to the possibility of using high-
bandwidth interconnection networks, but at the expense of longer memory access
latency.

This dissertation proposes a scalable, snoopy cache-coherent multiprocessor based
on a multiple bus network, and introduces a novel concept of partial-snooping, which
can simplify the snooping mechanism when the number of processors and busses is

large. To reduce the hardware complexity, each processor will snoop on a dynami-

cally changing subset of the busses. The new scheme extends the number of processors
that can be supported using a simple snooping mechanism before more complex con-
sistency mechanisms become necessary. The physically-distributed-logically-shared
memory model is chosen to let processors take advantage of local memory accesses
for private data. The proposed coherence solution is particularly attractive consider-
ing the high bandwidth and channel tunable capability of ﬁber optical networks.
The performance of the proposed scheme versus a directory scheme is evaluated
in three ways: qualitative discussion, simulation experiments, and queueing analy-
sis. The discussion shows that the proposed system can support low access latency
and high communication bandwidth at a moderate hardware cost. The comparative
evaluation by simulation shows that even though the experiments were designed to
give advantages to the directory scheme, the proposed scheme can usually perform
better. The simple and effective analytical method can evaluate systems with up to

hundreds of processors in seconds.

To my parents, for the love they have given in thirty years

iv

ACKNOWLED GMENTS

I’ve been fortunate to have the help and support from many people while pursuing
my Ph.D., and I now like to give them my sincere thanks.

First and foremost, my thank goes to my thesis advisor Dr. Richard Enbody. His
intelligent ideas and our long discussions have laid the foundation for my work and
guided me through the years of my Ph.D. program. Special thank also goes to Dr.
Lionel M. Ni for all his support and advice. I am grateful for the valueless suggestions
provided by Dr. Philip McKinley.

I would not have been able to complete the Ph.D. program if it were not for the
ﬁnancial support from the teaching assistantship assigned by Department Chairper-
son, Dr. Anthony Wojcik. In addition, I highly appreciate the guiding information
and trusts from the professors that I have worked for. They are Professors Chester R.
Forsythe, Donald Weinshank, Matt Mutka, and M. V. Ramakrishna. I learned a lot
and had a good time in being a teaching assistant for each of them. I am especially
precious the friendship between Professor Forsythe.

Thanks with my heart also go to my parents and parents-in-law, for the love
and care given by them, and to the family of my elder brother, Tien—Yen, for their
continuously encouragement. My wife Tswei-Ping and my son Howen gave my life
outside the computer science fullness and meaning; they put up with my absences,
and they cheered me on when I needed it. I would have no chance to complete my
study without their love and support.

Over the years, my ofﬁcemates in Engineering Building Room #324 and many
other friends of mine not listed here have all encouraged me and gave me technical
assistance on many occasions. Thanks go to all of them.

Finally, I thank the authors of the Tango simulation system, Stephen Goldschmidt

and Helen Davis at Stanford University, for offering us the code.

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1 Introduction

1.1

1.2 Dissertation Outline

Motivation .................................

2 Cache Coherence Problems

2.1

Introduction ................................

2.2 Directory—Based Coherence Protocols ..................

2.3 Snoopy Cache Coherence Protocols ...................

2.4 Coherence Schemes on Network Architectures .............

2.5 Consistency Ordering ...........................

3 Bus-Based Networks

3.1

Introduction ................................
3.2 A Classiﬁcation of Bus~Based Networks .................
3.3 Applying Snoopy Schemes to Multiple Bus Networks .........
3.4 Optical Network Technology .......................
3.5 Embedding Topologies in Optical Networks ...............

3.5.1
3.5.2
3.5.3
3.5.4
3.5.5
3.5.6
3.5.7

Single-Bus Networks . .

Point-to—Point Connection ....................

The k—ary n—cube Topology ...................

Multistage Interconnection Networks ..............

The OMP Network . . .

Optical Networks in Distributed-Memory Systems .......

Using Optical Networks

4 The Tango Tracing and Simulation System

4.1

Introduction ................................

vi

 

4.2 Trace-Driven vs. Execution-Driven Simulation ............. 37
4.3 Tango Simulation Approach ....................... 40
4.4 Experience in Using Tango ........................ 43
Address-Separated Multiple Bus Network 45
5.1 Introduction ................................ 45
5.2 Multiple Bus Networks .......................... 47
5.3 Performance Evaluation by Simulation ................. 53
5.4 Performance Evaluation by Queueing Models .............. 57
5.4.1 An Analytical Model for Model 6 ................ 58
5.4.2 An Analytical Model for Model 1 ................ 62
5.4.3 Analytical Results ........................ 65
5.5 Summary ................................. 68
The Cache Coherent Architecture 70
6.1 The MSU system ............................. 70
6.2 Coherence Protocols ........................... 74
6.2.1 Under a Sequential Consistency Model ............. 74
6.2.2 Under a Weak Consistency Model ................ 75
6.3 Qualitative Performance Expectation .................. 79
6.3.1 Pros and Cons of Coherence Schemes .............. 80
6.3.2 Impact from Memory References ................ 82
6.3.3 Performance Expectation ..................... 86
6.4 Implementation by Optical Networks .................. 86
Performance Evaluation 89
7.1 Simulated Directory—Based System ................... 89
7.2 Architectural and Timing Assumptions ................. 92
7.3 Benchmark Programs ........................... 95
7.4 Simulation Results ............................ 98
7.4.1 Comparison of Full-Snooping MSU and DirooNB ........ 98
7.4.2 Various Degrees of Snooping 0n MSU .............. 102
7.4.3 Impact of Faster Communication ................ 106
7.4.4 Impact of Shorter Directory Access Latency .......... 109
7.4.5 Impact of Memory Placement .................. 109
7.4.6 Impact of Larger Cache Size ................... 112
7.4.7 Performance on Various System Sizes (Scalability) ....... 115
7.5 Mean-Value Performance Analysis .................... 121

 

7.5.1 Workload Parameters ....................... 122

7.5.2 Mean Times for the MSU System ................ 123

7.5.3 Mean Times for the Directory-Based System .......... 125

7.5.4 Parameter Values Used in the Experiments ........... 126

7.5.5 Analytical Results ........................ 127

7.6 Summary ................................. 131

8 Conclusions 133

A An MVA Algorithm for a Network of Queues with a State—
Dependent Server 136

B Workload Parameters Used in Analysis 139

BIBLIOGRAPHY 141

viii

6.1

6.2
6.3

7.1
7.2

7.3

7.4

B.1

LIST OF TABLES

Actions for weak consistency when an access comes in (from local pro-
cessor or from remote processors) while an access for the same block is
pending locally. (R: read access. W: write access, which may be either
a write miss or a Write hit.) .......................
Pros and cons of the three cache coherence schemes ...........
Impact of reference patterns and system parameters on the perfor-

mance of cache-coherent systems. ....................

Default architectural assumptions ....................
Default memory access latency for primitive operations in processor
cycles. ...................................
Latency of various operations in absence of bus contention and cache

line replacement. .............................

Descriptions of the benchmark programs and the default problem sizes.

Probabilities used in the analytical experiments .............

78
83

85

93

95

96
97

139

1.1

2.1

2.2

2.3

2.4

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

LIST OF FIGURES

(a) A shared memory multiprocessor with caches. (b) A NUMA archi-

tecture with caches .............................

An example of operations for a cache write hit under a directory based
coherent scheme. .............................
An example of write hit invalidation on a snoopy bus ..........
In a sequentially ordered system, the Y will be printed as l by Proces—
sor 2. In a weakly ordered system, the value of Y may be printed as
0 .......................................
An example to distinguish the sequential consistency and others. In
a sequentially ordered system, the X will be printed as 1. In other
systems, the X may be printed as 0. ..................

Four topologies of bus-based multiprocessors ...............
A 4 X 4 star coupler with input signals under different frequencies. . .
The FOX architecture: tunable transmitters are used to determine
destinations .................................
A star topology with one wavelength (a), is logically equivalent to a
single—bus system (b) ............................
One transmitter and one receiver are used in one node. Logically this
network is a point-to-point directed ring topology ............
Two transmitters and two receivers are used in one node for this exam—
ple. Logically this network is a multiple—bus-connected k-ary n—cube
topology ...................................
Two transmitters and two receivers each attach to different channels
are used in one node. Logically this network is a MIN. ........
(a) Conﬁguration of the Orthogonal Multiprocessor (OMP). (b)
MHMT representation of OMP. .....................

Conﬁguration of a distributed~memory system connected by star-

coupled optical network with tunable transmitters and ﬁxed receivers.

21
25

28

29

30

31

31

33

34

4.1
4.2
4.3
4.4

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

5.10
5.11

5.12

6.1

6.2

6.3
6.4
6.5

Structure of a trace-driven simulator. ..................
Structure of Tango, an execution-driven multiprocessor simulator. . .
Tango application process state diagram .................

Production of a simulation system using Tango. ............

Three models of Address-Separated multiple bus network. ......
Multibus Models 4, 5 and 6. .......................
Ratio of execution time of Model 1 (single M —B connected, N x B x B)
to Model 5 (fully connected, N x 00 x 00) ................
Ratio of the execution time of Model 6 (fully connected, N x 00 x B)
to Model 5. ................................
Difference in execution time ratio between Model 1 and Model 6. . . .

Execution time ratio of different values of r on Model 4 (fully-
connected, N x rB x B) to Model 5 when N = 16. The r = 1 curve

also represents Model 1, and the r : oo curve also represents Model 6.

A queueing model for Model 6. .....................
State transition diagram for the network servers in the queueing model
of Model 6. ................................
A queueing model for Network Model 1 (Address-Separated multiple
bus network). ...............................
An MVA algorithm for the analytical solution of Model 1. ......
Analytical performance results for Models 1 and 6 when N = 16 and

I/ﬂl = 10 ..................................
Analytical performance results for Models 1 and 6 when N = 128 and

l/ﬂl =10 ..................................

An illustration of the dynamic snooping concept of MSU. Each node
has to snoop on the bus(ses) that it has blocks cached from. .....
The MSU system with Nb“, =2 8 and N37,p = 4. The ﬁrst bus is for
private and read-only memory accesses ( Nread = 1), every node has
a ﬁxed receiver connected to it. Each node also has a ﬁxed receiver
connected to a bus for receiving requests for the local main memory,
and two tunable receivers which can connect to any two of the busses
at a time. .................................
Transition diagram of the Berkeley Ownership Protocol. .......
Node architecture for implementing weak consistency ..........
The expected comparative performance of two scheme under different

system sizes .................................

xi

38
39
41
42

51

54

55

56

57

59

60

63
65

67

67

71

72
75
76

87

 

6.6

7.1

7.2

7.3

7.4

7.5

7.6

7.7

7.8

7.9

7.10

7.1

,_i

Optically-connected MSU system. One tunable transmitter, two ﬁxed

receivers and several tunable receivers are used in each node ......

The simulated network for a directory—based system with Nnod,3 = 4
and Nb“, 2 2. ...............................
Average shared data cache hit rate for the three benchmarks for various
number of busses when Nnode = 32. ...................
Normalized execution time of full-snooping MSU and DirooNBNm un—
der various number of busses for Mat benchmark. ...........
Normalized execution time of full-snooping MSU and DirooNBNb... 11n-
der various number of busses for Gauss (a) and Heat (b) benchmarks.
Effect of various degrees of partial—snooping of MSU when Nnode =
32 on normalized execution time (a) and cache hit rate (b) for Mat
benchmark under default parameters. The dashed line in (a) shows
the execution time of DirooNB with N,,,,, busses for comparison.

Effect of various degrees of partial—snooping of MSU when Nnode = 32
on normalized execution time (a) and cache hit rate (b) for Gauss
benchmark under default parameters. The dashed line in (a) shows
the execution time of DirooNB with N3np busses for comparison.

Effect of various degrees of partial-snooping of MSU when NW]!e =
32 on normalized execution time (a) and cache hit rate (b) for Heat
benchmark under default parameters. The dashed line in (a) shows
the execution time of DirooNB with Nsnp busses for comparison.

Comparison of full-snooping MSU and DirooNBNb... for different num-

ber of busses when bus communication latency is reduced to half. Nor-

88

91

99

100

103

104

105

malized execution time for Mat (a) and Heat (b) benchmarks are shown. 107

Effect of various degrees of partial-snooping of MSU when the commu-
nication latency is reduced to half. Normalized execution time when
Nnode = 32 for Mat (a) and Heat (b) are shown. The dashed lines show
the execution time of DirooNB with NS,1p busses .............
Execution time of DirooNBNbM when the directory look up and update
latency is reduced to half for Mat (a) and Heat (b). Dash lines show
the execution time under default parameters for comparison. .....
Impact of memory placement on memory-mapped Heat benchmark
when Nnode = 32. (a) shows the comparison of full—snooping MSU and
DirooNB on different number of busses. (b) shows the effect of various

degrees of partial—snooping of MSU ....................

108

110

113

 

7.12

7.13

7.14

7.15

7.16

7.17

7.18

7.19

7.20

7.2

>-|

A.1

B.1

Performance of full-snooping MSU and DirooNBNm under various
number of busses when cache size is increased to 16K bytes per node
for Mat (a) and Heat (b) benchmarks. .................
Effect of various degrees of partial-snooping of MSU when cache size is
increased to 16K bytes per node. Normalized execution time (a) and
cache hit rate (b) for Mat when NW1a = 32 are shown. The dashed
line in (a) shows the execution time of DirooNB with N,,,,, busses.
Effect of various degrees of partial—snooping of MSU when cache size is
increased to 16K bytes per node. Normalized execution time (a) and
cache hit rate (b) for Heat when Nnode = 32 are shown. The dashed
line in (a) shows the execution time of DirooNB with N,"p busses.
Impact of larger problem size with larger cache size. Normalized exe~
cution time (a) and cache hit rate (b) for Mat with 200x200 matrices
size and 16K bytes cache when dee = 32 are shown. ........
Impact of larger cache size and better memory mapping. Normalized
execution time (a) and cache hit rate (b) for Heat with optimum mem-
ory mapping and 16K bytes cache when Nnode = 32 are shown.
Performance of MSU and Dir under various number of processors
(Nnode). Nbus is set to be %Nnode; and, for MSU, N3up = iNnode. (a) is
from Mat benchmark, and (b) is from memory~mapped Heat ......
Analytical mean time between two memory requests of full-snooping
MSU and DirooNBwa with Nnode = 32 under various number of busses.
Analytical mean time between two memory requests for various degrees
of partial-snooping on MSU when Nnode : 32. The dashed line is for
DirooNB with NMp busses for comparison. ...............
Analytical mean time between two memory requests for various degrees
of partial-snooping on MSU when Nnode = 128. The dashed line is for
DirooNB with NM,p busses for comparison. ...............
Performance of MSU and Dir under various number of processors
(dee). Nb,” is set to be %Nnode; and, for MSU, Nmp = i node. The

p is the ratio that a program can be parallelized .............
An MVA algorithm for the analytical solution of Model 6. ......

Function f(:r) = 1 — €_0'5I, used in deciding pseum ...........

xiii

116

117

118

119

120

128

131

140

CHAPTER 1

Introduction

1 . 1 Motivation

Parallel systems with a large number of processors offer the possibility of enormous
computing power that can never be achieved by a uniprocessor system. With today’s
advanced microprocessor technology, the power can come at a low cost-performance
ratio. However, simply extending the existing system designs to large-scale systems
may not be feasible or may lead to signiﬁcant per—processor performance degradation.

Parallel systems can be divided into shared— and distributed—memory machines.
The primary distinction is the programming abstraction presented to the programmer.
The shared-memory programming environment presents the abstraction of a single
system image which is used for communicating data values and synchronization. In
a distributed-memory programming abstraction, the programmer must specify inter—
processor communication explicitly, through message passing or memory transfers. It
is widely believed that a shared—memory is a more natural or convenient environment
for the programmer. However, efficiently scaling shared-memory multiprocessors to
large sizes remains a challenging task for computer architects.

Reducing the long shared-memory access latency on large-scale multiprocessors

is one of the most difficult problems [1, 2], and presently it has become more sig-

niﬁcant because of the quickly-improving microprocessor technology. Distributing
main memory to processors to let each node contain a portion of the global memory
is one technique to shorten local memory access time. This kind of multiprocessor
has been classiﬁed as a non-uniform—memory-access (NUMA) machine. Examples of
distributed shared-memory multiprocessors include the BBN Butterfly, the Stanford
DASH system [3] and the MIT Alewife [4] machine. To further decrease memory ac—
cess time, attaching a private cache to each processor is one solution which not only
can shorten the access latency but also can reduce network contention [5] (Figure 1.1).
Cache memory can significantly increase system performance because of the temporal
and spatial locality of memory references. The possible existence of multiple copies of
the same data, however, gives rise to the cache coherence problem. Various coherence
protocols have been addressed in the literature, and the general solution is that when
a processor needs to modify a shared data in its cache, it has to invalidate or update

the other processors which are holding the same data.

main memory
interconnection network

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

interconnection network
-[mem[ ~Imem| -[mem|
Icache I Icache I [cache I Icache I [cache I Icache I
processor processor processor processor processor processor

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) (b)

Figure 1.1. (a) A shared memory multiprocessor with caches. (b) A NUMA archi-
tecture with caches.

Snoopy schemes and directory schemes are two common hardware approaches for
maintaining cache coherence. Snoopy schemes efﬁciently achieve cache coherence on
single—bus—connected multiprocessors by letting each tapped processor snoop on the
operations of other processors. However, because all communication takes place on
a bus, the bandwidth limitation of the bus prevents building a multiprocessor with
many processors, eg. 30 processors [6, 7]. With today’s high-performance processors,
a few processors can easily saturate a bus. Directory schemes, on the other hand,
can be applied to any interconnection network and the systems are relatively more
scalable. Most large—scale cache—coherent multiprocessors adopt directory schemes [4,
8, 9]. However, the overhead of directory maintenance and look—up time plus the high
latency of the communication networks are signiﬁcant drawbacks.

Snoopy schemes require the broadcast be globally performed in one network cycle
so that all processors can have current knowledge of the state of cached data and
the broadcasting processor can ensure that all other processors have received the
message. This kind of broadcast will be referred as an atomic broadcast. To apply
a snoopy scheme to a large-scale multiprocessor, a high—bandwidth and low—latency
network which can also support atomic broadcast or atomic multicast is needed. A
natural extension of the single-bus connection is to employ multiple busses to increase
the bandwidth, reliability and scalability at moderate cost. There are two common
multiple-bus topologies: the ﬁrst one is the multiple—hop-connected system, such as
the Wisconsin Multicube [10] and the Aquarius system [9]. Snoopy schemes cannot
directly apply to this topology because the multiple—hop connections cannot efficiently
support atomic broadcast. The other topology has a pool of busses and lets all nodes
connect to all busses [11, 12]. Applying snoopy schemes to this latter topology is
complex and costly, which limits the scalability. Details of the limitation will be
presented later in this dissertation.

We believe that a bus-based architecture is still a viable architecture. While many

existing parallel systems adopt interconnecting networks other than busses, exciting
new bus-based multiprocessors are being produced by such workstation companies
as SUN and SGI. For example, the new SGI Power Challenge system promises 5.4
GFLOPS peak performance by using 18 processors connected by a backplane with
the bandwidth of 1.2 Gigabytes per second.

To solve the dilemma of requiring a large-scale, cache-coherent, shared-memory
multiprocessor but without the long memory access latency spent in maintaining di-
rectories, this dissertation proposes a scalable, snoopy coherent multiprocessor based
on a multiple bus network. This multiprocessor extends the performance of bus-based
systems such as the SGI Power Challenge. In addition, the architecture can be ap—
plied to both electronic and optical networks so it can take advantage of the high

bandwidth of optics as they become more affordable.

1.2 Dissertation Outline

The dissertation starts by introducing the cache coherence problem in Chapter 2.
It introduces and compares the snoopy coherence scheme and the directory-based
coherence scheme. This chapter also surveys various consistency ordering models as
a background for the coherence protocol of the proposed multiprocessor.

Many existing multiprocessors are bus—based systems because of the advantages
of the simplicity and short communication latency of the busses. As suggested in the
title of this dissertation, the proposed system is based on a multiple—bus network.
Chapter 3 presents the main features of bus—based networks and the way to apply
cache coherence protocols on bus-based systems. A novel classiﬁcation of bus—based
systems is presented to put our system in context with other systems. In addition,
because of the progress in lightwave technology, presently it is possible to implement

a multiple-bus network by using the multiple channels on an optical network. This

 

chapter also surveys the current technology in optical networks and presents various
ways to implement multiple busses optically.

Chapter 4 introduces the tracing and simulation system used to collect the per-
formance results of the proposed network and architecture. This system is named
Tango‘an execution-driven simulator which was developed at Stanford University,
and has been used for analyzing the DASH system [8], the SPLASH benchmarks [13],
and for many other hardware and software studies.

Chapter 5 presents a multiple-bus network, named the Address—Separated net-
work, on which the choice of the transmitting bus depends on the memory address
accessed. A variety of models for this network are studied, with other popular multi—
ple bus models shown for comparison. The performance of this network is evaluated
by both simulation and analysis. The results indicate the feasibility of applying it to
a multiprocessor.

Chapter 6 introduces a novel snoopy cache-coherent multiprocessor—the MSU
system, based on the Address-Separated multiple-bus network. Each node of this
system contains a processor, cache memory, a portion of the globally addressable
shared-memory, I/O devices, and a network controller. The main feature of the system
is partial snooping, that is, each node snoops on a subset of all busses. A qualitative
discussion of this system is presented to show the advantages and disadvantages prior
to quantitative evaluations.

The quantitative performance of the proposed multiprocessor versus a directory—
based system is evaluated by using both simulation experiments and queueing anal-
ysis. The methodology and results are shown in Chapter 7. The sensitivity of var-
ious system parameters is also studied to show the independence of the conclusions
about the system and the choice of the parameters. The results show that the pro-
posed system is suitable to be scaled into large systems, and that it outperforms the

directory-based system.

This dissertation is concluded in Chapter 8. This chapter also reiterates the

contributions of the research and indicates future research directions.

CHAPTER 2

Cache Coherence Problems

2. 1 Introduction

Recently, the progress of memory technology has allowed the speed of dynamic ran-
dom access memory (DRAM) to grow at a rate of about 7 % per year; however,
the speed of processors has grown over 50 ‘70 annually since 1985 [14]. The uneven
growth of speed between processors and memory results in the necessary usage of
the small, high-speed private memory which operates like a cache of a processor, and
serves as a buffer to the larger, slower main memory. Cache memories are common in
uniprocessor systems, but they are even more effective in shared-memory multipro-
cessors because they can alleviate not only the access latency, but also the network
contention [5]. For such a multiple-cache system, more than one copy of the same
data may exist in different caches at the same time. This situation leads to the so
called cache coherence problem—the problem of maintaining consistency (coherence)
among the multiple copies of data stored in different caches. Shared data may have
consistency problems if they are shared by more than one process. Furthermore, if
processes are allowed to migrate among different processors, private data may also
have consistency problems when the copies of the same data exist in two or more

caches due to the migration.

In general, shared-memory multiprocessors can be divided into those providing
cache coherence in hardware, and those requiring software support. Cache-coherent
systems usually refer to the systems using hardware solutions. On the other hand,
software supported systems utilize the assistance from compilers and/or programs
to either disallow shared-and-writable data to be cached [15] or to add coherence
operations prior to accessing any shared data [16, 17]. In some systems, such as the
BBN Butterﬂy, there exist only instruction caches but no data caches to avoid cache
coherence problems.

Coherence problems also exist in other applications, and similar solutions to cache
coherence protocols can be applied. A local-memory coherence system can allow a
NUMA-type multiprocessor (such as the BBN Butterﬂy) to replicate or migrate re-
mote data into local memories to take advantage of faster memory accessing time [18].
Some systems, known as cache-only memory architectures, are designed to always mi-
grate memory to local processors by hardware [19, 20, 21]. Recently some researchers
are implementing shared-memory or shared-objects abstractions on top of message-
passing systems to ease the programming on multicomputers [22, 23, 24, 25].

This dissertation focuses on hardware cache-coherent systems. The background

of this issue is presented in following sections.

2.2 Directory-Based Coherence Protocols

A number of different coherence solutions have been proposed, and a survey has been
done by Stenstriim [26]. From a hardware perspective, these schemes can be classiﬁed
into three main categories: directory schemes, snoopy schemes, and the extensions of
above two schemes on hierarchical topologies.

The directory schemes all maintain some type of table or directory in which the

sharing information can be found [27, 28]. When a shared block is going to be modiﬁed

 

 

in one cache memory of a processor, the update information or an invalidation signal
will be sent to other caches which have the same block. These schemes differ in
where the directories are located (centralized or distributed), how they are maintained
(full-map, limited, single-link chained, double—link chained), and what information is
stored inside. Various directory schemes have been suggested and studied for large—
scale systems [28]; among them, two schemes are popular. The ﬁrst uses a bit-vector
in the home directory for each cache line to keep information on the locations of
the cache line. One example of such protocols has been implemented on the DASH
multiprocessor [8]. The other popular directory scheme employs a singly- or doubly-
linked list for each cached line, and keeps a home directory to point to the head
of the list. This scheme is used in the IEEE Scalable Coherence Interface standard
(SCI) [29].

Figure 2.1 shows an example of the operations when a write hit occurs in a bit-
Vector directory scheme. When Processor 2 has a write hit on a shared datum, it
must look in the directory to see whether other processors also have cached the same
cache line. After receiving the request, the memory module that keeps the directory
of the line informs other processors having the same line. When all acknowledgments
from other processors have been received, the memory module modiﬁes the directory
and acknowledges the requesting processor.

The greatest advantage of directory schemes is their insensitivity to the type of in-
terconnection network used. Since the interconnection networks of many large-scale
multiprocessors have multiple stages or hierarchical structures, directory schemes,
rather than snoopy schemes, can be applied easily [29, 30, 31, 32]. The most crit—
ical drawback of a directory scheme is the long access latency. The overhead of
directory look-up and maintenance such as regenerating the sharing chain on SCI is
unavoidable. In addition, for a directory scheme, the advantage of the independence

of the broadcast capability on the network becomes a disadvantage when multiple

 

 

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Proqessor Processor Processor
2 3
I [write hit I
\§ ”'/
Interconnection \\\\ 4 1 ”I 2/ ////)
network 5 \\ \ I / // / 3
| \\ \‘III/ .2
0 Ir dir
Memory Memory

 

 

 

 

 

 

Figure 2.1. An example of operations for a cache write hit under a directory based
coherent scheme.

copies of the same message need to be sent. Many researchers have tried to hide
the long latency by relaxing the consistency ordering requirement to overlap memory
accesses [33, 34, 35, 36, 37, 38]. Theoretically, relaxing execution ordering can gain
a lot in performance by overlapping communication and computation [39]; but, to
guarantee correctness, only write requests bypassing outstanding read requests be-
tween any two synchronization points is practical [8], which limits the improvement

in performance [40].

2.3 Snoopy Cache Coherence Protocols

The snoopy schemes rely on a broadcast bus on which all system communication takes
place. Various snoopy protocols have been studied and surveyed [6, 41, 42, 43], in gen—
eral every cache controller will monitor other caches’ requests by listening (snooping)
on the bus, and then responds accordingly. Figure 2.2 shows an example of a write
invalidation. When Processor 2 has a cache write hit on a shared datum, it broad-

casts the invalidation signal on the bus. Processor 1 and 3 receive the invalidation by

11

snooping on the bus all the time.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Processor Processor Processor
1 2 3
cache ‘ [cache ‘
i
I. . . i .
| Bus n _ __L'£“£‘:‘_"92t'_"_“_ _v ______ L. _>
] Memory Memory

 

 

 

 

 

Figure 2.2. An example of write hit invalidation on a snoopy bus.

The main feature of snoopy cache coherence schemes is that the broadcast op-
eration is atomic—either all processors receive the message at the same time or the
message cannot be sent due to the bus being busy. This atomic broadcast operation
can avoid an inconsistency situation when more than one processor tries to invalidate
or update the same block simultaneously. Nevertheless, because all communication
takes place on a single shared-bus, the bus soon becomes the system bottleneck. The
simulation results from other researchers [6, 44] show that the shared-bus can become
the bottle neck when as few as ten processors are sharing data.

Attempts to scale snoopy-coherent systems by replacing the bus with a higher
bandwidth communication network such as the multistageinterconnection network
may not work because snooping scheme relies on low~latency and atomic broadcasts
to maintain coherence. Unless a high-bandwidth and low—latency network which can

supply atomic broadcast can be achieved, scalability is a limitation to the traditional

12

snooping scheme.

2.4 Coherence Schemes on Network Architec-
tures

In order to allow a larger number of processors in snoopy cache-coherent multiproces-
sors, several multiple-bus systems [10, 9] and hierarchical systems [32, 45, 47, 46, 48]
have been proposed. A few levels of memory clusters are used in these architectures
to release the constraint on the number of processors per bus. In addition, the snoopy
scheme and the directory scheme may be combined together to provide consistency
on different levels [9, 32]. These architectures are more scalable, but they usually
suffer from the hardware complexity and the overhead introduced by the multiple
levels of coherence protocols. Some examples of cache-coherent network architectures
are addressed below in more details.

The Wisconsin Multicube [10] uses a grid of busses connecting the processors to
memory. A processor with a large cache resides at each crosspoint on the grid and a
shared-memory module connected to each vertical bus. Each cache controller snoops
on both the vertical and horizontal busses. Invalidation messages are broadcast on
every vertical bus, while global read requests are routed to the closest cache with a
copy of the requested block. Each memory block has a home column corresponding to
the memory module that stores the block. Each cache controller stores in its column
the identiﬁcation of all modiﬁed blocks, which serves as routing information for read
requests. A read request is broadcast on the horizontal bus and routed to the vertical
bus where the modiﬁed block is stored.

A similar topology is used in the Aquarius architecture [9], but this system dif-
fers in the cache coherence mechanism and in the distribution of memory modules.

The system combines features of snooping cache schemes, to provide consistency on

13

individual busses, and features of directory schemes, to provide consistency between
busses. The memory is distributed per node, unlike the Wisconsin Multicube.

In hierarchical systems, clusters of processors are connected by a bus or an in—
terconnection network. Wilson [45] studied a hierarchically connected multiple—bus
system. The design explores a uniform memory architecture with global memory
connects to the topmost bus. It uses hierarchical caches to reduce bus use at various
levels. A cache contains a copy of all blocks cached underneath it. Consistency among
copies stored at the same level is maintained in the same way as for traditional snoopy
cache protocols. However, an invalidation or a request must propagates vertically to
a higher level if it cannot be served locally.

Erickson [47] studied a system consists of a two-level hierarchical bus interconnect
with physically distributed memory and private processor caches. A cluster consists
of multiple processors sharing a single cluster bus, and multiple clusters are joined
by a single global bus. A snooping coherence scheme provides a single, ﬂat virtual
address space for his design.

The VMP-MC multiprocessor [46] is another hierarchically connected multiple bus
system. The cluster bus has an interbus cache module that interfaces to the next bus
in the hierarchy. This second-level bus has the main memory. In this system, larger
clusters are connected Via a ring-based system to provide a large-scale, distributed—

shared-memory system.

2.5 Consistency Ordering

A serial uniprocessor generally executes instructions one at a time in the order speci-
fied by the program. A load operation always returns the last value written to a given
memory location. Likewise, a store operation ﬁxes the value that will be returned

by subsequent loads until the next store to the same location. For multiprocessors,

14

however, neither the memory system model nor the implementation is as straightfor-
ward. The executing order of operations in different processors is difﬁcult to deﬁne.
Furthermore, the operations from the same processor may be performed in a different
order because of different communication latencies involved. Consistency models are
necessary to specify what accessing orderings are legal while several processes are
accessing a shared location.

Several memory consistency models have been proposed, and they deﬁne different
levels of the ordering requirements. From the most strict requirement to the least,
the common ones are: sequential consistency [49], processor consistency [50], weak
consistency [51], and release consistency [39]. A survey of these ordering models can
be found in [52]. The sequential consistency and weak consistency will be briefly
introduced for further reference.

Naturally, most programmers take for granted that the result of a sequence of
write operations performed by a processor will be immediately observable by any
other processor exactly in the same order as it was performed. Lamport [49] observed
the above situation and deﬁned sequential consistency as: A system is sequentially
consistent if the result of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations of each indi-
vidual processor appear in this sequence in the order speciﬁed by its program. Thus,
in a sequentially consistent system the end result of the program must be as if each
instruction had ﬁnished executing before the next instruction is started and the in-
structions are executed in program order. In this case, the synchronization among
processors can be implicitly implemented by using shared variables. For example, the
programmer of the program in Figure 2.3 will expect the Processor 2 to print 1 as the
value of variable Y. In this program, variable X is used as a synchronization variable
to deﬁne the order of the two processes.

In a single-bus connected multiprocessor, sequential consistency can be easily

15

 

Shared int X=Y=O;

Processor 1 Processor 2
O 0
O O
O O
Y = 1; while (X != l);
x = 1; print Y;

Figure 2.3. In a sequentially ordered system, the Y will be printed as 1 by Pro-
cessor 2. In a weakly ordered system, the value of Y may be printed as 0.

 

 

 

achieved if all memory accesses are issued orderly from any node and all copies of
the same data can be updated or invalidated simultaneously [36]. For other network
architectures, because different memory accesses may have different latency, it may
be costly to maintain the sequential consistency.

On the other hand, in a system other than a sequentially ordered one, explicit
synchronization operations must be used whenever a speciﬁc order of accesses must
be enforced. For example, in Figure 2.3 the value of Y may be printed as 0 by
Processor 2 in a weakly ordered system, which is not the result expected by our
programmer. A more complicated example can be found in Figure 2.4, where only a
sequentially ordered system can guarantee the value of X be printed as 1, all other
consistent models may have 0 as the printed result.

The implementation of sequential consistency in parallel systems precludes the
incorporation of many common performance enhancing features from uniprocessor
architectures. Weak consistency is essentially a trade-off between software ﬂexibility
and hardware performance. Weakly ordered systems depend on explicit, hardware

recognizable synchronization operations, such as the lock and unlock operations, to

16

 

Shared int x=Y=O;

Processor 1 Processor 2 Processor 3
O O O
t : :
x = 1; while (X I: 1); while (Y != 1);

Y = 1; print x;
O 0 O
O O O
O I 0

Figure 2.4. An example to distinguish the sequential consistency and others. In a
sequentially ordered system, the X will be printed as 1. In other systems, the X
may be printed as 0.

 

 

 

order the effects of events initiated by different processors in a system. The memory
has to be consistent at those synchronization points, but the ordinary operations
between two synchronization points can be executed in any order as long as the data
dependence is still guaranteed.

The weak consistency cannot guarantee the order of the completion time of se—
quential statements except for synchronization operations. Dubois et al. [51, 53]
deﬁned weak ordering as: In a multiprocessor system, storage accesses are weakly or-
dered if. (I) accesses to global synchronizing variables are strongly ordered, and if (2}
no access to a synchronizing variable is issued in a processor before all previous global
data accesses have been performed, and if (3) no access to global data is issued by a
processor before a previous access to a synchronizing variable has been performed. For
weak consistency, the non—synchronizing memory references only need to follow the
program’s data dependencies; in all other cases, they can be executed in any order.

In a weakly ordered system, an order that maximizes system performance can
be used, which may include: issuing or completing instructions out of order, letting

reads take precedence over writes, prefetching data, and, if the cache block contains

 

 

17

only one word, delaying cache coherence invalidation signals after the write to the
cache. As a result, lockup-free caches can be designed to avoid processor blocking on
a cache miss.

Despite all of the above, Baer and Zucker [40, 54] showed that on a shared—bus
architecture, the weak consistency model cannot provide any signiﬁcant performance
beneﬁt over a sequential consistency model. They also claimed that the overhead of

a weakly ordered system would outweigh its beneﬁt on a bus-based system.

CHAPTER 3

Bus-Based Networks

3.1 Introduction

One major factor in deciding the performance of a multiprocessor is the time spent in
interprocessor communication. The communication latency of the network is expected
to be short so the multiprocessor can have high performance, and the bandwidth must
be large to allow a large number of processors. A single bus network has the advantage
of simplicity and short communication latency, but the bandwidth limitation prevents
the system from possessing a large number of processors.

A natural extension of the single-bus network is to employ multiple busses to
increase the bandwidth, reliability and scalability. The performance of multiple-
bus interconnections for multiprocessors has been analyzed in the literature and the
results agree that multiple busses are very appealing [11, 55, 56, 57, 58, 59, 60, 61,
62, 63, 64, 65]. Simulation results have shown that multiple narrow busses can offer
a high bandwidth with more ﬂexibility than a wide bus [63]. It has also been claimed
that a four-bus system can achieve almost the same performance as that of a 16 x 16
crossbar system while reducing the hardware cost signiﬁcantly [11, 62]. The commonly
used model is an N X M X B multiple-bus processor system, in which all the N

processors and M memory modules are connected to all B busses. Many researchers

18

 

 

19

have shown various schemes to reduce the number of connections on a multiple bus
network [66, 15], these schemes are especially signiﬁcant when the number of busses
is relatively large (i.e. more than half of the number of processors). Lang [55] and
Chen [65] proposed a partial bus network to alleviate the loads and the complexity
of the arbitration hardware caused by the large number of busses. In the partial
bus network, each processor connects to all busses and each shared memory module
statically connects to a subset of the busses.

An alternative way to implement a bus-based network is to use optical devices
instead of the traditional electronic busses [15, 67, 68, 69, 70]. The potentially huge
bandwidth in an optical interconnect can dramatically reduce the communication
contention in a multiprocessor.

This chapter starts by presenting a classiﬁcation of bus-based networks, followed
by a discussion of applying snoopy schemes on multiple bus networks. Furthermore,
this chapter introduces the state-of-the-art technology of optical networks that can be
applied in multiprocessors, and the utilization of optical networks to emulate several

popular multiprocessor topologies.

3.2 A Classiﬁcation of Bus-Based Networks

Classifying bus-based networks is necessary to put our proposed system in the proper
context. A variety of multiple bus topologies have been proposed, which include one-
dimensional, multiple—dimensional, and hierarchical topologies [71]. However, this
classiﬁcation scheme is incomplete because the multiple-dimensional topologies can
also be represented and implemented in one dimension. Additionally, many multiple
dimensional systems are hierarchically partitioned into several clusters.

Our classiﬁcation for bus-based multiprocessors is characterized by the number of

hops between any two nodes and the number of transceivers (a pair of transmitters

 

20

and receivers attached to the same bus) per node. A single-hop system is one in
which messages can be transmitted between any two nodes without any intermediate
switches (as in multistage-interconnected systems) and no need for forwarding by
other nodes (as in multiple-hop systems) [72]. If we let Ntmn, and N,“ represent
the number of transmitters and receivers per node, respectively, a single transceiver
system is one that Nmm, = Nrec = 1 and the transmitter and receiver are on the same
bus for all nodes. For simplicity, both Ntmn, and N,“ are assumed to be constants for
all nodes. The number of transceivers decides the number of busses that one node can
access at a time. Listed below are the four topologies of bus-based multiprocessors

(Figure 3.1):

0 Single Hop Single Transceiver (SHST).

Single Hop Multiple Transceiver (SHMT).

Multiple Hop Single Transceiver (MHST).

Multiple Hop Multiple Transceiver (MHMT).

SHST systems use a broadcast bus for connection. Several commercial systems
such as Sequent Symmetry, Encore Multimax, SGI Onyx, and Stardent Titan belong
to this category. Shared-memory environments can be easily applied on them, and
the cache coherence problem can be solved by using the snoopy scheme. Because of
the bus bandwidth limitation, scalability is a weakness of these systems.

SHMT systems are multiple bus systems. Busses need not be physically separated,
they can be obtained by dividing the bandwidth of a single physical connection. Sys-
tems such as Synapse N+1, SGI Challenge and Tandem NonStop TXP have more
than one shared bus. Most of the proposed SHMT systems fully connect each pro-
cessor and memory module to all of the busses. One interesting application of SHMT

is the cyclic shifter interconnection [66, 73, 74], in which each node connects to a

 

21

 

WWW igfﬁpf

(a) single hop single transceiver (b) single hop multiple transceivers

W W W

a

(0) multiple hops single transceiver (d) multiple hops multiple transceivers

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.1. Four topologies of bus-based multiprocessors.

22

constant-number subset of all busses. The other extreme case of SHMT systems is an
N—node (IO-bus system with Ntmns : N.“ = N—1: it results in a point—to-point fully
connected topology. The performance of the SHMT topology for multiprocessors or
local area networks has been analyzed in the literature [64, 72, 75, 76, 77]. Results
show that using multiple low-speed busses rather than one high-speed bus can yield
signiﬁcant performance improvement and a number of implementation advantages.
One feature of these systems is that multiple receivers tapped to the same bus can re—
ceive the same message at the same time. This feature makes the atomic multicast—a
requirement for a snoopy coherence scheme—possible on the SHMT systems.
MHST systems usually have good scalability. A subgroup of nodes connect to a
bus, and some connecting switches are used to connect different busses. The hierar-
chical multiprocessors such as the machines proposed by Wilson [45], Erickson [47],
the University of Illinois’ Cedar project, the University of Toronto’s Hector [17], and
the VMP-MC system [46] are examples of MHST systems. These systems are used to
remedy the hardware complexity of crossbars or multistage networks on a large—scale
multiprocessor. If cache coherence is required, different coherent schemes can be used
in different levels—snoopy schemes may be adopted to maintain the coherence among
the nodes in the same bus, and directory schemes may be chosen for higher levels.
MHMT topologies have been used for connecting systems into a k-ary n—cube
such as the Lattice-Mesh [78], the Spanning Multiaccess Hypercube [79], the Wiscon-
sin Multicube [10] and the Aquarius system [9]. The advantage of this topology is
scalability, but the disadvantages are the complexity of the hardware and the over-
head from coherence protocols. Like MHST systems, multiple—level coherent schemes

can be adopted if cache coherence is needed.

 

 

23
3.3 Applying Snoopy Schemes to Multiple Bus

Networks

Applying a snoopy scheme on a single-bus system is straightforward—all communi-
cation appears on the shared bus and every processor’s cache monitors the bus and
responds accordingly. The response to a data modiﬁcation will be performed in the
same bus cycle as the request issued on the bus. Sequential consistency [49] can be
easily achieved in this system. Furthermore, since only one request can be issued on
the bus at a time, the arbitrator of the bus forces all memory accesses to be in some
sequential order.

Both the hardware and the coherence protocol become complicated when the
traditional snoopy schemes are applied to fully node-bus-connected multiple bus net-
works. Because each node must now snoop on multiple busses, the nodes will have
difﬁculty responding as fast as in single bus systems. In addition, the arbitration for
a free bus takes more time if there are more busses. A study has shown that in a
multiple-bus system utilizing lookahead arbitration techniques, memory-bus arbitra-
tion becomes the performance limiting factor for large amounts of memory banks [61].
Multiplexors and buffers may be used to temporarily store requests, but the system
consistency ordering must be relaxed [51], which will further complicate the design of
cache controllers and place a burden on programmers. Some acknowledgment mech—
anism is needed because operations such as synchronizations will need to know when
the operation is globally performed. Such operations will generate more communica-
tion trafﬁc and increase communication latency. In addition, since the requests for the
same memory block may exist on different busses at the same time, some arbitration
mechanism is needed. All of the above complexity plus the requirement of multiple
snooping devices (transmitters and receivers to the busses) for each processor cause

snoopy schemes to no longer be simple and efﬁcient. The same reason also limits the

 

24

scalability of the system because the snooping mechanism for each processor becomes
more complicated and costly as the number of nodes and busses increases.

To avoid the complications mentioned above, a multiple bus network that assigns
a bus to a request according to the requesting address will be presented and discussed
in Chapter 5 of this dissertation. A new snooping scheme on this network will be

presented in Chapter 6 to ease the complexity of multiple bus multiprocessor.

3.4 Optical Network Technology

Traditionally, all busses were implemented electronically. However, in a large-scale
multiprocessor, the heavy trafﬁc from the processors can easily exceed the bandwidth
of the electronic bus which interconnects the system. The high bandwidth of opti—
cal communications inevitably can be an alternative [15, 67, 68, 70]. Theoretically,
the bandwidth of optical ﬁbers is about 30 Terahertz in the transmission low-loss
wavelength windows near 1.3 and 1.5 am.

A variety of conﬁgurations of optical networks including bus, ring, tree and star
have been proposed [69, 80, 81, 82]. The bus or ring architectures suffer from the linear
power loss (in dB) with the number of tapped nodes, while the power loss of the star
architecture only grows logarithmically. A star topology constructed by integrated
2 X 2 modular couplers makes hundreds of nodes in one optical network possible.
With current technology, a 64 x 64 passive star is available, and the extension of the
numbers to hundreds such as a 256 x 256 star is foreseeable in the near future [80].
Star couplers also beneﬁt from being passive devices. One example of a passive
star Ethernet LAN is used in the 44-story Southwestern Bell Facility in St. Louis,
Michigan [83]. This system uses 3-level star hierarchy with 3 stars on the distribution
center, 8 port stars serving groups of 5 ﬂoors, and each ﬂoor served by 2 16—port stars.

It contains 540 nodes, 92 miles of ﬁber, and supports over 4000 computer terminals

 

25

and workstations.

The optical network can be partitioned into multiple channels. Two possible
methods can be used: time division and wavelength division. The former suffers from
the required ﬁne-tuned clock synchronization for the entire system, which is especially
difficult for a large-sized system. The following discussion concentrates on the latter
method. As shown in Figure 3.2, every input signal of a star coupler can be broadcast
to all output branches, and if they are under different light wavelengths (frequencies),

no interference will occur.

xxxx

x xx
x xx

. xxxx

 

 

it; t

Figure 3.2. A 4 X 4 star coupler with input signals under different frequencies.

If packets under the same wavelength are transmitted in the network simulta—
neously, communication contention will occur. A number of contention control tech—
niques have been used in different systems, which includes carrier sense multiple access
with collision detection (CSMA/CD), demand assignment multiple access (DAMA),
and token-passing schemes [70, 84].

Wavelength Division Multiple Access (WDMA) technique, also called Wavelength
Division Multiplexing (WDM) by others, can partition the available optical band-

width into multiple narrow channels. A connection is established via a dynamic

 

26

allocation of a frequency (wavelength) for the duration of the connection. Sev-
eral modulation formats have been used, including amplitude-shift keying (ASK),
frequency-shift keying (FSK), phase-shift keying (PSK), and polarization-shift key—
ing (PolSK) [85]. In order to let one optical device handle data in different wave-
lengths, a number of wavelength tunable transmitting lasers and tunable receivers
were reported [67, 85, 86, 87]. There are two important parameters regarding the
proper choices of these devices: (1) the number of available channels, and (2) the
tuning speed. In different techniques, the number of tuning channels is varied from
10 (local-oscillator ﬁlters), and 100 (acousto-optic ﬁlters) up to 1000 (2 stage Fabry-
Perot ﬁlters). The tuning speed is from the scale of nanoseconds (semiconductor
lasers and local-oscillator ﬁlters), to microseconds (acousto—optic ﬁlters), and up to
milliseconds (Fabry—Perot ﬁlters). To build a fast interconnection network for a mul-
tiprocessor, devices which can be tuned to tens of wavelengths with at most several
nanoseconds random-tuning times are needed. Essentially, the fast-tuning devices
described by Kobrinski et al. [88] can ﬁt the requirement. Experimentally, switching
times of less than 15 nanoseconds have been demonstrated in three—section distributed
Bragg reﬂector (DBR) laser transmitter with maximum continuous tuning range of
2.2 nm (FSK modulation), which can support more than 50 wavelength—addressable
channels. Tunable double-section distributed feedback (DFB) optical ampliﬁers (ﬁl-
ters) were shown to be capable of selecting Gigabits per second data packets with
nanosecond switching times with up to 20 wavelength channels. The capacity of opti-
cal communication systems has continued to double every 18 to 24 months, both for
research experiments and commercial systems [89]. Therefore, considerable improve-
ments are expected in the near future. It is predicted that lasers with the ranges of
up to 256 wavelengths, and switching times less than one nanosecond will soon be
possible [90].

A few prototypes of optically-connected multiprocessors exist already [85, 68],

 

E "

27

and the FOX (Fast Optical Cross Connect) system [91] is illustrated below as an
example. The FOX system is an experimental architecture for studying the poten—
tial of optically-connected parallel processing computers. It uses fast-tunable laser
transmitters and ﬁxed wavelength receivers to communicate between processors and
memory modules. Two. separate passive star couplers were used to broadcast the
transmitting packets to all branches and only the packets under correct wavelength
can be ﬁltered through the receivers. Figure 3.3 shows the architecture of the FOX
system. If enough channels (lightwave frequencies) are available, each receiver can
have its own channel, so the communication is self-routing——no destination address is
included in data packets. Processors can keep requesting memory access, and receive
the responses from memory modules later. Data collisions may occur if more than one
packet travels toward the same memory module at a time. Collisions were resolved
by letting the sending processor wait for a successful response from the corresponding
memory module. If no response is received by the processor after a period of time, a
backoff algorithm will be used and then the packet will be retransmitted. It is inter-
esting to note that in this architecture, if the response time of each memory access
is the same and if each processor has a unique receiving wavelength, no collision will
ever happen in the second star coupler.

The FOX architecture can be modiﬁed by putting the entire burden of tuning on
the processors’ side. Each processor connects to a tunable laser transmitter and a
tunable receiver, and the memory modules only connect to ﬁxed ones. In this case,
each memory module uses the same channel for both sending and receiving signals.
After a processor sends a request to a memory module through a certain channel,
it tunes its receiver to listen to the same channel for the response. A similar idea
was presented in a proposal of HYPASS (optoelectronic Hybrid PAcket Switching
System) [90], a packet switch which is capable of several hundred Gigabits per second

throughput.

 

 

28

processors x memories
" ’ a
C ’4
*1

:2
"
p 4 ,a

 

 

 

*2
x 3
I» ’
p 3 4 ,4 m 3
: 3.3 : ¢—) tunable transmitter
X C
pn I; - V” (9+— ﬁxed wavelength receiver
xn

Figure 3.3. The FOX architecture: tunable transmitters are used to determine desti-
nations.

A design of an optically-connected multiprocessor has been proposed by Wailes
and Meyer [15]. In their Multiple Channel Architecture (MCA), hundreds or even
thousands of processor nodes, memory nodes, and I/ O nodes are connected by a star
coupler. Each processor node uses three pairs of tunable transmitters and tunable
receivers: one pair is for transmitting general data, another pair is for cache line
write back, the ﬁnal pair is reserved for executing barrier synchronization. Each
memory node or I/ O node has a tunable transmitter and a tunable receiver pair.
Thousands of channels are implemented via wavelength division multiplexing. The
channel arbitration is through the CSMA / CD technique. Transceivers can ﬂexibly
tune to any channel, and the channel assignment is left to either the compilers, the

programmers, or the operating systems.

29
3.5 Embedding Topologies in Optical Networks

A variety of network topologies can be implemented by star-coupled optical networks,
including all classiﬁcations of multibus networks discussed in Section 3.2. This section

shows the implementation of many popular topologies for multiprocessors.

3.5.1 Single-Bus Networks

If only one wavelength is used, an optical star topology is logically equivalent to a

single bus system (Figure 3.4). This topology is classiﬁed into the SHST category.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) (b)

Figure 3.4. A star topology with one wavelength (a), is logically equivalent to a
single-bus system (b).

3.5.2 Point-to-Point Connection

Another use of the optical network is to treat the network as a multichannel system,
with each available wavelength to work as a single channel. Figure 3.5 shows an
example of implementing a SHMT network which works as a point-to—point directed

ring. Although the topology is represented as a multibus topology, each bus in the

 

 

30

ﬁgure can be thought as a channel in a WDMA optical network connected by one
star-coupler. Each node uses only one transmitter and one receiver. In order to
communicate among different channels, the transmitter and the receiver for the same

node attach to different channels to avoid conﬂicts.

.n4

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.5. One transmitter and one receiver are used in one node. Logically this
network is a point-to-point directed ring topology.

3.5.3 The k-ary n-cube Topology

Figure 3.6 shows a network in which each node can transmit and receive data from
two channels by using two transmitters and two receivers. Logically this topology
is a multiple-bus mesh topology [92], and the same idea can be used to construct
3D mesh or other k-ary n-cube topologies. If tunable devices are used the number
of transceivers per node can be reduced. For this topology, the number of hops in

transmission is the number of the dimensions in the logical representation.

3.5.4 Multistage Interconnection Networks

Optical networks can also emulate non bus-based systems. Figure 3.7 shows a perfect

shuffle connected network, in which each node has two transmitters and two receivers,

 

 

 

 

 

c1

.2 . u . 1+— .4
CPU—i I I'-

 

 

 

n1 “3 C3 r cznzr

c4 , t—A—

 

 

Figure 3.6. Two transmitters and two receivers are used in one node for this example.
Logically this network is a multiple-bus-connected k-ary n-cube topology.

 

 

 

Channel Channel
9 1 ..... .
>'nS] >[nli
10 2 ' """

11
n2

 

[.4

n3

 

i—l

 

7

I14 16 8 >:D4i

 

 

Figure 3.7. Two transmitters and two receivers each attach to different channels are
used in one node. Logically this network is a MIN.

32

all attached to different channels. In this ﬁgure, an 8-node network uses 16 channels to
achieve the communication, which can be implemented by one star-coupler. However,
it is possible to reuse channels by using a star-coupler for each stage, thus the needed
channel number can be reduced to be twice the number of nodes in one stage. Details

of this topology have been studied by Acampora [93].

3.5.5 The OMP Network

Now let us see the ability of optical networks to emulate more complicated conﬁgu-
rations. The OMP (orthogonal multiprocessor) architecture proposed by Hwang [94]
is used as an example. Figure 3.8 (a) shows the original design of OMP with 3 pro-
cessors, where processor i connects to the memory modules in row i and column i.
The memory module (i, i) is the private memory of processor i, and memory module
(i, j) where i 75 j is shared by processor i and processor j. Figure 3.8 (b) represents
the connection of OMP by using an MHMT topology with Ntmm : Nrcc = 2. If an
optical network is chosen to connect the system, one node can send and receive data
from two channels by using one tunable transceiver or two ﬁxed transceivers. For an
n-processor OMP, either two n x n2 star couplers or one (n + n2) x (n + n2) star
coupler can interconnect the system. For this kind of partially-connected network,

there may be several alternative paths between any two nodes.

3.5.6 Optical Networks in Distributed-Memory Systems

If a local memory accompanies each processor, compare to the FOX system described
above, only half of the network devices are needed to construct a multiprocessor sys-
tem. This architecture can be represented by a distributed-memory message-passing
multicomputer or a shared-memory NUMA multiprocessor. A node may connect to

a tunable transmitter and a ﬁxed wavelength receiver, as shown in Figure 3.9. The

33

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

- 'rn - -
I I I
l
”'l Moa Fl M01 0"] M02
r1
. J -
l l
’1 M10 "— M11 “'- M12
r2
I l I
C C
Oi-M2O C14»— “21 20— M22
(a)

 

(b)

Figure 3.8. (a) Conﬁguration of the Orthogonal Multiprocessor (OMP). (b) MHMT
representation of OMP.

34

system can be self—routing if the number of tuning channels is sufﬁcient to let each
receiver have a unique channel. This system cannot use CSMA / CD to control com-
munication collisions because the nodes cannot detect the transmission by themselves.
An alternative is to use ﬁxed-wavelength transmitters and tunable receivers, as op-
posed to what is shown in Figure 3.9. This conﬁguration can offer atomic multicasting
by letting multiple nodes receive signals from the same channel simultaneously. How-
ever, because a receiver will listen to channel A). only after agreeing to accept data
from node nk, demand polling or some complicated assignment techniques must be

used.

 

 

 

 

1 J. ‘ 25). .1
l 2 3 I l

T

 

 

 

 

 

 

 

 

 

Figure 3.9. Conﬁguration of a distributed-memory system connected by star-coupled
optical network with tunable transmitters and ﬁxed receivers.

3.5.7 Using Optical Networks

Although optical networks can be used to implement a variety of network topologies,
some of the topologies are not good choices. To consider a best topology, the major
factor is the optic and electronic conversion overhead. Before the all-optics routing

chips become available, when an intermediate node receives lightwave signals, it must

 

35

convert the optical signals into electronic signals, so the destination address can be
read to decide the outgoing channel for the signals. And then, the signals will be
converted into lightwave again under the proper wavelength channel for sending out
to the next node. This process is relatively slow compared to the routing decision
in traditional electronic networks. Because of it, the number of hops in an optical
network cannot be large if a high-performance network is expected.

Therefore, using optical network to emulate a directed ring as shown in Figure 3.5
suffers from the possible large number of hops, unless the messages are really long
and the pipelining effect makes the distance invisible. The same limitation should be
considered in any MHMT topology. Because of the above reasons, the SHMT topology
is the most suitable topology for a large-scale multiprocessor connected by an optical
interconnection network. However, the potential large number of transceivers needs

to be carefully limited. An example will be shown later in this dissertation.

CHAPTER 4

The Tango Tracing and

Simulation System

4. 1 Introduction

The network and the novel multiprocessor architecture proposed in this dissertation
both were evaluated by using a simulation system named Tango [95, 96]. This chapter
describes the Tango simulation environment and our experience in using it. The
simulated architectures and the performance results will be addressed in successive
chapters.

Tango is a software simulation and tracing system used to gather execution statis-
tics for parallel programs and multiprocessor systems, or to generate traces of syn-
chronization events and data references. Developed at Stanford University, Tango is
being used in a wide range of investigations from algorithm studies to detailed hard-
ware evaluations, such as the evaluation of the SPLASH benchmarks [13] and the
DASH multiprocessor [8]. The system simulates a target multiprocessor by interleav-
ing the execution of processes on a uniprocessor host. The application is compiled
and augmented to perform monitoring and simulation tasks. During the simulation,

the augmented application processes are time-multiplexed to run on a uniprocessor

36

37

in a way that ensures memory and synchronization operations are executed in the
correct order.

In using Tango to simulate a target architecture, the user has to supply the speciﬁc
memory and network timing model of the target multiprocessor. The user has the op-
tion to choose that these routines are invoked at all memory references, at any shared
memory references, or at synchronization points only. In the other words, the tradeoff
between the accuracy and the cost of the simulation experiments is determined by
the information interested by the user.

The applications can be written in either C or FORTRAN using the m4 macros
developed at the Argonne National Laboratory. The macros present a variety of
abstractions that the programmer can use, such as locks, barriers, distributed loops,
and messages. The macros are implemented using UNIX processes and semaphores, so
the simulator is designed to run under an operating system that provides support for
processes, shared memory, and semaphores. The simulation experiments we executed

were on a DECstation 3100 under Ultrix.

4.2 Trace-Driven vs. Execution-Driven Simula-
tion

One common way to structure a simulation system is to use a trace-driven simulation,
which decomposes the simulation system into two components: an address generator
and a timing simulator for the memory and the network (Figure 4.1). The generator
produces the trace data of a program’s execution; the timing simulator emulates
the operations of the target machine. Trace-driven simulators have been used in
several multiprocessor studies [6, 41, 97]. However, the most criticizable issue in
trace-driven multiprocessor simulation is the accuracy of processes interaction. Most

parallel programs are nondeterministic: the execution path through the program, the

38

 

Application
Program

Address Generator

 

 

 

 

 

 

 

 

Memory/Network
Timing
Simulator

 

 

 

 

ll

Execuﬂon
Statistics

Figure 4.1. Structure of a trace-driven simulator.

39

order of events to maintain data consistency, and the latency of operations should
all depend on the relative execution timing of processes. Although some timing
adjustments may be possible, the interactions represented by multiprocessor trace-
driven simulation generally do not correspond to correct execution of the program in
the target architecture, especially in using the traces generated on one machine to
simulate another machine [98, 99],

One major feature of Tango is that it allows execution-driven simulation. In an
execution-driven simulator, sequences of application instructions under one processor
are executed until a globally visible operation is encountered. At that time, simula-
tion code updates simulated time by either a simple increment or by performing a
more detailed emulation of the target architecture. As the simulation progresses, the
correct ordering of program events can be ensured since simulation times have been
determined. So, unlike trace—driven simulation, execution-driven simulation ensures
accurate ordering of program events and permits accurate simulation of contention
and process interactions. The structure of Tango which addresses its execution-driven

feature is shown in Figure 4.2.

Application
Program
/\ Memory/Network

Address Generator Timing

V Simulator

 

 

 

 

 

 

 

 

 

 

 

 

V 1!
Trace Execution

Figure 4.2. Structure of Tango, an execution-driven multiprocessor simulator.

40

4.3 Tango Simulation Approach

A simulation process is created for each process in the application to associate with a
unique processor in the target system. Functional behavior is simulated by compiling
the code and executing it. The costly instruction interpretation and emulation is
avoided by directly executing user application code whenever possible. Novel target
machine instructions that do not exist on the available machine can be implemented
in libraries.

Each simulation process keeps a software clock, and the clocks are loosely synchro-
nized while maintaining the order of operations of interest. A clock is updated after
executing a set of instructions, at simulated memory references and synchronizations.
Each assembly language instruction of the host machine is assigned a default delay
corresponding to the latency of the target machine, and the delays are added in the
simulated application during the compilation process. When a memory reference of
interest to the user is encountered, either a subroutine or a process which acts as the
timing simulator is responsible for calculating the reference latency at run-time.

The executing order of events of interest is preserved by distributed schedulers:
each simulation process is augmented to reschedule itself before each event of interest.
The process farthest behind in simulation time is always scheduled to execute next.
Figure 4.3 shows the process state diagram that illustrates the scheduling method
described above.

In Tango, the user application is compiled into a simulation system that is cus-
tomized to meet the accuracy and efficiency requirements. Figure 4.4 shows the
production of a simulation system from a C program (ex.C is the application pro—
gram name used in the ﬁgure) using Tango. The ﬁrst step is to expand the macros
which provide machine and synchronization abstractions. The output of the macro

expansion is a standard C language source ﬁle, and a standard compiler is used to

41

 

 

 

   
  

ready to
be issued
process farthest ll reschedule
back in time operation
functional
simulation
memory
, , reference
synchronization
operation
watiting to
be issued
by timing
simulator
event farthest
back in time
i
synchronization memory/network
simulation timing simulation
i

 

 

 

 

Figure 4.3. Tango application process state diagram.

42

 

 

 

 

 

 

 

 

ex.C
C C
expand macros
9X.C

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

     
 

ex.S
Tango processor
augment parameters
ex.S
6X.O
timing library
simulator routines
memory
parameters
trace
execution program summary
statistics output for each
processor

 

 

Figure 4.4. Production of a simulation system using Tango.

43

produce an assembly language ﬁle from this source. An added augmentation step
inserts code to increment the virtual time for each process, call the timing simulator
at data references, and write trace records. The timing values for each operation
provided by the host machine are stored in a processor parameter ﬁle. A standard as-
sembler then is used to convert the augmented code into an object module. The ﬁnal
step is to link the object modules of the modiﬁed program with the timing simulator
and library routines. When the simulation is being executed, a timing parameter ﬁle

containing synchronization delay and memory system parameters is read.

4.4 Experience in Using Tango

To customize Tango system for our experiments, we implemented routines to handle
the detailed operations for the architectures, such as the contention on the intercon-
nection networks, the private cache for each processor, the cache coherence protocols,
the memory placement policy, and the location of shared pages.

One major advantage of using Tango is its ﬂexibility in choosing efﬁciency versus
accuracy. For example, in simulating a uniform-memory-access machine, the tim-
ing simulator can be used for each memory access. But, when Tango is used in
simulating a NUMA machine, the details of private memory accesses is not that im-
portant compared to the shared memory accesses, thus a simple calculation can be
implemented for the private memory accesses and the simulation can focus on shared
accesses. Detailed hardware simulation can be achieved by the routines supported by
the user. The other major advantage of using Tango is its execution driven character-
istic, which makes accurate simulation of coherence protocols possible. Applications
can be easily implemented by using the m4 macros, which enables our study of the
reference patterns of different programs. Furthermore, real program execution results

are generated, which can be used to verify the correctness of the program.

44

The disadvantages in using Tango includes the difﬁculty in debugging and some
hidden bugs inside the code. Because of the lack of detailed documents, users must
read the source code to understand the system. The simulation is not very efﬁcient
because the context switching among processes produces a lot overhead. For example,
running a 100 X 100 matrix multiplication simulation for a 32-processor target system
took about two hours on a DECstation 3100, and a 200 x 200 matrix multiplication
simulation took more than ﬁfteen hours on the same host machine. The efﬁciency of
Tango should be able to be signiﬁcantly improved by using threads (light weight pro-
cesses) instead of UNIX processes. The size of the simulated target machine is limited
by several system parameters of the host machine, such as the maximum number of
processes a user can generated at a time, the maximum number of semaphores a user

can use, and the size of the shared memory a program can deﬁne.

CHAPTER 5

Address-Separated Multiple Bus

Network

In this chapter we demonstrate the Viability of an address-separated multiple bus
network [100]. We will show three different representations of this network that with
identical bandwidths. The network will be evaluated using both the Tango simulator
and queueing analysis. The queueing models introduced in this chapter will be the

basis for analyzing the multiprocessor systems proposed later in this dissertation.

5.1 Introduction

The performance of a multiprocessor system signiﬁcantly depends on the efﬁciency
of the interconnection network. Connecting all processors by a single bus was the
initial architecture for commercial multiprocessors, and this architecture is still being
used on many multiprocessors. One important reason for using a single bus on a
multiprocessor is because the snoopy coherence can be easily applied. Due to the
bandwidth limitation of a single bus, employing multiple busses on the network be-
came interesting.

One common method to utilize a multiple bus network is to build a multiple—

45

46

hop-connected system (MHST or MHMT) in which the communication between two
nodes (either processors or memory modules) may need to travel through interme-
diate nodes. This topology trades the advantages of a single hop to allow a large
number of processors in one system. The other common topology is a single-hop
N X M x B multiple-bus multiprocessor system (SHST or SHMT), in which usually
all the N processors and M memory modules are connected to all B busses. For a
message packet, the multiple busses work like a pool of resources—the packet will
be transmitted when any one of the busses is available. This multiple bus network
suffers from the fact that all processors and memory modules have to connect and
listen to all busses all the time because input data packets can come through any bus.
This problem becomes more signiﬁcant as the number of busses increases.

This chapter presents a new multiple bus interconnection network, the Address-
Separated multiple bus network, on which one memory block can only be transmitted
through a certain bus determined by the address of the block. One of the advantages
of this model is that fewer connections between the memory modules and the busses
are needed. Furthermore, the network can solve the interlock problem easily when two
processors are accessing one memory block at the same time. Because one memory
block will be transmitted through a ﬁxed bus, busses can be used as semaphores to
guarantee mutual exclusion in accessing memory blocks. This feature is especially
important in maintaining cache coherence to avoid the problems of applying snoopy
scheme on a traditional multiple bus network, as addressed in Section 3.3. It will be
shown that the Address-Separated multiple bus network can reduce the complexity
of a system while maintaining good network bandwidth. The performance of the
network was evaluated by both simulation and queueing analysis. The results of the
study will indicate the feasibility of a new bus-based, cache-coherent multiprocessor
architecture described in the following chapter, where each processor can statically

or dynamically connect to a subset of the busses.

47

5.2 Multiple Bus Networks

The basic assumptions for the bus-based networks discussed in this chapter are:

1. A bus will be held only if the destination memory module is also available at

that moment.

2. An arbitration scheme is used that allows one processor to succeed if multiple

processors are competing for a bus.

3. A bus is held during the entire bus transaction, including completion of the

memory cycles if memory was accessed.

The circuit switching technique is assumed because of the following three rea-
sons: 1) it can reduce the communication latency due to the elimination of the initial
delay in arbitration, decoding, and setting of appropriate switches after a path is
established, 2) the ease of performing atomic operations, which simpliﬁes the imple-
mentation of synchronization and mutual exclusion primitives, and 3) simplicity for
simulation and analysis.

Network bandwidth (the number of successful memory accesses per time unit)
will be used in this section in discussing various multibus networks. The goal of
the discussion in this section is to ﬁnd a network model which reduces memory-bus
connections while maintaining good bandwidth. We denote multiple-bus networks by
N X M X B to indicate the number of processors, the number of memory modules,
and the number of busses. For brevity, without losing any generality, it is assumed

that M = kB where k is a positive integer. The case of k : 1 is discussed ﬁrst.

Claim 5.1: For an N x B X B network, the following two models have
exactly the same bandwidth (Figure 5.1).

0 Model 1 (Single M —B connection): Memory Module i only connects to

(UNI-l

le-i

WNl-i

 

 

48

Processors

Busses

Memory
Modules

(Model 1)

Processors

Busses

Memory
Modules

Processors

Busses

Memory
Modules

(Model 3)

Figure 5.1. Three models of Address-Separated multiple bus network.

49

Bus i.
0 Model 2 (Full M —B connection): Each memory module connects to all

busses.

Reasoning: Two types of contention (conﬂicts) may exist in a multiple
bus system: contention on busses and contention on memory modules. In
Model 1, the processor which holds Bus i is the only one that can use
Memory Module i, so contention will only happen in accessing the busses
and no memory contention will happen. Therefore, the N processors
compete for B busses to reach the memory modules. In Model 2, the
processor which can successfully access Memory Module i should have
had no trouble in ﬁnding a bus because when a memory module is free,
there must be a bus available due to M = B. On the other hand, when
a processor is holding a bus, the destination memory module should also
be available, otherwise the ﬁrst assumption in this section is violated. So,
again, N processors compete for B memory modules, which is equivalent

to the contention occurring in Model 1.

Claim 5.1 shows that in N x B X B systems, one memory module need only
connect to one bus. All other connections between memory modules and busses are
redundant. Also, ﬁxing the transmitting bus for each memory block in this model

does not cost anything. Now we generalize the value of k to any positive integer.

Claim 5.2: Single M—B connected N x kB x B networks (Model 3) all
have the same bandwidth for any k if 1: memory modules connect to one

bus.

Reasoning: Model 1 is a special case of Model 3 with k = 1. We can show

that Model 3 is equivalent to Model 1 by combining the k memory modules

50

connected to Bus i into Memory Cluster i. The processor that is holding
Bus i will be the one which can access Memory Cluster i. Additionally,
only one module in a cluster can be accessed at a time. Therefore, these k
memory modules bonded together simply form a bigger module, nothing

else is different.

In fact, the number of memory modules connected to each bus need not be the
same for all busses in Model 3; the bandwidth of the network will always be identical.
This model is similar to the N X M X B network with a single bus-memory connection
discussed by Chen et al. [65] except that the two assumptions of arbitration schemes
are not exactly the same. We can conclude from "the above two claims that single
M—B connected N X B X B network (Model 1), fully M—B connected N X B X B
network (Model 2) and single M-B connected N X kB X B networks (Model 3) all
have exactly the same network bandwidth. The above two claims let us evaluate a
complex network by using a much simpler model. From the deﬁnition of the Address-
Separated network, it should be represented by Model 1 or 3. On the basis of the
preceding discussion our network can be represented by any one of above three models.
Our evaluation of the Address-Separated network will based on Model 1 because of
its simplicity.

To simplify the description in the following sections, the fully M —B connected
N X M X B systems is named as Model 4 (Figure 5.2), where M = rB, and r is a
real number greater than or equal to 1.0 such that rB is an integer. For example, if
B = 8 and r = 1.5, then M = rB = 12. We always assume the number of memory
modules (M) is a function of the number of busses (B) because when the number of
busses increases, the number of memory modules should also be increased—there is
no reason to have more busses than memory modules. Model 4 represents general
multibus systems, and it has been studied thoroughly in the literature [11, 56, 60, 65].

Because in this model memory accesses can freely travel on any bus, and processors

51

 

 

Processors
1
2
3 Busses
B
Memory
Modules
(Model 4)
Processors
1
2
3 Busses
(1)
Memory
Modules
(Model 5)
Processors
1
2
3 Busses
B
Memory
Modules

 

(Model 6)

Figure 5.2. Multibus Models 4, 5 and 6.

52

have to compete on both busses and memory modules, different assumptions are used
by other researchers such as two—level arbitration, split transactions, different busses
used for memory requesting and response. However, discussion of those network
protocols is not addressed in this dissertation.

The number of connections needed in Models 1 and 3 is (BN + M), which equals
B(N+ 1) in Model 1 since M = B. In Model 4, B(N+ M) 2 EN + rB2 connections
are needed, which is B(rB — 1) more connections than in Model 1. As mentioned
in Section 5.1, in Models 1 and 3, the number of connections between processors
and busses can be further reduced if one processor need not access certain memory
modules due to either the locality of application programs or if multiple users are
sharing the system and no interference is expected.

The number of connections will also affect the hardware complexity in a system.
The memory modules in a single M-B connected system are simpler than those in
fully-connected multibus systems because each module only needs to connect to one
bus. Also, in Models 1 and 3, contention will happen only on the busses, so no
arbitration mechanism is needed for memory modules. Furthermore, if private caches
will be used in each processor, the set—associative cache management policy can be
easily implemented because addresses from different memory modules will come from
different busses.

Given the variety of multibus models, it will be interesting to investigate the
optimum multibus networks and their performance. Before ending this section, the
theoretical multibus networks which can achieve the maximum network bandwidth

are presented. They will be named as Model 5 and 6 (Figure 5.2).

Claim 5.3: For an N X M X B multibus system:
0 Model 5: if N is ﬁxed, the maximum network bandwidth will be derived
from the fully-connected system with M = B = total number of words

(smallest memory accessing unit) in the memory.

53

0 Model 6: if both N and B are ﬁxed, the maximum network bandwidth
will be derived from the fully-connected system where M equals the total

number of words in the memory.

Reasoning: The key to achieving the maximum bandwidth is to reduce
the possibility of contention to the minimum. If the number of busses
equals the total number of words in the memory, any memory request can
always ﬁnd an available bus. Similarly, if the number of memory modules
equals the number of words in the system, the memory module contention
can be reduced to the minimum. The effect of accessing the same shared
word from different places at the same time is ignored here because the

operation will be different in different systems.

Model 5 and 6 are just theoretical networks. No such systems will ever exist be-
cause of the cost of the hardware. However, these systems are useful for understanding
the capabilities and limitations of the systems described earlier. In addition, Model 6
will be used in later chapters as the network of a directory-based cache-coherent

system for comparison.

5.3 Performance Evaluation by Simulation

The results in this section are generated by using the Tango simulator described in
previous chapter to run a parallel minimum-cut program using simulated annealing.
The minimum-cut problem is used extensively in Electronic Computer Aided Design.
The test problem is a 16-node graph with an average edge degree of 4.5 per node.
All memory accesses are assumed to be 5 CPU cycles. No speciﬁc memory mapping
method is used; that is, memory is uniformly distributed across the different memory

modules.

54

At ﬁrst we compare the performance of our architecture with the ideal models.
Figure 5.3 presents the comparison between Model 1 (single M —B connected, N X
B X B) and Model 5 (fully connected, N X 00 X 00). The latter model represents
an ideal architecture. The curves show the ratio of execution time of the two models
when N = 4, 8 and 16. For Model 1, when there is only one bus, the execution time
is about (3/4) X N times longer than in a non-contention system (Model 5). As the
number of busses increases, the ratio decreases. The ratio decreases quickly as the
number of busses increases until the bus number is N / 2; after that point, the ratio

decreases slowly. The curves will converge to 1 as B approaches inﬁnity.

 

10 N = 16 *- _
N = 8 *—
N = 4 ‘9"
8 - _
Execution
Time
Ratio

 

 

 

 

 

Number of B and M in Model 1.

Figure 5.3. Ratio of execution time of Model 1 (single M—B connected, N X B X B)
to Model 5 (fully connected, N X 00 X 00).

Figure 5.4 shows the comparison of the two ideal models. The curves are the ratio

of execution time for Model 6 (fully connected, N X 00 X B) and Model 5. This ﬁgure

55

presents the optimum performance gain from increasing the number of busses. The
ratio halves as the number of busses doubles, and when B : N, Model 5 achieves the
same performance of Model 6. The equivalence is because no bus contention occurs
when the number of busses equals the number of processors. This observation also

veriﬁes the correctness of our simulation.

 

 

 

 

a] l l l
10 - N = 16 +— s
\ N : 8 ". "’
\ N Z 4 '0—
8 r\ ..
Execution \
Time 61 "
Ratio {- k
\ \
4 ’ \ \ 1
a K X
2 \ \ \k\ E
k: X \ \ +— —— ~
1 \\.é \— é _— ~ ~ _
2 4 8 16

Number of Busses in Model 6

Figure 5.4. Ratio of the execution time of Model 6 (fully connected, N X 00 X B) to
Model 5.

The comparison between Model 1 and Model 6 is shown in Figure 5.5. This ﬁgure
indicates the places where the two models are closer, which can be used to determine
the choice of number of busses in our architecture. The curves show the difference in
execution ratio of the above two models compared to Model 5. When the number of
busses is one, there is no difference between Models 1 and 6 because all processors

have to compete for the single bus no matter how many memory modules or what

56

kind of connection is adopted. When the number of busses increases, the difference
increases at ﬁrst and then decreases, and eventually the two curves will merge together
when the number of busses reaches inﬁnity. The shape of the theoretical difference
can be better observed on the curve of N = 8, on which the ratio difference starts to

decrease when B = 4; fast at ﬁrst, and then slowly.

 

 

 

 

I l
2.5 - N = 16 *— -
N = 8 "—
N = 4 '9-
2 -- ..
Difference
in . .. r
Execution [
Time '
Ratio /
1 - ..
0.5 - ° -,
0 l 1 l
2 4 8 16

Number of Busses (and M in Model 1)

Figure 5.5. Difference in execution time ratio between Model 1 and Model 6.

Figure 5.6 shows the execution ratios for N = 16 under different value of r for
Model 4, a fully connected multibus network with M = r8. When r = 1 the
network is the same as Model 2, which has equivalent bandwidth as Model 1 according
to Claim 5.1. The simulation results verify this claim again. The r = 00 curve
represents Model 6. The curves show that for a ﬁnite value of r, the execution ratio
will converge when B approaches N. It indicates that by increasing the number

busses the number of transceivers in a multiple bus multiprocessor can be reduced.

57

The difference between the Model 1 curve and the r = 2 curve is due to more memory
modules and the free choice of the busses. Considering the saving of connections
on Model 1, I believe the performance is very reasonable. With the assistance of
private caches in a multiprocessor system, the bandwidth difference between Address-
Separated multibus network and fully connected multibus network should be small,

but the former can realize cache coherence protocols with much lower cost.

 

 

 

 

 

l l r
10 Modellori‘:l'*—..
T‘ = 2 *—
7‘ = 3 '9—
8 Model6orrzoo-X-
Execution
Time 6 - _
Ratio
\
4 - \ _
X \ _
2 - \ \ _
l :15 ~ "_ ~ -— -— _
2 4 16

8
B (N =16,M = TB).

Figure 5.6. Execution time ratio of different values of r on Model 4 (fully—connected,
N X rB X B) to Model 5 when N = 16. The r = 1 curve also represents Model 1,

and the r = 00 curve also represents Model 6.

5.4 Performance Evaluation by Queueing Models

Queueing models for Models 1 and 6 can derive the approximate performance results

very efﬁciently and can be used to verify the simulation results. These analytical

 

58

models will be used again later in this dissertation to analyze two cache-coherent
multiprocessor architectures. The discussion begins from Model 6 due to the simplic-

ity of its queueing model.

5.4.1 An Analytical Model for Model 6

Figure 5.7 shows the queueing model for Model 6. This queueing system involves
N processors which access a pool of B busses in a closed form. The processors are
represented by N processor servers; the buses are represented as a queueing system
with B servers. As shown in the ﬁgure, there is no waiting queue for the processor
servers since the number of requests equals to the number of processors in this closed-
form queueing system. There is only one shared waiting queue for the B bus servers
because busses are operated as a pool of resources. Because M z oo in Model 6,
no memory contention will ever happen. Thus no queueing systems for the memory
modules in the queueing model of Model 6 are necessary, the memory delay can
be added to the service time in the network servers. A single request (or customer
in the language of queueing theory) will move from its unique processor server to
a network server and back to the same processor server. The time spent by the
request in the processor server represents the computation time between two adjacent
memory accesses. The time spent by the request in the network server represents the
communication latency plus the time spent in memory modules. Totally N requests
exist in the system.

It is assumed that the computation time in each processor is exponentially dis-
tributed with a mean of 1/ H1, and the service time of a network server is exponentially
distributed with a mean of l/ug. No time is spent by a request in the waiting queue of
the processors because the number of processor servers equals the number of requests.
Consequently, the processor servers work like a queueing system with an inﬁnite num-

ber of servers. Because the number of busses is less than or equal to the number of

59

   
   

Service
rate: M 1

 

 

 

    
   
 

N servers

     
   

 

(Processors)
N
Service
rate: )U. 2
<———.

 

    
 

B servers

(Network & Memory)

Figure 5.7. A queueing model for Model 6.

60

processors (B S N), the service rate of the network servers depends on the number of
requests inside. The service rate is nu; when there are 12 requests in the server with
n < B, and Bﬂz otherwise. The arrival rate for the network servers when there are
71. requests inside is (N — 71);“, because all the requests that are not in the network
server must being delayed in the processor server with the service rate of #1.

Even though there are really two queueing systems involved, the processor servers
and the network servers, the state transition diagram of this queueing model can
be determined by only the network servers. This is because the model is a closed
queueing network and the function of the processor servers is as a delay center only.
Figure 5.8 shows the state transition diagram of the network servers. The states are
the number of requests inside, and the arrows represent the entering and leaving rates

for each state.

2M 3M BMZ BM By. B
2 2 2 [\2
N0 2

@f:0®'"0 :61)

Figure 5.8. State transition diagram for the network servers in the queueing model

of Model 6.

We can denote B, to be the probability of n requests in the network server. To
analyze the queueing model, we have to calculate the state probabilities ﬁrst. The

steady—state equations can be derived by letting the entering rate of a state equal its

61

exiting rate. Hence the equations are

State exiting rate = entering rate
0 Nﬂlpo = #2131
n, 1 S n < B [mm + (N — n)p1]Pn = (n +1)p2Pn+1 + (N — n +1)p1P,,_1
n, B s n < N [Bﬂz + (N — n)p1]Pn = BH2Pn+1 + (N — n +1)p1Pn_1

N BMPN = HIPN-l
an

Equation 5.1 can be recursively solved to represent Pn in terms of P0 and yields

 

—£%7&RP. 0<n<B,
Pn :: (N—n);; (#2) #0 n — (52)
(N—n)!e!Bn—B (1%) P0, B 3 n S N-

In order to determine P0 we use the fact that the B, must sum up to 1, thus

N B—l N
ZH=ZH+ZR=L (M)
i=0 '=O '=B

Combining Equations 5.2 and 5.3, P0 is given by
—1

B4 1 1‘ MN 1 1‘
A”: (i!(N—i)!) (i—l +152}. (N_,,!B._B ("Z—ll " (5'4)

i=0

 

P0:

 

The mean number of requests in the network servers, in, can be derived from

N
.20

And, because the total number of requests that exist in the model is N, the mean

number of requests in the processor server, in, can be easily calculated by

62

To generate other performance results, we will use a simple theorem that holds
for a large variety situations and is true for both the whole queuing system and any
subqueue. This is Little’s Law [101]. It says that the mean number of customers in a
queue equals the product of the mean arrival rate to the queue times the mean time
a customer spent in the queue.

Since a request spends no waiting time in the processor servers, the mean time
spent by a request in the processor server, 1121, can easily be seen to be 1 /n1. According

to Little’s Law, the throughput of the processor servers, A, can be derived from

’\ = — = = (N — ﬁzlﬂl- (5-7)

Because this queueing model is a closed form, the same throughput is also held
for the network servers. By applying Little’s Law again, the mean time spent in the

network server by a request, 152, can be derived from

- _B=___’EZ_

Now we have completed in analyzing Model 6, let us take a look at Model 1.

5.4.2 An Analytical Model for Model 1

The transition diagram of this model is much more Figure 5.9 shows the queueing
model for Model 1, the Address-Separated multiple bus network. The processor
servers are still the same as those in Model 6, but the shared-queue network servers
become B single-server queues. Since a pair made by a bus and a memory module is
always available at the same time, again, no queueing systems for memory modules
are necessary. As in the simulation experiments, memory accesses are assumed to

be uniformly distributed among the busses and the memory modules (note, M = B

63

 
  
  

   
    

Service
rate: M 1

N servers

 

 

 
  

  
   
     
   

Service
fate: M 2
Single
server

 

 

 

  
   
     

Service
rate: M 2
Single
server

 

 

 
  

 
   
     

Service
rate: M 2
Single
server

 

 

 

Figure 5.9. A queueing model for Network Model 1 (Address-Separated multiple bus
network).

64

in Model 1), that is, each request will choose to travel through a certain bus with
the probability of 1 / B. Again, the service rate for a processor server is #1, and the
service rate for a network queue is #2.

complicated than in Figure 5.8 because it is multi-dimensional. Because of the
complexity, solving the queueing model directly from the steady—state equations is
difﬁcult. Fortunately, this kind of queueing models can be numerically solved by
using the Mean Value Analysis method (MVA) [102].

MVA uses a number of fundamental queueing relationships to determine the mean
values of throughput, delay and population size for closed queueing networks. This
method is based on the Arrival Theorem which claims that in a closed exponential
queueing networks the number of customers present upon arrival has the same distri-
bution as the steady-state distribution for the network with one customer less [103].
Therefore, the queueing model can be repeatedly solved for increasing population size
that starts from a simple initial situation.

Let us denote the processor servers to be queue 0, and the B network queues to
be queue 1 to queue B. Let 75,-(2') be the mean number of requests in the jth queue
when there is a total of 2' requests. The mean time spent in the jth queue is denoted
as 7.723(2), and the arrival rate (throughput) for queuej is Aj(z'). In addition, let 2),-
to be the visiting probability of the jth queue. In this queueing model, 120 = 1 and
v,=1/Bforj=1,...,B.

The MVA method starts by calculating the mean delay that a request experiences
at the jth queue, {DJ-(2'). The 1710 is ﬁxed to be l/nl. For 1.7),- such that j = 1,. . . , B, it
is the sum of the service time of this request and the service time for all other requests

before it in this queue. According to the Arrival Theorem, the equation becomes

. l 1 . .
IDj(z)=—+——><nj(z—1), j=1,...,B. (5.9)
#2 H2

65

Both the throughput, /\,(z'), and the mean number of requests, nJ-(z) can be easily
derived by applying Little’s Law. With some manipulation, the MVA solution for the

queueing model in Figure 5.9 can now be listed in an algorithmic form in Figure 5.10.

 

1. lnitially,ﬁ,~(0)=0, j=0,...,B. Also, v0=l;v,=%, j=1,...,B.
2. ForizltoNdo steps 3,4and 5.
3. wo(i)=i;wj(i)=J—2+§i{:—ll, j=1,...,B.

#1

Figure 5.10. An MVA algorithm for the analytical solution of Model 1.

 

 

 

MVA can also be applied to solve networks of queues with state-dependent servers,
such as Model 6. But the accumulated computational error will be signiﬁcant when
the probability of no customer in a queue is extremely small due to a large N. The
MVA solution for Model 6 has been implemented for verifying the analytical solution
in Section 5.4.1 for small Ns, and the algorithm can be found in Appendix A of this

dissertation.

5.4.3 Analytical Results

By using the analytical models of the multiple bus networks, we can get similar per-
formance results as from the simulator but with much higher efﬁciency. The criteria
used to measure the analytical performance is the mean time between two memory

requests. In running a program on a system with a certain number of processors, the

66

total number of computation and memory request pairs should be a constant regard-
less the number of busses. Thus the mean time between two requests can successfully
represent the execution time of an application under that number of processors. As
with the simulation results, the analytical results will be shown by the ratios of the
mean time between requests on the evaluated model and the Model 5. The mean
time between requests on Model 5 is ﬁxed as (l/pl + l/ﬂz) due to the fact that no
contention will ever happen.

Figure 5.11 shows the analytical results for Network Models 1 and 6 when N = 16.
The computation time is assumed to be exponentially distributed with a mean of 10
time units (151 = l/pl = 10). The Y axis on Figure 5.11 is the ratio of mean time

{[1117

between two memory requests compared to that of Model 5, generated from U—uiT-T/ZE’

The ﬁgure shows the curves for #2 = 5 and p2 = 2. It can be observed that the results
have exactly the same shape as in the Figure 5.6. The difference between Model 1
and Model 6 is the same as shown in Figure 5.5.

Analytical models are very useful in the situations that running simulation is
difficult. For example, using the Tango simulator for a target system with more than
a hundred processors takes an intolerably long period of time, if it is even possible.
Sometimes the simulation cannot even be executed due to the limitation of memory
resources or the number of active processes allowed on a host machine.

Figure 5.12 shows the analytical results with N = 128 for #2 = 2 and M = 1. It
can be seen that when the network is a serious bottleneck (in the case of B < 8 for
either #2 = 1 or 2), the execution time is doubled when the number of busses becomes
half. The difference between Models 1 and 6 is not signiﬁcant at these points because
of the long waiting time in the queues. When the number of B is greater then 8, the

execution time patterns become the same as N is 16 in Figure 5.11.

3
2.8
2.6
2.4
Ratio of 2'2
Mean Time 2
Between

Requests 1.8

1.6
1.4
1.2

1

67

 

 

Model 1, l/ﬂz = 5
Model 6, l/pg = 5
Model 1, 1/[12 = 2
Model 6, 1/[12 = 2

   

-)(—_
_x_
-e——.
.9-

—1

.

 

 

 

 

Number of Busses

Figure 5.11. Analytical performance results for Models 1 and 6 when N = 16 and

1m1 = 10.
10
8
Ratio of
Mean Time 6
Between
Requests
4

Figure 5.12. Analytical performance results for Models 1 and 6

up, = 10.

 

 

Model 1, l/pg
Model 6, l/pg
Model 1, l/pz
Model 6, l/ltg

II II II II
mew
tti‘l‘

 

m u— I. 'l

16

Number of Busses

 

32

when N = 128 and

68
5.5 Summary

A multiple bus network is an interconnection network architecture with many advan-
tages. In the common fully-connected, N X M x B multibus systems, messages have
the freedom to travel through any available bus, so the systems can fully utilize the
communication bandwidth. But this network has a drawback that when the number
of busses increases, the number of connections increases signiﬁcantly, which compli-
cates the system design. In addition, this network loses an important character of a
single-bus network—the snoopy cache coherent protocols cannot be applied directly.

This chapter presented the Address-Separated multibus architecture on which the
choice of busses depends on the accessing addresses. Three network models with
equivalent bandwidth have been shown to be able to represent the address-separated
multiple bus architecture. Model 1, the easiest model of multiple bus networks, has
the same number of busses and memory modules, and there is only one connection
between each memory module and the busses. Model 2 is the fully-connected version
of Model 1, and it has been shown that the extra connections and the freedom of
choosing busses are redundant, unless fault tolerance is considered. Generalizing
Model 1 with multiple memory modules connecting to one bus leads to Model 3. The
above three models have been claimed to possess exactly the same network bandwidth.
The equivalence can simplify the evaluation process for complex networks.

For comparison purposes, we also deﬁned Models 4, 5 and 6. Model 4 is the
common multibus system, and many advantages that Model 1 has over this model
were described. Models 5 and 6 are theoretical maximum-performance multiple bus
networks with an inﬁnite number of memory modules, and the number of busses is also
inﬁnite in Model 5. Model 6 will be used later in this dissertation in a directory-based
system for performance comparison.

Because of the features of scalability and atomic broadcasting, the Address-

69

Separated multibus architecture is useful in the design of a cache-coherent multi—
processor which allows the use of snoopy protocols. Both simulation and analysis
methods are used to evaluate the performance, and the identical execution time pat-
terns indicate that they verify each other. The study shows that this multibus network
has many merits and can perform well in a multiprocessor. The purpose of studying
Address-Separated multiple bus network is to demonstrate the feasibility of the net-
work design of a cache-coherent multiprocessor, and to build a base of its performance

study. The multiprocessor architecture is presented in the following chapter.

CHAPTER 6

The Cache Coherent Architecture

A novel cache coherent multiprocessor named the Multiple Snoopy bUs system
(MSU) [104] is presented in this chapter. The MSU system applies the snoopy scheme
to an address-separated multiple bus-based NUMA—type multiprocessor without re-
quiring each processor to monitor all of the busses. After showing the architecture,
the protocols are presented. A qualitative discussion will compare the system with
others. Finally, the way to implement the network by optical networks will be de-

scribed.

6.1 The MSU system

In this multiprocessor, each processor has its own private cache and a portion of the
globally—shared main memory. A total of NM busses are used in the multiple bus
network, and Nread of them are reserved for the private and read-only cache lines. All
nodes are separated into (NW, — Nmad) clusters, and the shared-and-writable memory
lines in one cluster will be transferred on the same bus. Each node is assumed to
be able to transmit on any one of the Nb“, busses, and simultaneously listen to NW,
busses, where 3 3 NW, S Nbus. When NSW 2 Nbus, the system is a full snooping

system; otherwise, we refer to it as partial snooping. A node needs one or several

70

71

switchable (tunable) transmitters which can switch in real time to transmit messages
to any one of the busses. A node has N,"p receivers, one of which must listen to
a ﬁxed bus to receive the requests for the local memory of the node; another Nread
receivers, 1 S Nread S (NMp — 2), listen to the busses for private and read-only lines,
and the other (Nmp — Nread — 1) switchable receivers snoop on the other busses.
Figure 6.1 illustrates the dynamic snooping concept described above. Figure 6.1(a)
shows that Nodes 1, 2 and 4 are caching the blocks in a cluster using Bus 1, and Nodes
3, 4 and 6 are caching the blocks transmitted on Bus 2. Therefore, each node has to

snoop on the busses according to its caching status, as shown in Figure 6.1(b).

 

82

m

 

Figure 6.1. An illustration of the dynamic snooping concept of MSU. Each node has
to snoop on the bus(ses) that it has blocks cached from.

Figure 6.2 shows an example of an MSU system with eight busses. In this example,
each node has four receivers, two of which can dynamically snoop on different busses.

All nodes will have a receiver tap to the ﬁrst bus because that bus is reserved for

72

the private or read-only lines. Node 1 belongs to a cluster in which the local main
memory lines will be transmitted on the second bus, so there is a receiver that always
snoops on Bus 2. Similarly, the lines in Node 2’s main memory are transmitted on the
third bus. The cache memory is separated into four sets, and the placement policy is

set-associative with degree of (cache size in lines / 4).

Node 1 NOde 2

CD
C
U)

 

CD\IO\U'14>LUI\)|>—‘

Figure 6.2. The MSU system with Nb“, 2 8 and N51,p = 4. The ﬁrst bus is for

private and read-only memory accesses ( Nread = 1), every node has a ﬁxed receiver
connected to it. Each node also has a ﬁxed receiver connected to a bus for receiving
requests for the local main memory, and two tunable receivers which can connect to
any two of the busses at a time.

On the network of the MSU system, each request and response has to be trans-
mitted on a certain bus according to the address of the referred block, thus it is a
form of the Address-Separated multiple bus network discussed in Chapter 5. To see
the link between the two, the network of the MSU system is identical to the Network

Model 3 in Figure 5.1 with N = Nnode, B = Nbu, and M 2 NW, x Nnode. The M

memory modules are placed beside the Nnode processors now with NW, modules per

73

node. Since M was represented as kB previously, the value of k can now be simply

° N Nno e
derived from ilvaA.

bus

The classiﬁcation of the shared-and-writable and private—or-read-only blocks can
be performed by the operating system, compilers, or programmers. Only the shared-
and-writable blocks need to be snooped and maintained coherent. A processor can
load local private and read-only blocks into its cache immediately without sending
the request through the busses. If a required private-or-read-only block resides in
another node, one of the Nread busses can be used to send the request and receive
the data. No invalidate or update signals will be broadcast for private and read-only
blocks, so both the communication trafﬁc on these busses and the sn00ping processes
of each processor are kept to a minimum.

A processor can cache the shared-and-writable lines from at most (Nmp — Nread)
different clusters simultaneously including the cluster it belongs to; because whenever
these lines are cached, the processor must snoop on the corresponding busses. One of
the dynamically switchable receivers will be devoted to that cluster. When a remote
shared-and-writable line is needed by a processor, a request will be sent through the
corresponding bus determined by the address of the line, and the response will also
be received from the same bus. When there is no available receiver but a shared-
and-writable line from a new cluster is needed, a replacement policy is used to decide
which cluster should be replaced and then a cache—set replacement will occur. After
replacement the receiver will snoop on the bus associated with the new cluster.

A restriction of the system is that the shared-and-writable lines from only (N37,,D —
Nmad) out of the (Nbus — Nrcad) clusters can be cached into a node at a time. We
expect this restriction to be sufﬁcient for most cases; that is, a processor only uses
shared data from a few different clusters at a time. However, private and read-only

data can be accessed from anywhere by using the Nread busses.

 

74
6.2 Coherence Protocols

Because both a broadcasting ability and a snooping mechanism are supplied in the
proposed architecture, many coherence protocols based on a shared—bus system can
be directly applied in the MSU system. The protocol which can ﬁt the architecture

best must be a one that can:

1. minimize the intra-node actions to avoid local lockout between the processor,

the cache and the main memory.
2. minimize the bus actions to let the system have better scalability.

The needed coherence protocols are discussed separately for the sequential and

the weak consistency ordering models mentioned in Section 2.5.

6.2.1 Under a Sequential Consistency Model

According to the ﬁrst suggested condition described above, the protocol should avoid
writing cache blocks back to the main memory if it is not indeed necessary, to simplify
the actions inside a node. With respect to the second condition, write-invalidate
instead of write-update protocols should be adopted to make the messages short in a
write operation. The Berkeley Ownership protocol [42] is one of the protocols that
satisﬁes the requirements.

The Berkeley protocol is a write-invalidate protocol that is based on the concept
of cache block ownership. The owner of a block is either the main memory or one of
the caches. A cache can obtain exclusive ownership of a block via two invalidation
bus transactions. One is used on cache write misses and obtains the block for the
requesting processor, at the same time it invalidates copies in other caches. The
second is an invalidation signal that is used on cache write hits. Once the exclusive

ownership has been obtained, a cache can update a block locally without initiating

75

 
 
 
 

  

read‘:::§//

bus invalidate

Invalid

  

bus
write
miss

 
  

‘ bus
‘\ invalidate
\\ or bus
\.write
«,3? i s s ....... ‘
Q

 

  

write hit

   

Shared-Dirty
(owner)

   

Dirty

(exclusive owner)

  
 

ooooooooooooooooooooooooooooooo *
bus read miss

   
 
 

 
 

I
.....

. . write miss bus read miss
write hit

Figure 6.3. Transition diagram of the Berkeley Ownership Protocol.

additional bus transfers. Owning a block obligates the owner to provide the data to
other requesting caches and to update main memory when the block is selected for
replacement. Cache-to—cache transfers are done in one bus transfer, with no memory
update. A read miss will let the processor get a valid exclusive or shared copy of the
block from the owner of the block. To summarize the above description, Figure 6.3

shows the state transition diagram for the protocol.

6.2.2 Under a Weak Consistency Model
Gaining Performance

The potential performance gain of weak consistency on the proposed architecture is
especially high if the number of transmitters per node is less than the number of
receivers. In this case, one transmitter must respond to the requests from several
receivers and also must send requests for the local processor. If the transmitter can
keep sending messages without waiting for the response of previously sent requests,

a potential bottleneck on transmission can be relieved. Furthermore, for the weakly

76

ordered model, other processors need not stall while waiting for the responses delayed

in this transmitter.

A Lockup-Free Cache Architecture

Because memory coherence has to be enforced at the synchronization points, buffers
have to be used to keep track of outstanding requests. In addition, a few queues are
needed for memory accesses. A lockup-free cache architecture for single-bus-connected
system has been proposed by Scheurich and Dubois [37]. A revised one for the MSU

system is shown in Figure 6.4 and the operations are briefly described below.

 
   
 
 
 
    
  

Processor

initializing

queue

pending
access

cache/bus
controller

   
 

   

  
 

cache
memory

 
 
  
 

    
  
 
 
 

    

  

re—
ceiver

re—
ceiver

re-
ceiver

     

   

issuin
queue

  

Busses

Figure 6.4. Node architecture for implementing weak consistency.

The processor can continue placing memory access requests on the initializing
queue whenever the queue is not full. The cache/ bus controller can use the recycle

line to reinitialize a request if the controller cannot handle the request because of some

77

pending situations. The pending buffer is used to keep a record of the outstanding
requests which have been sent to the issuing queue but have not been performed. On
a cache access miss or a local write hit when the block is not in the dirty state, the
request has to be put on the issuing queue to be sent to the bus, and a record of the
request has to be kept in the pending buffer. When a request from a write hit has
been successfully issued on the bus, the request is assumed to be globally performed
by all other processors, and the pending buffer can erase the record for the block. For
a read or write miss, the record will be kept pending until the data of the block is
received from one of the receivers. If a request out of the issuing queue is rejected by
the bus arbitrator or by other processors, the recycle line of the issuing queue will be

used to reissue the request.

Interference Problems

In addition to using the same protocol as in the sequential consistency model to
serialize the write operations, the protocol for weak consistency has to address the
problem that a memory access might occur while a request for the same block is
pending. The second request may come from either the local processor or a remote
processor. It will be the programmer’s responsibility to guarantee that two different
processors will access the same word at the same time only if both accesses are read
operations. However, accesses from different processors still can target the same
block for different words. If the two accesses are both from the same processor, data
dependence has to be maintained, thus the cache controller (not the processor) may
have to stop accepting any new request from the initializing queue until the pending
request is performed. Table 6.1 summarizes the proper actions for these problems. If
when a local pending request has not being successfully issued and a remote request
for the same block arrives from the bus, the remote request can be ignored. That

is because after the local request is later issued, the newest owner will offer the new

78

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Aosmmm 5&3 goo?!
Aosmmm 882: 0.8an l A3255 >> >>
Aosmmm Rodi uooﬂou
Aosmmm 8803 .88qu l Aoaoaob m >>
Aosmmm 5&3 poo?
Aosmmm 8803 20st l AoaoEocv >> m
Aosmmm 8&3 98qu Aosmmm 8&3 80an
Aosmmm 389$ «53mm m 182 pom Acammm 883V cosmmm m 182 aom AoaoEouv m m
oumemﬁﬁob mzeam Hozobsoo 283V >> >>
oazemﬁamob CS wumwaom 8on m 3mm: mEEoSbﬂ 980: m >>
ouzemﬁcmg mzﬁm .5on38 280: >> m
owmemﬁﬁou .8 wﬁwuoE mammuoa 280$ m m
€203 advocate Hob $.83 macaw one not mmoooe mmoooo
nompoe Eooq somuoe ﬁeooq mat/Eda. womaom

 

 

ASE out? a yo mama 3E3 a uoﬁmo 03 BE 32:? nmmoooe «£55 ”3 .mmoooe wave ”EV .3182 wamvaom mm 483 083 one Sm mmoooe
so 233 AmHOmmoooa 3082 59¢ Ho .888on #82 Soot E 858 mmoooe no sea? hoooammmGOo Mao? How maompo< Ab pimp.

79

data anyway. However, if the request has been issued, the arriving request has to
be rejected. To prevent the other processors from issuing a request the controller
can raise a special bus busy line just after receiving the request. Thus whenever a
request is sent to a bus, the cache/ bus controller has to detect the line of that bus. If

a request is rejected, the controller can just put it back into the issuing queue again.

Synchronization Points

In order to keep the coherence at the synchronization points, before a synchronization
access is allowed to be performed, all previous accesses must be performed globally.
Therefore, before the processor puts the synchronization access into the initializing
queue, the recycle line of the queue must be disabled. When the cache / bus controller
ﬁnds that the next access is a synchronization access, the controller has to wait until
the pending buffer is cleared and then the controller can receive the synchronization
access. After this point, the controller cannot accept any other request until this
access is ﬁnished. Finally after the synchronization access is done, the recycle line for
the initializing queue can be enabled. During the stall period of the controller, the
processor can still proceed and put requests on the queue as long as the queue is not

full.

6.3 Qualitative Performance Expectation

This section presents the advantages and disadvantages of the cache coherence scheme
of the MSU system. In particular, the effects of memory references from various
applications and the effects of system parameters on the performance of these schemes
are addressed in here. In the following chapter the arguments will be supported by

simulation experiments and analysis.

80

6.3.1 Pros and Cons of Coherence Schemes

One main advantage of directory schemes is that they are not restricted by the band-
width limitation of a particular network. However, unlike snoopy schemes, the di-
rectory look-up and maintenance make the access latency longer. For example, on a
cache read miss to a shared line, the requesting node has to send the request to the
directory of the line, and then from there the request must either be forwarded to the
real owner of the line or the requesting node must receive the directory of the line.
Furthermore, either the bit-vector or the linked list has to be modiﬁed to reflect the
new status.

The atomic broadcasting ability of the busses of a snooping scheme can let a
message be sent to all nodes at the same time, which can signiﬁcantly shorten the
time for keeping data coherent. On the other hand, sending a signal to all nodes can
only be executed serially on directory schemes on a general interconnection network.
For example, on a write hit to a shared block, the invalidation signal can be atomically
sent to all other caches on the bus. For the directory schemes, the invalidation process
has to be done serially; moreover, on the linked-list scheme the sharing chain has
to be broken and the location of the head has to be changed. Although part of the
latency can be hidden under a weakly ordered model, any subsequent synchronization
operation has to wait until all previous accesses are ﬁnished.

In the mean time, a directory scheme also generates a lot of internode commu-
nication trafﬁc. For example, whenever a write hit occurs, individual invalidation
messages must be sent to each processor which cached the same data, and all of them
have to send an acknowledgment back to the directory or the requesting node before
the write can complete. The writing processor can be idle for much of that time.

Generally speaking, although directory schemes can be applied to high-bandwidth

interconnection networks, snoopy schemes on bus-based networks can achieve low-

81

latency memory accesses; and the latency, not bandwidth, is often the limiting factor
in the performance of a multiprocessor. The bandwidth of bus-based systems can be
increased by using multiple busses or by using multiple-channel optical networks.

Traditional single-bus-based snoopy schemes cannot scale with the growth of the
number of nodes. If multiple busses are used without modifying the snoopy coherence
scheme, as discussed in Section 3.3, the complexity of the coherence mechanism will
grow fast when the number of busses or processors increases, which can seriously
restrict the scalability of the system.

An address-separated multiple bus network has lower network bandwidth than. a
pool of multiple busses because the busses may not be evenly utilized. That uneven
utilization is offset by a reduction in the complexity of a multiple bus snoopy scheme.
The MSU system introduces the concept of partial snooping to simplify the snooping
mechanism for each processor when the number of busses is large. By using this
concept most of the advantages of snoopy schemes can be kept while improving the
scalability of the system. An overhead introduced by the MSU system that does not
exist in the traditional single-bus-based, snoopy-coherent system, is the switching time
among the busses. But with today’s technology, this overhead is not signiﬁcant [105].

The cache-set replacement overhead is a potential drawback for the proposed
scheme. When a processor needs a new shared-and-writable block from a new block
cluster and if no cache set is available, one presently cached set must be ﬂushed out
and the receiver has to switch to a new bus. All dirty lines in the replaced set need to
be written back to the main memory. Also, the cache hit rate might be low after the
replacement if the lines in the replaced set will be needed again later. Actually, the
replacement of blocks is needed for any cache system unless the size of the cache is
unlimited; the difference in our the proposed scheme is that in addition to the normal
block replacement, replacement also occurs with a set of blocks. If memory access

patterns are uniformly distributed to any cluster, the probability that the next access

82

will introduce a set—replacement can be calculated as:

Nan _Nread
(1_ h) X (1_ Nb P _ N d) X pshared X pwritable

 

where h is the hit ratio, the pshared and paw-table are the probabilities that the requesting
block is shared and writable, respectively. The N23,:N:::: is the probability that
a block is in one of the clusters whose memory is being snooped by the node, so
(1 —— N22,:NZZS) is the probability that a block belongs to a cluster that the node is
not snooping on. If we choose it = 0.9, pshared = pwrgmbze = 0.5, N5”, = 8, NW, 2 4,
and Nread = 1, the above equation will result in a value of 0.014. However, we believe
that the locality of memory access patterns will cause a processor to use shared blocks
from only a few clusters at a time. If increased locality reduces the probability of a
memory access being in an uncached cluster to 0.1 instead of (1— W), the above
equation will become 0.0025. That is, the probability of the set-replacement should
be very low when a reasonable number of Nmp, so the overhead is not signiﬁcant. The
simulation results which will be shown later will also indicate that in most cases the
extra replacement time is not critical to the performance. Increasing the reference
locality by a different algorithm or a better memory placement can avoid this problem.

The pros and cons for each of the three coherence schemes are summarized in

Table 6.2.

6.3.2 Impact from Memory References

In addition to the general advantages and disadvantages of cache coherence schemes,
various reference patterns and system parameters can have signiﬁcant impact on
system performance. In MSU, a node can cache as many cache lines from another
node as it wants to by snooping the corresponding bus. In addition, all nodes can be

caching a single cache line if they each are willing to use one receiver for snooping that

83

 

.mOmmH—£ do mudO~ VOUGQHMQES KAQ ﬁomuﬁvﬁuoo

$880826 25 8 80080.20

800308 0388083m 8o“ 8008 08%
8808008808

80808000 o» 088 8008808 80880883

.800 08080 w8o80 m88m8£ 03mmmom

8888888 00808880800808088 0888mm

88:38.08 80.308 80% w88oo8a Ewing
8.25.800 80.08888 8838 30m
.8808: 000008 80808 2.08m

A8038 szv

w88oo8m 8888.08
883 88:88
28098808 8o

08088 Nao08m

 

.3808 888 80888388 8
8088 25-08888 8 8 8025

.®a®£om kHOuUQHmU d Eda:
UEGHH OOGQGOHGﬂNEIOOGOHOAOU mmog

080800 8808a

 

 

 

 

 

 

 

:8 8o moo8m 8800088 8080 w8>0m 888880.? 00888388800808088 038% 8083.85
0803.8 88-0%86 8 8238000 80885 .8838 800080 8o808 208m #0888888
.0088 80808 083 m0€ouo08Q .8088 25-088888
0080808888 830088 OH 080 oESE 8 8 8888 008 88808880 8.00
68830088 w883888 8 8008830 .m0wO8 088 028 880 8088 08H. 08088
$08088 $000.0 8o808 w8oq 812308 ~80 08 80:89» 03 880 83008Q
m8o0 _ wormﬁ 08080m _

 

.mvgvﬁom OUQ®H®£00 @5060 00.2.: OH: wO wQOU mug—m mOHAm :ND OHLdL—u

84

bus. However, what the MSU system cannot do effectively is caching a little bit of
every node’s memory, even if any particular cache line is only being used by some small
number of nodes. On the other hand, a directory scheme can easily handle the case
where every node is caching some small amount of memory from every other node,
as long as any particular cache line is only being cached by a small number of nodes.
However, it cannot easily handle the case where many nodes want to cache a particular
piece of data. Generally, in real applications most data are either being cached by
a small number of nodes or being globally shared [106, 107]. The MSU system can
handle the latter case very well. The same case causes problems for directory schemes.
For the former case, MSU’s behavior will depend on the distribution of the data, so
the system may or may not effectively handle the case. However, because of the
locality property in most applications and the help from the read-only bus in our
system, we expect the MSU system to perform well in both sharing patterns. Also,
distribution of data can be solved in software relatively easily.

Table 6.3 depicts the effects of reference patterns and system parameters on per-
formance for different systems. In general, a snoopy scheme will perform better than
a directory scheme when shared data are frequently accesses. For example, consider
the length of a linked list in SCI for hundreds of nodes: if a modiﬁcation on one cache
has to inform all the others, hundreds or thousands of operations and messages will
occur in the system. Many other conditions are also considered in Table 6.3. One
interesting phenomenon of the partial snooping in the MSU system is that when the
NW, decreases, the number. of cache sets decreases also, so the number of lines in
each cache set increases and that can increase the hit rate. Therefore, sometimes the

system performance will be better when the number of snooped busses is reduced.

85

 

 

00808808808 w8>0888 8080 00808 88 w8000808 80 808008 0 08 8883

£00 08000 0 8 0088 80 808888 088 00000808 0080 8 .80>030E
8008080— 000000 088 w88088w808 8080 a088088 w8000808

“0808 88 w80008008 8 808008 0 08 8883

h0.800 08808000 .80 808888 088 08888m

DmE 80 3:2. w8000800Q

 

.000088 w808 8 808080880
08..— 08 0880 8008 088 880808 800 0080800 80800858

Evumxmm mﬂﬁmﬁde .m am
mvmmdﬁ MO HOﬁESG wﬁmmeHUCH

 

8808000808 80008000 88 8000800 800088086 088 8080

000088 8808 w880080 80 80008 088 0803020 800 8 .DmE 88
.088 088 80 888 08 08 0>08 :50 00000000 08080 808080 :0 888
.8008 0883 80808 80008 808008 8 880808 0080800 .3008m
.000000 8080088

8080 08080 808008 80.8 880808 800 0080800 80800.8Q

m88808 80808 888%

 

.080800 8080088 0 80 80008 808008 0 0888 8883 880880.»
8008080 880808w 0 80 w88§> 088 800 :85 0805088080880
5080080888880 80308080 0800 088 8808808 080000008 088888 88

80308080 8050880880808

 

00808808808 808080 088 00008008 8:? 8080008808

800 08000 80 0808 88 308 088 8080 80088086 8008 0883 080.
h808080880 808 08 000088 8080080 80 808888 088 8
800803088 00088 80 80008 088 0080808 :85 8 ﬂDmE 88

0088008 80808
80808088 8 80808
w8000000 3880888

 

.808080 sz 088 80 88000 808 000088 80 80008 80008080883

08080 808080 308 0
w8000000 88808800808

 

 

088088 8080 800880080 08880 00800 :85 0803080808888
0080 8008 088 0800.080 80800.88 83080 3080 8088

 

0808 8008080 880808
80 8880088008 808808.08

 

 

00808808808 808080 80 800888 _

80080 _

 

0808080 8808080008000 *0 00808808808 088 80 0808080808 808020 8080 08808808 008080808 80 800888 .mé 030mL

 

86

6.3.3 Performance Expectation

As a summary of this qualitative discussion, Figure 6.5 shows the expected compara-
tive performance of a directory coherence system and a partial—snooped MSU system
for various system sizes. When the system size grows, the number of processors, the
network bandwidth, and the number of snooped busses in MSU increase together.
When the system size is small, the directory coherence system is expected to perform
better because at that moment the number of snooped busses in MSU is not large
enough for caching the working set, and the frequent cache-set replacement may re-
duce the hit rate and generate overhead. But when the system size is large enough to
support enough snooped busses, the performance of MSU will soon outperform the
directory coherence system due to shorter memory access latency. Furthermore, when
the system size keeps growing, the long memory access latency in a directory scheme
prevents the system from gaining performance, and even causes performance to de-
crease because of the longer communication overhead. However, the MSU system can
utilize the increased processors and the network bandwidth to gain performance. This
expectation will be veriﬁed later by both simulation and analysis, and the identical

relative performance shapes can be found in both Figures 7.17 and 7.21.

6.4 Implementation by Optical Networks

In addition to traditional electronic busses, the multiple busses of MSU can also
be implemented by the multiple channels of a state-of-the-art optical network. The
proposed coherence solution is particularly attractive considering the high bandwidth
and channel tunable capability of fiber optic networks. As mentioned in Section 3.4,
current advances in lightwave technology make the potentially usable bandwidth of
an optical ﬁber to be on the scale of Terahertz. The large bandwidth can be efﬁciently

exploited by wavelength division techniques to partition the available bandwidth into

87

 

MSU

/

Execution Directory scheme

Time

 

 

 

 

 

System size

Figure 6.5. The expected comparative performance of two scheme under different
system sizes.

multiple channels. With current technology, optical wavelength tunable transmitters
and receivers can tune to different optical frequencies (wavelengths) within tens of
nanoseconds. Multiwavelength optical networks allow multiple packets under different
wavelengths to travel in one optical ﬁber without interfering with each other [92].

A node on MSU can snoop multiple channels by using multiple tunable receivers.
The star topology constructed by a star-coupler is chosen because of its scalability
and the fact that the it is a passive device. Figure 6.6 shows the architecture of an

optical connected MSU system with tunable transmitters and tunable receivers.

88

 

 

 

 

 

 

.I O ‘

. OIOCGSSOI‘

    

local
mem

 

 
 

star
coupler

 

 

 

—
b...

 

 

 

 

 

 

 

 

 

 

 

Figure 6.6. Optically-connected MSU system. One tunable transmitter, two ﬁxed
receivers and several tunable receivers are used in each node.

CHAPTER 7

Performance Evaluation

This chapter compares the performance of the MSU system and a directory-based
system by using simulation and queueing analysis. The simulation results were derived
by implementing two architectural simulators on top of the Tango simulation system
introduced in Chapter 4 [108]. The analytical results were derived by calculating
the mean service times in the processor and network servers of the queueing models
described in Section 5.4. The methodology and results from simulation are presented

ﬁrst.

7 .1 Simulated Directory-Based System

Two architectural simulators were developed. The ﬁrst model is for the MSU system
described in Chapter 6. The other adopts a directory-based coherence scheme for
comparison. A full—map, bit-vector directory scheme was chosen in our simulation
because, in general, it achieves the best performance among directory schemes for
large-scale systems [28].

The network of the simulated directory-based system has been chosen carefully for
a fair comparison with a bus~based system. Adopting the address-separated multiple

bus network would be unfair for the directory-based system because the constraint on

89

90

the choice of the busses according to the address of an access should not be applied
to directory schemes. The reasons for using the address-separated busses on the
MSU system are to ease the implementation of the snoopy scheme and to make the
partial snooping possible. Nevertheless, as shown in Chapter 5, the address-separated
network has smaller bandwidth than a full-connected multiple bus network. Thus, for
a fair comparison with the MSU system, the network of the directory-based system
should possess the same bandwidth as a fully-connected multiple bus network so that
the transmission can be carried when any one of the busses is available.

Since a multistage interconnection network is a natural choice for a directory-
based system, we can present the network of the directory-based system in the same
way. Figure 7.1 shows the network of the directory-based system with four nodes
and two channels. When a node needs to transmit data, it ﬁnds an available channel
through either one of the two 4-to-1 switches. If it succeeds, a path is built and the
path will be held for reply, if any, until the end of the session. If it fails, the node will
wait for one of the channels to become available. Since the channels are analogous
to the busses on the MSU system, for simplicity, in the following only bus or busses
will be used to refer to either the busses on the MSU system or the channels in the
directory-based system; and, Nbus will refer to the number of either the busses or the
channels. For example, the system in Figure 7.1 will be described as Nnode : 4 and
Nbus = 2-

Multistage interconnection networks normally have neither broadcast nor combin-
ing capability built into the switches. To prevent the MSU system from gaining an
advantage in the comparison from its use of broadcast, we chose communication pa—
rameters for the directory to keep communication costs low. For example, the second
and subsequent invalidation messages during an invalidation operation require half
the time of the ﬁrst message. This choice represents an implementation where most

of the overhead of the multicast invalidation communication is incurred by the ﬁrst

91

 

hannel 1

'-
1 tc>l‘l ,“||||||l|i 4 to 1
switch switch
I‘
. V V ‘.
"‘ '\ Channel 2

’-

l to 4 4 to 1

switch switch
Node 4

 

Figure 7.1. The simulated network for a directory-based system with Nnode = 4 and
Nbus : 2

92

message. Also, the switching time for the MIN-based directory system is assumed to

be the same as the bus arbitration time in the MSU system.

7 .2 Architectural and Timing Assumptions

The architectural assumptions for both simulators are listed in Table 7.1. Since
the MSU system uses a set-associative cache, the same was used for the directory
scheme. The replacement policy for cache lines and the cache sets on the MSU is a
clock algorithm, which sets a space as unavailable for replacement when the space is
accessed. The algorithm sequentially checks all places to look for an available space
for the incoming item. If the currently pointed space is unavailable, the algorithm
sets it available before checking the next one, so next time when this space is pointed
to, it will be chosen for replacement if it is not accessed before beng pointed to
again. As mentioned in previous chapters, the Berkeley Ownership protocol and the
circuit switching technique were adopted. Although relaxing the consistency ordering
requirement can improve performance in both systems, the optimal performance will
be gained by different relaxation strategies in the two systems. To make the simulation
experiments fair and simple, sequential consistency is used in the simulation for both

systems.

93

Table 7.1. Default architectural assumptions

 

One word: 4 bytes

One cache line: 16 bytes

Main memory page size: 1K bytes

Data cache capacity (per processor): 256 lines

Cache placement policy: set-associative with degree of # 256

of snooped busses

 

Replacement policy: Clock

Coherence protocol: Berkeley Ownership (a write-invalidate protocol)
Switching technique: circuit switching

Consistency ordering: sequential

 

 

 

Because the network bandwidth is important in deciding the system performance,
our simulation considered the contention in the networks for both systems. The MSU
system has greater opportunity for bus contention because each access has to travel
through a certain bus. Memory contention will never occur in the Address-Separated
network for the same reason as in a single-bus network. However, memory contention
is still possible for the directory-based system when two requests from different busses
(channels) target the same memory module in a node, but this contention was ignored
in our simulation for simplicity reasons. Because the sequential consistency is chosen
in our simulation, each node in both the MSU and the directory-based system can
only set up one path in the network at a time. But, as just mentioned, each node
of the directory-based system in Figure 7.1 can handle multiple requests coming
from the left side of the node at the same time, which gives a distinct advantage to
the directory scheme. Note that the network of the directory-based system has the
same bandwidth as the theoretical Network Model 6 in Figure 5.2. Even though the
simulation gives an advantage to the directory scheme, the results will demonstrate
that the MSU system can still perform better.

Following the notation used by Chaiken et a1. [28], we can denote the simulated

94

directory scheme as DirooNBNm, which means that in the directory scheme (Dir),
there is no limitation (00) on the number of processors that can cache a line. And,
when a node invalidates the same cache line in other nodes, it only informs the nodes
which cached the line rather than broadcasting to all nodes (NB), otherwise a lot
of trafﬁc will be generated on the Nbu,-channel network. The MSU system will be
denoted by MSUﬁfZZf later in order to specify the values of Nmp and Nb”. Because
of the strong application-dependence, no read-only busses were simulated for MSU
system; that is, we always assumed Nread = 0 in the experiments. Elimination of
those busses gives yet another advantage to the directory scheme.

The default numbers in processor clock cycles for memory access latency are listed
in Table 7.2. These numbers were derived by assuming that in a system with N MHz
processors, the bandwidth of a single bus is approximately N Megabytes per second.
The numbers in the upper part of the table are used for both systems, and the lower
part shows the overhead which only happens in directory schemes. The table has
only the default numbers; the sensitivity on performance of a variety of the latency
values has also been studies and the results will be discussed in the following section.
Table 7.3 lists the time needed for read and write operations when there is no network
contention, replacement, or write back.

In the simulation experiments for evaluating the network bandwidth, described in
Chapter 5, a constant communication time was applied to every memory reference.
Now in evaluating the multiprocessor systems, an option of Tango is turned on to in—
voke the corresponding architectural simulator only when a shared memory reference
is detected. That is, without losing any generality in a N UMA system, we assumed
that all instruction fetches and private data references always hit in the processor
cache so the simulator can focus on shared data accesses. This choice allows us to
simulate the most critical characteristics while allowing a simulation run to complete

in hours rather than days.

95

Table 7.2. Default memory access latency for primitive operations in processor cycles.

 

 

 

 

 

 

 

 

 

 

 

 

I ParameterTOperation I Size I pcycles ]
tarb Bus arbitration 2
teache Read and write cache hit one word 1
tin” Invalidation one line 4
treq Access request on bus one line 4
trpy Access reply on memory and bus one line 32
trpm Cache line replacement overhead one line 2
twbl Write back to local main memory one line 5
twb, Write back to remote main memory one line 20
Extra overhead for full-map bit-vector directory schemes:
tab-m, Invalidation after the ﬁrst one one line 2
tdloc Local directory look up and update 8
tdmt Remote directory look up and update 40

 

7 .3 Benchmark Programs

Three benchmark programs—matrix multiplication, Gaussian elimination, and heat
transfer—were implemented in our simulation experiments, and the descriptions of
them can be found in Table 7.4. The Mat and the Gauss programs have been widely
used as benchmarks in many memory architecture studies [54, 106, 109, 110]. The
Heat program was selected because its memory access locality allows us to study
the impact of data placement. All of the benchmarks are common applications and
each one represents one category of applications. In addition, the programs and
the reference patterns are quite straightforward which allow better reasoning on the
simulation results. As mentioned in Section 4.4, the time—consuming, instruction-
level simulation limited the size of our benchmarks. The default problem sizes are
shown in Table 7.4, but the sensitivity of larger problem size was also studied in some
experiments. The characteristics for each of the benchmarks are presented below.
For the Mat program, almost all accesses are for shared data. All the items in the

two multiplying matrices are read-only, and every item of the result matrix is only

 

96

 

 

 

 

 

 

08.038020» + 88.3 + 00..“ 0880\038w + 88¢ + 80......w + few + uzueow AQHOE®8\8.MUOC mmma 00:83 80 ﬁdwm 00.:t..~H
82...: x 8 1.0+ 2.3 is: x 8 n e + .53 88080808: 25
853033 ”503033 + few + 3088 80830080808 0 80 88 088$. 3-31:8
o 2088 0888 80830.80 80 80 88 0885 00:85
D 08000» 828 ﬁdvm 88:000er
A08030880>8 8008008 80 no? 08... 08 3 ”080800 8880800888
88.6 + on...“ 88.8w + new IT 880w + 08000» mmma 9383 HQ ﬁdvm 00.085
2...» 3:8 + 0.88 + 2088 08: 8083080808 0 80 88 08.85 00.3.08“
o 3080 088 800830-080 80 80 88 08.85 inﬁrm
O 08000w a—ﬁ Udvm 88800ng

 

”080800 w88008m

 

 

08$ 803088000 088 _

8080808 00000< _

80508080 _ 808080.808 _

 

8808000808 088 08000 8080 808808800 088 80 0080080 8 080308080 0808808, 80 8080808 .ms 08808.

 

97

Table 7.4. Descriptions of the benchmark programs and the default problem sizes.

 

I Benchmark I Description

 

 

 

 

Mat Multiplication of two 100 x 100 dense matrices. One processor
takes the task of generating results of one row at a time.

Gauss Using Gaussian elimination method to transfer a 100 x 101 matrix
into an upper triangular matrix.

Heat Heat transferring calculation by Jacobi method with double

buffering on a 64 x 64 grid. Deriving the new value of a grid
point takes data from its four nearest neighbors.

Processors are synchronized after each round of calculation, and
global difference is accumulated every 20 local calculations.

The convergence tolerance was set to 10‘3.

 

 

 

 

cached in one processor. In the other words, almost no invalidation operations occur
in executing the Mat program. The Gauss benchmark is an example of programs
that are difﬁcult to parallelize well. The load balancing among processors may be
biased after calculating for a while. That is, some processors will be idle or only
executing synchronization operations while others are busy in computation. For the
Heat program, only the information of grid points on the boundary of two adjacent
processors is shared, no grid point information has a high degree of sharing. The
global difference is the only variable shared by all processors; but this shared variable
is modiﬁed by each processor only once every 20 rounds of calculation because each
processor accumulates the local difference among its points first.

As stated in Table 6.3, the characteristics of the selected programs of “no invali-
dation”, “biased load balancing”, and “low—sharing” should all benefit the directory
scheme. The data were stored in row-major order from the matrices into memory
pages, and the pages were interleaved across nodes so each node could store a portion
of the global data even for these small—size problems. A node may need data from sev-
eral nodes at a time to complete some computation. This default memory placement

can balance the network load for both architectural simulators, but causes problems

98

for partial-snooping MSU when the number of N”? is not large enough. In the other
words, in addition to the simulator, we favored the directory scheme in the simulation

again in the choice of the benchmark programs and the memory placement.

7 .4 Simulation Results

This section provides the simulation results that show the performance of the MSU
and the DirooNB systems. The purpose of the simulation experiments is twofold,
one is to compare the two schemes, the other is to study the behavior of the MSU
architecture. This section begins by showing the comparison of the directory scheme
with full-snooping MSU, and then the performance of MSU under various degrees
of partial-snooping. The impact of various system parameters will then be studied
in the following subsections. Determining the sensitivity of the system to different
parameter values is essential to demonstrate the validity of the simulations. One
can take exception with individual simulation parameter choices, but the parameter
sensitivity studies indicate that the conclusion is relatively independent of parameter
choices. In the last subsection the performance under various numbers of processors

will indicate the scalability of the two coherence schemes.

7 .4.1 Comparison of Full-Snooping MSU and DirooNB

The full-snooping MSU means that the system snoops on all busses, but the network
is still the address-separated multiple bus. On the other hand, the DirooN B uses the
network shown in Figure 7.1.

Figure 7.2 shows the cache hit rates for the three benchmarks used in the simula-
tion experiments for both schemes. The data is collected from the simulation results

of MSU, but the difference in hit rate between MSU and DirooNB is very small and

can be ignored. The small difference is caused by the different memory access se-

99

quence in simulation. In Figure 7.2, when the number of busses increases, the hit
rates drop. That decrease occurs because the degree of set-associativity in the cache

placement policy decreases with a larger number of busses.

Shared Uata Cache HI! Hate In "/0

 

Mat Gauss Heat

2 4 8 Cl 16

 

Number of Busses:

Figure 7.2. Average shared data cache hit rate for the three benchmarks for various
number of busses when Nnodc = 32.

The normalized execution time used in Figure 7.3 and following ﬁgures was derived
by dividing an execution time under a certain number of nodes (Nnode) using the
speciﬁc scheme by the execution time of MSUxZZT with Nsnp = Nbus = %Nnode for the
same benchmark program. For example, when Nnode = 16 the following calculation

is used:

exe. time of DirooNBle
exe. time of MSU:

 

normalized execution time of DirooNBm =

The reason for using normalized execution time is to scale the performance results so

that results using different numbers of nodes can ﬁt in the same ﬁgure. The relative

100

shapes under the same number of nodes are preserved.

Figure 7.3 shows the performance for the Mat benchmark. The ﬁgure shows that
MSU is better in all cases, but the difference is reduced as the number of busses
increases. That reduction in difference occurs because the directory scheme can use
the added busses more effectively, as stated in Table 6.3. When Nb”, 2 16 and
Nnodc = 32, the two schemes achieve almost the same performance. However, the
fraction of Nbus/Nnode = 1/ 2 is unusually high, probably 1 / 4 or 1/8 is a more practical
number. With a more reasonable number of busses, the performance of the snoopy

scheme is far better than the performance of the directory scheme.

 

  
   
   

Dir, Nn = 32 *—
5.0 _ ode =

MSU, Nnode 32 €— "
Dir, Nnode = 16 -><-
MSU, Nnode = 16 {>—

 

 

 

4.0 - -
Normalized
Execution 3.0 - 0
Time
2.0 r ..
1.0 - , . $ 5
0.5 l L l
2 4 8 16

Number of Busses

Figure 7.3. Normalized execution time of full-snooping MSU and DirooNBNbus under
various number of busses for Mat benchmark.

Figure 7.4 shows the same comparison as Figure 7.3 but for the Gauss and Heat

benchmarks. In these cases, when Nnodc = 32, the directory scheme outperforms

3 .0
2.5
Normalized
Execution
Time
1 .5
1 .0

0.5

5.0

4.0

Normalized3.0

Execution
Time
2.0
1.0
0.5

Figure 7.4. Normalized execution time of full-snooping MSU and DirooNBNm under

101

 

 

Dir, Nnode
MSU: Nnode
Dir: Nnode

MSU: Nnode

h—H—‘OOOD
OEOENM

fl

<l>>k

 

 

Number of Busses

(a)

16

 

\\

 

Dir: Nnode
MSU: Nnode
Dll‘, node

MSU: Nnode

   

\

.0
‘5

II II II ll
#6wa
mamw

©>k<l>f

 
   
 

 

 

8

Number of Busses

(b)

various number of busses for Gauss (a) and Heat (b) benchmarks.

102

MSU when the number of busses equals 16 for Heat, and both 8 and 16 for Gauss. A
similar pattern happens when Nnode = 16. The better performance of the directory
scheme in these instances should be mainly due to the low data sharing rate in these
programs. However, in all other situations, the snoopy scheme on the MSU performs
better. The above ﬁgures also indicate that the relative patterns on curves of the

MSU and the Dir are generally similar under different number of nodes.

7.4.2 Various Degrees of Snooping on MSU

Figures 7.5, 7.6 and 7.7 show the execution time and hit rates on MSU for various
degrees of partial-snooping (various Nsnp) when Nnode = 32 for Mat, Gauss and Heat,
respectively. For comparison purposes, the execution time of a directory scheme with
N3”? busses is shown using a dashed line.

On the (a) ﬁgures of Figures 7.5, 7.6 and 7.7, the curves show that at some value
of N,,,,,, partial-snooping may derive better performance than full-snooping on MSU.
In these situations, the MSU with partial-snooping can beneﬁt from both scalability
. and performance. This is a very encouraging experimental result because it shows
the usefulness of the concept of partial-snooping. For Mat (Figure 7.5) and Heat
(Figure 7.7) with Nnode : 32, MS a?” can outperform Dir‘ooNB16 at some values of
NW9, but not in Gauss (Figure 7.6). This observation means that if the computation
and communication load of an application cannot be balanced, such as in Gauss, the
DirooNB can beneﬁt from the freedom of using all the channels (busses), while MSU
system suffers from the heavy communication trafﬁc due to the low hit rates (shown
in Figure 7.2).

The reason for the better performance of partial-snooping can be observed in
the hit-rate ﬁgures (Figures 7.5(b), 7.6(b) and 7.7(b)). They show that the best
hit rate usually occurs at a certain degree of partial-snooping, but not at the point

of full-snooping. One of the reasons for this phenomenon is that the degree of set-

103

 

2.0 - \ DirooNBNmp »<— _

\ MSU NW «9—
8
Msuxzw <—>~

\

, 1.5 -
Normalized \
Execution \
ime

 

 

 

0.5

 

 

 

 

 

 

  
  

 

 

 

 

80-

$ 7e~

.E

.9 76—

(U

a:

r: 74—

I

.°c’ 72~

O

8

.9 ml

<6

0 68*

'O

9

0 66—

.C

a)
64__.

/
8 busses

 

/

8 16 / 16 busses
Number of Snooped Busses ( N snp )

(b)

Figure 7.5. Effect of various degrees of partial-snooping of MSU when Nnode = 32
on normalized execution time (a) and cache hit rate (b) for Mat benchmark under
default parameters. The dashed line in (a) shows the execution time of DirooNB with
NWT, busses for comparison.

104

 

\ DirooNBNmp —x—
Msvgﬁf *9— -
MSU16 P 9—

\/

 

Normalized 1 -0 ‘
Execution
Time

 

 

 

l l m

 

8
Number of Snooped Busses (Nsnp)

(a)

 

 

 

 

 

 

 

  
    

 

 

68-

o\°

.E 67“

2

(U

E 66~

'3':

.°c’ 65*

U

(U

o

a 64*

o

E 63-

(U

.C

m
62-

8 busses

  

8 16 I) 16 busses
Number of Snooped Busses ( N snp)

(b)

Figure 7.6. Effect of various degrees of partial—snooping of MSU when Nnode = 32
on normalized execution time (a) and cache hit rate (b) for Gauss benchmark under
default parameters. The dashed line in (a) shows the execution time of DirooNB with
N,,,,, busses for comparison.

105

 

 

\ DirooNBNmp —x—
\ MSU NW 6-
.— 8 ..
2.0 \ MSUlglsnp 9_
Normalized
Execution 1.5 -
ime
1.0 -
05 l l l

 

 

 

4 8
Number of Snooped Busses (N mp)

(60

l

82

76—

70—

 

 

 

 

 

i=3

 

 

 

 

 

Shared Data Cache Hit Rate in %
0

   

 

 

 

 

8 busses

 
    

4

8 16 i 16 busses
Number of Snooped Busses ( N snp)

(b)

Figure 7.7. Effect of various degrees of partial—snooping of MSU when Nnode = 32
on normalized execution time (a) and cache hit rate (b) for Heat benchmark under
default parameters. The dashed line in (a) shows the execution time of DirooNB with
N mp busses for comparison.

106

associativity increases as N,mp decreases for a ﬁxed cache size. The other reason is
that while one node can only cache data from a few clusters at a time, the cache size
for each cachable cluster is larger, which results in a better hit rate if the number
of cache sets is sufﬁcient. With the ability to dynamically cache from all clusters,
the size of the cache is virtually larger with partial-snooping, which is similar to the
effect of virtually larger main memory in a virtual-memory system.

The thrashing on cache sets can be observed in Figure 7.7(b) when Nb,” 2 16 and
Nmp = 2. The number of cache sets is seriously insufﬁcient in this case, so the hit
rate drops to about 30%. It also makes the performance at that point bad, which
can be observed in Figure 7.7(a). This observation indicates that when the value of
Nmp becomes very low, the extra overhead of the cache-set replacement can become
signiﬁcant. It can be observed in Figure 7.7(a) that the MSU: outperforms MSUf6,
which indicates that when the number of snooped busses is insufﬁcient on the MSU
system, increasing the total number of busses may even decrease the hit rate and

performance. Section 7.4.4 will show how memory mapping can solve this problem.

7 .4.3 Impact of Faster Communication

The MSU system uses multiple busses instead of one high—speed, large-bandwidth bus
to reduce communication contention. However, we still studied the impact when the
busses have larger bandwidth by using the Mat and Heat benchmarks. The choice of
only two benchmarks is for brevity; the Gauss benchmark was discarded because of
its uncommonly low hit rate. Figures 7.8 and 7.9 show the performance results for
Mat and Heat benchmarks when the default bus speed is doubled with Nnode = 32.
That is, the latency values of tiny, treq, tpr, twa, tdmv, and tdrmt are reduced to half of
the default values in Table 7.1. Though the execution time changed, the performance
patterns of normalized execution time curves in the ﬁgures are almost identical to the

patterns in the ﬁgures under default bus bandwidth shown in previous subsections,

107

 

Dir, Nnode = 32 6&-
MSU, Nnode = 32 G-

4.0 -

Normalized
Execution 3.0 -
Time

2.0 l-

 

 

 

 

0.5 I l l
2 4 8 16
Number of Busses

(a)

 

Dir, Nnode = 32 —>(—
MSU, Nnodc == 32 9—

4.0 -

Normalized3-0 "

 

 

 

Execution
Time
2.0 _
1.0 h
0.5 1 ' J

 

2 4 8 16
Number of Busses

(b)

Figure 7.8. Comparison of full-snooping MSU and DirooNBNm for different number
of busses when bus communication latency is reduced to half. Normalized execution
time for Mat (a) and Heat (b) benchmarks are shown.

108

 

 

 

 

 

 

2.5
i I \ \ ' DirooNBNanp .x_
\ \ MSng’” <9-
2.0 l" X MSUIG P e_ "
\
\
Normalized \
Execution 1.5 r- \ '
Time \
\
\
10 ~ 5
0.5 ' ' '
2 4 8 16
Number of Snooped Busses (Nsnp)
(a)
2.5 5 l \ l
\ DirooNBNmp —)<-
2 0 _ \ MSUBx’” + _
- \ MSU16‘"” 9-
Normalized
Execution l.5 -
Time
1.0 -
0.5 1 1 '

 

 

 

2 4 8 16
Number of Snooped Busses (Nsnp)

(b)

Figure 7.9. Effect of various degrees of partial-snooping of MSU when the communi-
cation latency is reduced to half. Normalized execution time when Nnode = 32 for Mat
(a) and Heat (b) are shown. The dashed lines show the execution time of DirooN B
with N”? busses.

109

which indicates that the bandwidth of the network has little impact on the relative

performance results, in either a full-snooping or a partial-snooping system.

7 .4.4 Impact of Shorter Directory Access Latency

Two major overheads in a directory scheme are the directory look—up and update
latency, neither of which is required in a snooping scheme. Although we believe the
default values of the directory access latency reﬂect practical numbers, the perfor-
mance of directory schemes with shorter directory access latency is studied. Reducing
the directory access latency further increases the advantage of directory schemes over
the snoopy schemes. Figure 7.10 shows the performance of DirooNB when directory
access time, tdloc and tdrmt, are reduced to half of the default values. The results are
again from Mat and Heat benchmarks when Nnode equals 32. The dashed lines show
the performance of Dir and MSU under default parameters for comparison purposes.
It can be seen that the directory access latency does affect the performance of di-
rectory coherence schemes; however, reducing it by half is not sufﬁcient to allow the
directory scheme to outperform MSU in most cases. In general, the relatively perfor-
mance between Dir and MSU remains the same when the directory access latency is

reduced by half.

7 .4.5 Impact of Memory Placement

The placement of memory may signiﬁcantly affect the performance of a N UMA ma-
chine, including the architecture discussed in this dissertation, due to the different
access latency of local and remote memory [111]. The impact of the memory mapping
for MSU and DirooNB is studied in this subsection. For some applications, such as
the matrix multiplication and the Gaussian elimination, it is difﬁcult to predict what

would be the best memory mapping pattern because there is no obvious method to

110

 

 

r l
’K Fast Dir, NW. = 32 9(—
5-0 ' \ Dir, Nnode = 32 e<- r
MSU, Nnode = 32 {>-
\
4.0 - \ -
Normalized \
Execution 3.0 -
Time
2.0 r-
1.0 -"
0.5

 

 

 

Number of Busses

(a)

 

I
Fast Dir, Nnode = 32 *-
MSU, Nnodc : 32 O—

4.0 "' \ "

Normalized
Execution 3.0 -
Time

2.0 -

1.0 - '
0.5

 

 

 

 

Number of Busses

(b)

Figure 7.10. Execution time of DirooNBme when the directory look up and update
latency is reduced to half for Mat (a) and Heat (b). Dash lines show the execution
time under default parameters for comparison.

111

decide which processor will access a certain datum most frequently. The decision
of memory placement is even more difﬁcult if the application is well load—balanced
among processors—the Mat and Gauss benchmarks for example. In a well load-
balanced application under a shared memory environment, each processor takes a
portion of the whole task at the beginning, and then the remaining subtasks will be
taken one-by-one by processors as they ﬁnish their computation. Since the decision
of task-division is dynamically decided at run time, it would be very difficult to find
a good memory mapping pattern beforehand. Dynamically migrating memory pages
among nodes might gain performance for these applications, but the migration pro-
cess becomes overhead itself. The issue of page migration on cache—coherent systems
has been studied by Gupta et al. [112], also via the Tango simulation system.

A different way has been chosen to study the memory placement problem by
specifying the static location of a certain memory space in the programs, which is the
trend supported by many high-level parallel languages. The heat transfer problem is
a good example for studying memory mapping because each processor can compute
the temperature of one block in the working domain and the temperature data of
the block can be stored locally. Only the data of boundary elements of the block are
shared by neighboring processors, and one processor only shares temperature data
with at most four neighboring processors. The Heat benchmark used previously does
let each processor compute one block of the domain and share data with neighbors,
but no particular mapping was done. The matrix elements were placed by row—
major in consecutive pages, so one processor may need data stored in several different
nodes. This row-major memory mapping was chosen so we could study the effect of
insufﬁcient snooped busses on MSU as shown in Figure 7.7. It is, in fact, an example
of a program which should perform poorly on the MSU system. To study memory
mapping, the Heat benchmark was modiﬁed into two new versions—mapped-Heat

and optimally-mapped-Heat. Both programs store the memory of one matrix block

112

consecutively in one node, but the mapped-Heat doesn’t guarantee the memory will
be stored locally in the node that computes the block, while optimally-mapped-Heat
does.

Figure 7.11 shows the execution time of the new versions of the Heat benchmarks.
To better indicate the relationship between different algorithms, the actual execution
time instead of normalized time is used. In this case scaling was unnecessary since
the number of nodes is ﬁxed. Figure 7.11(a) shows the execution time of full snooping
MSU and the DirooNB. Compare the results with the performance of previous row-
major mapped Heat shown in Figure 7.4(b). Observe that after memory mapping the
MSU always outperforms the DirooNB, even though the new versions generate higher
hit rates so the directory access latency becomes less signiﬁcant in DirooN B. The di-
rectory scheme gains more in changing from mapped-Heat to optimally-mapped-Heat
due to high locality leading to both faster data and directory accesses. Figure 7.11(b)
shows the performance of partial-snooping MSU under various Nsnp. Since in the new
versions each node will only need temperature data from a few other nodes, reduc-
ing the value of Nam, to four doesn’t increase the execution time, which is especially
obvious in the optimally-mapped-Heat. However, when Nsnp = 2 the insufﬁciency of

the snooped busses increases the execution time on the MSU system.

7 .4.6 Impact of Larger Cache Size

We chose a small cache size of 4K bytes per node in the simulator to match our choice
of small benchmarks so simulations could complete in reasonable time. The impact of
larger cache memory is studied in this subsection. Figure 7.12 shows the comparison
of full—snooping MSU and DirooNB when the cache size is increased to 16K bytes per
node for Mat and Heat benchmarks. The performance pattern is the same as that
for 4K bytes cache size (Figures 7.3 and 7.4(b)), but the directory scheme performs

worse when the number of busses is small.

 

 

113

 

 

 

 

 

 

 

 

 

 

 

l r
6-0 " >3 Dir, map. -x-— _
\ Dir, opt. map. -)<—
\ MSU, map. (>-
5.0 _ MSU, opt. map. 9— -
Execution
Time 4.0 -
(in million
pcycles)
2.0 -
10 l l 1 ~
2 4 8 16
Number of Busses
(a)
4? r r N
5.0 P MSUSNWD, map. ‘9— q
U "W (>—
\ MS 16 ,map.
MSUgNmp, opt. map. '9—
4.0 - (3 \ MSUIGMP, opt. map. 9— ‘l
Execution \ \
Time 3.0 - \ -
(in million \
c cles \
p y ) 2.0 _ \ __
1.0 - V — _ _ — — ﬁg
1
2 4 8 16

Number of Snooped Busses (Nsnp)

(b)

Figure 7.11. Impact of memory placement on memory-mapped Heat benchmark when
NW1.3 = 32. (a) shows the comparison of full-snooping MSU and DirooNB on different
number of busses. (b) shows the effect of various degrees of partial-snooping of MSU.

114

 

 

 

 

 

 

 

 

I r
10.0 -' Dir, Nnode = 32 ->(—— ‘
MSU, Nnode = 32 9—
8.0 -
Normalized
Execution 6.0 -
Time
4.0 -
2.0 -
0.5 .1. 1 ' I
2 4 8 16
Number of Busses
(a)
6.0 [ l I
5 0 _ Dir, Nnode = 32 —><—— _
' MSU, Nnode 1‘ 32 9—
4.0 - -
Normalized
Execution
Time 3-0 h “
2.0 — _
1.0 -‘ . /?
l 1 I

 

 

 

0.5
2 4 8 16
Number of Busses

(b)

Figure 7.12. Performance of full-snooping MSU and DirooNBNbM under various num-
ber of busses when cache size is increased to 16K bytes per node for Mat (a) and

Heat (b) benchmarks.

115

When the cache size is larger, the insufﬁciency of snooped busses on MSU affects
the performance more signiﬁcantly because the cache-set write back is more expensive
now—more dirty cache lines may need to be written back to memory in a set replace-
ment. In addition, frequent changing of the snooped clusters dominates the larger
cache size. Figures 7.13 and 7.14 show the performance of the MSU under various
degrees of partial-snooping when the cache size is 16K bytes for Mat and Heat, re-
spectively. The hit rate ﬁgures illustrate that the best hit rates exist at full-snooping,
not at a certain degree of partial-snooping any more due to the above reasons.

However, partial-sn00ping still can beneﬁt from larger caches in many situations.
As we have mentioned, the problem size of our benchmarks is small in comparison
to real world problems. Figure 7.15 shows the performance of MSU with 16K bytes
caches under various Nsnp when the size of the matrices in Mat becomes 200 x 200.
When Nbu, is sixteen, the best performance can be found at Nsnp = 8 again, not at
the full-snooping. The hit rates (Figure 7.15) also reflect the performance trend.

Furthermore, partial-snooping can beneﬁt from larger caches when a better mem-
ory placement is adopted. Figure 7.16 shows the performance of MSU with 16K bytes
caches under various Nsnp when optimally-mapped-Heat is executed. The best hit

rates are found when the number of snooped busses is four for both 8 and 16 busses.

7 .4.7 Performance on Various System Sizes (Scalability)

The performance of MSU and DirooNB under various system sizes, from 8 to 64
processors, is shown in this subsection to examine scalability. In our experiments, the

number of busses is half of the number of processors (N51,, 2 %Nnode). For MSU, the

number of snooped busses is half of the number of busses (NW, 2 %Nbu, = anode).

4
Figure 7.17(a) shows the results from default Mat and Mat with larger matrix size
and 16K bytes caches, and Figure 7.17(b) shows the results from Heat with better

memory mapping. For default Mat and Heat, the difference between Dir and MSU is

116

 

 

 

 

 

 

I I 9K .
2.5 - \ DlrooNBNmP -)<— _
MSW: X
2.0 - 316 -
\
Normalized \
Execution \
Time 1-5 ' \ _
\x
1.0 - \>
05 l l l
2 16

8
Number of Snooped Busses (N Mp)

82—

80‘

78‘

76—

740

   

 

 

 

 

 

 

 

Shared Data Cache Hit Rate in %
0

 

 

 

 

78 busses

6 / 16 busses

 

4
8
Number of Snooped Busses (Nsnp) 1

(b)

Figure 7.13. Effect of various degrees of partial—snooping of MSU when cache size is
increased to 16K bytes per node. Normalized execution time (a) and cache hit rate
(b) for Mat when Nnode = 32 are shown. The dashed line in (a) shows the execution
time of DirooNB with N,,,,, busses.

117

 

 

 

 

 

 

5.0 " DirooNBNsnp ‘X— _'
MSUsz‘W’ 49—
4_0 _ MSUIG‘” 9“ —
Normalized
Execution 3 0 _ _
ime '
2.0 - _
1.0 r m 3
0.5 l l I
2 16

 

 

 

   

 

 

 

 

Shared Data Cache Hit Rate in %
8

 

 

 

E
ﬁ/ 8 busses

16 / 16 busses

 

8
Number of Snooped Busses (

(b)

N snp )

Figure 7.14. Effect of various degrees of partial-snooping of MSU when cache size is
increased to 16K bytes per node. Normalized execution time (a) and cache hit rate
(b) for Heat when dee = 32 are shown. The dashed line in (a) shows the execution
time of DirooNB with Nmp busses.

Normalized 1'5 -

 

118

 

 

 

DirooNBNmp .x_ _

N
MS 8 ’"P e—
\ Maw <+

\

\

 

 

 

 

 

 

 

  

 

 

 

 

Execution
Time
1.0 '
05 l l l
2
Number of Snooped Busses (N sup)
(3)
82 —/
,,\° 80 —
c
.8 78 M/
‘5
93 76 —
'3‘:
g 74 ‘
g 72 —/
£3 70 —/
8
t, 68 —
9
g 66 - 1
64 ﬂ '8 busses

4

a
Number of Snooped Busses (Nsnp) 1

(b)

 

 

6 16 busses

Figure 7.15. Impact of larger problem size with larger cache size. Normalized exe-
cution time (a) and cache hit rate (b) for Mat with 200x200 matrices size and 16K
bytes cache when Nnode = 32 are shown.

119

 

 

 

 

 

I
>\ DirooNBNmp, opt. map. —x—
3.0 P Nu, -
\ MSUW ,opt. map. '9—
\ MSUIG‘”, opt. map. 9—
Normalized
Execution '0 —
Time
1.0 -
0.5 ' ' l

8
Number of Snooped Busses (Nmp)

92~
90~
88~
aaf
a4—
82~
so—
78—
76—
74~
72—
7o—

 

 

 

 

 

Shared Data Cache Hit Rate in %

 

 

 

 

 

 

    

8 busses

    

4
6 16 busses

a
Number of Snooped Busses (Nsnp) 1

(b)

Figure 7.16. Impact of larger cache size and better memory mapping. Normalized
execution time (a) and cache hit rate (b) for Heat with optimum memory mapping
and 16K bytes cache when Nnode = 32 are shown.

120

 

l I

160 P 16K MSU (200x200) @— ‘
140 - 16K Dir (200x200) —x— _
MSU (100x100) @—

Dir (100x 100) -X—

 

 

 

 

 

 

 

 

 

 

 

 

120 -
Execution 100 _
Time
(in million 80 -
pcycles)
40 *-
20 P
0
Number of Processors (Nnode)
(a)
_ r I _
6'0 O MSU, map. (>—
Dir, map. -x—
5.0 _ \ MSU, opt. map. 9— ..
\ Dir, Opt. map. —><—
, 4.0 - E
Execution
Time
(in million 3.0 '- ’ ____ .8
pcycles) "" ,._. I,
2.0 - d
1.0 —- T T T * " *;¢>
0.0 ' ‘ ‘
8 16 32 64

Number of Processors (Nnode)

(b)

Figure 7.17. Performance of MSU and Dir under various number of processors (Nnode).
N5“, is set to be %Nnodc; and, for MSU, Nmp = iNnode. (a) is from Mat benchmark,
and (b) is from memory-mapped Heat.

121

not signiﬁcant when the number of processors is 32 or larger (Heat is not shown on
the ﬁgure to avoid clutter). When there are eight processors for all applications, the
execution time of MSU is always greater than the directory scheme because using only
the two snooped busses is not enough to snoop on the working sets of the applications.
For larger size Mat and Heat with memory mapping, on more than 8 processors, the
execution time of MSU drops fast, while the execution time of the directory scheme

decreases slowly or even increases for more processors and busses.

7 .5 Mean-Value Performance Analysis

Approximate mean-value performance analysis has successfully evaluated the perfor-
mance of some cache coherent multiprocessors [7, 44, 113, 114]. In general, they
developed an equation for the mean processor response time based on the mean value
equations of bus waiting time, memory interference and cache interference. In most
cases direct calculation is difﬁcult because these equations recursively depend on each
other. The numerical quantities of the mean response times can be derived by putting
some initial values in and then solving the equations iteratively until the results con—
verged.

The mean-value method used in this section to analyze the MSU and the DirooNB
systems is different to the method just mentioned. Since the queueing models of the
networks of the two multiprocessor systems have been solved in Section 5.4, the per-
formance of the multiprocessor can be easily evaluated by applying accurate mean
service times to the processor and the network servers in the two queueing models.
This section shows the calculation of the service times of the queueing servers and
discusses the analytical results. Although the subtle details of the multiprocessors
are missed in the analytical models, this approximate analysis leads to a very efﬁcient

evaluation for large-scale systems. Compared to tens of hours in running one sim-

122

ulation experiment, the time spent in computing one analytical result dramatically

reduced to be within one second.

7.5.1 Workload Parameters

The workload of a system is decided by probabilities and latency constants. The

following basic constants are speciﬁed for the workload model:
a team,D is the mean processor computation time between memory requests.

0 The memory access latencies, including tarb, teache, tiny, treq, twy, trpm, twbl, twbr,
tdinv, tdzoc and tdrmt, have been deﬁned in Table 7.2 for simulation. The same

values are used again in the analysis.
0 Nana/set is the number of cache lines per cache set on the MSU system.
The following probabilities are used in the analysis:

0 pprwwe and pshamd are the probabilities that a memory access is for private and

shared memory, respectively.

0 plow, and premote are the probabilities that a shared memory access locates locally

and remotely, respectively.
0 phit and pm,” are the probabilities of cache hit and miss.

0 pwme and pread are the probabilities of a shared access to be a write access and

a read access, respectively.

0 pea, and pm” are the probabilities that a node owns a cache line exclusively and
non-exclusively, respectively, given that the node issues a write-access on the

line.

123

o prpm is the probability of no vacant space in the cache for a cache-miss access.

A cache line replacement has to be invoked.

o psetJu-t and P30471533 are the probabilities that the cache set of a new cache line
exists and does not exist on the MSU system, respectively, given that the access

needs to invoke cache line replacement.

0 pdmy is the probability that a replaced cache line is dirty.

7.5.2 Mean Times for the MSU System

The mean time between two memory requests issued by a processor, denoted as T,
is the sum of the computation time and the memory access time. The mean memory
access time is divided into private memory access time, TPn-vate, and shared memory

access time, Tsharcd. The equation for T is

T : tcomp + pprivate X Tprivate + pshared X Tshared- (7'1)

As in the simulation experiments, private accesses are always assumed to be cache
hits, thus

Tprivate : teache- (72)

The shared accesses can be divided into the time for cache-hit accesses and the
cache-miss accesses. The cache-miss accesses may need to invoke replacement. So

Tshared can be represented as:

Tshared : phitThit + pmiss(Tmiss + prmerpm)- (7.3)

124

The time for a cache-hit access can be derived from

Thit : preadTreadJIit + pwrite(pe.rTe:r:_hit + pn-e:r:Tn_e.r_hit)' (74)

The latencies of T read _;,.-t, Tam“, T max _;,,-t, and Tm,” have been deﬁned in Table 7.3.

They are listed below again for easy reference:

Tread_hit : TexJzit = tcachea (75)
Tn_ea:-hit = tcache + tarb + tinvi and (76)
Tmiss : tcache + tarb + treq + trpy- (77)

Finally, the mean replacement time, Trpm, is the time of tTpm plus the write back
time. The write back time depends on whether the set is hit or not. If hit, there is a
probability that the replaced line is dirty. If not, all lines in a cache set need to be
replaced, and again there is a probability for each line to be dirty. The mean time to
write back a cache line to its main memory can be calculated as (oncaztwbI+Premotetwin).

These discussions of the mean replacement time yield

Trpm = trpm + psetJIitpdirty(placaltwbl + premotetwbr)
+pset-mi33Nlines/setpdirty(placaltwbl 'l' Premotetwbr)

: trpm + pdirty (psetJu't + pset-misleines/set)(plocaltwbl + premotetwa)' (7'8)

Applying all the components of T into Equation 7.1 yields

T : tcomp "l' pprivatetcache + pshared{
Phit [preadtcache + pwrite (pextcache + pn_ea:(tcache + tarb + tinv))] +

pmiss [teache + tarb + treq 'i' trpy + prpmtrpm +

125

Prpmpdirty(pset_hit + Pset_misleines/set)(plocaltwbl + Premotetwbr)]}- (7-9)

To derive the mean service times for the queueing systems in Figure 5.9, the terms
of T are separated into two sets, sproceswr and snetwork. They are the time spent in the
processor and the network, respectively. The smocesm should contain all terms with
teammtcachmtarhtmp, and twbz in T; the snetwork contains all terms with t,,,,,, treq, t,,,,,,

and twbr in T. Based on Equation 7.9, sproccswr and snctwork can be represented as

Sprocessor = tcomp + pprivatetcache + psharcd{
phitLDw-eadtcache + pwrite(pextcache + pn_ex(tcache + tarb))] +
pmiss[tcache + tarb + prpmtrpm +
p'rpmpdirtyplocal(psetJLit + Psez_missszes/sez)twbzl}, and (7-10)
Snetwork = pshared{phitpwritepn-extinv + pmiss[treq + trpy +

prpmpdirtypremote (pset_hit + pset_misleines/set )twbr )] } - (711)

Letting sprocessor be l/pl and snetwork be l/pg allows the queueing model in Fig-

ure 5.9 to generate the performance quantities for the MSU system.

7 .5.3 Mean Times for the Directory-Based System

The equations for the DirooNB system can be derived similarly. Because of the dif-
ferences between the MSU and the DirooNB, some equations have to be changed.

As shown in Table 7.3, due to the overhead of directory update, Tm,” and TRJIJ,
should be redeﬁned as:

Tmiss : tcache + tarb + treq + trpy + placaltdloc + premotetdrmta and (712)

Tn_e:r_hit : tcache + tarb + premotetdrmt + Plocaltdloc +

126

tinv + (”sharing _ 1)tdinv) (7'13)

where rimming is the mean number of processors in which a shared line is cached at
the same time.
Because no set miss will happen in DirooNB, Trpm should be simpliﬁed from Equa-

tion 7.8 into

Trpm : trpm + pdirty(plocaltwbl + premotetwbr)- (7.14)

Apply the new Tmm, TmeLM-t, and T rpm into T shared in Equation 7.3, and then the
T for DirooN B can be derived from Equation 7.1. Finally, the equations of sproccssor

and snetwork for DirooNB can be calculated as:

Sprocessor : tcomp + pprivatetcache + pshared{
Phit [preadtcache + pwrite(tcache + pn_extarb + pn_explocaltdloc)] +

pmiss [teache + tarb + placaltdloc +

prpmtrpm + prpmpdirtyplocaltwbl]}a and (7-15)
Snetwork : pshared{Phitpw1-itepn_ex[premotetdrmt + tinv + ("sharing — 1)tdinv] +
pmiss(treq + trpy + premotetdrmt + prpmpdirtypremotetwbr)}- (716)

As mentioned in Section 7.2, the network of the DirooN B system has the same
bandwidth as the Model 6 in Figure 5.2. Letting 3707068330, and snetwork be l/pl and
1 /p2 of the queueing model in Figure 5.7 can generate the performance quantities for

the DirooNB system.

7.5.4 Parameter Values Used in the Experiments

In the experiments reported in this section, We assumed pprimte = 0.5, phi, = 0.75,

pmad = 0.75, and prpm = 0.8. The pea: was assumed to be the value of (pwrite/nsharmg),

127

where n,haT,-ng was set to be 2.0. According to the Berkeley protocol, Pdirty should
be the same as pm. To represent the spatial locality in memory accesses, plow; and
pseun-t were chosen carefully. The plant was set to be I / (NW1.3 + l), where l is a locality
factor and was assumed to be 10.0 in the experiments. The pmJu-t was assumed to be
the cumulative function of an exponential distribution, and the value was generated
by the function of max(N,np/Nbu,, 1 — e‘ANb“), where /\ was assumed to be 0.5. All
other probabilities can be easily decided by using the above assumptions. All values
of the probability parameters together with the values of 1 — 6‘05wa are listed in
Appendix B for clarity.

The same architectural assumptions and the communication latencies used in
simulation experiments were used again in this analysis. One additional latency,

tcomp, was assumed to be 10.

7 .5.5 Analytical Results
Comparison of Full-Snooping MSU and DirooNB

The analytical result of full-snooping MSU and DirooN B with Nnode = 32 is shown
in Figure 7.18. As in Section 5.4.3, the mean time between two memory requests is
used to represent the ratio of execution time. Like the normalization procedure used
for the simulation results, all performance quantities are divided by the mean time
between requests of MSU12. Observe that the results are very similar to the ones
from simulation—the MSU system can outperform the Dir system, but the difference

is reduced when the number of busses increases.

Various Degrees of Snooping on MSU

Figure 7.19 shows the analytical results for partial-snooping MSU systems when

Nnode = 32. Because a ﬁxed hit rate was used in all analytical experiments, the phe-

128

 

 

 

 

10 | l I

9 ' Dir, NM... 2 32 —><—--l

8 .. MSU, Nnode = 32 G— a

7 _- -
Normalized 6 - -
Mean Time
Between 5 — _
Requests

4 _.

3 .

2 ..

1 ,_ ......

 

 

 

 

Number of Busses

Figure 7.18. Analytical mean time between two memory requests of full-snooping
MSU and DirooNBNm with Nnode = 32 under various number of busses.

 

DirooNBNmp -><—

_

MSUSxZ” -9—
MSU16 P ~9—

-

I

Normalizedg. 5
Mean Time
Between

 

 

 

 

 

Requests 2 - \ \ '-
\
\
1.5 — \ \ _
\

W<

1 ’- v >
0.5 L 1 l

2 4 8 16

Number of Snooped Busses (Nsnp)

Figure 7.19. Analytical mean time between two memory requests for various degrees
of partial-snOOping on MSU when Nnode = 32. The dashed line is for DirooNB with

N mp busses for comparison.

129

nomenon of having the highest hit rate at a certain degree of partial snooping does
not exist. Consequently, the curves are similar to simulation results in Figures 7.13(a)
and 7.14(a) when the cache size is increased in the simulation experiments. However,
we still can observe that when the number of snooped busses is large enough to handle
the working set of the local processor, using more snooping busses cannot improve
system performance.

One major advantage of using analysis instead of simulations is that the analysis
can significantly save the computational time in performance evaluation. Figure 7.20
shows an example of this by measuring a system with Nnode = 128. To generate the
same ﬁgure from simulation, even if the host machine can afford that many active
processes and other resources (such as semaphores), it might takes weeks to run all
experiments on a workstation. In Figure 7.20, as before, the normalization was done
based on the result of the MSUgﬁ. One can easily observe that the difference between

the MSU and Dir is larger than when the system size is smaller.

Performance on Various System Sizes (Scalability)

Figure 7.21 shows the result of using the analytical models to verify the expectation
on the MSU system made in Figure 6.5. The same assumptions used in simulation
experiments (Section 7.4.7) are adopted again. Because the results are from various
numbers of processors, the mean time between requests cannot represent the execution
time relations among systems with different processors. We need to ﬁnd a better way
to represent the execution times using the analytical results.

The method used to transfer the mean time between requests, T, into execution
time is described as follows. Assume on a uniprocessor system the execution time of
a program is mT, where m is the number of memory requests made in the execution.
If the program can be fully parallelized on an N -node multiprocessor, the execution

mT

time on the multiprocessor would be T. However, most programs have serial parts

130

 

 

 

 

 

20 l K I

18 __ DirooNBNmp -x— _
\
16 - \ M30432” +— -
14 ' X MSU32"up @- ‘
Normalized 12 ' \ T
Mean Time
Between 10 ' X ‘
Requests \ \
8 '- \ \ .1
6 ~ \ \ \ -
\x
4 - \\ ..
2 - x i
"L '4 Y '

 

[\D

4 8 16
Number of Snooped Busses (N mp)

Figure 7.20. Analytical mean time between two memory requests for various degrees
of partial-snooping on MSU when Nnode = 128. The dashed line is for DirooNB with
N snp busses for comparison.

that cannot be parallelized. For simplicity, we assume a program can be separated
into a part that can be fully parallelized to any number of processors, and a serial
part that can only be executed sequentially on one processor (the same assumption
was used in developing the famous Amdahl’s Law [115]). Assume p to be the ratio
that a program can be parallelized, the execution time on an N -node multiprocessor
now can be represented as L331 + (1 — p)mT. Since the value of m is a constant for
one program, it can be taken out from the above formula. Therefore, the value of
%T + (1 —- p)T can be used as the execution time for various numbers of processors,
which is the Y axis used in Figure 7.21.

Similar to our expectation and the results by simulation, Figure 7.21 indicates
that the execution time of the MSU system drops faster than the directory-based

system when the number of processors increases.

131

 

 

 

 

 

 

 

 

 

 

 

Dir, p = 0.80 -)<— ‘
MSU, p = 0.80 €—
Dir, p = 0.90 -)(—-
MSU, p = 0.90 G— ‘
Dir, p = 0.95 9(-
MSU, p = 0.95 @-
Execution _
Time
Ratio /
j
/>
0 P l L
8 16 32 64

Number of Processors (Nnode)

Figure 7.21. Performance of MSU and Dir under various number of processors (Nnode).
Nb,” is set to be %Nnode; and, for MSU, N37,,D = iNnode. The p is the ratio that a
program can be parallelized.

7 .6 Summary

In this chapter, at ﬁrst we demonstrated the feasibility of the MSU system by sim-
ulation. Even though the directory scheme were advantageous from both simulator
parameters and benchmark programs, the simulation results are promising. Not only
can the proposed snoopy scheme with full-snOOping perform better than the direc—
tory scheme, but adding the partial-snooping concept can improve the performance in
some instances. Several default parameters were examined to further study the sys-
tem, and we derived consistent results that when a high degree of parallelism can be
achieved from either the software or the hardware, the MSU system can outperform
the directory-based system.

In addition, an effective analysis was developed to calculate the mean delay times

of both the proposed and the compared systems. The method we adopted can be

132

used to evaluate large-scale multiprocessors with dramatically shorter computation
time than simulation. The analytical results have further veriﬁed the expectation and

simulation.

CHAPTER 8

Conclusions

The choices of memory architectures and coherence schemes have drawn attention
in building new shared-memory multiprocessors. The research in this dissertation
presents a new concept of building a distributed shared-memory multiprocessor.
A novel, scalable, snoopy cache-coherent multiprocessor architecture based on the
address-separated multiple bus network has been proposed in this dissertation. To
allow the proposed system to handle more busses without increasing the complexity
of the snooping mechanism, the concept of partial-snooping was introduced—each
processor can snoop on a dynamically changing subset of the busses. The proposed
scheme can enjoy fast memory access and also beneﬁt from greater scalability.

We conclude that the proposed scheme successfully uses the snoopy mechanism

to extend the number of processors in one multiprocessor by the following facts:

0 Memory modules are distributed into nodes to allow fast local memory access

and to avoid network contention and bottleneck.

o The network bandwidth can grow with the number of processors by adding

more busses.

0 Partial snooping design can alleviate the cost and hardware complexity when

the number of busses grows.

133

134

o The multiple busses on the proposed system can be constructed by an optical
network with wavelength-divided channels and tunable devices. The high net-
work bandwidth supported by the state-of-the-art optical technology improves

scalability.

o The proposed system performs better than directory—based systems, especially
when a higher degree of parallelism is supported, such as balanced load, larger
problem size, high reference locality, better memory mapping and a larger num-

ber of processors.

0 Current fast progress in distributed operating systems and parallel compilers
should lead to better load balancing and data placement techniques, which

would make the MSU system even more attractive.
The contributions of this research can be described as follows:

a The claims that prove the equivalence among different multiple bus network
models tremendously simplify the performance evaluation, because complicated

networks can be evaluated using simple ones with identical performance.

0 The concept of partial snooping is a novel idea. The study in this dissertation
shows its feasibility to be used in multiprocessors to simplify the hardware on

large—scale multiprocessors.

o The MSU system indicates a possible way to scale shared-memory multiproces-

SOI‘S.

o The proposed system includes a design exploiting the state-of—the-art, star-

coupled, optical networks with tunable devices.

0 A lowﬁlevel and accurate simulator with several benchmark programs for mul-

tiprocessors was constructed to evaluate the proposed system. The simulator

135

has the potential to be utilized for other research.

0 A simple and effective analytical method was introduced which can evaluate

parallel systems with up to hundreds of processors in seconds.

Several issues have been addressed in this dissertation but were not studied thor-
oughly, and they may be interesting for future study. These include the impact of
the read-only bus on the MSU system, the performance improvement from a weakly
ordered MSU system, and if the busses were implemented optically, the possible
hardware complexity and delays of the conversion between lightwave and electronic
signals. The study of a hierarchical structure with the proposed scheme or a hybrid
with other schemes should be of interest to computer architects. Lastly, the concept
of partial-snooping may even be applied to a directory-based system built on an in-
terconnection networks where each processor can only access a subset of the outside

world at a time.

APPENDICES

APPENDIX A

An MVA Algorithm for a

Network of Queues With a

State-Dependent Server

The MVA solution for the queueing system in Figure 5.7 (Model 6) is shown in this
appendix.

According to the assumptions used in Section 5.4.1, the computation time in each
processor is exponentially distributed with a mean of 1 /p1 , and the service time of a
network server is exponentially distributed with a mean of 1 /,u,2. To derive numerical
results, we have to determine the mean delay spent by a request in the queueng
system when the total number of requests is N, The mean time spent in the processor
servers, 1510'), has been ﬁxed to be 1 / #1 for any i, where i is the number of requests
in the queueing model. Thus this analysis only needs to derive the mean time that a
request spent in the network servers, which is denoted as 1222(N).

The mean service time of the network servers with i requests, denoted as 32(2), is

1 i=1,...,B—1.

52(2) = 5’ (A.1)
_1_ 2': B,...,N.

Buz’

137

Let Pj(i) be the probability of j requests in the network servers when there are
a total of i requests in the queueing model. According to the Arrival Theorem, the
mean waiting time in the network servers that an entering request will experience can

be represented by

i=1

Also, by using Little’s Law, the throughput of the model, Mi), can be derived as
M') i (A 3)

z = —. .
U71 + 152

If a newly arriving request has to wait in the queue of the network servers because
all network servers are busy, the probabilities of j requests in the network servers
could be determined by Pj(z') = j_1(z' — 1). However, this probability has to be

adjusted by a factor of the mean busy ratio of the network servers when j requests

are in, which can be represented by Mr) X 52(3). Hence,
[7,-(i) 2 Mi) x 32(3) >< P,_1(z' — 1), j = 1,. . . ,i. (A.4)

Using Equations A2, A3, and A.4, the MVA solution can be shown in an algo-

rithmic form in Figure A.1.

138

 

 

. Initially, 190(0) 2 1.0.
.Fori=1toNdosteps3to6.
. ID1(Z) = i, 1172(2): 23:1j X 82(j) X Pj_1(2 — 1).

.w): 1'

11714-1172 '

 

. P,(z') = W) x 52(1) >< P,_1(2' — 1), forj = 1,. . . ,j.
. P0(Z)=1— i Pj(2).

i=1

Figure A.1. An MVA algorithm for the analytical solution of Model 6.

 

 

APPENDIX B

Workload Parameters Used in

Analysis

Table B.1 lists the values of all probability parameters that were used in the analytical

experiments in Section 7.5:

Table B.l. Probabilities used in the analytical experiments.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[ Parameter l Value j
pprivatc 0.5
pshared 0.5
placal ﬂ;
premote 1.0 — m
Phit 0.75
pmiss 0.25
pread 0.75
pwrite 0.25
pea: 0.125
Pu-” 0.875
Prpm 0.8
pset_hit max(’,$—:f, 1 _ e-O-SNb...)
pset-miss I — psetJu't
pdi'rty 0.125

 

 

 

 

139

140

—0.5:z:

The value of function f(:c) = 1 — 6 used in deciding pseuhit is plotted in

Figure B.1.

 

 

0.9

0.8

0.7

0.6
f (1') =

1 _ e—O.51: 0-5
0.4
0.3
0.2
0.1

 

 

 

 

Figure B.1. Function f(:r:) : 1 — 6‘05“”, used in deciding psehm.

 

BIBLIOGRAPHY

[1]

l2]

l3]

[4]

[5]

l6]

[7]

[8]

BIBLIOGRAPHY

K. Gharachorloo, A. Gupta, and J. Hennessy, “Hiding memory latency using

3

dynamic scheduling in shared-memory multiprocessors,’ in Proceedings of the
19th IEEE Annual International Symposium on Computer Architecture, pp. 22

— 33, 1992.

S. Picano, E. D. B. III, and J. E. Hoag, “Programming costs of explicit mem-
ory localization on a large scale shared memory multiprocessor,” in Proc. of

Supercomputing, pp. 36 — 45, 1991.

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hen-
nessy, “The DASH prototype: Logic overhead and performance,” IEEE Trans-
actions on Parallel and Distributed Systems, vol. 4, Jan. 1993.

A. Agarwal, B.—H. Lim, D. Kranz, and J. Kubiatowicz, “APRIL: A proces-
sor architecture for multiprocessing,” in Proceedings of the 17th IEEE Annual
International Symposium on Computer Architecture, pp. 104 — 114, 1990.

J. R. Goodman, “Using cache memory to reduce processor-memory trafﬁc,” in
Proceedings of the 10th IEEE Annual International Symposium on Computer
Architecture, pp. 124 — 131, 1983.

J. Archibald and J .-L. Baer, “Cache coherence protocols: Evaluation using a
multiprocessor simulation model,” ACM Transactions on Computers, pp. 273—

298, Nov. 1986.

M.-C. Chiang and G. S. Sohi, “Evaluating design choices for shared bus multi-
processors in a throughput-oriented environment,” IEEE Transactions on Com—
puters, vol. 41, pp. 297 — 317, Mar. 1992.

D. Lenoshi, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, “The
directory—based cache coherence protocol for the DASH multiprocessor,” in

Proceedings of the 17th IEEE Annual International Symposium on Computer
Architecture, pp. 148 — 159, 1990.

141

 

 

142

[9] M. Carlton and A. Despain, “Multiple-bus shared-memory system: Aquarius

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

project,” IEEE Computer, pp. 80 — 83, June 1990.

J. R. Goodman and P. J. Woest, “The Wisconsin Multicube: A new large-
scale cache-coherent multiprocessor,” in Proceedings of the 15th IEEE Annual

International Symposium on Computer Architecture, pp. 422 — 431, 1988.

Q. Yang, L. M. Bhuyan, and R. Pavaskar, “Performance analysis of packet-
switched multiple-bus multiprocessor systems,” in Real— Time Systems Sympo-

sium, pp. 170 — 178, IEEE, 1987.

M. Dubois, “Throughput analysis of cache-based multiprocessors with multiple
busses,” IEEE Transactions on Computers, vol. 37, pp. 58 — 70, Jan. 1988.

J. P. Singh et al., “Splash: Stanford parallel applications for shared-memory,”
Tech. Rep. CSL-TR—92-526, Computer Systems Laboratory, Stanford Univer-
sity, June 1992.

J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative
Approach. San Mateo, California: Morgan Kaufman Publishers, Inc., 1990.

T. S. Wailes and D. G. Meyer, “Multiple channel architecture: A new optical in-
terconnection strategy for massively parallel computers,” Journal of Lightwave
Technology, vol. 9, pp. 1702 — 1716, Dec. 1991.

H. Cheong and A. V. Veidenbaum, “Compiler-directed cache management in
multiprocessors,” IEEE Computer, pp. 39 — 47, June 1990.

Z. G. Vranesic, M. Stumm, D. M. Lewis, and R. White, “Hector: A hierarchi-
cally structured shared-memory multiprocessor,” IEEE Computer, pp. 71 — 79,
Jan. 1991.

A. L. Cox and R. J. Fowler, “The implementation of a coherent memory ab-
straction on a NUMA multiprocessor: Experiences with PLATINUM,” in Pro-
ceedings of the 12th ACM Symposium on Operating Systems Principles, pp. 32
— 44, 1989.

R. Bisiani and M. Ravishankar, “PLUS: A distributed shared-memory system,”
in Proceedings of the 17th IEEE Annual International Symposium on Computer
Architecture, pp. 115—124, 1990.

E. Hagersten, A. Landin, and S. Haridi, “DDM—a cache-only memory archi-
tecture,” IEEE Computer, pp. 44 —54, Sept. 1992.

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

143

H. Burkhardt et al., “Overview of the KSRl computer system,” Tech. Rep.
KSR—TR-9202001, Kendall Square Research, Boston, Feb. 1992.

K. Li and R. Schaefer, “A hypercube shared virtual memory system,” in Pro-
ceedings of International Conference on Parallel Processing, pp. I—125 - I—131,
1989.

N. Nitzberg and V. Lo, “Distributed shared memory: A survey of issues and
algorithms,” IEEE Computer, pp. 52 — 60, Aug. 1991.

A. Mohindra and U. Ramachandran, “A survey of distributed shared memory
in loosely-coupled systems,” Tech. Rep. GIT-CC-91 / 01, College of Computing,
Georgia Institute of Technology, Jan. 1991.

A. S. Tanenbaum, M. F. Kaashoek, and H. E. Bal, “Parallel programming using
shared objects and broadcasting,” IEEE Computer, pp. 10 — 19, Aug. 1992.

P. Stenstréim, “A survey of cache coherence schemes for multiprocessors,” IEEE
Computer, pp. 12 — 24, June 1990.

A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, “An evaluation of di-
rectory schemes for cache coherence,” in Proceedings of the 15th IEEE Annual
International Symposium on Computer Architecture, pp. 280 — 289, IEEE, 1988.

D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal, “Directory-based cache
coherence in large-scale multiprocessors,” IEEE Computer, pp. 49 — 58, June

1990.

D. Gustavson, “The scalable coherent interface and related standards projects,”

IEEE Micro, pp. 10 — 22, Feb. 1992.

V. P. Srini, “Crossbar—multi-processor architecture,” in Cache and Interconnect
Architectures in Multiprocessors (M. Dubois and S. S. Thakkar, eds.), pp. 221
— 243, Kluwer Academic, 1990.

M. Thapar, “Cache coherence for scalable shared memory multiprocessors,”
Tech. Rep. CSL-TR-92-522, Computer Systems Laboratory, Stanford Univer-
sity, May 1992. Ph. D. dissertation.

A. K. Nanda and L. N. Bhuyan, “Design and analysis of cache coherent mul-

tistage interconnection networks,” IEEE Transactions on Computers, vol. 42,

Apr. 1993.

 

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

144

P. Bitar and A. Despain, “Multiprocessor cache synchronization issues, innova-
tions, evolution,” in the 13th International Symposium on Computer Architec-
ture, pp. 424 — 433, Computer Society Press, 1986.

J. Torrelas and J. Hennessy, “Estimating the performance advantages of relax—
ing consistency in a shared-memory multiprocessor,” in Proceedings of Interna-

tional Conference on Parallel Processing, pp. I-26 -— I-34, 1990.

P. Stenstrém, F. Dahlgren, and L. Lundberg, “A lockup-free multiprocessor

cache design,” in Proceedings of International Conference on Parallel Process-

ing, pp. I—246 — I—250, 1991.

C. Scheurish and M. Dubois, “Correct momory operation of cache-based multi-
processors,” in Proceedings of the 14th IEEE Annual International Symposium
on Computer Architecture, pp. 234 - 243, 1987.

C. Scheurich and M. Dubois, “Lockup-free caches in high-performance multi—

processors,” Journal of Parallel and Distributed Computing, pp. 25 — 31, 1991.

A. W. Wilson Jr. and R. P. LaRowe Jr. , “Hiding shared memory reference
latency on the galactica net distributed shared memory architecture,” Journal
of Parallel and Distributed Computing, vol. 15, pp. 351 — 367, 1992.

K. Gharachorloo, D. Lenoshi, J. Landon, P. Gibbons, A. Gupta, and J. Hen-
nessy, “Memory consistency and event ordering in scalable shared-memory mul-
tiprocessors,” in Proceedings of the 17th IEEE Annual International Symposium
on Computer Architecture, pp. 15 — 26, 1990.

J.—L. Baer and R. N. Zucker, “On synchronization patterns in parallel pro—

grams,” in Proceedings of International Conference on Parallel Processing,

pp. II—60 — II—67, 1991.

S. Eggers and R. Katz, “Evaluating the performance of four snooping cache
coherency protocols,” in Proceedings of the 16th IEEE Annual International
Symposium on Computer Architecture, pp. 2 — 15, 1989.

R. Katz, S. Eggers, D. A. Wood, C. Perkins, and R. G. Sheldon, “Implementing
a cache consistency protocol,” in Proceedings of the 12th IEEE Annual Inter-
national Symposium on Computer Architecture, pp. 275 — 283, IEEE, 1985.

J. K. Archibald, “A cache coherence approach for large multiprocessor systems,”

in Proceedings of unkonwn, pp. 336 - 345, 1988.

 

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

145

M. K. Vernon, E. D. Lazowska, and J. Zahorjan, “An accurate and efﬁcient
performance analysis technique for multiprocessor snooping cache-consistency
protocols,” in Proceedings of the 15th IEEE Annual International Symposium
on Computer Architecture, pp. 308 — 315, 1988.

A. W. Wilson Jr. , “Hierarchical cache/ bus architecture for shared memory mul-
tiprocessors,” in Proceedings of the 14th IEEE Annual International Symposium
on Computer Architecture, pp. 244 — 252, 1987.

D. R. Cheriton, H. A. Goosen, and P. D. Boyle, “Multi-level shared caching
techniques for scalability in VMP-MC,” in Proceedings of the 16th IEEE Annual
International Symposium on Computer Architecture, pp. 16 —- 24, 1989.

C. Erickson, Design and Evaluation of a Hierarchical Bus Multiprocessor. PhD
thesis, Dept. of Electrical Engineering, Michigan State Univ., 1991.

Q. Yang, G. Thangadurai, and L. N. Bhuyan, “Design of an adaptive cache
coherence protocol for large scale multiprocessors,” IEEE Transactions on Par-
allel and Distributed Systems, May 1992.

L. Lamport, “How to make a multiprocessor computer that correctly executes
multiprocess programs,” IEEE Transactions on Computers, pp. 690 — 691,
Sept.1979.

J. R. Goodman, “Cache consistency and sequential consistency,” Tech. Rep. 61,

SCI Committee, Mar. 1989.

M. Dubois, C. Scheurich, and F. Briggs, “Memory access buffering in multipro-
7

cessors,’ in Proceedings of the 13th IEEE Annual International Symposium on
Computer Architecture, pp. 434—442, 1986.

T.-S. Jon, “A comparison among different memory consistency ordering mod-
els,” tech. rep., Computer Science Department, Michigan State'University, Mar.

1992. Answer for one comprehensive exam question assigned by Lionel M. Ni.

M. Dubois and C. Scheurich, “Memory access dependencies in shared-memory
multiprocessors,” IEEE Transactions on Software Engineering, pp. 660-673,
June 1990.

R. N. Zucker and J .-L. Baer, “A performance study of memory consistency
models,” in Proceedings of the 19th IEEE Annual International Symposium on
Computer Architecture, pp. 2 -— 12, 1992.

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

146

T. Lang, M. Valero, and I. Alegre, “Bandwidth of crossbar and multiple-bus
connections for multiprocessors,” IEEE Transactions on Computers, pp. 1227
— 1234, Dec. 1982.

M. A. Marson and M. Gerla, “Markov models for multiple bus multiprocessor
systems,” IEEE Transactions on Computers, pp. 239 - 248, Mar. 1982.

K. B. Irani and I. H. Onyiiksel, “A closed-form solution for the performance

analysis of multiple-bus multiprocessor systems,” IEEE Transactions on Com—
puters, pp. 1004 — 1012, Nov. 1984.

O. Ganz and I. Chlamtac, “Analysis of multiple—bus synchronous and asyn-

chronous communication systems,” Performance Evaluation, vol. 5, pp. 1 — 13,

1985.

T. N. Mudge and H. B. Al-Sadoun, “A semi-markov model for the performance
of multiple-bus systems,” IEEE Transactions on Computers, pp. 934 — 942, Oct.
1985.

D. Towsley, “Approximate models of multiple bus multiprocessor systems,”

IEEE Transactions on Computers, pp. 220 — 228, Mar. 1986.

T. N. Mudge, J. P. Hayes, and D. Winsor, “Multiple bus architecture,” IEEE
Computer, vol. 20, pp. 42 — 48, June 1987.

L. N. Bhuyan, Q. Yang, and D. P. Agrawal, “Performance of multiprocessor
interconnection networks,” IEEE Computer, pp. 25 — 37, Feb. 1989.

A. Hopper, A. Jones, and D. Lioupis, “Multiple vs wide shared bus multipro-

7

in Proceedings of the I6thIEEE Annual International Symposium on
Computer Architecture, pp. 300 — 306, 1989.

cessors,’

Q. Yang and L. N. Bhuyan, “Performance of multiple-bus interconnections for
multiprocessors,” Journal of Parallel and Distributed Computing, vol. 8, pp. 267
— 273, 1990.

W.-T. Chen and J .-P. Sheu, “Performance analysis of multiple bus intercon-
nection networks with hierarchical requesting model,” IEEE Transactions on
Computers, vol. 40, pp. 834 — 842, July 1991.

T. Lang, M. Valero, and M. A. Fiol, “Reduction of connections for multibus
organization,” IEEE Transactions on Computers, vol. C-32, pp. 707 — 716, Aug.
1983.

 

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

147

T.-P. Lee and C.-E. Zah, “Wavelength-tunable and single-frequency semicon-
ductor lasers for photonic communications networks,” IEEE Communication
Magazine, pp. 42 — 52, Oct. 1989.

M. S. Goodman, “Multiwavelength networks and new approaches to packet
switching,” IEEE Communications Magazine, pp. 27 — 35, Oct. 1989.

P. W. Dowd, “Wavelength division multiple access channel hypercube processor
interconnection,” IEEE Transactions on Computers, vol. 41, pp. 1223 — 1241,
1992.

D. M. Chiarulli, S. P. Levitan, and R. G. Melhem, “Optical bus control for
distributed multiprocessors,” Journal of Parallel and Distributed Computing,
vol. 10, pp. 45 — 54, 1990.

S. Thakkar, M. Dubois, A. T. Laundrie, and G. S. Sohi, “Scalable shared-
memory multiprocessor architectures,” IEEE Computer, pp. 71 — 74, June 1990.

Y. Birk, F. A. Tobagi, and M. E. Marhic, “Bus-oriented interconnection topolo-

,

gies for single-hop communication among multi—transceiver stations,’ in Proc.

INFOCOM, pp. 558—567, 1988. New Orleans, Louisiana, Mar. 29-31.

J. Kilian, S. Kipnis, and C. E. Leiserson, “The organization of permutation
architectures with bused interconnections,” IEEE Transactions on Computers,

vol. 39, pp. 1346 — 1358, Nov. 1990.

P. K. McKinley and J. W. S. Liu, “Group communication in multichannel net-

works with staircase interconnection topologies,” in Proceedings of ACM SIG-

COMM, (Austin, TX), 1989.

M. A. Marsan and D. Roffinella, “Multichannel local area network protocols,”
IEEE J. on Selected Areas in Comm., vol. SAC-1, no. 5, pp. 885—897, Nov.
1983.

I. Chlamtac and A. Ganz, “Design and analysis of very high speed network
architectures,” in Proc. GLOBECOM, pp. 927—931, 1986. Houston, Texas,
Dec. 1-4.

B. P. Mohanty and T. D. Todd, “Connection-based media access for multichan-
nel local and metropolitan area networks,” in Proc. INF OCOM, pp. 226—233,
1988. New Orleans, Louisiana, Mar. 29-31.

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

148

L. V. Kalé, “Lattice-mesh: A multi-bus topology,” in Proc. International Con-
ference on Parallel Processing, pp. 700—702, 1985. St. Charles, Illinois, Aug.
20-23.

P. W. Dowd and K. J abbour, “Spanning multiaccess channel hypercube inter-
connection,” IEEE Transactions on Computers, vol. 37, no. 9, pp. 1137-1142,
1988.

M. M. N assehi, F. A. Tobagi, and M. E. Marhic, “Fiber optic conﬁgurations for
local area networks,” IEEE Transactions on Computers, pp. 941 — 949, Nov.
1985.

Y.-K. M. Lin, D. R. Spears, and M. Yin, “Fiber-based local access network
architectures,” IEEE Communication Magazine, pp. 64 — 73, Oct. 1989.

I. P. Kaminow, “Photonic multiple-access networks: Topologies,” ATE T Tech-
nical Journal, pp. 61 —71, March/April 1989.

F. W. Scholl and M. H. Coden, “Passive optical star systems for ﬁber optic local
area networks,” IEEE Journal on Selected Areas in Communications, vol. 6,

July 1988.

J. W. Reedy and J. R. Jones, “Methods of collision detection in ﬁber optic
CSMA/CD networks,” IEEE Journal on Selected Areas in Communication,
vol. SAC-3, pp. 890 — 896, Nov. 1985.

C. A. Brackett, “Dense wavelength division multipexing networks: Principles

and applications,” IEEE Journal on Selected Areas in Communications, vol. 8,

pp. 948 — 964, Aug. 1990.

H. Kobrinski and K.-W. Cheung, “Wavelength—tunable optical ﬁlters: Appli-
cations and technologies,” IEEE-Communications Magazine, pp. 53 — 63, Oct.
1989.

P. E. Green and R. Ramaswami, “Direct dection lightwave systems: Why pay
more?,” IEEE LCS, pp. 36 — 49, Nov. 1990.

H. Kobrinski, M. P. Vecchi, M. S. Goodman, E. L. Goldstein, T. E. Chapuran,
J. M. Cooper, C. E. Zah, and S. G. Menocal, “Fast wavelength-switching of

laser transmitters and ampliﬁers,” IEEE Journal of Selected Areas in Commu-
nications, vol. 8, pp. 1190 — 1202, Aug. 1990.

149

[89] K. L. Walker, “Directions in optical ﬁbers,” AT€§ T Technical Journal, vol. 69,

Novenber/ December 1990.

[90] E. Arthurs, M. S. Goodman, H. Kobrinski, and M. P. Vecchi, “HYPASS: An

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

[100]

optoelectronic hybrid packet-switching system,” IEEE Journal of Selected Areas
in Communications, vol. 6, pp. 1500 — 1510, 1988.

E. Arthurs et al., “Multiwavelength optical crossconnect for parallel processing
computers,” Electronic Letters, vol. 24, pp. 119 ~120, 1988.

P. K. McKinley, “Lightwave multichannel networks with grid-based topologies,”
in Proc. International Phoenix Conference on Computers and Communications,
(Scottsdale, Arizona, Mar. 27-30), pp. 506—512, 1991.

A. S. Acampora, “A multichannel multihop local lightwave network,” in Proc.

GLOBECOM, pp. 1459—1467, 1987. Tokyo, Japan, Nov. 15-18.

K. Hwang, P.-S. Tseng, and D. Kim, “An orthogonal multiprocessor for parallel
scientiﬁc computations,” IEEE Transactions on Computers, pp. 47 — 61, Jan.

1989.

H. Davis, S. R. Goldschmidt, and J. Hennessy, “Multiprocessor simulation and
tracing using Tango,” in Proceedings of International Conference on Parallel

Processing, pp. II—99 — II—107, 1991.

S. Goldschmidt and H. Davis, “Tango introduction and tutorial,” Tech. Rep.
CLS-410, Computer Systems Laboratory, Stanford University, Feb. 1991.

M. Dubois and F. Briggs, “Effects of cache coherency in multiprocessors,” IEEE
Transactions on Computers, pp. 1083—1099, Nov. 1982.

P. Bitar, “A critique of trace-driven simulation for shared-memory multiproces-
sors,” in Cache and Interconnect Architectures in Multiprocessors (M. Dubois

and S. S. Thakkar, eds.), pp. 37 — 52, Kluwer Academic, 1990.

E. J. Koldinger, susan J. Eggers, and H. M. Levy, “On the validity of trace-
driven simulation for multiprocessors,” in Proceedings of the 18th IEEE Annual

International Symposium on Computer Architecture, pp. 244 — 253, 1991.

T.-S. Jon and R. Enbody, “Address-separated multibus architecture,” in Pro-
ceedings of Int ’l Conference on Parallel and Distributed Systems, (Hsinchu, Tai-
wan, R. O. C.), pp. 82 — 89, Dec. 1992.

[101]

[102]

[103]

[104]

[105]

[106]

[107]

[108]

[109]

[110]

[111]

150

J. D. C. Little, “A proof for the queueing formulal = Aw,” Operations Research,
vol. 16, pp. 651—665, 1961.

M. Reiser and S. S. Lavenberg, “Mean-value analysis of closed multichain queue-

ing networks,” Jounal of the ACM, vol. 27, pp. 313—322, Apr. 1980.

S. S. Lavenberg and M. Reiser, “Stationary state probabilities of arrival instants

for closed queueing networks with multiple types of customers,” Journal of

Applied Probability, vol. 17, pp. 1048—1061, 1980.

T.-S. Jon and R. Enbody, “A scalable snoopy coherence scheme on distributed
shared-memory multiprocessors,” in Proceedings of Supercomputing ’92, pp. 652

— 660, IEEE Computer Society Press.

T.-S. Jon and R. Enbody, “Using optical networks on multiprocessors,” Tech.
Rep. CPS-93-9, Dept. of Computer Science, Michigan State University, Apr.
1993.

J. K. Bennett, J. B. Carter, and willy Zwaenepoel, “Adaptive software cache
management for distributed shared memeory architecture,” in Proceedings of the
17th IEEE Annual International Symposium on Computer Architecture, pp. 125
— 134, 1990.

S. J. Eggers and R. H. Katz, “Characterization of sharing in parallel programs
and its applicability to coherency protocol evaluation,” in Proceedings of the
15th IEEE Annual International Symposium on Computer Architecture, pp. 373
— 382, 1988.

T.-S. Jon and R. Enbody, “Scalable snoopy vs. directory cache coherence
schemes,” submitted, 1993.

C. Dubnicki and T. J. LeBlanc, “Adjustable block size coherent caches,” in
Proceedings of the 19th IEEE Annual International Symposium on Computer
Architecture, pp. 170 — 180, 1992.

M. Annaratone and R. Ruhl, “Performance measurements on a commercial
multiprocessor running parallel code,” in Proceedings of the 16th IEEE Annual
International Symposium on Computer Architecture, pp. 307 - 314, 1989.

J. Ramanathan and L. M. Ni, “Critical factors in NUMA memory manage-
ment,” in Proc. of 11th Intl. Conf. on Distributed Computeing Systems, pp. 500
— 507, May 1991.

151

[112] A. Gupta, T. Joe, and P. Stenstr6m, “Comparative performance evaluation of
cache-coherent NUMA and COMA architectures,” Tech. Rep. CSL-TR-92-524,
Computer Systems Laboratory, Stanford University, May 1992.

[113] S. T. Leutenegger and M. K. Vernon, “A mean-value performance analysis of

a new multiprocessor architecture,” in Proc. Conference on Measurement and
Modeling of Computer Systems, 1988.

[114] R. P. LaRowe Jr. , C. S. Ellis, and M. A. Holliday, “Evaluation of N UMA mem-
ory management through modeling and measurements,” IEEE Transactions on

Parallel and Distributed Systems, vol. 3, pp. 686 —701, 1992.

[115] G. M. Amdahl, “Validity of the single-processor approach to achieving large
scale computing capabilities,” in AFIPS Conference Proceedings, vol. 30,
pp. 483 — 485, AFIPS Press, 1967.

MICHIGAN STATE UNIV. LIBRQRIES
lllllllllllllllllllllllllllllllllllllllllllllll
31293009045810