r W ' \

LIBRARY

Michigan State
University

K *1

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

T_——__—_
DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU Is An Affirmative Action/Equal Opportunity Institution
cmmma-oj

 

DESIGN AND EVALUATION OF A
HIERARCHICAL BUS MULTIPROCESSOR

By

Carl Burton Erickson

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering
1991

Abstract

Design and Evaluation of a Hierarchical Bus Multiprocessor

By
Carl Burton Erickson

This thesis describes the design and evaluation of the protocol and controllers
which maintain memory coherence in a hierarchical bus multiprocessor. The archi-
tecture studied represents a natural step in the evolution of bus-based shared memory
multiprocessors. Extending the scalability of these machines beyond the current sin-
gle bus limit of 20 processors preserves the inherent advantages and familiarity of the
bus interconnect and the shared memory programming paradigm.

Cache memory is vitally important in bus—based multiprocessors for reducing
memory latency and for conserving bus bandwidth. A new cache coherence pro-
tocol, designed for hierarchical buses, is used to solve the attendant problem of cache
and memory coherence. A design and debugging tool based on Petri net simulation
was developed to evaluate detailed models of the controllers implementing the cache
coherence protocol. The Petri net simulator is completely general, but is particu-
larly adept at representing concurrency and synchronization; a necessity for modeling
parallel computer architectures.

Performance evaluation of the architecture under study was accomplished with
object-oriented discrete event simulation. The overall performance of a bus-based
multiprocessor is heavily dependent on the performance of cache memory, since the
system buses are potential bottlenecks. Two simulation models were developed. The
ﬁrst, a purely probabilistic model, can be used to quickly explore a wide range of
system conﬁgurations and parameters. A second, more detailed, trace-driven model
accurately represents the activities and state of a single cluster of processors. Trace—

driven simulation is necessary in evaluating the hierarchical bus multiprocessor since

the performance of cache memory is highly dependent on the pattern of memory ac-
cesses. A method of gathering architecture independent multiprocessor address traces
on a conventional uniprocessor was developed to facilitate trace-driven simulation.
Simulation indicates the hierarchical bus architecture increases the ultimate size
of bus-based multiprocessors by nearly an order of magnitude to approximately 200
processors. Detailed Petri net modeling and simulation suggests the feasibility of the

controllers and the correctness of the cache coherence protocol.

To my parents,
and Mary Caroline O’Neill

iv

Acknowledgments

This research and the degree it represents would not have been possible were it not
for my dual thesis advisors, Dr. Mani Azimi and Professor Lionel Ni. The time spent
with Mani in front of workstations, at the chalk board, and over numerous dinners has
been distilled into this thesis. For all the knowledge I have absorbed, both technical
and non-technical, and for our friendship, I am most grateful. Dr. Ni has been most
generous with his time in the ﬁnal year and a half of my degree; I have beneﬁted in
many ways from his considerable experience and knowledge.

I also appreciate the assistance and interest of my other committee members,
Dr. Richard Enbody, Dr. Erik Goodman Dr. John Kreer, Dr. Habib Salehi and
Dr. Anthony Wojcik. As chairman of the Electrical Engineering department when I
started at Michigan State, Dr. Kreer helped me in many ways, the last being to serve
on my thesis committee. Erik gave generously of his time in reviewing my research,
and provided ﬁnancial support through my long and pleasant association with the
Case Center.

Working with the numerous people who have been a part of the Case Center for
the last ﬁve years has been an enjoyable and educational experience. In particular
Jackie Carlson, Leslie Hoppensteadt, and Homayoun Torab have made Michigan State
a happier place for me.

This research has been supported by the National Science Foundation under grant
MIPS 8811815 and by the College of Engineering.

Finally, as the most important people in my life, I acknowledge with love the

support of my extended family and in particular my wife, Mary Caroline.

List of Tables

Table of Contents

vi

x

List of Figures xi:
1 Introduction 1
1.1 Need for Parallel Processing ....................... 1
1.2 Elements of Parallel Computers ..................... 2
1.2.1 Processors ............................. 3

1.2.2 Interconnection network ..................... 3

1.2.3 Memory .............................. 4

1.3 Cache Coherence ............................. 5
1.4 Single Bus Multiprocessor ........................ 5
1.5 Hierarchical Bus Multiprocessor ..................... 6
1.6 Performance Evaluation ......................... 8
1.6.1 Queuing networks ......................... 8

1.6.2 Discrete event simulation ..................... 10

1.7 Thesis Outline ............................... 11
1.7.1 Guide to chapters ......................... 12
Cache Coherence 16
2.1 Deﬁnition of Cache Coherence Problem ................. 16
2.2 Example of Cache Coherence Problem ................. 18
2.3 Solutions to the Cache Coherence Problem ............... 20
2.3.1 Software solutions ......................... 20
2.3.2 Hardware solutions ........................ 21
2.3.3 Single bus snoopy protocols ................... 22

3 Architecture of the HBSM Multiprocessor

3.1 Introduction ................................
3.2 System Controllers ............................
3.2.1 Cache Controller (CC) ......................
3.2.2 Memory Controller (MC) .....................
3.2.3 Cluster Cache Controller (CCC) .................
3.2.4 Cluster Memory Controller (CMC) ...............
3.2.5 CCC/CMC state mapping .....................
3.3 Interference Events ............................
3.3.1 Collision events ..........................
3.3.2 Benign events ...........................
3.3.3 Coherence events .........................
3.3.4 CMC and interference events ...................
3.3.5 CCC and interference events ...................
3.4 Bus Arbitration ..............................
4 COGI Cache Coherence Protocol
4.1 Design of COGI Protocol .........................
4.2 COGI Protocol Tables ..........................
4.3 Example of COGI Protocol .......................
4.4 Other Protocols ..............................

5 Multiprocessor Address Traces

5.1 Introduction ................................
5.1.1 Tracing techniques ........................
5.2 The Traced Application .........................
5.3 Trace Generation .............................
5.3.1 Expanding assembly language ..................
5.3.2 Compile and execute .......................
5.3.3 Filter raw traces .........................
5.4 Extension of Tracer ............................

6 Petri Net Model
6.1 Petri Net Modeling ............................

6.1.1

Extensions .............................

vii

25
25
28
30
30
31
33
35
36
37
38
38
39
39
41

42
42
43
53
57

58
58
60
61
64
64
67
67
7O

73
73
75

6.1.2 Software packages ......................... 76

6.2 High Level Timed Petri Nets ....................... 77
6.2.1 Places ............................... 77
6.2.2 Directory places .......................... 77
6.2.3 Transitions ............................ 78
6.2.4 Connectivity ............................ 79

6.3 HLTPN Simulator ............................ 79

6.4 HBSM Multiprocessor Model ...................... 81
6.4.1 Processors ............................. 81
6.4.2 Synchronization .......................... 83
6.4.3 Bus arbitration .......................... 85
6.4.4 Directories ............................. 87

6.5 Design and Debugging .......................... 90
6.5.1 Correctness ............................ 93

Discrete Event Simulation 94

7.1 Introduction ................................ 94

7.2 Simulation Environment ......................... 97
7.2.1 Multiprocessor models ...................... 99
7.2.2 System parameters ........................ 101
7.2.3 Measures of performance ..................... 102
7.2.4 Correctness ............................ 104

7.3 Probabilistic Model ............................ 107
7.3.1 Single cluster ........................... 109
7.3.2 Multiple cluster .......................... 115

7.4 Trace-driven Model ............................ 121
7.4.1 Intercluster activity ........................ 123
7.4.2 Benchmark parallel merge sort .................. 126
7.4.3 Results for 8x8 system ...................... 127
7.4.4 Cache geometry .......................... 132
7.4.5 Coherence trafﬁc ......................... 134

7.5 Summary ................................. 140

viii

8 Conclusions and Future Research 143

8.1 HBSM Multiprocessor Architecture ................... 143
8.1.1 Performance improvements .................... 145

8.1.2 System requirements ....................... 146

8.2 High Level Timed Petri Net Simulator ................. 147
8.3 Simulation Methodology ......................... 148
Bibliography 150

ix

2.1

3.1
3.2
3.3
3.4
3.5
3.6

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9

5.1

7.1
7.2
7.3
7.4
7.5

List of Tables

Important stages in the evolution of snoopy cache coherence protocols.

The state of blocks in the CC. ......................
The state of blocks in the CCC. .....................
The state of blocks in the CMC. .....................
Mapping of CMC states to CCC and CC. ................
Mapping of CCC states to CMC and CC. ................
Coherence events detected by the CCC and their resolution .......

Transaction types on the cluster bus and processor events. ......
Transaction types on the global bus. ..................
Actions of CC in response to processor requests. ............
Actions of CC in response to cluster bus transactions ..........
Actions of CMC in response to cluster bus transactions .........
Actions of CMC in response to global bus transactions. ........
Actions of CCC in response to cluster bus transactions. ........
Actions of CCC in response to global bus transactions. ........

Actions of MC in response to bus transactions ..............
Tracer library functions for barrier synchronization and tracing.

Technology dependent system parameters for simulation model.

Probabilistic representation of memory accesses .............
Techniques of increasing conﬁdence in correctness of models. .....
Probabilistic parameters for simulation of ﬁrst model ..........
Effect of hit ratio on multiprocessor cycles per instruction (MCPI) for

single cluster multiprocessor Experiment 1 ................

23

31
32
34
36
36
41

44
44
46
47
48
49
50
51
52

63

101
102
106
108

112

7.6 Multiprocessor cycles per instruction (MCPI) for ideal application Ex-
periment 5. ................................ 118
7.7 Multiprocessor cycles per instruction (MCPI) for non-ideal application
Experiment 6 ................................ 120
7.8 Cases of intercluster transactions in COGI protocol ........... 125
7.9 Experiment 7 multiprocessor cycles per instruction (MCPI) data for
set size 2. ................................. 133
7.10 Experiment 7 distribution of cluster bus busy time. .......... 139

1.1
1.2

2.1

3.1

3.2

3.3
3.4

4.1

4.2

5.1
5.2
5.3
5.4

5.5

6.1

List of Figures

Generalized two-level hierarchical bus multiprocessor ..........

Steps in the evolution of bus-based shared memory multiprocessors. .

7
13

Simple multiprocessor scenario to demonstrate cache coherence problem. 19

Components of the Hierarchical Bus Shared Memory (HBSM) multi-
processor architecture. ..........................
Detail of the CCC dual-bus snoop device showing the two active por-
tions of the controller, cluster snoop and global snoop, sharing access
to the single cache directory ........................
Meaning of the block states of the CCC ..................
Meaning of the block states of the CMC. ................

Basic scenario of intracluster sharing of locally resident block. Pro-
cessor P1 reads block, P10 reads then writes block, passing updated
version of data word to P1 .........................
Continuation of the COGI protocol example transaction. Processor

Pm reads then writes the same block that is shared with Cluster 1.

Traced multiprocessor address space. Raw virtual references are shown
with resulting ﬁltered references below them and indented. ......
C template for process creation and synchronization. .........
The steps of the Tracer technique .....................
Expanded assembly code (in two columns) shown below original as-
sembly code fragment. ..........................
Problem of extending Tracer to general synchronization and task allo-

cation paradigm. .............................

Main loop of HLTPN simulator ......................

xii

29
32
34

54

56

62
65
66

68

82

6.2 HLTPN model representing the actions of processors in the hierarchical
bus multiprocessor ............................. 84

6.3 HLTPN subnet of barrier synchronization in the hierarchical bus mul-

tiprocessor model .............................. 86
6.4 HLTPN subnet of bus arbitration for HBSM multiprocessor. ..... 88
6.5 Set associative and direct mapped directories for a 2 cluster, 2 proces-

sor/cluster hierarchical cache system model. .............. 89
6.6 Bus handshaking signals DI and DK in Petri net model. ....... 92
7.1 Simulation loop for event scheduled DES using C-like pseudo code. . 96

7.2 Effect of hit ratio on cluster bus utilization for single cluster multipro-
cessor. ........................... ' ........ 109

7.3 Effect of global accesses on cluster bus utilization for single cluster
multiprocessor. .............................. 1 1 1

7.4 Effect of dirty write-back probability on cluster bus utilization in single
cluster multiprocessor. . . . . ' ...................... 113

7.5 Effect of write notice probability on cluster bus utilization in single

cluster multiprocessor. .......................... 114
7.6 Global bus utilization in ideal application. ............... 116
7.7 Potential efﬁciency for ideal application. ................ 117
7.8 Global bus utilization for realistic application. ............. 119
7.9 Potential efﬁciency for realistic application ................ 121

7.10 The benchmark parallel merge sort used for traces in the hybrid model. 128
7.11 Bus utilizations in typical conﬁguration of block size 4, set size 2. . . 129

7.12 Hit ratio and potential efﬁciency in typical conﬁguration of block size

4, set size 2. ................................ 130
7.13 Effect of block size on potential efﬁciency for set size 2. ........ 133
7.14 Effect of set size on potential efﬁciency for optimal block size. . . . . 134
7.15 Effect of set size on global bus utilization for block size 4 ........ 135

7.16 Fraction of cache writes which generate a cluster bus write notice. . . 136

7.17 Fraction of write notices to blocks which are not shared by other caches.137

7.18 Fraction of read misses which are serviced by another cache with a
dirty copy of the block ........................... 138

xiii

Chapter 1

Introduction

This thesis describes the design and performance of the hardware controllers and
cache coherence protocol which together comprise the memory subsystem of a hi-
erarchical bus shared memory multiprocessor. The architecture studied represents a
logical evolutionary step in the inevitable ascendancy of parallel processing in compu-
tationally intensive application domains such as scientiﬁc computing. The inevitable
need for parallel computers and the basic elements of these machines is outlined in
this introductory chapter. The ﬁrst generation of single bus multiprocessors are con-
sidered, and the natural extension of a hierarchical bus architecture is described. The
contributions of this thesis are a solution to the problem of cache coherence in a hi-
erarchical bus multiprocessor and performance evaluation techniques for trace-driven
simulation. The ﬁnal section summarizes the strategy underlying this work and the

contents of the remaining chapters.

1.1 Need for Parallel Processing

The eventual use of parallel processing technology is assured by the limitless need
for computational power and the inherent physical limits of sequential computers.
The areas of research that could beneﬁt from faster, more efﬁcient computers is
as large as the number of ﬁelds being studied with computers. Classic examples
of applications which are limited by available computing power include numerical
weather forecasting, computational ﬂuid dynamics, ﬁnite element analysis, and image
processing. The future will undoubtedly bring uses for the computer which are as yet
unimagined, and which will require vast computational power.

The computational power of traditional von Neumann architecture uniprocessor
computers is limited by the speed of light. Signal transmission speed in silicon is actu-
ally ten times slower than the speed of light in a vacuum (3 x 108m/ sec) [18]. At this
rate, a signal can only propagate 3 cm in 10‘9sec. As the clock rate of the fastest su-

percomputers [15] approaches 10’9sec, practical constraints on packaging and cooling

will begin to limit their computational power. If such supercomputers were restricted
to purely sequential architectures, then executing one ﬂoating-point operation per
clock cycle would imply a limit of 1000 MFLOPS (millions of ﬂoating-point oper-
ations per second). In fact, extant supercomputers, and those announced for the
near future, rely on low-level parallel processing techniques known as pipelining and
vector processing to increase their maximum potential speed beyond 1000 MFLOPS.
Materials more exotic than silicon (such as gallium arsenide) may have a smaller
propagation time, but are nevertheless eventually limited by the speed of light.
Where will computer engineers of the future ﬁnd the performance improvements
necessary to satisfy the voracious appetite for computational power of scientiﬁc appli-
cations? Whatever the technology, if the speed at which information can be processed
is limited, then the only method of processing a constant amount of information in
less time is to perform multiple operations simultaneously. Coordinated, simultane-
ous work by multiple processors is the essence of parallel processing. Implementations
of parallel computers range from large numbers of slow processors to small numbers
of fast processors. Hennessy and Patterson [30] refer to the El Dorado1 of parallel
processing as large numbers of powerful processors in the same computer. The two
distinct roads to this El Dorado are exempliﬁed by the 8 processor Cray Y—MP from
Cray Research and the 65,536 bit-serial processor CM-2 from Thinking Machines.
The key to parallel architectures which follow the Gray Y—MP path is scalability.
In practical terms, scalability is usually judged over a range. For example, binary
hypercube multicomputers may scale well over the range of 10 to 10,000 processors,
but packaging, cooling, and other mundane constraints may make hypercubes im-
practical outside of this range. Parallel computers which rely on a shared bus as
an interconnection network have shown very limited scalability. Such computers are
limited by bus bandwidth to around 30 processors; in other words, single bus parallel

computers are scalable over less than an order of magnitude.

1.2 Elements of Parallel Computers

The fundamental elements of parallel computers are processors, memory, and an

interconnection network. The interconnection network serves to connect processors

 

1El Dorado refers to a city of fabulous riches in South America for which Spanish explorers of
the 16th century searched in vain.

to other processors as well as to memory modules.

1.2.1 Processors

Processing elements in parallel computers range from the most powerful pipelined
vector processors (Cray Y-MP, IBM 3090) made with expensive, fast bipolar semi-
conductor technology, to the bit-serial, slow, CMOS processing elements of the CM-2.
In between these extremes are the high performance RlSC microprocessors, such as
the Intel i860, Motorola 88000, Sun Microsystems SPARC, and MIPS R3000. Su-
percomputer processors are designed to be as fast as possible with almost no regard
for expense. They are heavily pipelined and support vector instructions. Clock cy-
cle times in such machines are now around 4 X 10‘9sec, with sustained performance
around 20—30 MFLOPS. Such processors often require elaborate packaging and cool-
ing support, and are never made in large quantities. Most of the supercomputers
offered today, and all that are being planned, are available in models with multiple
processors [15].

The single chip, VLSI microprocessors with around 106 transistors offer the most
economical method of building powerful parallel computers with hundreds to thou-
sands of processors. They often include one or more of on-chip data and instruction
caches, ﬂoating point and integer processors, and memory management units. Mass
production of these off-the-shelf components reduces their price, and they generally
do not require any special cooling or packaging. Their clock cycle times range from
12.5 x 10'9 to 40 x 10‘9sec, and their sustained performance ranges between 2—8

MFLOPS.

1.2.2 Interconnection network

Of the three elements of any parallel computer, the interconnection network is likely
the hardest to make scalable. The methods proposed for connecting processors and
memories are numerous. Some common interconnection networks include bus, ring,
star, tree, mesh, torus, crossbar, and hypercube. Each of these (and the many other
possibilities) can be classiﬁed by their connectivity, degree, and cost. The bus and
crossbar are fully connected with respect to memory; each processor is directly con-
nected to every memory module. The bus has lower cost and lower performance

than the crossbar, since each processor node has only one connection (degree one)

and all processors share the same path to all memory modules. With a crossbar,
on the other hand, each processor has a direct connection to each memory module,
thereby increasing the degree of each node, the cost, and hopefully the performance.
In general, increasing the connectivity or the degree of the interconnection network
increases the potential performance of the system at the expense of greater complexity

and cost.

1.2.3 Memory

Parallel processors can be described by two views of memory: hardware and software.
Physically, memory in a parallel processor can be distributed or centralized. Central-
ized memory may become the bottleneck of the entire computer, limiting the scala-
bility of the architecture. From a software perspective, memory can either be shared
or non-shared. Shared memory is seen as the more desirable programming paradigm,
since interprocessor communication can take place within the single, shared address
space. Non-shared memory programming involves explicit forms of communication
(message passing) and can be efﬁciently emulated on a shared memory machine. Gor-
don Bell coined the terms multiprocessor to refer to shared memory parallel processors
and multicomputer to refer to non-shared memory parallel processors [7]. Bell termed
bus-based multiprocessors built from powerful microprocessors multis and predicted
that these machines would represent the ﬁfth generation of computing.

The disparate progression of memory and processor technology has resulted in a
memory latency problem [30]. The speed of dynamic random access memory (RAM)
has grown steadily at a 7% annual rate since 1980. Over the same time period,
processor speed has increased at a rate of 19—26% before 1985 and 50—100% rate
after 1985. The solution to this mismatch in speed has been a memory hierarchy.
Small, fast static random access memories known as caches are placed close to the
processor, and serve as a buffer to the larger, slower dynamic RAMs used in main
memory. Computer programs generally exhibit both spatial and temporal locality,
that is, the memory locations that will be accessed in the near future have a high
probability of being physically close in the address space (or the identical) to those
accessed in the recent past. The principles of locality assure that a large proportion
of memory accesses will be satisﬁed by the faster but smaller cache, thus decreasing

the overall memory latency.

1 .3 Cache Coherence

Computers with cache memories may have multiple copies of the same block of mem-
ory existing at the same time. These multiple copies of a memory block lead to the
problem of cache coherence. A scheme for preserving the correctness of the system
in the face of multiple copies of memory blocks is known as a cache coherence pro-
tocol. On a uniprocessor, a cache coherence protocol may be as simple as requiring
that blocks in the cache which have been modiﬁed be written back to main memory
before they are replaced in the cache. Input / output operations may also affect cache
coherence in a uniprocessor. Cache coherence in a multiprocessor is more compli-
cated, because copies of blocks may exist in multiple caches as well as main memory.
The actions of all processors must be coordinated to maintain memory and cache
coherence.

Centralized solutions to the multiprocessor cache coherence problem (6. g. a sin-
gle table and controller) are undesirable because the centralized resource is likely to
become a system bottleneck. A distributed solution to the cache coherence problem
requires that each processor know what actions every other processor is taking, and
respond accordingly. A major advantage of the bus-based multiprocessor class identi-
ﬁed by Bell is that it admits a distributed, hardware solution to the cache coherence
problem. The set of snoopy cache coherence protocols hinges on the ability of each
processor to react to the transactions of every other processor by monitoring the bus.
By snooping on the bus, each processor receives the same information at the same
time, and can take actions to preserve coherence based on this information and the
cache coherence protocol. The bus serves to serialize access to the memory assuring
that no more than one transaction can take place at any given time, and that all

processors have the same knowledge of the system.

1.4 Single Bus Multiprocessor

The single shared bus was a natural choice of interconnection network for the ﬁrst
multiprocessors. The shared bus was inexpensive, fully connected, and a known
technology. Computer engineers were familiar with existing bus standards and in-
terfacing techniques. Multiprocessors built on shared buses could be sold in small

conﬁgurations and later upgraded by adding processors. The full connectivity of the

bus allowed for simpler, centralized memories. Centralized memories, in turn, made
the shared memory software paradigm easier to implement. Early commercial mul-
tiprocessors built on a single bus include the Symmetry and Balance from Sequent
Computer Systems, the Multimax from Encore Computer, the N+1 from Synapse
Computer, FX/80 from Alliant Computer Systems, the Sequoia from Sequoia Sys-
tems, and the System 64000 from ELXSI. More recently, shared memory bus-based
multiprocessors have been used in powerful graphics workstations from Stardent and
Silicon Graphics, and as compute and ﬁle servers by MIPS, Solbourne, Compaq Com-
puter, and Digital Equipment Corporation.

In addition to reducing memory latency, cache memory plays an equally impor-
tant role in bus-based multiprocessors by reducing accesses to memory. Single bus
multiprocessors without caches would be limited in size to only a few processors. If
every memory access by every processor had to go across the single shared bus, the
bus would soon become the system bottleneck. Instead, by buffering the bus from
every memory access, caches allow the available bus bandwidth to be shared among
more processors. Even with cache, the ultimate size of single bus multiprocessors is
between 20 and 30 processors. Scalability after this point is limited by the bandwidth
of the bus.

1.5 Hierarchical Bus Multiprocessor

The use of caches to reduce trafﬁc on the bus increased the ultimate size of single
bus multiprocessors an order of magnitude, from 2—3 processors to 20—30 processors.
A logical means of further increasing the scalability of bus-based multiprocessors is
to form a hierarchy of single bus multis. A two level hierarchical bus multiproces—
sor would have several clusters of processors joined by a single global bus. Each
cluster would be equivalent to a single bus multiprocessor. Memory could be either
distributed among clusters or centralized on the global bus. Figure 1.1 shows the
general arrangement of processors and memory in a two-level hierarchical bus shared
memory (HBSM) multiprocessor. Processors are gathered into clusters organized
around a single bus; memory is distributed throughout the entire system. Such a
machine should be able to scale into the hundreds of processors, thus increasing the
ultimate size of bus-based multiprocessors an additional order of magnitude.

The Cm* multiprocessor developed at Carnegie Mellon in the 19703 is the most

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.1: Generalized two-level hierarchical bus multiprocessor.

historically signiﬁcant hierarchical bus architecture. The Cm* was made up of clusters
of up to 14 processor/ memory modules sharing a single bus and a custom controller.
Clusters were joined to one another with two global buses. The custom controller
maintained the illusion of a single shared memory address space. Any of the DEC
LSI-ll processors could access any other processor’s memory module, but the mem-
ory latency ratio of 1:3:9 (localzsame clusterzdifferent cluster) placed a premium on
intercluster accesses. A 50 processor, 5 cluster Cm* machine was built and operated.

Cm* showed the feasibility of a bus architecture which was designed to take ad-
vantage of locality of reference by attaining linear speed-up on such clusterable ap-
plications as partial differential equations. The lack of cache memory simpliﬁed the
role of the custom controller and eliminated the need for a cache coherence protocol.
Amortizing the cost of the custom controller across the 10-14 processors in a cluster
allowed it to be more powerful and expensive than if each processor were required to
have its own controller. A buffered, packet switching bus protocol was used to avoid
deadlocks over bus allocation. Not having to worry about a cache coherence problem
made implementation of such a split-transaction protocol feasible.

Two ongoing research projects featuring hierarchical bus multiprocessor architec-
tures have been reported in the literature. In a project known as the Ultramax [8, 56],
Encore Computer, with DARPA funding, has made the logical extension to their Mul-
timax architecture by joining single bus clusters of 16 processors with a global bus.
The Ultramax uses large caches between the clusters and the global bus, and memory

that is physically distributed among clusters.

The Data Diffusion Machine [29], from the Swedish Institute of Computer Science,
is built around a multi-level hierarchical bus architecture. Processor caches at the
lowest level of the hierarchy form the only system memory in the DDM. Caches at
higher levels hold only status information, not data. The most novel aspect of the
DDM is that shared memory data blocks are not considered to have a particular home
location. Instead, data blocks migrate to where they are needed, living in caches at

all times, without a home memory module for permanent residence.

1.6 Performance Evaluation

To be credible, a proposal for a new computer architecture must be accompanied by
evidence of realistic requirements for and correct operation of system components.
To be of interest, new computer architectures must show promise of increased perfor-
mance. Two strategies were pursued to evaluate the HBSM multiprocessor architec-
ture. Detailed Petri net models were used to design and debug the cache coherence
protocol and controllers. Discrete event simulation of purely probabilistic and trace-
driven models provided estimates of broad measures of performance of the protocol

and the architecture.

1.6.1 Queuing networks

The main advantage of queueing networks is the balance between accuracy and ef-
ﬁciency which they provide. The close correspondence between certain aspects of
computer systems and networks of queues eases the task of developing a queuing
network model. Algorithms such as Mean Value Analysis are available for which the
analysis of queuing network models is computationally proportional to the product
of the number of queues and the number of customer classes. Use of such economical
algorithms depends on being able to accurately represent the system under study with
a class of models known as product form queuing networks. A further advantage is
the fact that the number of parameters which must be obtained to characterize a sys-
tem being modeled with product form queueing networks is relatively small. Queuing
network models have been successfully applied to the analysis of high level computer
system models where questions of capacity planning and overall system performance
must be addressed.

If the system under study can not be adequately represented by product form net-
works, then a general network of queues can be employed. Unfortunately, analysis of
such general networks is computationally very expensive; the amount of work required
for analytical solutions rises exponentially with the size of the model. Simulation of
queueing network models is an alternative in such cases, but is also computation-
ally intensive and requires very careful characterization of the system under study
to determine model parameters. For approximately the same investment in model
development and simulation time, a more general discrete event simulation of the
system under study can be made.

The main difﬁculty in using queuing networks for performance evaluation is that
exact solutions to models exist only for product form, or separable, networks. Prod-
uct form networks must be composed solely of ﬁrst-come—ﬁrst-served queues with
exponential, class-independent service times. FCFS queues with exponential service
times are not sufﬁcient for accurately modeling the memory subsystem of a multipro-
cessor. As a result, approximate solutions to non-product form networks (those with
a wider range of service times and disciplines) must be found. Sauer and Chandy
[45] outline two approaches to this problem. The ﬁrst is to simply apply the product
form solutions to the non-product form model. In some cases, Reiser has shown that
this approach can yield results close to the exact solutions [43]. The conditions under
which this strategy yields results close to the exact solution are for queuing networks
with large populations of customers. This requirement is not met for a model of
the HBSM multiprocessor, since the number of customers in the entire model would
simply be the number of processors being modeled. On average, most queues would
never see more than one customer enqueued at a time.

The second approach to solution of non-product form networks is known as ag-
gregation. While no known bounds exist for the error incurred by using aggregation,
Lazowska, et al claim that a large body of empirical evidence exists to show that
results from aggregation are usually close to the exact results [35]. The empirical
evidence cited comes from the ﬁeld of computer systems modeling and capacity plan-
ning. Unfortunately, the number of customers in a queuing network model of the
HBSM multiprocessor would be so small, compared to the computer system models
mentioned, that the error involved with aggregation is much more uncertain.

Queuing networks are unable to represent some common aspects of computer

system behavior such as simultaneous holding of resources. The representation of

10

concurrency is an unfortunate limitation when using queuing networks to model par-
allel computer systems. Since queuing network models are concerned with average
ﬂows in a network of queues, it is difﬁcult to represent resources which must be used
exclusively. For example, consider a snooping cache controller competing with a pro-
cessor for exclusive access to the cache directory. This competition (and the likely
locking that accompanies it) has ramiﬁcations for the performance of the cache mem-
ory subsystem. In a queuing network model of such a cache controller, the use of
the cache directory can only be represented as an average ﬂow through the directory
queue; competition and locking can not be directly represented.

Modeling with queuing networks is an entirely stochastic approach; the actual
memory references recorded in address traces can not be used directly as input to a
queuing network model. Increased availability of computer power at decreased cost
has boosted the popularity of simulation as the performance evaluation tool of choice.
More popular in the past, technological trends and the availability of sophisticated
modeling and simulation software have largely displaced queuing networks as a per-

formance evaluation tool for detailed computer architecture studies.

1.6.2 Discrete event simulation

The overall performance of a shared memory multiprocessor depends heavily on the
performance of the memory hierarchy. In turn, the performance of the memory hier-
archy depends on the memory reference patterns of an application. Interconnection
network trafﬁc attributable to the maintenance of cache coherence in particular is
very sensitive to the distribution and timing of memory references both within a
single processor and between processors. To be accurate, performance evaluation of
a cache coherence system (protocol and controllers) must be done with trace-driven
simulation. The difﬁculties of obtaining traces for multiprocessors provided the im-
petus for the development of a multiprocessor trace gathering technique that can be
run on a uniprocessor. A major advantage of the traces generated in this manner are
their architectural independence. Address traces are taken at the process level, so
the traces are independent of the architecture of the machine generating them. Syn—
chronization artifacts in the traces are avoided by recording the presence of barrier
synchronization points in the traces themselves. The simulators that are fed with

the traces determine the actual timing of the read, write, and synchronization events

11

recorded in the traces.

To demonstrate reasonable hardware requirements and to provide some degree
of conﬁdence in the correctness of the cache coherence protocol, detailed modeling
was performed with high-level, timed Petri nets. The Petri net model accurately
represents the complete state of the system controllers, buses, caches and memory
modules. Developing this model and running it with multiprocessor address traces
was crucial to the design and debugging of the proposed cache coherence protocol.
However, the level of detail in the Petri net model precludes its use for trace-driven
simulations of adequate length to accurately estimate the potential performance of
the HBSM multiprocessor architecture.

Estimates of bus and controller utilization, the chief performance metrics of in-
terest in bus-based multiprocessors, are made using an ob ject-oriented, event sched-
uled, discrete event simulation developed in the C++ programming language. The
C++ simulation is a hybrid approach combining trace-driven simulation and proba-
bilistic models. For applications which are symmetric with respect to clusters, the
cache activity on one cluster can be taken to be representative of the activity of all
clusters. Simulation time is reduced by simulating the activities of only one clus-
ter. The effect of the interaction between clusters is represented probabilistically in
the simulation by pre—processing the address traces and estimating the amount of

intercluster activity for each phase of the application.

1 .7 Thesis Outline

The evolution of bus-based shared memory multiprocessors is depicted in Figure 1.2.
The ﬁrst step taken was to join several processors and a central memory to a single bus
via private caches. This ﬁrst step was motivated by the ever present need for increased
computing power, the desirability of the shared memory programming paradigm, and
the familiarity of computer engineers with bus technology. The scalability limit of the
single bus multiprocessors is the bandwidth of the bus; this unique resource becomes
the system bottleneck when the number of processors exceeds about 20. The graph
in the middle of Figure 1.2 shows the behavior of speed-up or total compute power
of single bus multiprocessors when the bus becomes a system bottleneck. Actual
performance saturates at a limit determined by the bus bandwidth, the memory

reference patterns, and the size, geometry and speed of caches and memory. Further

12

additions of processors at this saturation point does not result in further speedup.
The natural next step to extend the scalability of bus-based multiprocessors is to
form a hierarchy of buses, as shown in the bottom of Figure 1.2.

This thesis describes the design and performance of the hardware controllers and
cache coherence protocol of a hierarchical bus shared memory multiprocessor. The
ﬁrst four chapters constitute the background of the problem and the paper design
work. The ﬁnal four chapters represent the work done to evaluate the HBSM multi-
processor architecture.

The cache coherence protocol and controllers developed for the hierarchical bus
shared memory architecture are complex entities. Veriﬁcation of the correctness of
such complex, interrelated hardware / software systems remains an open research topic.
Trace—driven simulation with an accurate and detailed Petri net model was performed
to gain conﬁdence in the correctness of the memory subsystem. The Petri net simu-
lator used to develop and debug the protocol and hardware controllers is a powerful,
general tool for the study and performance evaluation of concurrent systems. How-
ever, the level of detail of the Petri net model required to be useful for studying the
correctness of the system precludes its use in performance evaluation of the system
for trace ﬁles of realistic lengths. An object-oriented discrete event simulator was
developed to study the higher level performance of the memory subsystem. Both
strategies of this two-pronged approach, detailed Petri net models and higher level
discrete event simulation, are trace-driven approaches. The goal of the ﬁrst strat-
egy is to gain conﬁdence in the correctness of the protocol and the realizability of the
hardware controllers. The goal of the second strategy is to demonstrate the feasibility
of increasing the scalability of bus-based multiprocessors by an order of magnitude,

from 10—20 processors to 100—200 processors.

1.7.1 Guide to chapters

This chapter outlines the need for parallel processing technology to bring to bear
increased computational power on a wide range of computer applications. A brief
review of the elements of parallel computer architectures and the method of applying
caches to bridge the processor/ memory speed gap was provided as background. The

ﬁrst cache-based single bus multiprocessors were described. The natural extension

13

 
  

need for parallel
processing
technology

$383

single-bus multiprocessor

      
 
   
   

shared memory
programming

paradigm

 

 

 

 

 

 

 

 

‘;Z~I §§§§§

   
     

   
  
 
 

  

bus
technology

caches to decrease
memory latency and
bus traffic

 

18 * ideal

 

 

 

Limitatlon ----------------- '-

 

 

 

 

2 4 6 8 10 11 14 16 18 20

 

 

 

 

 

 

“NextStep

 

 

 

 

 

 

 

 

 

 

hierarchical bus-based shared memory
multiprocessor

Figure 1.2: Steps in the evolution of bus-based shared memory multiprocessors.

14

of the hierarchical bus as a means of increasing the ultimate size of bus-based mul-
tiprocessors was outlined. Performance evaluation strategies were described which
produce conﬁdence in the correctness of the proposed cache coherence protocol and
broad measures of performance for the hierarchical architecture.

The problem of cache coherence is considered in Chapter 2. Possible solutions to
the cache coherence problem are described, with speciﬁc attention being paid to a
class of hardware solutions for single bus multiprocessors.

The components of the hierarchical bus shared memory multiprocessor studied in
this thesis are described in Chapter 3. The responsibilities and the design criterion
of each controller are explained. The description in this chapter includes a section
analyzing the hazards inherent in a controller which monitors two asynchronous buses
and how each of these hazards is resolved.

The cache coherence protocol developed for the hierarchical bus multiprocessor
is described in Chapter 4. A complete description of the states of the protocol, the
required bus transactions, and the tables deﬁning the protocol are found here.

Chapter 5 describes a method developed to generate multiprocessor address traces
for detailed performance evaluation of multiprocessors. The chief advantages of this
method are the lack of reliance on a real multiprocessor and the architectural inde-
pendence of the address traces.

The detailed Petri net models used to design and debug the hardware controllers
and the cache coherence protocol are discussed in Chapter 6. The models were used
in a Petri net simulator developed for the purpose of studying concurrent systems.
The simulator supports many extensions to Petri nets which increases the modeling
power or expressiveness of the tool.

Discrete event simulation is used to estimate the performance of the hierarchical
bus multiprocessor. The object-oriented simulation tools and the models used for
performance evaluation are described in Chapter 7. A purely probabilistic model of
the complete HBSM multiprocessor is used to explore the scalability of the hierarchi-
cal architecture. A trace-driven extension of this model is used to verify the purely
probabilistic simulation model, and to study in more detail the effects of cache ge-
ometry and coherence trafﬁc in the HBSM multiprocessor. This simulation model
provides increased detail with acceptable speed by accurately capturing the state of a
single cluster and representing intercluster activity probabilistically with parameters

measured from address traces.

15

The ﬁnal chapter draws conclusions about the scalability of the hierarchical bus

architecture and considers directions for future research in several areas.

Chapter 2

Cache Coherence

A hierarchical bus multiprocessor has the potential of increasing the ultimate size of
bus-based multiprocessors into the range of a few hundred processors. The critical
role of caches in bus-based architectures and the obvious need for correct and de-
terministic operation of the multiprocessor means that a satisfactory solution to the
cache coherence problem is necessary. As shall be seen, the performance of the sys-
tem for maintaining cache coherence will be crucial to the performance of the overall

system.

2.1 Deﬁnition of Cache Coherence Problem

The cache coherence problem arises from the simultaneous existence of multiple copies
of the same block of memory. Physical memory of a computer can be thought of as
a collection of memory blocks. Each block of memory is associated with a home
memory module. Copies of each memory block may exist in zero or more caches in
a multiprocessor. The correct and deterministic operation of the multiprocessor de-
pends on managing the multiple copies of data blocks, or solving the cache coherence
problem.

An intuitive requirement for a solution to the cache coherence problem is that
any read to a memory location should return the value last written to that memory
location. In a multiprocessor the meaning of last written value may very well depend
on the point of view of each processor. As a result, this intuitive deﬁnition of coherence
is not rigorous enough to use for evaluating solutions to the cache coherence problem,
but does indicate a desired behavior for a solution.

A more rigorous requirement for correct operation is known as sequential consis-
tency. Sequential consistency requires that a system produce the same result as long
as the memory accesses of each processor are kept in order, while the memory accesses
of different processors may be interleaved in arbitrary order. This is a common expec-

tation among programmers, no doubt because uniprocessors execute in this fashion,

16

17

and the semantics of sequential and parallel programming languages encourage this
model. Scheurich and Dubois deﬁne a sufﬁcient condition for a multiprocessor to be
sequentially consistent in terms of global performance of memory accesses. A store
is globally performed when it is performed with respect to all processors, that is, no
load issued by any processor can return an old value which was valid before the store.
A load is globally performed when it is performed with respect to all processors, and

the store which is the source of the value returned has been globally performed.

Sequential consistency is satisﬁed in any system if an access may not
be performed with respect to any processor until the previous access by
the same processor has been globally performed and if accesses of each
individual processor are globally performed in program order [46].

Enforcing sequential consistency via the cache coherence system may be too costly.
For example, if multiple caches in the hierarchical bus architecture held copies of the
same block, sequential consistency would require that a processor writing to this block
be held until the effect of the write operation had been recorded in every cache holding
the block. Even if the effect of the write were to simply invalidate other copies of the
shared block, stalling the processor until all invalidations were performed (and until
acknowledgments had been received) would severely degrade performance.

Alternatives to sequential consistency are known as weak ordering models. Sys-
tems which do not enforce sequential consistency (or strong ordering) may still execute
correctly if proper synchronization is followed. Coherence in a weakly ordered system
is only enforced at necessary synchronization points. The advantage of weak ordering
models is that by allowing inconsistency when it does not matter, the performance
of the system may be improved. To continue the example used above, a processor
would only need to be delayed on a write to a shared block until it broadcast its write
intention on its cluster bus, knowing that invalidations would eventually be received
by all caches holding a copy of the block.

The cache coherence maintenance system (protocol and controllers) described in
this thesis implements a form of weak ordering known as processor consistency. Pro-
cessors are released after a write to a shared block once the transaction has been
issued on the bus; a processor is free to issue further memory accesses before the
write is seen by all processors in the system. Accesses by any given processor occur
in sequential order with respect to that processor, but nothing is guaranteed about

memory accesses relative to other processors. Not guaranteeing sequential consistency

18

eliminates the only published method of proving a cache coherence protocol correct
[46]. The issue of correctness is addressed in later chapters. Processor consistency
requires that access to shared variables be protected with synchronization constructs
such as semaphores and monitors.

The expectations of a good solution to the cache coherence problem are that
it not allow violations of the cache coherence protocol, that it maintain coherence
with minimal impact on the performance of the system, that it be transparent to
programmers, and that the cost and complexity of the solution are reasonable. A
solution to the cache coherence problem does not eliminate the need for proper use

of synchronization in parallel programming.

2.2 Example of Cache Coherence Problem

To understand the cache coherence problem, imagine a small multiprocessor with
two processors each with private caches and a single system memory, joined by an
arbitrary interconnection network, as shown in Figure 2.1a. The ﬁrst event (step 1)
in the scenario depicted is when processor P1 does a read to a block which misses in
the cache. The cache miss is satisﬁed by the cache controller CCI requesting that
the block be supplied to it. The memory module responds to this request (step 2),
and the block is loaded into the cache of P1. Processor P2 proceeds to also read from
this block (step 3), ﬁnd it missing in the cache, load the block from memory (step
4), until the situation is as shown in Figure 2.1b. The next event in the system is a
write by processor P1 to the block. At this point, the question of updating the other
copies of the block that exist in memory and cache arises. If the write to the block is
passed over the interconnection network, thus keeping the copy of the block in memory
up—to—date with the latest copy in P1 ’8 cache, then the cache coherence protocol is
classiﬁed as write-through. If memory is left inconsistent with the latest cache copy,
then the protocol is said to be write-back. A write-back protocol requires that caches
take responsibility for blocks for which they hold the latest value. The actions taken
for the copy of the block in Pg’s cache distinguishes the cache coherence protocol as
either a write-invalidate or write-update protocol. Write-invalidate protocols simply
invalidate copies of the block held in other caches. Write-update protocols transfer
the new data value of the block to the other caches holding copies of the block. The

memory update policy and the cache update policy are orthogonal.

19

 

General Interconnection Network

 

 

 

6?) supply
GD supply

 

 

 

 

 

 

 

 

 

 

P1

(1) read @ read

a) Reads by P1 and P2 load copies of block into their respective caches.

    

 

 

 

 

 

 

Block of interest shown in black.

 

General Interconnection Network

 

 

 

    

 

 

 

 

 

 

 

 

 

 

 

 

(59 write

b) Write by P1 forces choice of policy for updating memory and
cache copies of block.

Figure 2.1: Simple multiprocessor scenario to demonstrate cache coherence problem.

20

2.3 Solutions to the Cache Coherence Problem

The strategies for solving the cache coherence problem can be broadly classiﬁed into
hardware solutions or software solutions. The policies which can be implemented
efﬁciently depend on the architecture of the multiprocessor, particularly of the in-
terconnection network. For example, if all cache controllers can monitor broadcasts
from all other cache controllers, then a write-update policy can be more efﬁciently
implemented, since the writing processor need not know in advance the list of other
processors holding copies of the block in their cache. Access to the same information
at the same time gives the shared bus an advantage over more general forms of in-
terconnection networks, such as multistage switching networks, since it allows for a

distributed hardware solution to the cache coherence problem.

2.3.1 Software solutions

Software solutions to the cache coherence problem rely on intelligent compilers to
generate instructions to the cache controllers which will ensure that coherence is
maintained [12]. In the most extreme case, the compiler may simply mark all vari-
ables shared between two or more processors to be uncacheable. Accesses to share
variables would bypass the cache and go directly to memory every time. A more so-
phisticated approach is to classify shared variable memory accesses by their coherence
requirements, and issue instructions to the cache controller to maintain coherence for
the shared variables but still allow caching. For example, shared variables that are
used read-only by multiple processors can always be cached, and no special action
need be taken to maintain coherence. Shared variables which are read by many pro-
cessors and written by just one processor must be made uncacheable on the reading
processors and can be cached for the writing processor if the cache protocol speciﬁed
a write-through memory update policy.

Software solutions to the cache coherence problem do not require complex cache
controllers, but do require that a processor be able to issue commands to the cache
controller to turn the cache on and off, and to selectively invalidate cache blocks.
Performance of software solutions can not be as high as a properly implemented
hardware solution to the coherence problem, since compilers cannot know the relative

timing between accesses to the same block by multiple processors. In other words, if

21

a block is accessed read / write by two processors, but those accesses do not overlap in
time, then hardware support for cache coherence will allow greater cache accesses to
occur with only small overhead, while a software solution enforces no caching of the

block.

2.3.2 Hardware solutions

Placing the burden of cache coherence on hardware makes the actions necessary to
maintain coherence transparent to both the programmer and the compiler writer. The
solutions that have been proposed can be classiﬁed into directory schemes, snoopy
controllers, and coherent networks. Directory schemes use state information held in
either a centralized directory [54] or distributed among the memory modules [10]. In
both cases, the directory holds state information which enables the cache controllers to
take actions necessary for preserving coherence. Centralized directories may become a
system bottleneck, as every memory reference must be checked against the directory.
Stenstriim has proposed a directory scheme where the state information is distributed
among the caches [50].

Full-map directories contain state information for every instance of a block in
the system, allowing invalidates or updates to be sent only to those caches holding
the block. The size of full-map directories has prompted methods of limiting how
many caches can hold a block at the same time. These limited directories [11] either
force invalidations when the requested number of blocks exceeds their capacity or
resort to broadcasting invalidates and update information. The main advantage of
directory schemes is they do not require a single source of information which all cache
controllers can monitor; they can be used with general interconnection networks.

A second class of hardware solutions are the snooping schemes designed for single
bus multiprocessors. When a single shared bus is used to interconnect all processors
and memories, broadcast from one cache to all other caches is readily and inexpen-
sively achieved. Snooping on (monitoring) the bus gives all cache controllers infor-
mation about what is happening in the system simultaneously. A single invalidate
bus transaction can invalidate all cached copies of the block at the same time. In
addition, the bus acts to serialize accesses to memory, since the bus is the only path

for all processors to memory, whereas in a generalized interconnection network there

22

may be multiple paths to memory which can be used concurrently. The character-
istics which make the bus desirable from a standpoint of cache coherence act as a
fundamental limitation to the ultimate size of bus-based multiprocessors. Because
the bus is a unique resource shared by all processors to communicate with each other
and memory, it will inevitably become the system bottleneck as the size (number of
processors) of the multiprocessor is increased.

Limitations in the scalability of bus-based multiprocessors have led to proposals
for cache coherent networks. These machines are multiprocessors built around inter-
connection networks designed to maintain cache coherence while at the same time
not limiting the scalability of the overall system. Hierarchical bus architectures have
been proposed by several researchers. Wilson [56] describes a two-level hierarchical
bus machine which uses large second level caches to hold copies of the blocks held
in caches in the cluster below them. These second level caches hold both status and
data. The system memories are concentrated on the highest level (global) bus. The
cache coherence protocol used is an extension of a simple invalidation protocol with
write-back.

The Data Diffusion Machine [29] being built at the Swedish Institute of Computer
Science uses large low-level caches attached to each processor and a hierarchy of buses
connected by status-only caches at each hierarchical level. A novel feature of the
DDM is the absence of traditional memory. The large caches at the processors act as
a distributed memory for the entire system.

The Wisconsin Multicube [28] is based on a two dimensional grid of buses with
memory modules for each column, and large caches at the intersection of rows and
columns. Each cache controller snoops on both its column and row bus. A sim-
ple write-invalidate/write-back protocol is used together with the information of all
modiﬁed blocks in all column caches held in each cache controller. A novel feature
of the Multicube cache coherence scheme is that read requests can be handled by the
closest cache containing a copy of the block, even if the block has not been modiﬁed.

Normally system memory handles requests for blocks which have not been modiﬁed.

2.3.3 Single bus snoopy protocols

The protocol and cache coherence system developed in this thesis have their roots

in single bus multiprocessor solutions to the cache coherence problem. Table 2.1

23

summarizes the characteristics of several of the important evolutionary steps made

in snoopy cache coherence protocols.

 

Protocol Memory policy Cache policy Comment

 

Write Once dynamic write-invalidate ﬁrst write to memory, subse-
quent writes to cache

 

 

Synapse write-back write-invalidate ﬁrst implemented snoop pro-
tocol, introduced concept of
ownership

Dragon write-back write-update used bus line to support

cache-to-cache sharing

 

 

 

 

 

 

Table 2.1: Important stages in the evolution of snoopy cache coherence protocols.

Goodman’s Write Once protocol [27] was the ﬁrst write-invalidate protocol and
also used a dynamic memory update policy. In Write Once, the ﬁrst write to a
cache block was passed on through to memory (write-through memory update policy)
while subsequent writes to the same block were kept in the cache (write—back). The
Synapse protocol [25] was used in the Synapse N+1 fault tolerant multiprocessor and
represents the ﬁrst implementation of a snooping protocol. The ownership concept,
in which every block has a distinct and unique owner at all times, either memory or
a cache controller, was introduced by Synapse. The ownership concept was taken the
next logical step by the Dragon protocol [39], in which write-updates are performed to
keep caches up—to-date with each other when a block is being written to by multiple
caches. A cache is an owner of a modiﬁed block, and hence is responsible for supplying
that block to read requests, until it either replaces the block and writes it back to
memory, or until another cache writes to the block and assumes ownership. Dragon
uses a bus line, called the Shared Line, for cache controllers to indicate whether
a block is being shared among multiple caches. The shared line obviates the need
to broadcast every write-update on the bus when it is known that no other caches
have copies of the block. This effectively splits the modiﬁed state into two states, a
modiﬁed-shared state, and a modiﬁed-private state.

Archibald and Baer [2] studied the performance of published snoopy protocols
(including those in Table 2.1) for single bus multiprocessors with probabilistic work-

load models. Their results indicate that cache coherence protocols which allow dirty

24

sharing of blocks (true of Dragon) have signiﬁcantly better performance, particularly

when the degree of sharing is high, than do simpler invalidation protocols.

Chapter 3

Architecture of the HBSM
Multiprocessor

Several strategies for solving the cache coherence problem were discussed in the previ-
ous chapter. The solution proposed for the hierarchical bus shared memory multipro-
cessor studied in this thesis is comprised of hardware controllers and a cache coherence
protocol. This chapter describes the HBSM architecture and speciﬁes the responsi-
bilities of the hardware components of the memory system. The possible states of a
memory block in each controller are described in anticipation of the deﬁnition of the

cache coherence protocol in Chapter 4.

3.1 Introduction

The Hierarchical Bus Shared Memory (HBSM) multiprocessor consists of a two-level
hierarchical bus interconnect with physically distributed memory and private pro—
cessor caches. A cluster consists of multiple processors sharing a single cluster bus.
Multiple clusters are joined by a single global bus. The interface between the global
bus and the cluster buses consists of two devices: a memory controller and a status-
only cache controller. Providing the illusion of a single, ﬂat virtual address space in
the presence of physically distributed memory is an important characteristic of the
HBSM multiprocessor both for performance and ease of programming.

The main components of the HBSM multiprocessor are shown in Figure 3.1.
Caches in the system are shown as boxes attached to controllers. Each row of the
boxes corresponds to a memory block. Data in caches are gray, status information
is white. Groups of blocks separated by thicker lines indicate set associative caches.
Each cluster consists of Processors P with private caches CC and a dumb Memory
Controller MC on a cluster bus joined to the global bus by the dual-bus snoop con-
trollers, the Cluster Memory Controller CMC and the Cluster Cache Controller CCC.

In order to support the illusion of a single shared address space, the CMC maintains

25

26

a directory of status information for every block in the cluster memory module. The
CCC keeps a set-associative status-only cache containing an entry for every block held
in a CC cache on the CCC’s cluster. The CCC acts to maintain coherence between
caches on different clusters.

The architecture of the HBSM multiprocessor is similar to other hierarchical bus
multiprocessors. The HBSM architecture differs from the Ultramax by physically
distributing the memory modules throughout the system, rather than concentrating
them on the global bus [8]. Additionally, the role of the second level cache is different.
The Ultramax second level caches are traditional caches that hold the both the data
and status of memory blocks found in the processor caches below them. The HBSM
multiprocessor CCC caches only the status of the memory blocks found in the lower
level caches, and not their data values.

The paper design of new multiprocessors is a time honored tradition for doctoral
degrees in the ﬁeld of parallel processing. Novel methods of connecting processors
with memories are not difﬁcult to propose; whether the new architecture has any
merit is more difﬁcult to ascertain. As with any complex system, the performance of
a large multiprocessor depends on the interaction of its many subsystems. To study
all of these subsystems is both necessary for complete understanding of the whole, as
well as impossible for anything but a large team of researchers. This chapter, along
with Chapters 4 and 6 attempts to compromise between a purely academic exercise
and a useful investigation of a practical multiprocessor architecture. Some aspects
of the HBSM multiprocessor are not modeled, or are represented only implicitly.
For example, the role of the operating system in a real multiprocessor cannot be
understated, but is completely ignored here. The model used for processors is very
simple.

The main components of the memory subsystem are described in this chapter;
Chapter 6 is devoted to a detailed Petri net model of these controllers. The Petri net
model captures the state of each controller and each memory block in the system.
Trace driven simulation and the level of detail found in the models uncovered problems
that would be faced in actually building a multiprocessor similar to the one studied
here. The goal of the design portion of this effort was to conﬁrm the practicality
of the protocol and the memory subsystem controllers. For example, a protocol
transaction which required instantaneous information available only on other clusters

is not practical. Modeling the HBSM multiprocessor in detail has veriﬁed that each

27

 

global bus

<><>

8388 8888

 

 

 

 

 

 

D0
D0
D0
D0

 

 

 

 

lllll

      

 

 

cluster bus

 

 

[ﬁllllLlllll

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CCC: Cluster Cache Controller CC: Cache Controller P: Processor
CMC: Cluster Memory Controller MC: Memory Controller M: Memory

Figure 3.1: Components of the Hierarchical Bus Shared Memory (HBSM) multipro-
cessor architecture.

28

controller can act to maintain system coherence with information which would be
readily available to it in a real system.

An additional achievement of detailed study of the cache coherence protocol and
memory subsystem controllers is the knowledge of the phenomenon of state-space ex-
plosion which occurs in a complex, asynchronous system. Cache coherence protocols
are often judged practical or impractical based on the number of states they utilize.
What is not often discussed in research papers is the large increase in states and
possible situations when the interactions of real controllers and asynchronous buses
are considered. For example, when two processors on different clusters write to a
common block at nearly the same time, the corresponding write notice transactions
will collide at one of the dual-bus snoop controllers (CCC and CMC) What is repre-
sented by a single box in a protocol table has now become a complex event of aborting
one of the transactions, resetting the states of various controllers and restarting the
aborted cache write. If each of the many possible interactions on one cluster and be-
tween clusters is considered a state of the system, then the formal states of the cache
coherence protocol mushroom into system states which must be properly handled.

A fundamental assumption underlying the design of the system controllers was
the clustered nature of the architecture, and the expectation that computations per-
formed on such a multiprocessor would exhibit clustered memory access patterns.
The protocol and controllers were designed to be completely general, i.e. they will
support any sort of memory access patterns. High performance is predicated on a
bias of each processor to access memory locations found in their respective cluster
memory module. The design objectives of the system were three-fold. First, the delay
associated with intracluster memory accesses should be minimized since the system
is aimed at applications with clusterable memory access patterns. Second, protocol
trafﬁc on the global bus should be minimized, since the global bus represents a pos-
sible system bottleneck. And third, the communication costs for intercluster accesses

should be within an acceptable range to allow sharing of data amongst clusters.

3.2 System Controllers

The two dual-bus snoop devices, the Cluster Memory Controller CMC and the Cluster
Cache Controller CCC each have two snooping parts; the cluster snoop (Csnoop)

monitors a cluster bus, and the global snoop (Gsnoop) monitors the global bus.

29

Dual-ported memory could be used to implement a single directory to which both
the Csnoop and the Gsnoop of these devices refer. Most of the complexity of the
design of the HBSM memory controllers is found in the dual-bus snoop devices.

global bus

global bus
interface

CCC
cluster
snoop

cluster bus
interface

cluster bus

 

Figure 3.2: Detail of the CCC dual-bus snoop device showing the two active portions
of the controller, cluster snoop and global snoop, sharing access to the single cache
directory.

Explanation of a few key terms and concepts used will aid in understanding the
HBSM multiprocessor and the cache coherence protocol. A block of memory cor-
responds to a contiguous group of memory locations, generally between 2 and 32
words. All blocks in the shared memory address space reside either in one of the
cluster memory modules or in the global memory module. A block is remote with
respect to a given cluster if it resides in the memory module of another cluster, or

in global memory. A block is local with respect to a given cluster if it resides in the

30

memory module of that cluster. Similarly, clusters are said to be local or remote with
respect to a given block. Devices will also be said to be local or remote with respect
to another device based on whether the devices are on the same or different clusters.
Caches hold copies of memory blocks. A cache block is the same thing as a cache
line. Blocks held by caches on remote clusters are said to be held remotely and blocks
held by local clusters are held locally. Copies of blocks can be either unmodiﬁed if the
data in the cache and the data in memory are identical, or they may be modiﬁed if
the data differ.

The MOESI [53] conventions for labeling the states are followed for the CC8. For
the dual-bus snoop devices, the CCC and the CMC, the MOESI state names have
been extended in a consistent fashion. Some states are said to be intervenient, which
means that a cache with a block in such a state carries the responsibility for supplying
the latest version of this block to other caches in the system. Cache controllers with
blocks in intervenient states must perform a write-back of the block to memory if

they wish to remove the block from their cache.

3.2.1 Cache Controller (CC)

Each CC has two independent components which utilize the same cache directory.
The ﬁrst component services requests from the processor and implements the cache
replacement algorithm. The second snooping component monitors the cluster bus
and takes coherence actions based on the contents of the cache directory and the bus
transactions. Directory locking between components of the CC is done on a per-block
basis. This scheme could be implemented in hardware with a dual-ported memory.
Table 3.1 deﬁnes the states of blocks in the CC. Intervenient states Modiﬁed and
Owned engender two responsibilities for the CC. First, read requests for such a block
are satisﬁed by the snooping component of the CC. And second, such blocks must ﬁrst
be written back to memory if they are to be replaced in the cache. The CCC shares
the responsibility for these intervenient states with the CC by monitoring requests on
the global bus for these blocks.

3.2.2 Memory Controller (MC)

MCs are dumb devices which monitor their respective buses and respond to read and

write requests for blocks within their memory modules. An MC can be inhibited from

31

 

CC State Meaning

 

Invalid block is invalid or not present

 

Shareable block may be shared by other local or remote caches and
be unmodiﬁed, or block may be modiﬁed and another
local cache holds block Owned

Modiﬁed block is modiﬁed and is held exclusively by this cache

 

 

Owned block is modiﬁed but may be shared with other local
caches

 

 

 

 

Table 3.1: The state of blocks in the CC.

responding to a bus transaction by a signal generated by a CC, CCC, or CMC.

3.2.3 Cluster Cache Controller (CCC)

The CCC monitors transactions on both [the cluster and global bus and takes action
according to the state of the referenced block as found in the CCC directory. The
possible states of a block in the CCC are described in Table 3.2. Figure 3.3 illustrates
the meaning of each state in terms of the fundamental attributes of a block from the
perspective of the CCC. A block which is present on the cluster is held by at least one
cache. Note that blocks in the Shareable Unmod and Cluster Exclusive states may
have their state cached in the CCC while they in fact are not held by any caches on
the cluster. This can occur when blocks that are held Shareable by CCs are replaced.
The CCC has no way of determining when unmodiﬁed blocks have been removed from
caches.

The main function of the CCC is to isolate the coherence trafﬁc on the local cluster
bus from the trafﬁc on the global bus, while ensuring that coherence is maintained
among clusters. The CCC maintains the status of every block held in the Cache
Controllers of its local cluster bus. It is important to note that the CCC does not
hold the actual cached data, but caches only the status of those blocks. CCCs also
control the response of the global memory controller by preempting it from responding
whenever one of their local caches has the most recent version of a global block. When
a CCC must respond to a global bus transaction by taking some action on its local

cluster bus it may preempt an existing cluster bus transaction which requires the

32

global bus, thus avoiding deadlock.

 

CCC State Meaning

 

Invalid no local cache holds a copy

 

Shareable Unmod block is unmodiﬁed and either remote or local to this cluster,

may be held Shareable by zero or more local caches; remote
caches may also hold the block Shareable

 

Cluster Exclusive local block is unmodiﬁed and exclusive to this cluster, may be
held Shareable by zero or more local caches

 

Cluster Modiﬁed block is modiﬁed and exclusive to this cluster; one of the local
caches holds the block Modiﬁed or Owned

 

 

 

Table 3.2: The state of blocks in the CCC.

 

Not present Present on cluster
on cluster

 

 

   

 

 

 
  

Unmodified Modiﬁed
Local Remote Local Remote
block block block block

 

 

 

 

 

 

 

 

 

Figure 3.3: Meaning of the block states of the CCC.

Global bus read requests for blocks held in state Cluster Modiﬁed by the CCC
are serviced by the CCC which issues a cluster bus ﬂush for the modiﬁed block. CCC
Gsnoops are also responsible for translating global bus invalidates into cluster bus
invalidates for blocks held Shareable Unmodiﬁed. On the cluster bus side, the CCC
Csnoop translates write notices into global bus invalidates for blocks that may be

held in other clusters.

 

33

Since the CCC is organized as a set associative cache, and must hold the status of
all blocks on the cluster, the CCC replacement policy will generate invalidate requests
on the cluster bus when it must replace a block in its directory. As viewed by the
CC3, these replacements are in addition to the replacements generated by the CC8
own policy. When the CCC replaces a modiﬁed block, the CC which holds this block
Modiﬁed or Owned is responsible for writing the block back to memory and then
invalidating it. CCs holding the block Shareable simply invalidate the block. Read
requests from either the global bus or the cluster bus for blocks with write backs
pending in CCs must be intercepted by the CCC and aborted. Checking the queue
of pending transactions against incoming transactions is one of the complexities that
arises in the dual-bus snoop devices.

Selecting blocks to replace in the CCC would best be done in the following order of
increasing cost in performance: Invalid, Cluster Exclusive, Shareable Unmodiﬁed, and
Cluster Modiﬁed. In the worst possible case the CCC may select a Cluster Modiﬁed
remote block to replace. The complexity of implementing an intelligent replacement
policy must be compared to the cost in performance for a random replacement policy

which is simple to implement.

3.2.4 Cluster Memory Controller (CMC)

The role of the CMC is to present the illusion of a single, virtual address space to the
processors on its local cluster so that requests for a memory location can be made and
serviced without knowledge of the resident cluster of the block. To do this, the CMC
maintains state information for the blocks in the cluster memory of its local cluster;
thus, the CMC always contains state information for the same blocks. This state
information enables the CMC to prevent its local Memory Controller from responding
to requests for invalid blocks. The CMC is also responsible for responding to local
cluster read requests to remote memory modules. Table 3.3 gives the possible states
of a block in the CMC. Figure 3.4 illustrates the meaning of each state in terms of
the fundamental attributes of a block from the perspective of the CMC. Keep in mind
that for the CMC, all states (expect Remote, of course) are for local blocks. Thus for
example, Invalid Locally could be more fully labeled Local-Invalid-Held Locally. The
names have been shortened for ease of use.

To support the state of Cluster Exclusive in the CCC the CMC must at certain

34

 

CMC State Meaning

 

Remote

the CMC for remote blocks

block resides in a remote memory module; no state is kept in

 

Valid

caches may hold copies

no modiﬁed copies of the block exist, zero or more local or remote

 

Cluster Exclusive

may hold copies

no modiﬁed copies of the block exist, zero or more local caches

 

Invalid Locally local cache has a modiﬁed copy of the block

 

 

 

Invalid Remotely remote cache has a modiﬁed copy of the block

 

Table 3.3: The state of blocks in the CMC.

 

Remote
blocks

Local blocks

 

 

lVo
sane
ink)
kept

Memory is valid Memory is invalid

 

Held locally Held remotely Held locally

Held remotely

 

 

   

 

 

 

 

 

Figure 3.4: Meaning of the block states of the CMC.

 

 

3.3

times inform the CCC of the exclusiveness of the block to that cluster. To perform
this duty the CMC has to distinguish between those blocks checked out by remote
caches and those held exclusively in local caches. This suggests distinct states of
Valid and Cluster Exclusive in the CMC. Since blocks may be invalid in the cluster
memory by being held modiﬁed either by a local cache or a remote cache, the invalid
state of the CMC must be divided into two distinct. states, called Invalid Locally and
Invalid Remotely. For example, if a modiﬁed block is held by a remote cache (Invalid
Remotely in the CMC), then upon receipt of a read request from a local cache, the
CMC must relay the read request onto the global bus, while in the case of a local cache
holding the block modiﬁed (Invalid Locally in the CMC), no actions by the CMC are
required.

The Gsnoop side of the CMC responds only to global bus transactions for resident
blocks; no actions are taken by the Gsnoop for remote blocks. Global bus read
requests for clean resident blocks are translated by the Gsnoop into cluster bus read
requests. Read requests for modified blocks are ignored by the CMC and are handled
by the CCC. Global bus write backs to resident blocks‘ are translated into cluster bus
write backs by the CMC.

The Csnoop side of the CMC translates cluster bus requests for remote blocks
into corresponding global bus requests. In addition, read requests for modiﬁed local
blocks which are held remotely are serviced by the CMC. So that. the other dual-bus
snoop device, the CCC, does not need data paths to the cluster and global buses,
the CMC acts in cooperation with the CCC on some transactions, relaying data from
the cluster bus to the global bus. For example. a block held modiﬁed in a cluster is
relayed to the global bus by the CMC Csnoop as a response to the CCC-generated

cluster bus flush.

3.2.5 CCC/CMC state mapping

To help explain the relationship between the dual-bus snoop components of the HBSM
architecture, Table 3.4 shows the nmpping of states in the CMC to the CCC and CC
for a given cluster. Table 3.5 shows the inverst mz-ippiug of states in the CCC to the
CMC and CC for a given cluster. In these tables. AND. OR. NOT mean respectively
the logical and, or and not operators. The + symbol means zero or more caches may
hold the block in this state. The state of Invalid (absence of a block) is deleted for

36

clarity in the CC column. For each cluster the CC column represents the possible
states of multiple CCs. For example, in the second row of Table 3.5, a block that is
Cluster Mod in the CCC implies that the block is either held Modiﬁed by one CC, or
that it is held Owned by one CC and Shareable by zero or more. other CCs.

 

 

 

 

 

 

CMC state CCC state CC states
Remote Share Umnod OR Inv Share+ OR Mod
OR Clus Mod OR (Owned AND Share+)
Valid Share Unmod OR Invalid Share+
Cluster Excl Clus Excl OR Invalid Share+
Inv Locally Clus Mod .\lod ()R (Owned AND Share+)
Inv Remotely Invalid

 

 

 

 

 

Table 3.4: Mapping of CMC states to CCC and CC.

 

 

 

 

 

CCC state CIVIC state CC states
Invalid NOT luv Locally
Shar Unmod Valid OR. Remote Share+
Cluster Excl Clus Excl Share+
Cluster Mod Inv Locally OR Remote Mod OR (Owned AND Share+)

 

 

 

 

 

Table 3.5: Mapping of CCC states to CMC and CC.

3.3 Interference Events

The asynchronous relationship between the global bus and the cluster buses produces
a number of troublesome events which can occur within the dual-bus snoop devices.
These events are referred to as interference events. Interference events are the nearly
simultaneous presence or collision of tl'i-lllSilClltlllS on the global bus and a cluster bus.
These transactions interfere with each other at the interface of the those buses, that

is to say, within a CMC or CCC.

37

Transactions which cause interference events can involve the same block or differ-
ent blocks. Since both the Csnoops and Gsnoops of the dual-bus snoop devices access
a single directory, a locking policy for these directories is required to reduce the pos-
sible combinations of same-block interference events that can occur. The locking of
these directories is done by address on a per block basis; that is, simultaneous direc-
tory accesses are disallowed only for the same block. For different. blocks, two accesses
can proceed concurrently. The locking policy of the CCC and CMC directories reduces
the number of possible same-block interference events that must be. considered in the
design of these devices to a manageable number by forcing the collision of each possi-
ble cluster bus transaction with each possible global bus transaction to occur in only
two forms. In the ﬁrst instance, the Csnoop gains a. lock on the directory and the
Gsnoop is locked out; in the second, the Gsnoop gains a lock on the directory and
the Csnoop is locked out.

Since the directories of the dual-bus snoop devices are locked by block, interfering
transactions which involve the same block will result in directory deadlock. Interfer-
ence events resulting in directory deadlock can be classiﬁed as either coherence events
or benign events. When the interfering transactions involve different. blocks, we call

the interference event a collision event.

3.3.1 Collision events

Collision events result in circular wait. deadlock in the CCC and/or the CMC. \Vhen a
dual-bus snoop device is required to translate a global bus transaction into a cluster
bus transaction, the Gsnoop of the device issues an arbitration request to the current
cluster bus master. The Gsnoop then waits to be grained the bus, allowing any
ongoing cluster transaction to ﬁnish. The higher priority of the dual-bus snoop devices
over the CC8 ensures that the CMC/CCC Gsnoop will receive the. bus on the next
transaction. However, if the ongoing cluster bus transaction requires use of the global
bus, then the Csnoop of either the CCC or the CMC will issue an arbitration request for
the global bus. This situation is a circular wait. deadlock. since. the current global bus
transaction can not ﬁnish without using the cluster bus, and the current cluster bus
transaction can not ﬁnish without using the global bus. The Gsnoops of the CCC and
CMC detect collision events by monitoring the global bus request queue of both the

CCC and CMC CsnooPS. Since collision events are different-block interference events,

.38

they can not violate the coherence of the system, so the situation is resolved by giving
the global bus transaction priority; the cluster bus transaction is aborted. This policy
is consistent with the principle of aborting the transaction which originated latest in

time and also decreases the trafﬁc on the global bus.

3.3.2 Benign events

Benign interference events are those for which directory deadlock exists (because the
competing transactions involve the same block) but the effects of the two transactions
are compatible with each other and hence do not destroy the coherence of the sys-
tem. These cases are handled simply by temporarily unlocking the locked directory,
allowing the snoop which is in directory deadlock to continue its operations. The
snoop which originally obtained the directory lock waits for the transaction on the
other bus to ﬁnish, then relocks the directory while it. continues its own transaction.
As an example of a benign event, consider the following case. A read request occurs
on the global bus for a block which is in state Valid in its CMC. The Gsnoop of the
CMC locks the directory of the CMC. At nearly the same time, a. cache on the same
cluster as the CMC makes a cluster bus read request for this block. Since the CMC
directory is locked, the cluster bus transaction can not be responded to by the Csnoop
of the CMC. This benign interference event would result in permanent deadlock if the
CMC were not capable of detecting the situation. The Gsnoop temporarily unlocks
the CMC directory, allowing the cluster has transtu-tion to finish properly. The next
transaction on the cluster bus is the CMC issuing a cluster bus read request. to satisfy
the pending global bus read request. Unlocking the directtu'y temporarily is benign

because there is no state change in the CMC for a read request to a Valid block.

3.3.3 Coherence events

Coherence events involve a cluster bus transaction and a global bus transaction which
are mutually incompatible from a coherence standpoint. One of the two transactions
must be aborted. As an example of a coherence event. consider the case when two
caches on different clusters are sharing a block in an unmodiﬁed state. If both caches
write to this block at nearly the same time, one of the CCCs of the active clusters will
win arbitration and broadcast the global bus invalidate transaction before the other.

In the second CCC (the one with the pending global bus invalidate transaction),

39

the directory will have been previously locked by the Csnoop for the cluster bus
write notice transaction. Since these transactions are mutually incompatible with
maintaining coherence between clusters, the. write notice transaction on the the second
CCC’s cluster bus will be aborted.

Aborting transactions on the cluster or global buses poses difficulties in the design
of the CCC and CMC. These devices must monitor the bus abort signal and be able
to recover when an abort occurs. When a device is the perpetrator of an abort, it
must either assume bus mastership for the next transaction, or allow the current bus
master to pass mastership to one of the outstanding arbitration requesters. Devices
which have initiated an aborted transaction should be persistent in re—trying the

transaction.

3.3.4 CMC and interference events

The CMC is not responsible for generating transaction aborts during coherence and
benign interference events; it does need to be. able to react to them, however. CMCs
which are the initiator of global bus read requests must re-try those transactions if
the initial attempt ends in global bus abort. In addition, CMCs must be able to reset
themselves at certain stages of reacting to an aborted transaction.

The CMC detects collision interference events by monitoring the global bus request
queue of their own Csnoop and the CCC Csnoop. When a collision event is detected

the cluster bus transaction is aborted.

3.3.5 CCC and interference events

The CCC is responsible for generating aborts for every coherence type interference
event. An example of such an event is a global has read request for a modiﬁed block
and a cluster bus write back to that same block. In this case the CCC aborts the
global bus read request, allowing the write back of the modiﬁed block to proceed.
The two forms of this interference event, either the Csnoop locks the directory ﬁrst
or the Gsnoop locks the directory ﬁrst, will be described in more detail to indicate
the actions taken by the CCC and the problems that may arise in coherence events.
If the CCC Gsnoop decodes the global bus transaction and locks the directory
ﬁrst, the status of the block read by the Gsnoop when parsing the transaction will

be Modiﬁed. Decoding the transactitm puts a lock on the. CCC directory. When the

40

Csnoop attempts to read the status of the block from the directory to respond to the
cluster bus transaction deadlock will occur, since the Gsnoop holds the CCC directory
lock. The Gsnoop detects this deadlock and resolves it. in the following way. If the
block is remote to this cluster, then a write back requires the global bus, and an abort
of the global bus read request is mandatory. However, if the block is local to this
cluster then the write back does not require the global bus, and it may seem that the
global bus read request can simply wait for the. write back to ﬁnish as in a benign
interference event. The reason that this is not. acceptable is that the Gsnoop has
already locked the directory and issued a ﬂush for the modiﬁed block. If the cluster
bus write back were allowed to ﬁnish, the Gsnoop would then issue a ﬂush for a clean
block, which is a violation of the COGI protocol. Thus in both cases, for both local
and remote blocks being written back, the Gsnoop must abort the global bus read
request.

The other form this coherence event. can take is the situation where the CCC
Csnoop decodes its transaction and locks the directory ﬁrst and the Gsnoop is locked
out of the directory. For local blocks it would be possible for the write back to
ﬁnish and then let the Gsnoop proceed in processing the global bus read request.
However, write backs to remote blocks require use of the global bus, so the global
bus read request must be aborted. Since the CCC performs its functions without
distinguishing between local and remote blocks, both forms of this coherence event
must be treated in the same way, and thus in both cases the global bus read request
is aborted.

Table 3.6 shows the coherence interference events which are detected and managed
by the CCC. To simplify the controllers, both aspects of the coherence events (Csnoop
locks directory ﬁrst or Gsnoop locks directory ﬁrst) are handled in the same manner.
Blocks replaced by the CCC may be. the cause of coherence events. When the CCC
forces a CC to purge a modiﬁed block (due to the CCCs replacement. policy) the CC
will hold this block in a pending write back state. Coherence events for blocks held
in limbo awaiting replacement are marked with purge in Table 3.6. Read requests
from either the global bus or the cluster bus for these blocks must be intercepted by
the CCC and aborted. Other transactions may intervene on the cluster bus between

the invalidate transaction and the CCs write back of the modiﬁed block.

41

 

 

 

 

 

 

 

Cluster trans Global trans CCC state Action
write back read request Cluster Modiﬁed abort global trans
write notice read request Shareable Unmod abort global trans
write notice invalidate Shareable Unmod abort cluster trans
read request invalidate Shareable Unmod abort cluster trans
read request - Cluster Modiﬁed (purge) abort cluster trans
- read request Cluster Modiﬁed (purge) abort global trans

 

 

 

 

 

 

Table 3.6: Coherence events detected by the CCC and their resolution.

3.4 Bus Arbitration

Bus arbitration is distributed amongst potential bus masters. The current bus master
advertises for arbitration requests and chooses the bus master elect for the next
transaction. A bus master retains mastership, and hence responsibility for arbitration,
until the bus is handed off to another bus master. Once selected, a bus master elect
can be replaced only by a bus abort with bus master preempt. On a. cluster bus,
CCs have equal arbitration priority, and fairness is achieved among competing CCs by
preventing a CC who has used the bus from participating in arbitration again until all
outstanding requests have been satisfied. The priority of CCCs and CMCs is higher
than CCS to minimize the amount of time the global bus is held by these devices.

If no requests are received by the arbiter during arbitration for the next bus
mastership, the CC3 which are locked for fairness are released and again allowed to
participate in arbitration. CCCs and CMC do not have fairness restrictions for bus
arbitration. Since the global bus is the most valuable resource of the entire system,
the CCCs and CMCs may request the cluster bus at. any time. On the global bus, CCC
Gsnoops and CMC Gsnoops compete equally with a fair arbitration locking scheme

identical to that for CCS on the cluster bus.

Chapter 4

COGI Cache Coherence Protocol

The deﬁnition of the Cluster Ownership Global Invalidation (COGI) cache coherence
protocol is contained in the tables of this chapter. Using tables to deﬁne the protocol
is a more structured method of displaying information which is often described using
state transition graphs. The meaning of the protocol states for each of the active
controllers is found in Chapter 3. For a. more intuitive understanding of the COGI

protocol, this chapter includes a lengthy example of intercluster sharing transactions.

4.1 Design of COGI Protocol

The Cluster Ownership Global Invalidatiou (COGI) cache coherence protocol has
been designed to match the architecture of the HBSM multiprocessor. The COGI
protocol is compatible with the MOESI class of cache coherence protocols [53]. On the
cluster buses, COGI is a write—back/write-update protocol that allows dirty sharing
in a manner similar to the Dragon protocol [30]. A bus line is used so that caches
can know when other caches share a block. This allows two modiﬁed states to exist,
one in which a cache has an exclusive copy of a block. and hence may write into the
block without notifying other caches, and another where a cache is sharing a modiﬁed
copy of a block with other caches and hence must. broadcast updates to this block
when it writes into it. These broadcasts are known as write notices and do not update
memory, allowing for efﬁcient sharing of data within a. cluster. Blocks may be “checked
out” from memory for writing, and shared among processors without being written
back to memory until replaced. This efﬁcient intracluster sharing is important, since
in the applications suitable to the HBSM multiprocessor, the majority of processor
memory accesses are destined for local memory modules. A block is never present in
a modiﬁed state in more than a single cluster at a time.

Unlike Dragon, in the COGI protocol caches must broadcast their ﬁrst write into
a block. This write informs the second level cache controller (CCC) of the modiﬁed

nature of the block. As a result of this requirement, the COGI protocol cannot use

42

43

a “valid exclusive” state for the processor caches. In Dragon, the valid exclusive
state allows caches to write to blocks without any bus trafﬁc; a valid exclusive block
which is written moves to the dirty state, indicating the cached copy is modiﬁed and
is exclusive. After the ﬁrst write to a block the COGI protocol performs as well as
Dragon.

The global bus protocol of COGI is a simple write-back/write-invalidate type.
To reduce trafﬁc on the global bus intercluster write sharing of data is limited to a
series of read / invalidate/ write actions. A cache secures an unmodiﬁed copy of a block
with a read request. When this cache ﬁrst writes to the block, the CCC invalidates
copies of the block on remote clusters, thus avoiding the need for the broadcast of
multiple write notices on the global bus. Although this decision discourages sharing
of modiﬁed copies among clusters, it has no effect on sharing of unmodiﬁed blocks. In
addition, CCCs are able to recognize local blocks held exclusively on their own cluster
to avoid the necessity of broadcasting global bus invalidations for blocks not shared
with other clusters.

The abbreviations used for bus transactions and the general meaning of these
transactions are given in Tables 4.1 and 4.2. The various types of bus transactions

are listed, and common (though not. exhaustive) examples of their use are given.

4.2 COGI Protocol Tables

Tables 4.3—4.9 deﬁne the COGI protocol. Each row of the tables represents a state
that a block can be in in that particular device. The columns represent various bus
transactions, processor requests, and replacement actions. The tables are used by
ﬁnding the intersection of the column corresponding to the transaction of interest
with the current state of the block. Each cell of the table shows the actions taken
by the device for this transaction in the top part, while the bottom part gives the

i

change in state of the block. The notation “sigline ‘? action, : action; is used to

1

indicate that ‘actionl’ is taken if ‘siglinc’ is raised and ‘nctt‘ong is taken otherwise,
where ‘sigline’ can be either of the two signal lines CSHL or REML. The notation
El signiﬁes that the related transactions can not. occur for a. block in that state. The
phrase ”Read block cbus” and ”Read block gbus” means that the controller copies a
block off the respective bus. The phrase ” Write block cbus" and “Write block gbus”

means that the controller writes a block to the respective bus as the response to a

44

 

 

 

 

 

 

 

 

 

 

Abbreviation Transaction Example
CBRR Cluster Bus Read Request generated by cache misses
CBWN Cluster Bus “(rite Notice issued to update caches which are
sharing a modiﬁed block
CBWB Cluster Bus \Vrite Back generated by processors replacing a
modified block
CBIN Cluster Bus INvalidate issued by the CCC as a result of ei-
ther a CCC replacement or a GBIN
transaction
CBFL Cluster Bus FLush used by the CCC to force the cache
holding a modiﬁed block to provide
that. block for a GBRR
PR Processor Read read one word of a block
PR Processor \Vrite write one word of a block
Purge block replacement. block is selected for replacement by

 

 

CC

 

Table 4.1: Transaction types on the cluster bus and processor events.

 

 

 

 

 

Abbreviation Transaction Example

GBRR Global Bus Read Request occur when a. cache controller issues
a read for a block which is owned by
a device not on the same cluster

GBWB Global Bus \Vrite Back occur when a cache controller re-
places a block remote to its cluster
that. has been held modiﬁed

GBIN Global Bus INvalidate occur when a cache controller issues

 

 

a write to a block that had previ-
ously been shared between clusters

 

Table 4.2: Transaction types on the global bus.

 

 

bus transaction request. The phrase ”Read word” means that. the controller copies a
single word from the bus to update a cache block.

The signal lines required on cluster buses are the CSHL (Cluster SHared Line) and
REML (REMote Line). Protocols based on write broadcast require an extra signal
line, called the Shared Line, for the exchange of information regarding the shared
status of blocks. Since the CCCs can not. distinguish the locality or remoteness of
addresses, the CMC use a second bus line, called REML, to inform them of the local

or remote nature of blocks in a read request.

46

 

 

 

 

Purge PR pw
CBRR
Invalid Read block cbus
—-)Shareable
Purge CBWN

 

Shareable
CSHL?-—>Owned:—-)Modiﬁed

 

CBWB
Modiﬁed Purse

 

 

CBWB CBWN
Owned Purge

 

CSHL?—:—>Modiﬁed

 

 

 

 

 

 

Table 4.3: Actions of CC in response to processor requests.

Processors make memory access re-
quests (loads/stores
or reads/writes) to the cache con—
trollers. The cache controller ser-
vices thc memory access request
from the cache if possible (cache
hit) and by making a bus transac-
tion if not possible (cache miss). All
blocks are loaded into the cache on
a cache miss in the Shareable state.
Writes to a Shareable block result
in the block changing to a dirty
state. either Owned or Modiﬁed,
depending on whether other caches
in this cluster have copies of the
block. The Purge column shown in
the table above correspond to ac-
tions taken by the cache controller,
CC. when a block is selected for re-
placement.

 

 

 

 

 

 

 

 

 

 

 

 

CBRR CBWN CBWB CBIN CBFL
Read word Raise CSHL
Shareable Raise CSHL
-—>Invalid
Inhibit memory E] El CBWB Write block cbus
Write block cbus
Modiﬁed Raise CSHL t
—)Owned E] El —>lnvalid —>Shareable
Inhibit memory Read word El CBWB Write block cbus
Write block cbus Raise CSHL
Owned Raise CSHL
-)Shareable El —>Inva|id —)Shareable

 

 

 

 

Table 4.4: Actions of CC in response to cluster bus transactions.

 

 

The other responsibility of the CC
cache controller is to monitor trans-
actions on the cluster bus and take
actions to preserve cache coher-
ence. The CC is responsible for sup-
plying dirty blocks to cluster bus
read requests, and also for updat-
ing copies of blocks that are being
dirty-shared with other caches. The
cluster bus invalidate and cluster
bus ﬂush transactions are generated
by the cluster cache controller CCC.
A CC must respond to these trans-
actions by performing a write back
of the block, and by invalidating the
block, respectively.

lThe raised CSHL notiﬁes the CMC that the remote block is being supplied by a local CC and
there is no need for a global bus read request.

48

 

 

CBRR CBWN CBWB CBIN CBFL
CSHL?—: Read block cbus Read block cbus
(GBRR GBWB Write block gbus
Read block gbus
Remote Write block cbus
Raise CSHL)

 

 

 

Read block cbus

 

 

Inv Locally Write block gbus
-—>Cluster Excl —)Valid
Inhibit memory D E] 1' Cl
GBRR

Read block gbus
InV Remotely Write block cbus

 

 

 

 

 

 

 

Raise CSHL
—>Valid C] D D
Raise CSHL E] El
Valid
—9an Lee D D
D C]
Cluster Excl
->lnv Lee D C]

 

 

 

 

 

 

 

 

Table 4.5: Actions of CMC in response to cluster has transactions.

The CMC maintains the trans-
parency of a physically distributed
memory. On the cluster bus, it
is responsible for servicing read re-
quests for remote blocks and for 10-
cal blocks that are held dirty on
some remote cluster. The CMC acts
to inhibit the MC from responding
for a. block by marking blocks that
have been modiﬁed as invalid. Read
requests for Invalid blocks are ser-
viced by either a cache with a dirty
copy (Invalid Locally) or by the
CMC (Invalid Remotely). In either
case the CMC inhibits the MC from
responding to the read request.

 

 

 

lThis scenario happens when a remote CCC is the originating source of a GBIN transaction. The
local CCC has the block in state Valid, observes the GBIN and generates a CBIN. In the meantime,
the local CMC Gsnoop observes the GBIN and changes state. to Invalid Remotely. The CMC cluster
snoop observes the CBIN transaction when it appears on the cluster has, but the block is held in
state Invalid Remotely this time.

~19

 

 

 

 

 

 

 

 

 

 

 

 

GBRR GBWB GBIN
U i
luv Locally
C]
Read block gbus Read block gbus C]
Inv Remotely CBWB CBWB
—)Valid —>Valid E]
CBRR r_']
Raise REML
Valid Read block cbus
Write block gbus
U —~>lnv Rem
CBRR [3 D
Raise REML
Cluster Excl Read block cbus
Write block gbus
—)Valid E] D

 

 

 

 

 

 

Table 4.6: Actions of CMC in response to global bus transactions.

The global snoop of the CMC mon-
itors the global bus and responds
to requests for blocks belonging to
the local cluster memory. The CMC
translates global bus read and write
back requests into cluster bus read
and write back requests.

 

 

IThis scenario can happen because the local CCC might be the originating source of the global
bus transaction. The local CCC observes a CBW N for a local block in state Valid and generates a
GBIN. In the meantime, the CMC cluster snoop observes the C I3\\'N as well and changes the block
state to Invalid Locally. The CMC Gsnoop observes the GBIN transaction when it appears on the
global bus, but the block is held Invalid Locally this time.

50

 

 

 

 

 

 

 

 

CBRR CBWN CBWB Purge
D
REML?——: El
Invalid (CSHL?
—)Shar Unmod:
—9Cluster Excl)
GBIN E] CBIN
Shal- Unmod Purge
—)Cluster Mod [3
T D CBIN
Cluster Excl Purge
—-)Cluster Mod D
CBIN
Cluster Mod Purge

 

CSHL?—)Shar Unmod:—)Invalid

 

 

 

 

 

 

 

Table 4.7: Actions of CCC in response to cluster bus transactions.

The cluster snoop of the CCC trans-
lates cluster bus write notices into
global bus invalidates for blocks
that may be shared by other clus-
ters. After a ﬁrst write to a Shared
Unmodified block, the CCC does
not send global bus invalidates,
since it knows the block no longer
exists in other clusters. During a
cluster bus write back, the CCC
will drop a block (set its state to
Invalid) if no caches respond with
the shared line. In general, how-
ever. the CCC has no way of know-
ing when caches drop blocks (via
replacement). thus the CCC may
cache copies of blocks which no
longer exist on the cluster. These
blocks will eventually be replaced
by new, active blocks.

 

 

lThis scenario starts with a. read request on the global bus for a local block which is held in
the Cluster Exclusive state of the CCC. The CCC Gsnoop observes this transaction on the global
bus and changes the block state to Shareable Unmodified. The CMC Gsnoop relays the global bus
request onto the cluster bus. When the local CCC cluster snoop observes the corresponding CBRR,
the block is held Shareable Unmodified. Another scenario that. is possible is a CBRR for the local
block. In this case too, no state change is required.

51

 

 

 

 

 

 

 

 

 

 

 

GBRR GBWB GBIN
Invalid
1' CBIN
Shar Unmod
—)lnvalid
[3 Cl
Cluster Excl
-)Shar Unmod Cl C]
Inhibit global memory [3 Cl
Cluster Mod
-)Shar Unmod E] El

 

 

 

 

Table 4.8: Actions of CCC in response to global bus transactions.

 

 

On the global bus, the CCC trans-
lates invalidate transactions into
cluster bus invalidates for blocks
held on its cluster. Additionally,
the CCC Gsnoop must force CCs
to supply dirty blocks by ﬂush-
ing them back to memory. When
a global bus read request to a
dirty global block is made, the CCC
must. also inhibit the global memory
module from responding.

lThis scenario starts with a write back of a remote block held modified by a CC. Assuming that
there are still caches who have the block in Shareable. the CCC cluster snoop changes the state from
Cluster Modiﬁed to Shareable Unmodified. The CMC relays the write back over to the global bus by
generating a GBWB. The CCC Gsnoop then observes a GBWB to a block in the Shareable Unmod

state.

 

xxRR'l'

CBWN

xxIN‘l'

xxWB'l' CBFL

 

 

Remote

 

 

 

MC lnhibit?—:

(Read block from
memory,

Write block onto
bus)

 

 

 

Read block from Read block from

bus, bus.
Write block into Write block into
memory memory

 

 

 

Table 4.9: Actions of MC in response to bus transactions.

 

 

lxx can be either of the two notations: CD or GB.

Memory controllers are very simple
devices. They monitor their respec—
tive buses for transactions that cor-
respond to blocks they own, and un-
less inhibited by another device on
the bus. act to service the transac-
tion. They carry no state informa-
tion for the blocks they own.

4.3 Example of COGI Protocol

The following example of the operation of the. protocol involves intercluster cache
references and is shown in Figures 4.1 and 4.2. The numbered steps in the ﬁgures
follow the description of the events in the following text. A cache miss on a PM read
request causes CC” to broadcast a read request (CBRR) over Cluster Bus 1 (see
Figure 4.1). Assuming that the block is resident locally and no other cache has a
copy, MC1 responds to the request and provides the block. CCU reads the block off
the cluster bus and sets its state to Shareable. CCC. observes the read request on
the cluster bus and checks the REML to make sure that the request is not generated
remotely and relayed by CMC1 from the global bus. Next. it watches for CMC1 to raise
the CSHL line to see whether the block is already being shared by other clusters. In
the absence of both of these signals CCC] creates space for the block state and sets
it to Cluster Exclusive, meaning that it is an unmodified local block and no remote
cache has a copy.

A subsequent read request by P1,”) and a. CBRR. broadcast over the cluster bus
by CCmo is answered again by MCI. CCMU loads the block off the bus and sets its
state to Shareable. As the next event of interest, PL“, writes into the block causing
CCmo to broadcast the modiﬁed portion of the block over the cluster bus as a CBWN
and to change the state to Owned. CCU upon observing the write notice transaction
updates that particular word in its copy of the block, but. keeps the block in the state
Shareable and raises the CSHL telling CCHO to retain the block Owned. CCC1 upon
observing the modiﬁcation to the block changes the state of the block to Cluster
Modiﬁed while CMCl changes it to Invalid Locally. indicating that some local cache
has a modiﬁed copy of the block.

At this point, the two caches CC“ and CCHO hold the block Shareable, and
Owned, respectively. The CCC on Cluster 1 holds the block status as Cluster Modiﬁed,
indicating a modiﬁed version of the block is exclusive to this cluster. To introduce
some intercluster sharing, consider what happens when 172,1 performs a read causing
a read miss in its cache (see Figure 4.2). 3 To get the block, CCQJ generates a read
request over Cluster Bus 2. Since the block is remote to this cluster, CMC2 monitors
the CSHL to make sure that no local cache is responding to the read request. Since
the block is not held modiﬁed by any of the caches on Cluster ‘2, CMC2 broadcasts

 

3Due to the space limitation in Figure 4.2. the global bus is shown in two sections.

Global Bus

Cluster Bus

CBR
CBRR. @ cewu® @ @ Supply
.

 

 

 

PR. P1,1 pw® P1,10 6R M1

 

 

 

 

 

 

CCC: Cluster Cache Controller CC: Cache Controller P: Processor
CMC: Cluster Memory MC: Memory M: Memory
Controller Controller

Figure 4.1: Basic scenario of intracluster sharing of locally resident block. Processor

P1 reads block, P10 reads then writes block, passing updated version of data word to
P1.

a read request over the global bus. CCCI, having the block in the state Cluster~
Modiﬁed, generates a ﬂush request (CBFL) on Cluster Bus 1 when it observes the
global bus read request.

CCCI changes the state of the block to Shareable Unmodiﬁed indicating that the
block is unmodiﬁed and shared between its caches and the remote cache on Cluster
2. Because CCLlo has the block in state Owned it responds to the ﬂush request and
provides the modiﬁed copy of the block on the cluster bus. CC” does not react to
the ﬂush request because the Shareable state is not intervenient, as is the Owned
state. When MCI observes the flush it updates the copy in memory, making the
block new unmodiﬁed. Acting in concert with CCCl, CMCl recognizes the global bus
read request for a block whose state is Invalid Locally, and relays the block from the
cluster bus to the global bus.

CMC2 reads the block off the global bus as the response to its read request and
places the block on Cluster Bus 2. It also raises the CSHL to signal to CCC2 that the
block is resident remotely and might be shared with caches on other clusters. CCC2,
upon observing the CSHL signal from CMC; indicating that the block is shared
with other clusters and/ or owned remotely sets the status of the block to Shareable
Unmodiﬁed. At the same time CC2,1 reads the block off the cluster bus and sets its
state to Shareable.

At this point the block exists in three caches, CC”, CCHO and CC“ and is in
the Shareable state in all three. Now consider a write to the block by Pm. CCQJ
broadcasts the written word over the cluster bus (C BWN), and monitors the CSHL
line. Due to a lack of response on the CSHL line, CC“ changes the state of the block
to Modiﬁed, indicating that no other local caches are sharing the block. \Vhen CCC2
observes the write notice on the cluster bus it broadcasts an invalidation transaction
(GBIN) on the global bus and changes its state to Cluster Modiﬁed. CCC; observes
the invalidate transaction, relays it to Cluster Bus 1 as a CBIN transaction, and sets
the state of the block to Invalid. CC” and CCMO invalidate their copies of the block
upon observing the invalidate transaction on the cluster bus. CMC1 changes the block

state to Invalid Remotely upon observing the GBIN on the global bus.

56

Global Bus

 
      

® CBIN

Cluster Bus

Supply
a make)
0..

 

 

 

 

 

 

 

 

.................................

Global Bus

6) GBlN (Deena
Cluster Bus

F’2,1o M2

 

 

 

 

 

 

 

 

 

 

Figure 4.2: Continuation of the COGI protocol example transaction. Processor P11
reads then writes the same block that is shared with Cluster 1.

4.4 Other Protocols

Archibald describes a cache coherence protocol for a two-level hierarchical bus multi-
processor in [3]. The architecture assumed by Archibald is different from the HBSM
architecture by not distributing memory amongst the individual cluster buses. Cen-
tralizing memory on the single global bus seems destined to make the global bus a
bottleneck. The cache coherence protocol for the cluster buses requires six states
in the processor cache controllers. Nearly doubling the number of meaningful states
(vis—‘a—vis the COGI protocol) allows Archibald’s protocol to distinguish blocks held
read-only, and blocks held only on one cluster. While this information can be put to
good use for minimizing the effect on performance of coherence trafﬁc, the complexity
of the controllers which must implement the protocol would most likely be daunting.

Archibald’s global bus protocol is very similar to COGI. The number of states is
the same, and the information conveyed is equivalent. The controller for the second
level caches in Archibald’s system is less complex than the CCC/CMC of the HBSM
since memory is concentrated on the global bus.

As part of Encore’s Ultramax project [8], Wilson has developed a cache coherence
protocol for a two-level hierarchical bus multiprocessor [56]. This protocol is an
extension of Goodman’s Write Once protocol, suitably modiﬁed for a hierarchical bus
architecture. A limitation of this approach is that. Write Once does not support dirty
sharing of data. Thus heavy intracluster read / write sharing of data blocks requires
that caches alternately acquire and invalidate a. single copy of a shared block. As
shown in [2], Write Once is inferior to protocols which support efﬁcient dirty sharing,

as does COGI, particularly when the level of sharing is high.

Chapter 5

Multiprocessor Address Traces

Trace-driven simulation is an important technique for performance evaluation of mul-
tiprocessor computer systems. Traces are difﬁcult to obtain for existing multipro-
cessors because of their relative scarcity and the complexity and cost of the address
tracing process. In order to evaluate the performance of the HBSM multiproces-
sor, and as an aid to debugging the COGI cache coherence protocol, a technique for
generating architecture—independent multiprocessor data address traces on a widely

available RISC uniprocessor was developed [5].

5.1 Introduction

Computer architects use modeling and sinmlation techniques to evaluate the perfor-
mance of competing design solutions before physically constructing new machines.
Adequate performance evaluation of the HBSM multiprocessor memory hierarchy re-
quired a trace-driven simulation approach, since the performance of the cache and
memory controllers, the cache coherence protocol, and the processor-memory inter-
connection buses are highly dependent on the memory access patterns of parallel
applications. In addition to acting as the input to a multiprocessor simulator, ad-
dress traces are also useful for characterizing the memory access patterns of parallel
applications for use in analytical models, such as queuing networks. Other issues
which can be studied with multiprocessor traces include interconnection networks,
memory paging strategies, processor instruction mixes and branching characteristics.

Multiprocessor address traces are obviously not available in the design phase of
new architectures; instead, address traces from existing multiprocessors must be used.
Unfortunately, since access to multiprocessors is still relatively limited and existing
tracing techniques consume a large amount. of the multiln'ocessor resource, address
traces from real applications on existing multiprocessors are difficult to obtain. An-
other potential problem is that address traces from existing applications may not be

independent of the architectural features of the machine used to generate the traces.

59

For example, accesses to synchronization variables will depend on the relative timing
between processes, and the speed and design of memory subsystem components. If
such accesses are recorded in address traces, then using those traces to study new
architectures is problematic.

To avoid problems with existing traces, and to assure access to architecture-
independent traces of arbitrary computer size, a software technique for generating
data address traces of MIMD shared memory multiprocessors by tracing uniproces-
sor versions of parallel applications which utilize static task allocation and barrier
synchronization was developed. This tracing technique, hereafter referred to as the
Tracer, is relatively fast, portable, and does not require access to a multiproces-
sor. The primary advantage of this technique is that by generating architecture-
independent address traces, competing architectures can be evaluated with one set of
traces.

The Tracer uses the technique of source code and assembly language modiﬁcation
of a uniprocessor version of a parallel application to generate multiprocessor traces
during execution. Reads, writes and barrier synchronization events are captured for
each process of the parallel application. The Tracer has been developed on Sun
Microsystem’s line of SPARC computers. While the Tracer could be implemented
on any uniprocessor, the relatively simple instruction set of RISC machines simpliﬁes
the process of automatically modifying the assembly language code to gather traces.

Architecture-independent address traces are the natural result of separating the
tasks of trace generation and multiprocessor simulation. This dichotomy is necessary
when the performance evaluation of a multiprocessor memory system is being under-
taken. With such traces, the information captured for each process can not include
the ﬁne-grained temporal relationship between the traces of each process since this re-
lationship depends on the architecture and environment present when the traces were
generated. Since the programming model assumed is one of static task allocation and
barrier synchronization, the multiprocessor simulator into which the traces are fed
creates the actual ordering of events. Bach process follows its own independent series
of reads and writes until reaching a barrier synchronization event. At that time the
multiprocessor simulator forces the process to wait for the other processes to reach
the barrier before allowing it to continue. It is the resymnsibility of the multiprocessor
simulator to generate accesses for synchronization variables. \Vithin the trace of a

single process, the timing of the events is constructed from the count of the number

60

of instructions executed since the last data access. This information is recorded in
the traces. Placing high level synchronization events in the trace and letting the
multiprocessor simulator construct the accesses to syncln'onization variables and the
temporal relationship between processes allows a single set of traces to be used to

evaluate competing memory hierarchy designs.

5.1.1 Tracing techniques

Several methods of gathering multiprocessor traces have been described in the lit-
erature [52]. Dubois, et a1 outline the general methodology of trace generation as
separate and independent of multiprocessor simulation [19]. The reported techniques
fall into the broad categories of hardware, software, and simulation methods.

Hardware monitors may be used to directly observe and record addresses of mem-
ory references sent to off chip caches or to main memory. Disadvantages of hardware
methods include the expense of the hardware monitors, and the lack of ability to mon-
itor memory references satisﬁed by internal caches. As on-chip caches grow larger,
this becomes more and more of a disadvantage. In a related approach, modiﬁcation of
microcode may be used to record all memory references generated. By modifying the
microcode on a VAX 8350, the ATUM [47] method can trace parallel applications in-
volving up to 4 processors with only a slow-down by a factor of 20. Operating system
and multi-tasking references are captured in the traces. The small number of proces-
sors and the fact that ATUM works only for the VAX architecture are limitations to
this approach.

For microprocessors which support the ability to interrupt the system after every
instruction, a software module interrupt handler can be used to record the memory
addresses generated in each instruction. This technique is used in [21] on a bus-based
shared memory multiprocessor. The main disadvantage of tracing through interrupts
is the slow-down of the parallel processor being traced; each CPU may be slowed by
as much as a factor of 1000.

PSIMUL is an instruction level MIM D multiprocessor emulator that has been used
by IBM researchers to study memory reference behavior of parallel programs [48, 34,
6]. Simulating a multiprocessor to generate address traces is a very time consuming
process; So reports a slow-down of a factor of 450 when generating traces. PSIMUL

cannot trace privileged instructions, thus it may only partially capture operating

61

system references.

Stunkel and ichs describe a tracing technique for multicomputers called TRAPEDS
[51]. Implemented on an Intel iPSC/ 2, the TRAPEDS technique is similar to the
Tracer in that it modiﬁes the application code at the assembly language level and
then collects traces during execution of the program. \Veinberger originally pro-
posed the method of modifying application code at the assembly language level to
record instruction counts at execution time [55]. The fundamental difference between
TRAPEDS and Tracer is the focus of architectures (multicomputer versus multipro-
cessor) and the fact that TRAPEDS must be run on the multicomputer itself. A
slow-down by a factor of 50 is reported for TRAPEDS when traces are not collected
on disk; recording traces further increases the slow-down because of the I/ O require-
ments and the hypercube architecture of the iPSC / 2.

Tango is a multiprocessor simulation environment that can be used to generate
data address traces for a wide range of applications and architectures [17]. The
philosophy of Tango is one of providing only as much accuracy as a user is willing to
pay for in simulation time. To reduce the time required to simulate a multiprocessor,
Tango executes compiled sections of application code whenever possible, rather than
using the more common approach of single-stepping the simulated multiprocessor
through an application. This method is claimed to be 100 to 1000 times faster than

more traditional multiprocessor simulators.

5.2 The Traced Application

The model assumed for the multiprocessor architecture is a very general one; multiple
processors and memory modules are connected through a general interconnection
network. The memory modules comprise a single shared memory address space as
shown in Figure 5.1. Each process uses a portion of this shared address space as
its private memory. Shared data can be placed in either the global shared portion
of the address space or in the regular shared portion of the address space. The
distinction between the private, regular shared, and global shared portions of the
address space is made to facilitate the use of the generated traces in simulations of
non-uniform memory access multiprocessor architectures. The meaning of the shared
memory address space can be easily altered by modifying the memory mapping ﬁlter

program.

62

 

 

 

 

 

raw virtual Memory i ) mapped simulator
addresses mapping ] addresses
Global
shared
write 3 153962 ?
write 3 1280 Regu'a’
shared

read N 4278189244 \)

 

 

 

 

 

read N 16 Private proc N
O
o
a

read ALL 4278189170 Private proci

read 1 70 :

read 2 170 0

read 3 270

read N

Private proc 1

 

(N-1)*100+70

 

 

-'(D

CD—(Q:

mom-sac.)

mom'om

Figure 5.1: Traced multiprocessor address space. Raw virtual references are Shown
with resulting ﬁltered references below them and indented.

63

The ﬁrst step of the tracing process is to code the application in an extension of a
high-level language. While the Tracer will work with any language for which process
creation facilities and an adequate compiler are provided, the following discussion
assumes the use of the C language. The application programs are written using the
fork and wait primitives found in the UNIX operating system. The application to be
traced is written in C following templates that aid the job of process creation and
synchronization. The extensions to C allow each variable to be classiﬁed as either
private, regular shared, or global shared. The declarations of these variables in the
extended syntax are replaced by legal declarations and C function calls to record the
type and virtual address of all variables at execution time.

Tracing is controlled by setting a global trace attribution variable to the identity
of the process to which traced events should be attributed. The trace attribution
variable can also be set to a virtual process. Virtual process IGNORE turns off tracing,
virtual process ALL attributes a traced event to all processes, and virtual process
CONTROL attributes the traces to a control process. The tasks of setting the trace
attribution variable and deﬁning barrier S}"11Clll'()lll’lallOllS are done with the Tracer

library functions shown in Table 5.1.

 

attr_trace(proc) Set the trace attribution variable to proc

 

def-barrier(x,y) Deﬁnes a barrier synchronization for pro—
cesses numbered x through y, inclusive

 

 

 

 

barrier(proc) proc reaches barrier

 

Table 5.1: Tracer library functions for barrier synchronization and tracing.

The tasks of process creation and barrier synchronization are aided by using the
code template shown in Figure 5.2. The ﬁrst step in the code template is to turn
tracing off with attr_trace(IGNORE). A barrier is then deﬁned for all regular pro-
cesses and for a control process. The control process is used to initialize global data
structures and coordinate the work of the. regular processes. The for-loop with the
function barrier(proc) in the body causes each of the regular processes to reach the
barrier and wait. Tracing is then attributed to the CONTROL process as data struc-
tures are initialized for the regular processes. After the data structures are ready,
the CONTROL process reaches the barrier, which releases the regular processes. The

next for-loop in the template forks a child process for every regular process, 1, ...,

64

HAXPROC. The process identity of each child process is given by the loop variable proc.
Each child process will generate traces which will be attributed to the process that
they represent, hence the ﬁrst thing a child process does is to set the tracing processor
to proc.

A parallel application to be traced can be written by following the sequential child
process creation code structure of Figure 5.2. An alternative structure is to have the
control process create all children and let them run in arbitrary order. Since the traced
program runs on a uniprocessor, both of these methods will generate equivalent trace

ﬁles. The sequential method simpliﬁes multiple process file I/O.

5.3 Trace Generation

Throughout this section, the traced program will be referred to by the name app.*,
using the sufﬁxes par, c , and s to refer to the traced application program at differ-
ent stages. The ﬁrst step of the tracing process is to translate the extended syntax C
language version of the application, app.par, into legal C source code, app.c. This
translator is written in the AWK programming language. The Tracer next compiles
the C source code of app.c into assembly language, app. 3. Each of the steps in the
process of tracing an application is shown in Figure 5.3. The output of each step is

identiﬁed by a ﬁlename to the left of the transitions between boxes.

5.3.1 Expanding assembly language

The assembly language code in app.s is automatically modiﬁed to create exp.s.
The UNIX development tools lex and yacc were used to create a program which
reads SPARC assembly language and expands it to gather address traces at execution
time. The added code must not interfere with the original function of the program,
so extensive saving/ restoring of registers is done for every point at which the original
program is modiﬁed. The RISC nature of the SPARC microprocessor limits the
complexity of SPARC assembly language and simpliﬁes the creation of the expander
program. The main job of the expander is to add code to app. 3 at every occurrence
of load and store instructions which records the following information at execution

time:

trace attribution process number

65

 

 

int proc;

set-tracing_proc(IGNORE);

def-barrier(CONTROL, MAXPROC);

for(proc=1; proc<=MAXPROC; proc++)

wait_at_barrier(proc);

set_tracing_proc(CONTROL);

init_data_strucs();

wait-at_barrier(CONTROL);

set_tracing_proc(ALL);

for(proc=1; proc<=MAXPROC; proc++){

}

if( fork() == 0 ){

set_tracing_proc(proc);
/* do the child’s work */
exit(0);
}
set_tracing_proc(IGNORE);
if( (pid = wait(0)) == -1 )
perror("wait");
else
printf("child has finished\n“);
set-tracing_proc(ALL);

set_tracing_proc(IGNORE);

Figure 5.2: C template for process creation and synchronization.

 

 

66

app.par

\lr

translate variable [

 

 

 

declarations j
app.c ¢
cc -8 appc 7

 

 

app.s ¢

‘ l.
expand loads and stores i
to record memory access ‘

exp.s [

l
l
cc exp.s -l tracer ]
l

 

 

a.out

 

trace file$

perform memory mapping '
and symbolic processor
expansion

1!

events file

 

i
l

 

..._ ._-L_.._ ...~ ..:

 

Figure 5.3: The steps of the Tracer technique.

67

virtual address
value read-from or written-to memory at virtual address

delay since last memory access

The code expansion at this step is roughly a factor of 10. Figure 5.4 shows ﬁve original
lines of assembly code (at the top of the ﬁgure) which include a load instruction (1d)
and several other instructions which do require memory accesses. The assembly
language code below the dashed line is the result of expanding the original ﬁve lines

of code. The original lines are marked in the expanded version with the symbols

.......

The value for the delay since the last memory access is kept in a global array
indexed by the trace attribution processor number. Since instructions which do not
access memory should execute in a predictable amount of time (ignoring interrupts
and the effects of multitasking) it is reasonable to assign them a ﬁxed “time delay”
that represents the time spent by a processor in executing each instruction. For each
processor, the corresponding element of the time delay array represents the sum of
the delays for each of the instructions which have executed since the previous memory
access. For each assembly language instruction, the expander program adds assembly

language code to app. 3 for incrementing the delay value.

5.3.2 Compile and execute

After the trace gathering code is added to app.s, the expanded assembly code is
compiled with the Tracer library routines to produce an executable program. This
program is then executed to produce the raw trace ﬁle. The Tracer library routines
themselves are not traced during execution. In addition, external system libraries
(such as mathematical routines) are not traced. The only code that produces traces
during execution is code written by the programmer and passed through the assembly

language expander.

5.3.3 Filter raw traces

The ﬁnal step of the Tracer takes the raw trace ﬁle as input and produces the ﬁnal
multiprocessor events ﬁle as output. This events ﬁle is so named because it consists

of read, write, and synchronization events for each processor in the multiprocessor

68

 

 

 

L205: ld [%fp+-0x2c],%16
cmp %16,0x13
bge L204
n p
.stabn 0104,0,269,LL100

L205: sub %fp,0x2c,%g6 1d [%g6+%lo('processor)],%g6
1d [7gp+&0x2c],%16 !!!!!! sll %g6,0x02,%g6
mov 016, og7 set 'delay,/og7
mov %00,%g5 add %g6,%g7,%g6
sethi %hi('saveol),%00 1d [%g6],%g7
st %ol,[%00+%lo('saveol)] add %g7,0x4,%g7
sethi %hi('saveo7),%00 st. %g7.[%g6]
st %o7,[%00+%lo('saveo7)] cmp %16,0x13 !!!!!!
mov %g5,%00 sethi %hi ( 'processor) ,%g6
mov %16,%01 1d [%g6+%lo( 'processor)] ,%g6
mov %g6,%00 sll %g6,0x2,%g6
353111) 'genrtok,2 set. 'delay.%g7

, . 0 r O) . 0

sethi %hi('processor),%g6 [EM Wiggléggﬂigﬁ
ld [%g6+%lo('processor)],%g6 add %g7.0x4 % g7
311 %g6,0x2,%g6 st. %g7 [!%g6]
set 'delay,%01 bgc L2d4 in!!!
add %86»%011%86 nop !!!!!!
mov 0x4,%01 sethi %hi ( 'processor) ,%g6
St . %01_’l%86l 1d [%g6+%lo( 'processor)] ,%g6
sethi %h1( saveol),%00 sll %g6.0x2.% g6
1d [%00+%lo(°saveol)],%ol set. 'rlclay,%g7
sethi %hi('saveo7),%00 add ‘%,g6‘l%,g7,‘%gﬁ
ld [%00+%lo('saveo7)],%o7 1d [%g6],%g7
mov %g5,%00 add / 7,0x4,% 7
mov %g7,%l6 st %g§,[%g6] g

sethi %hi('processor),%g6

.stabn 0104,0,269.LL100

Figure 5.4: Expanded assembly code (in two columns) shown below original assem-
bly code fragment.

 

 

69

to be simulated. The ﬁlter program maps virtual addresses to physical addresses,
replicates process private variables, and maps virtual process number to physical
process number. Virtual process number is the process to which a trace has been
attributed at run time. Physical process number is the identity of the process in
the multiprocessor simulation in which the traces will be used. Virtual processes are
discussed in more detail below.

Addresses that are traced at run time come from the virtual address space of
the uniprocessor used to generate the traces. On SPARC machines these can occur
anywhere in the range 0 to 232 — l. The ﬁlter maps these virtual addresses to phys-
ical addresses by removing the holes in the virtual address space and beginning the
physical address space at 0. The type of a variable (process private, regular shared,
or global shared) and its virtual address are known to the ﬁlter by reading the top of
the trace ﬁle. This information is placed in the trace ﬁle at run time as a result of
the declarations in app . par.

The virtual addresses which correspond to variables which are declared as process
private are replicated by the ﬁlter so that each physical processor has a private copy of
this virtual address. The arrangement of physical memory and the mapping of virtual
addresses is shown in Figure 5.1. The ﬁrst section of the address space represents N
processor private memory spaces of PR words each. A regular shared memory of SR
and global shared memory of 50 complete the physical address space. The distinction
between regular shared and global shared memory is made to facilitate the simulation
of non-uniform access shared memory multiprocessors. The ﬁlter program is written
to allow the mapping of virtual to physical memory to be easily changed. The speciﬁc
mapping that is used depends on the architecture of the multiprocessor that will be
simulated with the traces. The arrows in Figure 5.1 show how each type of address in
the raw trace ﬁle will be mapped by the ﬁlter. The ﬁrst write by process 3 shows how
a shared location maps into either the global shared or the regular shared portion
of memory. The next read by process N represents an access to a process private
variable. Finally, the read by virtual process ALL is mapped into N read events, one
for each regular process. Since the variable is denoted private in app.par, the reads
for each process are to variables in their own private address space. The accesses
shown assume that the virtual address is mapped into the 7'“ physical address of a

private address space of 100 words per processor.

70

5.4 Extension of Tracer

The main limitations of the Tracer are the lack of instruction traces, the lack of
library and operating system traces, and the speciality of programs which can be
traced. Extending the Tracer beyond its current limits is the subject of this section.

For simulation of multiprocessor computer architectures, address traces are easier
to generate than they are to put to good use. The computational requirements of
detailed, trace-driven simulation limits the number of memory references that can be
simulated. Since data accesses typically have poorer cache behavior than do instruc-
tion accesses, the time dedicated to simulation should be spent on data address traces
at the expense of instruction traces. Other applications of address traces require that
instructions be traced. For example, studies of the relevant frequency of instructions
can be used to design the instruction set architecture of new microprocessors. Knowl-
edge of instruction memory reference patterns may also be needed when optimizing
the design of separate instruction and data caches.

Much of what is necessary to capture instruction traces already exists in the
Tracer. Since assembly code is added to a program for every instruction executed (not
just the data accessing loads and stores), instruction traces could be gained by simply
recording the value of the program counter at each instruction. The increase of trace
storage requirements and the slow-down of trace gem-nation would be considerable,
however, since typically 75% of total memory accesses are instruction fetches. For
applications which require statistics characterizing instruction or address traces, but
which do not need the actual traces themselves, a process which would reduce the
raw traces to statistics could be run in conjunction with tracing, thereby avoiding the
time and space required to record the traces on disk.

The lack of library and operating system traces is not a fundamental limitation
of the technique of the Tracer, but more of a practical problem. The Tracer can be
used on any program for which the source code is available. Unfortunately, access
to operating system and library source code is usually quite limited. If a particular
library routine is judged important to the use of the generated traces, the library can
be replaced by code written by the user thereby allowing traces to be gathered for
it. Operating system traces are much more problematic. \Vhilc application traces
from a uniprocessor emulating a multiprocessor will be very similar to an actual

multiprocessor composed of the same CPUs, the operating system of a uniprocessor

71

and a multiprocessor are likely to be quite different. The utility of uniprocessor
operating system traces for multiprocessor system evaluation is questionable.
Generalizing the synchronization and task allocation paradigms supported by the
Tracer would increase its utility considerably. Static task allocation and barrier syn-
chronization allow the memory references streams of individual processes to be com-
pletely independent of each other except for the syncln'onization points. Barrier
synchronization points are easily embedded in the traces. The difﬁculty of more gen-
eral synchronization is that the addresses traced for each process will depend on the
relative timing of processes as they run. This timing is in turn dependent on the envi-
ronment of the computer used to generate the traces. If the intended use of the traces
is to study changes to the multiprocessor environment (such as the effect of memory
sub-system performance) then a circular problem has been uncovered as shown in
Figure 5.5. The information in the bottom oval is only available during simulation of
the multiprocessor, but the simulation of the multiprocessor clearly depends on the

address traces.

   

Relative timing of processes ,
determines the memory accessses )
of each process ”

  

 

Memory accesses in traces .
determine the timing between i
processes

\

\ «A

Figure 5.5: Problem of extending Tracer to general synchronization and task alloca-
tion paradigm.

One possible strategy to overcome the difﬁculty pictured in Figure 5.5 is to exe-

cute the Tracer and the multiprocessor simulator concurrently with communication

‘1
l0

between the simulator and the Tracer. The feedback provided to the Tracer by the
simulator could be used to determine task allocation and execution paths dependent
on process synchronization. As implemented, the Tracer requires two complete passes
through the raw trace ﬁle to accomplish address space mapping and virtual processor
expansion. To execute concurrently with a simulator, and to receive feedback from

it, the Tracer would have to be extensively modiﬁed.

Chapter 6

Petri Net Model

Detailed Petri models provide some assurance of the feasibility of the HBSM multi-
processor and COGI cache coherence protocol. Trace-driven simulations of a Petri
net model served as a means of designing and debugging the COGI protocol. The
Petri net models were intentionally crafted so that errors in the protocol or controllers
would be readily apparent during simulation. This chapter describes the models and
the High Level Timed Petri Net simulator used for trace-driven simulation of the

HBSM multiprocessor.

6.1 Petri Net Modeling

A model is an abstract representation of a system which ca1‘)tures the important
features and behavior of that system. In the design process, modeling represents a
compromise between prototype experimentation and design by intuition. Construct-
ing a working prototype of complex systems is both expensive and time-consuming.
Performance results from working prototypes can be highly accurate, but the inherent
inﬂexibility of such physical systems limits the breadth of study of possible design
choices. Intuition, on the other hand, while an invaluable part of the design process,
is insufﬁcient in providing accurate estimates of the perfm'mance of complex systems.
A good Petri net model provides a designer insight into the behavior of the system
being modeled, supplying feedback in the design process and highlighting errors in
the design.

The simple graphical structure of Petri nets allows the generation of models which
are related in a clear and logical manner to their physical system counterpart. In
computer modeling, relevant low level hardware details can be easily modeled with
Petri nets, while at the same time less interesting functions can be ignored with more
abstract net models. Petri nets are particularly well suited for modeling systems
exhibiting concurrency, synchronization and competition for resources. With the

increasing signiﬁcance of parallel processing, the use of Petri nets for modeling and

73

74

design of computers is most natural.

Fundamental design issues common in parallel and distributed processing dictate
a modeling tool which is ideally both powerful and expressive. To be powerful, a
modeling tool should be based on a speciﬁcation which can be related to a solid
mathematical framework. The inherent power in such models is the ability to ap-
ply mathematical analysis to the model to yield estimates of system performance.

Unfortunately, as Aggarwal and GOpinath [1, page 277] note,

“...as the expressiveness of the notation grows, the meaning of a speciﬁ-
cation is increasingly hard to infer in a formally precise way.”

so modeling tools which are powerful in this way tend not to be expressive enough to
use with convenience on complex systems.

Simulation is an important method of extracting system performance estimates
from models which are highly expressive, but either do not have a mathematical basis,
or are analytically intractable. In addition, simulation is often used to understand
the behavior of a system when formally determining the correctness of the system
being modeled is impossible or impractical. The chief drawbacks to simulation are
the complexity of developing a model and the computational requirements of running
the simulation. The Petri Net simulator developed for detailed modeling of the HBSM
multiprocessor addresses the ﬁrst of these issues by providing a graphical and intuitive
modeling paradigm. As Billington [9, page 307] notes in the context of communication

protocols,

“Interactive simulation is extremely useful in the early stages of protocol
speciﬁcation as it allows the designer to test. each part of the speciﬁcation
as it is completed, and can provide considerable insight into the system’s
behavior. It may also be used for debugging and learning about proto-
cols.”

Simulation of the Petri net model described in this chapter was used for designing
and debugging the hardware controllers of the HBSM nurltiprocessor and the COGI
cache coherence protocol.

The tradeoff between expressiveness (or modeling convenience) and power of the
analysis techniques is aptly exhibited in the many extensions made to Petri nets.
Pure Petri nets [42] can be utilized to model simple systems and study their behav-

ior by means of techniques such as reachability analysis. method of invariants, and

75

transformations. For more complex systems such as communication protocols [9] or
cache coherence protocols, the complexity of the. system being modeled mandates the
use of a more expressive modeling tool. Various proposed extensions to Petri nets
facilitate the modeling of large, complex systems by increasing the expressive power

of Petri nets.

6.1.1 Extensions

Pure Petri nets provide no way of distinguishing between individual tokens. In order
to increase the expressive power of Petri nets, researchers [‘26, 3‘2, 44] realized that a
method of distinguishing between individual tokens was necessary. The two efforts in
this direction, Predicate/Transition nets and Coloured Petri nets have been uniﬁed
in what are known as High Level Petri nets [33].

High Level Petri nets (HLPN) attach units of information to tokens in what are
called attributes. Associating attributes with tokens allows the modeling of arbitrary
data as it ﬂows through a system. Extending the transition ﬁring rules to include
conditions deﬁned on token attribute values offers a more compact. representation of
powerful control structures than is feasible with pure Petri nets.

A valuable extension to Petri nets for performance modeling is the inclusion of
the concept of time. Pure Petri nets, possessing no concept of time, are incapable of
providing time-related performance estimates ((2. g. throughput, utilization, availabil-
ity). Work on the inclusion of the concept of time into Petri nets has proceeded in
two directions: Stochastic Petri Nets and Timed Petri Nets.

In Stochastic Petri nets (SPN) [41, 57] an exponentially distributed ﬁring delay
is associated with each transition in the net. The reacl‘iability tree of the resulting
net is isomorphically equivalent to a continuous time Markov chain. Well established
Markov analysis techniques can be applied to yield system performance estimates.
Further extensions of SPN, called Generalized SPN [38], allows the existence of both
exponential and immediate transitions in Petri net models. A fundamental limitation
to SPN techniques is the size of the state space of the resulting Markov chain. Because
the state space grows exponentially with the size of the Petri net, SPN are limited to
modeling either small systems or large systems at a higher level of abstraction. The
restriction that every transition must have exponential or immediate ﬁring delays is

also a limiting factor; many systems could be more accurately modeled if constant,

non-zero delays were permissible.

In Timed Petri nets [31], a constant ﬁring delay is associated with each transition.
Geometric distributions for ﬁring delays can be achieved by the addition of very simple
subnets. Non-determinism in the state transition behavior is achieved by associating
ﬁring probabilities with each transition. With these ﬁring probabilities, an embedded
Markov chain can be deﬁned from the Petri net and Markov analysis used to obtain
performance estimates. The choice of meaningful ﬁring probabilities and the Markov

state space explosion problem are the chief difficulties in Timed Petri nets.

6.1.2 Software packages

In [13, 14] Chiola describes a software package for the analysis of Stochastic Petri
nets. This package, GreatSPN, includes a. graphical interface for model deﬁnition
and editing. Analysis is done by constructing the reachability graph and solving for
the steady state solution of the corresponding Markov chain. In addition, Monte
Carlo simulation can be used to acquire estimates of the average number of tokens in
each place.

Cumani [16] describes a software package, ESP, for the analysis of Stochastic
Petri nets with the extension of phase-type ﬁring distrilmtions. The analysis strategy
is similar to the one used by Chiola. To avoid the problems that are caused by
introducing general ﬁring distributions in Petri nets, the phase-type distributions
allowed in ESP are introduced at the reachability tree level.

A third package for solution of SPN is described in [‘20]. This software package
(DEEP) provides both analytical solution techniques through Markov analysis and
simulation tools. In the simulation tool nets are executed for long periods of time to
gain estimates of such measures as average number of tokens in a place. Dugan, et
a1, mention that the speed of the simulator is important, particularly for capturing
rare events in a PN model.

Meta Software Corporation sells a Petri net simulation tool with features similar
to the Petri net simulator described in this chapter [40]. Their Design/CPN tool
includes a graphical interface, a hierarchical Petri net editor, simulation of colored
Petri nets, and a $10,000 price tag.

All of the software packages described above perform Markov analysis on the

reachability tree of a Petri net. Two of them can use simulation for large or otherwise

77

intractable nets. However, all are limited to analysis of Stochastic Petri nets and
cannot handle the extensions of High Level Petri Nets and thus are limited in their
expressive power. The High Level Timed Petri Net (HLTPN) simulator used to
model the HBSM multiprocessor represents a coherent and uniﬁed PN modeling tool
incorporating the features of the HLPN, SPN, and TPN extensions. Petri nets of this
sort are not amenable to analytical Markov solution. Instead, simulation by execution
is used as an alternative for estimating time-oriented performance measures and as a

design / debugging tool.

6.2 High Level Timed Petri Nets

A Petri net is a directed bipartite graph with two types of nodes: places and transi-
tions. Connections between nodes are made by input arcs and output arcs. A High
Level Timed Petri Net (HLTPN) is a Petri net with two major extensions: the concept

of time is incorporated into transitions, and attributes are associated with tokens.

6.2.1 Places

Places are those nodes in a Petri net graph which hold tokens. A place can be one of
two types: normal or directory. Directory places have a special structure to facilitate
the efﬁcient handling of large numbers of tokens. Normal places hold tokens in an
unstructured fashion. Each place has a (possibly empty) list of attributes associated
with it. Tokens residing in a place carry values for each of the place’s attributes.
Attributes are either single integers or arrays of ﬂoating point. or integer numbers.
The marking of a Petri net at a speciﬁc time refers to the distribution of tokens
among places. In a HLTPN, the state of each place. can be represented with a list
of the attribute values for each token present on the place. Since attributes are
associated with places, each token on a place will have the same attributes, but may

have differing values for them.

6.2.2 Directory places

Modeling large caches and memories in the hierarchical bus architecture requires
storing a large number of individual elements to represent the state of memory blocks.

Special directory places which efﬁciently store large numbers of tokens were developed

78

for this purpose. The tokens in a directory place are stored in a structured manner
which provides some degree of random access to individual tokens. In contrast to
normal places, where tokens are located by sequential search, a token in a directory
place is accessed almost directly by the value of the ordering attribute. The cost of
a sequential search of a list of possibly thousands of tokens necessitates provision of
the special directory structure.

Tokens in directories are assigned to sets based on the value of the ordering at-
tribute. Within a set tokens are stored in an arbitrary order. The process of matching
a speciﬁc token in a directory (8.9. to check for the presence of a speciﬁc address in
a cache) consists of searching sequentially through the small number of tokens in the
proper set. The number of sets and the size of each set. determines the size of the
directory and the amount of sequential searching required to locate a token. The
set associative directory can model a fully associative mapping by making the set
size equal to the directory size. In this case the whole directory must be searched
sequentially to locate a token. The other extreme, direct mapping, requires the set

size to be equal to one; any token can be accessed directly with no sequential search.

6.2.3 Transitions

Transitions are those nodes in a Petri net graph corresponding to the occurrence of
events in the model. The ﬁring of transitions changes the marking of the net by
moving tokens from input places to output places. A transition is enabled when
suﬂicient tokens of the correct type can be assigned to each of its normal input arcs,
and no tokens can be assigned to any of its inhibitor arcs. In this case “sufﬁcient”
means enough tokens to satisfy the multiplicity and conditions deﬁned on each input
arc. Three transition types are supported in the simulator: immediate, stochastic,
and delayed. Immediate transitions ﬁre as soon as they are enabled, no simulation
time passes between the enabling and ﬁring events. Delayed transitions experience a
constant ﬁring delay. Stochastic transitions experience a random ﬁring delay selected
from exponential, uniform, or Gaussian distrilmtions.

An enabling token layer is a set of tokens on the input places of a transition which
satisfy the conditions, multiplicity, and inhibition arcs of that transition. An enabling
token layer allows a transition to ﬁre. Multiple enabling token layers on stochastic or

delayed transitions have unique non-zero ﬁring delay, and hence compete with layers

79

of all other timed transitions in the net for ﬁring precedence.

6.2.4 Connectivity

An input arc is a directed are connecting a place to a transition. Input arcs of a
transition are of either normal or inhibitor type. Inhibitor arcs reverse the normal
sense of enabling such that the presence of proper tokens on the input are prevents
the transition from being enabled. Inhibitor arcs which act to prevent the ﬁring of a
transition are said to be activated. A transition with more than one inhibitor arc is
disabled if any of its input inhibitor arcs are activated.

Enabling conditions for a transition are deﬁned on input arcs. A condition is a
predicate deﬁned by the modeler on token attributes of input places. The predicates
can be a simple comparison for equality between attribute values or a complex func-
tion deﬁned on attribute values of input tokens. Relative conditions (between two
input arcs) specify that an attribute value of a token for the ﬁrst input are together
with an attribute value of a token for the second input are must satisfy the predicate.
Absolute conditions (involving only one input are.) specify that a single attribute of
tokens on an input arc and a constant value satisfy the predicate.

An output arc is a directed are connecting a transition to a place. Output arcs
direct tokens to output places, and may have arbitrary, integer multiplicities. Prob—
abilistic output arcs may be deﬁned that send only one token to a set of output
arcs.

When a transition ﬁres, tokens from the input. places are moved to the output
places. For simple token mappings, the attribute values of the output place tokens are
ﬁlled by copying the corresponding attribute values from a. token in the enabling token
layer. Complex token mappings provide a means of placing tokens on output places
with attribute values which are an arbitrary function of the input token attribute

values

6.3 HLTPN Simulator

The goal of the design of the High Level Timed Petri Net (H LTPN) simulator [4] was
the uniﬁcation of the various features and capabilities from High Level, Stochastic,

and Timed Petri net extensions. The result is a highly expressive Petri net simulation

80

modeling tool. The simulator provides a terminal-based interface to a model editor
and to a run-time environment. The simulator is written in the C programming
language, and consists of 6200 lines of code in 15 ﬁles. The editor is written in the
C programming language, and consists of 5200 lines of code in 15 ﬁles. Because of
the numerous features included in the simulator, the syntax of a deﬁnition ﬁle is not
simple. To reduce the number of syntax errors in deﬁning a PN model, an editor
was developed which guides the user through the deﬁnition process and checks on the
logical format of the deﬁnition.

The objects of a Petri net model which can be created and edited are places,
transitions, clean-up subnets, and directory places. Each object has a number of
parameters and related information associated with it such as name, type, conditions,
arcs, etc., which can be modiﬁed. In cases which a default choice is available, the
editor allows the user to accept this default by hitting the carriage return. The default
action at any point is indicated inside brackets just before the prompt. Whenever
possible, values from the current deﬁnition of an object are provided as defaults in
the editing session. This makes it easier to change small parts of a deﬁnition without
re-entering all the information again. Place and transition deﬁnitions follow a line by
line series of questions to which the user responds. These responses determine the
course of the editor’s future questions. In this way the deﬁnition effort is minimized
while still allowing the user to deﬁne an arbitrary Petri net model.

The HLTPN simulator has several options which can be set at run time. These
include means of controlling the display of the resulting output and the conditions
under which the simulator stops. In addition, the simulator can be interrupted while
running and control of the simulation regained (execution of the net stops). When
the user returns to running the PN model, the simulator continues execution exactly
where it was interrupted. While the simulator is stopped the marking may be exam-
ined and altered; tokens can be added, edited or deleted. The marking of the entire
PN model, or of the places in a subnet, or of a single place, may be displayed.

Execution of the Petri net model consists of repeatedly ﬁring enabled transitions
and altering the net marking accordingly until execution is interrupted or stopped
or until deadlock is reached. This ﬂow of control is shown in Figure 6.1. During
the ﬁrst step the simulation time is not advanced. The simulator repeats the ﬁrst
step until no enabled immediate transitions remain unfired. The second step involves

checking all timed transitions to determine those which are. enabled, and ﬁring the

81

one with the smallest delay value. Simulation time is advanced in this step by the
delay value of the ﬁred transition. To increase speed, the actual implementation of
the simulator maintains a list of transitions which require examination in the two
main loops. Initially, the list consists of all transitions. Each time a transition is
examined and all enabling layers are found, it is removed from the list of transitions
to check. Transitions are appended to this list as changes are made to the marking

of their input places.

6.4 HBSM Multiprocessor Model

The High Level Timed Petri Net (HLTPN) model of the hierarchical bus multipro—
cessor is organized into 19 largely independent subnets totalling 460 places and 540
transitions. This organization eases the process of deﬁning and maintaining the 1000
nodes of the full model with the HLTPN editor. Each subnet represents a major
function of a hardware controller. For example, there are four subnets comprising the
processor cache (CC): one for the procesSor side of the cache, one for the bus snoop
portion of the cache, and one each for the cache directory and bus interface. The
full model represents the complete state of each controller and of each block in cache
and memory. Trace-driven simulation of the HLTPN model is slow due mostly to the
level of detail of the model. The simulator can perform between 15 and 60 memory

accesses per second, depending on cache hit ratio.

6.4.1 Processors

The size of the complete 1000 node model is too large to include in a publication.
Instead, smaller subnets representative of the complete. model are discussed. Fig-
ure 6.2 illustrates the high level actions of individual processors in the hierarchical
bus multiprocessor model. In the illustrations, places which are gray are places which
are depicted more than once in a ﬁgure. Each processor in the model is characterized
by a cluster number and a processor number (c,p). The transition generate uses a
complex output arc and a complex delay to generate processor memory operation
events (RR-read, WN-write, SY-sync) and to model the delay between processor ac-
tivity. When generate is enabled by a processor, the simulator uses the identity of the

processor (c,p) to look-up the next trace event for that processor. Each trace event

82

 

 

 

sequentially ﬁnd
and fire all enabled,
immediate transitions

 

 

 

V68

 

 

 

 

find all enabled,
timed transition
token layers

 

 

 

 

 

 

fire minimum delay
timed transition
token layer

 

 

l

 

deadlock

Figure 6.1: Main loop of HLTPN simulator.

83

includes the time that the processor spent between issuance of memory operations.
This time is used as the delay value for the timed transition generate. The information
on a token in place requests includes the identity of the processor (c,p), the address
of the memory request (a), the type of request (r) and a value associated with this
request (w). For read requests, the trace ﬁle contains the value of the memory loca-
tion that was returned by the memory system at the time the traces were generated.
This value can be used to test the correctness of the PN model.

Three transitions compete for tokens in place requests. These transitions use
absolute conditions to route tokens according to the type of request. Read and write
requests (RR and WN) place tokens in places mem access in progress and check return
value. These transitions also send a token to the cache controller (not shown) which
initiates the request in the remainder of the model (gray box). When the memory
access request is complete, a token is deposited in two places in the dashed box. For
read requests, transition check return ﬁres on one of these tokens and compares the
value returned from the model with the value expected. Transition request done ﬁres
for both RR and WN requests and places the processor token back in the processors
place, thus completing a memory access event. During simulation, the simulator is
instructed to stop if the transition return bad ﬁres, thus preserving the state of the

model near the time that an error occurred.

6.4.2 Synchronization

The subnet which handles synchronization requests is shown in Figure 6.3. This sub-
net performs barrier synchronization for an arbitrary grouping of processors. The
attribute (y) on place synch represents a barrier synch number, used to identify the
synchronization points. The attribute (b) is a bit field which deﬁnes the set of pro-
cessors involved in this synchronization. The current status of each barrier is kept in
place barrier. For the ﬁrst processor which reaches the barrier, the place barrier will
not have a token in it with attribute (y) which matches the token in synch, so the
transition middle cannot ﬁre, and transition first ﬁres. When subsequent processors
reach the barrier, the inhibition are on transition first from place barrier prevents that
transition from ﬁring, and transition middle will ﬁre instead. When either of these
transitions ﬁre, the current bit ﬁeld in (b) has the bit corresponding to processor

(c,p) cleared by a functional output are. A token representing each processor is then

84

 

get next event

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

      
 
  
 

 

 

 

 

 

 

C.D a.r.w
crp NM
“WWW ﬁe M...
”Maw“ ”””””” 1.2 2008. as. so
MM”, .
a,r,w M :
working
c,p,a [
r,w
r=SY r=WN "RR
synch l 1 write I - 1 read
93:;55ﬁilfff'f progress check
M “W"
,g;.§ld'°,; ' Hierarchical Bus value
‘ Multiprocessor
,g'lgiijiii ‘.§::;.;.; MOde'
\ 'érﬁe‘r'synchzzzif'2:»l *

 

 

 

)

check return
memory

 

 

access
done check data
_ request
done return return
ok bad

get next event

Figure 6.2: HLTPN model representing the actions of processors in the hierarchical
bus multiprocessor.

deposited in place idle to represent a processor waiting on a barrier.

When the last processor reaches the barrier, all the bits of attribute (b) have
been cleared and transition last ﬁres, depositing a token into place unlock. Transition
unlock uses a relative condition to match the unlocking token with each idle processor
and to place them back in action again by depositing a (c,p) token in place processors.
When all processors have been removed from place idle, the transition clear unlock can
remove the unlocking token since there will be no processors on place idle to inhibit
the clear unlock transition. Place consume unlock token is only used so that the clear
unlock token has a place to deposit tokens. The black dot in this place represents a
token without any attributes.

In the synchronization subnet the attribute (y) serves to allow multiple synchro-
nization events to occur simultaneously in the same subnet. Being able to “fold”
multiple identical subnets into a single subnet is an important attribute of High
Level Petri nets for modeling complex systems. The complete HLTPN model of the
memory/cache subsystem of the hierarchical bus multiprocessor folds the subnets
representing each processor and cache around the attributes (c) and (p). The result
is one subnet which represents a cache controller instead of M x N subnets, where

M is the number of clusters and N is the number of processors on each cluster.

6.4.3 Bus arbitration

Bus arbitration is distributed amongst all potential bus masters. On a cluster bus,
CCs have equal arbitration priority, and fairness is achieved among competing CCs
by preventing a CC who has used the bus from participating in arbitration again
until all outstanding requests have been satisﬁed. For this reason, bus arbitration
priority can be done based on a simple CC identity. The priority of CCCs and CMCs
is higher than CCs to minimize the amount of time the global bus is held by these
devices. In the Petri net model bus arbitration is represented by a single subnet as
shown in Figure 6.4. The current bus master selects the next bus master from those
having made an arbitration request by placing a token in select bus master elect. If
no bus arbitration requests have been made, then no tokens will be found in bus arb
requests and the transition at least one arb request will not be enabled. At the same
time, transition no arb requests is enabled, so a token is put in the place idle bus

to indicate that the bus is idle and available. If multiple bus arbitration requests

86

synch
(”A

   
       
 

 

 

 

 

y.b
barrier
[IIIIII’ nnddka
0 b:=
last 'dl
c,p,y Idle I e
unlock
unlock
processor consume
~ ruﬂock
token

Figure 6.3: HLTPN subnet of barrier synchronization in the hierarchical bus multi-
processor model.

87

have been made, then transition at least one arb requst will ﬁre, randomly picking one
of the bus arbitration requests, and using the identity of that requester to initialize
the attribute m in place current max. For the remaining bus arbitration requests,
the transition select max ﬁres and compares the attribute m representing the highest
priority bus request to the attribute p representing the priority of the bus request
being checked. If p is greater than m, m gets the value of p from the output are on
select max. Eventually, all of the bus arbitration requests in bus arb requests will have
been checked and the maximum priority (identity) will have been found. At that time
transition election done will be no longer disabled by its inhibitor arc, so it will ﬁre
and place the identity of the selected maximum priority request into the bus master

elect place.

6.4.4 Directories

The model of the hierarchical bus multiprocessor required two types of directory
place. The directories for caches are accessed in the. set associative manner described
above. The upper portion of Figure 6.5 shows a. set associative directory that can be
used to model a CC cache in a 2 cluster, 2 processor/cluster model. Each set in each
cache holds 4 blocks (4 way set associative). Each cache block represents 4 addresses
and hence 4 data words. Tokens representing blocks are stored in the directory based
on the address of the block in memory. A full model of the HBSM multiprocessor will
have N processors per cluster and Ill clusters. The data structure for this directory
is organized as a three dimensional matrix which is addressed by cluster number,
processor number, and block offset (frame number). Each rectangular column in the
top part of Figure 6.5 represents the cache of one processor. Each column is addressed
by attributes c, and p, and blocks within a colunm are addressed by their offsets, or
frame, f. Cache directories store information in the following attributes: cluster id
(c), processor id (p), address (a), frame (f), status (5), and data (d). Attributes c, p,
and f are used to organize the cache into a. three dimensional data structure in the
simulator. The frame attribute represents the offset of a. block within a directory, and
is analogous to the concept of cache frame. lvlatching of tokens for conditions can be
done either by block address or frame number. Data is stored as an array of ﬂoating
point numbers, one element for each memory location.

Cluster Cache Controller (CCC) directories are also modeled with set associative

88

select bus
master elect

       
 

bus arb
requests

 

c,p C ..

 

 

 
 

 

 
       
 
 
 

at least one no arb
arb request requests
SG'ect C, m current
max .
max Idle
“ bus

 

m

c
“r
' election

done

   

l.

bus master
elect

Figure 6.4: HLTPN subnet of bus arbitration for HBSM multiprocessor.

89

Set Associative directory

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

set 0
set 1 , block 0
1 block 1
set 2 . .
o
o
3 : 3 .
set S '''''
Direct Mapped directory ‘ i 
blOCk 0 ' "i I i .
c, p, f, 3,3 d(0.5,1.0,0.64,2.3)
block 1 *
block 2
o o
g o
o 0
block M

 

 

 

 

 

Figure 6.5: Set associative and direct mapped directories for a 2 cluster, 2 proces-
sor/ cluster hierarchical cache system model.

90

directory places. Because there is only one CCC per cluster, there is no processor
identity attribute for this directory and the indexing into the data structure is done
by ordering attributes c, and a or f. Also, no data is held in the CCC directory.

The second directory structure is used in the Cluster Memory Controller (CMC)
and the Memory (MC) to track the status of every block in memory, and to hold
data for those blocks, respectively. Tokens for this type of directory represent each
block of memory in the system. Each block is always represented by a token, and
each directory place always has the same set of tokens, so the directory structure is
organized in the direct mapped fashion (set size equal to one). The lower portion of
Figure 6.5 shows this direct mapped type of directory. Access to these directories is
done by block address. The Cluster Memory Controller tokens hold status information

for each block while the Memory tokens hold actual data.

6.5 Design and Debugging

The initial design of the COGI protocol Was checked on paper by trying to anticipate
all possible scenarios and conﬁrming that the protocol deﬁnition tables speciﬁed the
correct behavior in every case. Several iterations of this method, and the consequent
modiﬁcations, resulted in a protocol deﬁnition which was seemingly correct. The
next step in the process was the development of the Petri net model of the HBSM
memory subsystem. Modeling the active controllers revealed problems that would
normally only be faced by someone planning to build a prototype machine. Some of
these problems (and their solutions) are described in this section.

An asynchronous bus avoids problems of clock skew among bus masters, and is
capable of operating with controllers with different response time characteristics. The
handshaking that is used on asynchronous buses assures that faster devices do not
outpace slower devices. This provides some measure of device and / or technology inde-
pendence for the bus. In a cache coherence system which relies on snoopy controllers,
each controller must be able to respond quickly enough to take actions for each bus
transaction to maintain coherence. Handshaking in the form of acknowledgments for
each transaction ensures that each cache controller has observed and responded to
every bus transaction. If a controller is unable to respond immediately, handshaking
forces the transaction on the bus to be delayed until all controllers have responded.

The handshaking used in the Petri net model is similar to that of Futurebus [49].

91

Two signal lines implement the necessary acknowledgments between bus masters for
all transactions; DI is a many-to-one, general acknowledgment used by bus masters to
signify that a transaction has been observed and responded to. DK is one-to-one signal
used when one controller is sending information to another controller (6.9. in a cache-
to-cache block transfer). Figure 6.6 shows how such bus signals can be modeled with
Petri nets. The shaded portion of the ﬁgure represents the current bus master. The
other transitions each represent an active snoop device on the bus. When a snoopy
controller sees a bus transaction it takes actions according to its responsibilities, then
when it is done removes a token identifying itself from the place DI. When all tokens
matching cluster attribute c have been removed, transition all controllers acked, data
ready is no longer disabled by the inhibitor are from DI. At this point, all snoopy
controllers on a cluster bus have responded to a transaction. \Vhen the controller
which is providing data in this example has the data ready, it responds by placing
a token in DK, and transition all controllers acked, data ready ﬁres, ﬁnishing this
transaction.

The idea of making a Petri net model which deadlocks upon error was used to
help ensure correctness of the model. While running the simulator, if the model
deadlocked, the state of the model was preserved close. in simulation time to the
point that an error had occurred. Keeping separate tokens in DI helps in debugging
the protocol and model since activity on a bus stops if one of the controllers does not
acknowledge a transaction, and the identity of this controller will be left in place DI.
Examining the state of the model for the controller which did not respond to the bus
transaction can determine the cause of the deadlock, and hence of the error.

The action of the CMC during a write to a local block is an example of a subtle
bug in the COGI protocol found via simulation of the Petri net model. For a write
to a local block1 held Valid by the CMC, the CMC changes the state of the block to
Invalid Locally, as shown in Table 4.5, indicating that a. local CC has a dirty copy
of the block. Since dirty sharing is not allowed between clusters, neither a write
back nor an invalidate transaction should appear on the global bus for a block held
Invalid Locally by the CMC. If the block is held Valid by the CCC, then the CCC
must broadcast a global bus invalidate when it observes the cluster bus write notice

transaction. However, if the cluster snoop of the CMC changes the state of the block

 

1A local block is one whose home cluster is the same as the cluster on which the write notice
occurs.

92

CMC
ack trans

. ait for
ack

   

 

all controllers
acked, data
ready

trans done,
release bus

Figure 6.6: Bus handshaking signals DI and DK in Petri net model.

93

to Invalid Locally before the the CCC broadcasts the global bus invalidate (quite
probable at higher global bus utilizations), then the CMC global snoop will observe
a GBIN transaction to a Invalid Locally block, an event which seems to violate the
COGI protocol. The solution, as noted in Table 4.6, is to accept such transactions,
but to not take any actions upon their occurrence. Another possible solution is to
have the global snoop devices (the CMC and CCC) ignore global bus transactions
which originate from each other, i.e. the CCC and CMC of the same cluster do not

snoop on each other global bus transactions.

6.5.1 Correctness

Executing the Petri net model of the HBSM multiprocessor with data address traces
revealed many problems in the design of the system controllers and a few bugs in the
COGI protocol. Careful design and the inherent. complexity of the Petri net model
combined to make the Petri net trace-driven simulations quite fragile. Errors in the
design of the model and bugs in the COGI protocol resulted in a quickly deadlocked
Petri net. This fragility increased conﬁdence in the correctness of the protocol, since
successful and correct execution of a large trace ﬁle seemed very unlikely in the
presence of either model or protocol bugs. As discussed in Chapter 2, formally proving
a cache coherence protocol correct is only possible. in certain very restricted cases.
Proving the correctness of a complex hardwar'c/software syster‘n such as the COGI
controllers and protocol is an open research problem. In the meantime, detailed
modeling and simulation is a practical method for increasing a designer’s conﬁdence in
the correctness of such systems. The speciﬁc techniques and strategies for discovering
errors in the simulators and the models used in this thesis are discussed more fully
in Chapter 7.

Chapter 7

Discrete Event Simulation

The second of the two part assessment strategy for the HBSM multiprocessor in-
volves discrete event simulation for performance evaluation. While it is theoretically
possible to do performance evaluation with the Petri net model, the level of detail
of the Petri net model requires unacceptably large amounts of computer time. Two
event scheduled simulation models were developed to estimate the performance of the
HBSM multiprocessor: a probabilistic model and a trace-driven model. The faster
probabilistic model is used to study the scalability of the HBSM multiprocessor. The
more detailed trace-driven model is used to validate the probabilistic model and to

measure the impact of coherence trafﬁc and cache geometries on system performance.

7 .1 Introduction

Scientists develop models to explain and support theories of systems. Their goal is
the understanding of systems as they exist. Engineers model to evaluate changes to
a system, usually with the idea of improving some aspect of a system’s performance.
Discrete event simulation (DES) uses a model of a. dynamic system which is subject
to a series of instantaneous changes of state, or events. Rapid advances in computing
technology, and the ability to model complex behavior have made discrete event
simulation the strategy of choice for performance evaluation of computer systems.

The two prominent DES strategies are event scheduling and process interaction
[36, 23]. Both strategies involve entities which ﬂow through the system (jobs, cus-
tomers, bus requests, etc), activities which require signiﬁcant time for entities, events
at which the system changes state, and resources for which the entities compete.

In event scheduling, the modeler ﬁrst identiﬁes the events in the system at which
the model changes state. These instantaneous events bracket system activities during
which simulation time passes. For example, in an event scheduled simulation of a
bank, a simple model may have only three events: entering a queue for a teller,

beginning service with the teller, and leaving the bank after service. If the time taken

94

95

by a customer to select a teller was signiﬁcant, then a new event could be added
to the simulation to represent the customer entering the bank. These four events
would then bracket the three activities during which simulation time would pass for
a customer in the bank: selecting a teller, waiting for service, and service with the
teller. The entities (customers) would use the services of the resource (teller) in a
method described in the simulation program.

The simple structure of an event scheduled simulation program is a double-edged
sword. For high level models, or for models of simple systems, an event scheduled
simulation program consists of merely maintaining a future event list (in order of
simulation time) and processing the events in list order. Processing events will, among
other actions, add events to the future event list by scheduling them for some time
in the future. The main loop of an event scheduled simulation in pseudo-code based
on the C programming language would look like Figure 7.1.

If the model were to be made more complex new cases would simply be added to
the single switch() statement corresponding to the new events. This ﬂat structure
becomes unwieldy for complicated models with many events. An additional problem
with event scheduled simulations of complex systems is the temptation to lump to-
gether events which occur at the same time. For example, in Figure 7.1, no simulation
time passes between the events Select.Teller and Start-Scrvice once a customer gains
a free tellerl. Without changing the behavior of the model, the Start_Service event
could be eliminated, and the Service_Done event could be scheduled directly from
the Select_Teller event. While this reduces the number of events, it makes the model
more difﬁcult to enhance and the structure more difﬁcult to understand. Service
time would no longer be clearly bracketed by start and end events, and extending the
model may multiply the “entry points” of a job starting service. Without care, large
event scheduled simulations can resemble unstructured computer programs in which
the only control structure is the dreaded goto.

One of the reasons that large event scheduled simulations can become unwieldy is
that the modeler is essentially ignoring what is known about the sequential relation-
ship between the events that a customer experiences. While the modeler probably
thinks in terms of the life-cycle of a customer as it ﬂows through the system in order
to decide on which events should be modeled, the flat nature of the event scheduled

strategy does not allow this information to be used to structure the program.

 

1This can be seen by the zero argument in the schedule(Start-Service , O) ; function call.

96

#define Enter_Bank 0

#define Select_Teller 1
#define Start-Service 2
#define Service_Done 3

Entity customer;
Resource teller;
int event;

whileC (event = cause()) != NULL ){
switch( event ){

case Enter_Bank:
/* decide which teller to use */
customer.schedu1e(Se1ect_Teller, select_time);
break;

case Select_Teller:
if( teller.reserve(customer) )
customer.schedu1e(Start_Service, 0);
else
/* join teller’s queue */
break;

case Start_Service:
/* use the teller */
customer.schedule(Service_Done, service_time);
break;

case Service_Done:
teller.release(customr);
customer.schedule(Enter_Bank, arriva1_time);
break;

Figure 7.1: Simulation loop for event scheduled DES using C-like pseudo code.

97

The second popular strategy for discrete event simulation is known as process
interaction. The process interaction strategy uses the life-cycle of an entity ﬂowing
through the system to describe the model in the simulation program. The events
experienced by an entity are embodied in a process. To continue the bank teller ex-
ample, a process would be used to represent each customer in the system, and these
processes would interact with each other by competing for the system resources. If the
bank model needed to be extended to evaluate the results of different policies used
by tellers, then the tellers could be represented by a process as well. The process
interaction strategy has a more complex control structure, since multiple indepen-
dent processes must communicate and coordinate their actions. Operating system
support for process creation and synchronization are generally required. The advan-
tage of the process interaction strategy is seen for complicated models which must be
extended, improved or maintained. In a properly designed process interaction sim-
ulation changes to one aspect of the model are confined to the code describing the
process which must be changed. The increased structure of the simulation code makes
process interaction simulation easier to understand, and hence to extend and main-
tain. The Petri net model described in Chapter 6 is a process interaction simulation
model.

In a book describing a set of tools for doing event scheduled simulation, Mac-
Dougall advises that the process interaction strategy is “strongly recommended for
implementing large-scale simulation models” [36, page 4] whereas event scheduled
simulation is best suited for small and rnedium-scale models. He cites as advan-
tages for process interaction simulations the similitude of model and system and the

hierarchical nature of the languages supporting the process interaction strategy.

7 .2 Simulation Environment

In order to study large conﬁgurations of the HBSM multiprocessor, and to study those
systems with large trace ﬁles, two simpler event scheduled models were developed.
These models are less computationally intensive than the. Petri net model by virtue
of not representing the same level of detail for the controllers, buses, caches and
memory modules. The event scheduled simulations are used to estimate the scalability
of the HBSM multiprocessor. Both models are built on the SILO [37] libraries for

event scheduled discrete event simulation using the ob ject-oriented C++ programming

98

language.

The SILO libraries are C++ implementations of MacDougall’s C language SMPL
simulation primitives [36]. SMPL/SILO provide for entities which schedule events
and resources which are reserved/used/released by entities. An attractive and un-
usual feature of this simulation environment is the implicit queueing done for busy
resources. When an entity attempts to secure a resource for its use (by “reserving”
it) and ﬁnds the resource is already being used by another entity, the blocked entity
automatically joins a queue of entities waiting for this resource; no explicit queuing
action is needed in the modeler’s simulation program. When the resource is released
this blocked queue is checked, and if an entity is present the event in which the
blocked entity originally tried to reserve the queue is immediately re-scheduled. In
other event scheduled languages the modeler must handle the joining of a blocked
queue explicitly each time a resource is found to be unavailable. Because a reserve of
a busy resource will cause an event to be automatically re-scheduled for some time
in the future, implicit queuing requires that the blocks of code representing events
must be reentrant. This requirement is readily met by forcing all event code blocks
to begin with a reserve function call made inside an if/clse clause, and by creating
separate events for reserving multiple resources.

The method of hierarchical decomposition was followed to develop the SILO mod-
els. This technique starts with very simple models which can be easily written and
debugged, and gradually adds detail until the model adequately represents the system
being modeled. Each stage the model is tested and veriﬁed by simulating cases for
which performance results are predictable. The development. of the simulation model
in stages of increasing detail harmonizes well with actual development of the simu-
lation program. By starting with a simple and abstract. model, initial programming
effort can be devoted to developing a sound and logical foundation for the simulation
program. The trace-driven model was developed in this way by extending the purely
probabilistic model.

A major advantage of the hierarchical decomposition method is accommodation
of the almost inevitable errors and dead-ends pursued in large software projects. If
in the initial course of model development a fundamental ﬂaw is recognized in the
programming approach, then abandoning the ﬂawed approach is much less wasteful
(and hence more likely) when using hierarchical decomposition since the ﬂaw has

been recognized before large amounts of time have been spent on adding details to

99

the model. The development of the ﬁnal SILO models of the HBSM multiprocessor
were preceded by several of these back-to-the-drawing-board reformulations of the

fundamental modeling approach.

7 .2.1 Multiprocessor models

The models of the HBSM multiprocessor used for discrete event simulation are con-
sistent with the paper design of the controllers and protocol discussed earlier in Chap-
ters 3 and 4. Developing a detailed process interaction Petri net model of the HBSM
multiprocessor required solutions to problems that would normally not arise in a
simulation study. Some of these problems required solutions which had negative im-
plications for overall performance. For example, bus utilization could be reduced by
allowing a processor and cache controller to be released as soon as a cluster bus write
notice transaction had been made, even if that write notice transaction required that
a global bus invalidation transaction be generated. However, releasing the processor
and cache controller in such a situation greatly complicated the problem of maintain-
ing coherence in the system as a whole, so the Petri net models of the controllers do
not take this course of action. The result is lower performance overall (from higher
bus utilization) but a more feasible system. The event scheduled simulation models
implement the same policies as those used in the Petri net model, even though the
level of detail in these models would not necessarily reveal errors that may occur as
a result of glossing over these types of problems. The results from the performance
evaluation models will be consistent with the realistic design of the Petri net model.

The models were developed by considering all possible transactions starting from
the point of view of a processor and following the COGI protocol tables. Work is done
in the models by active entities representing the processors and cache controllers of
the system. These entities reserve the resources of the model (global bus and memory,
cluster buses and memories, caches) and perform the actions dictated by the COGI
cache coherence protocol. The models are driven by memory access transactions
generated probabilistically or taken from a trace ﬁle. A processor entity generates a
memory access transaction then requests that its cache controller entity satisfy the
memory access request. The cache controller ﬁrst checks whether a memory request
is a hit or miss in the cache. For a cache hit, the cache controller schedules an event

at a time in the future which represents both the cache data retrieval time as well as

100

the time that a processor would actually spend performing operations with the word
of memory. When this time has passed a new memory access event is scheduled for
the processor entity.

Cache misses require that a cache controller participate in bus arbitration to gain
access to the cluster bus. When the cluster bus has been successfully reserved, the
cache controller checks for the case of a dirty copy of the desired block being held in
another cache on the cluster. If the block is not found in another cache, then either
the cluster memory is reserved and the block is retrieved, or if the block is a global
memory block then the cache controller attempts to reserve the global bus. Once the
required resources are reserved, future events are scheduled to release the resources
held (memories and buses) after the appropriate amount of time.

Instructions and data have very different cache performance characteristics. Since
instruction fetches are the majority of memory accesses of modern microprocessors,
simulation time can be reduced considerably for trace-driven simulations by sim-
plifying the model for instruction fetches. In both the probabilistic model and the
trace-driven model, processors with separate instruction and data caches are assumed.
Instruction caches are characterized by a hit ratio of 99%. Assuming that no writes
occur to instruction caches, no write backs of dirty instruction cache blocks are needed
on replacement. Misses in the instruction cache are. handled just like misses in the
data cache: the cache controller requests the cluster bus and reads the missing block
from cluster memory. Instruction fetches are always for local blocks, i.e. it is assumed
that an intelligent loader places the code segments for processes in the memory of the
cluster on which the process executes.

The second level Cluster Cache Controllers CCC, which cache the status of all
blocks found in the ﬁrst level CC caches, are modeled only to the extent to which they
participate in global bus activity. CCCs are assumed to have a large enough cache
size and associativity that invalidations in the CC caches due to CCC replacement are
negligible. Not modeling the CCC caches accurately does not have a direct impact
on the performance of the system because the CCC caches are status-only, thus no

beneﬁt is gained from them during read misses.

101

7 .2.2 System parameters

The speeds and capacities of caches, memories, buses, and microprocessors are key
technology-dependent factors to the overall performance of the HBSM multiprocessor.
All simulations in this chapter use the same values to characterize the performance
of the hardware components. It is assumed that microprocessors issue memory ref-
erences for words of 32 bits, or 4 bytes. When not otherwise speciﬁed, caches and
memories operate on blocks of 4 words. Memories are assumed to be interleaved in
such a way that all words of a block are available in parallel in the time it takes for
memory to read/write one word. No buffering is assumed for memory, thus a read
and write require the same amount of time. Processors are assumed to operate with
a 50MHz (20ns) clock rate, and caches are able to supply data quickly enough to
not stall the processor on a cache hit. The cluster and global buses are assumed to
be 4 words wide (128 bits) and to operate at the same 2011s clock rate used by the
processors. The time to transfer a 4-word block across the bus would thus be 20ns.
These buses do not use a split-transaction protocol so buses are held throughout the

duration of a transaction. These parameters are summarized in Table 7.1.

 

 

 

 

 

 

 

 

 

 

 

 

 

Parameter Value
word size 4 bytes
cpu cycle time 2011s

bus width 4 words
bus cycle time 1 cpu cycle
cache read hit 1 cpu cycle
cache write hit 1 cpu cycles
cache directory lookup 1 cpu cycle.
memory read 4 cpu cycles
memory write 4 cpu cycles

 

Table 7.1: Technology dependent system parameters for simulation model.

For a cache miss, the time required for a memory reference depends on the delay

experienced contending for the cluster (and possibly global) bus, but for a lightly

102

loaded system is equal to the cache directory access time, plus the bus transaction
time for a read request, plus the memory read time, plus the block transfer time
across the bus. Cache misses that are writes are treated as read misses followed by
write hits. For a cache hit write access the processor experiences the delay for the
cache write (one cpu cycle) and the time taken to perform the necessary coherence
transactions (cluster bus write notice and possibly global bus invalidate), if any.

For the purely probabilistic model, memory accesses of all types are generated ran-
domly following distributions common to modern RISC architecture microprocessors
and shown in Table 7.2. The probability of a memory access being either an instruc-
tion fetch, data read or data write is computed by dividing the frequency of each
access type by the sum of the frequencies. For the hybrid trace-driven / probabilistic
model the data accesses are for speciﬁc addresses and are taken from the trace ﬁle.
Instruction accesses are based on the number of instructions executed since the last
data reference. Addresses for instruction fetches are not used; a separate instruction

cache is characterized probabilistically by a hit ratio.

 

 

 

 

Parameter Frequency Probability
instruction fetches per instruction 1.2 0.75
read data accesses per instruction 0.3 0.1875
write data accesses per instruction 0.1 0.0625

 

 

 

 

 

Table 7.2: Probabilistic representation of memory accesses.

7.2.3 Measures of performance

The number of clock cycles per instruction (CPI) is a commonly used metric which
reﬂects both the architecture of a processor and the delays experienced due to other
system components such as the memory hierarchy. When studying the impact on
performance of the memory subsystem, it. is useful to calculate the theoretical best
possible value of CPI. Assuming that instruction and data cache accesses can not be
overlapped, that multiple instructions are not initiated simultaneously, and that a
cache access can be satisﬁed in one cpu cycle without stalling the processor, the best
that can be expected for CPI is the sum of the memory access frequencies shown in

Table 7.2 multiplied by the average time in cpu cycles required to execute instructions.

103

Assuming that each instruction theoretically takes 1.0 cpu cycles (isolated from the
memory subsystem), the processors in the probabilistic model have a lower bound of
1.6 CPI. Queuing delay for buses and the need for coherence trafﬁc can only increase
this value.

The cycles per instruction metric can be used in a multiprocessor by dividing the
time required for all processors to execute a certain number of instructions by the
average number of instructions executed per processor. This metric will be referred
to as multiprocessor CPI, or MCPI; it reﬂects directly the length of time required
to execute a ﬁxed length program, but does not include the effects of algorithm
efﬁciency in the way that a speed-up measurement does. As with any average, the
value of MCPI is more meaningful if all processors execute approximately the same
number of instructions. Multiprocessor CPI can be used to study the degradation
of system performance due to memory hierarchy performance. If the MCPI for a
memory conﬁguration is twice its best, theoretical value then the speed-up attained
with an ideal parallel algorithm will be at nrost 50%. If MCPI is close to its lowest
possible value, then the memory hierarchy will at least allow high performance and
efﬁciency, and performance will depend on the parallel algorithm used.

When MCPI is normalized by the best possible value it can attain (the theoret-
ical MCPI with no memory hierarchy degradation) and the reciprocal is taken, the
resulting number is a measure of the potential efﬁciency of the multiprocessor. High
potential efﬁciency (close to 1.0) means that the memory hierarchy does not limit the
speed-up of a parallel application. Low potential eﬂiciency (close to 0.0) means that
even the best parallel application will attain no speed-up. The potential efﬁciency
multiplied by the number of processors in the multiprocessor being simulated gives
the maximum attainable speed-up.

The single most important resource in bus-based multiprocessors is the bus band-
width. As a result, bus utilization is an important performance metric. When re-
ported for cluster buses, utilization is an average of all cluster bus utilizations in the
system being simulated. The saturation level of a bus is the utilization level beyond
which it is not feasible to operate the multiprocessor. The reference saturation level
shown in all plots is 75%. The MCPI and potential efficiency are used to conﬁrm
that this level of bus saturation provides reasonable system performance.

Cache hit ratio is of prime importance in reducing bus utilization and avoiding

104

lengthy memory access processor latency. Cache hit ratios as reported for the trace-
driven model are averages across all processors in the system. All cache hit ratios
refer to data caches, since instruction accesses are modeled probabilistically in both
models. Write accesses which miss in the cache are treated as a read miss followed

by a write hit, and thus contribute to both the hit and miss ratios.

7 .2.4 Correctness

Two major questions arise when using simulation for perfm'mance evaluation. First,
is the model correct, is. does it accurately represent the system being modeled? And
second, are the simulator tools and the simulation program itself error free? The
strategies employed for increasing the likelihood of the correctness of the simulators
and models are described in this section. Most of these techniques were also applied
to the Petri net simulation.

The likelihood of the simulation library being error free is enhanced by the fact
that it is a widely used simulation tool. Building on the work of other researchers
by using existing simulation support libraries benefits the modeler by leveraging the
experience and debugging effort of many other users of the library. Conﬁdence in
the correctness of the SILO libraries and the simulation programs developed for the
study of the HBSM multiprocessor is also increased by the use of modern software
engineering practices exempliﬁed by object-oriented programming.

Choosing a widely available simulation library and following good software engi-
neering practices increases conﬁdence in the correctness of the simulator tools to a
high level. Model correctness, on the other hand, requires the application of more
heuristic methods and is dependent solely on the skill and experience of the modeler.
Conﬁdence in the correctness of the model can be gained by using the method of
hierarchical decomposition for model development. Building simple models that can
be readily examined, understood, and tested as a foundation for later, more detailed
models increases the likelihood that the models accurately represent the system be-
ing modeled. At each stage of development, several techniques have been applied to
increase conﬁdence in the correctness of the HBSM multiprocessor model.

Testing a model by simulating scenarios with predictable results is one method
of conﬁrming the correctness of a model. For the probabilistic model this involved

replacing the random generation of memory access events and system interactions

105

with a deterministic sequence of events. For example, the function which determines
whether an access is cache hit could be modiﬁed to report hits and misses in some
regular pattern. For the trace-driven model, an artiﬁcially generated trace of 100
memory references by a single processor to the same block will result in an easily
calculated cache hit ratio and bus utilization.

Another useful technique is that of testing the simulator at extreme values of
input parameters. For example, running the simulator with a cache hit ratio of
0% and 100% or with a global access probability of 0% provides “sanity checks” on
simulation results. Incorrect behavior of the model at such extremes is usually easier
to detect than for more reasonable parameter values. For the trace-driven simulation
model, a very effective method of ﬁnding model errors was to execute the model with
very small cache and set sizes. Such extreme values of cache geometry tended to
exercise the most complex parts of the HBSM multiprocessor model, such as cache
replacement actions and cache—to—cache dirty sharing. These “torture” tests of small
cache trials uncovered many subtle ﬂaws in the simulation models.

To conﬁrm the correctness of the timing of transactions, the actual simulation time
for a transaction was measured and averages of these times were reported with the
simulation results. Errors could be detected by ﬁnding transactions which required
more or less time than their known maximum and minimum times.

During program development, a one page outline of the simulation program was
maintained in parallel with the actual C++ program. This outline graphically showed
the relationship between events in the model, and included the time between the
events comprising each transaction. Because of its brevity and systematic layout, the
outline served as a means of viewing the entire model without becoming lost in the
details of the C++ implementation. The correctness of the outline of events could
be more easily checked than the simulation program itself. Assuming the outline of
events was an accurate representation of the HBSM multiprocessor, the correctness
of the actual C++ program was reduced to the much simpler task of checking that it
matched the outline of events.

The trace-driven simulation model provides more possibility for errors, since it
is more detailed than the probabilistic model, but at. the same time allows some
important new methods for checking the correctness of the model. When the address
traces are created by the multiprocessor address trace generator. the value returned

by the memory system is recorded for every read access. The trace-driven simulation

106

model can use this information to check that the value returned to the processor
during simulation is the same as the one recorded in the traces. If the returned values
differ then an error has occurred in the model. This teclmique of uncovering errors was
used quite successfully in both the Petri net simulation model and the trace-driven
SILO model. A second advantage of the trace-driven model over the probabilistic
model for considerations of correctness is the ability to write the simulation program
in such a way that errors which occur in the model of the HBSM multiprocessor
memory system are caught by the simulation program. For example, during a cluster
bus write notice a cache controller must check for the presence of the block in the
other caches on the cluster. If a copy is found in another cache in the private, dirty
state, then a violation of the COGI protocol has occurred. Defensive programming
can detect such situations so that the model error can be tracked down.

Formally proving a simulation tool or model correct is an open research problem.
Appropriate strategies of design and programming, and heuristics for testing and
debugging can provide a reasonable level of conﬁdence in simulation results. Table 7.3
summarizes the techniques employed to increase conﬁdence in the simulation results
for the HBSM multiprocessor. The application of two entirely different modeling tools
and paradigms (Petri nets and SILO) to the study of the HBSM multiprocessor and

COGI protocol further increase the likelihood of the protocol being correct.

 

 

 

 

 

 

 

 

 

Technique Applied To Model
hierarchical decomposition both
short graphical events outline both
extreme value testing probabilistic
small cache torture tests trace-driven
artiﬁcial scenario testing both
measurement of transaction timing both
checking returned value against trace trace-driven
internal consistency checks trace-driven

 

 

 

 

Table 7.3: Techniques of increasing confidence in correctness of models.

107

7 .3 Probabilistic Model

The ﬁrst model is purely probabilistic (no traces are used) and consists of multiple
clusters of processors issuing three types of accesses: instruction fetches, data reads,
and data writes. The stream of references for each processor is generated with a ran-
dom number generator and a set of probabilities which characterize memory reference
frequency in modern RISC microprocessors. Actual addresses for memory references
are not generated by this model; the only distinction made is whether a memory ref-
erence is for a local cluster memory location or for a global memory location. Since
actual addresses are not generated for memory references, the behavior of the cache
is controlled by the hit ratio and write-back parameters supplied to the simulator.
This model is a compromise between speed and accuracy. It is used to explore the
range of scalability of the architecture and to form a baseline for comparisons against
the more accurate trace-driven model.

An advantage of the probabilistic model is that it always operates in a steady state
mode, since it is a purely probabilistic representation of the system being modeled,
while results from a trace-driven simulation may be affected by the phenomenon of
cache cold-starts. On an otherwise unloaded 25MHz 68040 NeXT workstation, the
rate of simulation for the probabilistic model varies from 3,850 memory references per
second to 850 memory references per second over the range of interest of the various
simulation parameters. For the experiments done with this model each processor
generated 10,000 memory references, independent of the size of the system being
simulated.

The probabilistic nature of the model requires that the actual operation of the
system being modeled be simpliﬁed. For example, it is impossible to model the inter-
action between caches for dirty sharing of a block (coherence trafﬁc) when addresses
are not actually used in memory references. Theoretically, a. parameter could be sup-
plied to the simulator to represent the probability of every sort of event occurring.
Practically, it is very difﬁcult to specify all of these probabilities since they are so
dependent on the actual memory reference patterns of the processors. The strategy
adopted for this ﬁrst model was to specify probabilities for the most important in-
teractions and to ignore the interactions which have only second-order effects on the
overall system performance. The results of this strategy are difﬁcult to classify as ei-

ther optimistic or pessimistic; some of the simpliﬁcations made will hurt performance

108

and others will help performance. However, the ﬁrst-order effects on performance are
accurately represented, and the model can be used to quickly explore wide ranges of
important characteristics and system capacities.

The parameters which are supplied to the simulator are shown in Table 7.4. The
cache hit ratio parameter applies to data caches only; instruction caches have a
non-varying 99% hit ratio. Data accesses are classiﬁed as being either for global
memory (globaLprob) or being for local cluster memory (1 - globaLprob). The
dirty.replacement.prob determines what percentage of replacements are for dirty
blocks and hence generate write backs to memory. The probability that a CC cache
controller must perform a cluster bus write notice for a write hit in its cache is
write_notice_prob. For each write notice that is done, the probability that a block is
also shared between clusters, and hence the CCC must perform a global bus invalidate

for the block is globalinvalidateprob.

 

 

 

Parameter Meaning
hit_ratio Data cache hit ratio.
globaLprob Accesses for global versus local cluster

memory. Also determines probability of
global bus write back on replacement.

 

dirty.replacement.prob Chance that a block chosen for replace-
ment on a write miss will be dirty and will
require writing back to memory.

 

write_notice.prob Chance that a block is being shared by
other processors in same cluster so that a
write notice is needed.

 

globaljnvalidateprob Chance that a write notice must be trans-

lated into a global bus invalidate by the
CCC.

 

 

 

 

Table 7.4: Probabilistic parameters for simulation of ﬁrst model.

The ﬁrst model was used to estimate the scalalnlity of the HBSM multiprocessor.
With ﬁve probabilistic parameters to study and two configuration parameters to
control (number of clusters and number of processors per cluster), a strategy is needed
to explore this large parameter state space. Tire strategy employed was to guess at the

most important parameters, study them in isolation, pick reasonable values for them,

109

then experiment with full systems to verify the choice of ﬁrst—order effect parameters.

7 .3.1 Single cluster

It is a well known fact that in bus-based multiprocessors hit ratio is of ﬁrst-order

importance in determining bus utilization (and hence to overall system scalability

and performance). The ﬁrst experiment done with the probabilistic model was to

determine the effect of hit ratio on bus utilization for a single cluster machine. No

accesses to shared global memory are made in this experiment, thus no read requests,

write backs or invalidates are made on the global bus.

Cluster bus utilization is

plotted as a function of number of processors for various hit ratio trials in Figure 7.2.

The probability of dirty replacements and for dirty sharing are set at 10%.

O.
c:
o
-r-I
l.)
M
N
°r-l
H O .
"-1
l.)
5
U)
D
Q 0
H O
w
l.)
U)
:3
H
U
0.
Figure 7.2:

SOT.

Experiment 1

 

 

 

 

 

 

l l l r I g ’_,,--/‘1 , ’_ , .rm- — -- ' ‘1 r ......
0% shared //’ /«”’ y”, -------
10% write notices //" ~'“ A/,. ’4
10% dirty rep acement ' g,’ 1,.1'
s - ' / ’ / ' -
6 / -1
4 “hit ratio 90.0%" «
f/' "hit ratio 94.0%“ ---’
"hit ratio 98.0%“ -------
“hit ratio 99.0%" --
"hit ratio 99.5%“ -~--
2 saturation(x) ‘
O l 1 l 1 1 L l M
2 4 6 8 10 12 14 16 18
number of processors
Effect of hit ratio on cluster bus utilization for single cluster multiproces-

The strong inﬂuence of hit ratio on bus utilization is seen by the rapid climb of

the curves toward bus saturation for lower data cache hit ratios. Figure 7.2 shows

110

that in order to support 10 processors on a single cluster the data cache hit ratio must
be 97.5%. This is a reasonable expectation for the hit ratio of a data cache for large
scientiﬁc problems and will be used in further experiments to study the consequence
of varying the other simulation parameters. For large, 2-way associative, LRU caches
in a uniprocessor, Hennessy and Patterson [30, page 424] cite studies that show data
cache hit ratios of up to 98.1%. Thus ten processors per cluster does not represent
the maximum size of a cluster, since the hit ratio of 97.5% is not overly optimistic.
For ideal data cache hit ratios (say 99.5%) the number of processors per cluster could
be as high as 14.

As seen in Table 7.5, the average number of cycles per instruction remains in a
reasonable range for cache hit ratios greater than 97.0% and number of processors per
cluster less than 12. For the target system of 10 processors per cluster and data cache
hit ratios of 97.5%, the MCPI is 1.902. The measure of MCPI is less sensitive to data
cache hit ratios than is bus utilization because the majority of memory references
(75%) are for instructions, and instructions have a separate cache with a high bit
ratio (99%) and no write-backs of dirty blocks.

The ﬁrst experiment established a maximum number of processors per cluster
for a given cache hit ratio. To isolate the effects of global access and hit ratio, no
accesses to global memory were made, and write notices and dirty replacements were
kept fairly small. Three more experiments were run to determine the impact of global
accesses, write notice trafﬁc, and dirty replacements on the size of a single cluster.

The second experiment determined the effect of global accesses on the size of a
single cluster. The trials shown in Figure 7.3 are for a ﬁxed cache hit ratio of 97.5%
and the same dirty sharing and replacement probabilities as in the ﬁrst experiment.
The close grouping of curves for a wide range of global access probabilities (1 to 25%)
indicates that global accesses have only a small effect on the number of processors that
a single cluster can support. The surprisingly little effect which global bus accesses
have on a single cluster is due to the fact that no queuing delay is ever experienced
for the global bus. Since the simulation is for one cluster, access to the global bus
and memory are guaranteed once the cluster bus is acquired. Global memory accesses
add only a constant time to a transaction since there is no global bus contention.

The third experiment was performed to gauge the impact of the probability of

 

2A perfect memory system which always supplied access to memory blocks in one cycle would
have an MCPI of 1.6.

111

Experiment 2

 

97.5% hit ratio

10% write notices s?”
10% dirty replacement ‘H 7

0.8 ~ 0% global bus invalidates ,gc’

 

0.4 r “shared
"shared
“shared
"shared
"shared

cluster bus utilization

01.0%”

05.0%" ———-
10.0%" w~w
15.os~ —~m
25.0%" --—-

0.2 - saturationlx)

 

1

 

 

 

O 1 j l l l 1

2 4 6 8 10 12 14 16

number of processors

18

Figure 7.3: Effect of global accesses on cluster bus utilization for single cluster mul-

tiprocessor.

112

 

 

 

 

 

 

 

 

 

 

 

Number of Cache Hit Ratio

Processors 90% 92% 94% 96% 97% 98% 98.5% 99% 99.5%
1 2.00 1.94 1.89 1.83 1.81 1.77 1.76 1.74 1.73
2 2.01 1.95 1.90 1.83 1.81 1.78 1.76 1.74 1.73
4 2.15 2.05 1.96 1.87 1.84 1.80 1.79 1.76 1.75
6 2.40 2.23 2.05 1.91 1.89 1.83 1.80 1.78 1.76
8 2.85 2.49 2.27 2.02 1.94 1.86 1.84 1.80 1.78
10 3.48 2.96 2.52 2.16 2.04 1.91 1.87 1.82 1.79
12 4.10 3.52 2.88 2.37 2.17 1.99 1.93 1.88 1.81
14 4.89 4.08 3.32 2.68 2.35 2.08 2.00 1.90 1.84
16 5.54 4.65 3.83 2.90 2.61 2.26 2.10 1.98 1.91
18 6.19 5.30 4.24 3.32 2.85 2.41 2.30 2.09 1.96

 

 

 

 

 

 

 

 

 

 

 

 

Table 7.5: Effect of hit ratio on multiprocessor cycles per instruction (MCPI) for
single cluster multiprocessor Experiment 1.

selecting a dirty block for replacement in the CC caches. The hit ratio is placed at
the value selected from the ﬁrst experiment, no access to global memory is made,
and the level of dirty sharing on the cluster is 10%. The close grouping of curves
in Figure 7.4 for a wide range of dirty write back probabilities (1 to 25%) indicates
that the time spent writing back dirty replaced blocks has only a small effect on the
number of processors that a single cluster can support. This effect is greater than
that of the global access experiment, since the difference between the extreme curves
(1 and 25%) represents almost one whole processor.

The fourth experiment was performed to gauge the impact of the probability
of broadcasting a write notice on the cluster bus for each write hit to a CC cache.
The hit ratio was set to the value selected from the ﬁrst experiment, no access to
global memory is made, and the level of dirty replacements is 10%. The fairly close
grouping of curves in Figure 7.5 for a wide range of dirty sharing probabilities (1 to
25%) indicates that the time spent broadcasting write notices for intracluster shared
blocks decreases the number of processors that can be supported on a single cluster

by approximately one.

113

Experiment 3

 

 
 

 

 

 

 

 

l T I l I
97.5% hit ratio
0% shared
10% write notices
0.8 r
C ..
o ......................................
"-4
.1.)
(U /
N .
"-1
H 0.6 ~ -
-r-4
4.)
5
U)
3
u 0.4 “dirty replacement 01.0%“ i
o "dirty replacement 05.0%" ---
ﬂ "dirty replacement 10.0%" --------
3 “dirty replacement 15.0%'--
0 “dirty replacement 25.0%“ - -
0-2 saturation(x) " ‘
O l 1 l l 1 J 1 L
2 4 6 8 10 12 14 16 18

number of processors

Figure 7.4: Effect of dirty write-back probability on cluster bus utilization in single

cluster multiprocessor.

114

Experiment 4

 

 

 

 

 

l I r l 1 T f I I ..-

97.5% hit ratio ’Jzﬁgc/f‘

0% shared /.;;;;*’

10% dirty replacements ‘ g7'

0.8 - /;';z .
C: ’ .1; /
o ---------------------------------- 7,sz --------------------------------
m ,(,
N A
~c-I
F1 0.6 t ~
A
J.)
3
a) y
s
'Q o a
g 0-4 "write notices 01.0%” ———
m “write notices 05.0%" --—
3 “write notices 10.0%“ --------
3 "write notices lS.0%“-——
0 ”write notices 25.0%" ----
0~2 saturation(x) ~-*
0 1 1 1 1 1 1 1 1
2 4 6 8 10 12 14 16 18

number of processors

Figure 7.5: Effect of write notice probability on cluster bus utilization in single cluster
multiprocessor.

115

Together, experiments 2, 3, and 4 indicate that some degradation in the number
of processors per cluster is to be expected as a result of write notices, dirty block
replacements, and global memory accesses. Since these effects will be even more
pronounced in multiple cluster machines, the number of processors per cluster should

be reduced to 8 in order to keep cluster bus utilization in the acceptable range.

7 .3.2 Multiple cluster

The ﬁrst four experiments for a single cluster identiﬁed a reasonable number of pro-
cessors per cluster (8) which can be supported at a 97.5% cache hit ratio over a wide
range of dirty replacement, write notice, global invalidate and global access probabil-
ities. With the number of processors per cluster established, and a target cache hit
ratio and MCPI, further experiments will investigate the ultimate scalability of the

HBSM multiprocessor by simulating multiple clusters.

Ideal application

The ﬁrst multiple cluster experiment uses parameters typical of the ideal type of
application for a hierarchical bus architecture. This application would be typiﬁed by
intracluster read and write sharing, but by intercluster read sharing only. In other
words, blocks of global memory would be shared between clusters for reading, but
dirty or write-sharing of blocks would be limited to processors on the same cluster.
It is also assumed that the cache geometry (size and associativity) and replacement
policies are such that replacement of dirty blocks during read misses is never done.
The plot in Figure 7.6 shows the global bus utilization as a function of the total
number of processors (number of clusters, C, times the number of processors per
cluster, P) in the system. Since the number of processors per cluster P is ﬁxed at
8, this is equivalent to plotting against the number of clusters, but directly shows
the total size of the HBSM multiprocessor being simulated. Each line represents
a trial of the experiment for a different level of shared global memory accesses. For
global access levels below 10%, the global bus remains Imsaturated for a 200 processor
system. This is not surprising, considering the type of application that is represented
by the simulation parameters used in this experiment. Highly clusterable applications
should exhibit much greater intracluster sharing than intercluster sharing, and as a

result the global bus can support the needs of 26 clusters of 8 processors each.

global bus utilization

116

Experiment 5

 

l I I j r T r f_.-' I I

97.5% hit ratio _/'
20% write notices ,~’ _,
0% dirty replacement ,“ ./
—0% global bus invalidates/z" ,z

b—-——-——-—-—-—-—~——-——--——-———-—-/ ———————————————————— ’7' -----------------

“global 01.0%“ ——— , /”

“global 05.0%" ---- /' ,”

.“global 10.0%“ -------- / ~” .~ ---------
“global 15.0%“ -~;“ ,,/’ ,”
“global 25.0%“-4~ ,/

saturation(x) ~-‘ _/

.—-—-'"'M

’W“.__._ W...-

 

1 J 1 1 1 1 1 l

 

 

20 40 60 80 100 120 140 160 180 200
total number of processors (PxC)

Figure 7.6: Global bus utilization in ideal application.

117

Though not shown in a ﬁgure, the average cluster bus utilization in this exper-
iment remains well below saturation (around 65%) for all but the highest levels of
global accesses on the largest machine. In addition, the average number of cycles
per instruction (MCPI) remains almost independent of the total multiprocessor size
for global access levels of 10% or less, as shown in Table 7.6. For these access levels
the MCPI remains very close to the target of 1.9 established in the single cluster
experiments. Figure 7.7 shows the same data when plotted as potential efﬁciency.
For global access levels of 10% or less, the potential efﬁciency remains very close to
its maximum value for multiprocessors of up to 200 processors. For a conﬁguration
of 208 processors (26 clusters of 8 processors per cluster), the potential efﬁciency of
84% means that the memory hierarchy would allow a speed-up of 175 for a perfect

parallel application.

Experiment 5

 

 

 

 

 

 

l I I I I
0.8 .-._._______ “““““““““““ - r \ ............... q

x 97.5% hit ratio .\

g 20% write notices ‘\\\.\

.2 0% dirty replacement “\~

.3 0.6 r0% global bus invalidates \\\‘ 1

m .

H

m

'3 0.4 ~ _

c

m

.1.)

0

Q “global 01.0%" -——

0.2 - “global 05.0%" ---- .
“global 10.0%“ "--
"global 15.0%" --------
"global 25.0%” __.
O L 1 J l
50 100 150 200

total number of processors (PxC)

Figure 7.7: Potential efficiency for ideal application.

 

118

 

Total Processors

Global Memory Accesses

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(Clusters) 1% 2% 5% 7% 10% 15% 20% 25%
8(1) 1.91 1.92 1.93 1.92 1.93 1.91 1.93 1.91
16(2) 1.91 1.91 1.91 1.91 1.91 1.91 1.91 1.93
32 (4) 1.91 1.91 1.90 1.91 1.92 1.92 1.93 1.94
48 (6) 1.91 1.91 1.91 1.92 1.91 1.93 1.92 1.94
64(6) 1.91 1.90 1.92 1.92 1.92 1.92 1.93 1.95
80 (10) 1.91 1.91 1.91 1.92 1.92 1.93 1.94 1.96
96 (12) 1.91 1.91 1.92 1.91 1.92 1.93 1.96 1.99
112 (14) 1.90 1.90 1.91 1.91 1.91 1.94 1.96 2.02
128 (16) 1.91 1.91 1.91 1.92 1.92 1.94 1.98 2.09
144 (18) 1.91 1.91 1.91 1.91 1.92 1.96 2.03 2.25
160 (20) 1.91 1.91 1.91 1.92 1.92 l.96 2.09 2.38
176 (22) 1.91 1.91 1.91 1.92 1.92 2.00 2.16 2.67
192 (24) 1.91 1.90 1.91 1.92 1.93 1.99 2.30 2.85
208(26) 1.91 1.91 1.92 1.92 1.94 2.07 2.53 3.08

 

 

 

 

 

 

 

 

Table 7.6: Multiprocessor cycles per instruction (MCPI) for ideal application Exper-

iment 5.

 

119

N on-ideal application

To gauge the effect of running applications on the multiple cluster machine which are
less suited for the hierarchical bus architecture, the second multiple cluster experi-
ment was done with a global bus invalidate probability of 25%, and less optimistic
assumptions about dirty replacements of cache blocks. This application would be
characterized by a fairly high level of intercluster write sharing, since one out of four
write-notices would broadcast on the global bus, and cache controllers would spend
more time performing write-backs of dirty local and global blocks.

As expected, global bus utilization rises more quickly and exceeds saturation at
lower levels of global access than in the previous experiment. Figure 7.8 shows that
with write-sharing between clusters and more block replacement trafﬁc, the HBSM
multiprocessor can only support 2% global accesses at the full size of 200 processors.
Put another way, at the same 10% global access level of the previous experiment, the

global bus can only sustain 16 clusters or 128 processors, a decrease in size of 38%.

Experiment 6

 

     

1 F U I I r V___.—- “'1'— Uf"./'T'— ----- I. .....
97.5% hit ratio /./’ ./“”//
20% write notices ,“ , ’ __
10% dirty replacement ’,’ ‘_-' /,
0.8 95% global bus invalidates.’ ,///’ s
8 .
"-4
J.)
c
N
:1 0.6
"-1
U
5
U)
B
0.4
H
m
.o
.3
a “global .
0.2 "global 05.0%“ -—-' q

“global 10.0%“ --------
"global 15.0%“ ~-
"global 25.0%“ ----
saturation(x) "--
0 1 1 1 l I 1 1 i 1 1
2O 4O 60 80 100 120 140 160 180 200
total number of processors (PxC)

 

 

 

Figure 7.8: Global bus utilization for realistic application.

120

The effect of increased global bus use can be seen in the average MCPI tabulated
in Table 7.7. The rise in MCPI is not as dramatic as global bus utilization because
the individual cluster buses are not running on the edge of saturation; this buffer acts
to protect MCPI until the global bus congestion is so severe that both MCPI and
cluster bus utilization begin rising. Cache controllers performing transactions which
require the global bus (a read request to global memory, or a global bus invalidate,
for example) ﬁrst arbitrate for their cluster bus, and only then request the global
bus. Any queuing delay experienced at the global bus means a cache controller is also

holding its cluster bus, effectively preventing any other cache from using it.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Total Processors Global l\=‘leniory Accesses

(Clusters) 1% 2% 5% 7% 10% 15% 20% 25%
8 (1) 1.96 1.94 1.94 1.96 1.94 1.96 1.95 1.94
16 (2) 1.94 1.94 1.95 1.95 1.96 1.95 1.97 1.97
32 (4) 1.94 1.94 1.94 1.96 1.96 1.97 1.97 1.97
48 (6) 1.95 1.94 1.95 1.96 1.96 1.98 1.99 2.01
64 (6) 1.95 1.94 1.96 1.96 1.97 1.98 2.01 2.03
80 (10) 1.95 1.94 1.96 1.96 1.97 1.98 2.01 2.03
96 (12) 1.94 1.95 1.95 1.97 1.99 2.04 2.12 2.25
112 (14) 1.94 1.95 1.96 1.97 2.00 2.10 2.25 2.50
128 (16) 1.94 1.94 1.97 1.98 2.02 2.16 2.38 2.75
144 (18) 1.95 1.96 1.98 2.00 2.07 2.29 2.59 3.05
160 (20) 1.95 1.95 1.99 2.03 2.10 2.42 2.88 3.53
176 (22) 1.95 1.96 2.00 2.05 2.21 2.71 3.24 3.77
192 (24) 1.95 1.97 2.03 2.09 2.36 2.92 3.48 4.11
208 (26) 1.96 1.97 2.05 2.18 2.48 3.11 3.79 4.39

 

Table 7.7: Multiprocessor cycles per instruction (MCPI) for non-ideal application

Experiment 6.

As expected for this less-than-ideal workload, Figure 7.9 shows the plot of potential

efﬁciency decreasing signiﬁcantly for even low levels of global access. Not only does

 

 

121

efﬁciency fall off more rapidly for this simulated application (when compared to the
last experiment), but the maximum efﬁciency possible is also lower. For 10% global
access levels the last experiment had a potential efﬁciency of 82.5% and a maximum
theoretical speed-up of 172. By comparison, at 10% global access for this experiment,
a maximum size conﬁguration of 208 processors has a potential efﬁciency of only 64%

for a maximum speed-up of 133.

Experiment 6

 

 

 

0.8 . “ " ~—

_..__
“s.
~—__.
‘~

> \

g 20% write notices

.2 10% dirty replacement cg ,

.3 0.6 ~25% global bus invalidates '5, '«- .
m

H \\

(U _\\\

“13‘ 0.4- \\ 1
c

m

.1.)

o

0.

"global 01.0%“ ———
0.2 . "global 05.0%" ---- .
"global 10.0%“ "--
“global 15.0%“ --------
"global 25.0%" --

 

 

O 1 1 1 1
50 100 150 200
total number of processors (PxC)

 

Figure 7.9: Potential efficiency for realistic a1')plication.

7 .4 Trace-driven Model

With a purely trace-driven simulation it is possible to accurately model the opera-
tions of the HBSM multiprocessor; probabilistic characterization of memory reference
dependent events can be avoided. A purely trace-driven model would represent the
state of every cache, memory and controller in the HBSM multiprocessor. Simula-

tion time for such models is mostly determined by the size of the trace, and to a

122

lesser extent on the parameters of the model (cache geometry, size of system, etc).
In addition, since the number of events in a complete trace-driven model is larger
than a probabilistic model, simulation time as measured in CPU seconds per mem-
ory access would be larger. Simulation of models of large systems with realistically
sized traces would be both computationally and storage space intensive. A hybrid
trace-driven/probabilistic model can be used as a compromise between estimating
all parameters and unreasonable simulation running times and trace storage space
requirements. The strategy behind the hybrid model is to accurately model a single
cluster and the global bus, but to represent intercluster interactions probabilistically.
This strategy can be justiﬁed by the structure of the HBSM itself; the clustered na:
ture of the architecture means that applications which are well suited for the HBSM
multiprocessor will tend to have a much greater level of intracluster than intercluster
activity. Compared to a complete trace-driven model, sinmlation times will be pro-
portional only to the number of processors per cluster, and will be independent of
the number of clusters studied. Intercluster activity can be characterized by studying
the trace of the application being simulated, thus reducing the uncertainty of the
probabilistic representation.

An additional simpliﬁcation of the hybrid model is to model separate instruction
caches probabilistically. Data caches are fully represented, so the hit rate of a data
cache is determined by the cache’s geometry, replacement policy, and the memory ref-
erence patterns in the traces. Instruction caches, on the other hand, are represented
in the model by a single number: hit ratio. The performance results (MCPI, poten-
tial efﬁciency, bus utilization) from the hybrid model contain the effect of instruction
fetches, since instruction misses are handled by the cache controller in the same man-
ner data misses are, but actual address traces are not used for instructions. Taking
a more abstract approach to modeling instruction fetches is a reasonable method of
reducing simulation complexity, since instruction caches have higher hit ratios and
better overall performance than data caches, and they do not. generally have dirty
blocks to replace. In addition, since the number of instructions fetched is typically two
to three times greater than the number of data accesses, the decrease in simulation
time is signiﬁcant.

The application used in the trace-driven simulations is a parallel merge sort. The
algorithm for the parallel sort attempts to balance the amount of work done in each

processor over all phases of execution and has been proposed as a shared memory

123

multiprocessor benchmark. The parallel sort is an iterative, barrier synchronized
algorithm built on an efﬁcient, asynchronous parallel merge algorithm capable of
effectively using P processors to merge two ordered arrays of size N / 2 into a single
ordered array of size N [24]. The performance of the HBSM multiprocessor on the
parallel merge sort trace of 8192 random, integer data elements done by an 8 cluster, 8
processors per cluster multiprocessor is used to gauge the accuracy of the probabilistic
model.

The parallel merge sort appears to be a reasonably good match to the HBSM
multiprocessor architecture, since the list of data to be sorted can be stored in global
memory, and each processor accesses mostly its own cluster memory. Since processors
work in groups of progressively larger powers of two, when the size of the groups
exceeds the number of processors per cluster, intercluster sharing of global data is done
for both reading and writing. Sorting is an important. application of computers, and
the parallel merge sort lies in the middle of the spectrum of algorithms suited for the
hierarchical bus architecture. It is thus reasonable to hope that high performance of
the HBSM multiprocessor memory subsystem on the parallel merge sort is indicative
of potentially high performance on other parallel applications.

The performance of the simulation model, as measured by the number of memory
references (both data and instruction) per second of CPU time on an otherwise un-
loaded 25MHz 68040 NeXT workstation is only 850 references / second. This compares
to 3800 references/ second for the probabilistic model with comparable cache hit ratio,
and 60 references/second for the highly detailed Petri net model. The difference in
speed is due to the difference in detail of the two models; the probabilistic model
consists of 32 events in 760 lines of C++ , versus 51 events in 1970 lines of C++ for
the hybrid model. The performance of the hybrid model is only weakly dependent on
the characteristics of the model being simulated compared to the probabilistic model.
This is a result of both the detailed level of modeling and the time spent in disk I/O

for the trace-driven model.

7 .4.1 Intercluster activity

The hybrid model accurately represents the state of a single cluster, the global bus
and memory. Assuming that the behavior of the traced application in the full HBSM

multiprocessor is symmetrical with respect to clusters (true for the parallel merge

124

sort), the actions and performance of one cluster will be representative of all clusters.
For the hybrid model the traces are used to estimate the level of intercluster activity.
The measurement made to gauge intercluster activity is the average number of clusters
from which a global block is accessed. While a single measurement could be used to
represent this average, the synchronized loops of the parallel merge sort offer an
opportunity to increase the resolution of this measurement. The number of clusters
accessing each global block is averaged across blocks for each synchronization period
to determine Ac. As expected, the level of sharing as measured by Ac increases
steadily with each synchronization period of the parallel sort. Only blocks which are
accessed by at least one cluster (that is at least one processor on one cluster) are
counted in the measurement of Ac. This means that Ac ranges from 1.0 to C, where
C is the number of clusters. If blocks were only accessed by one cluster Ac would be
1.0. If blocks were accessed by all clusters A, would be C. From A, the following
probability can be derived:

Pr(block b is held in cluster c) = —2-’-

This is actually a conditional probability, since only blocks which are accessed by at
least one cluster are used to measure Ac.

The COGI cache coherence protocol requires that 8 types of intercluster activity
take place. These events are shown in Table 7.8. Some simplifying assumptions reduce
the number of these intercluster events which must be modeled probabilistically. First,
if each processor is limited to accessing local cluster memory and global memory, then
the CMC will not be required to act in cases 1 through 4. This restriction is reasonable,
since widely shared data can be place in global marrow, and cluster memory can be
reserved for processor private memory. The traced parallel merge sort application
uses memory in this way. Second, if the CCC has a very large, highly associative
cache, then case 8 can be ignored. Since the CCC only caches status, and not data,
its impact on performance of the system will not be as important as a traditional
second level cache. In addition, the CCC can be made very large, for a ﬁxed number
of transistors, as compared to a traditional data/ status cache, since status information
requires only a fraction of the storage that status and data require. The remaining
cases (5, 6, and 7) are the results of read requests or write notices to global blocks

that may be held in multiple clusters.

125

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Case Controller Transaction Action taken
1 CMC gsnoop global bus RR cluster bus RR
2 CMC gsnoop global bus WB cluster bus W8
3 CMC csnoop cluster bus RR global bus RR
4 CMC csnoop cluster bus \VB global bus W8
5 CCC gsnoop global bus RR cluster bus ﬂush
6 CCC gsnoop global bus IN cluster bus IN
7 CCC csnoop cluster bus WN global bus IN
8 CCC csnoop replacement policy cluster bus IN

 

Table 7.8: Cases of intercluster transactions in COGI protocol.

The CCC sends a global bus invalidate when it observes a cluster bus write notice
to a block which may be shared by other clusters. In the typical case of no intercluster
write sharing, a global bus invalidate would be broadcast. for the first write notice to
a block, and further write notices for this block would remain in the cluster. Case 7
of Table 7.8 can thus be modeled quite accurately for low levels of write sharing just
by following the COGI protocol. Trafﬁc on the global bus due to invalidates from
other clusters (those not being modeled) is assumed to occur at the same rate as the
cluster being modeled generates GBINs. Every GBIN broadcast from the modeled
cluster creates C - 1 other GBIN transactions on the global bus.

For every invalidate on the global bus, the CCC gsnoop of each cluster decides
whether a cluster bus invalidate is required on its cluster. The measured trace char-
acteristic A, can be used to determine the likelihood of these case 6 transactions.
Assuming symmetry between clusters, the number of global bus invalidates broad-
cast by each cluster should be approximately the same. Thus, the cluster being
simulated will see the global bus invalidate trafﬁc from C - 1 other clusters. The
probability that a global bus invalidate from cluster Y is translated into a cluster bus
invalidate by the CCC on cluster X is

Pr( CBIN in cluster X ) = Pr( block in X && block in Y )
= Pr( block in X I block in Y ) x Pr( block in Y )

126

by the deﬁnition of conditional probability. Since AC accounts for only those blocks
held by at least one cluster (ale. it is never less than 1.0), the probability of a block
residing in one of the other clusters (X) not responsible for generating the global bus
invalidate (Y) is

Pr( block in X I block in Y ) = {61ng

so that

Pr( CBIN in cluster X ) = 4%le x %‘

When blocks are held by only one cluster at a time (Ac = 1, no sharing), the proba-
bility of a CBIN is 0, as expected. When blocks are held by all clusters at all times
(Ac = C, maximal sharing), the probability of a CBIN is 1.0, as expected.

Using this probability, and the assumption of symmetry between clusters, the
model translates a GBIN from one of the C — 1 other clusters with the above proba-

bility. This will correspond to a rate of

(Ac—l MC

Ac—l Al—
(C—1)x._lx ._ C

C C

CBIN events for the cluster being simulated for every GBIN generated on the simu-
lated cluster. The accuracy of determining CBIN activity from GBIN activity is only
as good as the accuracy of the GBIN activity model. If a block is being heavily shared
between two clusters, and one of the clusters writes to it, then it is likely that a global
bus read request from the invalidated cluster will soon come along and reset the state
of the block in the ﬁrst cluster to shared, so that further writes in this cluster will
result in global bus invalidates. This corresponds to case 5 in Table 7.8. Because
Ac does not distinguish between the type of access performed (read or write), and
because the relative timing of events in diﬁ‘erent clusters is impossible to measure
accurately from the traces, the rate of GBINs generated (and hence of CBINs) will
be smaller than those which would take place in the actual multiprocessor. The ﬂush

transactions which determine this inaccuracy (case 5) are not modeled at all.

7 .4.2 Benchmark parallel merge sort

The parallel sort algorithm divides a data set into P segments, uses a single processor

and a sequential sorting algorithm (Quicksort, say) to sort each segment, then applies

 

127

the effective parallel merge to iteratively merge the resulting P sorted segments into a
single sorted data set. The parallel merging requires [log2 P] iterations to fully merge
the original P data segments. Figure 7.10 shows the parallel merge sort algorithm.
The original data set 8 is ﬁrst divided into P segments each denoted by K, where
1 S K S P. The functions ub(X) and lb(X) return the upper and lower bound of
their array arguments, respectively. The values 10(2C', K) and hi(X, K) represent the
beginning and ending indices of array X demarking the region of X in which the K ‘h
processor operates. In each iteration of the loop in Figure 7.10 teams of processors
work together to merge two segments of 8 (labeled A and B) into a segment of the
destination array D (labeled C). In the ﬁrst iteration, teams of two processors merge
two segments of length IS I / P from S into D. In the second iteration, teams of four
processors merge two segments of length 2 x ( IS I / P) from S into D. The ﬁnal iteration
uses all processors to merge two segments of length IS I / 2 into the destination D. The

details of the balanced parallel merge can be found in [24].

7 .4.3 Results for 8x8 system

The trace-driven model requires the speciﬁcation of cache size, block size and cache
associativity. Cache hit ratio will depend on all of these factors. The simulations
done for this chapter were based on 454,193 data references (read and writes) and
1,495,464 instruction fetches for a single cluster of 8 processors (243,707 memory
references per processor). The level of sharing (accesses to global blocks, as deﬁned
for the probabilistic model) measured in the traces is 13%. Both cache and block sizes
are reported in units of 32-bit words. The parallel sort was done on random integer
data ranging from 0 to 8,192,000. Each element of the data to be sorted required a
single 32-bit word for storage, thus a 256 word cache can store 256 data elements.
For experiments in which the geometry of the cache (set or block size) is varied, the
size of the cache is held constant. In other words, a cache of 1024 words with a block
size of 1 can hold 1024 blocks, while the same cache with a block size of 16 can hold
only 64 blocks.

The ﬁrst experiment with the trace-driven simulation model ﬁxes the associativity
and block size at typical values in order to study the performance of the system
for various cache sizes. The data plotted in Figure 7.11 are cluster and global bus

utilizations as a function of cache size for a 2-way set associative cache and a block

128

 

 

Let

S be the array of data to be sorted

A be a segment of S worked on by a group

B be a segment of 8 worked on by a group

D be the destination array

C be a segment of D to store the merge of A and B
p be the size of a group of processors

k be the number of the K ‘h processor within a group

Split 8 into P segments labeled with K

10(5, K) = 110(3) + (K — 1) x [ub(S) — 111(8) + 1]/P
hi(S, K) = 10(5, K +1) — 1

Quicksort the P segments

For i = 1 to flog? P] in barrier synchronized loops do

px = min(2‘,P — 2i x (Hf/2"] — 1))
k... = ((K - 1) mod 2") +1

lb(AK,i) =10(S,2‘[(K —1)/2‘J +1)
ub(.AK,i) = hi(S,min(P,2‘|_(K —1)/2ij + 2“‘))

lb(BK,i) 2 10(8, min(P, 2i [K/2i] - 2""1 +1))
ub(BK, i) = hi(S, min(P, 2i [If/2"] ))

lb(CK,i) = 10(D, 2‘[(K — 1)/2i J +1)
ub(CK, 2') = hi(D, min(P, 2i [If/2"] ))

Perform balanced parallel merge using 1), k, A, B, C

Figure 7.10: The benchmark parallel merge sort used for traces in the hybrid model.

 

 

129

size of 4 words. Figure 7.12 shows the potential efﬁciency and the cache hit ratio
for this same system. In order to achieve a 97.5% cache hit ratio, the cache must
be 256 words in size. For this cache size, the cluster and global bus utilization are
58% and 80%, respectively. The purely probabilistic model predicted a cluster bus
utilization of 70% for an 8 x 8 multiprocessor at 97.5% data cache hit ratio, and
a global bus utilization of 48%, as shown in Figure 7.8. Thus, for the same data
cache hit ratio, the trace-driven model gives performance estimates which are slightly
better for cluster bus utilization and signiﬁcantly worse for global bus utilization.
As expected, the high bus utilization results in a lower value of potential efﬁciency;
at a cache size of 256, each instruction takes 2.1 times longer to complete than the
theoretically possible MCPI, and the maximum potential efficiency of the system is
only 47%.

Experiment 7

 

 

 

 

 

1 I I T I T I
0'8 saturation ‘
c \
o 0.6 g a
.H ‘ \
U \'\\‘ \
8 6*\~ 1
.H .\\\c uster bus
r—l ‘ \
'3 o 4
s ' \\
global bus“
0.2 - .
block size 4
set size 2
O 1 1 1 1 1 1
S 6 7 8 9 10 11 12

log(cache size)

Figure 7.11: Bus utilizations in typical conﬁguration of block size 4, set size 2.

The explanation of the discrepancy between the two models can be found by

examining the nature of the memory reference patterns for the parallel merge sort.

130

Experiment 1

 

 

 

 

 

1 I I T L_______..__-——— 5
hit ratio
.-'potential
0-8 ” // . efficiency ‘
//
0.6 ~ /' .
0.4 ~ .
0.2 - /,// .
I block size 4
set size 2
O 1 1 1 1 1 1
5 6 7 8 9 10 11 12

log(cache size)

Figure 7.12: Hit ratio and potential efﬁciency in typical configuration of block size 4,
set size 2.

131

Each processor in the parallel merge makes memory accesses to global memory (13%)
and to processor private cluster memory (87%). The cache behavior of these two
groups is very different. For processor private memory the cache hit ratio is 98.6%.
For global memory the cache hit ratio is only 89.2%. While some (3.5%) of the cache
misses for global blocks are serviced by other caches on the cluster which have dirty
copies of the block, the remainder of the misses to global blocks are serviced by the
global memory. Access to global memory is slower than cluster memory because of
the time required to transfer the memory block across two buses, and because of the
contention delay for two buses.

The two other factors contributing to the higher-than-expected global bus uti-
lization are the percentage of replacement of dirty blocks and the likelihood of global
bus invalidations. The experiments for the purely probabilistic model examined these
two parameters over a range of 1 to 25%. For cache size of 256 (97.5% hit ratio),
the replacement of dirty blocks on cache misses occurs 43% of the time. Of these
dirty replacements, nearly half are global blocks which require lengthier write back
transactions.

The second factor is the global bus invalidation trafﬁc. The percentage of cluster
bus write notices that result in global bus invalidate transactions is 41%. This high
level of invalidates is closely related to the large number of dirty global blocks that
are selected for replacement. Each dirty block that is replaced has a high likelihood
of being accessed again by the same cache. Every time a block is written back to the
global memory and read into the cache, a subsequent write will require a global bus
invalidate, since global blocks can not be held in a clean, cluster exclusive state.

In order to avoid memory system performance degradation (as judged by potential
efﬁciency) in the typical3 8 x 8 system, cache size must be increased to 1024 words.
Figure 7.12 shows that at this cache size the high data cache hit ratio attained (99.2%)
increases the potential efﬁciency to 87.5%. Large caches and subsequently high cache
hit ratios also result in more reasonable values for dirty block replacement (20%) and
for global bus invalidates (17%). In turn, a side effect of less dirty block replacement is
more intracluster dirty sharing between caches. The probability that a cache supplies
a dirty block upon a cluster bus read request is 30% for large cache sizes compared
to only 3.5% for a cache of 256 words. Since efﬁcient cache-to-cache transfer of dirty

blocks was one of the design goals of the COGI protocol, increased levels of dirty

 

3Block size of 4, set associativity of 2.

132

sharing help performance of the whole system. As a result, the cluster and global
bus utilizations in Figure 7.11 are well below the saturation point at 48% and 36%,

respectively.

7 .4.4 Cache geometry

Two factors determine the geometry of a cache: block size and set size. Block size
refers to the number of words which comprise the smallest unit of coherence in the
system. Larger blocks require more transfer time on the bus, but may beneﬁt the
cache hit ratio by pre-loading words adjacent to an accessed word. On the other hand,
smaller blocks allow better use of the cache space, since they decrease the number
of words that are loaded in the cache and never accessed. The associativity of a
cache refers to the number of cache frames in which a given block may be stored in
the cache. Fully associative caches allow any block to be stored in any frame of the
cache. Direct mapped caches (associativity of one) specify one particular cache frame
for storing each block. A compromise organization (in both cost and performance) is
a set associative cache; each block of memory can be stored in one of 5 cache frames.

The effect of block size on the performance of the HBSM multiprocessor was
studied with the trace-driven simulation model by comparing the performance of
block sizes from 1 word to 16 words for 2-way set associative caches. Table 7.9
shows the MCPI data for this conﬁguration. Figure 7.13 shows the plot of potential
efﬁciency as a function of block size.

Examination of Table 7.9 and Figure 7.13 reveals that for reasonable sized caches
(greater than 256) the best overall system performance is achieved for a block size of
four. Smaller caches show a distinct peak in the curves of potential efficiency versus
block size, while larger caches exhibit the same behavior but less dramatically. The
point at which performance begins to degrade with increasing block size is known as
the memory pollution point [22]. Large block size has both a positive pre-loading effect
and a negative space-wasting effect on hit ratio. For small caches the space wasted
with large blocks (wasted because some words of a block may never be accessed), is
relatively more important than it is in large caches, so the performance of the system
past the memory pollution point degrades rapidly.

As expected from uniprocessor experience with caches, the performance of the

system is better for larger set size. Figure 7.14 shows the positive beneﬁt that larger

133

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 7.13: Effect of block size on potential efﬁciency for set size 2.

log(block size)

Cache Block Size
Size 1 2 4 8 16
32 8.86 8.50 11.04 16.62 24.59
64 7.57 5.74 6.76 10.32 17.68
128 6.33 4.34 3.93 5.73 9.61
256 4.95 3.41 2.82 3.22 4.82
512 3.27 2.34 1.91 1.94 2.66
1024 2.001.62 1.53 1.54 1.65
2048 1.831.54 1.50 1.51 1.59
4096 1.731.52 1.48 1.49 1.56
Table 7.9: Experiment 7 multiprocessor cycles per instruction (MCPI) data for set
size 2.
Experiment 7
l I I I
set size 2
0.8 -
>
o
C‘.
a) \\\
...4 \\
,3 0.6
Q4 \\
u—a ‘\
m \
H
«1
'3 0.4
c
a)
.1.)
o
0.1
0.2 .
“cache 2S6“ ———
“cache 512“ ----
"cache 2048" "--
0 l 1 1
0 l 2 3

 

134

set size has on performance, particularly for small caches. For larger caches the gain
in performance from a set size of 2 to 4 probably does not justify the additional

expense of the higher associativity cache.

Experiment 7

 

 

 

 

 

 

1 . . r I 1 .
“block 4 set 1'
0.9 - “block 4 set 2“ ---- ”__”;;::;;3
“block 4 set 4" n-- [M"; ,,,,,,
0 8 P x,./
g , X
m 0 7 - // / 1
".1 / /.
.H I /
“-4 b d
a, 0.6
[11
H
(U 0 . S P , / .1
.H / / //
‘é / /
(l) O 4 .- / / x/ d
U . / / z ’
O / f/
m /, ,
0 3 . /// -
0.2 1
0.1 1 , 1 .

5 6 7 8 9 10 11 12
log(cache size)

Figure 7.14: Effect of set size on potential efﬁciency for optimal block size.

The effect of set and cache size on global bus utilization is shown in Figure 7.15.
For reasonable sized caches, the large improvement in global bus utilization gained
from a 2- or 4-way set associative cache indicates that the global memory blocks
accessed by a processor tend to interfere with each other in the cache (i.e. they
map to the same cache frame). Increasing the set size decreases misses due to this

interference, lowering the global bus utilization.

7 .4.5 Coherence traffic

An advantage of the trace-driven simulation is the ability to accurately characterize
the impact of coherence activity on the system. The results in this section are taken

from the trace-driven simulation of the 8 X 8 multiprocessor sorting 8192 random

global bus utilization

135

Experiment 7

 

 

 

 

 

 

1 I I I I r I
............................. \\\\ \
\ \
0.8 - ‘ \7 x «
"""""""""""""""""" (\‘\’""H"”‘\<""“”””“"°"
0.6 - -
0.4 . \\ .
“block 4 set 1" \\\\\‘\\
“block 4 set 2“ --*~ ““4
0 2 F “block 4 set 4" """""""""""""""""" ;
° saturation(x)
O 1 1 1 1 1 1
S 6 7 8 9 10 ll 12
log(cache size)

Figure 7.15: Effect of set size on global bus utilization for block size 4.

136

integers. The cache size is ﬁxed at 1024 words, with resulting cache hit ratios in the
neighborhood of 96-99%.

Write notices

The number of write notices placed on the bus by the CC cache controllers decreases
as block size increases, as shown in Figure 7.16. Since a write to a single word in a
block will place the block in a private, dirty state, large block size reduces cluster bus
write notice trafﬁc for blocks which are not being shared for writing between two or

more processors.

Experiment 7

I I

 

 

 

8 \\\ \ /
~‘s k
U 0 O6 '- “\k H“ ~~_¥_ ‘E q
0H \‘,\ N‘ W
U V‘\._
o ‘s- \
1: ~ ~ . - g.
m “*~;\; ...........
U 0 O4 “II-‘~;:;*“W-l
'H b- ~d
H .
3
0.02 r .

cache size 1024

0 1 1
O 1 2 3 4

log(block size)

 

 

 

Figure 7.16: Fraction of cache writes which generate a cluster bus write notice.

As expected, the fraction of write notices issued for blocks that are not actually
shared by other caches on the cluster drops signiﬁcantly with increasing block size.
For large block sizes in 2- and 4-way associative caches the unnecessary cluster bus
write notice trafﬁc is minimal, as shown in Figure 7.17. With a block size of 16,

a single write potentially replaces 15 unnecessary write notices that may occur for

137

a block size of one. Write notice trafﬁc for blocks which are being actively shared
between processors is not affected very strongly by block size. The advantage of large
block sizes is seen for private memory. Reducing the number of unnecessary write

notices helps to decrease bus utilizations.

Experiment 7

 

 

 

 

 

 

0.7 I I I
“set 1'
“set 2“ ---
”set 4” --------
0.6 a
U)
m
o
H
*6 0 5 .
C ' Km
3
"-4
14
3 0 4 -
E
3
m 0.3 - \ -
U \
(D
c c
C .
D \ \
0.2 b \\“\‘~\‘\ -(
cache size 1024
0.1 J 1

0 1 2 3 4
log(block size)

Figure 7.17: Fraction of write notices to blocks which are not shared by other caches.

Dirty sharing

Dirty sharing takes place between CC caches for global memory blocks. Figure 7.18
shows the fraction of read misses which are serviced by another cache on the cluster
with a dirty copy of the block. Cache-to-cache transfers of dirty blocks are faster than
reading blocks from memory, so the decrease of sharing for large block sizes increases
bus utilizations. For set associative caches this effect is very small, and is probably
due to large block size decreasing the time that dirty blocks remain in caches without
being replaced.

The readily apparent differences between direct-mapped caches (associativity 1)

138

Experiment 7

 

cache size 1024

L 1

 

8 0.35 . r -
'8 “set 1“ ——-
8 “set 2” ----
________________________ .--111,.__, u u -
H 0'3“- ........................................................ . rack set 4
a) V “"x‘ \....
.Q “'*»r
U \\\."~._
0 ‘\\'“"»n
>‘ 0.25 - \\
.Q \\
'o \l
m
-r-{ )- -4
F, 0.2
a.
o.
s
U) \_\
0015 '- \ .1
U) ‘\\
m
U)
U)
"-4
E
LH
o
c
o
"-4
1.)
o
c
u
'44

 

 

0 l 2 3 4
log(block size)

Figure 7.18: Fraction of read misses which are serviced by another cache with a dirty
copy of the block.

 

139

and set associative caches in Figures 7.16 7.17 and 7.18 show the advantage of in-
creasing the ﬂexibility of the cache block replacement algorithm. The adverse impact
of large block size on direct mapped caches results in a lower optimum block size (2

compared to 4) for such caches.

Bus traffic

One of the advantages to be gained from detailed trace-driven simulations is the
ability to account completely for the use of each system resource. The cluster bus
utilization in an 8 x 8 multiprocessor with cache size of 1024, block size of 4, and
2-way associative caches is 48%. Table 7.10 shows how the busy time of the bus is

divided among the possible bus transactions.

 

 

 

 

 

 

 

 

 

Transaction Bus Busy Time (‘76)
instruction fetch 73.4
global memory read 11.2
write notice 4.1
cluster memory read 3.9
cache-to—cache transfer 2.7
global write hack 2.2
cluster write back 1.9
global invalidate 0.7

 

 

 

 

Table 7.10: Experiment 7 distribution of cluster bus busy time.

The large portion of time required for instruction misses indicates the importance
of maintaining a high instruction cache hit ratio. The relatively small portion of time
(7.5%) spent in maintaining coherence (write notices, global bus invalidates, cache-
to-cache transfers) indicates that the COGI cache coherence protocol does not impose

a great cost on the system to maintain coherence.

 

 

140

7 .5 Summary

The purely probabilistic model was used to identify a target conﬁguration for the
HBSM multiprocessor. Simulations for a single cluster indicated that 8 processors
per cluster could be supported with a potential efﬁciency of 84% at a 97.5% data
cache hit ratio. Multiple cluster experiments showed that an ideal application (one
with no global bus invalidations and no dirty block cache replacement) could achieve
the target level of efﬁciency for conﬁgurations of 200 processors and global access
probabilities of up to 10%. For a less ideal application (25% global bus invalidations
and 10% dirty block replacement), the potential efﬁciency of the system with 10%
global accesses degraded below the 84% target level at only 100 processors.

The experiments done with the trace-driven model show that the HBSM mul-
tiprocessor architecture can readily support 8 clusters of 8 processors each for the
parallel merge sort application. The relatively high levels of global memory access
(13%) and the amount of intercluster sharing indicated by the 30% or greater global
bus invalidation rate and write sharingof 3 to 10% indicate that the parallel merge
sort is not an overly good match for the hierarchical bus architecture. Considering
the imperfect match of the parallel merge sort to the HBSM architecture, and ex-
trapolating from the readily attainable cluster bus utilization of 45% and global bus
utilization of 20% indicates a maximum scalability of 120 to 200 processors for the
parallel merge sort. Judging by the limit on potential efﬁciency, such a system could
be expected to show a speed-up on good parallel applications of 90% of the size of the
multiprocessor. Applications which exhibit more clustering of memory accesses and
fewer global memory references should easily reach the goal of efficiently utilizing an
HBSM multiprocessor of 200 processors.

Applications which would be more suited to the HBSM multiprocessor include
the large group of problems which operate in a small neighborhood of a large data
set. Image processing problems are an obvious example of this sort of application.
Since a very large amount of data could be distributed amongst the clusters, the
only processors that would use the global bus heavily would be those that operate at
the cluster-deﬁned boundaries of the data. Users who need to make many runs of a
simulation program (such as graduate students) could use the HBSM multiprocessor
very efﬁciently, since each processor could be assigned a different set of parameters to

generate one data point in a graph. Such applications would require no intercluster

 

141

sharing at all. Simulations of complex systems such as VLSI circuits may be able to
be partitioned to run efﬁciently on a clustered architecture.

The most important variable to the overall performance of the HBSM multipro-
cessor is cache size. Inadequately sized caches increase bus utilization past saturation,
thereby greatly increasing bus contention and queueing delay. For the 8 x 8 system
studied in this chapter, the data cache hit ratio must be close to 99% in order to keep
the potential efﬁciency around 90%. At 90% potential efﬁciency, the 64 processor
multiprocessor could exhibit speed-up of 58 on a good parallel algorithm.

Given a ﬁxed size cache, a computer designer must decide on the geometry of the
cache. Set associativity is an unambiguously good thing for performance. The expense
(in transistors, or silicon area) limits the associativity of today’s caches to between
1 and 4. As shown in each trace-driven experiment, increasing the associativity of a
cache increases the ﬂexibility of the cache placement algorithm, so it can only help
cache hit ratios. The boost in performance from using a 2-way associative cache over a
direct-mapped cache is signiﬁcant, while the additional beneﬁt of a 4-way associative
cache is less dramatic, particularly for large caches.

The case for block size is less clear than for set size. Increasing the block size
of a ﬁxed size cache will decrease the number of blocks which can be held in the
cache. The expense (in bus busy time) of transferring large blocks across the bus is
mitigated by the positive effects of pre—loading words that are likely to be accessed
in the near future. In addition, the model of the memory controller used for the
simulations assumed that memory was interleaved so that the time to access a block
was independent of the block size. Since the time required by the memory controller
to satisfy a read request is composed of the memory access and the bus transfer
time, larger blocks do not take proportionally longer to retrieve from memory. The
phenomenon of memory pollution, where large blocks cause wasted space in the cache,
militates against larger blocks. The many interrelated factors (miss access time, cache
hit ratio, write notice trafﬁc, dirty replacement trafﬁc, etc) which are affected by block
size result in performance which is not monotonically related to block size. Trace-
driven simulations indicate that the optimum block size is four words. This value
does not seem to be peculiar to the parallel merge sort being studied, since global
data (the list being sorted) consists of 8192 words, and there is nothing special about
groups of four elements. Similar results have been found in other studies of bus—based

multiprocessors [22].

 

142

The disparity between the trace-driven model and the probabilistic model shows
the importance of trace-driven simulation. The probabilistic model uses an average
cache hit ratio to represent the behavior of the cache. When the performance of
the cache differs for different classes of memory this is an inadequate representation.
Models of non-uniform memory access multiprocessors are vulnerable to this type

of error, since memory access costs depend on the physical location of the memory
block.

 

Chapter 8

Conclusions and Future Research

The goal of this work was to design and evaluate the memory subsystem for a hier-
archical bus shared memory multiprocessor. The existing infrastructure of computer
engineering knowledge of bus interconnects is considerable. This fact, coupled with
the ability to easily expand bus-based systems in small increments, and compatibility
with the shared memory programming paradigm, makes bus-based multiprocessors
an important commercial computer architecture. Extending the scalability of bus-
based multiprocessors past the limits of a single bus will be an important step in the
eventual ascendency of parallel processing.

The ﬁrst task in the study of the hierarchical bus architecture was to broadly
partition the responsibilities of the memory subsystem between three controllers, the
CC, CMC, and CCC. Once the primary role of the controllers was decided upon, a
protocol to maintain cache and memory coherence was developed. Detailed modeling
and trace-driven simulation of the protocol and controllers was performed to assess
their feasibility and correctness. The ﬁnal step was to evaluate the performance of the
HBSM multiprocessor in order to judge the scalability of the hierarchical bus archi-
tecture. This chapter outlines the results and conclusions of this study, and considers
future work to extend some of the tools developed and improve the performance of

the architecture.

8.1 HBSM Multiprocessor Architecture

The primary bottleneck which limits the ultimate scalability of bus-based multipro-
cessors is interconnection bandwidth. Cache memory has traditionally been used
to decrease memory latency. In addition, caches in bus-based multiprocessors are
necessary to decrease the demands placed on the bus by each processor. A two-
level hierarchical structure with clusters of single bus multiprocessors connected via
a global bus has the potential of increasing the overall size of a bus-based multipro-

cessor. Private caches for each processor and physically distributed memory are key

143

 

144

elements of the HBSM multiprocessor architecture. The use of caches and physically
distributed memory becomes complicated in a hierarchical bus system because of the
cache coherence problem.

The COGI cache coherence protocol has been designed to efﬁciently maintain the
coherence of the HBSM multiprocessor cache and memory modules. COGI is a write-
back/write-update protocol on the cluster buses, and a simpler write-back/write-
invalidate protocol on the global bus. Important characteristics of earlier protocols
have been included in COGI, such as the shared line, which allows dirty sharing of
blocks on a cluster. In addition, several novel features such as data-less second level
caches and support for distributed memory are central to the COGI protocol. The
COGI protocol supports a single, coherent, shared address space for all processors in
the system.

The simulations of Chapters 7 and 7 indicate the ultimate scalability of the HBSM
multiprocessor is nearly an order of magnitude greater than currently available single
bus multiprocessors. A maximum conﬁguration of 20 clusters of 8 processors per
cluster would result in a multiprocessor With a raw power of nearly 4000 MIPS].
With large enough caches this machine could operate at 85% potential efﬁciency, so
that perfect parallel applications could utilize 3400 MIPS of computational power.

The limit of 8—10 processors per cluster in the HBSM multiprocessor is only about
half that of the currently available machines from Encore and Sequent. In part this
is due to the characteristics of the processors which were simulated. A 50MHz RISC
microprocessor will demand much higher memory bandwidth than the processors
used in existing commercial machines. Further studies on the effects of increasing
bus bandwidth, particularly the beneﬁts of unequal distribution of resources between
the global bus and the cluster buses, would be interesting. The configuration of
the HBSM multiprocessor and the allocation of bandwidth resources should depend
strongly on the memory access patterns of the applications of interest. A general
method to characterize the memory access patterns of a wide variety of scientiﬁc
applications would be very useful.

The HBSM multiprocessor, built from standard microprocessors, commodity mem-
ories and conventional buses, would be an economical means of providing signiﬁcant

computational power. Simulation indicates that the performance of this architecture

 

1Based on a 50MHz RISC microprocessor being rated at 25 MIPS.

 

145

is heavily dependent on the level of access to global memory. Large applications run-
ning on an 20 x 8 machine should have no more than 10% global memory accesses in
order to avoid saturation of the global bus. In addition to single large scientiﬁc par-
allel applications, transaction processing and time-sharing workloads would be other

suitable uses for the HBSM multiprocessor.

8.1.1 Performance improvements

To satisfy the requirements of sequential consistency, the COGI protocol requires
that processors and controllers be held until a transaction is completely ﬁnished. For
example, a processor is not released on a write to a shared block until the CC has
made a write notice transaction on the cluster bus. If this write notice should require
that an invalidation be broadcast on the global bus, then the writing processor, the
CC and the cluster bus will all be busy until the global bus invalidate has been
accomplished. Important speed improvements may be made by allowing a weaker
consistency model. Removing the restriction of sequential consistency requires that
more attention he paid to explicit synchronization points, but promises to reduce
utilization of system buses. The issue of the semantics of memory operations in
shared memory multiprocessors is still being hotly debated.

Along with weaker consistency, using a split transaction protocol on the clus-
ter buses would increase the number of processors that a single bus could support.
As modeled, the cluster bus is held for the duration of a bus transaction. If this
transaction involves memory, then the bus is actually idle, but unavailable to other
processors, while the memory access is performed. A split transaction bus protocol
allows for a bus master to initiate a memory access request then relinquish the bus
so it may be used by other masters. When the memory access is ﬁnished, the mem-
ory module initiates a response to pass the data to the requesting processor. Split
transaction protocols require that memory modules be smart enough to initiate bus
transactions. (rather than simply accepting reads and writes). Cache coherence is
also signiﬁcantly complicated by split transaction protocols, since transactions have
two phases (initiation and completion) and may be in progress when other relevant
transactions occur. Interleaving transactions in this manner increases the complexity
of controllers, and must be weighed against the improvement in performance. For the

HBSM multiprocessor and the parallel merge sort, the distribution of busy time of

 

146

the cluster bus described in Table 7.10 indicates a maximum potential savings of 30%
on cluster bus utilization. In this case, a split transaction protocol would reduce the
cluster bus utilization from 48% to 18%. Increased arbitration and transaction over-
head would erode the beneﬁts from a more sophisticated protocol. A careful study
of the beneﬁts gained versus the increased complexity of the controllers and cache
coherence protocol would be interesting future work.

Many of the complications that arose in the design of COGI and the controllers
which implement it (CMC,CCC,CC) would be eliminated or mitigated by partition-
ing the single shared address space. If processors were allowed to access only local
cluster memory and global memory, then all but one of the deadlock situations which
arose and were solved in the COGI controller models would be eliminated. This
modiﬁcation would mean that cluster memory could not be shared between clusters.
Since the HBSM architecture is designed for applications which have distinctly clus-
terable memory reference patterns, and since a large global memory would allow for
intercluster sharing, this does not seem to be a debilitating restriction. An additional
performance beneﬁt which would result from the partitioning of memory is a decrease
in global bus invalidation trafﬁc. If CCC controllers were able to distinguish blocks
which are cluster private from global blocks, then write notices to SharedUnmod

cluster blocks would not require a global bus invalidation.

8.1.2 System requirements

The memory subsystem of a multiprocessor is only one part of a complex, interre-
lated system of hardware and software. Future studies of the hierarchical bus mul-
tiprocessor could involve other fundamental components of the system such as I/O,
bus interconnection, processor design, hardware synchronization support, operating
system, programming tools, and application programs.

Bus design and I/O are closely related to memory subsystem. To be economical
an implementation of the HBSM multiprocessor would most likely use one of the
established bus standards, such as VME, Multibus, or F uturebus. Support for multi-
ple processors and cache coherence makes Futurebus a natural choice for the HBSM
multiprocessor. The placement of I/ O peripherals such as mass storage and network
connections will be determined by the intended application of the multiprocessor.

Disk storage would most likely be needed on each cluster, since the global bus would

 

147

be incapable of supporting the disk I/O requirements of multiple clusters. Network
connections could either be distributed or centralized on the global bus, depending
on whether the machine was intended for use in a multiple user time-sharing mode
or in a single-user scientiﬁc mode.

Hardware support of synchronization primitives seems necessary for efﬁcient op-
eration of scientiﬁc applications. A common method of providing hardware synchro-
nization is to provide atomic instructions such as Test&Set. The atomicity of the
instruction is often guaranteed by locking the bus for the duration of the instruction.
With a hierarchical bus architecture it may be advisable to limit synchronization
operations to local cluster or global memory blocks. The possibility of integrating
synchronization primitives with the cache coherence protocol should be investigated.

The design of the operating system for a hierarchical bus multiprocessor would
be vitally important to the success of the machine. The issue of distributing the
operating system kernel versus centralizing it would be the ﬁrst major operating
system design decision. Centralized kernels are easier to program, but can become a
bottleneck; distributing the kernel on each processor would most likely be required.
Task allocation could be done by maintaining a single task queue on each cluster.
Processors would then take tasks from this queue when they were idle. Task migration
represents a signiﬁcant difﬁculty in clustered machines, due to the (assumed) afﬁnity
of a task for local memory.

Necessary support for programming would include intelligent compilers and linker/ loaders,
parallel symbolic debuggers, and performance evaluation tools. Knowledge of the
clustered nature of the architecture would be important to the compiler to make de-
cisions about the placement of variables in a processes virtual address space. The
virtual memory management of the operating system would have to include the abil-
ity to map a tasks address space (particularly the code segments) to the local cluster.
Debuggers and performance evaluation tools for parallel processing computers are
currently active research problems. Lack of a global clock and a single thread of

control make parallel debuggers difﬁcult to create.

8.2 High Level Timed Petri Net Simulator

In order to debug the COGI cache coherence protocol and to gain some conﬁdence

in its correctness, detailed trace-driven simulation was performed with High Level

 

 

148

Timed Petri net models. This general purpose modeling tool can be used in many ar-
eas, but is particularly adept at representing two important characteristics of parallel
computer models, concurrency and synchronization. The HLTPN simulator trades
analytical power for modeling expressiveness so that complex systems can be more
readily described. A current limitation to the Petri net simulator is the lack of a
graphical user interface. Such an interface, added over the existing model editor,
would considerably ease the creation of HLTPN models. If the graphical interface
were to support the run-time features of the simulator, the iterative process of de-

sign / simulate/debug would be made faster and more direct.

8.3 Simulation Methodology

Veriﬁcation of the probabilistic model with a more accurate trace—driven model in-
dicates some deﬁciencies in the probabilistic model. Non-uniform memory access
multiprocessors should be carefully modeled due to potential differences in cache be-
havior of different groups of memory. Extending the probabilistic model in Chapter 7
by separately characterizing cache performance of local and global blocks resulted in
much better agreement between the two models. Unfortunately, the use of detailed
trace-driven models will likely remain limited by storage space and simulation time,
so it is important to know the strengths and weaknesses of simpler, faster models.

The size and detail of the trace-driven SILO model classiﬁes it as a medium
scale event scheduled simulation model. Further enhancement to this model would
be difﬁcult and would best be done in a process interaction paradigm. The size
and complexity of the probabilistic SILO model leaves some room for extending it.
For example, separately characterizing the cache behavior of local and global blocks
increases the accuracy of the probabilistic model.

The tracing method developed to create architecture independent multiprocessor
address traces proved adequate for trace-driven simulation. If the Tracer were ex-
tended to handle more general applications (currently, traced applications can only
use barrier synchronization and static task allocation) the Tracer would certainly
attract more interest. One possible strategy for this improvement is to introduce
feedback from the multiprocessor simulator and to generate address traces in parallel
with multiprocessor simulation. Feedback from the simulator could be used to direct

the course of generally synchronized processes, and to make dynamic task allocation

149

possible. Tracing is currently limited to data address traces; instruction addresses
are not traced. The Tracer could be readily extended to capture instruction traces by
recording the program counter at every instruction. Since a delay value is currently

incremented for each instruction, the facility for doing this is already in place.

Bibliography

Bibliography

[1] S. Aggarwal and B. Gopinath. Foreword special issue on tools for computer
communication systems. IEEE Transactions on Software Engineering, 14(3),
Mar. 1988.

[2] J. Archibald and J.-L. Baer. Cache coherence protocols: Evaluation using a
multiprocessor simulation model. ACM Transactions on Computer Systems,

4(4):273-298, Nov. 1986.

[3] J. K. Archibald. A cache coherence approach for large multiprocessor systems.
In Proceedings of the 1988 International Conference on Supercomputing, pages
337—345, July 1988.

[4] M. Azimi and C. Erickson. High level timed petri net simulator: A modeling
and simulation tool for performance evaluation of concurrent systems. Technical
Report MSU-ENGR—89-014, Michigan State University, Department of Electrical

Engineering, Sept. 1989.

[5] M. Azimi and C. Erickson. A software approach to multiprocessor address trace
generation. In Proceedings of the 14,), International Computer Software and
Applications Conference, pages 99-105, Oct. 1990.

[6] S. J. Baylor and B. D. Rathi. A study of memory reference behavior of engi-
neering/ scientiﬁc applications in parallel processing. In Proceedings of the 1989
International Conference on Parallel Processing, pages 1:78—82, Aug. 1989.

[7] G. Bell. Multis: A new class of multiprocessor computers. Science, 228, Apr.
1985.

[8] R. R. Billig, S. S. Corbin, and R. L. Moore. A fast backplane cluster heralds a
1000-mips computer. Electronic Design, pages 81—86, July 1987.

[9] J. Billington, G. Wheeler, and M. Wilbur—Ham. Protearr: A high-level petri
net tool for the speciﬁcation and veriﬁcation of communication protocols. IEEE
Transactions on Software Engineering, 14(3), Mar. 1988.

[10] L. M. Censier and P. Feautrier. A new solution to coherence problems in multi-
cache systems. IEEE Transactions on Computers, 27(12), Dec. 1978.

150

151

[11] D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal. Directory-based cache
coherence in large-scale multiprocessors. IEEE Computer, pages 49—57, June
1990.

[12] H. Cheong and A. V. Veidenbaum. Compiler-directed cache management in
multiprocessors. IEEE Computer, pages 39—47, June 1990.

[13] G. Chiola. A software package for the analysis of generalized stochastic petri net
models. In International Workshop on Timed Petri Nets, 1985.

[14] G. Chiola. GreatSPN users’ manual version 1.3. Technical report, Computer
Science Department, Universita di Torino, Sept. 1987.

[15] I. Computer. Supercomputer hardware. IEEE Computer, 22(11):63-68, Nov.
1989.

[16] A. Cumani. Esp - a package for the evaluation of stochastic petri nets with
phase-type distributed transition times. In International Workshop on Timed
Petri Nets, Torino, Italy, 1985.

[17] H. Davis, S. Goldschmidt, and J. Hennessy. Tango: A software simulation and
tracing system. Processor Tracing Methodologies WorkshopInternational Sym-
posium on Computer Architecture, 1990.

[18] A. L. DeCegama. The Technology of Parallel Processing: Volume 1. Prentice
Hall, Englewood Cliffs, New Jersey, 1989.

[19] M. Dubois, F. A. Briggs, I. Patil, and M. Balakrishnan. Trace-driven simula-
tions of parallel and distributed algorithms in multiprocessors. In Proceedings of
the 1986 International Conference on Parallel Processing, pages 909 — 916, St.

Charles, IL, 1986.
[20] J. Dugan, G. Ciardo, A. Bobbie, and K. Trivedi. The design of a uniﬁed package

for the solution of stochastic petri net models. In International Workshop on
Timed Petri Nets, Torino, Italy, 1985.

[21] S. J. Eggers and R. H. Katz. A characterization of sharing in parallel programs
and its application to coherency protocol evaluation. 111 Proceedings of the 15th
Annual International Symposium on Computer Architecture. Computer Archi-
tecture News, May 1988.

[22] S. J. Eggers and R. H. Katz. The effect of sharing on the cache and bus perfor-
mance of parallel programs. In Proceedings of the 3rd International Conference

on Architectural Support for Programming Languages and Operating Systems,
pages 257—270, 1989.

[23] J. Evans. Structures of Discrete Event Simulation. Ellis Hor‘wood Limited, 1988.

152

[24] R. S. Francis and I. D. Mathieson. A benchmark parallel sort for shared memory
multiprocessors. IEEE Computer, 37(12):1619 — 1626, Dec. 1988.

[25] S. J. Frank. Tightly coupled multiprocessor system speeds memory-access times.
Electronics, January 12 1984.

[26] J. Genrich and K. Lautenbach. System modeling with high-level petri nets. In
Theoretical Computer Science, volume 13, pages 109—136. North-Holland, 1981.

[27] J. R. Goodman. Using cache memories to reduce processor-memory trafﬁc.
In Proceedings of the 10th International Symposium on Computer Architecture,
Trondheim, Norway, June 1983.

[28] J. R. Goodman and P. J. Woest. The Wisconsin multicube: A new large-scale
cache-coherent multiprocessor. In Proceedings of the 15th International Sympo-
sium on Computer Architecture, pages 422—431, 1988.

[29] S. Haridi and E. Hagersten. The cache coherence protocoal of the data diffusion
machine. In Proceedings of PARLE. Spring-Ver'lag, 1989.

[30] J. L. Hennessy and D. A. Patterson. Computer Architecture A Quantitative
Approach. Morgan Kaufman Publishers, Inc., San Mateo, California, 1990.

[31] M. Holliday and M. Vernon. A generalized timed petri net model for performance
analysis. IEEE Transactions on Software Engineering, 13(12), Dec. 1987.

[32] K. Jensen. Coloured petri nets and the invariant-method. In Theoretical Com-
puter Science, volume 14, pages 317—336. North-Holland, 1981.

[33] K. Jensen. High-level petri nets. In Applications and Theory of Petri Nets.
Springer-Verlag, Berlin, 1983.

[34] M. Kumar and K. So. A study of memory reference behavior of engineer-
ing/scientiﬁc applications in parallel processing. In Proceedings of the 1989 In-
ternational Conference on Parallel Processing, pages I:68—72, Aug. 1989.

[35] E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik. Quantitative Sys-
tem Performance: Computer System Analysis Using Queueing Network Models.
Prentice-Hall, Englewood Cliffs, New Jersey, 1984.

[36] M. MacDougall. Simulating Computer Systems. The MIT Press, 1987.

[37] S. Manoharan. SILO: An event-based simulation platform. University of Edin-
burgh, 1990.

[38] M. Marsan, G. Balbo, A. Bobbie, G. Chiola, G. Conte, and A. Cumani. On petri
nets with stochastic timing. In International Workshop on Timed Petri Nets,
Torino, Italy, 1985.

153

[39] E. M. McCreight. The dragon computer system: An early overview. In NATO
Advanced Study Institute on Microarchitecture of VLSI Computers, pages 83-
101. NATO Scientiﬁc Affairs Division, Urbino, Italy, 1985.

[40] Meta Software Corporation. Design / cpn: A tool package supporting the use of
colored petri nets. Internal publication, 1988.

[41] M. K. Molloy. Performance analysis using stochastic petri nets. IEEE Transac-
tions on Computers, 31(9), 1982.

[42] J. Peterson. Petri Net Theory and the Modeling of Systems. Prentice-Hall,
Englewood Cliffs, NJ, 1981.

[43] A. M. Reiser and A. S. S. Lavenberg. Mean value analysis of closed multichain
queueing networks. Journal of the ACM, pages 313—322, Apr. 1980.

[44] W. Reisig. Petri nets with individual tokens. In Applications and Theory of
Petri Nets. Springer-Verlag, Berlin, 1983.

[45] C. H. Sauer and K. M. Chandy. Computer Systems Performance Modeling.
Prentice Hall, Englewood Cliffs, New Jersey, 1981.

[46] C. Scheurich and M. Dubois. Correct memory operation of cache-based multi-
processors. In Proceedings of the 14th International Symposium on Computer
Architecture, pages 234—243, June 1987.

[47] R. L. Sites and A. Agarwal. Multiprocessor cache analysis using atum. In Pro-
ceedings of the 15"1 Annual International Symposium on Computer Architecture,
pages 186 — 195, May 1988.

[48] K. So, F. Arema, D. A. George, and V. A. Norton. Psimul - a system for parallel
simulation of parallel systems. In J. L. Martin, editor, Performance Evaluation
of Supercomputers, pages 187 — 214. Elsevier Science Publishers B.V., 1988.

[49] I. C. Society. Futurebus+ Draft 8.2 P8961 R/P8.2. IEEE Computer Society
Press, Feb. 1990.

[50] P. Stenstrém. A cache consistency protocol for multiprocessors with multistage
networks. In Proceedings of the 16th International Symposium on Computer
Architecture, pages 407—415, 1989.

[51] C. B. Stunkel and W. K. Fuchs. Trapeds: Producing traces for multicomputers
via execution driven simulation. Performance Evaluation Review, 17(1):70 — 78,
May 1989.

[52] C. B. Stunkel, B. Janssens, and W. K. Fuchs. Address tracing for parallel ma-
chines. IEEE Computer, 24(1):31-—38, Jan. 1991.

154

[53] P. Sweazey and A. J. Smith. A class of compatible cache consistency protocols
and their support by the ieee futurebus. In Proceedings of the 13th International
Symposium on Computer Architecture, pages 414—423, June 1986.

[54] C. K. Tang. Cache system design in the tightly coupled multiprocessor system.
Proceedings of the AFIP National Computer Conference, 45, 1976.

[55] P. J. Weinberger. Cheap dynamic instruction counting. Bell System Technical
Journal, 63(8):1815 — 1826, Oct. 1984.

[56] A. Wilson. Hierarchical cache / bus architecture for shared memory multipro-
cessors. In Proceedings of the 14th International Symposium on Computer Ar-
chitecture, pages 244—252, June 1987.

[57] A. Zenie. Colored stochastic petri nets. In International Workshop on Timed
Petri Nets, Torino, Italy, 1985.

 

TM"