.177: i
: usq. .

. :11
5.5... 2. «.31.:

3, . a ...,:..... , SS.

 

 

. .
.54 _. a. r.
, .. ‘ , , 3 . i
.5. I?! u. .... an. 3“,... 5...,
ii a .. x . .

L.

 

I

 

. _ :.

i, .3: -03..
3:. . .. x!
,5, i «.1: g ,

a
1:“

 

12.33:: {4:13.13... I.

3275:1353 NEE-3.... II

111.:7‘aznllx 5131x111

)1..ii:.€..5..l31:x: :1
1.1.1...- (I!

G
.1

 

l‘r.x’

'9 .9

- l. -

3.1 I

{3,1311
1: Ii}

. ’1 4 uvll‘Ivl‘".-‘3If3!1
:zsyzu. .. :3 xnnhlalél [aaiislv
:II‘IXK 1 1‘1)?! . . . .agerVc-lg)nrn
>anrJ
..7 . ,
5.33.. A. .:_, V ‘ . .
1:~L7l2112.:. , , ,
. ...::)‘:: :9 .v
v:r.«::
5.:11x.
.:;=::v:
I.....;x..~.v..\:4......x.l:q
In:- ::axr r: I
..1x

 

 

 

  

 

 

lav .3431]... )7: I:

l... a)!l.l.¢...)ti):: . .1 xx

1 33:: 2. E... .3 .Ir\:nnri..ls.r1 r: I I
3.3.1.233) I

I .11..XII:5)!II..‘I)xs.x5
p212K...»:Iaa»-Aax:)):?!‘n}:1!. V
L -. xlailah\| n....l;..r9 11‘s..) )|9\1v$.la
32:11:23.:1 1.1:iva.‘l litilnhllarhlfa
z alix‘lxr3173ol ‘

.an

 

..\ :1!
{.1135}: v3.11;
1...: .la.........{ la:

)3 :1.
I): I...
‘tnfxrsnx. .3 I...

.3 .x §..sl£.)ar! «\ar. lvzyxx.

x1 .31 73;... .113.

 

 

 

, gig;

 

In .17. .

 

3.33.11? .5 5:1
.. . . ‘...:‘.‘=:=>

 

gazll .31
., . )..)|h

 

E

llllllllllllllllllIllIlllllIlllllllllllllllllllllllllll

31293 01563 512

This is to certify that the

dissertation entitled

PARALLEL ASYNCHRcHouS PROTOCOLS
0N PARALLEL N417 DISTRIBUTED AVQTEMS

presented by

Euhm; Che}

has been accepted towards fulﬁllment
of the requirements for

 

 

Pk. P. degree in Cowgufbr &tence.

Major professor

We}? ($3 (731

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0—12771

 

 

 

LIBRARY
University

 

Michigan State

 

 

PLACE iN RETURN BOX to remove this checkout from your record.

TO AVOID FINES return on or before date due.

DATE DUE DATE DUE DATE DUE

_ l

 

 

 

= J

 

J

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

J
.___|
—7

 

 

 

 

 

 

 

 

 

 

[ﬁt

 

 

 

 

 

 

MSU Is An Affirmative Action/Equal Opportunity institution
abimmma-m

 

 

PARALLEL ASYNCHRONOUS PROTOCOLS
ON PARALLEL AND DISTRIBUTED SYSTEMS

By

Eunmz' Choz'

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1997

.ABSTRACT

PARALLELASYNCHRONOUSPROTOCOLS
ON PARALLEL AND DISTRIBUTED SYSTEMS
By
_Eunnn Chat

Parallel discrete event simulation has been Widely used over broad applications,
especially for complex problems which require high performance computing. In this
thesis, we perform experimental, theoretical, and simulation studies on parallel dis—
crete event simulation of VLSI circuits. The primary goals are to study and to improve
the performance of parallel asynchronous protocols on various parallel and distributed
SyStems. Depending on the machine platform, the major focus of implementing par-
allel protocols changes and a different perspective is applied to each architectural
characteristic.

For parallel logic simulation, the signiﬁcant architectural limitations of the MIMD
Platform are a relatively short number of processors and interprocessor communica-
tion overhead. Thus, our research work related to the MIMD platform concentrates
on COmputaticm granularity, a factor that can be used to control the ratio of the

computation amount to communication overhead in parallel processmg. We analyze

 

¥—

 

the effects of computation granularity on the performance of an optimistic proto-
col, and present an analytic formula to predict the total execution time in terms
of computation granularity. In order to understand. various aspects of computation
granularity, we perform experiments on the IBM SP2, and interpret the performance
effects of computation granularity in the aspects of execution activity, rollback sit-
uation, communication overhead, and simulation progress. In addition, we propose
efficient event scheduling schemes for the optimistic protocol. These schemes can bal-
ance the progress of logical processes, so that unnecessary computations and storage
overheads can be reduced.

The SIMD platform is better suited to the parallel simulation of small logic circuits
due to its massive processing elements and the characteristics of faster interprocessor
communication and barrier synchronization. On the SIMD platform, we investigate
the effects of the characteristics of major machine operations on the performance of
synchronous, conservative, and optimistic protocols. The experimental features are
compared on several different machine architectures and the performance results are
studied. We also study the effects of the machine-independent characteristics which
depend on parallel protocols and application problems.

Through this thesis research, parallel asynchronous protocols are implemented
and studied on various architecture platforms of distributed-memory systems: the
SP2 as a parallel multicomputer system, a cluster of DEC Alpha workstations as a
distributed system, and the MasPar MP-l/MP-Q as a SIMD system. According to
the characteristics of protocols and machine architectures, different aspectual studies

are deliberated.

© Copyright 1997 by Eunmi Choi
All Rights Reserved

 

To The Living God, My Parents, and My Husband

 

ACKNOWLEDGMENTS

Praise the Lord who lives and has guided us whoever trusts in the name of our
God! Through my long hardship and sorrow, the living God has provided all things
to me abundantly. There is nothing that I can do but the thanksgivings and praises
to Him!

I thank all of my committee members for their efforts, time, discussions, and chal—
lenging guidances through many meetings; Dr. Moon Jung Chung as the committee

Chair and advisor, Dr. Matt Mutka for his kindness and fair guidances, Dr. Herman
Hughes for providing my studying place in the HSNP lab, Dr. James Harman for his
Wisdom and insight, and the late Dr. Carl Page for his warm care. Special thanks I
Want to give to Dr. Anil Jain, who is the Department Chairperson and an outstand—
ing researcher, for his help, wisdom, and kindness when I had a hard time. Also, I
thank Dr. Lionel Ni for his supporting the facility of the ACS lab and his kindness
and insight, and Dr. Richard Enbody for his nice guidance as a professor and boss
When I was a student and a TA, and Dr. Donald Weinshank for guidances when I was
a novice TA. In particular, I give many thanks to Dr. Rawle Hollingsworth in Bio—

chemistry department for his kindness, understanding, cares, and ﬁnancial supports

vi

‘

 

NI/ 1

‘ , , __.—______—

vii
especially during my difficulty at the end of PhD. program although I didn’t deserve
to receive his many favors.

I give many thanks to my parents, Mr. Yo—Sup Choi and Mrs. Sun—Ok Chai, for

their deep and steadfast love over my life, prayers, and supports. Without their love,
I cannot be the person who I am. I respect them with all my heart and give all this
honor to them. Also I give thanks to my elder brothers Mr. Yong Nam Choi and
Mr. Gyu Nam Choi and sisters—in—law Mrs. Yun Hee Cho and Mrs. Bong Rim Choi
for their encouragements and love. I give many thanks to bothers and sisters in New
Hope Baptist Church; Pastor Young Ho Cho and Mrs. Hyun Sim Cho for their love,
sacrifices, and prayers, Dr. Sukjung Chang and Mrs. Boduk Chang for being the best
living example to me and their prayers, Dr. Hakjin Kim and Mrs. ija Kim for their
cares, my best friend Miss Kyungsook Lee for her being with me, cares, and prayers
Whenever I was hurt or lonely during my husband’s absence, my friend Ms. J isoo Hong
f01" her prayers and our good friendship, Mr. Henry Kim and Mr. Eric Chang and
Mr. Thomas Kim for their prayers and ministries of the 2nd generation of Korean
American, my elder friend Dr. Sea Hwan Choi for his abundant love, cares, and
Prayers, Mr. Young-Bong Kim and Mrs. Young-Hee Kim for their unceased prayers
and cares, my friend Miss Barbara Czerny for her beautiful heart and our sharing in
the Lord, and all His other people who cared, helped, and prayed for me.

There is the Great and Wonderful Gift in my life from God. He is my husband,
Dr. Dugki Min who has the warmest, most beautiful, and purest heart in the world.
EVery single moment my husband has been with me, giving unfailing love, encour—

8igements, and comforts. Without him, I could not have tasted the beauty of life and

 

viii
the love of God. He is my happiness. He is my forever friend, a fellow Christian in
one faith, the father—to—be of our children, the head of our family, and my husband. I

want to give thanks to him always while our living until God calls us to His Heaven.

This work was partially supported by DoD HPCMP Program GEN—3, Contract
No: F33615-96—C—1913. The experiments on the SP2 were performed by using the
machine at the Maui High Performance Computing Center. The MP—l/MP—2 of the
Scalable Computing Laboratory of the Department of Energy’s Ames Laboratory are

used for the experiments of the MasPar.

 

 

‘\ _[

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 Introduction

1.1 Parallel Synchronization Protocols of PDES ...............
1.1.1 Synchronous Protocol ..........................
1.1.2 Asynchronous Protocol ..........................
1.2 PDES on the MIMD Platform .......................
1.2.1 Computation Granularity ........................
1.2.2 Event Scheduling .............................
1.3 PDES on the SIMD Platform .......................
1.4 Thesis Outline ...............................

2 Research Issues and Related Work

2.1 Research Issues of Optimistic Protocol ..................
2.1.1 Data Structures for Event Queues ....................
2.1.2 GVT Computation ............................
21.3 Cancellation Mechanism for Rollback Handling ............
2.1.4 Fossil Collection .............................
2-2 Other Synchronization Protocols .....................
22.1 Variations of Time Warp .........................
22.2 Other Conservative Protocols ......................
2.3 Considerations on Parallel Machine Architecture .............
2.3.1 Existing Parallel Computer Platforms .................
2.3.2 Number of Processors ..........................
2-3.3 Interprocessor Communication Overhead ................
2-3.4 Local Memory Space on a Processor ..................
2-4 Parallel Logic Simulation ..........................
2-5 Partitioning Problem ............................

3 Analytic Prediction of the Effects of Computation Granularity

3-1 The Model ..................................
3-2 Analysis ...................................
3-2.1 Execution Time per Simulation Cycle .................
3-2-2 Number of Simulation Cycles ......................
3~3 Selecting the Target Machine Architecture ................
3.3.1 Characteristics of the Target Machines .................

¥—_—_—__

x
7 3.3.2 Comparison of Experimental Results on the Target Machines ..... 63
3.4 The Results .................................. 66
3.4.1 Workload Conservation Property ..................... 66
3.4.2 Comparison of Experimental and Analytical Results .......... 67
3.4.3 Effects of Computation Granularity on Distributed Systems ...... 73
3.5 Summary ................................... 75
4 Experimental Study of the Effects of Computation Granularity on
the Performance of an Optimistic Protocol on the IBM SP2 77
4.1 The Simulation Model ............................ 79
4.1.1 Logical Process Allocation ......................... 79
4.1.2 Simulation Cycle .............................. 80
4.1.3 Event Scheduling .............................. 81
4.1.4 Data Structure of Event Queues ...................... 82
4.1.5 GVT Computation ............................. 84
4.1.6 Rollback Handling ............................. 85
4.2 Experimental Results ............................. 86
4.2.1 The Environment .............................. 87
4.2.2 Total Execution Time ........................... 91
4.2.3 Analysis of Experimental Results ..................... 94
4.2.4 Speedup ................................... 117
4.2.5 The Effect of a Circuit Size ........................ 120
4.3 Performance Comparison Between the Optimistic Protocol and the Syn—
chronous Protocol ............................. 122
4.3.1 Timing Granularity ............................. 125
4.4 Summary ................................... 127
5 Event Scheduling Schemes for an Optimistic Protocol on Distributed
Systems 128
5.1 Event Scheduling ............................... 130
5-1.1 Balancing Progress by Executing Chance (BPEC) ............ 131
5.1.2 Balancing Progress by Virtual Time Window (BPVTW) ........ 133
5-2 Experimental Results ............................. 135
5-2.1 The Implementation Model ........................ 136
5-2.2 Performance Measurement with the BPEC Scheme ........... 136
5.2.3 Performance Measurement with the BPVTW Scheme .......... 142
5-3 Summary ................................... 144
6 Comparison and Analysis of Parallel Asynchronous Protocols on
Massively Parallel SIMD Architectures 146
6-1 Event Queue Manipulation .......................... 148
6-2 Related Instructions and their Performances on SIMD machines ..... 151
6-3 Comparisons of Massively Parallel SIMD Machines ............ 152
6-3.1 Queue Space ................................ 153
6.3.2 Parallel Virtuality ............................. 153

¥

 

xi
6.3.3 Communication Scheme .......................... 154
6.3.4 Processor Capability ............................ 155
6.4 Experimental Results ............................. 156

6.5 Summary ................................... 159

7 A Study of Machine-Independent Characteristics of Parallel Logic

Simulation 161
7.1 Critical Path Length ............................. 163
7.1.1 The CG Value ................................ 165
7.1.2 Effect of Critical Path Length ....................... 167
7.2 Parallelism of Simulation ........................... 172
7.2.1 A Reﬁned Conservative Protocol ..................... 176
7.3 MTW Mechanism for Memory Management ................ 182
7.4 Summary ................................... 184
8 Conclusion 186
BIBLIOGRAPHY 192

 

LIST OF TABLES

Performances of Communication and Array Access Operations on the MP—

2 and Alphas ............................... 61
Experimental Results on the MP-2 and a Cluster of Alpha Workstations 65
Experimental Results on a Cluster of Alpha Workstations ........ 75
Comparisons of Communication Measurements .............. 106

Comparison Between the Sequential Program and the Optimistic Protocol 119
Comparison Between the Optimistic Protocol and the Synchronous Protoc01122

 

Performance of Instructions ......................... 151
Comparison of Congestion Effects ...................... 154
Experimental Results (Speed ratio 2 gift?) ............... 157
Fractions of Execution Time ......................... 159

Critical Path Lengths and the Numbers of Gates on the Benchmark Circuits166

Performance of Parallel Protocols on the MP—l ............... 168

Values of CG ................................. 169

Average Parallelism .............................. 174

Performance Comparisons on the Reﬁned Conservative Protocol ..... 181

MTW Sizes Applied to the Experiments .................. 183
xii

 

' .
.2
i ’_f

LIST or FIGURES

2.1 A Conﬁguration of Processes in a Logic Simulation ............. 42
3.1 A Layout of Logical Processes ........................ 49
3.2 (a) The Workload Varying It When P = 5 (b) The Workload Varying P
When k = 100 .............................. 68
3.3 The E (m) vs. k Varying from 0 to the LP Ratio .............. 70
3.4 Comparisons of (a) Sf and (b) Tk ...................... 72
3.5 Experimental Total Execution Times with 813207 Circuit ......... 73
3.6 Experimental Total Execution Times with 838417 Circuit ......... 74
4.1 Total Execution Time ............................ 92
4.2 Total Execution Time (Close—up Between 1 and 300) ........... 93
4.3 Activity Rate ................................. 95
4.4 Actual Activity Rate ............................. 98
4.5 Rollback Frequency (Close—up Between 1 and 300) ............. 99
4.6 Rollback Frequency .............................. 99
4.7 Rollback Rate ................................. 102
4.8 Communication Time Varying the Number of PBS ............. 105
4.9 Computation Fraction ............................ 107
4.10 Computation Fraction (Close—up Between 1 and 300) ........... 108
4.1 1 Computation Fraction Varying the Number of PEs ............ 110
4.12 Comparison Between Total Execution Time and Computation Time . . . 111
4.13 Number Executed Events per Cycle in a processor ............. 113
4.14 Number of Simulation Cycles ........................ 114
4.15 GVT Jump .................................. 116
4.16 Speedup .................................... 119
4.17 Effect of a Circuit Size ............................ 121
5-1 Total Execution Time with the BPEC Scheme ............... 137
5-2 Number of Rollbacks with the BPEC Scheme ............... 138
5-3 Total Execution Time with the BPEC Scheme for S13207 Circuit . . . . 139
5-4 Effect of Computation Granularity on Total Execution Time by Varying
LP ratio .................................. 141
5-5 Total Execution Time with the BPVTW Scheme ............. 142
5-6 Average Number of Executed Events with the BPVTW Scheme ..... 144
6-1 A CBSQ Structure .............................. 149
xiii

‘____—_——_——

" ‘_‘__‘

xiv

/' 7.1 Total Execution Time vs. the Number of Input Event Vectors ...... 165
7.2 The Relationship between the Length of Critical Path and the Value of CG 171
7.3 Parallelism on the Conservative Protocol Excluding Null Messages (C1908

Circuit) .................................. 172
7.4 Parallelism on the Optimistic Protocol (C1908 Circuit) .......... 173
7.5 Parallelism on the Conservative Protocol Excluding Null Messages (C6288
Circuit) .................................. 175
7.6 Parallelism on the Conservative Protocol Excluding Null Messages (C7552
Circuit) .................................. 176
7.7 Parallelism on Conservative Protocol Including Null Messages (C1908 Cir—
' cuit) .................................... 178
f 7.8 The BTW Algorithm ............................. 180
\ 7.9 Average Parallelism of the Reﬁned Conservative Protocol ......... 181
; 7.10 Relationship between the Number of Simulation Cycles and the Number
of Input Event Vectors with the MTW Mechanism ........... 184
7 .1 1 Relationship between the Total Execution Time and the Number of Input

Event Vectors with the MTW Mechanism ............... 185

 

Chapter 1

Introduction

As multiprocessor systems become popular and simulation methodologies become
more efficient, parallel simulation becomes one of the most widely used research tools
for high performance computing problems, such as the design of very large scale
integrated (VLSI) circuits.

Circuit simulation plays an important role in the design and veriﬁcation of VLSI
Circuits. Circuit simulators are used to verify circuit functionality and to obtain de—
tailed timing information before the expensive and time—consuming fabrication pro—
Cess takes place [9]. As the size of VLSI chips have grown, circuit simulations have
beCOme a critical bottleneck in the design of complex VLSI systems. The simulation
of VLSI systems is performed at different levels of abstraction, ranging from archi-
tectuI‘al, gate- and switch—level to circuit and device simulation [69]. Among several

levels of simulation, gate—level simulation is popular because it can be implemented
efIiQiently by using simulation methods and algorithms, and because the gate model is

i 11 _
dlrect correspondence to the Boolean equation representation of digital designs [9].

¥

 

Since logic simulation is an extremely important part of the design and veriﬁcation of
any digital system, it is necessary to investigate parallel algorithms in order to reduce
large computing times of logic simulation. Parallel algorithms of parallel discrete
event simulation are the appropriate solution for logic simulation.

As a simulation methodology on parallel and distributed systems, discrete event

simulation has received a considerable amount of attention due to its ability to achieve
a high degree of parallelism. The discrete event simulation [10] has been a modeling
approach in which the state of system is described by state variables and is changed
according to the events occurring at discrete points in simulation time. The simulation
is progressed by the changes of the system state due to the events in consideration of
timestamps of the events. The concurrency in event processing provides a possibility
to achieve a high degree of parallelism. Parallel discrete event simulation which
PI‘OVides the parallel processing of event handling on a number of processors and the
USed simulation protocols are the research topics in this thesis.

In Parallel Discrete Event Simulation (PDES) [37, 33], the major problem is syn—
ChrOnization, because the simulation system is composed of a set of logical processes
and the events in the simulation are processed asynchronously by logical processes.
In Order to coordinate the logical processes, a number of synchronization protocols
have been proposed for PDES on parallel and distributed systems. The major two
protOcols to handle the synchronization are synchronous and asynchronous protocols.

Thesh tl'ltl'l b‘ lbllkd
ync ronous pro oco manipu a es ogica processes y usmg a g o a c oc , an
Sirr“111ates the task by executing events in chronological order. The asynchronous pro-

to
Q01 handles logical processes by using local clocks, and allows logical processes to

¥_____

proceed in virtual time. The asynchronous protocol has the beneﬁt of concurrent
execution of events at different points in simulated time. To control synchronization
of event execution among logical processes, we categorize the parallel asynchronous
protocol into two schemes: conservative and optimistic. Various conservative and
optimistic protocols with innovative data structures and mechanisms have been pro-
posed [22, 41, 35, 37]. These protocols are used to guarantee the correctness of the
concurrent execution by advancing the parallel simulation asynchronously. These
protocols resolve the synchronization that is caused by the concurrent execution of
events at different points in simulated virtual time.

This thesis investigates several issues in employing parallel asynchronous protocol
of PDES on several machine platforms of parallel and distributed systems. Depending
on the machine platform, the major focus of implementing parallel protocols changes
and a different perspective has to be applied to each architectural characteristic. Par—
allel and distributed computer systems are classiﬁed into SIMD (Single Instruction
Stream, Multiple Data streams) and MIMD (Multiple Instruction streams, Multiple
Data streams) platforms. The SIMD machines have a single instruction stream over
Inultiple data streams and synchronize all the parallel execution units. The MIMD
rna'Chines are composed of two big conﬁgurations; shared—memory multiprocessors

and distributed-memory multicomputers [11]. Shared-memory multiprocessors have
a. Sirlgle address space, while multicomputers use distributed memories with multi-
Dle a«ddress spaces. Distributed—memory systems consist of multiple computers that
Communicate by passing messages through an interprocessor communication network.

T
he following sections describe several parallel protocol schemes in detail and intro—

¥

duce some important research issues of parallel discrete event Simulation on the MIMD

and SIMD machine platforms.

1.1 Parallel Synchronization Protocols of PDES

Parallel discrete event simulation is the parallel simulation of a discrete event—driven
program on parallel and distributed systems. To achieve the beneﬁt of high concur—
rency in parallel simulation, parallel protocols are developed so that a set of logical
processes computes and generates events of discrete timestamps and data in parallel.

The following presents the details of synchronous and two asynchronous protocols,

i.e., conservative and optimistic protocols.

1 - 1 . 1 Synchronous Protocol
SynChronous protocol is the most conservative scheme in PDES. The processing of
logical processes is determined according to the global clock, which is the simulation
time of the latest logical process. The local clock of a logical process is determined by
the Smallest unprocessed event in the logical process, say Local Virtual Time (LVT).
In Order to ﬁnd the global clock, the minimum global time is computed by collecting
all 1Ogical processes’ LVTs. It is called Global Virtual Time (GVT). Only if the LVT
of a. logical process is the same as the GVT, the logical process can be activated and
a'llc’VVed to process its unprocessed event.

The advantage of synchronous protocol is in the simplest feature in parallel pro—

Qe - . . . . .
SS111g. Since each logical process never receives the earlier timestamped events than

¥

its previously-processed events, the event processing pattern on a logical process is
the FIFO and the later arriving events do not change the former processed state.
In other words, each logical process keeps the local causality constraint [37] of event
processing. However, the drawback of synchronous protocol is the necessity of com-
puting GVT in the every step of processing events. When the synchronous protocol is
used on parallel and distributed systems, it is essential to use architecture platforms
which provide a fast global reduction over the processing elements, or necessary to

use a developed algorithm for computing GVT.

1 . 1 . 2 Asynchronous Protocol

Parallel asynchronous protocol mainly handles logical processes by using local clocks,
and allows logical processes to proceed in virtual time so as to have the beneﬁt
0f Concurrent execution of events at different points in simulated time. To control
SyIIChronization of event execution among logical processes, we categorize the parallel

asynchronous protocol into two schemes: conservative and optimistic.

CoIFIServative Protocol

CorlServative protocol processes events not only conservatively but also more con—
Currently than the synchronous protocol. Each logical process allows events to be
proc(Essed only if it is sure that no event with an earlier timestamp can arrive. In
Other words, when a logical process assures that there is no further incoming events
Whose timestamps are less than t1, it can process an event with timestamp t1. This

CC) . . .
r18Ervative property makes each logical process compute events In nondecreasmg

¥

timestamp order by keeping the local causality constraint.

The communication connection between any two logical processes is established
by a link. A logical process has its own clock which is updated by incoming events.
Each input port has a link clock, which is decided by the timestamp of the recently
arrived event in the port. Local clock is deﬁned as the minimum link clock at the
logical process. The local clock based on link clocks is the low bound of timestamp
that the next incoming event may have. By satisfying local causality constraints,
thus, there is no incoming event which has smaller timestamp than the local clock.
The logical processes, which have unprocessed events whose timestamp is less than

or equal to local clock, are able to active to process the events. Without waiting for
any delayed events from parent logical processes, a logical process is able to work
On safe processing and does not break the non-decreasing timestamp ordered event
processing and sending. Since arrived events in each input port are in non—decreasing
timestamp order, a FIFO list can be used as an event queue of each input port as is
in S.Ynchronous protocol.

Since each logical process sends an output event only if the condition of the local
canSality constraint is satisﬁed, a logical process may inﬁnitely wait for an incoming
event which never occur. It results in the deadlock situations. For example, when
ther e is a cycle in the path of connected logical processes, it produces an unavoidable
dea“dlock situation. Chandy and Misra [22] proposed null message to avoid deadlock

Sitllatﬁons. A null message with timestamp t2 sent from logical process 2' to logical
pr()Cess j represents a contract that logical process i will not send logical process

j
a'l'ly event of timestamp smaller than t2. It provides the low bound of the next

¥n .

incoming events on the link between the two logical processes, and makes logical
processes proceed with the new link clock instead of an inﬁnite waiting. The null
messages are used to facilitate event processing as well as to avoid deadlocks. When
a logical process does not have any sending event even if the local clock is changed,
the logical process sends a null message with the timestamp of local clock, so that
the child logical process could work on the next possible processing as much as it
can. Local clocking and null messages, thus, increase the parallelism of processing in

conservative protocol by accelerating parallel processing.

Optimistic Protocol

The optimistic protocol allows logical processes to proceed without considering the
local causality constraint, and detects and recovers whenever any causality error is
detected. The most well—known optimistic protocol is the Time Warp protocol by
Jefferson [41, 43]. In Time Warp, the virtual time is used to deﬁne synchronization
aInong logical processes by coordinating their simulation time. Logical processes
execute their current queued events in increasing order by timestamp whenever there
is an unprocessed event in one of their event queues without waiting for any other
81334118 or constraint. Since this optimistic characteristic makes each logical process
propagate immature events, it can result in a situation where a logical process receives
an event with an earlier timestamp than the previously—processed events. Thus, the

prot()col has to recover the earlier system state from the erroneously—advanced state
when an event arrives whose timestamp is smaller than the current LVT. This is

(1
a'lled a rollback situation, and the event that causes a rollback situation is called a

¥

straggler.

To handle the rollback situation, the optimistic protocol has to use a cancellation
mechanism; correcting incorrectly generated events that were already sent and up-
dating the related system status. The rollback situation yields another major aspect
to be considered: a storage overhead problem. In order to cope with any unexpected

rollback situations, logical processes must keep all previously processed events. Since
the GVT is the smallest timestamp value that cannot be rolled back by a straggler,
fossil collection is performed to reuse the storage of event queues by removing pro—
cessed events with smaller timestamps than GVT from the queue so that a process can

keep its memory usage to a minimum. More details about Time Warp are described

in the rest of this thesis.

1 -2 PDES on the MIMD Platform

In the MIMD machine platform, the systems employed in this thesis are distributed—
ITlemory multicomputers. Distributed-memory systems consist of multiple computers
interconnected by an interprocessor communication network and communicate by
passing messages through the interconnection network. Not only on the distributed
Systems, but also on the shared—memory systems, many researchers have studied of
P DES [31, 36, 66, 47]. Efficient simulation of large application problems such as VLSI
CirCuits on distributed—memory systems requires a large number of processors and is
1iIllited by the scalability of existing architectures. Due to this limitation on MIMD

InEtchines, a physical processor must be assigned more than one logical process. The

number of logical processes allocated per processor is determined by the LP ratio —
the number of logical processes to the number of processors.

The major consideration on a distributed—memory system is interprocessor com—

munication overhead via message passing. Instead of sharing a memory over all pro—
cessors, a distributed-memory system communicates between processors by means of
sending and receiving messages. In order to minimize the communication overhead
through the interconnection network, a processor has to control how many events
are executed and decide how often the processor communicates with other proces-
sors- The computation granularity, which is the computation amount of processing
events between any two consecutive communication points, is proposed as a solution
to resolve the issues.

Event scheduling is another consideration of parallel discrete event simulation

On the MIMD machine platform. Under the situation that the LP ratio is larger
than 1, many logical processes have to be handled by a processor. Between any two
Communication points, thus, each processor decides which events are ﬁrst executed
a'Inong many unprocessed events from the allocated logical processes. In the following
Subsections, the research issues of computation granularity and event scheduling are

d18cussed.

1 . 2.1 Computation Granularity

Oomziutation granularity is the computation amount on a processor between any two

Consecutive interprocessor communication points [23]. In reference to the quantity of

¥—_—_—¥

10

the computation granularity, we use the terms computation grain size and computa-
tion granularity interchangeably. The computation granularity is mainly used for the
optimistic protocol in this thesis with several aspects of usage. Recall that due to the
limited number of processors in parallel and distributed systems, we have to assign
more than one logical process to a processor. After executing several events, proces-
sors collect the generated events from the allocated logical processes into messages,
each of which is sent to the destination processor, and send the messages through the

communication network.

In order to clarify the deﬁnition of computation granularity, it may be helpful
to compare computation granularity to event granularity. Event granularity is de-
tGBI‘IIIined by the computation amount of a single executed event, which is called the
event grain size. Depending on the event grain size, event granularity can be cate-
gorized into ﬁne, medium, and coarse grain. The size of an event is decided by the
nLlInber of instructions required to compute the event on a processor. For example,

manipulating an event on a VLSI circuit usually requires around a hundred machine
iIIStructions, thus the event granularity is small. If a processor only executes an event
between two consecutive communication points, the computation granularity is the
event granularity. In contrast, if a processor executes k events between any con—
SeCutive communication points, then the computation granularity is approximately
1" times larger than the event grain size. Some researchers have studied the effect
of event granularity on the performance of parallel simulations [73, 39]. In [19], an
experimental study has investigated the effects of communication overhead on the

performance of Time Warp, considering two types of simulation applications; one

9—;

11

with a large event granularity and the other with a small event granularity. In this
study, the computation grain size was the event grain size by executing one event
before communication. According to their performance results, the communication
latency in distributed computing environments can signiﬁcantly degrade the efficiency
of Time Warp in the application problems which contain a large number of logical
processes with a small event granularity. In contrast, for applications having large
grained events, the communication latency appears to have little effect on the per—
formance of Time Warp. Most implementations of Time Warp on shared memory
systems compute an event and then communicate with other processors. Since the
memory access time to event queues is almost the same as the event propagation
time to other processors on a shared memory system, the ‘communication-per—event’
algorithm is common. As an exception, Das et al. [31] discuss the possibility of im—
DI‘OVing performance by processing batches of k events, where k is the batch size.
They claim that the batch processing approach reduces queue management overhead.
In distributed—memory systems, the communication latency is long, and therefore the

Computation granularity is an important issue to be studied.

In addition to the control of interprocessor communication between processors,
Computation granularity contributes to the control of the degree of optimism of event
processing on the optimistic protocol. Adjusting the degree of optimism is the major
SOlution to achieve better performance on Time Warp. High optimism could cause the
progress of some logical processes to be too far advanced compared to other logical
processes so that frequent rollback situations occur. In order to keep the proper degree

of Optimism, each processor should not execute all possible logical processes although

12

the processor has more logical processes that contain unprocessed events. The degree
of optimism can be controlled by setting the maximum bound of the number of
executed logical processes before performing interprocessor communication. As for
the controlling purpose, we use a parameter called the Maximum Computation Grain
Size (MCGS). In view of the degree of optimism, the computation grain size should
be small enough so as not to cause too far advanced logical processes and generate

too many rollback situations.

If the computation grain size is too small, however, the optimistic protocol be—
haves like a conservative protocol or even like a synchronous protocol. In this case,
Only a few events are processed out of many logical processes between consecutive
Communications. It makes the progress of the whole simulation slow, and most ex—
eCution time is spent in communication. Thus, the computation grain size should
be large enough to prevent the communication overhead from dominating simulation
performances. By using a large MCGS value, we can control the amount of computa-

tion enough to avoid frequent communication situations. Maintaining the adequate
Computation amount between communication points also decides the overall progress
to complete the simulation. A small portion of computation requires a long series of
CYCIes of communication and computation, which may yield more frequency of com—
Inllnication until the simulation is complete. Therefore, the computation granularity

is the parameter that should be tuned properly to achieve better performance.

13

1.2.2 Event Scheduling

An efficient event scheduling on the optimistic protocol results in an appropriate
execution order of events and reduces the unnecessary computations and storage
overheads. In this thesis, two factors are considered in order to design an efﬁcient
event scheduling strategy for Time Warp on the MIMD machine platform. The ﬁrst
factor is the amount of computation to be processed in a processor between two con-
secutive communication points. The computation granularity must be small enough
to prevent Time Warp from processing events over-optimistically, but large enough
to minimize the degrading effects of performance due to the communication over—
head. The second factor is the balancing progress of logical processes in a processor.
Within the range of the given computation granularity, logical processes assigned to
a processor progress at different speeds and this may cause more rollback situations.
Considering these two factors, we propose two event scheduling schemes: the Bal-
ancing Progress by Execution Chance (BPEC) scheme and the Balancing Progress by
Virtual Time Window (BPVTW) scheme. The BPEC scheme controls the progress
of a process by limiting the number of chances of the process to schedule events, while
the BPVTW scheme controls the progress by using a virtual time window for each
process, which allows only events whose timestamps are within the interval of the

virtual time window to be executed.

 

 

 

14

1.3 PDES on the SIMD Platform

Parallel processing on the MIMD machines has advantages over the SIMD machines in
several aspects, such as general computational capabilities and asynchronous compu—
tation on each processor. However, one signiﬁcant drawback of the MIMD machines

is that they usually have a small number of processors, compared to the massively
parallel SIMD machine. The SIMD machines can achieve massive parallelism with
a. small-grain parallel processing elements (PES). Since the VLSI circuit simulation
requires a small—grain operation of a gate, and a large number of gates, the simula—
tion model is well suited for the massively parallel SIMD environment. The SIMD
platform also offers a faster barrier synchronization and a communication router that
can be performed by an instruction, which are crucial to several aspects of the major
Performance of parallel simulation. Since the massively parallel PE nodes are able to
SUStain the model that the LP ratio is 1, each processor is assigned a logical process
so that the SIMD machine platform will not have any serious problems for circuit
partitioning and load balancing.

However, the SIMD machine platform also has limitations: the number of pro—
ceSSOrs compared to the number of logical processes on a large application problem,
and the local memory space on a processor. To compute a large application problem
having hundreds of thousand of logical processes such as a VLSI circuit, the SIMD
Ina"Chimes are short of processors. To cope with the problem, several logical processes

may be assigned into a processor. The small local memory space, however, has the

li - .
mltation to carry many loglcal processes. Thus, only small circuits, which have a

‘n

 

 

15
smaller number of logical processes than the number of processors or have the almost
same number of logical processes, are suitable for parallel simulation on the SIMD
platform.

To study the parallel synchronization protocols on the SIMD machine, the ma—
jor features considered are the processor capability, the interprocessor communica—
tion scheme, and the global synchronization. By comparing computation times of
the related instructions, the performance of parallel synchronization protocols on a
SIMD machine can be examined. In addition, another factor that affects simula—
tion performance is the use of an appropriate data structure with consideration to

the architectural characteristics and the simulation model. Also, the improved algo-

rithms according to the machine architecture affect the performance results of parallel

Protocols.

1 -4 Thesis Outline

The remainder of this thesis is organized as follows. Chapter 2 presents the general
I“asearch issues and their related work by focusing on the important research topics of
PD ES. First, the research issues of parallel optimistic protocol and the related work
are dEscribed. Next, major features of existing machine architectures, which affect
the Simulation performance on PDES, are discussed: the number of processors, the
interprocessor communication via the interconnection network, and the local memory
Space per processor are the features considered. Finally, the application problem

0 . . . . . . . . . .
f t-1‘11s thes1s, the VLSI c1rcu1t system, Is 1ntroduced W1th characterlstlcs of logic

-+

 

 

16
simulation.

Chapter 3 presents an analytic model to study the effects of computation gran-
ularity on parallel and distributed systems. Analytical formulas are derived, based
on the model, to predict the performance metrics, such as the total number of sim-
ulation cycles and the total execution time. The formulas can be used to analyze

the performance as a function of computation granularity when each processor is as-
signed many logical processes and a barrier synchronization is used for computing
GVT. The predicted results, veriﬁed by the experimental results on a cluster of DEC
Alpha workstations, show that the computation granularity significantly affects the
simulation performance.

Chapter 4 performs a set of experiments on the IBM SP2 to study the effects of
computation granularity. We measure and interpret the performance effects of com—
putation granularity in perspectives of execution activity, rollback situation, com—
munication overhead, and simulation progress. According to the analysis, a large
Computation granularity increase the degree of optimism, resulting in frequent roll—
back situations, while a small computation granularity causes a high communication
overhead and a slow simulation progress.

Chapter 5 studies the issue of event scheduling strategy for the optimistic protocol
on parallel and distributed systems. Efﬁcient event scheduling can make an appro-

priate execution order of events and reduce unnecessary computations and storage
Overheads. In this chapter, we propose two efficient event scheduling schemes, called
the BPEC scheme and the BPVTW scheme, by considering computation granularity

a
ﬁg} the balancing progress of logical processes. The BPEC scheme restricts the num—

‘

 

 

17
ber of executed events per logical process between interprocessor communications,
while the BPVTW scheme does not allow an event to be executed if its timestamp is
far ahead of those on the other events in different processes.

In Chapter 6, as target machine architectures for logic simulation of small circuits,
two massively parallel SIMD machines are compared by their machine-dependent
characteristics. The employed parallel protocols of PDES are synchronous, conserva—
tive, and optimistic protocols. The objective of the comparison is to understand the
effects of machine characteristics of SIMD machines on the performance of parallel
Simulations. For the efficient implementation on SIMD machines, new data structures

0f eVent queues and the queue manipulation mechanism are proposed.

Chapter 7 presents the effects of machine—independent factors on performances of
PDES and discusses the relationship between the factors. The characteristics consid—
ered are the length of critical paths, the parallelism of simulation, data structures of
event queues and the related schemes, the number of processes, and the number of
inp ut event vectors. This study of the inter—relationship among machine—independent
factors helps to understand the performance characteristics of the given application

problem and the employed parallel protocols.

Finally, Chapter 8 summarizes the major contributions of this research.

 

 

Chapter 2

Research Issues and Related Work

Parallel discrete event simulation (PDES) consists of the simulation protocols, the
application problem, and the parallel or distributed machine system. The perfor-
mance of PDES is affected by the characteristics of the three components. This
Chapt er presents the general research issues and their related work by focusing on the
iInDOI‘tant research topics of PDES.

Among several parallel synchronization protocols, Time Warp is mainly focused
SinCe it gives many challenging research issues. The data structures for event queues,
the GVT computation, the cancellation mechanisms for rollback handling, and the
fossil collection are studied in this chapter. The variations of Time Warp and con-
SerViitive protocols that have been proposed by other researchers are also introduced.
Since this thesis studies the parallel protocols on various architecture platforms, the
maul or features on existing machine architectures, which mainly affect the performance

on 1DDES, are studied: the number of processors, the interprocessor communication

v‘ . .
1a the interconnect1on network, and the local memory space per processor are the

18

¥_——_—

 

19
features considered. In order to understand characteristics of the application problem,
the unique characteristics of logic simulation are described. Finally, the related work

on the partitioning problem in simulating VLSI circuits on parallel and distributed

systems is investigated.

2. 1 Research Issues of Optimistic Protocol

The most well—known optimistic protocol is Time Warp [41]. In Time Warp, when-
ever logical processes have unprocessed events, the logical processes execute events
1" egardless of causality constraints [37] of events. Since each logical process executes
and propagates immature events, the protocol has to recover the earlier system state
from the erroneously—advanced state when genuine events arrive. The recovery in—
VOIVeS undoing execution of immature events. The genuine events are executed and
propagated to other logical processes by correcting the erroneously—sent events.

Each logical process progresses by processing the received events. The current
progl‘ess of a logical process is indicated by the Local Virtual Time (LVT) of the logical
ply"()(3ess, which is determined by the minimum timestamp of unprocessed events in the
logical process. A rollback situation occurs when an event called a straggler arrives,
which has a timestamp smaller than the current LVT of a logical process. When a

Stra'ggler arrives, the logical process has to cancel wrong events, recover the earlier
System state from the erroneously-advanced state, and execute events from the point

of the straggler’s timestamp. To cope with rollback situations, each logical process

must store all events even after the events are executed and sent. Fossil collection is

‘____—__—

 

20
necessary to reuse the storage of event queues; when an event queue is full enough,
the logical process removes from the queue the processed events whose timestamps
are smaller than the Global Virtual Time (GVT), which is the minimum timestamp

of all LVTs.

2.1.1 Data Structures for Event Queues

In P DES, logical processes contain events that have been generated or received but not
yet executed, and are pending in the input event queues. Logical processes manage
the pending events and let the processors schedule and execute the events according
to the event scheduling scheme. The data structure implementation of event queues

has an important influence on the simulation performance.

G”013811 Event Queue

Global event queue is the common event queue which contains all unprocessed events
of 1c)gical processes. The global event queue is usually used for sequential program
and Synchronous protocol. On the sequential program and synchronous protocol, the
event with the smallest timestamp is always executed ﬁrst among many unprocessed
events. According to this algorithm, which is known as smallest—event-ﬁrst processing,
the priority queue is used for the global event queue. There are several queue structure
for eVent queues of PDES, such as heap, splay tree [71], calendar queue [17], and skew
heap [44]. Moreover, the priority queue for the global event queue is used for Time
Vva'I‘p especially on the shared-memory machine platform. Jefferson [42] proposed

(1 .
a'IlCelback protocol for storage management and presented that an effic1ent data

‘—______—__ .

 

21
structure of event queue reduces the number of rollbacks in Time Warp. Ronngren
et al. [68] proposed an improved version of the skew heap for event queues on a

shared-memory implementation.

Individually Distributed Event Queues

In contrast to the global event queue, individually distributed event queues have
logical processes keep their own event queues in which all unprocessed events are
pending. The distributed event queue is suitable for PDES in distributed-memory
machine platform. Each logical process uses the local memory in a processor, stores
eVents in the local area, and processes them by accessing and proceeding concurrently.
USing the individual event queue per logical process also simpliﬁes migration of a
logical process to another processor in a dynamic load management scheme.
When the individually distributed event queue is used, all the events are sorted
in the increasing order by timestamp in each logical process’s event queue. Also,
a SeIDaJate output event queue is not necessary by keeping the previously—processed
events into the event queues as well. Each event queue has three pointers; the bottom
to point out the very ﬁrst event, the front to point out the ﬁrst unprocessed event,
and the rear to point out the last event. After executing an event placed in the front,
the front pointer moves to the next unprocessed event. The processed event stays
in tlie queue until a fossil collection performs. The already processed event may be
Ilolled back by stragglers. Instead of using separate output event queues, using a single
event queue which can be used as both input and output event queues can reduce the

qIlene manipulation time when a straggler arrives, by eliminating the corresponding

‘_—_——

 

22

event moving and queue handling from output event queues.

Output Event Queues
In Time Warp, an output event queue contains negative copies of events that a logical
process has recently sent, which is kept in virtual time order [41]. It is used in case of
rollback situation. The output event queue, however, is a necessary structure when
the global event queue is used in Time Warp. It is because the previously—processed
event is to be dequeued from the event queue in order to let the other unprocessed
events be in order in the global priority queue. The previously-processed events are
en<queued into the output event queue to prepare against the rollback situation.

In contrast, if the individually distributed event queue is used, it is unnecessary to
Use a. separate output event queue because the distributed event queue can keep the
pr eViously—processed events as well as the unprocessed event. Thus, it may reduce the
quelle manipulation time when a straggler arrives, by eliminating the corresponding

heavy event moving and queue handling in rollback situations.

2 ‘ 1 ~ 2 GVT Computation

GVT is the bottom line of virtual time which cannot be rolled back by any straggler.
GVT on distributed-memory systems is obtained by computing the minimum value
of LVTS from all processes and the events in iii-transit messages, which are sent
by a sender processor and not yet received by the receiver processor. Suppose that

P is the set of processors, N is the set of logical processes, e is an event, t3(€) is

the timestamp of the event e, LVT] is the LVT of logical process l, where l E N,

‘—

 

 

23
and miyj is a message sent from processor i to processor j, where i, j E P. Then,
GVT = min(minl€N{LVTl}, mineem,_j{ts(e)}), where i,j 6 P and i 75 j. Since each
processor maintains the assigned logical processes, PVT (Processor Virtual Time) is
computed as the minimum value of LVTs in a processor and used to compute GVT.
The PVT of processor i is, thus, PVT, : minleLi{LVTl}. GVT can be computed as
GVT = min(min,ep{PVT,-},mineem,‘j{ts(e)}), where i,j E P and i 76 j.
There are two ways for computing GVT: synchronous and asynchronous GVT
computations. Synchronous GVT computation performs barrier synchronizations pe—
riodically to collect PVTs from processors. This is good for SIMD machine which
support special routing schemes for global computations. Due to the characteristic
0f barrier synchronization, however, the system performance of the MIMD machine
Which communicates by message passing, is dominated by the performance of the
latest responding processor on barrier synchronization. On a synchronous protocol,
Computing GVT is a crucial step to activate logical processes only if their LVT val—
Hes are equal to the GVT. In general, a synchronous protocol uses sychronous GVT
Computation by performing a reduction operation, paying the cost of the barrier
SyI1Chronization. Asynchronous GVT computation is suitable for asynchronous pro—
tocols of PDES such as Time Warp on distributed—memory systems. Without barrier
Ssrrlchronization, asynchronous GVT computation allows processors to do their own
DI‘()(3essing asynchronously, and collects information from processors individually. On
an Optimistic protocol, the GVT value is used to perform fossil collections and to

Check the termination condition of simulation. Since logical processes can progress

t'lrleir virtual time asynchronously by executing the events, the GVT is not the value

‘—_———

 

 

24
that processors have to compute every moment during communication. Thus, the
GVT computation can be implemented asynchronously in order to keep the beneﬁt
of asynchronous processing of processors.

The inherent consideration when the GVT is computed in a distributed—memory
system is to handle the events in the in—transit messages via network, which are sent
by a sender processor and not yet received by the receiver processor. Most of the ear—
lier GVT computation algorithms [12, 70] require each message to be acknowledged.

The events in the messages that are not acknowledged are considered to compute
GVT by taking the minimum timestamp. However, generating an acknowledgment
for each message yields heavy network traffic on the network, degrading the per-
formance of interprocessor communication. Lin and Lazowska [49] propose a GVT
Computing algorithm which does not require acknowledgments, but it has a worst
CaSe of 0(n2) control messages. Concepcion and Kelly [30] propose a multi—level
token passing algorithm for hypercube topology, in which dedicated processors are
used to compute GVT, but they require acknowledgments. Tomlinson and Garg [86]
pI‘()IDOSe an algorithm for minimally latent GVT, which does not require acknowledg—
ments. They determine a target virtual time (TVT) and an initiator detects when
GVT reach TVT. To perform the fossil collection, however, the algorithm should de—
t“TIP-"nine what value the next TVT has to have. Baldwin et al. [8] propose overlapping

W- . .
111Clow algorlthms of token passmg.

 

 

25

2.1.3 Cancellation Mechanism for Rollback Handling

Under the unavoidable rollback situations of the optimistic protocol, it is necessary
to annihilate the erroneous messages caused by the immature events. There are
several cancellation mechanisms to annihilate the erroneous messages: aggressive
cancellation, lazy cancellation, direct cancellation, and immediate cancellation.

In the aggressive cancellation mechanism [41, 43], a logical process immediately
sends antimessages for all messages that it already sent whenever it receives a strag—
gler- In other words, when a logical process rolls back to timestamp t due to a
straggler, it immediately sends all antimessages for all processed events with the

timestamp larger than t. The aggressive cancellation mechanism can result in a large
numb er of messages being sent out messages on the network and thus in heavy traffic
and communication delay. In the lazy cancellation mechanism [38], when a logical
prOCess has a rollback situation, it delays to send antimessages until recomputing
the event and realizing its previously—sent messages were erroneous. An antimessage
is Sent only after the logical process recognizes that the newly re—computed event
has a different value from the previously sent event. In contrast to those two can—
cellation mechanisms, Fujimoto [36] proposed a direct cancellation mechanism on a
Shared memory system that eliminates antimessage load. When a straggler arrives
into a logical process, all corresponding processed events on the logical process and
rel ated events in other logical processes are removed directly. The direct cancellation
rrleChanism is especially suited for shared memory systems to access the other logical

proCesses’ event queues as well. Instead of storing antimessages in output queues, the

26
direct cancellation mechanism has pointers to the corresponding positive events in
order to remove not only rolled back events but also the events caused by the rolled
back events, which are in the causality tree [36].
In the immediate cancellation mechanism [28], the logical process that receives
a straggler re—starts event execution from the point of the straggler’s timestamp,
by ignoring all messages sent so far and all the previously-processed events. The
cancellation scheme applies to the input event queue where a straggler arrives. In
immediate cancellation, there is no antimessage to correct previously. erroneously
generated events unlike the aggressive cancellation and the lazy cancellation mecha—
nisms. In this point, it is the same as the direct cancellation mechanism on shared
memory systems. On a distributed—memory system, the heavier traffic on the network
and the extra processing caused by the antimessage passing can be prevented using
this cancellation mechanism. Since logic simulation requires a small event granular—
ity and its computation and communication do not take a long time, the immediate
Ca"IICIellation is a suitable cancellation mechanism for parallel logic simulation on dis—
tribllted systems. All other cancellation mechanisms also require a re—start of event
Computation from the straggler’s timestamp. The difference between immediate can—
Callaotion and aggressive or lazy cancellation mechanisms is in the handling of rolled
back events in the input port in which the straggler arrives, and in the management
of a-1'1timessages. In this model, an event queue is located in each input port of the
10 g i Q . . . . .
a1 process. When there is a straggler w1th tlmestamp t1 in a logical process,
all unprocessed events with the timestamp later than t1 are removed from the event

qllehe. The logical process restarts event execution at the point of timestamp t1 as

‘—_—_—_‘

 

27
if it had never processed events with timestamp later than t1. The immediate can-
cellation mechanism reduces the load by eliminating antimessages. In addition, it
reduces the time to reconsider old messages in order to decide whether or not they
are to be antimessages, and requires a simple data structure. In this thesis, we use the

immediate cancellation mechanism to handle the rollback situation of the optimistic

protocol.

2- 1 .4 Fossil Collection

The fossil collection is the necessary step on Time Warp to reuse spaces of previously-
Processed events that will not be used again. Since GVT is the bottom line of times—
tamp that cannot be rolled back by a straggler, the processed events whose timestamp
is Smaller than the current GVT are not used in case of rollback and can be removed
to recycle the memory space.

The important consideration on the fossil collection is when and how often we
perfo rm the fossil collection during the simulation. Frequent fossil collections let the
unnecessary memory be reduced as the minimum, but the processing time dominates
the Simulation time. Rare fossil collections cause the short of memory in a processor.
when we perform the fossil collection to all event queues, searching and processing
all Queues takes too many time although some event queues do not consume many

ﬁleIl’lory.
In our implementation, we apply the fossil collection to each event queue when

th
Q queue contalns more events than the max1mum allowed queue Slze. Because there

 

28
may be a situation where only unprocessed events reach the amount, a threshold
value is used to check if the gap between the ﬁrst event in the event queue and GVT

is large enough to perform the fossil collection.

2-2 Other Synchronization Protocols

2 - 2.1 Variations of Time Warp

The major drawback of optimistic protocol is the high potential of generating er-
roneous events far ahead by processing events too optimistically. When a logical
pro cess receives a correct event, it takes time to annihilate the propagated incorrect
9Vent. This section presents several proposed variations of Time Warp that control

the degree of optimism in various ways.

Moving Time Window (MTW)

I\/I()\fing Time Window (MTW) [75, 74] is an optimistic scheduling mechanism to
red uce the overhead of a high optimism by limiting the maximum difference between
the local virtual times of logical processes. An optimal window size reduces the size
and frequency of rollback by applying to logical processes a window of timestamps
Sta'li‘ted from GVT. The MTW scheduling is as follows. For a given time window
irlteI‘val [t, t’], only events having the timestamps within the interval are considered
potential candidates for execution. For a set of side—effect (SE) events, event A
C01iflpletes execution before event B begins execution if the timestamp of A is smaller

th an the timestamp of B. For a set of NSE (no—side—effect) events, all events can be

‘—_——

29
executed in parallel if they are not intervened. The weakness of the MTW is that it is
hard to distinguish correct event processing from highly immature event processing,
so that the appropriate size of MTW cannot be decided in advance of a number of
experiments for a given problem.

The MTW is also able to be used to utilize the memory size efficiently under
the situation with the limited event queue size. Not only for Optimistic protocol but
also for conservative protocol, the MTW mechanism can be employed to limit the
maximum queue size [27, 24]. In this purpose, Chapter 6 shows the implementation

On the SIMD machine by using the MTW mechanism. By adjusting the window
Size for virtual times of events, the overﬂow on event queues due to the short of
memory can be prohibited. A huge number of input events which yields a large event

(1118116 size, therefore, can be simulated even on an architecture platform with a small

memory size.

Time Warp in SPEEDES

Time Warp in SPEEDES [78] accepts two run—time input parameters, say N1 and N2.
AS Ilodes locally process their events, SPEEDES keeps track of how many events have
been processed locally beyond GVT. When the number of N1 on a node (or there
are no more local events to process), the node calls its nonblocking sync function,
and without blocking, continues to process events. When the last node makes its
Call to the nonblocking sync function, all of the nodes simultaneously stop their

eVth processing and GVT is globally determined. In addition, another boundary

N2 is deﬁned to be an upper limit for the number of events that are allowed to

 

 

 

30
be processed beyond GVT. This effectively stops runaway nodes from consuming
all of their available memory while still remaining optimistic. Unlike Moving Time
Window, the SPEEDES approach is independent of simulation time. It does not
require knowledge of the intricate timing strategies in the event scheduling that is
performed within the simulation. It does, however, effectively solve the problem that

overly optimistic simulations may encounter—namely, overconsumption of the available

memory on a processor.

Breathing Time Warp

Time Warp may exhibit rollback explosions that can cause an avalanche of antimes-
sages. Breathing Time Buckets [79], on the other hand, may not be able to process
enough events per synchronization cycle to remain efﬁcient. Breathing Time Warp is
an approach to solves both of these problems by mixing the two algorithms together,
I'eslllting in the best of both methods.

Development of the Breathing Time Warp algorithm was motivated by the general
C)t)s€=:1:‘vation that events close to GVT (in terms of number of events, not time) tent
to be processed correctly while events far from GVT have a greater chance of being
rolled back. Thus, it makes sense to aggressively send the generated messages from
eve I‘lts close to GVT while not immediately releasing the messages generated from
eve hts far from GVT. Each cycle in the Breathing Time Warp starts out by using

8Lggl‘essive message sending methods (i.e., Time Warp), but then at some point makes

a
transition to risk—free message—sending methods (i.e. Breathing Time Buckets).

 

31

Bounded Time Warp

The Bounded Time Warp algorithm [87] detects when all objects have reached an
inactive state, i.e., there are no messages within the current bound that remain un-

processed. For this purpose, this algorithm uses a two phase token passing method

based on Dijkstra’s termination detection algorithm.

2- 2.2 Other Conservative Protocols

Conservative protocol prevents any logical processes from simulation beyond a point
at which another logical processes might affect it. From [21, 22, 18, 51], the conser-
vative protocol is developed. The conservative protocol focuses on the schemes that
are free from deadlock or detected and corrected deadlock [56]. The following shows

Other conservative schemes which do not pay the cost of rollback situation as in the

Opt i mistic protocol.

Y4&‘7‘7NS (Yet Another Windowing Network Simulator)

YAWNS (Yet Another Windowing Network Simulator) [64, 63] has the barrier syn-
Chr(Drtiization operations for the safe event processing. It has the following three steps.
AS tlie first step, every logical process determines whether it is active with an activity
which began in the previous window. In the second step after a global synchro-
niZa-tdon, each logical process inserts any completion notiﬁcations into its event list
afth determining the completion time of its next activity which has not yet begun.

he logical process can determine the starting time of its next act1v1ty w1th the pre-

 

 

 

 

32
computing stage. When another global synchronization is performed, the minimum
value of next-activity-completion-times among all logical processes is determined as
the upper edhe of the window. In the third step, each logical process executes events
vvith the timestamps less than the upper window edge.

In [13], YAWNS is implemented on the IBM SP2 and the experimental results are

presented.

Lookahead Scheme on Conservative Protocol

Lookahead [37] is the ability to predict what will happen in the simulation future.
If the timestamp increments are non-zero, each logical process can predict future
eVents, which will be generated and have the effects on the other events to process. It
is used in the deadlock avoidance approach and in the deadlock detection and recovery
a1g()rithm. In [62], the lookahead ability is improved by precomputing portions of the
COI1'11:)utation for future events, so that a logical process can precompute the service
time of events that have not yet been received.
The lookahead ability is examined analytically by using queuing network simula-
tions in [48, 89]. In [63], average performance as a function of lookahead is analyzed
for a. synchronous protocol based on precomputing times. The impact of exploiting

1 . . . . . .
Ookahead on computation and commumcation overhead 18 also studied 1n [50].

Breathing Time Buckets

he Breathing Time Buckets algorithm [77] processes events in time cycles, where

t . . .
hese time cycles do not use a constant time 1nterval. They adapt to the optimal

 

33
width, which is determined by the event horizon. The event horizon (the time stamp
of the earliest new event generated in the current cycle) determines the next cy-
cle boundary. Processing events beyond this boundary may cause time accidents.
Breathing Time Buckets only sends messages that are known to be valid. This algo-
rithm may be too conservative in its attempt to eliminate the need for antimessages.
It may turn out for some application that cycles in Breathing Time Buckets do not

process enough events to remain efﬁcient.

2 - 3 Considerations on Parallel Machine Architec-

ture

The parallel processing environments provided by existing parallel and distributed
33’ Stems have several limitations, such as a limited number of processors, interproces-
SOI‘ communication overhead, and a ﬁxed size of local memory space on a processor.
These limitations prevent Time Warp from achieving ideal performance. In View of
the parallel optimistic simulation, this subsection compares the major representa-
tive platforms of existing parallel and distributed systems: SIMD (Single Instruction
St re am, Single Data stream) and MIMD (Multiple Instruction streams, Multiple Data
stre amS) which includes shared-memory multiprocessors, distributed-memory multi-

C .
Omputers, and a cluster of workstations.

 

34

2.3.1 Existing Parallel Computer Platforms

Parallel and distributed systems are classiﬁed into SIMD and MIMD platforms. The

SIMD machines have a single instruction stream over multiple data streams and

synchronize all the parallel execution units. The MIMD machines are composed

of two big conﬁgurations: shared—memory multiprocessors and distributed-memory
multicomputers. Shared-memory multiprocessors have a single address space, while
multicomputers use distributed memories with multiple address spaces. Distributed—
memory multicomputers consist of multiple computers interconnected by an inter-
Processor communication network. The multiple computers communicate by passing
messages through the connection network. According to the taxonomy of Bell [11], a
CluSter of workstations interconnected by a LAN is categorized in distributed-memory
InL11ticomputers. A more detailed taxonomy of parallel and distributed computer sys-

tems is studied in [40]. For convenience’ sake, each computer or processor on parallel

and distributed computer systems is called a processor.

2' 3 - 2 Number of Processors

A r1ice inherent characteristic of the optimistic protocol is its possibility to achieve
high parallelism in parallel processing. In order to maximize the beneﬁt of this ca-
pal'bility, the target parallel computing system for Time Warp should be the one that
can provide a sufﬁciently large number of processors required by a given application

ptIOIDIem.

The possible number of processors on a parallel system is mainly determined by the

35

scalability of the system’s architecture. The purpose of scalability is that the machine
performance can linearly increase as the number of processors increases without a se-
vere degradation of communication overhead. In a practical sense, however, the cost
of adding more processors is another major reason why most existing systems have
the limited number of processors. Among the considered four platforms, SIMD is the
platform which can include the largest number of processors by providing massively
parallel small processors, around tens of thousands. For example, the Connection
Machine CM-2 [85] has up to 64K processors and the MasPar 1200 family [53] has
up to 16K processors. In contrast, MIMD machines have a much smaller number of
processors in the range of around 10’s or 100’s. The least scalable MIMD platform
is a distributed system. A distributed system, which is composed of workstations
interconnected by local area sub-networks, usually contains several (in the 10’s) ma-
chines. To connect several hundreds of workstations on a distributed system, bridges
and routers are used to link several sub—networks and increase time to use machines
on another sub-network. Also, the shared—memory multiprocessors have the limit of
adding the massive number of processors because the common intercommunication
network is a bottleneck of performance increase with a hundred processors. The more
scalable type of MIMD is distributed—memory multicomputers because they provide a
separate address space per computer. Distributed-memory multicomputers can have
up to several hundreds of processors, while shared-memory multiprocessor can have
up to tens or hundreds of processors. For example, the IBM SP2 has been able to
deliver a massive number (several hundreds or up to a thousand) of IBM SP nodes.

As for the large application problem such as VLSI circuits, the number of proces-

 

36
sors on existing parallel and distributed systems still has the limitation when a logical
process is assigned into a processor. If there is enough memory in a processor, several
logical processes should be assigned collectively into a processor instead of assigning
a logical process into a processor. The ratio of the number of logical processes to
the number of processors is called the LP ratio. It represents the number of assigned
logical processes per processor. For the simulation of VLSI circuits in which hundreds
of thousands of gates are integrated in a chip, the LP ratio on distributed-memory

systems of hundreds of processors is in the range of a couple of thousands.

2.3.3 Interprocessor Communication Overhead

Another major architectural consideration of parallel and distributed systems is the
communication cost of passing messages via interconnection network between pro-
cessors. On distributed-memory systems, such as distributed systems and multicom-
puters, processors can communicate with other processors only by passing messages
through the interprocessor network. Because of the communication overhead, proces-
sors have to pay an extra cost. The network contention is a well-known problem in
this platform. On a shared-memory multiprocessor system, the communication cost
should also be considered, especially under high loaded network situations with many
processors. Not only the intercommunication network, but also the hot spots in the
shared memory can increase the communication overhead. Some machine architec-
tures provide the special built-in router for interprocessor communications so that the

communication cost may be reduced. For example, the MP-2 architecture in SIMD

37

platform has a built-in global router connected to all processors. The global router
on the MP-2 is a circuit-switched style network that is organized as a three-stage
hierarchy of crossbar switches. It provides a bidirectional communication instruction
supporting point-to-point communications between any two processors [14]. Also,
the IBM SP2 in MIMD provides the built-in communication subsystem connected
to all nodes. Between any two nodes, the High-Performance Switch [82] provides
a packet-switching, multistage Omega network with buffered wormhole routing for
interprocessor communication.

In an application problem that contains many event exchanges between logical
processes during the simulation, such as VLSI circuits, the communication cost is
one of the important considerations to choose the proper architecture platform. A
possible way to reduce the interprocessor communication cost is to assign several log-
ical processes into a processor and compute an appropriate amount of events before
performing communication. The logical processes in the same processor do not com-
municate through a network. Instead, they move the events into the other logical
process’ event queues, and the used data structure is the major key to control per-
formance. Thus, the LP ratio, i.e., the issue of deciding how many logical processes
are assigned into a processor, is a parameter which is highly related to the simulation

performances.

 

 

38

2.3.4 Local Memory Space on a Processor

Another limitation on parallel and distributed systems is the local memory S}
on each processor. Since Time Warp requires lots of memory to store previou
processed events for rollback situations, especially when several logical processes
assigned to a processor, the most feasible architecture platform is the one that
a sufﬁcient local memory to process several logical processes on a processor. S
each processor has a limited memory size and the user’s space is smaller than
supported memory size, the simulation which requires more memory than the lim
space will use the swapping space from the hard disk space and the total execu
time of simulation highly increases.

Most of SIMD machines have small local memories. For example, the Connec
Machine CM-2 has 8KB memory, the MP-l has 16KB memory, and the MP-2
64KB memory. The SIMD machines, thus, have the limitation of small local m
ory Space to support many logical processes per processor. In contrast, distribu
memory MIMD systems contain a larger memory so that they can carry out In
logical processes on a processor. For example, the SUN Sparc workstations ha‘
32MB, 64MB, or larger memory according to the conﬁguration. Due to the 1:
memory size, distributed systems can have the advantage of simulating huge circ
by accommodating many gates. The SP2 in Maui High Performance Computing C
ter (MHPCC) [54] can provide from 64MB up to 512MB memory for thin nodes
from 64MB up to 2048MB memory for wide nodes. Thus, a large number of log

processes per processor can be accommodated in SP2.

39

To keep using the appropriate amount of memory space even for a huge application
problem, we have to consider two factors: the LP ratio and the number of initial input
events. As the initial input events (which are fed into the application problem in the
beginning of simulation) increase, the required memory size also increases. The LP
ratio should not be so large as to be beyond the system limitation. The number
of initial input events should be also reasonable to prepare against any worst case
of rollback situations. Thus, the memory bounded situation is an unavoidable issue

when simulating a large application.

2.4 Parallel Logic Simulation

The application problem for the parallel protocols in this thesis is the VLSI circuit
system. Since logic simulation is a critical bottleneck in the design of complex VLSI
systems, the importance of a logic simulation in the overall design process has in—
creased. Especially, the gate—level logic simulation is considered the application prob-
lem, in which gate—level logic elements such as AND, OR, NOT, NAND gates, and D
f/f, function as logical processes in the model of parallel discrete event simulation.
There are unique characteristics of logic simulation of a VLSI system that can
be a suitable application problem for parallel asynchronous protocols on parallel and
distributed systems. First of all, the number of logical processes is so large, such
as about several tens of thousands of gates to more than hundreds of thousands of
gates, that a single processor cannot simulate for the veriﬁcation efﬁciently. The

memory requirement may increase beyond the limits of the possible memory size and

40

it causes the swapping cost to simulate. By partitioning the circuits into sub—circuits
and allocating them into each of the processors, parallel simulation can achieve cost—
effective performance on parallel and distributed systems. The second is that the
gate—level computation requires a ﬁne—event grain size and the sub—circuits of grouping
gates are able to be simulated on a processor. Since the event granularity is so small,
we can study the effect of computation granularity by binding several events between
two consecutive communication points.

In addition, there are more considerations of gate—level logic circuit simulation.
The connection between logical processes are determined; the task dependency of
the application problem is static and decided in the beginning of simulation. Each
logical process has a unit delay in virtual time, where the unit delay is determined
by the maximum individual delay timing of all gates. The input events are initially
enqueued into the input gates instead of being generated during simulation. The
following shows that the unique characteristics of the gate—level logic simulation in

terms of parallel discrete event simulation.

0 Huge number of gates : When we design a VLSI chip and verify it, a chip is
composed of a huge number of gates. It provides the characteristics of ﬁne—
grained massively parallel processing, which is suitable to be processed on a

parallel machine system.

0 Highly parallel processing of gates : As a circuit is processed by the given initial

input events, a large number of gates are active to process events concurrently.

Since it produces a highly parallel processing over processors, a VLSI circuit is

 

 

 

41

a good application for parallel simulation.

Deterministic communication connections : To achieve the deterministic results
of a circuit, each gate is connected to parent gates and child gates, and the con—
nection between any two gates is statically assigned. Thus, each gate propagates

its events to the pre-determined child gates.

Fixed delay : Each gate has a ﬁxed delay to compute an event according to its
logic operation. The virtual timestamp after computation is always predictable

with the timestamp of a received event.

Generic conﬁguration : A circuit shows a possible conﬁguration of problems on

general distributed parallel processing.

Simplicity of data : An event is composed of a timestamp and data which is
O or 1. Since a gate processes a logic operation such as AND, OR, NOR,
XOR, NOT, etc., the computation is very simple and the outgoing results are

determined.

Easy computation of output results : Each gate has an assigned logic operation
which gives a deterministic output for the incoming inputs. The property is
able to provide a simple structure to handle data and its computation. Using a

lookup table is possible and makes computation easier.

Non-preemption of events : Once an event starts to be processed in a process,

the event is not preempted by other events.

 

42
o Lookahead capability : Due to the ﬁxed delay per gate, we can easily expect the
next timestamp of the unprocessed events. It gives us the lookahead capability

to know the minimum timestamp of the next processed event.

 

:::::::':"-':"-"-'7-’-"-""" input ports

    

a
/L

 

a process

 

 

 

 

 

Figure 2.1: A Conﬁguration of Processes in a Logic Simulation.

 

 

 

Each gate has several input ports according to the fanin, each of which has its own
event queue for storing the incoming events as illustrated in Figure 2.1. Each event
queue is implemented by a data structure according to the protocol characteristic.
Each gate has output ports according the given fanout, but there is no output event
queue. It is because we use the input event queues to store the previously-processed
events as well as unprocessed events. The detailed queue manipulation and the related

mechanisms are described in the rest of this thesis.

43

2.5 Partitioning Problem

Since we focus on the application problem of VLSI circuit systems for asynchronous
protocols on parallel and distributed systems, the related work on partitioning and
mapping algorithms regarding VLSI simulation is presented. Bailey et al. have an
excellent survey about parallel logic simulation of VLSI systems in [7]. Especially,
they cover the partitioning problem as a major factor of PDES for VLSI circuit system.
Partitioning logical processes on processors can greatly affect the VLSI simulation in
the following two reasons. The ﬁrst is to adjust the balance of computation among
processors, so that each processor can have useful work and all of processors can
have a nearly equal number of work to achieve load balance. The second is to reduce
communication overhead which is the major performance bottleneck on parallel and
distributed systems, by placing frequently communicating logical processes into a
processor.

As the method to partition gates by focusing on reducing communication and syn-
chronization overhead, many researchers have studied and only a few major methods
are brieﬂy presented as follows. Levendel et al. [46] propose the string method, in
which the algorithm follows a primary input to a fanout gate and selects one of the
fanout gates to add to the string until it reaches a primary output. The gates on
the string are placed on the same processor. Smith et al. [72] propose the method of
fanin and fanout cones, in which each gate has a cone consisting of the set of gates
that are affected by the output of the gates. After the cones are built for all gates,

the gates driven by primary inputs are evenly assigned to processors. Mueller-Thuns

 

 

44

et al. [58] more focus on reducing communication overhead and place entire cones
on a processor regardless whether the gates in the cone have already been placed in
another processor, that is, they allow gates to be evaluated redundantly. Bisection
and multiway partitioning [32, 45] are graph-partitioning algorithms, in which the
components are treated as nodes, and signals are treated as arcs in a graph. By
dividing the graph recursively into partitions, they can minimize arcs between par-
titions. Brinder (1990) shows that bisection improves gate-level simulation greatly
over random methods for optimistic protocol where the cost for functional evaluation
is similar to communication costs. Sporrer and Bauer [76] have performed a number
of experiments on partitioning circuits of back-order techniques of Smith et al. [72]
and Fiducia and Mattheyses [32]. Simulated annealing is another method to partition
circuits [34, 20]. However, there are two shortcomings of simulated annealing: 1) the
time required to perform the simulated annealing task is long relative to the serial
execution time of the simulation, and 2) the lack of information about circuit activity
prior to simulation limits the ability to formulate an effective cost function to drive
the simulated-annealing algorithm. By applying the partitioning schemes to paral-
lel logic simulation, many researchers have studied and compared the experimental
results [3, 52, 59, 60]

To obtain good load balance of partitioning, random partitioning is the easiest and
fastest technique, in which logical processes are randomly assigned processors [20, 34].
According to [7], the effective partitioning and mapping depends on a number of
factors: the computation/communication ratio, the circuit activity, and the target

architecture’s communication capabilities. In addition, most of the partitioning tech-

45
niques are linear in the size of the circuit. However, if it takes longer time to partition
a circuit than simulate it, developing, partitioning algorithms loses the purpose and
accelerating the simulation is not so useful. In this sense, instead of using partition-
ing algorithms, random partitioning is used by assigning logical processes evenly to
processors in this dissertation, Rather than the study of performance improvement
by using efficient partitioning algorithms, we focus on computation granularity and

the other factors in the asynchronous protocols on parallel and distributed systems.

Chapter 3

Analytic Prediction of the Effects

of Computation Granularity

In this chapter, we derive analytical formulas to model two major performance met-
rics of a parallel optimistic synchronization protocol as functions of the computation
granularity. The main purpose of this analysis is to understand the effect of compu-
tation granularity in an analytic way. The total execution time of a simulation and
the number of simulation cycles are the performance metrics considered. According
to our analytical results, the computation granularity is an important factor in im-
proving the performance of parallel discrete event simulation for large VLSI circuits
on parallel and distributed systems. Experimental results performed on a cluster of
DEC Alpha workstations interconnected by a DEC GIGAswitch are compared to the
analytic results.

In addition, in this chapter, we compare the architectural characteristics of the

MasPar MP—2 to those of a DEC Alpha workstation cluster. The MP-2, as a SIMD

46

47

machine, has a large number of processors and a system-supported global router for
barrier synchronization. However, it is not a feasible platform for a very large VLSI
circuit simulations due to its small local memory size. In contrast, the distributed
system has the advantage of a large amount of local memory which is suitable for
large circuits that cannot be simulated on the MP-2. This study shows that a cost-
effective distributed system is a promising platform for parallel simulation of large
VLSI circuits and that the computation granularity is an important factor to be
considered in improving the performance of a parallel simulation.

The related work, Agrawal and Chakradhar [2] analyzed the performance of the
synchronized iterative algorithms based on the binomial distribution on their analytic
model. In their analytic model, the computation grain size was ﬁxed to the value
of the LP ratio and interprocessor communication time was ignored. Chung and
Chung [29] presented a model to predict parallelism and the number of simulation
cycles for parallel logic simulation. The computation grain size of their model was
ﬁxed to either one or the LP ratio.

The remainder of this chapter is organized as follows. Section 3.1 describes the
model which is used for our analysis and experiments. In Section 3.2, we analyze and
derive the prediction formula based on the model considered. Section 3.3 compares
the characteristics of the MP-2 and a cluster of Alpha workstations from the perspec-
tive of parallel discrete event simulation. Section 3.4 presents the experimental and
analytical results on the effects of computation granularity. Concluding remarks are

given in the last section.

48

3.1 The Model

This section describes the model of the parallel discrete event logic simulation, which
is used in our experiment and analysis. Based on this model, we study the effect of
computation granularity on a parallel and distributed system.

Figure 3.1 presents an instance of allocating 12 logical processes of a logic circuit
to 3 processors. Note that the LP ratio is 4. In a gate-level logic simulation, a gate
functions as a logical process. The logical processes are randomly partitioned and
evenly allocated to the processors. Each logical process contains a maximum of 3
input ports and 2 output ports after a pre-processing step of circuit reconﬁguration.
An intermediate gate which is created by pre-processing has a zero delay, while a
normal gate has a unit delay. Each output port of a logical process is assigned to an
input port of the destination logical process according to the conﬁguration layout of
the benchmark circuit. While an output port does not have any queue, an input port
has its own input event queue for storing events.

A simulation cycle contains the following steps: processing events received in the
previous simulation cycle, selecting some processes to be executed, computing events
of the selected processes, sending the generated messages to the destination processes
via the interprocessor network or via intraprocessor handling after packing computed
events, and performing a barrier synchronization. At the ﬁrst simulation cycle, pro-
cesses are activated by the given input events. At the following cycles, processes are
activated by the incoming events from other processes and by satisfying their own

activation conditions. A process which has unprocessed events and is ready to be

49

 

 

     

PEI

  

.
.
.
‘- I l I I I I
o
.
.
.
'-
e
o
.
.
o
‘-
‘~
'.
...
n
u
‘-
u
.
e
'o
u.-
.
o
‘. ..o
-._.

------------------------------------------------------------------

 

PE 2 ‘ ‘-j-j-:-:-:-:-: Input Queues

(Gate)

 

Figure 3.1: A Layout of Logical Processes

 

50
active but not yet executed is called as a candidate process. Each processor selects
some candidate processes up to the maximum computation grain size and executes
events of the selected processes. The executed processes propagate output events
to the destination processes, which are located either in the same processor or in
another processor. That is, it is to communicate the output events among logical
processes by propagating messages between physical processors or by moving the
output events within a processor. Message transmission between processors uses the
system’s interprocessor communication network, while message transmission within
a processor is simply a movement of data within the processor. From now on, we
will use the terminology ‘communication’ to refer to message transmission of output
events among logical processes, which is propagated between processors and within a
processor. After communicating the output events, processors are globally synchro—
nized via barrier synchronization and determine the GVT value to be the minimum
of PVTs (Processor Virtual Times). The PVT is determined by the minimum LVT’s
among the assigned logical processes in a processor. Due to the barrier synchroniza—
tion among processors, all processors advance with the same simulation cycle during
the simulation although the assigned logical processes progress asynchronously based
on their own local clocks. In contrast to the synchronous protocol in which only
logical processes whose LVTs are equal to the GVT are executed and all logical pro-
cesses are synchronized by the GVT, the optimistic protocol allows each processor to
optimistically execute all possible logical processes with their unprocessed events in
a simulation cycle although the LVTs are larger than the GVT. Thus, our model has

the beneﬁts of asynchronous and optimistic processing as well as the control over the

 

 

51

amount of computation between two communication points.

3.2 Analysis

To study the effects of computation granularity, in this section, formulas are developed
to predict two major performance metrics, the total execution time and the number
of simulation cycles. The total execution time is the total time spent in simulating
a circuit with given input events. The total execution time can be approximately
computed by multiplying the average execution time per simulation cycle by the
number of simulation cycles. Before proceeding to the performance analysis, a few

terms require deﬁnitions:
0 N is the number of logical processes.
0 P is the number of processors in the system.

0 r is the LP ratio. In our model, the N logical processes are assigned evenly
to the P processors, and therefore the LP ratio is [%] in some processors, and

[%] in others. The LP ratio in the analysis is ﬁxed to be [%] for convenience.

k is the maximum computation grain size of the simulation, that is, the maxi-

mum number of processes (or events) that are allowed to be executed between

any two consecutive interprocessor communication points.

a is the activation probability of a logical process.

 

 

 

52
o n,- is the random variable that represents the number of candidate processes in

the ith processor.

0 m,- is the random variable that represents the number of processes that are
actually executed in the ith processor. It is the minimum value of n,- and k:

Tli /\ k.
0 Sf is the total number of simulation cycles.
0 7" is the execution time per simulation cycle.

T’“ is the total execution time of the simulation. That is, T’6 = Sf * 7"“.

The terms of 71,-, mi, 7*, and T k are all variables that are dependent on the LP

ratio 7", but we do not explicitly state 7" for those terms except for Sf.

3.2.1 Execution Time per Simulation Cycle

The execution time in a simulation cycle is determined by the processor that exe-
cutes the largest number of processes in the cycle. Let mmax be the largest num-
ber of executed processes in a simulation cycle over all processors, i.e., mm“ =
Masc(m1, m2, ..., mp). Suppose the fanout of a process is w, i.e., each executed pro—
cess generates 211 copies from the computed event and sends each of them to the
destination process. Then, the execution time per simulation cycle, 7"“, is derived as

follows.

7'16 : tp'roc + mmaa: [ tcomp + ’(U (1 — Q) tmove ] ‘l' (P — 1) tcomm + tsyn (31)

53
where 751mc is the amount of time spent to process received events and to select some
candidate processes among active processes according to the maximum computation
grain size, temp is the computation time of an executing process, q is the probability
that a generated event is transmitted to another processor via the interconnection
network, tcomm is the interprocessor communication time, tmwe is the data moving
time of an output event within a processor, and tsyn is the barrier synchronization
time. When an optimistic protocol is used, the term amC refers to the handling time
of an event queue adjustment due to stragglers and the term mm” includes immature

events of optimistic processing. Since N processes are randomly and evenly assigned

 

to P processors, the probability q is 113:. The term, (P — 1) 7560mm, is the maximum
interprocessor communication time for all generated events in a simulation cycle. In
our model, the events generated in a simulation cycle on a processor are combined and
sent together in a message if their destination processes are on the same processor.
If the logical processes are randomly distributed and the LP ratio is large, then a
processor may send messages to all other processors. The parameters, tpmc, tcomp,
teamm, tmove, and tam, can be measured from experiments.

As a machine-independent parameter, we derive the expected value of mmam when
the maximum computation grain size is k. Let fn,(:r) and Fn,(:1:) be the probability

mass function and the cumulative distribution function of m, respectively, where

nggr.

54
Theorem 1 The probability mass function of m,- is given by
fairs) if 22 < k
fm.(:v) = (3.2)
1 — Fni(a:) ifs: = k
Proof: Because m,- is the number of executed processes among n,- candidate pro-
cesses and the value of k limits the maximum of m,, the value of m,- is the minimum
of n,- and k, i.e., m,- = n,- /\ k. If m,- < k, the number of actually executed processes
is the same as the number of candidate processes, that is, m, = m. The condition of
m,- = It includes all cases in which the number of candidate processes is larger than
k, i.e., n,- _>_ k. Thus, fmi(a:) 2: Pr[n,- Z :13] if :1: = k. Note that m,- has the truncated

distribution [57] at k. E]

The cumulative distribution function (c.d.f.) of m,- is given by

Fm,(:c) = 1—Pr[m,->a:]=1—Pr[n,~/\k >12]
1—Pr[n,->:r]=-—Fm(:1:) ifx<k

1— =1 ifx_>_k

If m1,m2, . . . , mp have the identical distribution, say Fm(a:), and they are indepen—

dent, then the c.d.f. of mmam is computed as

F

mmaz

(:13) = Pr[mmaxgx]

= PTle’mz‘ S Ivll

55
= Pr[m1 3 cc] Pr[m2 S 1:] - - -Pr[mp g as]

= Pr[m§:t]’D = Fm(:1:)’D (3.3)

The assumption that m1, m2, . . . , mp are identically distributed may be true in our
simulation model, because logical processes are randomly and evenly assigned to
processors. The assumption that m1, m2, . . . , mp are independent may not be correct
because of the task dependency among logical processes when real logic circuits are
simulated. Due to the second assumption, the analytical results in the next section
show a little difference from the experimental results.

Since mmax is a non-negative integer value, E(mmax) : 2:0(1 — meu (1:)) [67].

Based on the c.d.f. of mmax, the expectation of mm“, E (mmam), is obtained as follows.

E(mm...) f: <1 — Frame»
k—l
= an — Fm-..<x>>
k—l
: k '" 230(Fm($)P) (3'4)

Based on Equations (3.1) and (3.4), the expected execution time per simulation cycle

is as follows:

E(Tyﬁax) : tproc + E(mmax) [tcomp + w (1 "' Q) tmovel + (P — 1) tcomm + “93:53)

56

3.2.2 Number of Simulation Cycles

Another component of the total execution time is the number of simulation cycles,
Sf. In this subsection, Sf is derived when the maximum computation grain size is
k and the LP ratio is r. As a preliminary step, we deﬁne the workload of parallel
simulation.

Deﬁnition 1 Let W1." be the workload of the simulation when the maximum compu-
tation grain size is k and the LP ratio is r. Then, er is deﬁned as the total number
of executed events by all processors during the simulation.

By deﬁnition, er is computed by multiplying the total number of simulation cycles

by the total number of executed processes in a simulation cycle, that is,

Wr’“ = (Number of simulation cycles) * (Total number of executed processes in a cycle)

P
i=1

where m is the average of m1, m2, . . . , mp. Let E(m) be the expectation of m. When
P is large, # is near 1. Hence, we can use E(m) for m. Since m1, m2, . . . , mp have

the same distribution, we can use E(m) for E(m). Applying Equation (3.4) with
P = 1, we have

k—l
E(m) = k — §Fm(x) (3.7)

In order to derive an analytical formula for the number of simulation cycles, we

use a property called the workload conservation property.

(3.6)

57
Workload Conservation Property: The workload is preserved for all values of r

and k, i.e.

W11 2 er foranykandr,0§kgr

This property says that the total number of events executed during the simulation
is preserved regardless of the values of k and r. That is, the number of executed events
when each processor has one logical process and at most one event is executed per
logical process, is equal to the number of executed events when a processor has more
than 1 logical process and executes at most one event from each logical process. Note
that the value of W11 contains the total number of executed events including the
immature events due to optimistic processing when as many processors are used as
are required logical processes (the LP ratio 1). In [29], Chung and Chung derived
the number of simulation cycles in the same manner. We prepare two ﬁgures in
Section 3.4.1 which support this property. By using Equation (3.7) and the workload
conservation property, the number of simulation cycles can be derived as follows.
Theorem 2 When the maximum computation grain size is k and the LP ratio is r,

the number of simulation cycles, S”c is computed as follows:

T7

S’ar
SkN 1

. N W (3'8)

where a is the average activation probability of a logical process.

58

Proof: Ifr = 1, k =1, and P = N, then

W,1 .—. 311Na (3.9)

Ifr>1andk21,thenP-——|'¥] and

Wk 2 5’mean

 

N
z Sf [71E(m)l(.,k) (3.10)
Due to the workload conservation property, W11 2 Wf implies SllNa x
S”? l¥l E(m)l(r,k)' Thug,
5’“ x S] N a = $11 a r
T If] E(m)l(r.k) E(m)|(r.k)
Cl

Therefore, the total execution time can be derived from Equations (3.5) and (3.8):

Tk = SkE(rk ) (3.11)

3.3 Selecting the Target Machine Architecture

Before performing a set of experiments in order to evaluate our derived analytical

formula, we compared two typical parallel machine platforms, the SIMD and MIMD

59
platforms, to determine a proper machine architecture for simulating large VLSI cir—
cuits. SIMD machines have a single instruction stream over multiple data streams
and all of their parallel execution units are synchronized. In contrast, the processor
nodes in the MIMD machines asynchronously execute multiple instruction streams
over multiple data streams. This difference in architectural philosophies necessitates
different, respective implementations of the optimistic protocol. Due to such different
machine characteristics, it may be difﬁcult to compare the performance of the opti-
mistic protocol on the two machine platforms. Thus, what we do is to attempt to
prove that a cost-effective cluster of workstations is a promising platform for PDES
of large VLSI circuits in comparison to SIMD architectures. As target machines, a
cluster of DEC Alpha workstations as a MIMD machine is compared to the MasPar
MP-2 SIMD machine. In the rest of this section, we compare the architectural char-
acteristics of the two machines. Also, experimental results on both machines with

several benchmark circuits from the ISCAS ’89 benchmark suite [15] are presented.

3.3.1 Characteristics of the Target Machines

The employed target machines are the MasPar MP-2 with 4K PEs and a cluster of
6 DEC Alpha 3000 workstations interconnected by a DEC GIGAswitch via FDDI.
The GIGAswitch supports a peak data transfer rate of 200Mbits per second. For
the MP-2, MPL (Massively Parallel Language on the MasPar) [53] is used. For par—
allel distributed computing on the distributed system, PVM 3.3 (Parallel Virtual

Machines) [65] is employed. Among many architectural characteristics, only those

 

 

 

60
related to parallel logic simulation are considered. Since performance results of par—
allel simulations are dependent on the execution time of machine instructions, it is
essential to compare the major instructions on those target systems. We compare
the instructions and the characteristics needed for interprocessor communication, the
global synchronization, and the access operations to event queues. In addition, the

number of processors and local memory space per processor are also considered.

Interprocessor Communication and Global Synchronization

The target machines have distributed local memories over processors without any
globally shared memory space. Processors communicate with one another by pass-
ing messages over an interconnection network. One of good characteristics of the
MP-2 architecture is that it has built-in global router connections for interproces-
sor communications. The global router is a bidirectional communication instruction
and a circuit-switched style network organized as a three-stage hierarchy of crossbar
switches supporting point-to-point communications between any two processors [14].
Also, the reduction to compute the minimum value among all processors can be per-
formed by a single instruction within a short amount of time via the global router.
In contrast to the built-in global router on the MP-2, a cluster of workstations
uses a local area network (LAN) such that interprocessor communication is accom-
plished via system software, an operating system, and a message passing package.
The collective communication time on a distributed system is much longer than that
of any point-to—point communication time. Table 3.1 shows comparisons of commu—

nication and barrier synchronization performances between the machines. Processing

 

 

 

 

 

61
I] Communication Synchronization I] Array Read Array Write ]
MasPar MP-2 I] 385 71 ]| 5.47 5.86
DEC Alphas [I 3900 11700 I] 0.08 0.09

 

 

 

Table 3.1: Performances of Communication and Array Access Operations on the
MP-2 and Alphas

time of l—integer data is measured for communication, barrier synchronization, and
array accesses on the MP-2. On the Alpha cluster, the round trip time of 1-integer
messages is measured for the interprocessor communication time. To measure the
barrier synchronization time on the Alpha cluster, the master workstation sends a
1-integer multicasting message by using pvm_mcast() after collecting 2-integer mes-
sages from ﬁve slave workstations. The time measured includes the time spent in the

system call, gettimeo f day(). ,useconds are used as a unit of measurement.

Computation Capability

The parallel discrete event simulation requires many event manipulations, such as
event enqueuing, accessing of events, event comparisons, event dequeuing, and event
queue status checking. Since the event processing time depends on the processor’s
capability, the simulation performance is affected signiﬁcantly by the given computa-
tional capability of the machine architecture.

On the MasPar, most data movement within a processor occurs on the internal
processor 4-bit nibble bus and the 1-bit bus [61]. Compared to other SIMD machines
such as the Connection Machine CM-2, the MasPar has a more computational capa-
bility through such means as the use of an on—chip register set per processor and the

4-bit nibble bus. The distributed system is composed of DEC Alpha 3000 worksta-

 

 

62
tions whose clock rate is 133 MHz. Table 3.1 shows the processing performance of
array read and write Operations on the MP-2 and the Alpha. The Alpha shows much

faster processing times than the MP-2 on such operations.

Local Memory Space and Number of Processors

Most SIMD machines have small local memories. For example, the Connection Ma-
chine CM-2 has 8KB of local memory, the MP-l 16KB, and the MP-2 64KB. However,
the number of processors in SIMD machines is quite large. The MasPar 1200 family
has up to 16K PBS and the CM—2 has up to 64K PEs. In contrast, the distributed sys—
tem has a more local memory, but a small number of processors. Most workstations,
such as the SUN Sparc 10 and the DEC Alpha 3000, have 32MB or more memories.
The interconnection network between processors in a distributed system with LAN
and sub-networks of LAN usually has several (in the 10’s) machines, resulting in the
limited number of processors. In order to connect several hundred workstations to
increase the number of processors, we have to use bridges and routers that link several
sub-networks, requiring extra communication time.

In logic simulations, VLSI circuits have a large number of gates. When the number
of gates is larger than the number of processors, many gates should be assigned to a
single processor. In this case, SIMD machines are limited by their small local memory
space. Distributed systems, however, can take the advantage of having large amount

of local memory for simulation of huge circuits by accommodating many gates.

 

 

 

 

63

3.3.2 Comparison of Experimental Results on the Target
Machines

In this section, we compare the experimental results of the optimistic protocol on
both machines with several benchmark circuits. As benchmark circuits, we use several
circuits from the ISCAS ’89 benchmark suite. Through a pre—processing stage, the
circuits are transformed to a conﬁguration with a maximum of 3 input and 2 output
ports by adding zero-delay gates. Other normal gates have unit delays. Note that the
restricted number of input and output ports on a logical process is good for parallel
processing especially on the SIMD machine, in which each instruction on the massively
parallel processors of a SIMD machine is synchronized by the front—end processor,
so that it can reduce any time delays due to processor execution synchronization.
The clock interval in a D f/f is the same as the interval of input events. On the
distributed system, the optimistic protocol is implemented as a master-slave model,
where the master processor initially distributes the tasks to slave processors, checks
the correctness of the GVT during the simulation, and collects information and results
from the slave processors after completing the simulation. The master processor does
not interrupt processors’ executions during the simulation. The parallel processing
of a given application problem runs on the slave processors by computing events and
communicating with other slave processors in the form of SPMD.

In order to interpret performance results of the target machines, we describe the
implementation and data structure for event queue manipulation. For the MP-2, a

Circular array data structure called Circular Binary Search Queue (CBSQ) is used.

64

The more details are described in Chapter 6. Since a processor of the MP-2 has only
a small amount of local memory, event queues are implemented by a data structure
that can reduce memory usage. The CBSQ not only conserves the memory due to its
circular structure, but also makes event queues be handledimore efﬁciently by using
binary searches for rollback situations and fossil collections. The access time to the
CBSQ is shown by the array read and write operations on the MP-2 in Table 3.1.

On a distributed system having few processors, the LP ratio must be larger than
one. The large amount of memory of a workstation should be shared by many pro-
cesses efficiently. For this purpose, the event queues are implemented by linked struc-
ture on the Alpha cluster. The enqueuing time measured was 8.19 pseconds including
the time spent in malloc() system call. The dequeuing time takes 3.18 useconds. To
improve the performance of event queue manipulation, a free space pool is used. In-
stead of freeing the reserved space when dequeuing, the free space pool temporarily
holds the released space for future use to avoid frequent calling of mall0c() routines.
When the free space pool is used, the average enqueue operation time is reduced to
4.81 pseconds. Although the computation capability of Alpha workstations is much
better than that of processors in the MP-2 as shown in Table 3.1, the queue manipu-
lation times on the two machines do not make a big difference because of the different
implementation and manipulation of event queues.

Table 3.2 shows the experimental results of Time Warp on the MP-2 and a cluster
of Alpha workstations. The number of gates, the measured total execution times, and
the performance ratio of total execution times on the target machines are shown for

six benchmark circuits. Seconds are used as a unit of total execution time. Since the

 

65

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

No. of Total Execution Performance

Circuit Gates Time (T) Ratio
MP-2 Alphas g—A’yﬁ

8953 715 11.57 47.09 4.07

81196 967 16.41 70.16 4.28
81238 988 16.62 71.44 4.30
S1423 1161 11.98 52.79 4.41
31488 1456 9.43 33.12 3.51
81494 1462 9.48 33.21 3.50

 

 

Table 3.2: Experimental Results on the MP—2 and a Cluster of Alpha Workstations

MP-2 has 4K PBS, the benchmark circuits whose numbers of gates are less than 4K
are selected from the ISCAS ’89 benchmark suite. The number of initial input event
vectors for the simulation was 1000. On the distributed system, the computation
grain size is maximized to keep the original Time Warp concept and to compare its
results with those of the MP-2 under the same conditions. That is, all processes are
allowed to execute events whenever they have unprocessed events. According to the
results, the MP-2 produces about four times faster performance than the cluster of
Alpha workstations. Since queue manipulation times on those machines are similar,
as explained above, the major difference of the results is caused by the high amount of
computation due to having many assigned logical processes per processor as well as the
difference in interprocessor communication and barrier synchronization. When the
distributed system has more processors, this performance gap will be reduced. Due to
having many allocated logical processes in a processor of the distributed system, the
amount of computation and the time of managing memory for all logical processes
make processors have a poorer performance. Moreover, the distributed system has

the potential of being used for much larger VLSI circuits than the MP-2 can simulate.

66
Therefore, conviction that a cluster of Alpha workstations can be a feasible target
machine, in the next section, we perform a set of experiments based on this machine

to evaluate the analytical formula derived in Section 3.2.

3.4 The Results

This section presents the analytical and experimental results of logic simulation of
the optimistic protocol on the same cluster of six DEC Alphaworkstations employed
in the previous section. Several large circuits from the ISCAS ’89 benchmark suite
are used as the benchmark circuits. Analytical results are generated by the formula

in Section 3.2 and compared to the experimental results.

3.4.1 Workload Conservation Property

In Section 3.2, we used the workload conservation property to compute the number
of simulation cycles. The workload conservation property states that the number of
executed events when each processor has one logical process and at most one event
is executed per logical process, is equal to the number of executed events when a
processor has many logical processes and executes at most one event from each logical
process. To support the workload conservation property, we show experimental results
with some ﬁgures. During the simulation, the total number of executed events are
counted by varying values of several parameters. The 813207 circuit with 100 input
event vectors was used. Figure 3.2 (a) shows the workload at P = 5, varying k values,

and Figure 3.2 (b) presents the workload with a ﬁxed k, 100, varying P values up to

 

 

67
5. Both ﬁgures show virtually the same values of the total number of executed events

for all r and k and support thus the workload conservation property.

3.4.2 Comparison of Experimental and Analytical Results

This subsection veriﬁes the analysis in Section 3.2 by comparing experimental results
on a cluster of Alpha workstations to the analytical results. The total execution time
and its related components are compared to the 813207 circuit with 100 input event
vectors. The number of processors, P, is 5. The 813207 circuit had 11689 gates: N
is 11689. Since 11689 gates are assigned to ﬁve workstations evenly, the LP ratio, r,
is 2338. Both of the interval of input events and the clock interval in a D f/f are 200.
The computation grain size varies from 1 to r.

In order to derive the performances of E(mmax) and E (m) from Formulas 3.4
and 3.7, it is required that the distribution of n should be known, where n is a
random variable of the number of candidate processes on a processor in a simulation
cycle. The distribution of n may be determined by the characteristics of circuits
and underlying simulation protocols. The binomial distribution has been used in [2]
and [29] with the assumption that the gate activations at each simulation cycle are
independent. Parallel simulations of real circuits will probably make it difﬁcult to
satisfy this assumption. Instead, we used the distribution of n that came from our
experiments; during the simulation, the number of executed gates at each simulation
cycle on all processors was counted to compute the p.d.f and c.d.f. of n. Based on

the measured c.d.f. of n, the E(m) and E(mmax) can be computed by Formulas 3.7

 

 

 

 

 

 

 

 

 

 

 

68
80000 1 1 1 r
60000 — 1 my . 1 1
J /“ 7‘7 W
W: 40000 — —
20000 — p z 5 .1— _
l l l l
0 500 1000 1500 2000
k
(a)
80000 | T I
60000 — 4 4+— +‘\:r
W,’c 40000 — _
20000 e k = 100 +— ~
l l
1 2 3 4 5
P
(b)

Figure 3.2: (a) The Workload Varying k When P = 5 (b) The Workload Varying P
When k = 100

 

69
and 3.4, respectively.

Figure 3.3 shows the experimental and analytical results of E (m) when k varies.
As k increases, E (m) increases until it reaches a bound. Once it reaches the bound,
the value of E (m) is steady, even as k increases. The bound of curves is delimited
by the number of available candidate logical processes in a processor. Although the
value of k is large enough to allow more processes to be executed in a simulation
cycle, since there are not enough candidate processes in the processor, the number of
executed events are delimited by the number of available candidate processes. Since
E (mmaz) shows a similar trend to E(m) in that both curves increase as k increases
and have upper bounds with steady performances, we omit the ﬁgure of E (mmax).
The CDF in the ﬁgure uses the E, (x) from experiments by measuring the number of
executed processes on processors at each simulation cycle and computing the p.d.f.
and c.d.f. of n. Figure 3.3 also presents the curves of E(m) that are computed by
the binomial distribution and the exponential distribution of n. The result of the
exponential distribution is closer to the experimental one than that of the binomial
distribution, but both are still far from the experimental result.

The number of simulation cycles, Sf, can be obtained by the values of E (m) in
Figure 3.3 and by using Theorem 2. Figure 3.4 (a) shows the experimental results of
Sf compared to the analytical results as a function of k. The dotted lines represent
the analytical values, and the solid lines represent the experimental values. To obtain
some parameter values of Theorem 2, we simulated the same circuit using a sequential
simulator of Time Warp with the LP ratio 1, where the sequential simulator is a

simulator performing the original theme of Time Warp on a single processor and is

 

70
80 I I I I ‘ 1
Experimental E m +—
70 — CDF Analytical E m - - - - ‘
Binomial Analytical E m - - -
60 — Exponential Analytical E m - 4 - ~ ‘

'-.".'.1'.'.\".".'.".'4—

 

 

 

0 L l l l l l

O 100 200 300 400 500 600 700
k

Figure 3.3: The E (m) vs. k Varying from 0 to the LP Ratio

not Operated by a synchronous protocol. In the sequential simulator, thus, all available
candidate logical processes are executed. Because there is no machine having as many
processors as logical processes of a huge VLSI circuit, we measured S] by presuming
to assign a logical process to a processor by means of the sequential simulator of Time
Warp. From the sequential simulation, the value 222 is obtained for 5'], 0.021457 for
the activation probability, a, and 1.286383 for the average fanout of a gate, w. Since
S] is obtained under the condition of the original Time Warp when the LP ratio
is 1, the parallelism of the simulation is maximal. That is, all candidate processes
are executed. As shown in Figure 3.4 (a), the steady bottom line of S]? has the same
value of S] as the low bound of number of simulation cycles. This is because a large k
allows all candidate processes to be executed and follows the condition of the original
Time Warp. In contrast, as k becomes small, processors postpone execution of some

candidate processes to the next cycle due to the limitation of computation grain sizes.

 

71
This increases the number of simulation cycles as k decreases.

The experimental and analytical results Of T"c are shown in Figure 3.4 (b) as a
function of k. Seconds have been used as a unit of measurement. When k is near
1, T" becomes very high because the time spent for frequent communications and
barrier synchronizations dominates the total execution time. This implies that very
small computation grain sizes deteriorate the simulation performance signiﬁcantly
when the communication time is relatively large compared to the computation time.
In contrast, in SIMD machines, which provide small communication and barrier syn-
chronization times, the simulation performance may not be affected greatly even by
small computation grain sizes when the LP ratio is larger than 1. In Figure 3.4 (b),
when k is around 100, the simulation shows Optimal performance. It is shown that
good performances can be achieved on distributed systems by using at least a certain
amount of computation grain size. When k reaches approximately 300, the simula—
tion reaches a steady performance. This result implies that when k is larger than the
optimal size, the variation of execution times among processors is large in a simula—
tion cycle. The steady curve after k = 300 is delimited by the number of candidate
processes in a simulation cycle. The analytical curve that is compared to the curve
of experimental result is shifted to the left somewhat. This is caused by differences
of curves on Sf as shown in Figure 3.4 (a). By adjusting the maximum computation
grain size, the unbalanced load among processors can be adjusted so that idle times

of processors can be reduced and simulation performance can be improved.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

72
1000 1 I I I

Experimental -6—

CDF Analytical - - - -
800 a
600 _

Si“
400 2
200- ................. cc ....... _.
l l l l l
0 200 400 600 800 1000
k
(a)
I I l 1 1

140 " Experimemtal —o— —

CDF Analytical - - - -
120 .. _,
100 4* _

k

T 80 1 _

60 ' c .........................
40 - -—l

20 l l l l 1
100 300 500 700 900
k

Figure 3.4: Comparisons of (a) Sf and (b) T’0

 

73

 

200 I I I r I I I

 

 

 

 

 

 

 

I I
813207, P=3 +—
P=4 A—
P=5 +-
150 h .1. 71‘-
Tk 100 .. —A
50 -
0 1 1 l l l 1 l l n
O 100 200 300 400 500 600 700 800 900 1000
Grain Size

Figure 3.5: Experimental Total Execution Times with S13207 Circuit

3.4.3 Effects of Computation Granularity on Distributed
Systems

In the previous subsection, we have studied how performance could be affected
by adjusting computation granularity for a given circuit when the LP ratio is ﬁxed.
Even on the same distributed system, however, the effects of computation granularity
may vary according to machine—independent factors. By varying the number of pro-
cessors, we have performed similar experiments to examine the performance effects
Of computation granularity by using several circuits from the ISCAS ’89 benchmark
suite.

Figures 3.5 and 3.6 show T’“ for circuits S13207 and 838417, respectively, when
the number of processors varies from 3 to 5. The S13207 circuit has 11689 gates and
the 838417 circuit has 30367 gates. For circuit 838417, both the interval of input

events and the clock interval of a D f/f are 500. From the experimental results, some

74

 

 

 

 

 

 

 

 

l I
700 -
600 p = 3 + _
P = 5 -°—
500 _
Tk
400 ’5
300 3 _
200 —
1 1
0 200 400 800 1200 1600 2000

k

Figure 3.6: Experimental Total Execution Times with 838417 Circuit

aspects of parallel discrete event simulation affected by machine-independent factors,
such as circuit characteristics, can be observed. The ﬁrst observation is that a certain
range of computation grain size can produce better performances and that locations
of the optimal grain size within the range are different on those circuits. Circuit
813207 shows better performances when k is between 50 and 300, but S38417 shows
better performances between 30 and 500. The range of optimal computation grain
sizes on 813207 is smooth and the optimal point is located in the middle of the range,
while 838417 has a sharp valley of an optimal range and its Optimal point is at the
beginning. The next observation is that the amount of performance improvement
affected by computation granularity varies from circuit to circuit. Through Figures
3.5 and 3.6, we can observe the amount of improvement in T k by adjusting the
maximum computation grain size. Table 3.3 gives more detailed information in this

aspect for four different circuits. The total execution time when the value of k is

75

 

| P 1| k || 813207] 835932 | 838417 | 838584 ]

2 LP ratio 286.28 8493.81 1501.87 5175.72
Optimal 239.30 7153.13 731.59 4432.14
Improved Rate 1.20 1.19 2.05 1.17
3 LP ratio 147.81 3992.81 698.05 2829.21
Optimal 131.96 3539.53 361.57 2469.10
Improved Rate 1.12 1.13 1.93 1.15

 

 

 

 

 

 

 

 

 

 

 

4 LP ratio 98.15 2281.65 413.34 1921.02

Optimal 71.15 1965.55 244.72 1672.98
Improved Rate 1.38 1.16 1.69 1.15

5 LP ratio 61.77 1459.43 284.52 1244.56

Optimal 45.21 1326.58 198.35 995.01

 

 

 

 

 

 

 

 

 

 

Improved Rate 1.37 1.10 1.43 1.25

 

Table 3.3: Experimental Results on a Cluster of Alpha Workstations

equal to the LP ratio is compared to that when the value Of k is equal to the Optimal

size that yields the minimum total execution time. The Improved Rate is computed

(T'6 when kzr)
(Th when k=optimal) '

 

as The Improved Rate shows the performance improvement when
the computation granularity is appropriately adjusted, compared to the results of the
original Time Warp. Depending on the circuit, the experimental results show various

Improved Rates of up to 2.05. Even on the same circuit, the Improved Rate varies as

P increases.

3.5 Summary

Computation granularity is one of the adjustable variables to improve the perfor-
mance of parallel discrete event simulation when the number of logical processes of
the application problem is larger than the number of physical processors on parallel
and distributed systems. In this chapter, we have studied the effect of computa-

tion granularity on the performance of optimistic protocol on distributed systems for

76
benchmark circuits by using an analytical model. We summarize the contributions
made by this research. First, analytical formulas have been developed to predict
several performance metrics such as the number of simulation cycles and the total
execution time for the given machine—dependent and independent parameters. An—
alytical results derived from the formulas have veriﬁed experimental results on the
distributed system. Second, the architectural characteristics of a cluster of DEC Al-
pha workstations was compared to those of the MasPar MP-2 with respect to parallel
discrete event simulation. Finally, the effects of computation grain size on distributed
systems have been observed with various machine—independent factors, such as the
number of processors and different circuits. The experimental results indicated that
the effects of computation granularity on the performance of distributed logic simu—
lation vary depending on the characteristics of circuits. According to the results, by
adjusting the computation grain size, the improved rate of total execution time was

increased up to 2.05.

Chapter 4

Experimental Study of the Effects
of Computation Granularity on the

Performance of an Optimistic

Protocol on the IBM SP2

In order to focus on the major effects of computation granularity on the performance
of Time Warp, an experimental study of computation granularity is investigated by
performing a set of experiments on the IBM SP2. The performance aspects considered
in this study are execution activity, rollback situations, communication overhead, and
simulation progress. Execution activity shows the amount of actual event computa-
tion between communication points, which is compared to the amount of possible

event computation on logical processes. Rollback situations are represented by two

77

 

78
performance metrics: a rollback frequency and a rollback rate. To study communica-
tion overhead on nonblocking message communication, the computation fraction of
the total execution time of simulation is used as the performance metric. Simulation
progress shows the overall progression of simulation status until the termination con-
dition is satisﬁed. Also the speedup of parallel processing on Time Warp is presented
as a function of the number of processors.

Among the many parallel architectural platforms, we consider multicomputer sys-
tems as prominent ones for optimistic simulation of VLSI circuits which require fre-
quent communication and a large local memory space. The target machine is the
IBM SP2. It provides the High-Performance Switch as its communication subsys-
tem [1, 83, 81] and can scale to a couple hundred or a thousand SP nodes according
to the conﬁguration. Each SP node uses a RISC System/ 6000 processor and can ad-
dress up to 512MB or 2048MB of local memory. With the topology of a bidirectional
MIN, buffered wormhole routing is used for interprocessor communication.

The following section describes the model and implementation by explaining the
employed schemes and design structures of the optimistic protocol. Section 4.2 de—
scribes the experimental results on the SP2, and discusses the effects of computation
granularity. The performance results are compared to those of a synchronous protocol

in Section 4.3. The concluding remarks are presented in Section 4.4.

79

4.1 The Simulation Model

In this section, how the optimistic protocol is implemented is presented by showing
the employed schemes, design strategies, data structures, and characteristics of the
application problem. The Optimistic protocol is implemented in an fashion of event-
driven activation. Incoming events into the event queues activate logical processes
and the logical processes are executed with their events by processors considering

protocol strategies.

4.1.1 Logical Process Allocation

The number of logical processes in large application problems is greater than the num-
ber of processors on parallel and distributed systems. Partitioning of logical processes
into processors can therefore greatly affect VLSI simulations, reduing communication
overhead by placing frequently communicating logical processes onto a processor, and
by adjusting the balance of computation among processors. Bailey et al. wrote an
excellent survey [7] in parallel logic simulation, in which they explained and compared
several partitioning schemes. Most of “the partitioning techniques discussed in their
paper take a linear time in the size of the circuit. They questioned that if it would
take longer time to partition a circuit than to simulate it, reduing only the simu-
lating time could accelerate the simulation performance. In this chapter, in order
to emphasize the effects of computation granularity, we use a random partitioning
scheme rather than any sophisticated partitioning algorithm. Logical processes are

randomly partitioned into processors, and evenly assigned into processors with the

80

same amount of the LP ratio.

4.1 .2 Simulation Cycle

Each processor repeats computation of events and communication with other proces-
sors until the simulation termination condition is satisﬁed. Each cycle from a series
Of repeated cycles of computation and communication is called a simulation cycle. A
simulation cycle is composed of several processing steps: LVT computation of logical
processes, event scheduling, event computation, event propagation, GVT computa-
tion, and fossil collection if necessary. Because each processor is allowed to proceed
with its own work by repeating the simulation cycle, simulation cycles of processors
advance asynchronously.

The number of simulation cycles is one of the key simulation performance met—
rics. A large number of simulation cycles represents a long series of works required to
complete the simulation. However, it is hard to say that a short number of simulation
cycles always has the fastest performance results because the amount of computation
in a simulation cycle also determines performance speed. We may have better perfor-
mance results if the number of simulation cycles is not too long with an appropriate
amount Of computation per cycle. This is because too many simulation cycles with
very small amounts of computation yield the extra cost due to the repetition. Thus,
the number Of simulation cycles is a criteria of simulation performance to show the
simulation progress for completing the required computation amount for the whole

simulation.

81

4.1.3 Event Scheduling

When each processor is assigned a logical process, i.e., the LP ratio is 1, event schedul-
ing is straight-forward; executing the smallest unprocessed event from the logical
process at ﬁrst. Under the situation that the LP ratio is larger than one, by apply—
ing the computation granularity, processors have to decide which logical processes’
events are ﬁrst executed and how many events are executed before communication.
To make the simulation progress rapidly, processors need to execute not only a few
small events whose timestamps are near the GVT, but also more events that can
proceed optimistically.

After communication in a simulation cycle, processors may activate logical pro-
cesses that have newly arriving events or unprocessed events. Since not all logical
processes having unprocessed events are actually activated in our model, we use two
different terms to present a logical process’ status: a candidate logical process and
an active logical process. Candidate logical processes are those that have unpro-
cessed events, whereas active logical processes are those that have events that are
executed. In the original Time Warp, all candidate logical processes should execute
their smallest unprocessed events, and all of the candidate logical processes become
active logical processes. In our model, some candidate logical processes become active
in the control of computation granularity, and the other candidate logical processes
are postponed to be considered in the next simulation cycle.

In our implementation of a simulation cycle, each processor has all the candidate

logical processes in the central scheduling priority queue by building with LVTs of

 

 

 

82
the candidate logical processes and index numbers of logical processes. Up to the
limit of the maximum computation grain size (MCGS), candidate logical processes
are selected from the scheduling heap and become the active logical processes. The
smallest unprocessed event in each active logical process is, then, executed, and the

generated output events are prepared for propagation.

4.1.4 Data Structure of Event Queues

In PDES, logical processes contain events that have been generated or received but
not yet executed, and are pending in the input queues. Each processor schedules the
pending events according to the event scheduling algorithm. The design structure of

event queues has an important inﬂuence on the simulation performance.

Individually Distributed Event Queues

In the model, each logical process has its own event queue in which all unprocessed
events are pending and the previously processed events are stored as well. This type of
implementation with an individual event queue per logical process is better than one
central event queue shared by all logical processes in a processor, especially when the
LP ratio is large. It is because if all logical processes share one event queue in common
within a processor, it takes a long time to search and manipulate the event queue for
rolled back events in accordance with the logical process’s index number. That is, it
can reduce the queue manipulation time when a straggler arrives, by eliminating the
corresponding heavy event moving and queue handling in rollback situations. Using

one event queue per logical process is also easily adaptable to migrate a logical process

 

83

to another processor in a dynamic load management situation.

Input Event Queues

In our implementation of a logical process’ event queue, all events are sorted in
increasing order by timestamp. Each event queue has three pointers; the bottom to
point out the very ﬁrst event, the front to point out the ﬁrst unprocessed event, and
the rear to point out the last event. After executing an event placed in the front,
the front pointer moves to the next unprocessed event. The processed event stays
in the queue until a fossil collection performs. The processed event may be rolled
back by stragglers. Instead of using separate output event queues, using a single
event queue which can be used as both input and output event queues can reduce the
queue manipulation time when a straggler arrives by eliminating the corresponding

event moving and queue handling from output event queues.

Central Priority Queue

The event scheduling algorithm generally follows the ‘smallest-timestamp-ﬁrst’ selec-
tion scheme. Since each logical process has an individually distributed event queue, a
processor has to schedule the assigned logical processes to select the smallest times-
tamped event ﬁrst. We use a central priority queue structure for efﬁcient event
scheduling in a processor. Several priority queue structures have been proposed such
as a heap with array structure, a calendar queue, and a skew heap. We used the heap
structure for the priority queue. Each item in the priority queue has its active logical

process’s index number and its LVT (the timestamp of the ﬁrst unprocessed event as

 

84
the key to be sorted). Only candidate logical processes that receive events and have
unprocessed events are enqueued into the scheduling queue with the LVT value. If the
logical process’s index number is already in the scheduling queue, only the LVT value
is changed and the heap structure is left untouched. This happens when a straggler
arrives and it has a smaller LVT value. From the scheduling heap, each processor de-
queues the smallest one and executes the event on the corresponding logical process.
When the computation granularity is applied with event scheduling, the scheduling
heap is kept after event computation and communication, and is re—organized with

newly received events.

4. 1.5 GVT Computation

In the optimistic protocol, the GVT is used to perform fossil collections and check
the termination condition of the simulation. Since logical processes can progress
their virtual time asynchronously by executing the events, the GVT does not have to
be computed whenever processors communicate with one another. Thus, the GVT
computation in our model is implemented asynchronously in order to keep the beneﬁt
of asynchronous processing of processors. We use a token passing mechanism to collect
logical processes’ virtual times asynchronously from processors and compute the GVT
value.

In this scheme, the master processor sends a token to its destination processor,
receives the circulated token from its source processor, and computes the GVT by

using the collected minimum local clocks of logical processes from processors. The

85
token is passed over processors in the designated logical order. There are two values
in the token: the previous GVT value that had been computed with the information
collected during the previous cycle, and the current GVT value. Each processor con-
tains the minimum LVT among the assigned logical processes, the Processor Virtual
Time or PVT. Whenever a processor receives the GVT token, it takes the stored
GVT value from the token and writes its PVT value to the token. If the on-going
GVT on the token is larger than the PVT, the PVT value becomes the on-going
GVT value and the token carries the information to the next processor. The advan-
tage of this scheme is that it is not necessary to send acknowledgments in order to
make sure in-transit messages are received for computing a correct GVT. Also, this
scheme makes simulation cycles asynchronous by allowing each processor to proceed
on its own without any barrier synchronization. Detailed token passing algorithms

are presented by Baldwin et al. [8].

4.1.6 Rollback Handling

Under the unavoidable rollback situations of the optimistic protocol, it is necessary to
annihilate the erroneous messages caused by immature events. Among many cancella-
tion mechanisms —— aggressive cancellation mechanism, lazy cancellation mechanism,
direct cancellation mechanism, and immediate cancellation mechanism —, the im-
mediate cancellation mechanism is used to handle rollback situations. In immediate
cancellation, there is no antimessage to correct the previously—sent wrong events as

in the direct cancellation mechanism [36] on shared memory systems. In addition,

86
immediate cancellation is a suitable cancellation mechanism for application problems
having small event granularity such as VLSI logic circuits because the event compu-
tation and propagation cost little compared to antimessage management.

Fossil collection is necessary in the optimistic protocol in order to reuse spaces of
previously-processed events. Because the GVT is the smallest timestamp that cannot
be rolled back by a straggler, processed events whose timestamp is smaller than the
current GVT can be removed. The important consideration on fossil collection is
when and how often to perform it during the simulation. Frequent fossil collections
keep the used memory at a minimum, but the processing time can dominate the
simulation time. Rare fossil collections can cause a shortage of memory in a processor.
In our implementation, we apply fossil collection to each event queue when the queue
contains more events than the maximum allowed queue size. Because there may be
a situation in which only unprocessed events are in an event queue and reach the
maximum allowed queue size, a threshold value is used to check that the gap between
the ﬁrst event in the event queue and the GVT value is large enough to perform fossil

collection.

4.2 Experimental Results

With several sets of experiments, we observe the performance of the optimistic pro—
tocol on the SP2. In this section, the simulation environment and results of our ex-
periments are presented. The simulation environment includes the employed features

of the target machine, benchmark circuits, and message passing communication. The

87
general trend of the major performance results, the total execution time, is presented
in Section 4.2.2 as a function of the MCGS. To understand the effects of computation
granularity, the performance metrics of the optimistic protocol have been measured
and analyzed in Section 4.2.3 from several perspectives: execution activity, rollback
situation, communication overhead, and simulation progress. In Section 4.2.4, the
parallel speedup is studied when the number of processors varies. The effect of LP

ratio is observed in Section 4.2.5.

4.2.1 The Environment
The Target Machine

The target machine is the IBM SP2. On the SP2, it is possible to have up to
512 or a thousand processors, up to 512MB of local memory for thin nodes up to
2048MB for wide nodes; each node has a RISC System/ 6000 processor. The SP2’s
High-Performance Switch provides a fast communication environment: the High—
Performance Switch provides a peak bandwidth of 40 MB per second. The MH-
PCC [54] SP2 that has been used for our experiments is composed of about 400 SP
nodes which comprise 30 frames. Each frame contains either 8 wide nodes or 16 thin
nodes, and has a 16-way switch board. Among the nodes, the ones used for our
experiments were the thin nodes which had 64MB of local memory. Between any
two nodes, the High-Performance Switch [82] provides a packet—switching, multistage
Omega network with buffered wormhole routing for interprocessor communication.

The network has the topology of a bidirectional MIN. The SP2 divides messages into

88
packets, each Of which divides into 8-bit ﬂow control digits (ﬂits). That is, nodes
send messages to other nodes by breaking messages into packets and injecting the
packets into the network. Each packet contains routing information that is examined
by switching elements to correctly forward the packet to its destination, where ﬂow
control is performed by a ﬂit. Job scheduling on the SP2 is governed by a job sched-
uler called loadleveler. Using the loadleveler, a job can be simulated on a number
of SP nodes dedicated to the job without sharing processors with other jobs. How-
ever, there is some performance degradation due to delays and contentions on the

communication network, as it is shared with other jobs on other processors.

Benchmark Circuits

The benchmark used was the ISCAS ’89 sequential circuit suite [15]. For our experi-
ments, we use the three biggest circuits from the suite: 838584, 838417, and 835932.
The numbers of gates for the circuits are 20717, 23843, and 17828, respectively. The
types of gates of the sequential circuits are AND, NAND, OR, NOR, inverter, and D
ﬂip-ﬂop. The clocking interval between two input event vectors is 200. The clocking

interval of D f/ f is also 200. Each gate has a unit delay.

Message Communication

We use the Message Passing Interface (MP1) [88] for message passing between pro-
cessors on the SP2. The details of communication performance on the SP2 with MP1
have been studied in [90, 55]. In our model, to reduce communication overhead, we

use MPLIsendO, which is a nonblocking send routine of asynchronous message pass-

89
ing. Thus, the amount of computation can be overlapped with the transmission time
in communication from the point of issuing a nonblocking send to the checkpoint of
the completeness of the issued nonblocking send. As the receive routine, a blocking
receive routine MPLRech is used.

For message communication between processors, two methods have been consid-
ered. The ﬁrst one is the all-to-all communication pattern. That is, during commu-
nication of a simulation cycle, each processor sends messages to all the processors
except itself. This all-tO—all communication pattern produces many messages into the
network, even when a processor does not need to send messages to other processors.
The second method is the some—to-some communication pattern, in which a processor
sends messages to some processors only if they are the destinations of the messages.
According to our experiments with tens of processors on the SP2, the all-to-all com-
munication pattern shows better performance than the some-to-some. This is because
there is a high possibility for a destination processor to receive the sent messages in
the next simulation cycle; not in the same simulation cycle. Since a processor does
not know how many messages are sent to itself from other processors, the processor
may give up waiting for more messages after repeating the procedure of probing mes-
sage arrival, receiving and handling the messages, and waiting. The late messages
may be buffered in the receiver side, but not received by the receiver processor un-
til the next cycle, thus, it may slow the overall simulation progress. Since the SP2
provides fast communication, the overhead of sending extra messages in the all-to-all
communication pattern is less than the overhead of the delay for the completion of

sent messages in the some-to-some communication pattern.

90

Performance Metrics and Parameters

In order to study the effect of computation granularity in the optimistic protocol, we

use the following performance metrics.

c Total execution time: the spent wall-clock time interval from the starting point

of the simulation to the terminating point.

9 Speedup: the ratio of the total execution time to execute a sequential program
on a processor to the total execution time to execute a parallel program on

multiple processors.

0 Number of simulation cycles: the number of repeated cycles of computation and

communication until the simulation completes.

o Computation time: the amount of computation time out of the total execution

time.

o Computation fraction: the ratio of computation time to the total execution

time.

0 Number of executed events per cycle: the number of executed events by a

processor in a simulation cycle.

0 Activity rate: the ratio of the number of active logical processes to the LP ratio

in a processor per simulation cycle.

0 Actual activity rate: the ratio of the number of active logical processes to the

number of candidate logical processes in a processor per simulation cycle.

91
o Rollback frequency: the ratio of the number of stragglers to the number of

received events in the simulation.

0 Rollback rate: the ratio of the number of rolled back events to the total number

of computed events in the simulation.

0 GVT jump: the increment of GVT when the GVT value is re—computed in a

token cycle.

In addition, the parameters used in the experiments are as follows:

0 MCGS: the maximum computation grain size, that is the maximum number of

events between any two consecutive communication points.

0 Number of PBS: the number of processors used to process the given application

problem in parallel.

By varying the MCGS, we observe the effects of computation granularity on the
performance results with the performance metrics. The following subsections show
the experimental results of the optimistic protocol on the SP2 when the number of
processors is six. The major performance result is introduced ﬁrst and compared to
the related effects of the other performance metrics. Also, the speedup and the effects

Of LP ratio are studied as the number of processors varies.

4.2.2 Total Execution Time

The total execution time is the wall-clock time interval from the starting point

of simulation to the point of terminating. This major performance metric of simu-

92

 

 

 

 

 

 

1800 l l l I M
f s38584 +—
1600 _ $38417 -e—- 2
$35932 «—
1400 [ —
1200 r = C 4‘} “a
Total .,
Executiofi000 .r 6 —
Time 800 — 2 —
600]L ., —
400 if 'a -
200 . ...4" —
O ’ ”l l l l l l l l
0 500 1000 1500 2000 2500 3000 3500

MCGS

Figure 4.1: Total Execution Time

lation shows how fast the simulation can complete a given task. Figure 4.1 shows
the total execution times to simulate three sequential circuits, 838584, 838417, and
835932, with 500 input event vectors on the SP2. The right ending point of each
curve in the ﬁgure shows the total execution time when the MCGS is the LP ratio
of each circuit. In the case that the MCGS is equal to the LP ratio, each processor
makes all candidate logical processes be active and their events are executed before
communication. When the MCGS is the LP ratio, the simulation is performed ac-
cording to the theme of the original Time Warp where all candidate logical processes
can execute each of their events and propagate the newly generated events to the
associated logical processes in the simulation cycle. As shown in the ﬁgure, however,
the performance is better at the smaller MCGS than at the value of LP ratio. Rather
than processing all possible logical processes, adjusting the number of active logical

processes yields better performance. Thus, the graph of the total execution time

93

 

300

 

 

 

 

I I I T I
838584 5F—
S38417 4*-
250 835932 +—
200 ‘
'Tbuﬂ ;
Execution150 - / _
Time I . I a
100 r —
50 — ‘ = = ‘ ° —
0 l l l l l
50 100 150 200 250 300
MCGS

Figure 4.2: Total Execution Time (Close-up Between 1 and 300)

shows that we can achieve better performance results by using the MCGS instead of
applying the original Time Warp directly.

Each curve in Figure 4.1 has a steady performance as the MCGS becomes larger
than a certain large value, that is, around 2000. The performance remains steady up
to the MCGS of the LP ratio. By using other experiment metrics in the following
section, the steady performance on a large MCGS is explained in more detail.

To observe the performance trend near the optimal MCGS, Figure 4.2 focuses on
the left side of Figure 4.1. Each curve of a circuit has its own valley that has a different
slope on either side of the valley, a different width, and a different Optimal location
of computation granularity depending on the Circuit’s characteristics. In order to
understand the appearance of the curves more clearly, we interpret the performance

results in depth with other performance metrics in the following section.

94

4.2.3 Analysis of Experimental Results

To explain the performance results of the total execution time, several other aspects
of Time Warp are studied and explained by other performance metrics. Those per-
formance aspects are categorized into four parts. First, to observe the trend of event
execution affected by computation granularity, execution activity is studied with per-
formance metrics of activity rate and actual activity rate. The second is the rollback
situation. Since the inherent optimistic characteristic yields a rollback situation, it is
the main concern in the optimistic protocol. Rollback frequency and rollback rate are
used for the performance metrics. As the third, communication overhead is consid-
ered. Communication overhead is an unavoidable consideration for parallel processing
on parallel and distributed systems. The computation fraction out of the total ex-
ecution time is used to consider the approximate communication fraction. The last
aspect is the progress of the simulation. Because a given problem is terminated when
the required tasks are done, considered is how fast the simulation is able to proceed
to reach the termination. The number of simulation cycles and the GVT jump are
the performance metrics. Therefore, the major aspects of performance analysis for
the optimistic protocol on parallel and distributed systems are execution activity,

rollback situation, communication overhead, and simulation progress.

Execution Activity

By observing the event execution of parallel processing, an overall trend and an

estimated performance can be predicted. The amount of major computation on a

95

 

07 I I I I I T I

  
 
 

0.6 — $38584 -*—
S38417 -9-
335932 *—

 

0.5

Activity 0 4
Rate

 

0.3

0.2

L
4
i!»

1

0.1

 

 

l l l l 1

0 500 1000 1500 2000 2500 3000 3500
MCGS

   

Figure 4.3: Activity Rate

processor is determined by how many events it executes. Although the optimistic
protocol has more other interesting concerns to study, we start with the study of the

effect of execution activity on parallel processing.

0 Activity Rate
The activity rate is deﬁned as the ratio of the number of active logical processes
to the LP ratio in a processor per simulation cycle, where an active logical
process is one that executes an event in a simulation cycle. The activity rate
indicates how many gates are active to compute an unprocessed event in a
simulation cycle on a processor. The rate depends on the number of processors
used, the parameter value of MCGS, and the characteristics of the circuit, such
as the number of gates in the circuit and the connectivity conﬁguration of the

gates.

96
Figure 4.3 shows the average rate of activity on a processor as the MCGS
changes. With the activity rate we can explain several aspects of the graph of
total execution time shown in Figure 4.1. The activity rate at the rightmost
MCGS has the maximum value of each curve. This point represents when a
circuit is simulated by the original Time Warp protocol. A general trend we
can recognize from the ﬁgure is that the activity rate decreases as the MCGS
decreases to the left. This is because the MCGS limits the maximum number
of active logical processes, each of which processes an event per simulation cy—
cle, even if there are more candidate logical processes which have unprocessed
events. An interesting phenomenon is that the curve of the activity rate be-
comes ﬂat when the MCGS is between approximately 2000 and the LP ratio.
This phenomenon implies that the number of active logical processes does not
increase linearly according to computation granularity at a large MCGS up to
the amount of the LP ratio. This is because the number of active logical pro-
cesses is decided by the characteristic of the circuit. Although the MCGS allows
more logical processes to execute their events, there are not as many candidate
events as the amount of the MCGS. This makes the number of executed events
be delimited by the number of available candidate logical processes, not by the
MCGS. Since the number of active logical processes is the same as the num-
ber of executed events in a simulation cycle, the activity rate determines the

amount of execution time per simulation cycle.

97
According to Figure 4.3, the rightmost value of the activity rate at the LP
ratio is an inherent characteristic of a circuit. Circuit 835932 has the highest
activity rate at the LP ratio, 838584 is the next one, and the last one is 338417.
Compared to Figure 4.2, the circuit having a higher activity rate has a wider
valley of optimal computation granularity; the circuit of a low activity rate
has a narrow valley. Because the activity rate shows the required amount Of
tasks on a circuit to complete the simulation, the circuit with a higher activity
rate contains many unprocessed events although those events are delimited by
the MCGS. At the Optimal value of MCGS, even as the MCGS increases a
little more, the performance of the circuit with the higher activity rate is not
degraded by executing more events fulﬁlling the required event amount without
causing a bad effect. Meanwhile, the circuit with a low activity rate has fewer
events to execute. When the MCGS is larger than the optimal point, it has
a greater likelihood of generating rollback situations because good events have
been already executed and some immature events may be executed, degrading

the performance.

Actual Activity Rate

Another performance metric related to event activity is the actual activity rate.
The actual activity rate is the ratio of the number of active logical processes to
the number of candidate logical processes in a processor in a simulation cycle.
This metric presents how many logical processes out Of the candidate logical

processes are actually active and executing events. In contrast to the activity

 

 

 

 

 

 

 

 

1 qr a a 0

338584 *—

S38417 '9—
0.8 $35932 +— “
Actual 0'6 _ T

Activity
Rate 04 _
0.2 “I" d
0 " l l l l l l l
0 500 1000 1500 2000 2500 3000 3500
MCGS

Figure 4.4: Actual Activity Rate

rate, the actual activity rate can reach one depending on the MCGS. When
the actual activity rate is one, all candidate logical processes are allowed to be
active and to execute an event in the simulation cycle. Figure 4.4 shows the
average actual activity rate per simulation cycle. The starting point Of reaching
1 is around 2000 which is same as the starting point of the steady state in the
curve of activity rate in Figure 4.3 and that in the curve of total execution time

in Figure 4.1.

Rollback Situation

Rollback is an unavoidable situation in the Optimistic protocol that is caused by
Optimistically sent immature events. Varying the MCGS, the performance effect of
rollbacks and the performance trend are studied in this subsection. The experimental

results of rollback frequency and rollback rate are used as the performance metrics of

99

 

0-2 I T T

$38584 +—
S38417 -6—
S35932 *—

 

0.15 r

Rollback
Frequency '

 

 

 

0 1 l l
50 100 150
MCGS

200

250

300

Figure 4.5: Rollback Frequency (Close-up Between 1 and 300)

 

0-3 I I I I

 

 

 

 

 

 

 

f I I
0.25 = = -
0.2 _.
or o 6)
Rollback 0.15 _

Frequency

0.1 338584 +— —

S38417 -e—

S35932 4—
0.05 _

0 l l l l l l l

0 500 1000 1500 2000 2500 3000 3500
MCGS

Figure 4.6: Rollback Frequency

100

rollback.

o Rollback Frequency
Unique characteristic of the optimistic protocol is the overhead of rollback situa-
tions. Figures 4.5 and 4.6 show the rollback frequency varying with the MCGS.
The rollback frequency is the ratio of the number of stragglers to the number of
received events in the simulation. It shows the possibility of having a straggler
when a logical process receives an incoming event. On the 838584 and S38417
circuits in Figure 4.5, the rollback frequency is very low, i.e., smaller than 0.03,
when the MCGS is between 1 and 60. When the MCGS goes to around 80 or
100, there is a large jump in the rollback frequency in each of the two circuits;
the rollback frequency greatly increases in the interval between 80 and 100. In
contrast, the 835932 has a higher rollback frequency than other circuits and is
almost steady until the MCGS reaches 100. It begins to jump at around 200.
These large jumps of rollback frequency imply that many stragglers start to be
generated out of the incoming events. This means that around the jump point
the simulation performance is degraded due to increasing rollback situations.
The degrading effect of rollback on the total execution time is shown in Fig-
ure 4.2; the execution times of 838584 and S38417 after 100, and that of the
335932 circuit after 200. From the observation of rollback frequency and the
associated total execution time, we can say that by adjusting the number of

stragglers the MCGS is a controlling factor of the degree of optimism.

101
Also, the jumping point of rollback frequency can be interpreted by the activity
rate in Figure 4.3. Since the S35932 circuit has a higher activity rate, it keeps
almost the same performance although the MCGS increases to 150 and it shows
degradation after approximately 200. The other two circuits have degradations
earlier at around 100. Thus, the performance degradation due to rollback sit—
uations is affected by the activity rate of the given circuit. In addition, on the
large MCGS in Figure 4.6, the rollback frequency of each circuit has a different
value of steady curve until the MCGS is up to the LP ratio. The steady perfor-
mances also depend on the characteristics of circuits: because of their layout,

some circuits have a few rollback situations and others have many.

Figure 4.6 shows that the curves of rollback frequency are around 20% or 25%
of incoming events, which are smaller portions compared to the degradations of
the execution times with a large MCGS. Since the rollback frequency is decided
by the number of stragglers, a straggler usually results in more than one rolled
back event and requires extra processing for the queue adjustment over all input
events queues on the logical process. The queue adjustment on event queues
is to move the front pointer of each event queue according to the straggler’s

timestamp, after the straggler is enqueued into the destined input-event queue.

Rollback Rate
The rollback rate is the ratio of the number of rolled back events to the total
number of computed events in the simulation. Compared to the rollback fre-

quency which is decided by the number of stragglers, the rollback rate relies on

102

 

 

 

 

 

 

 

1 I I I I I I I
838584 “A—
S38417 '9—
0.8 335932 '0— -
0.6 1 _
Rollback
Rate
0.4 _
0.2 r
0 l I l l l l l
0 500 1000 1500 2000 2500 3000 3500

MCGS

Figure 4.7: Rollback Rate

the number of rolled back events by the stragglers. Since a straggler usually
yields more than one rolled back events, about 20% rollback frequency makes
the 50% up to 90% rollback rate in the simulation as shown in Figure 4.7. When
a straggler comes in an input port and is enqueued into the event queue, other
event queues in the logical process have to be adjusted to re—compute from the
straggler’s point. The adjusted and rolled back events in other event queues are
counted as well as the rolled back events in the event queue where the strag-
gler occurs. The rollback rate mainly depends on the characteristics of circuits,

especially the layout of connectivity among gates.

Communication Overhead

Another major issue related to performance of the Optimistic protocol on parallel and

distributed systems is interprocessor communication overhead. In our implementation

103
for interprocessor communication on the SP2, nonblocking message passing routines
from MPI are used in order to improve performance by overlapping communication
and computation. A nonblocking send routine initiates the message sending opera-
tion, but returns before the message is received by the receiver processor. A separate
wait routine veriﬁes that the message has been received. We use the nonblocking
send MPIJsendO, blocking receive MPLRech, and MPLWaitO to complete the
nonblocking send. It is impossible to measure the exact communication time for
nonblocking communication because it operates together with computation and the
transmission time is overlapped with the computation time. In order to measure the

communication time roughly, the following several strategies can be used:

1. Lower Bound Communication Time: To measure the lower bound com-
munication time, only the time spent for send and receive routines is measured.
Each processor reads the value of a wall-clock time just before MPLIsendO
Operations, and then reads the clock again after the send operations ﬁnish. In
the same way, it reads the clock before MPI.Recv() operations and reads it
again after the receive operations ﬁnish. It thus measures the elapsed wall-
clock times for send and receive operations. This lower bound communication
time measures the time spent for message sending and receiving before invoking
the MPI communication routines and after returning from the routines without
considering any transmission between processors. It contains no computation

time.

 

104
2. Upper Bound Communication Time: The upper bound communication
time is the time spent for communication until it is sure that the message send-
ing is done. Each processor reads the clock before a send operation and reads it
again after making sure that all the sent messages are received in the destination
processor, and measures the elapsed wall-clock time. The upper bound commu-
nication time contains the computation time which is between the MPIJsendO
operation and the MP1. Wait() operation. Since the MPI.Wait() operation is
usually called as late as possible before re—using the send buffer, the included
computation time may take a large portion rather than the communication

related time.

3. Approximate Communication Time: The approximate communication
time is a roughly measured communication time. The method used is to com-
pute { Total Execution Time - Total Computation Time); while measuring the
total execution time, each processor measures the computation time for all com-
puting steps in the simulation and subtracts the computation time from the to-
tal execution time. The event handling time in event queues is included in the
computation time. According to our experiments, this communication time is
the same as (Lower Bound Communication Time + Waiting Time), where the
waiting time is measured just before and after the MP1. Wait() operation, gath-
ering only the waiting time for the completion of nonblocking message sending.
However, since the computation time may contain an overlapping area with the

transmission time in communication, this communication time may be smaller

 

 

 

 

105
200 I T I I I I
Computation Time +—
Lower Bound Communication Time -9-
Upper Bound Communication Time -o—
150 — Approximate Communication Time A— d
1
Execution100 _ _
Time
50 - -’
A. A éﬁé'f—AMQ/‘t" I
0 h 4L I l l l l
2 4 6 8 10 12 14 16

Number of PBS

Figure 4.8: Communication Time Varying the Number of PBS

 

than the exact time spent for interprocessor communication.

Figure 4.8 shows the experimental results of several communication patterns ex-
plained above. The difference between lower bound and approximate communication
times are little; the waiting time of making sure the completeness of sending is not
a high portion. Those two measured results of communication time generally in-
crease as the number of processors increases, paying more communication overhead.
In contrast, since upper bound communication time contains some portions of com-
putation, the execution time is high when a small number of processors are used,
becomes low as the number of processors increases, and becomes high again when
more than enough processors are used due to communication overhead. Also, Fig-
ure 4.8 presents computation time as well as communication time. As the number of
processors increases, the amount of computation decreases due to higher distributing

tasks over many processors; computation time decreases. On the point of 12 PEs,

106

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Communication Time Compu-
No of MCGS Low Bound Upper || Approximation tation
PEs Sending [Low B. Bound || Tot-Comp [ Low B+Wait Time
P=2 85 2.68 12.50 120.34 13.94 13.29 207.34
P=4 85 3.62 14.52 76.19 15.77 15.39 118.61
P=6 60 4.39 11.71 53.87 13.05 12.70 79.64
P: 45 5.32 14.95 48.03 16.48 16.12 59.77
P210 35 6.60 16.51 43.51 18.26 17.91 48.06
P=12 35 7.57 18.06 43.47 19.88 19.55 44.88
P214 30 7.77 19.00 41.67 21.01 20.68 38.94
P=16 20 10.15 24.45 43.88 26.92 26.56 30.73

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.1: Comparisons of Communication Measurements

the computation and upper bound communication times are crossed over. On more
than 12 PEs, computation time is smaller than upper bound communication time. It
means that although computation time is small due to a small amount of computed
events per processor, communication time becomes a dominant factor.

Table 4.1 shows the detailed measured results Of Figure 4.8, showing the values of
low bound, upper bound, and two approximate communication times as the number
of processors changes. In low bound communication, the time spent for sending
is smaller than the time for receiving because the nonblocking send operation does
not wait for arrival of messages. The two approximate communication times, ( Total
Execution Time - Total Computation Time) and (Lower Bound Communication Time
+ Waiting Time), are almost same. In this measurement, the program includes the
performance measuring statements so that the execution time is large compared to
the total execution time in Table 4.16, which is measured with only the core program

without including measuring statements to compute speedup of parallel processing.

107

 

 

 

 

 

1
0.8
n. I I
Oﬁcr _.' _
Computation ,' *. . 'k- - 'k ------ ~k ...... * ..... .k
. .' _ .G. _ '
Fraction 04 _ 03-0. 2.34? o .0..o ...... G ...... o ............. 5,
_*- " $38584 Computation -A-—
j' 91* .*" $38584 Rollback "A" -
0 2 _ 3X S38417 Computation -e— _
' Q‘? S38417 Rollback 49- -
5.1 S35932 Computation —o—
. o I I l I lS35932 lRollback| -" '
0

0 500 1000 1500 2000 2500 3000 3500
MCGS

Figure 4.9: Computation Fraction

To get rid of the effect of computation amount on the upper bound communication
time and consider the effect of nonblocking message passing, we have used the third
communication measurement method, the approximate communication time, in the

following performance metrics of communication overhead.

0 Computation Fraction
Figure 4.9 shows the computation fraction of the total execution time vary-
ing with the value of the MCGS. The computation is composed of computing
LVTs, checking the D f/f clock, making the central priority queue, executing
events, processing received events, manipulating event queues, computing GVT,
and performing fossil collection. Based on the computation fraction, the com-
munication fraction can be roughly estimated by computing the ratio of the
roughly measured communication time to the total execution time, where the

communication time is derived by ( Total Execution Time — Total Computation

108

 

 

 

 

 

 

 

1 I I I I I
0.8 — _
$38584 Computation -*—
S38584 Rollback -*' -
. 0-6 S38417 Computation -e— ‘
Computation S38417 Rollback -o- -
Fraction S35932 Computation -o—-
0.4 335932 Rollback '0' - . . . , , .. : :0
ﬁ ...... O ........ . . d
o ..................... r
0 2 r- , ..... * """" f. - _
”HQ: . .
.35: ' ....-
0 W'- ..«- -g """" m I l
50 100 150 200 250 300

Figure 4.10: Computation Fraction (Close-up Between 1 and 300)
Time )

When the MCGS is very small, the computation fraction is small; only a small
number of executed events are executed in a cycle per processor while yielding
frequent interprocessor communication. Thus, the communication overhead
dominates most of the total execution time by computing only a small number of
events and communicating frequently. As the MCGS increases a small amount,
the computation fraction increases rapidly since the number of executed events
increases accordingly and the spent time for communication is balanced with
computation. When the MCGS is large enough, the communication overhead is
not the major factor of the total execution time; the computation time due to
rollback situations is the major factor. Therefore, the communication overhead
mainly applies to the small MCGS’s, especially on the left side of the optimal

computation granularity.

109
Figure 4.9 also shows the rollback fraction in dotted lines varying with the
MCGS, compared to the computation fraction. The rollback fraction is the
ratio of the approximate rollback handling time to the total execution time,
that is derived by multiplying a computation fraction by a rollback rate. This
approximation shows the computation amount of rolled back events out of the
overall event computation amount. On a large MCGS, the rollback fraction
takes a high position among the computation fraction. The 90% computation
fraction may suggest that it is possible to improve the performance by increasing
the number of processors. However, as shown in the ﬁgure, among the 90%
computation fraction, the rollback fraction is a large portion. When a larger
number of processors than the appropriate number are used, the performance
does not increase because processors will perform unnecessary event processing,
resulting in more rollback situations. Thus, using more processors is not always

the answer.

Figure 4.10 focuses on the left side of Figure 4.9, and is also able to explain the
optimal valley in Figure 4.2. Figure 4.10 focuses on a higher communication
overhead through the computation fraction between MCSG values 1 and 30 in
solid lines. It also shows a heavier rollback situation through the rollback rate
in dotted lines started from 100 on 838584 and S38417 circuits, and from 200
on 835932 circuit, respectively. The area between those two, that is, the point
reaching the steady performance of computation fraction around the MCGS of

30 and the point of increasing rollback situations, is the same as the optimal

110

 

   
 
   

 

 

 

1 r I I I I I I
838584 -*—
838417 ~9-
0.8 S35932 *—
D
0.6 -
Computation ,
Fraction D
0.4 — -
0.2 — _
O l I l l l l l
2 4 6 8 10 12 14 16 18

Number of PBS

Figure 4.11: Computation Fraction Varying the Number of PBS

valley of total execution time in Figure 4.2. The area in—between provides the

optimal computation granularity because the high communication overhead is

reduced and the overhead due to heavy rollback situations is not yet included.

Thus, the area mainly works on genuine event execution without being domi-

nated by the high communication and rollback overhead.

0 The Limit of Parallel Processing

Figure 4.11 shows the computation fraction according to the number of PBS. As

the number of PBS increases, the computation fraction decreases, that is, the

communication fraction increases. Thus, we can see that more PEs result in a

greater communication overhead and a loss of the beneﬁt of parallel processing

performed by many PEs.

111

 

I I

838584 total time
838584 computation time

S38417 total time
S38417 computation time -

S35932 total time
S35932 computation time -

     
 
 

  
     
     
   

200 "

-l‘?‘l’§”r

 

 

 

 

' tr
Execution150 :_
Time '_
100 -
<9. .

50 -'....,.on . - "

0 l 1 l l l
50 100 150 200 250 300

MCGS

Figure 4.12: Comparison Between Total Execution Time and Computation Time

0 Total Execution Time vs. Total Computation Time
To explain the effect of communication overhead on overall performance in
more detail, Figure 4.12 compares the total execution time in solid lines to the
computation time in dotted lines varying with the MCGS. When the MCGS
decreases from the LP ratio to the optimal point of execution time, both the
total execution time and the computation time proceed in parallel by keeping a
small gap of communication time. This small gap shows that the SP2 provides
fast interprocessor communication and that nonblocking message passing does
not cause great loss by overlapping the communication with the computation
of processors. When the MCGS is between 1 and the optimal point, the trend
is different from the former; the gap of the total execution time and the com-
putation time increases as the MCGS decreases. This bigger gap represents the

overhead of communication, that is, processors spend more time on commu-

- nu.

112
nication at a very small MCGS’s only doing a small amount of computation

between subsequent communications.

0 Number of Executed Events per Cycle
Another interesting observation on the left side of the optimal point in Fig-
ure 4.12 is that the computation time still increases although the real computed
events are very few and the communication cost is not included. In order to
know how many events are executed per simulation cycle, Figure 4.13 shows the
average number of executed events between communications on a processor. As
shown in the very left side having a small MCGS, the number of executed events
is very small due to the limitation of the MCGS. Although each processor works
on a very small amount of events at a small MCGS, Figure 4.12 shows that the
computation time increases greatly even after accounting for the communica-
tion overhead. This is because of the effect of the simulation progress that is
presented in the next item considered. If the simulation proceeds very slowly
with few executed events, it yields a very large number of simulation cycles,

degrading the total execution time.

Simulation Progress

Simulation progress represents the proceeding status until the simulation reaches the
termination condition. Even when many events are executed per simulation cycle, it is
difﬁcult to say that the simulation can be done very quickly. The simulation progress

is different from the concept of execution activity. We introduce two performance

113

 

 

 

 

 

 

1800 r ’ ’ ’ —
$38584 +—
1600 r S38417 -9— _
$35932 -°-
_ A o :3)
Number 1400
of 1200 — —
Executed
Events 1000 _ —
per 800 [- a
Cycle
600 L 1 *
400 - -
200 - —
0 l l I l
0 500 1000 1500 2000 2500

MCGS

Figure 4.13: Number Executed Events per Cycle in a processor

metrics to show the simulation progress: one is the number of simulation cycles and
the other is the GVT jump. The number of simulation cycles represents the number
of repeated cycles to complete the given task, and the GVT jump shows the increment

of GVT whenever it is computed.

0 Number of Simulation Cycles
Figure 4.14 shows the total number of simulation cycles during the entire sim-
ulation until completing the task of a given circuit with a given set of input
events. By repeating a series of simulation cycles, the simulation can reach
the termination condition of the simulation. Thus, the number of simulation
cycles represents the simulation progress, i.e., how many times each processor
needs to repeat the computation and communication routine to ﬁnish its own
work. When the MCGS is large enough to have a steady total execution time

as shown in Figure 4.1, the curve of simulation cycle is also in the steady state.

 

114
30000 I I I I I
838584 dr—
S38417 4*-
25000 - S35932 -°— -
Number 20000 _ _
of
Simulation

FL

Cycles 1 5000

:1

10000

I

 

 

CD 0

'—
n A A
V v

 

 

 

5000 l I I I I I I l l
50 100 150 200 250 300 350 400 450 500
MCGS

Figure 4.14: Number of Simulation Cycles

This is because almost all active logical processes are executed among candidate

 

logical processes and the number of executed events is almost the same. The
benchmark circuits used have different steady state values in the number of
simulation cycles; 838584 starts at around 14000, S35932 at around 10000, and
S38417 at around 8500. This is affected by the characteristic of the circuit, es-

pecially its critical path length. As the critical path length increases, in general,

 

the number of simulation cycles increases. Another possible characteristic of a
circuit is the circuit size, i.e., the number of gates in the circuit. According to
the research in [4], circuit size has much more complex relationship with circuit
parallelism than a simple linear relationship. Thus, it is hard to determine the

performance effect of circuit size.

In Figure 4.14, the starting points reaching the steady state occur at different

places for each circuit and begin earlier than the optimal points Of total execu—

 

115
tion time. The starting point is related to the starting point of heavy rollback
situations. In Figure 4.5, the places having the largest jumps are approxi-
mately the positions of the start of steady performance curves of the number of
simulation cycles. This means that the MCGS allows more than enough logical
processes to execute events, causing an increase of rolled back events. The num-
ber of simulation cycles has the same value up to the LP ratio while keeping
the steady performance, because more than enough event executions per cycle
do not contribute the simulation progress, but degrade the execution time. In
contrast, as the MCGS becomes smaller than the steady point, the number
of simulation cycles increases greatly. This is because the genuine events that
do not generate rollback situations cannot be executed due to the very tight
limitation of the MCGS, and the small number of executed events make the

simulation progress slow down, increasing the number of simulation cycles.

GVT Jump

Another metric of simulation progress is GVT jump. The GVT jump shows
how much the GVT value increases whenever it is computed. Since we use an
asynchronous GVT computation method which circulates a token around all
processors, only the master computes the GVT value after collecting the PVTs
from the processors and measures how much the GVT value jumped from the
previous GVT value. Figure 4.15 shows the average GVT jump during the sim-
ulation on the master processor by varying the MCGS. The interesting thing is

that the peak point in the curve of the GVT jump occurs around at the Opti-

 

 

 

 

GVT
Jump

 

     

838584 -*—
S38417 -9—
S35932 *— 7“

 

 

 

O I I l I I I
50 100 150 200 250 300 350 400 450 500
MCGS

Figure 4.15: GVT Jump

mal point of the total execution time. This means that at the optimal point of
the MCGS the simulation progress is the best by executing good events among
candidate logical processes, so that rollback situations do not degrade the sim-
ulation progress and there is no severe loss due to the communication overhead.
The appropriate MCGS encourages lagging logical processes to execute events
and hinders advanced logical processes not to do so, keeping the overall sim-
ulation progress optimal. The GVT jump also represents simulation progress
as does the number of simulation cycles. The S38417 circuit has the highest
GVT jump value, S35932 has the next value, and $38584 has the smallest. The

higher the GVT jump, the shorter the number of simulation cycles.

 

 

 

 

117

4.2.4 Speedup

Speedup is the ratio of the time to execute an efficient serial program for a calculation
on a uni-processor to the time to execute a parallel program for the same calcula-
tion on N processors that are identical to the uni-processor [80]. The serial program
used in our experiment is a sequential program employing a synchronous protocol.
The sequential program of synchronous protocol always executes the smallest event
ﬁrst among many unprocessed events by following the concept of the synchronous
protocol. When the sequential program simulates a large circuit with a large num-
ber of input events, the simulation may take a long time. To make the sequential
program more efﬁcient, a global common priority queue is used as the event queue
for all logical processes instead of a distributed queue for each logical process and an
array heap is implemented to reduce pointer handling overhead. Compared to the
sequential program, the parallel program uses the optimistic protocol. As the data
structure for the optimistic protocol, a linked list is used because it can use the space
efﬁciently in heavy rollback situations which dynamically occur during the Optimistic
simulation. In Section 4.1, more details of event queue implementation are presented.
The difference in the data structure of event queues between the optimistic protocol
and the sequential program may produce some difference in performance. Thus, the
speedup that we use in this chapter is the ratio of the total execution time on an
efﬁcient sequential program to the total execution time on the parallel program of
the optimistic protocol.

In contrast to the sequential program, the optimistic protocol has many com-

 

 

118

plicated features such as parallel processing, Optimistic event processing, rollback
situation, communication overhead, and computation granularity. With that of the
optimistic protocol, in order to accurately compare the total execution time Of the
sequential program, an Optimal MCGS must be used. Since the LP ratio on a pro-
cessor changes as the number of processors varies, the optimal MCGS also changes
according to the different number of processors. Table 4.2 shows the total execution
times of the sequential program (SEQ) and the optimistic protocol (OPT) for 838584,
838417, and 835932 circuits. The table presents the approximately-optimal MCGS
values and the total execution time at values on a different number of processors,
which is obtained after performing several experiments. Because we use a 5 interval
gap of the MCGS to ﬁnd the Optimal value, the values shown in the table may not be
the exact optimal MCGS. In addition, to compare between the total execution time
of the sequential program and that of the optimistic protocol, this set of optimistic
protocol’s experiments does not contain the measurement part of the program which
measures performance in terms of other performance metrics. Thus, the total exe-
cution time of the Optimistic protocol is shorter, here, than that in Figure 4.1. The
effect of taking off the performance measurement part from the program also changes
the optimal MCGS value. This is because the extra computation amount per event
and simulation cycle for measuring other performance metrics is not included in com-
putation time after removing the performance measurements.

As the number of processors increases, the total execution time decreases by
achieving the beneﬁt of parallel processing as shown in Figure 4.16. The 838584

circuit has the highest speedup when the number of processors is 12, 838417 does

119

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Program No. of PBS Circuit
838584 1] 838417 I] S35932
SEQ Time Time Time
P=1 105.21 __ 50.51 153.1:
[—OPT H u MCGS Time H MCGS [Time ]| MCGS Time]
P=2 85 81.22 95 41.15 100 108.08
P=4 85 44.98 60 24.30 120 63.57
P=6 60 33.10 60 18.93 85 45.48
P=8 45 25.02 50 14.58 65 37.41
P=10 35 23.56 40 15.08 50 29.53
P=12 35 23.20 35 14.97 50 27.01
P=14 30 25.60 30 15.20 60 24.30
P=16 20 25.37 25 15.43 50 27.27

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.2: Comparison Between the Sequential Program and the Optimistic Protocol

 

7 I I I I I I F
S38584 +—
6 __ S38417 '9— ..
S35932 *— ,
5 — ...
Speedup 4 — /’ J

 

 

2 4 6 8 10 12 14 16
Number of PBS

Figure 4.16: Speedup

120
when 8, and 835932 does when 14. After reaching high-point, the speedup degrades
even in a large number of processors. The major reason is the communication over-
head as described in Figure 4.11. Due to this, parallel processing does not offer much
performance improvement on a very large number of processors.

Another important consideration is that the optimal MCGS decreases as the num-
ber of processors increases as shown in Table 4.2. This is because the LP ratio de-
creases as the number of processors increases, and the appropriate amount of events
per simulation cycle also needs to decrease. However, there are some exceptions:
when the number of processors is 2 and when the number of processors is too large
for a circuit with a high activity rate. If the number of processors is 2, the internal
messages within a processor are the major portion of the messages and the compu-
tation granularity is not affected only by the LP ratio. When the 835932 circuit uses
14 processors, the optimal MCGS does not decrease because a small amount of event
execution per simulation cycle makes the simulation progress slower and the perfor-
mance is degraded. Also, circuit partitioning may affect the performance results as

the number of processors changes.

4.2.5 The Effect of a Circuit Size

By varying the number of PBS, we have studied performance in terms of the
speedup of parallel processing. With a given circuit, the LP ratio changes as the
number of PBS varies, so that the speedup can be observed on a different number of

PBS. In this section, we study how a circuit size affects the performance when the

 

 

121

 

8 I I I
T38584 E—
7 .— D38584 -°— g _.
S38584 *— 5
a 3
6 _ “fl
1;,» 5 ‘

5 r .. _
Speedup /
4 - _
n

j / I I I I

5 10 15 20
Number of PBS

 

 

 

Figure 4.17: Effect of a Circuit Size

number of PBS varies. If we use a different circuit to compare results, circuit char-
acteristics such as the critical path length are different and may result in a different
effect. Thus, in order to keep the same characteristic of a circuit, we use a double
circuit and a triple circuit of a given circuit by multiplying the circuit without any
connection between the multiplied circuits, that is, by adding two or three of the same
circuit in parallel. Compared to the single circuit of 838584, we call a double circuit
D38584 and a triple circuit T38584. The multiplied circuit is evenly and randomly
partitioned into processors.

Figure 4.17 shows the effect of circuit sizes on performance results with the 838584,
D38584, and T38584 circuits. As the circuit size increases through multiplication,
each processor is assigned more logical processes and therefore has a higher LP ratio.
As more logical processes are assigned to a processor, the parallel processing speedup

increases on a larger number of PBS. Because the multiplied tasks of double and triple

 

 

 

122

 

 

P=2 I P=4 | P=6 T P=8 | P=10 I P=12 | P=14 |

 

Synchronous Protocol

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Total Execution Time 60.83 38.61 34.20 31.35 32.30 33.53 34.07
GVT Comp. Time 7.15 8.55 11.16 11.38 13.41 14.53 15.46
Activity Rate (%) 1.3858

l Optimistic Protocol [I I
MCGS 85 85 60 45 35 35 30

Total Execution Time 81.22 44.98 33.10 25.02 23.56 23.20 25.60

GVT Comp. Time 2.61 1.17 0.79 0.66 0.60 0.52 0.48
No. of Token Cycles 23560 6677 4225 3087 2469 1913 1657

Activity Rate (%) 0.8205 1.6398 1.7358 1.7351 1.6860 2.0169 2.0171
Rollback Rate (%) 0.1076 0.9982 1.6842 1.6861 1.5098 5.3299 5.3830

 

 

 

 

 

 

 

 

 

 

 

Table 4.3: Comparison Between the Optimistic Protocol and the Synchronous Pro-

tocol

circuits yield greater amounts of computation required to complete the simulation,

distributing the task into more PEs can produce a greater parallel processing, result-

ing in a higher speedup. If more than enough processors are used, the communication

overhead degrades the performance results as described in the speedup section.

4.3 Performance Comparison Between the Opti-

mistic Protocol and the Synchronous Protocol

Performance results of the employed optimistic protocol are compared to those of

 

a synchronous protocol in this section. While Optimistic protocols allow logical pro-
cesses to execute events optimistically, synchronous protocols make logical processes
execute events only if the LVTs of logical processes are equal to the GVT value.
Thus, in synchronous protocols, computing GVT is the major task to progress the

Overall simulation by facilitating the next smallest event to be executed. In contrast

 

 

 

123

to the complicated operations of the Optimistic protocol, the synchronous protocol
has simple operations to process events. The smallest timestamped event is always
executed ﬁrst in a processor, and each logical process handles events with the FIFO
queue. As for the model implementing the synchronous protocol in this section, a
global common priority event queue is used per processor. All pending events from
assigned logical processes are in the priority queue. After ﬁnding the GVT value, the
events having the same timestamp are dequeued and executed by each processor. In
order to achieve better performance in the synchronous protocol, the global priority
queue is implemented as an array structure as in the sequential program that is used
for speedup.

Table 4.3 shows the performance results of the synchronous protocol varying with
the number of processors on the 838584 circuit. Because the synchronous protocol
uses a barrier synchronization to compute the GVT per simulation cycle, the GVT
computation takes longer time as the number of processors increases. In contrast
to the synchronous protocol, the Optimistic protocol needs the GVT value to check
the termination condition and to perform fossil collection. Thus, computing the
GVT is not signiﬁcant for simulation progress compared to the synchronous protocol.
Since the GVT is computed asynchronously over processors by passing a token, the
optimistic protocol takes very little GVT computation time except in the case of
P = 2 in Table 4.3. In this case, the GVT token is passed frequently between the two
processors, the number of token cycles increases greatly, and the GVT computation
time is rather large compared with that in a larger number of processors. Although

the Optimistic protocol uses a linked list structure for the event queue handling, it

 

 

124

achieves better performance results than the synchronous protocol when the number
of processors is larger than four. The difference is in the GVT computation time,
and the activity rate as shown in Table 4.3. The Optimistic protocol at optimal
MCGSs has better activity rates than the synchronous protocol when the number of
processors is larger than two, whereas the synchronous protocol has the same activity
rate even when the number of processors is varied because it has the same number
of executed events per simulation cycle according to the GVT. Also the table shows
rollback rates at Optimal MCGSs. If the number of processors is more than a certain
number (approximately 12), the rollback rate increases greatly.

A challenging question in performance comparisons of parallel synchronization
protocols is how much the Optimistic protocol can achieve a better performance than
the synchronous protocol. There are three major considerations to compare the per-
formances: the beneﬁt of asynchronous protocol, the beneﬁt of parallel processing,
and the event processing cost. The ﬁrst consideration about the beneﬁt of the asyn-
chronous protocol is to consider whether the application problem is suitable for the
simulation with the asynchronous protocol. The synchronous protocol is good for ap-
plication problems which have many processing events in the same timestamp value.
For example, if a high portion of logical processes have LVT values that are equal to
the GVT value, the synchronous protocol may be as efﬁcient as the asynchronous pro-
tocol. In contrast, asynchronous protocol is good for application problems which have
events in various timestamp values. When only a few logical processes have LVTs
that are equal to the GVT, asynchronous protocols can achieve better performance

by executing events regarding the local clock of each logical process.

 

 

125

Second consideration about the beneﬁts of parallel processing is to consider the
communication overhead compared to the computation amount. If an application
problem requires too much communication overhead compared to the computation
amount, the parallel processing of many processors is no longer a beneﬁt. A ﬁnal
consideration about event processing cost is how much the other processing steps are
necessary during the simulation. Data structures to handle events and an appropriate
scheme for event queues are one of the considerations. The optimistic protocol in—
cludes the extra cost for rollback handling and storage management. The conservative
protocol contains the extra processing of link clocks, local clocks, and null messages

to keep the correct local causality constraints.

4.3.1 Timing Granularity

When we compare performance results between optimistic and synchronous proto—
cols, the Optimistic protocol does not give a dramatically higher performance than
synchronous protocol in Table 4.3 although it allows processors to execute events
asynchronously and Optimistically. Not only the employed synchronization protocol,
but also the assumption applied to the application problem can affect simulation
performance. Under a different assumption, performance results of the optimistic
protocol can increase greatly compared to those of the synchronous protocol. In
particular, the way to build up timestamps of events according to the time delay in
logical processes affects the simulation performance.

In the domain of logic simulation, the delay of a gate is the time required for

126
an input signal change to be reﬂected at the output. Timing granularity [7] is a
delay-based model for logic circuit simulation, ranging from very ﬁne-grained timing
(such as 0.1 ns delay on a gate) to coarse-grained timing (such as unit-delay and zero-
delay on a gate). Fine-grained timing has small resolutions of simulation time and
a large number of possible delays. In [7], Bailey reported that the timing resolution
becomes coarser, as either the number of possible delay values decreases or the time
resolution increases. Unit-delay timing is extremely coarse-grained timing. Bailey [5,
6] also developed two formal models to compare the effects of timing granularity
on circuit parallelism, and showed that coarser-grained timing could result in higher
levels of circuit parallelism. Considering the effects of timing granularity and the
characteristics of protocols, according to [7], to use a synchronous protocol is sufﬁcient
if coarse timing models are used; if ﬁne timing is used, an optimistic protocol should
strongly be considered to use. As the number of processors increases, a synchronous
protocol may result in bad effects; an Optimistic protocol is likely scale better. With
the deﬁnite relationship between timing resolution and circuit activity, the application
problem with a speciﬁc delay-model can affect the performance of parallel protocols.
Therefore, the performance is not only affected by parallel processing considering the
barrier synchronization and the communication overhead, and the event processing

cost regarding designs of protocols, but also inﬂuenced by timing granularity.

127

4.4 Summary

In this chapter, we presented experimental results of Time Warp to study effects
of computation granularity. As the target machine for parallel simulation of VLSI
circuits, the IBM SP2 has been used since it can provide a large number Of fast
processors with large amount of local memory and has a fast interprocessor com-
munication subsystem. We measured several performance metrics and categorized
them into total execution time, execution activity, rollback situation, communication
overhead, simulation progress, and speedup. According to the experimental results,
computation granularity signiﬁcantly affects the performance of Time Warp by con—
trolling the degree of optimism and communication overhead. A large computation
granularity increases the degree of optimism, resulting in frequent rollback situations.
In contrast, a small computation granularity causes a high communication overhead
and a slow simulation progress. Therefore, we conclude that computation granularity
is a very important factor to be tuned properly to achieve the optimal performance

of Time Warp on parallel and distributed systems.

 

Chapter 5

Event Scheduling Schemes for an
Optimistic Protocol on Distributed

Systems

In simulating large VLSI circuits on distributed-memory systems, one of the impor—
tant considerations for an optimistic protocol is to schedule events especially when
the LP ratio is larger than 1. This chapter considers two major factors to design an ef-
ﬁcient event scheduling strategy for Time Warp on distributed-memory systems. The
ﬁrst factor is computation granularity, which is the amount of computation to be pro—
cessed in a processor between two consecutive interprocessor communication points.
As shown in the previous chapters, by adjusting the computation granularity to be
larger than a certain degree, we can prevent communication overhead from dominat-

ing simulation performance. A large computation granularity, however, could increase

128

 

129

the number of rollbacks, wasting the simulation time to handle rollbacks. That is,
while processors execute a large number of events without communication, processes
could propagate many immature events and therefore the protocol should recover the
earlier system state from the erroneously—advanced state when an event arrives whose
timestamp is smaller than the current LVT. A considerable amount of overhead can
occur in the rollback situations because the past state of each process should be saved
and recovered by re—computations. As a consequence, the appropriate computation
granularity might be small enough to prevent Time Warp from progressing events
over-optimistically, while keeping large enough to minimize communication overhead.

The second factor considered is the balancing progress of logical processes. Some
logical processes may advance too far ahead of others, possibly leading to inefﬁcient
usages of memory and excessive rollbacks, especially for large application problems.
By adjusting the computation granularity, we can control the degree of optimism
apprOpriately as described in the ﬁrst factor. However, within the range of the
given computation granularity, logical processes assigned to the same processor would
progress at different speeds and this may cause more rollback situations. If we con-
trol the speed of each process by giving more chances of executing events to slower
processes, we may reduce the frequency of rollback situations. We study two schemes
of balancing the progress of processes: one of which is to restrict a process to execute
events no more than the given number of times, and the other is not to allow an event
to be executed if its timestamp is far ahead of other events in different processes. The
details of those schemes are studied in Section 5.1, which was published in [25].

In Sections 5.1, the event scheduling schemes are presented considering the compu-

 

130
tation granularity and balancing progress of processes. Section 5.2 shows the experi-
mental results of the effects of computation granularity and event scheduling schemes

on a distributed system. The concluding remarks are presented in Section 5.3.

5.1 Event Scheduling

As one of popular simulation applications, VLSI circuits involve a large number of
Objects performing a small amount of computation for each communicating event. A
VLSI circuit usually has several tens to more than hundreds of thousands of gates (or
logical processes). The manipulation of an event requires around a hundred machine
instructions; its event granularity is very small. In contrast, distributed—memory
systems consist of a much smaller number of big processors and the interprocessor
communication latency is large. In general, a distributed-memory system has up to
hundreds of processors. Thus, it is anticipated that most VLSI circuit simulations
will contain more than thousands of logic gates on a processor. Therefore, each pro-
cessor should schedule the unprocessed events of a large number of assigned processes
efﬁciently in order to achieve good performance. As we mentioned in the beginning
of this chapter, there are two important considerations as we design an event schedul-
ing strategy: the computation granularity and the progress balancing. By adjusting
the computation granularity, the number of events executed in a simulation cycle
can be controlled. Besides controlling the computation granularity, an efﬁcient event
scheduling strategy should be a scheme that can balance the progress of logical pro-

cesses by considering what events are executed ﬁrst and deciding the order of event

 

 

131
executions.

The major characteristic of Time Warp is that each process can asynchronously
execute its unprocessed events without considering causality effects. Due to this char-
acteristic, some logical processes may advance too far ahead of others, possibly leading
to the inefficient use of memory and excessive rollbacks. This situation might happen
particularly for a large-scale simulation application with a small event granularity,
such as VLSI circuits. By using the scheduling scheme of balancing the progress of
logical processes, we can prevent the simulation from propagating incorrect compu-
tations too far ahead and reduce seriously excessive rollback situations.

We consider two balancing schemes: the Balancing Progress by Execution Chance
(BPEC) scheme and the Balancing Progress by Virtual Time Window (BPVTW)
scheme. The BPEC scheme controls the progress of a process by limiting the number
of events executed in a process between interprocessor communications, while the
BPVTW scheme controls the progress by using a virtual time window in a process,
which allows events to be executed only whose timestamps are within the interval of

the virtual time window.

5.1.1 Balancing Progress by Executing Chance (BPEC)

The BPEC is a scheme that balances the progress of logical processes. It limits the
number of chances (opportunities) of executing events in a logical process. In this
scheme, we have two controllable pafameters: one parameter for the computation

granularity and the other parameter for balancing progress.

 

 

132

As the ﬁrst parameter to control the computation granularity, this scheme sets
the maximum number of events that can be executed by a processor between com-
munication points. This quantity is called the Maximum Computation Grain Size
(MCGS). Within the limit of the MCGS, the processor scheduler selects events from
logical processes through the central scheduling priority queue. That is, each proces-
sor selects and executes events up to the limit of the MCGS. The scheduler counts
the number of events executed in the processor since the last communication point
while selecting events from the priority queue. Once the number of executed events
reaches MCGS, the processor stops scheduling events and performs the interprocessor
communication. The second parameter is the maximum number of chances needed
to execute events in a logical process. When the scheduler selects events from the
logical processes through the priority queue, it also counts the number of events exe-
cuted in a logical process. The purpose of counting events per process is to balance
the progress of processes. By giving the appropriate execution chances to processes,
we can achieve the beneﬁt of using parallel optimistic protocol. As well as boosting
only processes far behind, the overall progress Of processes can advance by Optimisti-
cally executing all processes. The maximum number of chances that a logical process
can execute events between interprocessor communications is called the Balancing
Progress Chances {BPC}. If a logical process no longer has unprocessed events or
it has already executed as many events as the BPC, then the logical process is not
allowed to enqueue its events into the priority queue.

By controlling the maximum allowable chances of executing unprocessed events

per logical process, the BPEC scheme not only balances the relative speed of the

133
processes’ progress, but also controls the computation granularity. In the case that
MCGS is set to be larger than the optimal value of grain size, the computation
granularity can be adjusted by using the proper BPC value. When all logical processes
are individually controlled by the BPC, the total number of executed events in a
processor cannot increase too much, and the granularity between two communication
points is to be adjusted properly. The BPC is also able to control the event schedule
by giving an equal number of execution chances to the logical processes. Instead of
only executing events on a speciﬁc logical process, every process has the same chance
to execute its events. Thus, the BPEC scheme contributes to progress balancing
of logical processes and reducing the consequent rollback situations. Due to the
characteristic of conﬁning the strict upper bound of the number of events that can
be executed by a processor between interprocessor communications, this scheme is a

static method of controlling computation granularity.

5.1.2 Balancing Progress by Virtual Time Window
(BPVTW)

Unlike the BPEC scheme, the BPVTW does not have the strict upper bound of exe-
cution chances for each logical process before interprocessor communication. Rather,
the relative progress of logical processes is controlled with regard to the timestamp of
events. That is, the scheduler prevents a process from executing events if the process
is far ahead of the other processes in virtual time. For this purpose, we employ a

virtual time window whose base is the GVT such that logical processes can execute

 

 

 

134
only events having timestamps within the interval of the virtual time window. This
virtual time window is called the Balancing Progress Window and the size of the
window is denoted by BPW. Thus, the Balancing Progress Window has the interval
[GVT, GVT+BTW].

In this scheme, the overall progress is controlled by each logical process not by
each processor. As in the BPEC scheme, the processor scheduler selects the smallest
timestamped event from the central scheduling queue. The scheduler, however, does
not count the number of executed events. When a logical process has an unprocessed
event and its timestamp is in the interval [GVT, GVT+BT W], the event is enqueued
into the central scheduling queue. If there is no unprocessed event whose timestamp is
in the interval [GVT, GVT+BPW], the process does not have any chance to execute
events until the next cycle after communication. In other words, the scheduler handles
events until there are no available unprocessed events in the scheduling queue. Once
the scheduling queue is empty, the processor communicates with other processors.

The computation granularity in the BPVTW scheme is indirectly controlled by the
BPW value. Since there is no static limitation of the maximum number of events to be
executed by a processor between any two communication points, we do not employ the
MCGS. Instead, the BPVTW scheme concerns the real progress of processes with their
timestamps in order to balance the progress of logical processes. The computation
granularity is consequently adjusted when the scheduler restricts scheduling events
according to the BPW value. Because there is not a limited number Of executed
events, the actual computation granularity may vary depending on the progress of

the simulation. In this sense, the BTVTW scheme is considered a dynamic method

 

135
of controlling the computation granularity.

A similar concept has been proposed as Moving Time Window (MTW) [75].
Although the MTW mechanism didn’t provide considerable improvement on some
cases [37], as it will be shown from the experimental results in Section 5.2, the
BPVTW scheme achieves a good performance improvement for large-scale and small-
event-granularity applications on parallel and distributed systems due to its ability to
control the computation granularity. In addition, for those applications with a large
LP ratio, the BPVTW scheme provides the balancing effects as well as control over

the degree of Optimism.

5.2 Experimental Results

This section presents the experimental results on a cluster of six DEC Alpha work-
stations interconnected by a DEC GIGASwitch through FDDI. The DEC Alpha 3000
workstation has a 133 MHz clock rate and a 32 MB memory. The GIGAswitch sup-
ports a peak data transfer rate of 200Mbits per second. As benchmark circuits, we use
several circuits from the ISCAS ’89 benchmark suite [15] and apply a pre-processing
step to re-conﬁgure the circuits into a maximum of 3 input and 2 output ports as in
Chapter 3. The added gates due to the pre—processing procedure have a zero delay
and the normal gates have a unit delay for the logic simulation. In the circuits, we
have the D f / f clocking interval set to be the same as the input clocking interval. The
logical processes are randomly partitioned into processors. For parallel processing and

the interprocessor communication on the distributed system, we use PVM (Parallel

 

 

136
Virtual Machine) 3.3 [65]. Time Warp is implemented on the distributed system as

a master-slave model.

5.2.1 The Implementation Model

The basic model of implementation for Time Warp in this chapter is the same as the
model on the distributed-memory system in Chapter 3 except the scheme of GVT
computation and the model of simulation cycle. We use the same token passing
mechanism as in Chapter 4 to collect PVT values asynchronously from processors
and to compute the GVT value. Without any barrier synchronization per simulation
cycle, the simulation cycle repeats asynchronously on each processor. Processors are
allowed to execute events according to their own progresses.

As the data structure of event queues like the model of previous chapters, each
logical process has its own event queues according to the input ports. In the event
queues, all unprocessed events are pending and the previously-processed events are
stored as well. Instead of using a separate output event queue, the event queues are
used for both input and output events. The central priority queue per processor is
used for scheduling events efﬁciently. The detailed event scheduling depends on the

scheduling schemes that are applied.

5.2.2 Performance Measurement with the BPEC Scheme

This subsection explains the experimental results when the BPEC scheme is applied

as the event scheduling technique to Time Warp on the distributed system.

 

 

137

 

 

 

 

 

 

 

 

900 ’ I I I ' I I
: S38584, BPC = 1 4—
800 :- S38584, BPC = 00 4.. - .A
.I 535932, BPC=1 .9.—
700 L 335932 BPC = 00 A” _.* _
S38417,BPC=14_
600 i S38417, BPC = 00 _
Total 5
Execution 500 ;’
Time L
400 _
(A
300 _
200 _
100 - I I
0 1000 2000 3000 4000 5000

MCGS

Figure 5.1: Total Execution Time with the BPEC Scheme

Effects of the MCGS

To study the effect of varying MCGS, we have performed experiments with three
different circuits. Figure 5.1 shows the total execution time for the 838584 (curves
with stars), 835932 (curves with triangles), and 838417 (curves with dots) circuits as
a function of MCGS. The solid curves represent the case of BPC = 1 and the dotted
curves show the case of BPC z 00. The number of input vectors is 50, and the BPC
is set to 1 and the unlimited value. The three circuits have the following LP ratios;
6733, 5357, and 6074, respectively.

As we mentioned in the previous section of the BPEC scheme, the computation
granularity is controlled by values of MCGS and BPC. To focus on only the effects
of MCGS, in this subsection we consider only the dotted lines where the BPC has
the unlimited value in Figure 5.1. The other solid lines are considered in the next

subsection. As shown by the dotted lines, the total execution time can be minimized

138

 

 

 

 

 

 

I I . - I l . _ . I
6e+06 - S38584, BROE 1 4— —
«35932;ch =1 A—
5e+°6 ’ S35932 BPC = 00 A - *
_.A'S3841’7, BPC = 1 «—
4e+06 — - S38417, BPC = 00 - _
Number
of 3e+06 — *_-' A 13
Rollbacks 3, ' .............. .
2e+06 — ' A" , ........... ﬁ
1e+06—I/-_O..._2...,..--"' : _
,«4 '
0 A“ l l 1 I 1
0 1000 2000 3000 4000 5000
MCGS

Figure 5.2: Number Of Rollbacks with the BPEC Scheme

when the MCGS is tuned properly. When MCGS is near to 1, the total execution
time increases very sharply as MCGS decreases. In this situation, the computation
granularity determined by MCGS is so small that the simulation spends most of the
time in the frequent communications. It implies that the MCGS should be larger than
such a small value and that a substantial amount of computation could be performed
between interprocessor communication points. The apprOpriate MCGS will minimize
the degrading effects of the communication overhead. In the extreme case of MCGS
= 1, the computation granularity is the same as the event granularity.

On the contrary, the total execution time increases linearly as the MCGS increases
after 200, approximately. It is because a large computation granularity may increase
the number of rollbacks. To observe this relationship, Figure 5.2 shows the number
of rollbacks varying the value of MCGS. Again, the dotted lines are for the case when

the BPC is set to the unlimited value and the solid lines are when the BPC is set to

 

139

 

I I l I

 
    

_ D
600 513207, BPC = 1 +—
550 _ 1313C7==22 4*—
BPC = 5 A—

500 ’ BPC z oo 6- —

   

450
Total

Execution400 '
'Tune 350 _
300

250

200

 

 

.L
A

 
 

 

I I l l

500 1000 1500 2000
h4CK3S

Figure 5.3: Total Execution Time with the BPEC Scheme for 813207 Circuit

1. As mentioned before, we focus on the dotted lines in this subsection to focus on
the effect of MCGS. In the dotted lines, the number of rollbacks is very small when
MCGS is near to 1 because the overall simulation advances in a conservative fashion.
As MCGS increases, the number of rollbacks increases almost linearly. This result
implies that the larger the computation granularity is, the higher the probability is
that immature events are propagated to other processors.

As a consequence, the appropriate value of MCGS should be as small as possible,
but larger than a certain value so that the communication overhead can be reduced
as much as possible. As for the cases in Figure 5.1, the optimal performance occurs

when the MCGS is around 200.

Effects of the BPC

In this subsection, we consider all the curves in Figure 5.1 to study the effects of

the BPC. When the MCGS is much larger than the optimal value, the simulation

140
performance depends on the BPC value. As we discussed above, when the BPC has
the unlimited value, the total execution time increases as the MCGS increase. In this
case, the computation granularity is controlled only by the MCGS. When the value of
BPC is small, say 1, the curve of total execution time goes to a steady performance
regardless of the MCGS. Figure 5.3 shows the simulation results with’the 813207
circuit. In the simulation, 100 input event vectors are used. Like the circuits in
Figure 5.1, the curves in Figure 5.3 also reach steady performance when the BPC
has small values, such as 1, 2, and 5. At this steady state, the BPC value controls
the computation granularity because of the limited number of executed events per
process. With a small BPC, processes have limited chances to execute events although
they have more unprocessed events. Thus, a fast process is prevented from executing
too many events so that the process cannot go too far ahead of the others, and we
are then able to obtain the balancing effect with a small BPC value. In the interval
between the optimal and the steady states, the computation granularity is controlled
not only by the BPC, but also by the MCGS. The depth of the valley of each curve
at the optimal performance is deeper as the value of the BPC becomes larger. As
shown in Figure 5.3 for the 813207 circuit, the simulation results with the BPC = I

produce better performances than when the BPC is larger.

Effects of the LP Ratio

In order to study the effects of the LP ratio, we stage a set of simulations with three
variations of the 835932 circuit. The experimental results are given in Figure 5.4.

The D35932 circuit and the T 35932 circuit are the circuits that are constructed by

141

 

600
550
500
450
Total 400

Execution350
Time 300

 

S35932 +- -
D35932 A—
T35932 +— 7

250 r
200 r r
150 r

100 l L J l l
0 1000 2000 3000 4000 5000 6000
MCGS

 

 

 

 

 

Figure 5.4: Effect of Computation Granularity on Total Execution Time by Varying
LP ratio

connecting the original S35932 circuit double and triple times, respectively. Therefore,
the T35932 circuit has around a hundred thousand logic gates. Since the two circuits
have logical gates multiple times of the 835932 circuit, they have multiple times higher
LP ratios than the 835932 circuit. For all three curves, the BPC is ﬁxed at 1. As
before, each curve shows the optimal and the steady performances as the MCGS
varies. However, unlike the 813207 circuit which has the steady performance with
the optimal performance on all the ranges of MCGS when the BPC is 1, the steady
performances of the three circuits in Figure 5.4 are much higher than their optimal
performances. The larger the LP ratio is, the larger the difference is. Interestingly,
even if tens of thousands of logic gates are assigned in a processor, the simulation
shows reasonably good performance on distributed systems if the MCGS is tuned
appropriately. These results imply that the effects of computation granularity on the

performance become more signiﬁcant as the LP ratio becomes large enough.

142

 

l W I I

S35932 «a— ‘
D35932 +— _
T35932 6-

H H H r—A

to 11> O) 00

O O O O

O O O O
I I I I
I l

Tkmal
Executiofi000 _ T

Time 800 t —
6001? ‘.
400 .— = —
200 : _ , —

0 I l l l I
0 2000 4000 6000 8000 10000
BPW

 

 

 

Figure 5.5: Total Execution Time with the BPVTW Scheme

5.2.3 Performance Measurement with the BPVTW Scheme

Additional sets of experiments have been performed with the BPVTW scheme on the
distributed system. Figure 5.5 presents the experimental results in terms of the total
execution time, varying the BPW. The three circuits, S35932, D35932 and T35932,
are used, which are described in Section 5.2.2.

In the ﬁgure, all three circuits show good performances up to the BPW value
of 2000. Unlike the BPEC scheme which shows sharp increases in total execution
time for small values of BPC, the BPVTW scheme shows good performance even
with very small values of BPW. It is because each process adds the unit delay of
virtual time in executing events, and many events exist in the narrow virtual time
interval. The interval of BPW which contains the optimal performances is quite
wide in comparison to the sharp optimal range in the BPEC scheme. Thus, the

progress balancing between logical processes with the virtual time window is effective

143
and works well in the wide interval of BPW. This is because the BPVTW scheme
concerns the actual progress of processes with their timestamps.

On the contrary, when the BPW is larger than 2000, the curves of total execution
time increase with different rates as the BPW increases. In this larger window interval,
the virtual time window does not take an important role of balancing progress of
logical processes because there are many executed events in this interval. Also, the
computation granularity is not restrained by the BPW. Thus, too many unprocessed
events are executed, propagating lots of immature events. The larger the LP ratio is,
the more unprocessed events are generated. In the ﬁgure, the T35932 circuit shows the
worst performance when the BPW is larger than 2000. The D35932 circuit is worse
than the 835932 circuit. Therefore, when the BPVTW scheme is used as the event
scheduling method, the BPW must be tuned properly. Otherwise, the simulation
performance will be as bad as if there were no limit of the MCGS and the BPC in
the BPEC scheme.

As discussed in Section 5.1, the BPVTW scheme does not control the computation
granularity in a direct way as in the BPEC scheme. Rather, by using the virtual time
window per logical process, the computation granularity is indirectly controlled while
the progress of logical processes are balanced. In order to observe the effects of
the BPW on the computation granularity, the average number of executed events
between interprocessor communications are also shOwn in Figure 5.6. In the ﬁgure,
where the simulation shows the almost Optimal performance, the average number of
executed events per communication is around 400 or 700, depending on the circuit.

This amount is similar to the optimal grain sizes for the same circuits as shown in

 

 

144

 

 

 

 

 

 

I
4000 835932 «a— —
D35932 *—
3500 T35932 49— -
3000 3
Average
Number 2500 _
of
Executed 2000 ‘
Events
1500 _
1000 _
500 -
0 l L I I
0 2000 4000 6000 8000 10000

BPW

Figure 5.6: Average Number of Executed Events with the BPVTW Scheme

Figure 5.4.

5.3 Summary

Event scheduling is an important design issue of optimistic protocol on distributed
systems. When a huge application problem has the high LP ratio, the number of
computed events between any two communication points could be controlled by com-
putation granularity and the event execution order. By balancing the amount of
computations with the frequency of communication and controlling the optimistic
tendency, computation granularity signiﬁcantly affects the performance efficiency of
distributed optimistic protocol. In addition to the computation granularity, the event
scheduling scheme which adjusts the progress balance of processes among the assigned
logical processes should be well designed to improve the performance.

By considering the characteristics of the application problem of logic circuits and

 

145
the Optimistic protocol, we propose two efﬁcient event scheduling schemes on the op-
timistic protocol: the BPEC scheme and the BPVTW scheme. The BPEC scheme
limits the number of executed events per logical process between interprocessor com-
munications by using the BPC parameter. The BPVTW scheme controls the progress
of processes in virtual time by using the BPW parameter, and prevents processes from
executing events far ahead of other processes.

We experimented Time Warp considering the design issues of the control of the
computation granularity and the event selection schemes on a cluster of DEC Alpha
workstations. According to our experimental results, both schemes show good per—
formances when the computation granularity is adjusted properly. Otherwise, com-
munication overheads or excessive rollback situations may signiﬁcantly degrade the
simulation performance. The results also showed that the progress balancing among
logical processes have signiﬁcant effects on obtaining the Optimal performance. In
addition, it is shown that the LP ratio and the characteristic Of the simulated circuit

affect the simulation performance.

 

 

Chapter 6

Comparison and Analysis of

Parallel Asynchronous Protocols
on Massively Parallel SIMD

Architectures

In general, MIMD machines have been used as parallel processing environments for
distributed event driven simulation. Parallel processing on MIMD machines has ad-
vantages over SIMD machines in several aspects, such as general computational ca-
pabilities, larger-grain parallelism, and asynchronous computation on each processor.
However, one signiﬁcant drawback of a MIMD machine is that it usually has a much
smaller number of processors than a massively parallel SIMD machine and pays a

higher communication cost for message passing. The SIMD architecture is the only

146

147

alternative under the current available technology, in order to to achieve massive
parallelism with several thousands of processors. It offers signiﬁcant advantages over
the MIMD architecture in VLSI circuit simulation in case that the number of gates
is smaller than the number of processors on the SIMD machine. The VLSI circuit
simulation requires a small-grain operation of a gate, and a large circuit requires a
large number of processors, making it well suited for a massively parallel SIMD envi-
ronment. The SIMD architecture also offers simpler synchronization, which is crucial
to several aspects of the performance of parallel simulation. In contrast to a MIMD
environment, a SIMD environment has no serious problem for the circuit partition
and load balancing because a logical process is assigned to a processor.

In this chapter, architectures of existing massively parallel SIMD machines are
analyzed for parallel gate-level logic simulation. The considered parallel protocols
of parallel discrete event simulation are synchronous, conservative, and optimistic
protocols. The Connection Machine CM-2 having up to 64K processing elements and
the MasPar MP-l having up to 16K processing elements are used as the target SIMD
machines.

The objective of this chapter is to understand the effect of machine characteris-
tics of SIMD machines on the performance of parallel simulation and to help choose
a massively parallel SIMD architecture for efﬁcient parallel logic simulation. In the
comparison of the two machine architectures, broad aspects of the machines are con-
sidered, such as interprocessor communication methods and array access times. Ac-
cording to the experimental results, the MP-l gives approximately two times faster

performance than the CM-2 for up to 16K gate benchmark circuits, while the CM-2

 

 

148
has the advantage over the MP-l for a circuit with a much larger number of gates. The
experimental work was published in [24]. One severe problem for parallel logic simu-
lation on the massively parallel SIMD machines is that the size of the local memory
is not enough for event queues. To cope with the problem, the Moving Time Window
(MTW) technique [75] is applied to conservative protocol and optimistic protocol as
a memory handling mechanism.

The remainder of this chapter is organized as follows. Section 6.1 describes the
data structure of event queues for the optimistic protocol to handle rollback situa-
tions efﬁciently on the SIMD machines. Section 6.2 presents the performance of the
instructions related to logic simulation on both the CM-2 and the MP-l. In addition,
comparisons and descriptions of architectures on massively parallel SIMD machines
are discussed in Section 6.3. Section 6.4 gives experimental results on both the CM-2
and the MP-l by using synchronous, conservative, and optimistic protocols. Based
on the experimental results, the SIMD machine architectures are analyzed in view of

parallel logic simulation.

6.1 Event Queue Manipulation

Eﬂicient queue manipulation is essential for good performance of optimistic protocol
due to rollback manipulation and space overhead problems. As an approach, a single
queue scheme based on immediate cancellation has been proposed in [28], but it has
been shown that it is not suﬂicient due to its inherent fossil collection overhead and

sequential search problems. An event queue handling scheme, called CBS Q ( Circular

 

 

149
Binary Search Queue), is used in this chapter as a data structure for fast event queue
manipulation. The data structure is based on the scheme of immediate cancellation.
A CBSQ is a data structure which allows binary search on a circular list. There are
three pointers for each CBSQ: Bottom, Front, and Rear. Bottom is a pointer to the
remaining oldest event in CBSQ. Front points to the position of the ﬁrst unprocessed
event in CBSQ, while Rear is a pointer to the next available element for a new coming

event. A binary Operation performs on CBSQ to ﬁnd a position of an event.

 

an old event

Case A. Case B.
- anew event

 

Rear Front
Rear Rear
3 j . Front
Front Front Bottom ' Bottom

       

Figure 6.1: A CBSQ Structure

 

 

 

A binary search is needed to insert a new incoming event into a CBSQ at an input
port as shown in case A in Figure 6.1, by just moving Rear pointer. If the timestamp
of the new event is less than the timestamp of the event pointed by Front, that
is rollback situation, all of the events placed from Front to Rear are automatically
eliminated by just moving both Front to the new event and Rear to the next place of
the element that contains the new event. This case is shown in case B in Figure 6.1.

That is, we do not need any additional Operation to delete wrong events by using

 

the CBSQ. When rollback occurs, binary searches on event queues are performed for

rollback adjustment in CBSQs which are not even directly involved in the rollback.

150
In this case, Fronts are moved down to the event with timestamp equal to or just
greater than the timestamp of the new LVT. Fossil collection also requires the binary
search to ﬁnd a boundary to delete unnecessary processed events. Bottoms are moved
to point to the event having the smallest timestamp which is greater than the current
GVT, whenever there is at least one full queue.

On the average, it takes O(log2n) time to do a CBSQ search Operation on each
queue, comparing the case of one linear queue per gate [28] of C(mn), where n is the
number of events on a queue and m is the number of ports per gate. For a circuit
with a maximum fan—in p, we need (2 + q)p binary search operations on the average
at each simulation cycle, where q represents an average frequency of fossil collection
(0 _<_ q s 1). The value of q varies, depending on the situation. For example, on
the system has a unlimited memory size, Time Warp does not need to perform a
fossil collection and q is zero. Since CBSQ is in each port of a gate, it reduces lots
of cost in the queue management of Time Warp for searching the proper position
especially when an event with a smaller timestamp arrives in an event queue and
it makes a rollback situation. Using CBSQ is efﬁcient for parallel logic simulation
on massively parallel SIMD machines, because the number of accesses in a queue is
logarithmically smaller than in the single event queue. Thus, Time Warp with the

immediate cancellation can be efﬁciently handled by the CBSQ of each port.

 

 

 

151

 

Table 6.1: Performance of Instructions

6.2 Related Instructions and their Performances

on SIMD machines

Since the performance of logic simulation considerably depends on the execution
times of machine instructions, it is essential to compare the performance on MP-1
and CM—2. The main instructions are interprocessor communication, global minimum,

and array access operations.

0 Interprocessor communication instruction: Interprocessor communica-
tion is used to propagate events between any two gates by sending a message
from a selected processor to a destination processor. The router instruction is
used for global communication in the MP-l, while the send instruction is used

in the CM—2.

Global minimum instruction: The global minimum is used to compute
GVT by obtaining the minimum local virtual time Of all active gates. The
SIMD machines can achieve the global minimum by using an instruction within

a short amount of time compared to other machine architectures.

Array access operations: To use the given local memory efﬁciently and

improve the performance, event queues are implemented by array structures.

 

152
Array access operations affect the performance of manipulating event queues

by reading and writing events of an array.

Table 6.1 shows the execution times of the instructions. MPL(Massively Parallel
Language on MasPar) [53, 26] is used for the measurement on the MP-l, while C / Paris
is used on the CM-2. As a unit of measurement, ,asecond is used. The destination
processors are randomly determined to get the performance of routing instruction in

Table 6.1. The MP-l is about 2 and up to 6 times faster than the CM-2.

6.3 Comparisons of Massively Parallel SIMD Ma-

chines

The unique characteristic Of the SIMD machine is that every instruction is synchro-
nized over all processors with different data by the front-end processor. The simula-
tion cycle which is repeated to complete the given tasks of simulation is synchronized
over all of the processors. In view of logic simulation, when the number of processors
and the memory space for event queues are enough to simulate a circuit and each
processor has a good capability on a SIMD machine, better performance for a circuit
with a large number of gates can be obtained. The CM-2 is known as a parallel ma—
chine with a large number of small grained processors, while the MP—l is a machine
with a relatively large processor memory size compared to other SIMD machine. In
this section, we discuss machine-dependent characteristics in logic simulations on the

two SIMD machines.

 

 

153

6.3.1 Queue Space

One of the important factors for efﬁcient parallel logic simulation is to have enough
event queue space because, in general, queue space determines the number of events
to be used for the simulation. Each processor on MP-l and CM-2 has 16K and
8K bytes, respectively, as the local memory size. The MP-l, when MPL is used,
provides only about 10KB out of 16KB for user space, while the CM-2, when C / Paris
is used, supports almost 8KB as user space. Under the situation with the limited
local memory space, processing a large number of input vectors may cause overﬂow
in event queues for conservative simulation and optimistic simulation. The Moving

Time Window (MTW) technique [75] is used to prevent the queue overﬂow.

6.3.2 Parallel Virtuality

The number of available processors is another signiﬁcant factor when each processor is
assigned to a gate. The CM-2 provides up to 64K of physical processors and offers the
VP (Virtual Processor) ratio concept for larger circuits than 64K, which indicates how
many times each physical processor must perform a certain task in order to simulate
the appropriate number of virtual processors [85].

The MP—l has a smaller number of processors than the CM-2. The number of
physical processors of the MP-l family is only from 8K up to 16K. Using a larger local
memory on MP—l, the virtual processor concept may be also implemented manually
instead of by using the system support. When implementing the virtual processor

concept, an important consideration is the partitioning of a benchmark circuit into

. .3.

154

 

 

 

 

[ I 32 bits [ 64 bits [[ Congestion effect ]
CM-2 140 190 7.7 (6.9, 8.4) times
MP—l 28 43 14.1 (15.1, 13.1) times

 

 

 

 

 

 

 

 

Table 6.2: Comparison of Congestion Effects

processors of the MP—l. If one processor can be assigned for several gates, the MP-l
may also cover a circuit with a large number of gates. However, the possible memory
size per virtual processor will decrease by half whenever the VP ratio increases, and
the entire processing time will be slow to process the assigned gates one by one.
On the SIMD machine, assigning several gates into a processor may result in a high

overhead due to the small memory space and the synchronized characteristic.

6.3.3 Communication Scheme

The CM-2 [84] is a SIMD machine whose interprocessor communication topology is
hypercube, while MasPar’s topology is mesh and router. Each CM-2 processor chip
contains one router node which serves the 16 data processors on the chip. The router
nodes on all the processor chips are wired together to form the complete router
network. The algorithm used by the router can be broken into stages called petit
cycles. The delivery of all the messages for a send operation might require only one
petit cycle if a processor is active, but if all processors are active then the execution
time is about 8 times slower than the one petit cycle time.

In MasPar, the Global Router is a bidirectional communication instruction and

a circuit switched style network organized as the 3 stage hierarchy of crossbar

switches [14]. Each nonoverlapping square matrix of 16 processors in the processor

 

 

 

 

155

array is called a cluster. The system can communicate with all PE clusters simulta-
neously, but it can communicate with only one processor per cluster at one time [53].
A congestion problem occurs when more than 2 processors in the same cluster send
data through the outer routing port at the same time. Even if the router requires a
very short time there is some delay to send several messages from a cluster. The case
of all 16 processors in a cluster wanting to send data at the same time is up to 16
times slower than the case of only one processor communicates per cluster.

Even when the worst case of activating 16 processors occurs, the router instruction
on the MP-l using the external router network is about two times faster than the
send instruction on the CM—2. Table 6.2 shows the performance of routing instruc-
tion when only one processor in a cluster sends a message to its randomly assigned
destination processor. From the evaluated performance, we can see that the MP-l is

more sensitive to the congestion problem in a cluster.

6.3.4 Processor Capability

A massively parallel SIMD machine contains numerous small-grain processors. Each
processor has a large on-chip register set and all computations operate on the registers
on the MP-l [61]. Most data movement within each processor occurs on the internal
processor 4-bit Nibble bus and the Bit bus. For example, during a 32-bit integer
instruction, a series of operations on successive 4-bit nibbles to generate the full
precision result requires 8 clocks. In contrast, each processor of CM—2 is 1-bit serial,

which causes an extremely long time to perform the read or write operations in the

 

 

 

156
array.

These capabilities of the processor affect array access time on SIMD machines,
and the CM-2 especially takes a long time to access an element of the array. Since
logic simulation requires a lot of event handling, that is, accessing events, comparing
events, and checking the queue boundary, the performance of the two machines are

signiﬁcantly affected by processor capabilities.

6.4 Experimental Results

We simulated combinational circuits from ISCAS ’85 circuit suite [16] as benchmark
circuits, including C1355, C1908, C6288, and C7552. Through a pre-processing step,
the benchmark circuits are re-conﬁgured into the maximum 3 input ports and 2 output
ports in order to take an advantage of the characteristic of the SIMD machine. The
synchronous, conservative, and optimistic protocols are implemented on the MP-l
and the CM-2, and experimented with 1000 input event vectors by using the MTW
mechanism to manage the local memory efﬁciently.

The experimental results on the performance Of synchronous, conservative, and
optimistic protocols for benchmark circuits are shown in the Table 6.3. Speed ratio
is the ratio of total execution time on the CM-2 to that on the MP-l. All MTW
sizes of synchronous protocol are inﬁnite because only a few events are executed and
propagated on the synchronous protocol based on the GVT value and the event queues
do not overﬂow. Among the parallel protocols, the conservative protocol achieve the

better performance than other protocols on the SIMD machines. In addition, the

157

 

 

 

 

Protocol Synchronous Conservative Optimistic
Speed ratio Speed ratio I MTW size Speed ratio I MTW size
C1355 1.89 (214/113 2.12 (55/26) 00 2.52 (189/75) 3000

 

01908 1.82 (265/146 1.95 (41 /21) 28000 2.36 (184/78) 20000

 

 

 

 

 

 

 

 

 

 

 

)

)
C6288 1.89 (735/388) 2.10 (437/208) 2500 2.52 (1667/662) 500
C7552 1.74 (324/186) 1.78 (89/50) 10000 2.32 (402/173) 1000

 

 

TOM—2)
TMP—l

Table 6.3: Experimental Results (Speed ratio =

MP-l runs 2 to 2.5 times faster than CM-2 and the optimistic protocol gives the
better Speed ratio than synchronous and conservative protocols on the MP-l.

From the experimental results, we can consider two analytical factors: array ac-
cess frequency and congestion problem. The ﬁrst factor produces several noticeable
effects and wide differences. Optimistic protocol requires more event searching in a
simulation cycle, causing more array accesses. Since the array access time as shown
in 6.2 is much faster on the MP-l than the CM-2, the performance of optimistic
protocol on the MP-l is almost two and half times better than that on the CM-2. It
provides the highest speed ratio Of optimistic protocol among those three protocols.
Conservative protocol is also inﬂuenced by this array access frequency effect, since
there are more array accesses for link clock and local clock than synchronous protocol.
Thus, conservative protocol has higher speed ratio than synchronous protocol.

For the second factor, the MP-l is more sensitive to the PE cluster concept than
the CM-2, as mentioned in 6.3. For synchronous protocol, there are only a few
processors which are activated in each cycle according to GVT, so it results in little
wasted time to go through a global router because the congestion is very low (about
average 1.75 ,useconds / cycle on the MP-l for communication). Conservative protocol,

however, has a small number of simulation cycles. This fact results in increasing the

158

degree of parallelism and communication time per cycle (about 1.97 pseconds / cycle).
Optimistic protocol allows all gates to be activated whenever gates receive the new
incoming events, resulting in a very high probability that each processor in a cluster
tries to propagate events simultaneously. The interprocessor communication takes the
longest time (about 2.12 pseconds / cycle). In Spite of this high congestion probability,
the performance of optimistic protocol on the MP-l is almost two and half times better
than that on the CM-2. This situation can be explained by the fraction of execution.
time for each major part of the logic simulation processing to the total execution
time. Processing of a logic simulation consists of three main parts: interprocessor
communication, queue accessing and handling, and GVT computation. The fraction
of execution time for each activity varies by protocols and circuits. It is measured
in terms of percentage of its execution time as shown in the Table 6.4. Since the
performance difference in array access instructions between the two machines is wide,
the speed ratio increases as the queue handling percentage increases. If the congestion
problem is severe, the percentage of communication time per cycle will increase.

Thus, on the optimistic protocol, the performance difference of queue manipula-
tions dominates the difference resulting from high congestion. The performance of
conservative protocol on the MP—l is also affected by this domination, producing a
higher speed ratio than synchronous protocol. Thus, we can see the congestion prob-
lem does not cause a signiﬁcant effect on the ISCAS ’85 circuits. The inﬂuence of the
congestion problem, however, depends on how gates are allocated to processors. In
the case of a multiplier circuit, conservative protocol is more sensitive to the conges-

tion problem than array access frequency (speed ratio of syn. = 2.13, speed ratio of

159

 

[ [I Interprocessor Communication ]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Synchronous 78.22% to 79.73%
Conservative 50.59% to 53.85%
Optimistic 9.20% to 17.04%

| Event Queue Manipulation
Synchronous 20.27% to 21.78%
Conservative 46.15% to 49.41%

Optimistic 82.96% to 90.80%

[ J GVT Computation ]
Synchronous less than 0.0001%
Conservative 0.0003%

Optimistic 0.0001%

 

 

 

 

 

Table 6.4: Fractions of Execution Time

con. 2 1.95, speed ratio of opt. = 2.72). Optimistic protocol still has a high speed
ratio due to the effect of queue manipulations caused by the heavy rollback in the
multiplier circuit.

As we see in the above experiments, the array access frequency and the communi-
cation scheme are major factors to make a simulation protocol better than the others.
One method to lower the congestion effect depends on how to assign logic gates to
processors based on the knowledge of gate activity in advance. Array access frequency

can be reduced by using an efﬁcient queue data structure for logic simulations.

6.5 Summary

The characteristics of machine architectures decide the performance of parallel pro-
tocols. When VLSI benchmark circuits are used as application problems, the perfor-
mance of logic simulation depends on that Of related instructions in the employed ma-

chine architecture. In order to compare the machine architectures in terms Of PDES,

160
the following characteristics and performances of machines are considered: computa-
tion capability of a processor, interprocessor communication, global synchronization,
local memory space, and the number of processors.

The two target massively parallel SIMD machines are the MP—1 and the CM-2.
Performances of the major instructions considered are faster on the MP-l than on
the CM-2. Three logic simulation protocols, that is, synchronous, conservative, and
optimistic protocols, are evaluated and compared on both of the machines. The ex-
perimental results show that the MP—l is about 2 to 2.5 times faster than the CM-2
with some variances for all three simulation techniques. The results reﬂect the fact
that the different performances of major instructions cause the faster execution times
on the MRI. The variances of performance among three protocols are also explained
by investigating how array access frequency and communication scheme have effects
on the logic simulation protocols. In general, even though there is a congestion prob—
lem in communications, the fraction of execution time spent in queue manipulations
exceeds the fraction spent in interprocessor communications in conservative and op—

timistic protocol. It results in that optimistic protocol has the highest speed ratio.

Chapter 7

A Study of Machine-Independent
Characteristics of Parallel Logic

Simulation

The factors that affect performance of parallel protocols can be classiﬁed into machine—
dependent factors and machine-independent factors. Machine-dependent factors are
ones that rely on the performance of processor operations and capabilities supported
by the machine system. The related performance metrics are an instruction process-
ing time, an array manipulation time, an interprocessor communication time, a global
synchronization time, etc. when a parallel protocol is employed in a machine. In con-
trast, machine-independent factors are independent of machine performance. Rather,
they are determined by characteristics of circuits and parallel protocols. The critical

path length of a circuit, the simulation parallelism of parallel protocols on a given

161

162
circuit, the number of simulation cycles, data structures, and other used schemes are
in this category.

In this chapter, we investigate the effects of two machine-independent factors on
performance of parallel protocols by observing how the factors are related to the char-
acteristics of parallel protocols, application problems, and other machine-independent
factors. The ﬁrst one is the critical path length of an application problem. We ﬁnd
the linear relationship of the critical path length with another machine-independent
factor. With the relationship, when the critical path length in an application problem
is known, we discuss how the number of simulation cycles and the total execution time
of a protocol can be estimated without any actual experiments. As the second factor,
we study the parallelism of simulation, which is deﬁned as the ratio of the activated
processes to the total processes during the simulation. Especially, the conservative
protocol with null messages is studied in view of parallelism and a reﬁned conservative
protocol is proposed by reducing useless null messages. Since sending and cascading
unnecessary null messages make all processors busy and increase communication delay
wastefully, the reﬁned protocol employs a mechanism, called Bounded Time Window,
which bounds timestamps of null messages to be sent. The effects of this mechanism
on experimental results of parallelism of simulation and the number of simulation
cycles are studied. Finally, the MT W mechanism is studied as a mechanism to ma-
nipulate event queues under the situation of the limited memory space. When we
simulate a huge circuit with a large number of input event vectors, to utilize the lim-
ited local memory for event queues is an important concern on the SIMD machine.

In order to prevent event queues from overﬂowing without loss of performances, the

163
MTW mechanism is applied to conservative and Optimistic protocols.

As the target machine the MasPar MP—l [53] is used for the experiments. The
MP-1 has 16K processors and 16KB local memory size per processor. Since the SIMD
machine has massively parallel processors and supports faster global synchronization,
it is appr0priate to simulate logic circuits in parallel by assigning a gate into a proces-
sor. The used model for experiments is the same as that in Chapter 6. Each processor
is assigned only one logical process.

In Section 7.1, the relationships of critical path length with other machine indepen-
dent factor are studied. Section 7.2 presents the effect of null messages on parallelism
of conservative protocol and shows a reﬁned conservative protocol using the BTW
mechanism. Section 7.3 shows the MTW mechanism as the tool of managing memory

space. Concluding remarks are in Section 7 .4.

7 .1 Critical Path Length

Producing the proper output results within a short time is the most important cri-
teria to reduce the simulation cost for huge application problems. Total execution
time is the Spent wall clock time for the simulation, which is measured from the
beginning of simulation until the simulation is complete. The following simple for-
mula expresses the total execution time, T, as a function of machine-dependent and

machine-independent factors:

T=r>I<S (7.1)

164

where r is the average execution time per cycle and S is the number of simu-
lation cycles. Total execution time can be computed by the number of simulation
cycles and the average spent time per simulation cycle. A simulation cycle contains
several processing steps: process evaluation and activation, event computation, event
prOpagation, and GVT computation. By repeating the simulation cycle of event com-
putation and communication, each processor proceeds to complete the simulation. If
many processors can process concurrently, the executed events per simulation cycles
increase and the number of simulation cycles may decrease by rapidly reaching the
total executed events required for the simulation. If the number of simulation cycles
increases, the total execution time, in general, increases consequently. The ﬁrst com-
ponent in the equation, r, is a machine—dependent factor that is related to machine
performance of computation and communication. It also varies for protocols and the
number of initial input event vectors. As a machine-independent factor, the second
component S is related to the number of input event vectors, say it. Figure 7.1 shows a
linear relationship of S and n from the results of C7552 circuit that are experimented
on the MP-l with conservative protocol. Experimental results from other circuits and
protocols also have the linear relationship of S and n with different slopes.

Thus, S can be explained as a function of the number of input event vectors as

follows.

SZCG*n+G (7.2)

where CG is the slope and C is the intercept of the function. Both of them are

dependent on the concerned circuits and protocols. Note that the value of G’ can be

165

 

 

 

 

50 _ I I I I
C7552 on Conservative P. @-
40 - —
Total 30 _ J
execution
time
T
( ) 20 ~ —
10 - —
0 l l I l
0 200 400 600 800 1000

Number of Input Event Vectors (n )

Figure 7.1: Total Execution Time vs. the Number of Input Event Vectors

negligible for large n. Thus, the CG value for a given circuit and a protocol can give
an estimation of the number of simulation cycles on a value of n. In the following

subsections, we study the values of CG for several circuits and protocols.

7.1.1 The C0 Value

From experimental results of synchronous, conservative, and Optimistic protocols, we
compute the CG value of each circuit. As the experimental benchmark circuits, some
circuits in the ISCAS ’85 suite [16] and multiplier circuits are used, which are all
combinational circuits. These circuits are preprocessed before starting simulation;
zero-delay gates are added to the original circuits and the format of circuits are
reconﬁgured to let each gate have at most three input ports and two output ports. It
is because all processors on the SIMD machine are synchronized at each instruction

and a gate which has many ports makes all processors wait for the gate without

166

 

[ || Critical Path Length | No. of Gates ]

 

 

 

 

 

 

 

 

 

 

C1355 36 1001
C1908 49 1322
C6288 174 3888
C7 552 55 5848
C3540 60 2609
C880 39 729
C17 4 13
3 Bits 17 78
16 Bits 95 2080

 

 

 

 

 

 

Table 7.1: Critical Path Lengths and the Numbers of Gates on the Benchmark Cir—
cuits

processing data. The limited number of ports makes all processors be in the almost
same processing interval per simulation cycle, considering the characteristic of SIMD
machine. Table 7.1 presents the number of gates and the length of the longest path,
i.e., the critical path length, of the benchmark circuits. In observations of circuit
characteristics in this Chapter, the effect of the circuit size is not studies. In [4],
Bailey studied the relationship between the circuit size and the parallelism and said
that parallelism does not scale linearly with circuit size and that the relationship
between those two factors is more complex than a linear relationship according to the
characteristics of the given circuits.

Tables 7.2 shows experimental results from synchronous, conservative, and opti-
mistic protocols on the MP-l. The number of simulation cycles and the total exe-
cution time of the benchmark circuits are measured on the three parallel protocols.
As a unit of measurement for total execution time, second is used. Because of the
limitation of local memory space per processor, the array size of CBSQ is ﬁxed at

512, where the CBSQ is showned in Chapter 6. Since the queue overﬂow problem

167

is unavoidable for a large number of input event vectors, the Moving Time Window
(MTW) mechanism is employed. In [7 5], the MTW has been used for the Time Warp
protocol as a hybrid approach to control the size and frequency of rollbacks and the
volume of historical information. In our experiment on the MP-l as the Chapter 6
does, the MTW has been employed for memory management on both of conservative
and Optimistic protocols. It also controls the speed of processing events by preventing
some events from processing.

Recall that the CG value is the slope of the function in Equation 7.2. Thus, the CG
value of each benchmark circuit can be computed on a different number of input event
vectors. The value of C0 is computed by using the number of input event vectors and
the number of simulation cycles from several sets of experiments. The computed CG
values are summarized in Table 7.3. Thus, the CG value is a value which is extracted
from experimental results, and is independent of the number of input event vectors

of simulation and the performance target machine.

7 .1.2 Effect of Critical Path Length

In order to ﬁnd the relationship between the critical path length and the CG value,
we conduct a regression analysis by using the experimental results on several bench-
mark circuits from synchronous, conservative, and optimistic protocols. The used
experimental results are in Table 7.3.

Figures 7.2 shows the regression analysis to ﬁnd the relationship between the

critical path length and Ca. A circuit is represented as a point in the ﬁgure, whose X

168

 

[ SYN 1] Input Event | Cycles [Execution Time ]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C1355 1500 87498 258.54
C1908 1500 104729 336.52
C6288 1500 265727 699.27
C7552 1500 125776 355.64
C3540 1500 115993 301.14
C880 1500 47551 119.55
C17 1500 1053 2.18
3B Mul 1500 2963 6.13
16B Mul 1500 98741 231.62

 

 

(a) Synchronous Protocol

 

[ CM [[ Input Event [ MTW ] Cycles [ Execution Time I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C1355 1500 30000 9648 41.92
C1908 1500 22000 8470 37.10
C6288 1500 1500 101444 388.38
C7552 1500 10000 19873 86.36
C3540 1500 10000 15339 67.32
C880 1500 30000 4149 17.32
C17 1500 00 1505 5.19
3B M111 1500 00 2485 7.57
16B Mul 1500 5000 36209 119.16
(b) Conservative Protocol
[ TW Input Event MTW Cycles Execution Time ]
C1355 1500 15000 12164 118.67
C1908 1500 7000 10319 178.22
C6288 30 10000 2160 21.87
C7552 1500 8000 22649 495.54
C3540 1500 13000 16553 380.39
C880 1500 00 4295 44.00
C17 500 00 505 1.86
33 M111 1500 00 2518 12.22
16B M111 1500 4000 36764 546.40

 

 

 

 

 

 

 

Table 7.2: Performance of Parallel Protocols on the MP-l

(c) Optimistic Protocol

 

169

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Circuits I Synchronous [ Conservative I Optimistic [
C1355 58.632 6.49 8.12
C1908 70.234 5.63 7.076
C6288 176.646 67.5 67.25
C7552 84.35 13.474 15.552
C3540 77.444 10.424 11.206
C880 31.67 2.726 2.824
C17 3.098 0.994 0.995
3 Bits 8.738 1.67 1.702
16 Bits 65.92 23.834 24.29

 

 

 

 

 

 

Table 7.3: Values of CG

and Y coordinates represents the critical path length and the corresponding CG value
of the circuit, respectively. The linear dotted lines are the regression lines driven by
the regression analysis. The regression coefﬁcient 61 is the slope of the regression
line and the parameter 60 is the Y intercept of the regression line. The inferences

concerning 61 have the following test:

0133120

021517'éO

where 61 = 0 implies that there is no relation between Y and X. The term b1 is the
point estimator of B, and b0 is the point estimator of [30. To use F test, F * is denoted
the test statistic for the analysis of variance approach.

The regression line in the part (a) of Figure 7.2 has 0.949812 as the value of b1
and 8.253485 as the value of b0. The value of F * is 45.911759. Since F * is greater
than F(.999; 1,7) = 29.2, we conclude 61 at 0, that is, there is a linear relationship

between the length of critical path and the value of CC on synchronous protocol with

170

a 99.9 % conﬁdence. For conservative protocol, we have 0.403902 for the b1 value,
and -8.991353 for the b0 value. Since the F* is also much greater than F(.999; 1, 7),
there is also a relationship between those two factors with a 99.9 % conﬁdence. In the
same way, we can prove linear relationships between the length of critical path and
the CG value for Optimistic protocol, based on the values in part (c) of Figure 7.2.

Once we know the CG value for a given circuit and protocol we may estimate
the execution time of the same circuit and protocol on an arbitrary system without
actual implementation. For the large number of input event vectors, multiplying the
CG value with the number of input event vectors yields the number of simulation
cycles. Then, based on system performance of basic operations, such as instruction
processing time, array manipulating time, and interprocessor communication time on
the system, we can approximate the execution time per simulation cycle by computing
with the portion of each spent time for the major instructions. By multiplying the
estimated number of simulation cycles with the execution time per simulation cycle,
the total execution time on the particular system can be computed. When CC is
unknown, the CG value can be estimated by using the relationship with the critical
path length. From the problem’s task dependency, the critical path length can be
obtained. Since we know that CG value are linearly related to critical path length
with the coefﬁcient and difference on the regression equation, CG can be computed

and the total execution time can be roughly estimated.

171

 

 

 

 

 

 

 

 

 

 

 

 

250 I I I I I I I I
Benchmark Circuit 0
200 " Linear Relation - - - - ‘
150— a
100 r . . - ' ' ' a
O , . . ' ' ‘ O
50 _. ,<.>. . — ~ -- J
. .- ‘ ' <>
0 Q 01 I I I l I I I
0 20 40 60 80 100 120 140 160 180
Length of Critical Path
(a) Synchronous Protocol
100 I I I I I I I I
80 _ Benchmark Circuit 0 __
Linear Relation ' ~
60 — -—
CG 40— _
20— <><>ﬁw“'”0 _
0 e O ‘5' '0
' I I I I 1 I I I
0 20 40 60 80 100 120 140 160 180
Length of Critical Path
(b) Conservative Protocol
100 I I I r I I I I
80 _ Benchmark Circuit 0
Linear Relation - - '
CG 40— -
. . . . - ’ ‘<‘>’
20 ‘ O . . - ' ‘
. . - ' ' O
0 <> .0 00 O
- ' ' ' L I I I I l I I

 

 

 

0 2O 40 60 80 100 120 140 160 180
Length of Critical Path

(0) Optimistic Protocol

Figure 7.2: The Relationship between the Length of Critical Path and the Value of
CG

 

172

 

 

 

 

1 I I I I I I
CON 01908 _—
0.8 b _
0.6 - _
Parallelism
0.4 - _
0.2 _
O 1 l I 1 I .4
0 200 400 600 800 1000 1200

Number of Simulation Cycles

Figure 7 .3: Parallelism on the Conservative Protocol Excluding Null Messages (C1908
Circuit)

7.2 Parallelism of Simulation

Parallelism of simulation is another important factor on simulation performance.
This factor is the ratio of the active processes to the total processes at a point Of
simulating an. application problem. The parallelism changes dynamically depending
on the system state and event processing while the Simulation runs. In this section,
we study the patterns of parallelism for three simulation protocols and the effect of
null messages on the parallelism of conservative protocol.

Figure 7.3 shows the parallelism Of conservative protocol on C1908 circuit. The
number of input event vectors is 200. The active processes which execute events are
included in the parallelism, and the null messages are excluded. As we see in the
ﬁgure, there are three different parts in the parallelism of simulation. The ﬁrst 200

simulation cycles produce the highest parallelism. The enqueued 200 input event

173

 

1 I I I I I I
OPT C1908 —

0.8 — _
0.6 ..
Parallelism

0.4 - _

0.2 - -

 

 

O l J l_ I I l

0 200 400 600 800 1000 1200
Number of Simulation Cycles

 

Figure 7.4: Parallelism on the Optimistic Protocol (C1908 Circuit)

vectors into the circuit are processed and propagated to other processes in this ﬁrst
part. The middle part of the simulation cycles between 400 and 10000 is the interval
of on-going event processing. According to the rule of protocol and the characteristic
of a given circuit, the state of activation varies. Finally, the last part shows the tail
of simulation in which the remaining events are processed. Although the locations of
those three parts vary depending on the characteristics of circuits and protocols, the
general trend of parallelism contains the three parts while simulating a combinational
circuit. As shown in Figure 7.4, the optimistic protocol also has the three different
parts of parallelism level. Because the optimistic protocol activates processes with-
out considering the causality effect, its parallelism is higher than the conservative
protocol.

Table 7.4 presents the average parallelism of some ISCAS ’85 circuits. Since the

C6288 circuit has a long critical path length compared to its total number Of gates,

 

174

 

 

 

 

 

 

 

 

 

[ Circuits I] Synchronous [I Conservative [[Optimistic ]
I I Without Null With Null
C1355 0.017030 0.123270 0.444010 0.193137
C1908 0.015966 0.127395 0.442529 0.218135
C6288 0.067693 0.157134 0.242911 0.188740
C7552 0.012733 0.069311 0.190080 0.094628

 

 

 

 

 

 

 

 

 

 

it produces the memory overﬂow in event queues for a large number of input event
vectors. To Obtain the proper parallelism excluding the effect of MTW, we use a
relatively small Size of input event vectors, 30. While the C1355 circuit has the same
trend as the C1908 circuit has, the C6288 circuit gives a gentle SIOpe after the feeding
part since it has a long critical path. The C7552 circuit produces a much longer tail
than the other circuits. This property may result from the larger number of total
gates compared to other circuits. To compare the differences of parallelism according
to the characteristics of circuits, Figures 7.5 and 7.6 Show the parallelism on C6288

and C7553 circuits, respectively. The C1355 circuit has the almost same behavior as

Table 7 .4: Average Parallelism

the C1908 circuit, so we omit the ﬁgure.

175

 

1 I I r I I I I I I

CON C6288 —
0.8 r _

 

0.6 — _
Parallelism

0.4 - _

0.2 5 —

 

 

0 l l l l I l l

O 200 400 600 800 1000 1200 1400 1600 1800
Number of Simulation Cycles

 

Figure 7.5: Parallelism on the Conservative Protocol Excluding Null Messages (C6288
Circuit)

176

 

 

 

 

1 I I I I I I
CON C7552 --—
0.8 — _
0.6 - _
Parallelism
0.4 - _
0.2 a
0 l l 1 l _L I
0 200 400 600 800 1000 1200

Number Of Simulation Cycles

Figure 7.6: Parallelism on the Conservative Protocol Excluding Null Messages (C7552
Circuit)

7 .2.1 A Reﬁned Conservative Protocol

Conservative protocol uses null messages to facilitate processing as well as to avoid a
deadlock situation. Figure 7.7 shows the parallelism of the conservative protocol for
the same Simulation as in Figure 7.3, except that Figure 7.7 contains the null messages
into the parallelism. The two ﬁgures shows that null messages of conservative protocol
highly increase the degree of parallelism. In other words, the conservative protocol
generates many null messages between any two processes although the number of ac-
tive processes is small. If the target machine system has a high cost for interprocessor
communication, many null messages yield a longer communication time. On MP—l,
however, the interprocessor communication is performed at a time by an instruction
through a circuit-switched style network organized as a three-stage hierarchy of cross-
bar switches [14]. Thus, the high parallelism due to the null messages does not make

a big difference in the performance result on MP—l. The null message is a signiﬁcant

177
factor for a machine architecture which has a long communication time per message,
such as a MIMD machine and a distributed system.

In conservative protocol, null messages are used to activate the child processes
even when a process does not have events to send. The process sends null messages
with the timestamp of the current Local Clock to the child processes, When the child
processes receive the null messages, they can be sure that there is no incoming event
with a larger timestamp than the null messages’ timestamp. If they have unprocessed
events whose timestamp is smaller than that of the null messages and the events
satisﬁes the activation condition, the events can be executed.

However, when a process generates null messages frequently, sending null messages
may cause bad effects. When the child processes do not have unprocessed events
or they are waiting for events in other input ports, the frequently-incoming null
messages make the processes keep busy without any useful processing, spending the
communication time. Another case is the situation that the null messages waste
the execution time without accelerating event computations. In this situation, null
messages can be cascaded to descendent processes, but there are a few events to be
executed on the descendent processes. Note that the purpose of sending null message
to the child processes is that the child processes can execute their events up to the null
message’s timestamp. However, if there are very a few events to be processed, most of
null messages are involved in an unnecessary work by increasing the communication

time without any beneﬁt for simulation progress.

 

178

 

1 I I I I I I

CON C1908 with null messages ——

0.8

0.6 r

Parallelism

0.4 -

0.2 I-

 

 

 

0 l l I I l l

0 200 400 600 800 1000 1200
Number of Simulation Cycles

 

Figure 7.7: Parallelism on Conservative Protocol Including Null Messages (C1908
Circuit)

The BTW (Bounded Time Window) Algorithm

To reduce the overhead due to null messages in the conventional conservative pro-
tocol, we propose a reﬁned conservative protocol. This protocol employs a window
per process to restrict events to be executed according to the values of timestamps.
That is, events which have the timestamps within the window are not allowed to be
executed. The window is called the Bounded Time Window (BTW). This window
can control to reduce the frequently-sending null messages.

A characteristic of the reﬁned protocol is that the window of BTW mechanism is
not dependent on a global characteristic of simulation, but on a locality characteristic
of each process. Since the BTW of process starts from the timestamp of the last sent
event, each process has a different window interval which has a different low bound.
Thus, each process can avoid sending null messages which reside within the window.

By using the BTW per process, we locally control the number of null messages to be

179
sent so that all processes can proceed in their own progress without waiting for any
global control or others’ late processing. Figure 7.8 describes the BTW algorithm in

detail.

Experimental Results

Table 7.5 compares the performance results of the reﬁned conservative protocol with
that of the original conservative protocol. The average parallelism including null
messages are measured by varying the BTW size of the reﬁned conservative protocol.
The numbers of used input event vectors are 200 and 1400. The result of the original
conservative protocol is the special case of the reﬁned conservative protocol when
the BTW size is 0. As the BTW size increases, the average parallelism decreases
without any large difference in the number of simulation cycles. It implies that BTW
reduces unnecessary null messages by preventing processes from sending unnecessary
null messages repeatedly. Figure 7.9 shows the parallelism as a function of the BTW
size on the C1908 circuit. Although the BTW mechanism decreases parallelism, the
total execution time of the reﬁned protocol on the MP—l is the almost same as that
of the original conservative protocol. This is because interprocessor communication
on the MP-l is performed by an instruction by means of the built-up router, and the
extra communication overhead due to the unnecessary null messages does not cost
much on the MP-l. However, when the BTW is applied to a MIMD machine or a
distributed system, the reﬁned conservative protocol may achieve better performance

improvement by reducing the number of null messages.

 

180

 

 

integer GVT
plural integer LVT, Local_Clock, Link_Clock[N 0.1 nputPort]

Allocate processes into processors.
Read input event vectors and put them into initial processes.

Repeat
Compute LVT per process.
Get Local_C'lock from the minimum LinkL’lock from input ports per process.
if it is a new Local_C'lock
then Set the Local_Cl0ck changed.

Get GVT from the minimum LVT from all processes.

if ( LVT <= Local_Clock || LVT 2: GVT )
then begin
Activate the process
Compute the event.
Generate an event with a new data and timestamp based on current status.
Check the event has the same data which was sent before.
if yes
then Set the event as a null message.
else Set the event as a real message.
end if
else if LocaLC'lock is changed
then make a null message with timestamp of Local_Clock.
end if

 

if the event is null message

then if ( timestamp of the event - timestamp of the last sent event > BTW )
then Send the null message to the successors.
else Discard the null message.

else Send the real message to the successors.

Activate the processes that have received new events.
Change the Link_C'l0ck according to the incoming events.

Insert the new events into event queues.

Until ( No more events in all event queues.)

Figure 7.8: The BTW Algorithm

 

 

 

 

 

 

 

 

Parallelism

181

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

" Conservative Protocol Reﬁned Conservative Protocol
n = 200 II Without Null With Null BTW=100 BTW=200 BTW=300
Parallelism 0.182373 0.479925 0.470730 0.409726 0.407387
Sim. Cycles Jf 1221 1221 1225 1231 1232
j—n—=TOO—_IrWithout Null With Null BTW=100 BTW=200 BTW=3OO |
Parallelism 0.198162 0.664161 0.641622 0.496218 0.493873
Sim. Cycles 7932 7932 7933 7940 7941

 

 

Table 7.5: Performance Comparisons on the Reﬁned Conservative Protocol

0.7

0.65

0.6

0.55

0.5

0.45

0.4

0.35

0.3

 

 

 

 

 

 

200

400

600 800

BTW Size

1000

Figure 7.9: Average Parallelism of the Reﬁned Conservative Protocol

 

 

 

 

 

 

182

7 .3 MTW Mechanism for Memory ”Management

One of important considerations on parallel discrete event simulation is the memory
size to simulate an application problem. Each process requires an amount of memory
space to store incoming events in its input event queues. Depending on the employed
protocol and the number of processes per processor, the required memory size may
vary. If the size of application is large or the parallel system does not have enough
memory, the limitation of memory might be an signiﬁcant problem. On SIMD ma-
chines, especially, a local memory size per processor is relatively smaller than on
MIMD machines or distributed systems. Thus, controlling the required memory size
per process without loss of performance is an important issue to utilize the limited
system capacity efﬁciently.

The M TW (Moving Time Window) mechanism is employed to handle the memory
problem on the SIMD machine. In [75], MTW was used for Time Warp to solve the
over-optimistically processing and the storage overhead problem as a hybrid parallel
protocol scheme. In this chapter, we use MTW for a different concept: It is to
halt the advance of events to prevent overflows in event queues. Since the parallel
protocols are implemented by array structures or CBSQ on MP-l, each event queue
has a limitation of queue size after statically being allocated. If there are more
events than the size of event queue, an overflowing situation occurs in the queue. By
controlling the advanced events not to be sent, parallel protocols can avoid overﬂows
in event queues and are able to simulate huge logic circuits even on machines having

the memory limitation. It is employed for simulation of conservative and Optimistic

 

 

 

 

 

 

183

 

Circuit Number of Input Event Vectors
200 I 400 I 600 l 800 I 1000 [ 1200 l 1400

01355 00 00 20000 20000 19000 19000 19000
C7552 10000 9000 9000 8000 8000 8000 8000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 7 .6: MTW Sizes Applied to the Experiments

protocols. The MTW mechanism operates with the current GVT value and a MTW
size If the LVT of a process is within the interval of [GVT, GVT + M TW size],
the process is active and executes an event. If the LVT is beyond the interval, the
process does not execute an event.

Experiments with a large number of input vectors need to use the MTW mech—
anism to prevent overﬂow situations on MP—l. Table 7.6 shows the applied MTW
size to the experiments. When the numbers of input event vectors are 200 and 400
on C1355, the simulation can run without controlling the queue overﬂow. As the
number of input event vectors increases, the used MTW size decreases to prevent
the overﬂow situation in event queues. Figures 7.10 and 7.11 show the relationship
between the number of input vectors and either the number of simulation cycles or
the total execution time when the MTW size in the table is applied the optimistic
protocol on C1355 and C7552, respectively. Although the MTW size decreases to
control the queue overﬂow as shown in Table 7.6, the linear relationship can be con-
tinuously kept in the ﬁgures. For example, the input event vectors 600 has a big
difference in the MTW size on the C1355 circuit which changes from 00 to 20000.
However, the corresponding performance differences are not signiﬁcant as shown in
Figures 7.10 and 7.11: the ﬁgures show a small variation of the linear lines between

400 and 600, producing a small increase on the lines of the number of simulation

 

184

 

l l l l l

  
 
 

20000
C7552 +—

C1355 é;—

 

15000
Number
of
Simulation 10000
Cycles

 

 

 

0 l l l l l
200 400 600 800 1000 1200 1400
Number of Input Event Vectors

Figure 7 .10: Relationship between the Number of Simulation Cycles and the Number
of Input Event Vectors with the MTW Mechanism

 

cycles and the total execution time. Thus, applying the MTW mechanism to parallel
asynchronous protocols can achieve the control effect for the event queue overﬂow
without signiﬁcant increases of the number of simulation cycles and the total execu-
tion time. The experimental results also present that the applied MTW size depends

on the characteristics of circuits.

7 .4 Summary

We study two machine-independent factors and their effects on performance of parallel
protocols for logic circuits. The ﬁrst factor is the critical path length of logic circuits.
By showing a linear relationship with the CG value, we discuss how the CG value and
the total execution time are related to the length of critical path. As the second factor,

the parallelism of simulation is studied. With some minor variations depending on

185

 

 

 

 

600 _ T l l l 1 ﬂ
C7552 +—
500 __ 01355 A— _,
400 - —
Total
Execution 300 _ —
Time
200 — -
100 - -
i
0 l l l l
200 400 600 800 1000 1200 1400

Number of Input Event Vectors

Figure 7.11: Relationship between the Total Execution Time and the Number of
Input Event Vectors with the MTW Mechanism

application problems, conservative protocol shows the highest parallelism when it uses
null messages. Optimistic protocol is the next and synchronous protocol shows the
lowest parallelism. On the conservative protocol, if null messages are not considered
for parallelism, the parallelism is lower than that of optimistic protocol. We propose a
reﬁned conservative protocol that can reduce unnecessary null messages without loss
of performances. The reﬁned protocol may achieve a better improvement effect on the
MIMD machines or distributed systems which have a high communication overhead.
The last factor that we considered is the MTW mechanism. When we simulate a
huge circuit with a large number of input event vectors, the MTW mechanism is
used to utilize the limited local memory for event queues. By applying the MTW
mechanism to conservative and Optimistic protocols, we could prevent event queues

from overﬂowing without loss of performances.

 

 

 

 

Chapter 8

Conclusion

In this thesis, we have studied and implemented parallel asynchronous protocols of
PDES on various distributed—memory systems for the application domain of VLSI
logic circuits. The target machine architectures are the IBM SP2 as a parallel multi—
processor system, a cluster of DEC Alpha workstations as a distributed system, and
the MasPar MP-l/MP-2 as a SIMD machine. On the various types of machine archi-
tecture platforms, we have investigated several machine-dependent and -independent
factors in order to improve performances of parallel asynchronous protocols. As for
the machine dependent-factors, studied are the number of processors, performances
of machine instructions or operations, the interprocessor communication overhead,
and the local memory space. As the machine—independent factors, considered are
computation granularity, event scheduling strategies, protocol characteristics, critical
path lengths, parallelism of simulation, etc.. In this chapter, we summarize the con-
tributions of this thesis work to the parallel discrete event simulation on parallel and

distributed systems.

186

 

 

 

 

 

187
o The Major Research Factor: Computation Granularity
As one of the important factors for optimistic protocols on parallel and dis-
tributed systems, computation granularity is proposed in this thesis. By com—
bining some event granularities, the computation granularity can adjust the
computation amount between any two interprocessor communication points.
Although many researchers have studied the event granularity as a signiﬁcant
factor which affects the performance of optimistic protocols of PDES, the com—
putation granularity, proposed in this thesis, has been neglected. The major

functions of the computation granularity are to reduce the communication over-

 

head between processors and to control the degree of optimism of processing
events on parallel optimistic protocols. By controlling the computation granu-
larity appropriately, the performances of parallel protocols can be optimized on

parallel and distributed systems.

0 Analytic Prediction of Computation Granularity
An analytic model is developed to study the effects of the computation granular-
ity on the performance of parallel protocols on parallel and distributed systems.
Based on the analytic model, prediction formulas of the number of simulation
cycles and the total execution time are derived. The formulas can analyze the
performances as a function of computation granularity, especially when the LP
ratio is greater than one. A global synchronization is used in the model to
compute GVT. Our analytic results, veriﬁed by the experimental results, show

that the computation granularity signiﬁcantly affects the performance results

 

 

 

 

 

188

on parallel and distributed systems.

Interpretation of Effects of Computation Granularity

Based on the experimental results performed on the IBM SP2, the effects of
computation granularity are interpreted in various aspects of simulation per-
formances. The performance aspects considered are execution activity, rollback
situation, communication overhead, and simulation progress. Execution activ-
ity shows the actual event computation amount between communication points,
which is comparable to the possible event computation on logical processes.
Rollback situation is interpreted by rollback frequencies and rollback rates. To
study communication overheads on the nonblocking message communication,
the computation fraction to the total execution time is used as the performance
metric. Simulation progress shows the overall progress of simulation status until
the simulation terminates. The speedup of parallel processing is also presented

as a function of the number of processors.

Implementations on Various Types of Machine Platforms
Several machine architecture platforms on parallel and distributed systems are
used to study parallel protocols: the SIMD machine platform, a distributed

system and a multiprocessor computer for the MIMD machine platform.

1. SIMD Machines: Three parallel protocols of synchronous, conservative,
and optimistic protocols are implemented by using the MPL on the mas-
sively parallel SIMD machines of the MasPar MP-l and MP—2. The per-

formance results are compared to those on the Connection Machine CM-2

 

 

 

 

 

 

189
in terms of major performance criteria of PDES. The MP—l achieves better
performance results than the CM-2. To improve performances of PDES on
the SIMD machines, the CBSQ data structure and a reﬁned conservative

protocol are proposed.

. Distributed Systems: On a cluster of DEC Alpha workstations inter-
connected by a DEC GIGASwitch, the optimistic protocol is implemented
with different design perspectives from those on the SIMD machines. Since
the number of processors (or machine nodes) is so small, each processor
has to be assigned more than one logical process to accommodate a large
application problem. The distributed system allows processors to maintain
logical processes asynchronously and uses the PVM as the message-passing
mechanism between processors. Event queues are distributed to each in-
put port of logical processes and each processor has a central scheduling

priority queue to schedule events of the assigned logical processes.

. Multiprocessor Computers: The optimistic protocol is implemented
on the IBM SP2 by using the message-passing library of MP1 to sup-
port the asynchronous processing over processors. The SP2 provides the
High-Performance Switch with the topology of a bidirectional MIN and
the buffered wormhole routing. Moreover, due to its high scalability and
good processor capability, the SP2 is a prominent machine for parallel sim-
ulation with a large application. The event queue manipulation is similar

to that on the DEC Alpha distributed system with a little bit enhanced

 

 

 

 

 

190
data structure. Asynchronous GVT computation based on token pass-
ing algorithm is used. To reduce communication overheads, we employ

nonblocking send routines.

0 Study of Event Scheduling Schemes
To develop a good event-scheduling scheme, several considerations are neces-
sary: data structures of event queues, a control of communication overhead,
and a progress control of individual logical processes. In this thesis, we propose
two event scheduling schemes for the optimistic protocol: the BPEC scheme
and the BPVTW scheme. The BPEC scheme limits the number of executed
events per logical process between interprocessor communications by using the
BPC parameter. The BPVT W scheme controls the progress of processes in vir-
tual time by using the BPW parameter, and prevents processes from executing
events far ahead of other processes. According to the experimental results, both
schemes show good performances when the computation granularity is adjusted
prOperly. Otherwise, communication overheads or excessive rollback situations
may signiﬁcantly degrade the simulation performance. The results also indicate
that the progress balancing among logical processes has signiﬁcant effects on

obtaining the optimal performance.

a Study of Machine-Independent Factors
The performance effects of the characteristics of parallel protocols and applica-
tion problems, which are independent of machine characteristics, are also inves-

tigated. The considered machine-independent factors are the length of critical

 

 

 

191
path, the parallelism of simulation, the MTW mechanism, the total number of
logical processes, the number of input event vectors, and data structures. The

relationships among these factors are studied by using experimental results.

 

 

 

 

 

BIBLIOGRAPHY

 

 

 

Bibliography

[1]

[2]

[3]

[4]

[5]

l6]

l7]

[8]

[9]
[10]

[11]

T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir,
“SP2 system architecture,” IBM System Journal, vol. 34, no. 2, pp. 152—184,
1995.

V. D. Agrawal and S. T. Chakradhar, “Performance analysis of synchronized
iterative algorithms on multiprocessor systems,” IEEE Transactions on Parallel
and Distributed Systems, vol. 3, pp. 739—746, November 1992.

R. Bagrodia, Z. Li, V. Jha, Y. Chen, and J. Cong, “Parallel logic level simulation
of VLSI circuits,” in Winter Simulation Conference (WSC ’94), pp. 1354—1361,
December 1994.

M. L. Bailey, “How circuit size affects parallelism,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 11, pp. 208—
215, Febrary 1992.

M. L. Bailey, “A time-based model for investigating parallel logc-level simula—
tion,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 11, pp. 816—824, July 1992.

M. L. Bailey, “A delay-based model for circuit parallelism,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 12, pp. 1903—
1912, December 1993.

M. L. Bailey, J. V. Briner, and R. D. Chamberlain, “Parallel logic simulation of
VLSI systesm,” ACM Computing Surveys, vol. 26, pp. 255—294, September 1994.

R. Baldwin, M. J. Chung, and Y. Chung, “Overlapping window algorithm for
computing GVT in Time Warp,” in Proceedings of the 11th International Con-
ference on Distributed Computing Systems, May 1991.

P. Banerjee, Parallel Algorithms for VLSI Computer-Aided Design. PT R Pren-
tice Hall, 1994.

J. Banks and J. S. Carson, Discrete-Event System Simulation. Prentice Hall,
1984.

G. Bell, “Ultracomputers: A teraﬂOp before its time,” Communications of the
ACM, vol. 35, pp. 27—47, August 1992.

192

 

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

193

S. Bellenot, “Global virtual time algorithms,” in Proceedings of the SCS Multi-
conference on Distributed Simulation, vol. 22, pp. 122—127, January 1990.

C. Benveniste and P. Heidelberger, “Parallel simulation of the IBM SP2 inter-
connection network,” in Winter Simulation Conference ( WS C ’95), pp. 584—589,
December 1995.

T. Blank, “The MasPar MP-1 architecture,” in Proceedings of the 35th IEEE
COMPCOM Spring 1990, pp. 20—24, February 1990.

F. Brglez, D. Bryan, and K. Kozminski, “Combinational proﬁles of sequential
benchmark circuits,” in Proceedings of the IEEE International Symposium on
Circuits and Systems, May 1989.

F. Brglez, P. Pownall, and R. Hum, “Accelerated ATPG and fault grading via
testability analysis,” in Proceedings of the IEEE International Symposium on
Circuits and Systems, June 1985.

R. Brown, “Calendar queues: A fast O(1) priority queue implementation for the
simulation event set problem,” Communications of the ACM, pp. 1220—1227,
October 1988.

R. E. Bryant, “Simulation of packet communications architecture computer sys-
tems,” tech. rep., Technical Report MIT-LCS-TR—188, Massachusetts Institute
of Technology, 1977.

C. D. Carothers, R. M. Fujimoto, and P. England, “Effect of communication
overheads on Time Warp performance: An experimental study,” in Proceedings
of the 8th Workshop on Parallel and Distributed Simulation, pp. 118—125, July
1994.

R. D. Chamberlain and M. A. Franklin, “Hierarchical discrete-event simulation
on hypercube architectures,” IEEE Micro, vol. 10, pp. 10—20, August 1990.

K. Chandy and J. Misra, “Distributed simulation: A case study in design and
veriﬁcation Of distributed programs,” IEEE Transactions on Software Engineer-
ing, vol. 5, pp. 440—452, September 1979.

K. Chandy and J. Misra, “Asynchronous distributed simulation via a sequence
of parallel computations,” Communications of the ACM, vol. 24, pp. 198—206,
April 1981.

E. Choi and M. J. Chung, “An important factor for Optimistic protocol on dis-
tributed systems: Granularity,” in Winter Simulation Conference (WSC ’95),
pp. 642—649, December 1995.

E. Choi, M. J. Chung, and Y. Chung, “Comparisons and analysis of massively
parallel simd architectures for parallel logic simulation,” in Sixth International
Parallel Processing Symposium, pp. 671—674, March 1992.

 

194

[25] E. Choi and D. Min, “Event scheduling schemes for time warp on distributed
systems,” in Winter Simulation Conference (WSC ’96), pp. 661—668, December
1996.

[26] P. Christy, “Software to support massively parallel computing on the MasPar
MP-1,” in Proceedings of the 35th IEEE COMPCOM Spring 1990, pp. 29—33,
February 1990.

[27] Y. Chung, Logic Simulation on Massively Parallel SIMD Machines. PhD thesis,
Michigan State University, 1991.

[28] Y. Chung and M. J. Chung, “Time Warp for efficient parallel logic simulation on
a massively parallel SIMD machine,” in Proceedings of the Tenth Annual IEEE
International Phoenix Conference on Computers and Communications, March
1991.

[29] M. J. Chung and Y. Chung, “Performance estimation based on gate-to-processor
ratio in parallel logic simulation,” in Proceedings of the 1992 International Con-
ference on Parallel Processing, pp. 246—253, August 1992.

[30] A. I. Concepcion and S. G. Kelly, “Computing global virtual time using the multi-
level token passing algorithm,” in Proceedings of the 5th Workshop on Parallel
and Distributed Simulation {PADS91), vol. 23, pp. 63—68, January 1991.

[31] S. Das, R. Fujimoto, K. Panesar, D. Allison, and M. Hybinette, “GTW: A Time
Warp system for shared memory multiprocessors,” in Winter Simulation Con-
ference (WSC ’94), pp. 1332—1339, December 1994.

[32] C. Fiduccia and R. Mattheyses, “A linear-time heuristic for improving network
partitions,” in 19th Design Automation Conference, pp. 175—181, 1982.

[33] P. A. F ishwick, Simulation Model Design and Execution: Building Digital
Worlds, ch. 9. Prentice Hall, 1995.

[34] E. H. Frank, A data-driven multiprocessor for switch-level simulation of VLSI
circuits. PhD thesis, Carnegie-Mellon University, 1985.

[35] R. M. Fujimoto, “Lookahead in parallel discrete event simulation,” in Proceedings
of 1988 International Conference on Parallel Processing, vol. 3, pp. 34—41, 1988.

[36] R. M. Fujimoto, “Time Warp on a shared memory multiprocessor,” in Proceed-
ings of 1989 International Conference on Parallel Processing, vol. 3, pp. 242—249,
1989.

[37] R. M. Fujimoto, “Parallel discrete event simulation,” Communications of ACM,
vol. 33, pp. 30—53, October 1990.

[38] A. Gafni, “Rollback mechanisms for optimistic distributed simulation systems,”
in In Proceedings of the SCS Multiconference on Distributed Simulation, vol. 19,
pp. 61-67, July 1988.

195

[39] F. Hao, K. Wilson, R. Fujimoto, and E. Zegura, “Logical process size in par-
allel simulations,” in Winter Simulation Conference (WSC ’96), pp. 645—652,
December 1996.

[40] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Pro-
grammability. McGraw-Hill, 1993.

[41] D. Jefferson, “Virtual time,” ACM Transactions on Programming Languages and
Systems, vol. 7, pp. 404—425, July 1985.

[42] D. Jefferson, “Virtual Time II: The cancelback protocol for storage management
in distributed simulation,” in Proceedings of the 9th Annual ACM Symposium
on Principles of Distributed Computations, pp. 75—90, August 1990.

[43] D. Jefferson and H. Sowizral, “Fast concurrent simulation using the Time Warp
mechanism,” in Proceedings of the 1985 S CS Multiconference on Distributed Sim-
ulation, January 1985.

[44] D. W. Jones, “Concurrent Operations on priority queues,” Communications of
the ACM, pp. 132—137, January 1989.

[45] B. Kerninghan and S. Lin, “An efficient heuristic procedure for partitioning
graphs,” The Bell System Technical Journal, vol. 49, pp. 291—307, 1970.

[46] Y. H. Levendel, P. Menon, and S. H. Patel, “Special-purpose computer for logic
simulation using distributed processing,” Bell System Technical Journal, vol. 61,
no. 10, pp. 2873—2909, 1982.

[47] J. M. Lin and S. G. Abraham, “Discrete event simulation on shared memory
multiprocessors using global simulation information,” in Proceedings of the 1992
International Conference on Parallel Processing, pp. 254—261, 1992.

[48] Y.-B. Lin and E. D. Lazowska, “Exploiting lookahead in parallel simulation,”
tech. rep., Technical Report 89-10—06, Department of Computer Science, Univer-
sity of Washington, 1989.

[49] Y.-B. Lin and E. Lazowska, “Determining global virtual time in a distributed
simulation,” tech. rep., Technical Report 90—01-02, Department Of Computer
Science, University of Washington, December 1989.

[50] W. M. Loucks and B. R. Preiss, “The role Of knowledge in distributed simu-
lation,” in Proceedings of the SCS Multiconference on Distributed Simulation,
vol. 22, pp. 9—16, January 1990.

[51] B. Lubachevsky, “Efﬁcient distributed event-driven simulations Of multiple-loop
networks,” Communications of the ACM, vol. 32, pp. 111—123, January 1989.

[52] N. Manjikian and W. M. Loucks, “High performance parallel logic simulation on
a network of workstations,” in Proceedings of the 7th Workshop on Parallel and

Distributed Simulation (PADS ’93), vol. 23, pp. 76—84, May 1993.

196

[53] MasPar Computer Corporation, MasPar MP-I, July 1990.

[54] MHPCC, “Maui High Performance Computing Center (MHPCC),”
http://www.mhpcc.edu/mhpcc.html, 1997.

[55] J. Miguel, A. Arruabarrena, R. Beivide, and J. A. Gregorio, “Assessing the
performance of the new IBM SP2 communication subsystem,” IEEE Parallel 89'
Distributed Technology, vol. Winter, pp. 12—22, 1996.

[56] J. Misra, “Distributed discrete—event simulation,” Computing Surveys, vol. 18,
pp. 39-—65, March 1986.

[57] A. M. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory Of
Statistics. McGraw—Hill, 1974.

[58] R. B. Mueller-Thuns, D. G. Saab, R. F. Damiano, and J. A. Abraham, “VLSI
logic and fault simulation on general-purpose parallel computers,” IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems, vol. 12,
pp. 446—460, March 1993.

[59] B. Nandy and W. Loucks, “An algorithm for partitioning and mapping conserva-
tive parallel simulation onto multicomputers,” in 6th Workshop on Parallel and
Distributed Simulation, vol. 24, pp. 139—146, January 1992.

[60] B. Nandy and W. M. Loucks, “On a parallel partitioning technique for use with
conservative parallel simulation,” in Proceedings of the 7th Workshop on Parallel
and Distributed Simulation (PADS ’93), vol. 23, pp. 43—51, May 1993.

[61] J. R. Nickolls, “The design Of the MasPar MP-l : A cost effective massively
parallel computer,” in Proceedings of the 35th IEEE COMPCOM Spring 1990,
pp. 25—28, February 1990.

[62] D. M. Nicol, “Parallel discrete—event simulation of fcfs stochastic queueing net-
works,” in SIGPLAN, vol. 23, pp. 124—137, September 1988.

[63] D. M. Nicol, “The cost of conservative synchronization in parallel discrete event
simulations,” Journal of the ACM, vol. 40, pp. 304—333, April 1993.

[64] D. M. Nicol, C. C. Micheal, and P. Inouye, “Efﬁcient aggregation of multiple lps
in distributed memory parallel simulations,” in Winter Simulation Conference
{WSC ’89), pp. 680—685, 1989.

[65] The Oak Ridge National Laboratory, PVM 3 User’s Guide and Reference Man—
ual, May 1994.

[66] D. Reed, A. Malony, and B. McCredie, “Parallel discrete event simulation using
shared memory,” IEEE Transaction on Software Engineering, vol. 14, pp. 541—
553, April 1988.

197

[67] S. I. Resnick, Adventures in Stochastic Processes, ch. 1. Birkhauser Boston, 1992.

[68]

R. Ronngren, R. Ayani, R. Fujimoto, and S. Das, “Efficient implementation Of

event sets in Time Warp,” in Proceedings of the 7th Workshop on Parallel and

Distributed Simulation (PADS ’93), vol. 23, pp. 101—106, May 1993.

[69] A. E. Ruehli and G. S. Ditlow, “Circuit analysis, logic simulation, and design

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]
[81]

veriﬁcation for VLSI,” in Proceedings of IEEE, vol. 71, January 1983.

B. Samadi, Distributed Simulation, Algorithms and Performance Analysis. PhD
thesis, University of California, Los Angeles, 1985.

D. D. Sleator and R. E. Tarjan, “Self-adjusting binary search trees,” Journal of
the ACM, vol. 32, pp. 652—686, July 1985.

S. P. Smith, B. Underwood, and M. R. Mercer, “An analysis of several approaches
to circuit partitioning for parallel logic simulation,” in Proceedings of the 1987
International Conference on Computer Design, pp. 664—667, 1987.

L. M. Sokol, P. A. Mutchler, and J. B. Weissman, “The role of event granularity
in parallel simulation design,” in Proceedings of the 6th Workshop on Parallel
and Distributed Simulation (PADS ’92), vol. 24, pp. 178—185, January 1992.

L. M. Sokol and B. K. Stucky, “MTW: Experimental results for a constrained
optimistic scheduling paradigm,” in Proceedings of the SCS Multiconference on
Distributed Simulation, vol. 22, pp. 169—173, January 1990.

L. Sokol, B. Stucky, and V. Hwang, “MTW: A control mechanism for paral-
lel discrete simulation,” in Proceedings of the 1989 International Conference on
Parallel Processing, vol. 3, pp. 250—254, August 1989.

C. Sporrer and H. Bauer, “Corolla partitioning for distributed logic simulation of
VLSI-circuits,” in Proceedings of the 7th Workshop on Parallel and Distributed
Simulation (PADS ’93), vol. 23, pp. 85—92, May 1993.

J. Steinman, “A distributed emulation of space communications for the strate—
gic defense system,” in In Proceedings of the Twenty-First Annual Pittsburgh
Conference on Modeling and Simulation, vol. 21, May 1990.

J. Steinman, “SPEEDES: A multiple-synchronization environment for parallel
discrete-event simulation,” International Journal in Computer Simulation, vol. 2,
pp. 251—286, 1992.

J. S. Steinman, “Breathing Time Warp,” in Proceedings of the 7th Workshop on
Parallel and Distributed Simulation (PADS ’98), vol. 23, pp. 109-118, May 1993.

H. S. Stone, High-Performance Computer Architecture. Addison-Wesley, 1987.

C. B. Stunkel, D. G. Shea, and B. Abali, “The SP2 communication subsystem,”
in http://ibm.tc.c0rnell.edu/ibm/pps/doc/, pp. 1—24, August 1994.

198

[82] C. B. Stunkel, D. G. Shea, B. Abali, M. G. Atkins, C. A. Bender, D. G. Grice,

[83]

[84

l—4

[85]

[86]

[87]

[88]
[89]

[90]

P. Hochschild, D. J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao,
and P. R. Varker, “The SP2 high—performance switch,” IBM System Journal,
vol. 34, no. 2, pp. 185—204, 1995.

C. B. Stunkel, D. G. Shea, D. G. Grice, P. H. Hochschild, and M. Tsao, “The
SP1 high—performance switch,” in Proceedings of the Scalable High Performance
Computing Conference, pp. 150—157, May 1994.

Thinking Machines Corporation, The Connection Machine System, May 1988.

Thinking Machines Corporation, Connection Machine Model CM-2 Technical
Summary, 1990.

A. I. Tomlinson and V. J. Garg, “An algorithm for minimally latent global virtual
time,” in Proceedings of the 7th Workshop on Parallel and Distributed Simulation
(PADS ’93), vol. 23, pp. 35—42, May 1993.

S. J. Turner and M. Q. Xu, “Performance evaluation of the bounded Time Warp
algorithm,” in Proceedings of the 6th Workshop on Parallel and Distributed Sim—
ulation (PADS ’92), vol. 24, pp. 117—126, January 1992.

University Of Tennessee, MPI: A Message-Passing Interface Standard, June 1995.

D. B. Wagner and E. D. Lazowska, “Parallel simulation of queueing networks:
Limitations and potentials,” in Proceedings of the 1989 ACS SICMETRICS and
PERFORMANCE, vol. 17, pp. 146—155, May 1989.

Z. Xu and K. Hwang, “Modeling communication overhead: MPI and MPL per-
formance on the IBM SP2,” IEEE Parallel 8 Distributed Technology, vol. Spring,
pp. 9—23, 1996.

 

 

”I[llllllllll[llllllll’