3 C 3..
.3.) .(V

(.1:
.. x..-
.lnva 1...?

inﬂate) ..

s... e
T 2...?
53;? . .. x

13-5.. .

‘ 23.x.
«a. 21...“... "HR
. 4:55

..
. 2...}. ..
a” human... huh“.
. A I. l. . .

in, \
53.3.3; I .6... .
.i. . v V €31
arias. ~41

 

1.. .LLT‘ a:

 

 

 

 

 

 

 

- r
,.,,..

h
p

1.
ix. 1., . .
.i.. .. .3 ﬁx...)
I~A~:! .
d HF; ,vv .:

 

 

 

 

 

Lu. V. MI? 312. :1

 

THESR-

\ '4

(«00/

This is to certify that the
dissertation entitled

Parallel Discrete Event Simulation and Its Application on Logic
Simulation

presented by
Jinsheng Xu

has been accepted towards fulﬁllment
of the requirements for the

Doctoral degree in Computer Science

as

 

 

 

Major‘Pr’ofessor’s Qﬁpbture
ﬁe..- - 7, 100.1
Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

_.—-—.—.—-—._ —._--.---.-.-.-.-.-.-.-.---

.-o-o—o—-._. .—

N LIBRARY |

Michigan State
University

 

 

 

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 cJCIRC/DateDuopes-sz

 

PARALLEL DISCRETE EVENT SIMULATION AND ITS APPLICATION ON
LOGIC SIMULATION

By

Jinsheng Xu

A DISSERTATION

Submitted to
MICHIGAN STATE UNIVERSITY
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science and Engineering

2002

ABSTRACT
PARALLEL DISCRETE EVENT SIMULATION AND ITS APPLICATION ON LOGIC
SIMULATION
By

Jinsheng Xu

The purpose of this PhD dissertation is to investigate new and efﬁcient paradigms for parallel
discrete event simulation on massively parallel processors and its application on parallel logic
simulation. We ﬁrst develop performance prediction models of parallel simulation. Using the
performance models, we will identify critical factors for improving the performance of
simulation. Among them, we will consider load balancing, communication cost, computational

granularity, activity rate, and rollback rate.

Based on the importance Of these factors, we will investigate new and improved approaches Of
parallel simulation. The first approach we consider is an aggressive synchronous algorithm to
reduce the simulation cycles. This method reduces the communication cost and speedup of the

simulation signiﬁcantly. Next, a look-ahead technique will be investigated to reduce the state

savings and the event queue size in Time Warp. The technique reduces the number of State
savings by skipping state savings that is not necessary. Event queue and state queue sizes can
be reduced by discarding the unnecessary events and states immediately. Finally, we will
develop a unifying simulation framework that exploits the advantages of synchronous and
conservative simulation techniques while keeping the optimism. The performance of this
approach is always better than Time Warp and comparable to the best-case performance of

synchronous and conservative simulation.

DEDICATION

To my grandparents, parents, wife and son.

iv

ACKNOWLEDGMENTS

The author wishes to thank my adviser, Dr. Chung, for his generous guidance during my
Ph.D. study. I learned many things from him. I also would like to thank my committee
members, Dr. Esfahanian, Dr. Wojcik and Dr. LePage. I truly enjoyed the classes from Dr.

Esfahanian and Dr. Wojcik. With the help Dr. LePage, I found a serious ﬂaw in on my papers.

I would like to thank my entire family members for their support and love. Without them, I

could not have come this far.

TABLE OF CONTENTS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

LIST OF TABLES .................................................................................. viii
LIST OF FIGURES .......................................................................................................... ix
Chapter 1. Introduction ...................................................................... .- -- 1
1.1 Background - .................. 1

1.2 Introducu'on to Parallel Discrete Event Simulation ........................................ 3
1.2.1 Discrete Event Simulation - . - - 3

1.2.2 Synchronous Simulation ............................................. - ........ . 4

1.2.3 Conservative Simulation ........................................ 4

1.2.4 Optimistic Sirnulation . - 5

1.3 Introduction to Parallel Logic Simulation ........................ - -- 6

1.4 Contributions and Outline of Thesis - - - ...... 7

1.4 1 Object Modeling ........................................... 7

1 ..4 2 Estimating communication parameters .................................................. 7

1.4 2 Synchronous Performance Prediction Model ....................................... 8

1.4.3 Cycle Reducing Synchronous Algorithm 9

1.4.4 An Overhead Reducing Technique for Time Warp -- .. 9

1.4.5 Time Warp Performance Analysis 10

1.4.6 Efﬁciently Unifying Simulation Techniques - - - 11
Chapter 2. Preparatory research .............................................................. 13
2. 1 Object Modeling ......... 13
2.1.1 Object Model Of the Simulation Engine ....... - . -- 15

2.1 .2 Simulation Engine: Logical Process and Simulation Object ............. 20

2.2 Measuring the Communication Cost ................. 21
2.2.1 Introduction ........................................................... 21

2.2.2 Measuring the parameters of the LogP Model - -- ..... -- - 23

2.2.3 All-to-All communication ....................................................................... 29

2.2.4 Minimum hardware support for multicasting? .................................... 34

2.2.5 Conclusion .................................................................................................. 35
Chapter 3. Syncrhonous simulation ............................................................................... 36
3.1 Performance Prediction ..................................................................................... 36
3.1.1 Introduction ............................................................................................... 36

3.1.2 The Model .................................................................................................. 40

3.1.3 Prediction Model due to Load Balancing factor ................................. 44

3.1.4 Performance Model .................................................................................. 50

vi

3.1.5 Improved Synchronous Algorithm ........................................................ 61

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3.1.6 Conclusion .................................................................................................. 64

3.2 Cycle reducing simulation .................................................................................. 65
3.2.1 Introduction ............................................................................................... 65

3.2.2 Performance Evaluation Model ............................................................. 67

3.2.3 Optimistic Synchronous Algorithm ...................................................... 69

3.2.4 Experimental Results ................................................................................ 70

3.2.5 Conclusion .................................................................................................. 75
Chapter 4. Optimistic Simulation ................................................................................... 77
4.1 Reducing the overhead of optimistic simulation ........................................... 77
4.1.1 Introduction- ............................................... 77

4.1.2 Algorithm .................................................................................................... 80

4.1.3 Performance Analysis .................................................... 92

4.1.4 Benchmark Results ................. -_ - - - - - 95

4.1.5 Conclusion - -. .................................................... 99

4.2 Performance Analysis - ......................... 100
4.2.1 Introduction ................................... - - 100

4.2.2 Communication Model ................. - ....... - - . - 103

4.2.3 Performance Analysis ................................................. . 103

4.2.4 Average Dependency Gap .................................................................... 106

4.2.5 Conclusion __ -_ ....... - . 109
Chapter 5. Uniﬁying Simulation techniques ............................................................... 110
5.1 Introduction ................................................................. 110

5.2 Exploiting Cycle Parallelism - - 112
5.2.1 Cycle Parallelism - ............................. 112

5.2.2 Tweaking the deﬁnition of GVT ........... - 113

5.2.3 Performance Considerations .................. - - - . ...... 114

5.3 Exploiting Conservative Parallelism ............................. 115
5.3.1 Conservative Parallelism ............................................ - - 115

5.3.2 Related Work .......................... 116

5.3.3 Performance Considerations ................................................................. 116

5.4 Unifying Framework. . . ............... - 117

5.5 Experimental Results ........................................................... - ..... 120

5.6 Conclusion .......................................................................................................... 123
Chapter 6. Summary ................................................................................ 125
BIBLIOGRAPHY .......................................................................................................... 129

vii

LIST OF TABLES

 

 

 

 

 

 

 

 

Table 2.1 Result of All-to-All communication for message length 1 ...................... 34
Table 3.1 Multiplication factor for different A and P 45
Table 3.2 Speedup table for different A and P 46
Table 3.3 Number of Executions, cycles and Average Activity Number .............. 47
Table 3.4 Comparison of theoretical and empirical results 48
Table 3.5 The comparisons of predicted and experimental maximum speedup .. 54
Table 3.6 The characteristics of circuits- 72
Table 4.1 Benchmark results of the algorithm on circuit simulation ...................... 96
Table 4.2 Best speedup comparisons 99
Table 4.3 TWW+Toverhead of 535932 with various granularities 104
Table 4.4 AAN and ADG of 535932 and multi32 108
Table 5.1 Mapping of simulation techniques. 120
Table 5.2 Maximum speedup with up to 16 processors 123

 

viii

LIST OF FIGURES

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.1 Object Models ................................................................................................ 16
Figure 2.2 Example of a hierarchical model- - - - - .. 18
Figure 2.3 Logical Process ................................ - ........................... 20
Figure 2.4 Time diagram for benchmark 1 - -- ........................ 23
Figure 2.5 Time diagram for benchmark 2 with X > 2*L+Or+Os ........................ 24
Figure 2.6 Experimental Result of Benchmark 1 and 2. .................. 26
Figure 2.7 Result of Benchmark 3 — No.1 .- - 28
Figure 2.8 Result of Benchmark 3 ~ No.2 ...................... - - 28
Figure 2.9 Result of Benchmark 3 - No.3 ........................................ . 29
Figure 2.10 Time diagram for Algorithm 1 .................................................................. 31
Figure 2.11 Execution Sequence for processor 0- - _ 32
Figure 2.12 Time diagram for Algorithm 2 - ....... 33
Figure 3.1 Synchronous algorithm our model based on - 40
Figure 3.2 Speedup without communication 49
Figure 3.3 Message overhead ............. . - - 51
Figure 3.4 Least Square ﬁt to a linear formula - - 51
Figure 3.5 PSC TCS vs. SGI Origin 2000 ............................... - 56
Figure 3.6 The effect of circuit size -- - _ ................ 57
Figure 3.7 The effect of granularity. ................................ - .................. 58
Figure 3.8 Improved Synchronous Algorithm - ..... - 62
Figure 3.9 Benchmark results of improved synchronous algorithm ....................... 63
Figure 3.10 Synchronous and optimistic synchronous simulations ........................ 74
Figure 3.11 The speedup comparison of unit and non-unit delay. - 75
Figure 4.1 An illustration of the algorithm (GVT=1) 82
Figure 4.2 Event 6" was received by P' after P' executed Q} and generated e0 ........ 86
Figure 4.3 Event 6" was received by P' before P' executed eo' and generated :0 ..... 88
Figure 4.4 Three performance cases of the algorithm (GVT=2) 92
Figure 4.5 Performance gains from GVT computation (GVT=2) 93
Figure 4.6 Improvement on the speedup of the algorithm. .......... -- 98
Figure 4. 7 The effect of event packing on 535932 and multi32 ....... 106
Figure 4. 8 Performance comparison of 535932 and multi32 107
Figure 5.1 An LP can switch back and forth from three execution modes ......... 118
Figure 5. 2 Module diagram of simulation kernel ................... - - -119
Figure 5.3 The comparison of experimental results ................................................. 123

C/Japterl

1. INTRODUCTION

1.1 Background

With the increasing complexity of microelectronics systems, validating various models (or
virtual prototypes) becomes critically important. There are two approaches to validating the
correctness of VLSI (Very Large Scale Integration) design: veriﬁcation techniques and
simulation. Of these two approaches, Simulation is still the primary tool. Simulation can be
used in a hierarchical fashion. At the top level, performance models are used to explore the
trade off between various hardware components and architectures. Behavioral simulations are
used to prove the correctness of the system speciﬁcation. At a lower level, timing simulation is
used to validate the correct timing information such as set-up hold times, and is the only
practical stool available at this level. For large microelectronics systems, simulation has become
a very time—consuming and critical part of VLSI design. Typically, simulation constitutes about

80% of the design cycle.

VHDL (Very High Speed Integrated Circuit Hardware Description Language) [3]s is an IEEE

standard hardware description language developed by the DOD. It can be used to describe
1

systems at both behavioral and structural levels. It can also be used to describe a performance
model. It is expected that we need to simulate a system up to 100,000 VHDL processes to
accurately model a start-of-the-art system. Simulating such a large system can be extremely

slow in a sequential machine.

Parallel logic Simulation has recently attracted a considerable amount of interest. However,
most research is restricted to a MIMD (Multiple Instruction/ Multiple Data) machine with a
small number of processors. Moreover, there are only a few benchmark results available based
on actual simulation of large circuits. The few existing empirical results are based on MIMD
machines with only a small number of processors, typically tens of processors. Thus, the
speed-ups attained compared to sequential simulation are severely limited. With the hardware
resources available for High Performance Computing, such as the 336 node Intel Paragon, the
400 node IBM SP2 and the 2000 node TCS, parallel simulation on Massively Parallel
Processors (MPP) is now feasible which can achieve a speed-up of up to several hundred times

compared to sequential simulation.

Fast parallel Simulation of large VLSI systems enables the designer to validate the correctness
of the design. The usage of behavioral Simulation in an earlier stage of the VLSI system design
can prove the functional correctness. By speeding up the performance model simulation,
designer can carry out extensive HW / SW trade-off analysis, and allow designers to select the
best hardware architecture. The structural level simulation can be used to validate the
correctness in logic level and timing. Fast simulator will allow designers to validate various
virtual prototypes and to test complete fault coverage analysis. Thus, the design cycle can be

shortened, and the number of real prototypes to be constructed can be reduced.

Although massive parallel simulation is a promising technique, it is not a widely used technique
in validating the VLSI design. One of the main reasons is the performance uncertainty of this
technique. This thesis concentrates on investigating new efﬁcient techniques for parallel

simulation.

1.2 Introduction to Parallel Discrete Event Simulation

A comprehensive introduction to parallel discrete event simulation can be found in [39]. In

this section, we are going to introduce some basic concepts in this ﬁeld.

1.2. 7 Dircrete Event Simulation

In time driven discrete simulation, simulated time is advanced in time Steps. If events are
dispersed over time, the time driven simulation is inefﬁcient. In event driven systems,
evaluation is driven by the events. Discrete event simulation technique is widely used in VLSI

validation, telecommunication, military training and many other areas.

In sequential Simulation, the simulation engine keeps a sorted list of events according to the
time Stamp. At each Step the smallest time stamped event is retrieved and processed. New
events are inserted into the event queue. This process continues until the event queue is empty

or simulation has reached pre-deﬁned end time.

Parallel Simulation consists of a collection of Logical Processes (LPS) that communicate by
exchanging time stamped events. There are three categories of parallel simulation techniques:

synchronous, conservative and optimistic.

7.2.2 S Jno/Jronour Simulation

Synchronous simulation is the simplest parallel simulation technique. At each Simulation cycle,
all processors execute the events with the smallest time stamp. After the execution of these
events, all processors exchange new events and decide the next smallest time stamp. This
technique is referred as global-clock algorithm. This approach has very small overhead
compared with other advanced simulation techniques. Because the synchronization cost is

high, it does not scale well as the number of processors increases.

7.2.3 Conservative Simulation

The conservative simulation is also referred as Chandy-Misra-Bryant [13,17] algorithm.
Conservative and optimistic simulation techniques are asynchronous algorithms. Unlike
synchronous simulation, conservative simulation advances simulation time at individual logical
processes. Conservative simulation preserves the causality of event executions by processing

events in strictly non-decreasing order in a logical process.

The safeness of an event execution is decided by the calculation of local virtual time (LVT).
Each generated event is accompanied with the LVT of the sending LP. The receiver LP knows
that it will not receive any event in the future with smaller time stamp than the LVT from the

sender. Each LP can safely execute the events with time stamp smaller than the LVT value of
4

all input channels. Deadlock may occur when there is cycle of block LPs each waiting for its
predecessor to update its clock. Null-message is used to avoid deadlock but it can add
signiﬁcant overhead on the performance. Chandy and Misra [18] also proposed a two-phase
method to detect and resolve the deadlock. Deadlock can be broken by the Observation that
events with smallest time stamp are always safe to process. Good lookahead is critical for the

performance Of conservative algorithm.

7.2.4 Optimirtio Simulation

The original optimistic Simulation is proposed as Time Warp algorithm by Jefferson [45]. It
can explore more parallelisms than the conservative Simulation by allowing the out of order
event executions. An event can be executed immediately. Some of these event executions may
not be correct and need to be cancelled. Rollback mechanism is used to recover from error
executions. When a straggler event arrives, an LP needs to be recovered to the state before
time stamp of the straggler. This requires frequent state savings. GVT (Global Virtual Time) is
used to discard the states that are no longer necessary to stay in memory. GVT is deﬁned as
the minimum time stamp of all events in transit or in the input queues of all logical processes.
No new event generated can have a smaller time stamp than GVT. Therefore all states except

one that has smaller time stamp than GVT can be discarded safely.

1.3 Introduction to Parallel Logic Simulation

Simulation is a signiﬁcant bottleneck in the development of VLSI systems. There have been
considerable efforts to accelerate the simulation using parallel discrete event simulation
techniques. Asheden[4] at al investigated the feasibility of synchronous simulation on VHDL
models. They concluded that high level models with large event execution granularity is
suitable for small scale multi—computers and low level models with small event execution
granularity is not suitable for parallel execution. They used very small benchmarks with largest

number of objects only around 150. The parallelism is limited by their own sizes.

Soule [70] studied parallelisms of digital logic simulation. For the synchronous parallel
Simulation method, the element evaluation concurrency ranged from 156 to 275. For
conservative method, the concurrency ranged from 42 to 196. Soule and Gupta [72] evaluated
the Chandy-Misra algorithm for digital logic Simulation. They found that overhead of Chandy-
Misra algorithm overwhelm its advantage and the performance is about three times slower
than traditional parallel event—driven algorithm. Konas [47] studied inherent parallelisms of
logic circuits with Event Precedence Graph (EPG) and compared them with synchronous
parallelisms. For the seven largest ISCAS-89 [11] circuits the average EPG parallelisms ranged
from 2600 to 23873 and average synchronous parallelisms ranged from 957 to 13481. Konas
and Yew [48] compared the performance of a synchronous simulation to the conservative and
optimistic algorithms in simulating a synchronous multiprocessor system. Their results Show
that the synchronous method is considerably faster than both of the asynchronous approaches.

Briner [12] implemented optimistic simulation and they achieved speedup of 23 with 32

processors. Konas [47] concluded that the state saving cost has to be reduced signiﬁcantly for

optimistic simulation to perform better than synchronous algorithm.

There is no known implementation that performs consistently well independent of the circuits.
It is not clear that which parallel simulation technique is most appropriate for logic simulation.

A survey on this topic can be found in Chamberlain’s paper [16].

1.4 Contributions and Outline of Thesis

7.4.7 Object Modeling

We developed an object model for parallel simulation. This object model describes the
behavior of systems to be simulated independently from the simulation engines. This model
enables object-oriented modeling of complex systems. The modeler is able to model the
objects in stepwise reﬁnement. Unlike many modeling techniques that use a new modeling
language, with this modeling technique, modeler can efﬁciently model the objects with our

C++ class libraries. Our object model technique is described in chapter two of this thesis.

7.4.2 Eitimating communication parameter:

LogP [31] model captures major parameters of a distributed memory multiprocessor. We
propose a technique that measures the parameters of LogP model. These parameters are
measured by carrying out several benchmarks. All the benchmarks are executed on IBM SP2
using MP1 (Message Passing Interface) [57] library. The result shows high overhead and

relatively small hardware latency. Exploiting these characteristics, an All-to-All [57]
7

communication, which has much smaller number of steps, is proposed. The better result of
this algorithm over the MPI’s MPI_Alltoall() implementation conﬁrms that software overhead
is high in SP2. The results of this research give us idea how signiﬁcant the communication cost
is for parallel simulation and how to reduce the communication overhead. In the second part

of chapter two, we give detailed explanations of this topic.

7.4.2 S jncbronou: Performance Prediction Model

We propose a model to predict the performance of synchronous logic simulation. This model
considers the factors including load balancing, event execution granularity and communication
cost. We derive a single formula that predicts the performance of synchronous simulation. We
have benchmarked several ISCAS[10,11] circuits on SGI Origin 2000. The benchmark results
show that the prediction model explains more than 90% of parallel simulation execution time.
We propose a formula that predicts the upper limit of the achievable speedup. We found that
the maximum speedup achievable is proportional to the square root of average number of
active objects per cycle. This model can be used to predict the speed-up expected for
synchronous simulation, and to decide whether it is worthwhile to use synchronous simulation
before actually implementing it. We also propose an improved synchronous algorithm that is
able to improve the load-balancing factor. The benchmark results Show that the improved
algorithm is able to reduce the load-balancing factor as well as the number of simulation cycles.

We are going to elaborate this research in the ﬁrst part of chapter three.

7.4.3 Cycle Reducing Synchronous Algoritlym

Logic simulation has very small computation granularity. The communication cost is the
signiﬁcant factor that limits the speedup of synchronous logic simulation. The benchmark
results showed performance improvement from load balancing of synchronous simulation is
very limited. To Signiﬁcantly accelerate the performance of synchronous simulation, we have

to reduce the communication cost signiﬁcantly.

Based on the previous prediction model we developed, we propose an improved prediction
model that integrates the number of communication cycles. We found that the number of
simulation cycles can be reduced by pro-executing safe or unsafe events that should be
executed in later cycles. Experimental results show that pro-execution safe events cannot
reduce the simulation cycle signiﬁcantly. We propose an optimistic synchronous algorithm that
executes unsafe events optimistically. Experimental results Show that this approach
signiﬁcantly reduces the number of communication cycles while keeping the rollback rate
within a small range. The performance is signiﬁcantly better than the traditional synchronous

algorithm. Detailed research on this topic can be found in the second part of chapter three.

7.4.4 An Ooerbead Reducing Teclmique for Time Warp

We introduce a technique that reduces the number of state savings and the event queue size of
Time Warp [27]. By reducing the state saving and the sizes of event queues, we decrease the
overhead of Time Warp, and the maximum memory requirement. We exploit the look-ahead
technique to get a lower bound time Stamp of the next event and determine if an event is safe

to execute. If an event execution is safe, no state saving is carried out. The saved States that

have smaller time stamp than this lower bound can be safely discarded without waiting for the
updated value of Global Virtual Time (GVT). We prove that the technique is correct under
both aggressive and lazy cancellation scheme. This technique can be implemented with
minimal additional overhead. Benchmark results on logic simulation Show that the mechanism
can reduce the number of state savings and maximum memory size signiﬁcantly. We
conducted experimental studies on logic simulation. The results show the signiﬁcant
improvement in the number of state savings as well as maximum state queue size. We also
observed improvement in execution time of simulation caused by less number of state savings.
Our benchmark shows that the proposed technique improves memory usage 18-85% and
maximum speed-up up to 5—30%. The detailed research on this topic is discussed in chapter

four of this thesis.

7.4.5 Time Warp Performance Anabsis

We analyze the performance of Time Warp simulation. We break down the performance of
Time Warp into three parts: computation, communication and idle time. We proposed Average
Depending Gap (ADG) that measures the parallelism that can be exploited by Time Warp
algorithm. Compared with critical path analysis, this measure gives the value of a more
practical parallelism. The detailed research on this topic is given at the second part of chapter

four.

10

7.4. 6 E ﬂicientbr U ngﬁing Simulation Tecbniques

We study a unifying technique that is able to exploit the advantages of synchronous,
conservative and optimistic simulations [86]. The unifying framework is based on optimistic
simulation technique. The optimized optimistic simulation exploits synchronous parallelism by
tweaking the deﬁnition of GVT. Conservative parallelism is exploited by embedding
lookahead computation into the optimistic simulation. In the uniﬁed framework, a logical
process may execute in synchronous mode, conservative mode and optimistic mode.
Synchronous mode and conservative mode executions have smaller overhead than optimistic

mode execution because state saving can be skipped and all saved states can be discarded.

An LP can switch back and forth from synchronous, conservative and optimistic mode. An
LP is eligible to execute in synchronous mode when the local virtual time is equal to GVT. An
LP is eligible to execute in conservative mode when the local virtual time is smaller than the
lower bound time stamp of future event the LP may receive. An LP executes in optimistic
mode when it does not satisfy the conditions to be executed in synchronous mode or
conservative mode. In optimistic mode current state needs to be saved because future rollback

is possible.

The simulation kernel consists of scheduler, communication manager and GVT manager.
Scheduler decides which LP to execute and in which mode to execute. GVT Manager is
responsible for computing the value of GVT. The frequency of GVT computation is decided
by the scheduler. Communication manager is responsible for sending and receiving messages
from other processors. Scheduler decides how many events to be buffered before sending

events.

11

In optimized optimistic Simulation, scheduler gives priority to events that can be executed in
either synchronous mode or conservative. When no synchronous mode or conservative mode
execution exists, scheduler picks an LP to execute in optimistic mode. In the case of large cycle
parallelism exists, the optimized optimistic simulation will execute in low overhead mode and it
will perform as good as synchronous sirnuladon. This is also true for the case of conservative
simulation. When neither synchronous nor conservative simulation performs well, the

optimized optimistic Simulation can exploit parallelism of optimistic Simulation.

Selecting a good GVT computation is critical for exploiting the cycle parallelism. The
frequency of GVT update should be adaptive. Frequent GVT computation can help LPs skip
state saving by exploiting more parallelisms, but it will increase the communication cost of the
system. Infrequent GVT computation has smaller communication cost but it exploits less cycle
parallelism. The overhead caused by frequent GVT computation should not be larger than the
event execution cost. On the other hand, GVT computation needs not be frequent if there

exists large cycle parallelism.

We benchmarked our uniﬁed framework on the Terascale Computing System at the
Pittsburgh Supercomputing Center. We found that when the number of processors is small,
the performance of the uniﬁed approach is close to the synchronous simulation and much
better than optimistic simulation. When the number of processors is large, the performance of
the uniﬁed approach is signiﬁcantly better than both synchronous algorithm and the optimistic

Simulation. Detailed explanations on this research are stated in chapter ﬁve of this thesis.

12

C/Japter 2

2. PREPARATORY RESEARCH

2.1 Object Modeling

In practice, Object modeling aims to be suitable for the representation and manipulation of
complex objects of engineen'ng design. In most current Simulation systems, users must know
the parallel platforms and the simulation technique to develop the model. As a new simulation
technique is invented and new hardware introduced, the models they developed are not
reusable, and must be modiﬁed. Moreover, the modeling techniques are different for each

application domain.

There has been much research work in modeling simulation. Several simulation languages have
been proposed including Simula, GASP, GPSS, CSIM, and Sim++. In general, they can be

classiﬁed into three different approaches:

0 an approach which is limited into a speciﬁc application domain,
0 a ﬁxed model for discrete event simulation

0 a new simulation language to model the objects

13

In these approaches, the application domain is limited and the system is not extensible.
Modifying Simulation models frequently requires changing the simulation scheme employed.
Moreover, adding a new simulation mechanism or modifying data structures may affect the
models already developed. For example, the Tyvis Simulator is speciﬁcally targeted for parallel
VHDL simulation using Time Warp. In this approach, all objects are a subclass of Time Warp
objects. Therefore, adding a new simulation scheme requires changes of all models already
developed. DEVS [88] has been proposed by Ziegler as a model of distributed event
simulation. All objects must be modeled under DEVS formalism. Even though this approach
may have the advantage of introducing the formalism, it puts a burden on the modeler in a
certain application domain. Simulation languages such as Yaddes[63], Maisie[7], SIMA[67] and
MOOSE[30,40] have been proposed to aid the user to develop models. These languages are
also associated with a ﬁxed set of pre-selected set of simulation mechanisms. Yaddes is a
distributed event driven speciﬁcation language that resembles Yacc and Lex. A Yaddes
program is translated into a C program which is later linked together with a run time support
library. SPEEDES [75] is a C++ based simulation environment developed at the jet
Propulsion Lab which supports sequential, Time Warp [45], and Time Bucket [75] Algorithms.

In SPEEDES, end users may adjust a predeﬁned set of parameters to improve the speed.

Only a few research results have been reported on the modeling and the performance of
parallel simulators. Maisie is developed for efﬁcient execution of the simulator. Depending on
the hardware platforms, the complexity of the model, and the frequency of messages
generated, the performance of parallel simulation varies greatly. Most of the parallel Simulators

discussed above require the end user to provide their own models as the simulation scheme

14

changes. The only exception is Maisie. In Maisie, objects must be deﬁned using Maisie

language.

Experience has shown that Simulation is evolutionary in nature. While requirements change,
the system being simulated also changes. The modeling Should be less resistive to such
changes, so the maintenance of the evolving system will be much easier [26]. In this section,
we outline our simulation engine that is developed using an object-oriented and layered

approach. Our proposed object modeling technique has the following features.

0 It allows users to model complex objects that consist of different type of objects with

various level of complexity.

0 It is simple so that the modeler should be able to model the objects in stepwise reﬁnement.
The object modeling technique must provide a way of hiding the information so that

modelers and users are not overwhelmed by the details of the objects.

0 It separates models and simulation scheme so that modelers should be able to develop

models without knowing the simulation scheme employed.

0 It provides a mechanism for parallel programmers to develop library modules and data

structures independent from simulation models.

2. 7.7 Object Model of tbe Simulation Engine

Our models are cleanly separated from the execution environment, and parallel platform.
There are three types of objects in our proposed modeling: Application Objects, Simulation Objects,
and Simulation S cbeme Objects. Application Objects model the behavior of objects in the
application domain to be simulated. They can be developed by modelers of a speciﬁc

application domain, and are independent

15

Simulation Scheme Objects

 

 

Application Objects [TW Kemej ] [(‘M Kerncﬂ Sync Kant-l

CPU [TWObj] [CMOBﬂ EyncObj ]

 

 

 

 

 

[MMX] [Pentium] [ Intel 486DX]

 

 

Parallel Programmer

 

[166MHz] [200MHz]

Model developer [mFH-Lple Inlyfritance j
\ I

 

 

Figure 2.1 Object Models

from the simulation schemes and hardware platforms. Simulation Scheme Objects depend on
simulation scheme used, and include all library modules necessary to run the Simulator on a
parallel platform. They form the core of the simulation engine, and are developed by parallel
programmers with expertise in Time Warp and Chandy-Misra schemes [59]. The system to be
simulated consists of many Application Objects. An end user selects the Simulation scheme to
be used, and instantiates Simulation Application Objects (Simulation Objects in Short). As
Shown in Figure 2.1, a simulation object inherits behaviors and properties from Application
Objects, and parallel run time methods from the Simulation Scheme Objects. For example,
166MHz CPU and 200MHz CPU inherit their instruction set from the Pentium class object.
To Simulate a computer system using Time Warp, an end-user instantiates an object that
inherits its behavior from the 200MHz Pentium object, and its run-time environment and data

structures from TW-OB], a Time Warp object.

The modelers populate the libraries of object models independent from the simulation

scheme. The rationale of our approach is to allow the front end user to use the application

16

objects developed by the modeler freely. Our object modeling technique provides freedom to
three parties of the simulation arena: modelers, parallel programmers, and front end users.
Parallel programmers (simulation scheme experts) may concentrate on providing various
parallel Simulation methods. In other words, once the modeler develops Application Objects,
the same Objects can be simulated with other simulation schemes, without remodeling them

for the new simulation scheme.

Application Objects are organized using the inheritance mechanism among them. Therefore
the modeler, with minimum knowledge of modeling language and object oriented modeling, is
able to create a library of objects. The front-end user, who simulates the system, mixes and
matches the models developed by modelers and the simulation scheme developed by parallel
programmers. The simulation scheme class deﬁnes the data and methods that each Simulation
object needs to operate within the system. This class can be viewed as the kernel of the
simulation and any type of the simulation domain can use that scheme via its interfaces.
Depending on a particular simulation scheme, a simulation scheme object is created. The

simulation scheme objects include all the necessary methods for that speciﬁc scheme.

A Simulation Application Object (Simulation Object) is an instance of a class that is obtained
from integration of the Application Object class and Simulation Scheme class. It contains the
implementation of the methods of the simulation scheme inherited from Simulation Scheme
Objects. It inherits its behavior and attributes from Application Objects. For the simplicity of
implementation and portability, we used the template class approach rather than the multiple
inheritance mechanism. A Simulation Object is responsible for simulating the single object and

generating events.

17

The following class deﬁnition shows the skeleton of an Application Object. Other instance
variables, member functions, and class relationships can be added into the deﬁnition. If the
parent class of the object has member functions of instance variables, the object does not need
to include these ﬁelds. Consider the following example. We want to model an 8-bit ALU in the
design of 8-bit microprocessor. All components of an 8-bit microprocessor share common
characteristics, such as design rules, bus width, minimum gate delays etc. Therefore it is a
subclass of 8-bit components. Suppose that the behavior of the 8-bit ALU has the following
speciﬁcations: 8-bit Input port and Output port, 6-bit control line, and 8-bit addition and
subtraction. Later ALU8M can be modeled which is a special case of ALU8 by providing
multiplication and division operations. This specialization can be accomplished by making

ALU8M a subclass of ALU8, and inheriting operations of ALU8M from ALU8:

 

[ 8-bit components I

/ \.

I Shift Register I l ALU8 l

l

I ALU8M I

 

 

 

 

Figure 2.2 Example of a hierarchical model

Here is the sample code for the ALU object:

class ALU8 {
public: / / Methods that simulates the behavior of the object
void addO;
void subtracto;
void exceptiono;
void executeProcessO; };

18

class UserState : public BasicState { // input and output port of the object
InSignal<int> x[8], y[8], opcode[2]; / / opcode denotes the operation
OutSignal<int> z[8]; };

The method executeProcessO simulates the behavior of the object. This is the main part of the
modeling and the modeler just needs to plug the behavior of the object into the class
deﬁnition, by conforming to the names that are declared within the class. In this sample code,
we declare two other functions, add() and subtractO that are used within executeProcessO

method.

To create a model of an ALU with more functions, such as multiplication and division, we can

simply model by inheriting from ALU8:

class ALU8M : public ALU8 {
public:
void multiplyO;
void divideO;
void executeProcesso; };

To simulate the ALU8M, the behaviors of the 8-bit ALU such as add, subtract, and exceptions
are inherited from the class of ALU8. ALU8M has its own functions, multiply and divide,
which are not deﬁned in ALU8. Moreover, ALU8M can implement its own methods for ADD

and SUB.

The sample C++ code is provided as follows:

void ALU8M::executeProcess() {
switch (opcode) {
case 0: add 0; break;
case 1: subtracto; break;
case 2: multiplyo;break;
case 3: divide0;break;
default: exceptiono;break; }

19

2. 7.2 Simulation Engine: Logical Process and Simulation Object

Each processor is an instance of a logical process (LP), which is the simulation engine of the

processor. Logical Process is responsible for the global ﬂow-control of the simulation.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.3 Logical Process

It instantiates the simulation objects assigned to that particular processor. During the
execution of the simulator, LP handles the communication of simulation objects via messages,
schedules events by selecting simulation objects to be executed in next simulation cycle, and
computes the Global Virtual Time. The LP object depends on the simulation scheme. It
consists of Simulation Scheme Objects. Figure 2.3 shows the LP object organization. The
Scheduler schedules events based on scheduling policy. Dispatcher reads input messages from
the buffer and places events into the event queue of the corresponding simulation object.
Messages to be sent are handled by the collector, which aggregate messages and send them

according to the communication policy.

20

2.2 Measuring the Communication Cost

2. 2. 7 Introduction

Several models have been proposed to analysis the complexity of parallel applications. One of
the early models is PRAM (Parallel Random Access Machine). It cannot predict the cost of
parallel applications accurately because it does not model the architecture features of
distributed memory systems that use messages to access the remote data. LogP [31] model
captures major parameters of a distributed memory multiprocessor. LogP model separates the
overhead and latency of a message passing. Overhead is time spent by CPU to transmit or
receive message while latency is the time for the ﬁrst bit of message to arrive at the destination.
This separation is important for analyzing the asynchronous applications. For asynchronous
applications, shorter overhead is desired even at the cost of longer latency. The Stanford
FLASH multiprocessor [50] has a custom node controller called MAGIC that implements all
the data transfers. Because communication overhead (parameter 0) can be off—loaded to
MAGIC, occupancy [44] replace overhead where occupancy means the controller is occupied
while processor can continue computation. If the time of overhead and occupancy are same, in
application’s point of view, the network has lower overhead and longer latency. Although
controllers are off-loading more communication tasks from the processor, in most

multiprocessors currently available processor has to do some message passing tasks.

One method of measuring the parameters of the LogP model is to assume the correctness of
the model and construct several benchmark programs and analyze them using the LogP

model. According to the equation and the experimental result all the parameters can be

21

decided [32]. In this paper, the same technique is used with much simpler Ping-Pong test that
measure the round trip time of a message. All the benchmarks were run at IBM SP2 machine
at the Cornell Theory Center. At ﬁrst message of length 0 is tested. The result is quite
surprising. Most of the communication time is overhead for sending and receiving messages
for IBM SP2. Hardware latency is very small. To test the effect of message length over the
overhead, another benchmark is constructed. As expected the overhead is increasing slowly as
the message length increases with a slight jump every 200 to 300 bytes that may be caused by

the packetization cost.

Long software overhead and usually short latency and high bandwidth, makes it possible for
another approach in implementing the All-to-All communication. In All-to-All
communication, all processors exchange messages with one another. Currently most of the
All~to-All communication is implemented in software and takes O(n) communication steps for
n processor All-to-All communication. Since the overhead is not very sensitive to message
length and bandwidth usually high, a new All-to-All communication algorithm is proposed in
part 4 of this paper. It has only O(logn) steps but each step sends m/ 2 times longer message.
The result from SP2 machine, the new algorithm is much better than the MPI’s MPI_Alltoall()
subroutine for short messages. This conﬁrms that overhead is too large for some parallel
computers including SP2. Many methods are proposed to reduce the overhead including

Active Messages [36] by allowing the user level control over message passing.

In the last part of this section the hardware support of multicasting is discussed. Pure software
implementation is too costly because there are many unnecessary memory copying. By a

special hardware which can recognize the multicasting message and send the same to both

22

direCtions: one to the local memory and the other to another node, the unnecessary memory

copying may be avoided.

2. 2.2 Measuring tbe parameters of tbe LogP Model

2.2.2.7 Communication Bencbmarks

Two communication benchmarks are constructed. The ﬁrst one is the pure Ping—Pong test
that is used to measure the round trip time of a message. The second benchmark embeds a
ﬁxed amount of computation between the send and receive. All the messages are of ﬁxed

length of 0 byte. The details of benchmark, time diagram and analysis are listed below.

Benchmark 1 (Ping-Pong test)

Processor 1:
Start Timer
repeat N times
send a message of size 0 to processors 2
receive a message of size Oﬁom processors 2

Stop Timer

Processors 2:
repeat N times
receive a message of size 0 from processor 7
send a message of size 0 to processors 7

 

 

 

 

Time Diagram:
I .L , i l J I l J ,
Os: ; Ide Or Os Idle Or Time
= = (a)
L l l L l l l H ,
Idle 0: Os Idle 0r Os Idle Time
(b)

Figure 2.4 Time diagram for benchmark 1

23

AnalYSis using LogP model:
The cost of each iteration is Os+L+Or+Os+L+Or=2*(Os+Or+L).
Benchmark 2 (Ping-Pong test with computation)

Processor 1:
Start Timer
repeat N times
send a message cfsize 0 to processors 2
Compute for X time

receive a message of size Oﬁom processors 2

Stop Timer

Processors 2:
repeat N times
receive a message of size 0 from processor 7
send a message of size 0 to processors 7.

Compute forX time

Time Diagram:

 

 

 

 

Osf F X Or Os X Or Tune
3 = (a)
TL"?
[‘1]! III J...l .
Idle 0: Os X Or 03 X Time
(b)

Figure 2.5 Time diagram for benchmark 2 with X > 2*L+Or+Os

Analysis using LogP model:

The cost of each iteration is: Os-l-MAX(X,2*L+Or+Os)+Or.

When X<2*L+Or+Os then the cost will be the same as benchmark 1.

24

2.2.2.2 Computing Os, Or and L
Make X large enough to ensure that X>2*L+Or+Os.

The total execution time for benchmark 2 will be: T2 = N*(Os+Or+X).

\Ve have,

Os+Or=T/N-X (1)

Let T1 be the execution time of benchmark 1.

T1=N*2*(Os+Or+L).

We have,

L=T1/(2*N)-(Os+Or) (2)

2.2.2.3 Experimental Result on IBM SP2
Benchmark 1 and 2, is implemented on IBM SP2 using MPI library. The program is written in

C language.

Figure 2.6 shows the result of Benchmark 1 and 2. When computation time X is 0, that is
Ping-Pong test, the time per iteration is around 117 microseconds. X is increased by increasing
the number of procedure calls to a ﬂoating-point computation subroutine. The time spent on
one subroutine call is deﬁned as a unit. At ﬁrst, when X increases the execution time per
iteration is not increasing. After X is increased to a certain amount, the execution time is
increasing almost linearly. The pure computation time is also increasing linearly to unit

computation. When X is larger than certain amount the difference between two lines is almost

25

constant, that is (Os+Or). From the experimental result, we Observed that the difference is
increasing very slowly as X increases. This may be caused by more context switching
operations due to more interrupts for message passing as X increases. To know the exact
reason, more detailed research should be done in future. In this experiment, (Os+Or) is
estimated by computing arithmetic mean of differences between execution time for
Benchmark 2 and pure computation when X is between 35 and 95 unit. The result is
(Os+Or)=47 microseconds. The parameter L is also decided by formula (2). The result is
L=11.5 microseconds. During the experiment, we found that execution time varies

signiﬁcantly even for fairly large number of iterations. The error of this experiment may not be

 

 

 

 

 

small.
Result of Benchmark 2
350
full
300 ,, BenchmarkZ 45/
/.
g 250 4’ //
g 200 4.
g g 340: Pure Computation
150 ”’f
5 i 4,.e/X
n——-¢—-———¢—
E 100 _,r/
/
50 1' /
D D/ r a . .
0 20 40 so 80 100
Computation time X (Brit)

 

 

 

Figure 2.6 Experimental Result of Benchmark 1 and 2.

26

2.2.2.4 Tbe Eﬂect of Message Lengtb

To test the effect of message length on (Os+Or) , benchmark 3 is designed to meet the
purpose. Starting from message length of 0, it increases message length after executing the

same code for benchmark 2. The detail benchmark and result are shown below.

Benchmark3:

Processor 1:

For (Messagelxngtb=0,MessageUngtb <MaxLengtb;Increase(2’9’MessageLengt/J))
Start Timer
repeat N times
send a message of size MessageLengtb to processors 2
Compute forX time
receive a message ofsize M essageLengtb ﬁom Processors 2
Stop Timer

Processor 2:

For (M essageLengtb _=0,M essageLengtb <MaxLengt/J;I ncrease( c’y’MessageIJngt/o ) )
repeatN times

receive a message of size MessageLengtbﬁom processors 2

send a message ofsize M essageLengtb to processors 2

ComputeforX time
Figure 2.7 shows that Os+Or is increasing as the length of message increases. From Figure 2.8
shows how Os+Or increases relative to message length. Os+Or is increasing almost linearly as
the message length increases. The line is roughly (Os+Or)=47+0.042*MessageLength. Figure
2.9 is a closer look. The Os+Or jumps at every 200 to 300 bytes, which conﬁrms the 255 byte
packet size of Vulcan switch chip in IBM SP2 [73,74]. Messages have to be packetized to ﬁt in

the 255 byte packet limit on Vulcan switch chip.

27

 

 

Execution Tune per Iteration
(MicrosecondJ

600

Result of Benchmark 3 - No.1

 

500 1

 

400 l

8
o

' Length=3060 /
l Length=2060 /

' Length=1060

   
  

 

 

 

 

 

 

 

 

 

 

 

Message Length

200 - .
f Pure Computation
Length=0 ./"
100 ‘-
0 t t r t :
0 20 40 60 80 100 120
Comprﬂation Time 3 (unit)
Figure 2.7 Result of Benchmark 3 - No.1
Result of Benchmark 3 . No.2
250

200 +
"6‘
r:
i

MI 150 if
E’.
.2

§ 100 -
O
4.
e
O

50 .

0 ‘r 1‘ r ‘r t t
0 500 1 000 1500 2000 2500 3000 3500

 

Figure 2.8 Result of Benchmark 3 - No.2

28

 

 

 

Result of Benchmark 3 - No. 3

 

83883
5:1-

Os+0r (Mieroeeeond)
8

10 4»

 

 

 

0 100 200 300 400 500
Message Size

 

 

 

Figure 2.9 Result of Benchmark 3 - No.3

2. 2. 2.5 Dﬂerent communication modes
There are three message sending modes speciﬁed in MP1 standard [57], ready, buffered and

synchronous. Synchronous mode is usually used to send long messages due to the limitation of
system buffer size on the receiver side. In synchronous mode message passing, additional
request and reply is needed before the actual communication begins. This will increase the
overhead of communication cost. Further study is needed to model the long message passing

COSt.

2. 2.3 All-to-All communication
From the above results, we can see that overhead (Os+Or) dominates the total
communication cost for a short message in IBM SP2. Active Messages was proposed to

reduce the overhead of communication by user level control of message passing. There is an

29

implementation of Active Messages on IBM SP2 recently by Chang et al[19]. The overhead was

reduced signiﬁcantly, but it is still large.

All-to-All communication pattern is widely used in many parallel applications to exchange data
among the processors. The efﬁciency of this synchronization cost is very important to get

more speedup for certain applications.

2.2.3.7 Implementation of All-to-All communication

MPI standard supports All-to-All communication with MPI_Alltoall(). In most cases, All-to-
All is implemented in software level based on point-tO-point communication. Algorithm 1 is

the pseudo-code of All-tO-All communication adapted from [77].

/* nun_procs . number of processors */
/* my_proc - id number of processor */
/* (ids numbered 0 co nun_procs—l) */

even := (nodtnun_procs,2) - 0)
if (even) then
m := numgprocs-l
else
m := nungprocs
end if

for I :a 0 to m-l
it (even .or. (I != ny_proc)) then

if (nygproc = n) then
other_proc :=

else it (ny;proc = I) then
other_proc := n

else
other_proc := nodtm+I+I-my;proc,m)

end it

call mp_bsendrecv(other_proc)
endit
end for

Algorithm 1: All-to-All Communication
From Algorithm 1, we can see that there are total (m-l) of one-to-one pair wise exchange

communication routine mp_bsendrecv0 procedure calls. Suppose the network has bi-

30

directional link (IBM SP2 is bi-directional multistage network) so that both sides of pair wise
mp_bsendrecv() can send message independently. Figure 2.10 is a time diagram for the
algorithm for very short message. Since the message length is very small, the effect of

bandwidth or g can be ignored. The total cost of Algorithm 1 is (m-1)*(Os+Or+L).

l‘iteratim Pileratim

TIITLli 1

Os L Or Os L 0r Time

 

V

Figure 2.10 Time diagram for Algorithm 1

2. 2. 3 .2 Anotber inqblementation of All-to-All communication for sbort message

Because the cost of All-to-All communication cost is increasing linearly using the pseudo-code
in 3.1 and cost of (Os+Or) is signiﬁcant, many real applications which need All-to-All

communication pattern can not get satisfactory performance.

The Algorithm 2 can reduce the total number of iterations to O(logm) at the cost of sending
m/ 2 times longer message at each iteration. Without special hardware support for sending a
message in non-consecutive memory locations, there will be additional memory copying cost
for this algorithm. Because of above two additional costs, this algorithm will not be suited for

long messages.

31

/* nun_procs = number of processors */
/* my_proc = id number of processor */
/* (ids numbered 0 to num_procs-l) */

dist = m/Z:
for I :a l to logtn)

sendgproc :2 (my;proc+dist) nod nungprocs

SendHessage :- Combined Message whose destinations are
(my;proc+dist) nod nun_procs,
(uy_proc+dist+l) nod nun_procs,

(my;proc+disc+dist-l) nod nungprocs
cell mp_bsend(sendgproc,SendHessage)
recv;proc := (my;proc~disc) nod nungprocs
Rechessage :2 Combined Message whose destinations are
nygproc,
(my;proc+l) nod nun_procs,

(ny;proc+dist-l) nod nungprocs
cell np_brecv(recvdproc,Recvﬁessege)
dist :- dist / 2

end for

Algorithm 2: All-to-All Communication Pseudo-code for short message

Figure 2.11 illustrates the algorithm for All-to-All communication when m=8 for processor 0.
Initially each processor j has 8 messages A10A1,..Aj7,Ajk ’s destination is processor k. The
correctness of the algorithm can be proved easily. At iteration i, a processor j receives m/ 2

messages, 2"‘of them arrives at their ﬁnal destination.

Iteration l

dist - 4
Processor 0 sends message A04.A05.A05.A07 to processor 4
Processor 0 receives nessege A40.A41.A49.Ae3 fromproccssu 4

Iteration Z

dist - 2
Processor 0 sends message A03.J4o3.1442.A43 to processor 2
Processor 0 receives message A30.A21.A50.A51 tromprocessor 5

Iteration 3

disc - 1
Processor 0 sends message AoivAnvAuvAoi to processor 1
Processor 0 receives message AIOvA ,A50,A7o Iromprocessor 7

After iteration 3, AwAmrAzovAiovAnvAvavo arrived at processor 0.

Figure 2.11 Execution Sequence for processor 0

32

The time diagram Figure 2.12 shows that each iteration takes (Os+Or+L+ (m/2)*g) time. The
total cost is (Os+Or+L+ (m/2)*g)*log(m). Above analysis ignores the additional cost of
sending non-consecutive message. Without hardware support , additional memory copying is
needed which make this algorithm good for only very small sized message. Even if the
hardware support exists, when the message size is long, (m/2)*g factor together with increase
in (Os+Or) will degrade the performance. The algorithm will perform best when the message

size is very short.

”iteration .4 2ml iteration

L
71 V

IlfllllTl... l

Os 1.?ng o: Os ' L(m12)g Or Time:

A
“

 

 

Figure 2.12 Time diagram for Algorithm 2

2. 2. 3.3 Experimental Ram/t on IBM 5 P2
Two All-to-All communication programs were written using MPI. One program uses

MPI_Alltoall() of MP1 standard library, the other uses the algorithm described in Algorithm 2.

Table 2.1 shows the result of All-to-All communication for message length 1. The explicit
memory copy is used to move non-consecutive data to consecutive memory location before
sending the message. The execution time for MPI_Alltoall is almost increasing linearly while
execution time Algorithm 2’s increasing rate is between logarithmically and linear. Because
memory copy is used to send non-consecutive data, the performance for Algorithm 2 is worse
when message size increases to hundreds of bytes. To perform better than MPI_Alltoall a

mechanism to send non-consecutive data very efficiently is needed.

33

 

8 16 32
MPI_Nltoall 391 16‘ 784 .16 1280 .15
Algorithmz 256 as 398 ,4: 645 .15

 

 

 

 

 

 

Table 2.1 Result of All-to-All communication for message length 1

2.2.4 Minimum bardu/are rapport for multicarting?

Currently multicasting is mostly implemented in software for various reasons [34]. In software
approach a multicasting tree is created to perform the task. Different multicasting algorithms
were proposed according to different routing schemes and different purposes. Most of them
create a multicasting tree and the cost is usually log(m) where m is the total number of

destinations.

In David Culler et al.’s paper [31], a method to create an optimal multicasting tree based on
LogP model is introduced. When a multicasting message arrives at its one of destination
processor j, it is taken out from the communication hardware into the local memory. If the
message has to be forwarded to other destinations according to optimal multicasting tree, the
message again has to be put into communication hardware from the local memory. This is
unnecessary cost very similar to packet switching that whole packet has to be received before it
can be forwarded to next processor. Wormhole switching improves performance by allowing
ﬂit level parallelism. Only one ﬂit is buffered and can be forwarded to next processor when
output port is available. Wormhole switching makes communication cost distance insensitive.
It needs special hardware to support the Wormhole switching. Is it possible to avoid the
unnecessary memory copying in multicasting by introducing a special hardware? One possible
solution is to add a header specifying the message as multicasting message. At a switching

element if a arriving message is recognized as a multicasting message and it is one of the

34

destination and is not the final destination, the same message is transferred to both directions:
one to the local memory of the processor, the other to next processor according to the routing

information in the header.

2.2.5 Conclusion

The overhead of communication in IBM SP2 is high. Although IBM SP2 has a very powerful
communication adapter, it seems that its power is not fully exploited. Implementation of
Active Messages on IBM SP2 reduced the overhead signiﬁcantly, but it is still large. Stanford
FLASH Multiprocessor’s MAGIC is designed to completely off-load the communication
overhead from the CPU. For many asynchronous applications, this feature is very attractive
because more CPU time can be used to do computation by overlapping more computation

with communication.

35

Cbapter 3

3. SYNCRHONOUS SIMULATION

3.1 Performance Prediction

3. 7.7 I nmductian

Synchronous simulation is one of the simplest and easiest parallel simulation protocols. It,
however, may suffer from poor performance because only events with the smallest timestamp
can be executed. For every cycle, processors are synchronized, making processors wait until all
other processors finish their event execution. The cost of longest executing processor is
considered as the computation cost of each time step. For other processors that have shorter
execution time will have to wait for the longest executing processor before all of them can
proceed to next time stamp. After each time step, all the processors exchange messages. This
parallel programming model can be categorized as a BSP programming model ['79]. Frequent
synchronization makes synchronous simulation more prone to the situation when the load is

unbalanced in certain time stamps.

Asheden[4] et a! investigated the feasibility of synchronous simulation on VHDL models. They
concluded that high level models with large event execution granularity is suitable for small

scale multi-computers and low level models with small event execution granularity is not

36

suitable for parallel execution. They used very small benchmarks with largest number of
objects only around 150. The parallelism is limited by their own sizes. Large number of
parallelisms exists for much larger circuits. Logic simulation has very ﬁne event execution
granularity. The event queue operation is a signiﬁcant cost for both sequential and parallel
logic simulation. Parallel logic simulation can speedup the simulation by distributing the event

queue operations.

Except from additional communication cost, synchronous simulation has little overhead
compared to a sequential simulation. Although conservative and optimistic simulations can
exploit more parallelism, they have much more overhead than synchronous simulation, which
can make the overall perform worse than synchronous simulation. Soule and Gupta [72]
evaluated the Chandy-Misra algorithm for digital logic simulation. They found that overhead
of Chandy-Misra algorithm overwhelm its advantage and the performance is about three times
slower than traditional parallel event-driven algorithm [7 2]. Konas and Yew compared the
performance of a synchronous simulation to the conservative and optimistic algorithms in
simulating a synchronous multiprocessor system [48]. Their results show that the synchronous
method is considerably faster than both of the asynchronous approaches. The main reason is

both of the asynchronous simulation algorithms introduce signiﬁcant overhead.

A sequential method that accelerates the simulation time is worth mentioning. Cycle based [82]
simulation eliminates events by only simulating on clock-cycle boundaries. This technique is
typically 5 to 10 times faster than discrete event simulation. However, this technique is only
applicable to the designs that do not have asynchronous combinational feedbacks. Discrete

event simulation is still the main stay of hardware veriﬁcation.

37

Agrawai and Chakradhar [1] considered only load unbalancing factor for synchronous
simulation. They gave a statistical model for synchronous simulation. Given a system with
uniform granularity of event execution and with random partitioning, an object is active with a
probability of acting, rate. The number of independent random variables is the same as the
number of objects in the system. At each cycle, different processors have different number of
active objects while the cost of the cycle is determined by the maximum order statistics of
binomial random variables. They proposed a performance model based on the number of
objects together with the activity rate. They compared the theoretical prediction with the
benchmark for several circuits. However, they analyzed only the computation-intensive case
and left the inter—processor communication out as an open research topic. This model has

limited application especially for logical simulation where each object has small granularity.

In this part we propose a model to predict the performance of synchronous simulation.
Parameters of the performance model can be obtained from the circuit characteristics and
parallel computer systems. From this model and parameters, we can decide whether it is
worthwhile to use synchronous simulation before actually implementing it. Our model is
based on multinomial distribution, and uses the average number of active oly'eetper gale. We next
propose a formula that predicts the performance of synchronous logic simulation combing
communication model with load unbalancing factor. From the formula, we derive a formula
that predicts the maximum speedup achievable by synchronous simulation, which is

proportional to the square root of average active number per gale.

We have benchmarked several ISCAS89 circuits and a combinational circuit on SGI Origin

2000. The prediction result is close to benchmark results. Circuits with large average active

38

number of object per simulation cycle have better performance because it has good load
balancing as well as high ratio of computation and communication cost. We have measured

the accuracy of the maximum speedup formula by multiplying the size of an ISCASS9 circuits.

As observed in the formula, improving the load balancing and activity rate are the keys to
improve the synchronous simulation. Synchronous algorithms exist that improve the load
balancing factor. AdvanCE SpaDES [47] executes safe events after the smallest time stammd
events ﬁnished execution and before the synchronization point. Steinmen [75] proposed
Breathing Time Buckets algorithm that executes unsafe events before the synchronization
point. States of the objects that execute unsafe events are saved to be able to recover to the
states before the incorrect executions. We propose an improved synchronous algorithm that
tries to improve the load balancing factor by scheduling the same number of objects every
cycle. Our technique differs from previous techniques by its ability to predetermine the
number of extra event executions. No asynchronous barrier synchronization is required by our
technique. Benchmark results show that our algorithm improves the load balancing factor and
the overall performance. The improved performance comes from better load balancing and

fewer simulation cycles.

This research is organized as follows: In section two we proposed multinomial distribution
model. In section three we proposed prediction model due to load unbalancing factor. In
section four we discussed the communication model. In section ﬁve we derived our ﬁnal
performance formula for synchronous simulation based on our multinomial distribution

model and communication model. We also compared the performance model with empirical

39

result. In section six, we propose an improved synchronous algorithm that is targeted to even

the balance of event execution of each cycle.

3. 7.2 7726 Model

The discrete event system we are interested in is all objects in the system have the same
granularity. This is applicable to the logic simulation where each object is a gate and there is
not signiﬁcant difference between execution time of two gates. The other assumption of the
model is random partitioning. The objects in the system are randomly partitioned to different

processors. This allows us to apply statistical methods to estimate the cost of the simulation.

Figure 3.1 shows the synchronous algorithm our simulation is based on. The algorithm runs
with parallel message passing system. GVT is computed by piggybacking the next smallest
timestamp in the event message sent per cycle. This algorithm has the advantage of removing
the cost of broadcasting and reducing cost, but it needs to send a null message to other

processors when there is no event for some cycles.

Each processor executes the following code:

while (GVT < MAXGVT) {
Execute all the events with GVT;
Send events to other processors with LVT attached;
Receive events from other processors;
Compute the GVT with piggybacked LVT;
}

Figure 3.1 Synchronous algorithm our model based on

This is a very simpliﬁed synchronous simulation model. There are optimizations to this model.

We are interested in this model because it enables us to derive a nice performance prediction

40

formula and we believe other complex models can be reduced to this simple model without
signiﬁcant loss of performance. This model can give us hint on how to improve the

performance of synchronous simulation.

3. 7.2.7 Alu/tinomial Dirtribution

\We proposed a different statistical model for synchronous simulation. We emphasized the
importance of the average number of active object: per simulation step rather than the activigl rate. A
system with a large number of objects with a low acting rate may perform better than a system
with a small number of objects but with a high activig rate because the system with a large
number of objects with a low activigr rate has a greater average number of active oly'ect: per cycle. For
simplicity, we will call the average number of active objects per cycle, the active number. Based
on the active number; we proposed a multinomial distribution model to model the distribution of
load and the effect of the load-balancing factor. An active object has the same probability to be
at any one of the partitions. There are an active number of active objects per cycle and they are
independent random variables. The cost of each cycle is determined by the maximum order
statistics of the counting variables that are the number of active objects distributed to different
processors. The multinomial distribution model reduces the number of parameters from two
in the binomial distribution model to just one variable — the active number. The prediction is
computationally easier for the multinomial distribution model because the number of active
objects is usually much smaller than the total number of objects in the system. We compared
the two models and the results were close. In part three, we discuss the multinomial

distribution model and compare simulation results with the benchmark results for several

ISCAS89 circuits.

41

The active number for one cycle is different than that of another cycle. The benchmark shows
that the active number varies from cycle to cycle signiﬁcantly in circuit simulation. For the
simplicity of prediction formula, we are not considering the distribution of the active number
over the cycles. We use the average active number to predict performance. The accuracy of the
prediction shows that the distribution factor of the active number in circuit simulation does not

have signiﬁcant contribution to the ﬁnal performance.

The active number is an inherent property of a circuit that describes the degree of synchronous
parallelism. It is the most important factor for our performance prediction model. If we
already know the activin rate of the circuit, we can estimate the active number by multiplying the
total number of objects with the activiy rate. If we do not have the activigr rate, we can do

sequential simulation with large enough cycles to estimate the value of the active number.

The general idea that allows us to predict the performance of synchronous simulation is based
on the following simple observation. If the objects are partitioned randomly enough and the
behavior of the discrete event system is random enough, we can assume that the active objects
for each cycle have multinomial distribution over partitions. An active object has the same
probability to be at any of the partitions. There is active number of independent random
variables each cycle. The cost of each cycle is determined by maximum order statistics of the
counting variables that are number of active objects distributed to different processors. If the
number of cycles is large enough, the average cost of each cycle will be close to its expected

value.

These are the terms we will use in this paper.

42

Notations and terms

o,,oZ,...,o,, : a list of n objects in the discrete event system

t,,t2,...,t ' a series of discrete time and numgrcle is the number of cycles.

numryclc '

Aﬂj : Number of objects that is active at time t,

A : The average ofA[tj, i= 7,2,. . .,max

5 = A[t,] + A[t2] + + A[tm] : Total number of activity

p : Number of partitions

cycle : a cycle is the duration of simulation for each discrete time t,

We assume the following three conditions.

Condition 1:

AM] > o, for all i

A [tj=0 means there is no object active at time t,. It is not necessary to have discrete time t,.

5 >>A[t.~/

This means that simulation has a wide range of discrete time and does not ﬁnish in a few

cycles.

Condition 2: Randomness

43

At each cycle, an object has equal probability of being active with all other objects.

Condition3: Uniform granularity

Each object takes a unit time to execute an event.

3. 7.3 Prediction Model due to Load Balancing factor

In this section we will derive the computation cost under the multinomial distribution model
and compare it with the benchmark result. We also propose a multiplication factor, which

describes the degree of load unbalancing.

3. 7.3.7 Total computation cost

The cost of each cycle is the maximum number of active objects among all partitions. We
denote the cost of each cycle by C[tj = maxtAkﬂj), where A‘ﬂj is number of objects active at
partition b. and A,[ti] + + Apﬂl] = A[t,./. p is the number of processors the simulation is
running on. The cost of the whole simulation is C[tL] + C[tz] + + C[t,m]. Since we have
assumed that each active object has equal probability of being present at any of the partitions,
A[tj objects that are active at time t,, are randomly distributed to p partitions. Therefore the
cost of each cycle can be estimated by the expected value of the maximum order statistics of
counting variables that are the number of active objects distributed to p partitions. The cost of
each cycle is a function of A[tj and p which denote as E(A[tj ,p). The closed formula for
function EMﬂj , )may not be easy to obtain. Instead of trying to ﬁnd the closed formula for

the function, we can use simulation to get the result of the function with good accuracy. Total

44

simulation costs due to execution of events should be Numgcler*E(A,p), where A is the

average of A [ti].

3. 1.3.2 A’lultzplication Factor

We want to measure the degree of load unbalancing with different average number of active
objects per cycle. We now deﬁne the multiplication factor as follows. Suppose that the average
activity number of objects per cycle of a circuit is A and the circuit is partitioned to P parts.
Suppose E(A,p) is expected value of maximum order statistics of counting variables that are

number of objects distributed to P partitions. We deﬁne multiplication factor M as:

l

M(A,P)=
E(A,P)*P/A

 

M is a function of A and P. Table 3.1 shows the value of the multiplication factor over
different values of A and P. The range of multiplication factor is from 0 to 1. The
multiplication factor shows the efﬁciency of parallel simulation. Larger multiplication factor
means better load balancing. The multiplication factor increases as A increases and P
decreases. Therefore, the synchronous parallel simulation will be more efﬁcient with a larger A

and smaller P.

 

Multiplication factor
A / P 2 4 8 16 32
1 0.50 0.25 0.13 0.06 0.03
10 0.80 0.60 0.41 0.27 0.17
100 0.93 0.83 0.70 0.57 0.43
1000 0.98 0.94 0.87 0.81 0.72
10000 0.99 0.98 0.96 0.93 0.89

Table 3.1 Multiplication factor for different A and P

 

 

 

 

 

 

 

 

 

 

 

 

 

 

45

The speedup can be defined using multiplication factor as:

.S‘(A,P) == P*M(A,P)

Table 3.2 shows the speedup for the A and P of above table.

 

 

 

 

 

 

 

 

 

 

 

 

 

Speedup
A/P 2 4 8 16 32
1 1.00 1.00 1.00 1.00 1.00
10 1.61 2.38 3.31 4.29 5.29
100 1.85 3.31 5.63 9.07 13.73
1000 1.95 3.76 6.94 13.02 23.06
10000 1.98 3.92 7.69 14.95 28.57

 

 

Table 3.2 Speedup table for diﬂ'erent A and P

Table 3.1 and 3.2 show that for large A there is little loss of efﬁciency, but for small A the loss
of efﬁciency is signiﬁcant. For example when A is 100 and P is 32, the load unbalancing will

cost 57% of the ideal speedup.

3. 7.3.3 Contoariron: wit/2 Bencbmané
Theoretical vs. Empirical

We ran several ISCAS89 circuits and one combinational circuit sequentially up to some large
enough time limits. We are able to get the total number of executions and number of cycles.
The activity number is derived from the quotient of number of executions divided by number

of cycles. Table 3.3 shows the results for 535932, 515850, 59234 and multi32.

46

 

Number of Executions, Cycles and Activity Number
Circuit 835932 15850 S9234 Multi32
Number of Executionsl32949202574571439652303858
Cycles 3499 4415 3258 1998
Avg. Activity Number 942 58 44 1 153

 

 

 

 

 

 

 

 

 

 

 

Table 3.3 Number of Executions, cycles and Average Activity Number

When these circuits are partitioned to a number of processors, the cost of each cycle is the
maximum number of executions among these processors. The sum of these maximum
numbers of executions for all cycles is the total amount of the computation. We can estimate

the total amount using our multinomial distribution model with the average activity number.

To see how accurate the estimation is, we randomly partitioned the objects into a different
number of parts where each part had the same number of objects. We then recorded the
number of executions for each of these parts and recorded the sum of then maximum number
of executions of all parts for all cycles. This experiment was done sequentially and there was
no communication cost involved. Table 3.4 shows comparisons of the benchmark
computation cost and the estimation with the different number of partitions. The estimation is
quite close to the benchmark. Most of the estimations have errors less than 2%. This shows
that many circuits have random behaviors that ﬁt well to our multinomial distribution model.
We compared the result of the multinomial distribution with binomial distribution. They both
have good estimations of the benchmark. The number of independent trials for multinomial
distribution model is much less than that of binomial distribution model because the number
of independent variables of the multinomial distribution model is much less than that of the
binomial distribution model. A more important advantage of the multinomial distribution
model is it only depends on one parameter — the average number of active object per grcle. The

binomial distribution model needs the activity rate and the total number of objects.
47

 

 

 

 

 

 

 

 

 

 

 

7 535932
‘P .13 l.
mpirical Multinomi inomi ultinomial Erro inomial Error
2 1701621 1690454 1687933 0.66% 0.80%
4] 877571 879266 877099 0.19% 0.05%
8 463184 466671 465514 0.75% 0.50%
1 6 254433 254690 253294 0.10% 0.45%

 

 

 

1
777
4851

14211

5010

325

$1 5850
ultinomial

1 423

4

2.440/
5.640/
3.280/
1 1.900/

2.630/

5.590

2.850/
11.820/

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

59234
lPEEmpiricallMultinorniallBinomiaWultinomial ErrorBinomial Error
2 81 135 80534 80564 0.74% 0.70%
4 45804 47469 47519 3.64% 3.74%
8 29635 29596 29515 0.13% 0.40%

1 19591 19670 19598 0.40% 0.04%
multi32
IPEEnﬁiiricallMultinomiallBinonnmultinomial Error inomial Error!
2 1 172691 1 178893 1176680 0.53% 0.34%
4 605081 61 1254 608103 1.02% 0.50%
8 317674 322804 319431 1.61% 0.55%
16 168821 174766 171944 3.52% 1.85%

 

 

 

Table 3.4 Comparison of theoretical and empirical results

3. 7.3.4 Speedup and Mult¢lication factor

From results of the above experiment we obtained the speedup for those circuits. Therefore,
the speedup was calculated from pure computation with no consideration of communication

COSt.

48

Figure 3.2 shows the speedup graph for the circuits we used above. The multi32 has the best
performance while 89234 has the worst. From Table 3.3 we can see that multi32 has the
highest average activity number and 59234 has the lowest average activity number. The

speedup is directly related to the average activity number of a circuit.

From the multinomial distribution model, we can see that the speedup is a non-decreasing
function of the number of processors. We ignored the communication cost for all previous
analysis. If the communication cost is considered, the speedup will not be a non-decreasing
function. If the number of partitions increases over a certain threshold, the performance will

be suffered because the communication cost will become higher than computation cost.

 

Speedup wlhtout communlcatlon

   

0 4 8 12 16
Number of Partltlons

 

 

 

Figure 3.2 Speedup without communication

49

3. 7 .4 Performance Model

3. 7.4.7 Communication Model

In this section, we describe a prediction model based on the communication cost and the
synchronization cost. The communication overhead of modern message passing parallel
computers includes the operating system overhead of dividing the messages into packets,
adding header information to the packets and then putting the packets into hardware. For
modern parallel computers with advanced routing technique and high bandwidths,
communication overhead takes up most of the total communication cost. Because of this we
are only interested in the communication overhead and will incorporate it into our prediction

model.

There are several parallel models proposed. The LogP model is good when the message is
short [31]. The LogGP model is proposed to incorporate long messages [2]. In the LogGP
model the message with size 13 is sent into network at time Ow+m(k— 7 ), where 0",, is overhead
and m is time per byte for a message. In this paper, we considered O..,,+m(,€-1) as the
communication overhead and, 0",, as the overhead for sending the null message. We ran the
experimental program on SGI Origin 2000. Figure 3.3 shows the change of communication

overhead of sending and receiving as size of message increases.

Figure 3.4 shows the least square ﬁt of communication overhead to a linear formula. We

assumed that the communication overhead O follows following linear formula:

0 = OM + m’kMrgSize

50

Where Onun is the overhead of sending a null message and m is the increasing factor as size of
the message increases. We used least square method to estimate at and 0",, The value of m

and 0",, are 12.9*10'3 microsecond and 37.3 microsecond separately on SGI origin 2000.

 

 

 

 

Cummleatlon War)
180
160
140
'5 13
g 8.
0
E 60
40
20
0
0 5000 10000
man
Figure 3.3 Message overhead

 

Marauder-am

 

 

 

 

180

160

140
'g 120 —-Experirra'rd
g 100
o 80 +LineaSqua
5 so Esirrdim
:

4o

20

o

0 maoooeoooaooowooo
mm

 

 

 

Figure 3.4 Least Square ﬁt to a linear formula

51

3.7.4.2 Predicted Speedup

Total execution time is the sum of computation time and communication time. Our
multinomial distribution model can estimate the computation cost. From the activity number
and the number of processors, we can compute the multiplication factor. The total
computation cost is Total Computation = G*N/ ﬂ’*M), where P is the number of processors; M
is multiplication factor; G is the granularity of computation that includes the cost of event
queue operation and event execution. N is the total number of event executions. The
multiplication factor measures the degree of load unbalancing and can be computed with other

random distribution models.

The communication time depends on number of events generated. Suppose that an average of
E events are generated per cycle and C is the total number of cycles. The total number of
cycles is total number of event executions divided by the active number. When the system is
partitioned into P parts, the number of events generated also follows multinomial distribution.
The cost of communication per cycle is decided by the maximum order statistics of these
multinomial random variables. Therefore, the number of events generated per cycle per
processor is (E/P)*7/M. These events will be sent to P processors. Therefore the event
sending and receiving cost of a cycle is P*((E/P)/ﬂ’*M)*m+OM). OM and m are the
communication parameters derived from part four of this paper. The total communication

cost is:

C *1”? (E/ P)/ ﬂ’*M)*m+ 0../1)=E..a/ (P*M)*”’+C *P *0“,

52

Where E is the total number of events generated. Not all event executions generate new

total
events. Depending on the circuit, we have EW=N*,€, wbere k is the probability that an event

execution generates a new event. The number of cycles C can be represented as N/A. The

reﬁned total communication cost is N / ﬂ’*M )’*(m/ k)+N / A *P *0,“

The speedup is the ratio of sequential simulation time and parallel simulation time. The

following gives the predicted speedup of synchronous simulation under our model.

Tm] _ N*G
+T N*G/(P*M)+N/(P*M)*(mlk)+N/A*P*0,m,,

COMP COMM

 

Speedup =

The above formula can be simpliﬁed to:

1
1+(m/k)/G + on,” *P
P*M G*A

 

 

G includes the event queue handling and event execution that can be considered to be much
larger than m and k is within a practical range. The value of m, the unit cost of message per
byte, is around 12.9"‘10'3 microsecond for SGI Origin 2000 and value of k, the event
generation rate, is dependent on the circuit and the values are around 0.1 for the circuit we
have tested. The value of G is dependent on the circuit too and it is in the order of
microseconds. Therefore, (m/k)/G is close to 0. This can be explained intuitively. The queue
overhead plus the object execution cost for generating one event is much larger than the cost
of sending a longer message with additional size of an event. We removed (m/k)/G from the

formula to get the simpler formula:

53

 

 

l +0null *
P*M G*A

M *P . Suppose load is

 

When P: j_1_ a G 1.3/Z, the maximum speedup achievable is
M Onto”

 

6*
JA
0

perfectly balanced, the maximum speedup achievable is _'"‘_"___ when the number of

processors is G *JX . The maximum speedup achievable is proportional to the square

null

 

root of the active number, which describes the degree of synchronous parallelism of a circuit.

3. 7.4.3 Bencbmark Result:

Table 3.5 shows the comparisons of predicted and experimental maximum speedup. The
circuits we tested include ﬁve large ISCAS89 sequential circuits and a 32-bit multiplier. Input
vectors come at every clock with random values. We benchmarked these circuits on the PSC

TCS. The error of the prediction is less than 10%.

’ um
Circuit
35932 17 5. 5.2 2.95°/
38584 2071 3. 3 4.91
17 2.7 2 2

15850 10 1. 1.9 8.3
13207 865 1.4 1.31 9.66
ulti32 695 7 6.4 6 6.380/

 

Table 3.5 The comparisons of predicted and experimental maximum speedup

54

We measured the performance of synchronous simulation on different parallel computers.
Benchmark results show that the PSC TCS is considerably faster than Origin 2000. The event
execution rate we measured for the PSC TCS is about 5 times faster than Origin 2000. That is,
the computation granularity (G) of the PSC TCS is about 5 times smaller than that of Origin
2000. However, the communication overhead of the PSC TCS is only 1.5 times faster. Thus,

according to the performance prediction formula, the Origin 2000 will achieve about 1.8

(JG/0m,” ) times better maximum speedup than the PSC TCS. Figure 3.5 compares the

speedup of Multi32 and 535932 on the PSC TCS and Origin 2000. The results match our

prediction model.

The largest circuit in ISCAS89 has around 20,000 objects. We multiplied 835932 and 538584
four, nine and sixteen times to test the effect of circuit size on the performance. Because
multiplicadon is implemented by juxtaposing the circuits, the new average active number: of the
multiplied circuits are about four, nine and sixteen times of the original ISCAS circuits. We
benchmarked these circuits on the PSC TCS. The maximum speedup achieved for these
circuits was about two, three and four times more than the original circuits. Figure 3.6 shows
the benchmark results of this experiment. This conﬁrms our formula for predicting the

maximum speedup achievable, which is proportional to the square root of active number.

55

 

TCS vs Q'lgln 2000 (M ul|l32)

 

 

—o—PSCTCS- Btper’nnntal +PSCTIS- Hedbted
+0r'gh2000- Btperhnntal +Orid‘12000- Hedicted

 

 

 

 

TCS vs Glgln 201!) (335932)

 

0 4 8 12 16
OPE

 

-o—PSCTCS- Btperhental +PSCTCS- Pl'edlcted
+Origh2000- Btper'rrental +Origh20m- Hedbted

 

 

Figure 3.5 PSC TCS vs. SGI Origin 2000

 

 

Effect of Active Minibar (838584)

 

1—0—338584
+338584x4

1+338584xs

1—I—338584x16

 

 

 

 

 

 

 

Effect of MN. Numbor ($35932)

 

1—0—335932
+335932x4
+53593239
+335932x16

 

 

 

 

 

 

 

Figure 3.6 The eﬂ'ect of circuit size

Figure 3.7 shows the effect of granularity. We put some loops to the object execution to study
the effect of granularity over speedup on 535932 and 515850. Each unit of extra granularity is
around 20us. There is improvement of speedup when extra G goes up from 0 to 1. But when

extra G is over 1, there is no signiﬁcant improvement on speedup. This can be explained by

 

 

our prediction formula: when G*A >> OM”‘P, [”1 dominates 05‘” *—E. For 515850, it

o
E

 

1 cannot overwhelm
P * M G

 

*-
>j'u

has much smaller A than $35932, although G is large

This caused the better speedup for larger G.

 

 

 

 

 

 

 

 

The Effect of Grenulerfty (835932)
12.00
10.00
8.00
g. -0—Comp Grain Size = O
i 6'00 -I-Comp Grain Size = 1
m 4.00 —e—Comp Grain Size: 2
+Comp Grain Size = 4
2.00
0.00
2 6 10 14
Number of Proceeeere
The Effect of Grenulerfty (815850)
7.00

6.00
5.00

 

-o—COmp Grain Slze = 0
+Comp Grain Size: 4
-e-Cornp Grain Size = 8

4.00
3.00

 

Speedup

 

2.00
1.00

 

0.00
2 4 6 6 10 12 14 16
NumberofPreceeeore

 

 

 

Figure 3.7 The eﬂ'ect of granularity.

Higher granularity can improve the performance because it can hide the increasing

communication cost as the number of processors increases. This can be seen best in the

speedup graph for 515850 with varying granularity.
58

When the granularity is large enough that communication takes a very small part of the total
execution time, increasing the granularity will not improve the performance. Inherent load
unbalancing as demonstrated by our multinomial distribution model restricts speedup. The
speedup of 535932 did not increase with the granularity starting from two. The predicted

communication cost of 535932 at granularity two takes less than 12% of the total cost.

The cost of computation decreases as number of processors increases, while the cost of
communication increases as the number of processors increases. If the activity number is small
or granularity is small, then the computation takes a small portion of the total cost and

performance will suffer.

3.7.4.4. Improving tbe performance of yncbmnour simulation

Reducing the communication overhead

Reducing the communication overhead is able to reduce the communication cost and speedup
the overall performance. Many methods are proposed to reduce the software overhead. Active
Messages [36] reduces communication cost by allowing the user level control over message
passing. Stanford FLASH Multiprocessor’s MAGIC is designed to completely off-load the
communication overhead from the CPU [50]. Virtual memory-mapped communication
supports direct data transfer between the sender's and receiver's virtual address spaces without

operating system’s support [35].

59

Load balancing

Another way of enhance the performance of the synchronous simulation is to improve the
load-balancing factor. Techniques [47,75] exist that balances the number of event executions.
In the next part of the paper we propose a different improved synchronous algorithm that is

also targeted to improve the load balancing factor. As revealed in the formula, the maximum

speedup improvement from multi-norninal distribution to perfect load balancing is 47M -1,

where M is the multiplication factor that describes the degree of load unbalancing. Therefore,

-1
the improvement from these techniques is limited to the factor of s/M . For a circuit with
average active number of 1000 on 32 processors, the improvement in speed up will be in the
factor of 1.2. If the average active number is 100, the speedup improvement factor on 32

processors is around 1.5.

Partitioning

Some circuit partitioning algorithms try to reduce the cost of message passing by making as
many events internal to its own partition as possible while still keeping the load well balanced.
However, our performance formula reveals that such partitioning techniques have little impact
on the performance of synchronous simulation. The cost of scheduling and executing an event
dominates the overhead of sending a message that is slightly larger in size. When we derived
the performance formula, we cancelled out the cost of sending extra messages. Soule’s
experimental results showed that there is no much difference in parallel performance with

several partitioning techniques [70].

60

Optimizing Communication

Our prediction model assumes that after each cycle, all processors exchange events with all-to-
all communication pattern. The value of next time step can be calculated by piggybacking the
local time to the event exchanges and there is no need for explicit barrier synchronization.
Therefore, when there are P processors, the cost of this communication is Oﬂ’). This
communication pattern becomes inefﬁcient when P becomes large and a processor only sends
events to a subset of processors. In this situation, the communication should be optimized that
a processor only sends messages to another processor when there are events to deliver.
However, additional O(logP) communication overhead is needed to calculate the next time

stamp.

3. 7.5 Improved Syncbronour Algoritbm

The original synchronous algorithm only executes the smallest time stamped events. The
performance degrades because of load-unbalancing factor. The load unbalancing can be
modeled by multinomial distribution model proposed in this paper. The cost of each cycle is

decided by the maximum number of active objects assigned to one of the processors.

The Improved Synchronous Algorithm tries to execute the same maximum number of active
objects per cycle by pre-executing :aﬁ’ events on the processors that are not assigned the
maximum number of active objects. For example, at a certain cycle processor 0 has 50 active
objects and processor 1 has 60 active objects. After executing 50 events, processor 0 tries to
execute 10 more rafe events. The safeness of an event execution can be decided by the look-

ahead technique used in conservative simulation.

61

Our technique is differs from previous techniques by its ability to predeterrnine the number of

extra event executions. No asynchronous barrier synchronization is required by our technique.

Each processor executes the following code:

while (GVT < MAXGVT) {

Execute all events with time stamp GVT

Execute safe events so that the number of
event executions this cycle is
max_num_schedule

Send and receive events.

Do barrier synchronization and calculate new
GVT and max_num_schedule.

}

Figure 3.8 Improved Synchronous Algorithm

Figure 3.8 shows the algorithm of the improved synchronous algorithm. The value of
max_num_rcbedule is the maximum number of active objects scheduled to be executed in a cycle
among all processors. The rest of processors will execute up to max_num_.tcbedule of event
executions by executing safe events. The new value of GVT and max_num_:cbedub can be
computed by piggybacking relevant information in the event exchange step. The outcome of
the improved algorithm generates better load balancing than the multinomial distribution
model. We also observed that the number of simulation cycle is reduced as the side effect of
load balancing effort. The performance gain comes from both of these factors. The following

ﬁgure compares the performance of the improved synchronous and synchronous algorithm.

62

 

Improved synchroune ve slnchronoue ($38584)

 

#PE

 

l+ Synchronous + hproved Synchronous]

 

 

 

Improved synchronous ve Syncrhonous (assume)

14
12
10

Speedup

010150503

 

0 4 812162024283236404448525660
IPE

 

—e— Syncrhonous -e— h'proved Synchronous

 

 

 

Improved Synchronous ve Synohroue ($13207)

 

0 5 10 15 20
#PE

 

 

—e— Synchronous -e- hproved Synchronous j

 

 

 

Figure 3.9 Benchmark results of improved synchronous algorithm

3. 7.6 Conclusion

We proposed a performance model for synchronous parallel discrete event simulation. From
the speed—up formula derived from our performance model, we can predict that synchronous
simulation performs better with a higher average active number of objects per cycle. It can both
reduce the load unbalancing factor and communication cost. We found that the maximum
speedup achievable by the synchronous algorithm is proportional to the square root of average
active number. The error of our prediction result compared with the benchmark result is small.
This result can give some hint on whether it is worthwhile to use synchronous simulation

before actually implementing it.

We proposed an improved synchronous algorithm that tried to improve the load balancing
factor. Benchmark results show that the improved synchronous algorithm generates better
load balancing than original synchronous algorithm. Although we observed performance

improvement from the improved algorithm, our performance prediction formula reveals that

the gain from improving load balancing is limited to w/M—Il , where M is the multiplication
factor that describes the degree of load unbalancing. For a reasonable range of active number and
number of processors, M is typically greater than 0.5. Therefore the maximum improvement in
speedup is typically less than 1.4. To further improve the performance of synchronous
simulation, a new algorithm must be able to improve the communication cost by reducing the

simulation cycle signiﬁcantly.

64

3.2 Cycle reducing simulation

3 .2. 7 Introduction

Logic simulation is an important method for assuring the correctness of a VLSI system design.
For large designs, logic simulation is very time consuming. Therefore, parallel discrete event
simulation techniques are used to accelerate the simulation. There are three major categories of
parallel simulation: synchronous, conservative and optimistic. A survey on the topic of parallel
discrete event simulation is described by Fujimoto [42]. A survey on the topic of parallel logic

simulation is described by Chamberlain [16].

Synchronous simulation is the simplest of all parallel simulation techniques. At every cycle only
the smallest time stamped events are executed. The main advantage of synchronous simulation
is its low overhead compared with other asynchronous simulation techniques. Soule and
Gupta [72] evaluated the conservative asynchronous simulation for digital logic simulation.
They found that the overheads of conservative asynchronous simulation overwhelm its
advantage and the performance is about three times slower than synchronous parallel event-
driven algorithm [72]. Konas and Yew compared the performance of a synchronous simulation
to the conservative and optimistic algorithms in simulating a synchronous multiprocessor
system [47]. Their results show that the synchronous method is considerably faster than both
of the asynchronous approaches. The main reason is both of the asynchronous simulation

algorithms introduce signiﬁcant overhead.

65

The main problem of synchronous simulation is poor load balancing and frequent
synchronization. There is a synchronization step between each cycle that exchanges messages
between processors and decides the next smallest time stamp. Inside each cycle, the number of
event executions varies among different processors. The processor that ﬁnishes event
execution early in a cycle has to wait for the latest processor to complete the execution. There
exists statistical model [85] that describes this load unbalancing factor. The performance of
synchronous simulation does not scale well because the cost of synchronization rises as the

number of processors increases.

There are some variations of synchronous algorithm that try to overcome the load imbalancing
problem of synchronous simulation. AdvanCE SpaDES [47] executes safe events after the
smallest time stamped events ﬁnished execution and before the synchronization point. They
observed some performance improvements on logic simulation when the size of circuit is
small. Steinmen [75] proposed Breathing Time Buckets algorithm that executes unsafe events
before the synchronization point. States of the objects that execute unsafe events are saved to
be able to recover to the states before the incorrect executions. These techniques do not solve

another problem of synchronous algorithm: frequent synchronization.

Xu and Chung [85] proposed a prediction formula for the synchronous simulation. The
formula considers factors including load balancing and computation communication ratio.
Logic simulation has very ﬁne computational granularity. Because of this, even if we are able to
make the load perfectly balanced at each cycle, the improvement on the performance is very
limited. In this section, we propose an improved performance formula for synchronous

simulation that considers the number of cycles. Then we describe an optimistic synchronous

66

algorithm that is targeted to reduce the number of simulation cycles instead of improving the
load balancing factor. We applied the algorithm to ISCAS89 sequential and ISCASSS
combinational circuits. We observed signiﬁcant improvement on the performance compared

with synchronous simulation.

3.2.2 Performance Evaluation Model

We extend Xu and Chung’s performance prediction model of synchronous simulation by

including the number of simulation cycles.

Terms:

C: Total number of simulation cycles with traditional synchronous algorithm.

A: Average number of event executions per cycle.

G: Computational granularity of an event execution. For cycle reducing algorithm, this number

also includes the state saving overhead.

0,: Communication overhead for sending and receiving one message.

P: Number of processors.

M: Multiplication factor that describes the degree of load balancing. For cycle reducing
algorithm this number includes the extra cost of rollback. 0 < M _<. 1. P * M describes the

speedup without communication cost. When M = 1, the load is perfectly balanced and there is

no rollback.

C ’: The number of simulation cycles of cycle reducing algorithm.. C S C.
67

Analysis:

The cost of sequential simulation is: C * A * G

For traditional synchronous simulation, the costs of computation and communication are:

Computation = C*A *G */ (P*M)

Communication = C ”*P * O,

The cost of synchronous simulation is:

C*A*G*/ﬂ°*M)+C’*P*O,

+_* f *—

M*PCGA

 

 

_ 1 c' 0 P “
The speedup formula 18

 

l j C G
The maximum speedup is achieved when P = * —, * — *«lA .
«l M C 0,

The maximum speedup is 0.5 * JM * ll; * jog *w/A.

Consider the case when C = C (traditional synchronous algorithm) and M = 1 (perfect load

G
* «l; . We can see that the upper

 

balance). The maximum speedup achievable is 0.5 *

null

limit speed up is determined by the average number of active objects per cycle for the

68

synchronous simulation. When this number is small, it is hard to achieve good speedup even if

the load is perfectly balanced.

There are techniques that try to improve the multiplication factor. The upper limit of

improvement from balancing the load is limited to the factor of WI}. The multinomial
distribution model predicts the value of this load balancing factor. For average active number
of 1000 and 32 processors, the value of M is around 0.72. For average active number of 100
on 16 processors, the value of M is around 0.57. The improvement factor from balancing the

load to the optimal for these cases are 1.2 and 1.3 separately.

The cycle reducing algorithm is able to reduce the simulation cycles. The performance of this
algorithm depends on how much we can improve simulation cycles without causing signiﬁcant
degrades of the multiplication factor. We can determine the value of the multiplication factor
and reduction in communication cycles by running a sequential simulator. For a given circuit, if
the number of processors and the number of events to execute in a cycle are given, the
sequential simulator can determine the number of rollbacks, the multiplication factor and the
reduced number of cycles. Together with other easily obtainable factors, the performance of
the cycle reducing algorithm can be predicted on a parallel computing platform. With the
prediction formula, it is possible to determine the optimal number of events to execute in a

cycle that generates the best speedup.

3 .2. 3 Optimistic Syncbmnous Algoritbm
To reduce the number of cycles, the number of event executions per cycle should be larger

than the average number of active objects per cycle. Besides executing the smallest time
69

stamped events each cycle, the algorithm optimistically executes a certain number of unsafe
events. Therefore, we call the algorithm: Optimistic Synchronous Algorithm. If the number of
events scheduled to execute is large, the number of cycles will be small. On the other hand,
there is a risk of higher rate of incorrect event executions. Therefore, the number of event
executions per cycle should be decided adaptively during the simulation. It is important that all
the processors schedules the same number of event executions per cycle to avoid the bad load
balancing. State saving is necessary for unsafe event executions. Garbage collection is done
each cycle. Because incorrect event execution may occur, rollback mechanism should be built

into the algorithm. The following is the description of the algorithm.

Each processor executes the following code:

GVT=0;
while (GVT < MAXGVT) I
A = Number of events with time stamp GVT;
Execute all events with time stamp GVT;
If (K > A), Execute K — A unsafe events;
Do garbage collection of the events and states
with smaller time stamp than GVT;
Send events to other processors with LVT
attached;
Receive events from other processors;
Compute the GVT with piggybacked LVT;
Adaptively adjust the value of K.
}

3 .2.4 Evgoerimental Results

We implemented the algorithm and tested it on Terascale Computing System (T CS) at the
Pittsburgh Supercomputing Center. We chose four large ISCAS89 [11] sequential circuits and
one large ISCAS85 [10] combinational circuit as our target system. Entities are randomly

partitioned. Input vectors come at every clock with random values. Clock is 10,000 times larger

70

than the unit delay of a gate. We simulated the all circuits with 800 input vectors. Table 3.6

shows the characteristics of ﬁve circuits.

We observed signiﬁcant improvement of our optimistic synchronous algorithm over the
synchronous algorithm. The number of cycles is reduced up to only 7% of the original number
of cycles. Figure 3.10 shows the speedup graph of ﬁve ISCAS circuits of both synchronous
and optimistic synchronous algorithm. For small number of processors, synchronous
simulation performs better than optimistic synchronous simulation. This is caused by overhead
for state saving cost of optimistic synchronous simulation. As the number of processors
increases, optimistic synchronous simulation performs much better than synchronous
simulation. This is caused by the signiﬁcant reduction in the number of simulation cycles. The
percentage of incorrect event executions is usually smaller than 15% when the number of

events scheduled to execute is not very large.

Synchronous simulation performs much worse when the circuits has non-unit delay. This is
caused by the reduction of the average number of active objects per cycle and increment in the
number of cycles. We observed that the optimistic synchronous algorithm is not sensitive to
the delay model. Figure 3.11 shows the speedup comparison of the optimistic synchronous
algorithm with unit delay and non-unit delay. The performance of non-unit delay is only

slightly worse than unit delay.

71

 

Table 3.6 The characteristics of circuits

 

 

Speedup ($38584)

 

12345678910111213141516
OPE

Fo— Synchronous + Optim'stic SynchronousJ

 

 

 

72

 

 

 

Speedup ($35932)

 

12345678910111213141516
OPE

 

 

 

Speedup ($15850)

 

0
12345678910111213141516
OPE

 

 

 

 

Speedup ($13207)

 

 

 

 

 

 

 

 

 

 

 

 

o
12345678910111213141516
res
Speedup (c7552)

 

0
12345678910111213141516

SPE

 

 

 

Figure 3.10 Synchronous and optimistic synchronous simulations

74

 

Speedup (C7552)

 

0
12345678910111213141516

 

 

—.— Unit Delay —.— 'Non um Delayrl

 

 

 

Figure 3.11 The speedup comparison of unit and non-unit delay.

3.2.5 Cont/«don

In this section, we proposed a performance evaluation formula for synchronous simulation.
The formula considers factors including load balancing, computation communication ratio and
simulation cycles. The formula reveals that reducing the number of simulation cycle is one of
the most important keys to enhance the performance. We proposed an optimistic synchronous
algorithm that is targeted to reduce the number of cycles. The experimental results Show that
the algorithm is able to reduce the number of cycles signiﬁcantly. The performance of the

optimistic synchronous algorithm continues to improve when synchronous algorithm is not

75

able to do so. For non-unit delay models, the optimistic synchronous algorithm performs only

slightly worse than the corresponding unit delay models.

76

Cbapter 4

4. OPTIMISTIC SIMULATION

4.1 Reducing the overhead of Optimistic simulation
4.1.7 Introduction

Time Warp algorithm allows optimistic asynchronous simulation of discrete event systems. It
can explore the parallelism in a system by allowing out of order event execution. Some of these
event executions may not be correct and need to be cancelled. When an event with time stamp
t arrives in a logical process (LP) and the LP already executed an event with a time stamp
greater than or equal to t, the current LP needs to be recovered to the state before time stamp t
to be able to correctly execute the new event. This requires frequent state saving. A great
amount of memory and CPU cycles will be consumed doing state saving. GVT (Global Virtual
Time) is used to do garbage collection of the states that are no longer necessary to stay in
memory. GVT is deﬁned as the minimum time stamp of all events in transit or in the input
queues of all LP. No new event generated can have a smaller time stamp than GVT. Therefore

all states except one that has smaller time stamp than GVT can be discarded safely.

State saving is one of the major overheads of Time Warp algorithm. State saving requires lots
of computation overheads. It also has more memory requirement, which may exceed the

physical memory size. Reducing the cost of state saving is essential to enhance the
77

performance of Time Warp. Hardware [43] and software approaches exist to reduce the cost.
There are three major software techniques that try to reduce this overhead. They are
incremental state saving [8,78], checkpoint or infrequent state saving [9,41,54,65] and reverse

computation [15].

In this paper, we present an algorithm, which estimates the lower bound of LP. This lower
bound together with GVT will be used to reduce the number of state savings and maximum
memory size needed for Time Warp. The algorithm embeds the lower bound estimation in
Time Warp event execution, so it has minimal overhead. The technique increases the
performance of Time Warp in most cases. The worst-case performance of the algon'thm is

nearly the same as normal state saving algorithm.

We incorporate the lookahead technique in conservative algorithm to reduce the number of
state savings. In conservative simulation, each generated event is accompanied with the LVT
(Local Virtual Time) of the sending LP. The receiver LP knows that it will not receive any
event in the future with smaller time stamp than the LVT from the sender. Each LP can safely
execute the events with time stamp smaller than the LVT value of all input channels. Only safe
events are executed in conservative simulation. Deadlock may occur when there is a cycle of
block LPs each waiting for its predecessor to update its clock. Null-message is used to avoid
deadlock but it can add signiﬁcant overhead on the performance. Chandy and Misra proposed
a two-phase method to detect and resolve the deadlock. Deadlock can be broken by the
observation that events with the smallest time stamp are always safe to process. jha and
Bagrodia [46] did research that tried to unify the conservative and optimistic algorithms. A

process can choose autonomously to be optimistic or conservative. A process is chosen to be

78

conservative to limit the parallelism and avoid generating a large number of incorrect
messages. The decision to be optimistic or conservative is hard to make. If a bad decision is
made, the performance may be worse than both conservative and optimistic simulation.
Varghese, Chamberlain and Weihl [80] used local guaranteed time to reclaim the memory with
time stamp greater than the GVT. It did not address the possibility of skipping the state

savings.

In our approach, whenever an event is generated, it sends along the lower-bound time stamp the
sender may send to the receiver in the future. The lower-bound time stamp of an LP is deﬁned as
the minimum of lower-bound time stamp of all input channels. The lower-bound time stamp of an
event updates the value of the lower-bound time stamp of an LP and vice versa. An event
execution is safe if the current time stamp is smaller than the lower-bound of the LP. In a safe
execution mode, state saving is not necessary and all saved states can be discarded. In an
unsafe execution mode, we need to save states, but some saved states with time stamp smaller
than the lower-bound time stamp of the LP can be discarded. During the simulation, an LP may
execute in safe mode or unsafe mode. If all event executions are in safe mode, the technique
will perform similar to Chandy-Misra algorithm without blocking. If all event executions are in
unsafe mode, the performance will be similar to the Time Warp. In other cases, it may perform
better than Time Warp by having fewer state savings. It may also perform better than Chandy-

Misra algorithm by exploiting more parallelism.

Our technique requires the simulation to have ﬁxed interconnections. The algorithm of our
technique and the proof of correctness are stated in section 4.2. The performance issues of the

algorithm are discussed in section 4.3. The proposed technique is especially good when the

79

cost of state saving is high compared with the cost of event execution. Logic simulation has
very ﬁne computational granularity. We applied our algorithm on logic simulation and tested it
on Terascale Computing System (T CS) at the Pittsburgh Supercomputing Center. The

performance result is presented in the section four of the paper.

4. 7.2 Aégorz't/Jm

4. 7.2.7 The Model

We assume that the system to be simulated has a ﬁxed interconnection. The applications that
have a ﬁxed interconnection include hardware and network simulation. The algorithm is not
applicable to war game simulation where objects communicate without ﬁxed connections. The
algorithm also requires communication channels of parallel machine to be FIFO (First In First
Out). Popular communication mechanisms including MP1 and TCP satisfy this condition. We
call a channel that has initial events, a primary input channel. We require that a primary input
channel is not an output channel of any process. In summary, we have the following

assumptions about the model.

Interconnection is fixed.

Communication channels are FIFO.

Initially, there are only events on the primary input channels.

4. 7.2.2 Notation:

We use the following notations in our algorithm.

[171(C): Lower bound of the time stamp of an event that may be received on channel C. To

specify the value at simulation time of 2', we use the notation lbtJC).

80

llztﬂ3): Lower bound of the time stamp of an event that may be received on process P. It is

deﬁned as the minimum value of lbt(C) of all input channels of the process.

3(a): Time stamp of event e.

5 Minimum delay of a process.

lbt(e): Lower bound of the time stamp of the next event (after 8) that may be sent from the

sender to the receiver. It is defined as min(/btﬂ’),t5(ew))+5, where e is the smallest time

next

stamped event in the unprocessed event queue of process P.

Anti-message: We use the —e to denote the anti-message of event e.

4. 7.2.3 Aégoritbm
The algorithm updates the value of lbt(C) and lbtﬂ’) with the value of lbt(e) every time a normal

event e arrives on channel C of process P. When the time stamp of an event, which is
scheduled to execute on process P, is less than lbtﬂ’), we can skip the state saving. We can also
discard the states that have smaller time stamp than lbt(P), which may be larger than the value

of GVT. When a new event 6 is generated on process P, we attach lbt(e) to the message.

Initialization Step:

For any channel C, if C is a primary input channel set lbt(C)mx, else lbt(C)=0.
For each primary input, the lbt(e) for each event e is set to be the time stamp
of the next event of e.

The lbt(e) of the last primary input event is set to «a

simulation seep:
if (e arrives on channel C of process P) {
lbt(C) max(lbt(C),1bt(e));
lbt(P) max(GVT, min({1bt(Ci)| C, is an input
channel of P )l):

}
if (e is scheduled to execute on process P) {
if (ts(e) >= lbt(P)) save current state;
Execute the event.
For each new event ex generated on output channel k,

81

lbt(ek) = min(lbt(P),ts(en,,¢)) + 5, enext is the next
event to be scheduled by P.
If enext == NULL, ts(ene,¢) = 0°.
Discard all the states that have time stamp smaller
than lbt(P) except the one that has the nearest
time stamp to lbt(P),-
}

 

(5.0%. fl... (49°) (2,...)
(2.5) a a (' ) (2,4) ‘4 5
15’“ bezoo
8:5 P1 5:1 P2
(6,7) /,5)
(6’7) x ¥ (6,7) (3,5)
(6,7) is Straggler. P3
executed (3,5) ls
at p3 executed
at P3

 

 

 

Figure 4.1 An illustration of the algorithm (GVT=1)

Figure 4.1 illustrates the algon'thm. The dashed arrow represents primary input channels. We

use a tuple (t:(e),lbt(e)) to represent an event with time stamp t.r(e) and lb! value lbt(e). Suppose

GVT is 1 during the following event executions. Process P1 has two primary inputs. Event

(1,00) is scheduled to execute. The lot of process P1 is 00. The time stamp of the event being

executed is less than the 112:. The algorithm skips state saving. The execution of the event

generates a new event (6,7) that goes to the ﬁrst input channel of process P3. Event (2,4) and

(2,00) are scheduled to execute on process P2. The execution of the events generates a new

event (3,5) that goes to the second input channel of process P3. Again, no state saving is

82

necessary. Suppose event (3,5) generated by P2 is a straggler. Event (6,7) is scheduled to
execute on process P3 before the arrival of event (3,5). No event updated the lb! of the second
channel of P3. The lbt of P3 is set to l, which is the maximum of GVT and minimum of lbt of
input channels. State saving is done because the event time stamp is greater than the lbt of the
process. When the straggler event (3,5) arrives, state is restored to the state before event (6,7)
was executed. The new lb! of the process is updated to 4. Since the time stamp of event

execution is 3, no state saving is done.

4. 7.2.4 Proof of Commie“

Note that the algorithm may skip state saving and discard states that have a larger time stamp
than the GVT. To prove that the algorithm is correct, we need to show that whenever a
straggler event arrives, there exists a state that has a smaller time stamp than the straggler
event. Thus, the Time Warp can be rolled back to the correct state. The straggler event may be
a normal event or an anti-message. We also assume cancellation schemes meet the following

cancellation condition.

(Cancellation Condition) For an event e tbat it generated [yr prom: P and an} ”Image e ’ generated between e

and anti-message —e, lbt(e) must not be greater tban ti(e).

Later in this section, we are going to prove that both aggressive and lazy cancellation scheme
satisfy the above cancellation condition. The following is some trivial relationships from the

algorithm.

lbtﬂ’) is deﬁned as the minimum value of lbt(C) of all input channels.

83

lbt(P) = min({/bt(C,) | C, i; an input ebonne/ of P} ) --- (1)

GVI does not have an effect on the correctness of the algorithm. Therefore, we will not

consider the GVT in the proof.

If event e is a primary input event, lbt(e) is pre-set to be the time stamp of the next event on its

primary input channel. No new event may be generated on the primary input channel. The

lbt(e) of the last primary event is set to 00. When a new event 6 is generated, we set the lbt(e) to

/bt{e)‘—’Inin(/bt(P),t5(e ”+5 (2)

”(A

e is the next event to be executed. If e does not exist, t:(em,)=°°.

next next

Initial value of lbt(C) is 0 if C is a non-primary input channel or 00 if C is a primary input
channel. Primary input events are pre-inserted and no event will arrive on a primary input

channel. lbt(C) of a non-primary input channel is updated when an event arrives on it.

/bt,,,.(C)= max-(1124,10, lbt(e) ) (3)

From (1) and (3) we have following simple observations.

Observation 1: lbtr(C) S/btr'(C), if TS If

Observation 2: lbttﬂJ) flint/(P), if 15 2".

The following lemma 1 is the foundation of the proof.

84

Lemma 7. At rinIu/ntion timez; if event e or —-e arrive: at an input ebanne/ C of oly’eet P, tben t:(e)2/th (C)

and /bt(e)2/btr (C).

(Proof)

We prove Lemma 1 by induction on 2:

(Boris) At simulation time 0, no event arrives at any input channel. Lemma 1 is trivially correct.

(Inductive ijotbetis) Suppose that the claim is true for all objects for simulation time up to 1"<

I:

(Inductive Step) Let us prove that the claim is true at simulation time 21 Note that lbt(C) is
updated every time a normal event e is received on C. Let e be an event received on channel C

at simulation time 2: We need to prove that

t5(e) 21th (C)

lbt(e) Zlbt, (C), if e is a normal event.

Note that if e is the ﬁrst message that P has received on channel C during the simulation time,

then lbt(C) '—" 0 at the time C receives e, so t:(e)2 lbt(C), lbt(e) 2/bt(C) and the claim holds.

Thus, we need to consider the case when there is a message placed on channel C before e.

Event e could be a normal event or an anti-message. Let us ﬁrst consider when e is a normal

event. Let e0 be the last normal message before e is placed on the channel C. Since the

85

interconnection is ﬁxed, e and e0 are generated by the same processor P'. Moreover, because

the communication channel is FIFO, e, is generated before e. Let e be generated by executing
event e' at simulation time 1", and e0 be generated by executing eo' at time 25'. Let C' be the

channel where P' receives e'.

Note that at P, after receiving e, at time To, the new lbt of channel C will be lbt(C) = max(/bt(ea),

lbtm(C)). Event e, is a normal event. By inductive hypothesis, lbt(ea)2 lbt10(C), so the new lbt(C)

after receiving e0 will be lbt(ea), and it will remain until e arrives at C.
That is, lbt,(C)‘—'/bt(ea), for 1;, < t_< z' ..- (4)
There are two cases:

(Case 7) Event e' was received by P' after P' executed eo' and generated e0.

e' arrives
I
‘0

l le'
8
$80 i
\ ea \1
arrives e
70' to z" 1" 2’ Physical
time

 

Figure 4.2 Event (3' was received by P' after P' executed e,’ and generated e,

86

Let 1'" be the simulation time P' receives e', 2" be the time P' executed e', and 25' be the time of

P' executed eo'. We have T> 2’ > 2'"> 2;].

By inductive hypothesis,

“(6') 2 In... (C') 2 1a,, (C) 2 1177.011?) (5)

Therefore,

t5(e) 2 t5(e') + 52 lbtmﬂy) + 6 _>_ lbt(eo) = 1th (C) --- (6)

Note that the last inequality follows from equation (2).

Consider lbt(e) = min (lbtr (P'), t.r(em,)) + 5, where ewis the next event after (2' at P'.

From (5), we have t:(e,,m) _>.t5(e') 2 lbtm.(P').

Since r> 2’ > 2'"> 70', lbt, (P') 2 lbtm.(P')

Thus, lbt(e) = min ( 1th a”), t5(€m1)) + 62 lbtwﬂy) + 62 lb: (ea) = lbt, (C) (7)
Note that the last equality follows from equation (4).

From (6) and ('7), the claim is true for case 1.

(Cate 2) The event e' was received by P' before P' executed eo' and generated e0.

Note that case 2 includes the case that e' is a primary input event. There are two sub cases.

87

(Care 2A) t5(e') _<t5(e0')

In this case, the execution of e' at time 2’ must be at least the second execution of e'. There

must exist another event e" such that t:(e") _<t5(e') and e" arrived at P' at 2", such that 25' < Z" <

1’.

e arrives e arrives

55...... O

to' to I" z’ 2' Physical
time

 

Figure 4.3 Event 43' was received by 1" before I" executed c,’ and generated e,

By inductive hypothesis,

t5(e") 21195. (P') 211nm. a”).

Therefore,

t5(e) 2 t:(e') + 52 t5(e") + 52/btm.(P') + 52/bt(e0) = lbt,(C) --- (10)

Consider, lbt(e0)= min(/btw (P) t:(ew)) + 6.

We know that,

t!(e..-....) > me.) > ate") >lbtro (P7
88

Therefore,

lbt(e)) = 1127,, (P') + 6

Consider, lbt(e),

lbt(e) 2172;, a?) + 6 2 lbt,0.(P') + 6 = lbt(ea) = lth(C) (11)

From (10) and (11), the claim is true for case 2A.

(Care ZB) tr(e') > tr(e0')

In this case, e' was in the unprocessed event queue of P' when P' executed eo'. Therefore,

tr(e') 2tr(em,), where e

next

is the next event scheduled after eo' at the execution of eo'.

Note that lbt(ea) = min (lbtto. ﬂ”), t:(em,)) + 5 _< tr(em) + 5

Therefore,

t5(e) 2 t5(e') + 52 1563”“) + (5 2 lbt(ea) = lbtf(C) --- (8)

Let e" be the event scheduled next after e' at P. Note that t:(e") > t.r(e') > tr (em).

Thus,

lbt(e) = min(lbtp (P'), t.r(e")) + 5 2 min (Ibtm.(P'), tr(em ) + 5 = lbt(eO) -- (9)

From (8) and (9), the claim is true for case ZB.

89

Therefore the claim is true for any event e.

Now let us consider anti-message —e.

The proof is trivial for tr(-e)=tr(e)2 1th (C). The value of lbt(C) is not greater than tr(e) when
event e arrived. By the cancellation condition, the value of lbt(C) is not increased before —e is
received. The proof of lbt(—e)2 lbt,(C) is similar to the techniques we used above and we are

going to skip it.

Thus, the claim is true during all simulation time. D

Lemma 2. For any stragler event scheduled to execute, there is at least one saved

state that has smaller time stamp than the time stamp of the straggler event.

(Proof)

Suppose e or —e is a straggler event. The time stamp of the last event execution must be greater

than or equal to tr(e). Therefore, e or —e must have been arrived after the last event execution

(note that execution does not change the lbt value of a process). Suppose e’is the ﬁrst event
executed by the same process such that tr(e) _<t5(e'). By Lemma 1, the lbt of the process at all

moments before the arrival of e or -e is less than or equal to tr(e). Therefore, t:(e') 2 lbtﬂ’), t is

the time event 6" is executed.

From the algorithm we know that a state s with time stamp less than tr(e) must have been

saved before the execution of event of If r is discarded before the arrival of e , there must exist

another state r’with time stamp less than the lbt(P) of that moment. Because lbt(P) at any

90

moment of simulation before the arrival of event e is less than or equal to tr(e), a state with time
stamp less than tr(e) must exist. Since there is no event execution between the arrival of e and
the execution of e, no state will be discarded by the value of lbt(P) event if the value may have
been updated to be larger than t.r(e) during this period. Therefore, when e or —e is scheduled to

execute, there must exist a saved state with smaller time stamp than t.r(e).

Lemma 2 guarantees the simulation runs correctly. I]

Lemma 3. Both aggressive cancellation and lazy cancellation satisfy the cancellation

condition we proposed.

(Proof)

(Aggrerrive Cancellation)

We are going to prove by contradiction that ﬁrst violation of the cancellation condition cannot

occur.

Suppose —e is the ﬁrst violation of the cancellation condition. Suppose e, is the message that

generated e at process P. Suppose e2 is the ﬁrst event execution after el at process P, such that
tr(ej _< tr(el). Suppose e’ is any event generated between e and —e. By Lemma 1 and the

assumption, we have,

t5(e) 2 tr(e,)+§2tr(e2) + (SZ/btﬂ’) + 5 = lbt(e)

Therefore, the time stamp of e is greater than or equal to lbt(e) of any event e’generated

between e and —e. This contradicts the assumption that -e is the ﬁrst violation.

91

(Iggy Cancellation)

It can be proved similarly to aggressive cancellation scheme. E

Theorem 1. Under both Wssive cancellation and lazy cancellation, the algorithm is

correct.
The proof is trivial from Lemma 2 and Lemma 3. l]

4. 7.3 Performance Anabolic

4. 7.3. 7 Individual Procerr Performance
The proposed technique makes Time Warp perform differently with three possible cases. The

best case is when the time stamp of the current event execution is less than the lbt of the
process. The Time Warp process has the least overhead in this case. No state saving is needed.
If there are states saved with smaller time stamp than current lbt, they can be discarded earlier
without having to wait for the GVT catch up with lbt of the process. The second case is when
the lbt of current executing process is greater than the GVT but the time stamp of the current
event execution is greater than or equal to the lbt of the process. State saving is necessary but
some saved states can be discarded earlier. The third case is when the minimum lbt of the input
channels is less than or equal to the GVT. In this case, current state has to be saved and saved

states cannot be discarded earlier. Following graph illustrates the three cases.

3,8 (5,8) (8.7)\ 98) (3a) 42
ibrz)8 leB le7\ @114: 8 lbt: 2\ 55,32

case 1 case 2 case 3

Figure 4.4 Three performance cases of the algorithm (GVT=2)

92

The GVT of the example is 2. For the case 1, event (3,8) will be selected to execute. There is
an event (5,8) on the other channel. When event (3,8) is being executed, the lbt of the ﬁrst
channel is 8 and the lbt of the second is 8. Therefore, the lbt of the process is 8. Since the time
stamp of current event is 3, which is less than the lbt of the process, no state saving is
necessary. Moreover, some states with smaller time stamp than 8 can be discarded. For the
case 2, current event has time stamp 8, which is larger than the lbt of the process. We need to
save the state but some saved states with smaller time stamp than 7 can be discarded. The
minimum lbt of the two input channels is 2 for the third case. State saving is necessary and no

state can be discarded earlier than the GVT.

The performance of the algorithm beneﬁt from GVT computation. The following ﬁgure

illustrates an example.
le0\ /(3.3)
lbt=3

6,4)
(3'4) arrives

lbt—‘4 @

Figure 4.5 Performance gains from GVT computation (GVT=2)

Above example shows a snapshot of two processes during a simulation. The GVT is 2 at that
moment. The ﬁrst input channel of the ﬁrst process never received an event, so the lbt of that
channel is 0. Event (3,3) of the ﬁrst process will be scheduled to execute. The process will have
the lbt value of GVT. The process generates a new event (6,4). When the event (3,4) is
scheduled to execute for the second process, the lbt of the process is going to be 4, which is

93

less than the current event time stamp. This is the case 1 of the algorithm and state saving can

be skipped.

4. 7.3.2 Overall Performance

We deﬁne the rollback rate ,0 as the number of cancelled event executions over the total
number of event executions. From the proof of the correctness of the algorithm we know
that, for every event execution that will be cancelled later, there is a state saving operation. The
best-case scenario is when all correct event executions do not save the states. In this case, the
total number of state saving is )0 times the total number of event executions. If the minimum
lbt of input channels is not greater than GVT, we need to save the state for every event
execution. The number of state savings ranges from p"N to N, where N is the total number of
event executions. When there is no state saving, the algorithm achieves the optimal
performance. Since event executions that are going to be cancelled guarantee state saving, it is
expected that performance of low rollback rate simulation is more promising than high
rollback rate simulation. The algorithm relies on the lbt of a process to progress faster than the
GVT to have good performance. It can reduce the total number of states saved as well as the
maximum memory usage of the simulation. The number of state savings is reduced by only
saving the state when the event time stamp is larger than the lbt. The maximum number of

saved states can be reduced by discarding the states earlier without having to wait for the

GVT.

4. 7.3.3 Algorithm Overhead

We allocate a word memory size for each input channel. The update of lbt happens to every

event arrival. It is a tiny fraction compared with event queue handling. The one word extra

94

memory for each channel is small compared with event queue size and memory size of state
saving. The algorithm also needs one word size memory for lbt to be associated with an event.
It takes small portion in the event data structure that includes value of the event, destination
information and time stamp. Sending a message with an additional word has little impact on
the communication performance. Efﬁcient implementation of the algorithm may make the

overhead of the algorithm take very small portion of the total cost.

4. 7.4 Bencbmané Result:

We implemented the algorithm and tested it on Terascale Computing System (T CS) at the
Pittsburgh Supercomputing Center. We chose four large ISCA589 benchmark circuits as our
target system. We executed circuits with up to 32 processors. Entities are randomly '
partitioned. Input vectors come at every clock with random values. Clock is 10,000 times larger
than the unit delay. of a gate. Table 4.1 shows the result of the algorithm on four different
circuits. We observed signiﬁcant improvement on the number of state savings and maximum

state queue sizes on all of the circuits with tested.

We also observed that the algorithm is able to reduce the execution time and increase the
speedup of the simulation. The improvement in execution time comes from less number of
state savings and smaller event queue size. Figure 4.6 compares the speedup of four ISCAS89
circuits. We observed that 5-30% better speedup from our propose technique compared with
original Time Warp. Table 4.2 compares the maximum speedup number achieved by original

Time Warp and our improved technique with less than 32 processors.

 

’ We compared performance of models with unit delay and non-unit delay. The difference is very small.

95

838584
/ Original
umber of
tate '

$35932
/ Original
umber of
tate '

0

815850
/Original

umber of

90/0
50/0
0/0
30/0
0/o

81 3207
/ Original
umber of
tate '
/o
1 %
%
8%
50/0

96

ize
/o
5%
0
5%
5%

70/0
80/0
80/0
40/0

/0

/ Original
State Queue

Original
State Queue

Original
State Queue

Original
State Queue

 

Table 4.1 Benchmark results of the algorithm on circuit simulation

 

 

M
a:

10 14 16 22 26 30
mam

 

 

 

 

NumberofProeeeeon

—0—Original —I- Improved

 

97

 

 

815050

 

2 6 10 14 18 22 26 30
MulberofProeeaeore

—o—Origind +Impmved

 

 

 

813207

 

 

 

 

 

NumberofProeeeeors

—o-Orlginal + Improved

 

Figure 4.6 Improvement on the speedup of the algorithm.

98

 

38584 . .61
35932 1.31
15850 .74 .75
13207 .32 .00

 

Table 4.2 Best speedup comparisons

4. 7.5 Conclurion

We proposed a technique that can reduce the number of state savings and the maximum
memory size required of Time Warp algorithm. The performance may vary with different
domains of simulation. We conducted some experimental study on logic simulation. The
results show the signiﬁcant improvement in the number of state savings as well as maximum
state queue size. We also observed improvement in execution time of simulation caused by less
number of state savings. Our benchmark shows that the proposed technique improves

memory usage 18-85% and maximum speed-up up to 5-30%.

Because the algorithm has small overhead and easy implementation, it is worthwhile to
integrate the algorithm in the Time Warp system. Combination of our algorithm with other
state saving algorithms may further improve the Time Warp performance. It is also interesting

to benchmark the algorithm on other applications of discrete event simulation.

99

4.2 Performance Analysis

4.2.7 Introduction

Performance evaluation of Time Warp is one of the least studied areas of parallel discrete
event simulation as it is one of the hardest research areas. Nicol [60] gives performance
bounds of self-initiating models. It considers state saving and rollback cost. Cascading rollback
is not considered. Tay gives performance evaluation model of Time Warp including state
saving and communication cost. Parallel performance can be predicted by sequentially
analyzing the sequential model. Lin use critical-path method to show that TW always performs
at least as good as conservative if state saving and rollback cost is negligible. If simulator rolls
back only false computations, the simulation is optimal Gupta uses Marchov-chains to model
the performance of TW. Performance results are approximated using numerical methods.
Lubachevsky analyze the stability of TW using Branch and bound method. Dickens analyze
bounded TW based on probabilistic approach and numerical techniques. So far, these
performance evaluation models either only gives bounds of Time Warp or have very limited

applications.

Time Warp has the ability to exploit the parallelism inherent in a simulation but not without
extra costs. The state of objects has to be saved to be able to rollback to the correct state when
the events are executed out of order. There are also extra costs of computing the Global
Virtual Time, recovering from rollback and garbage collection, etc. Researchers have proposed
many interesting ideas to improve the performance of Time Warp. In many cases,
improvement in one factor may deteriorate another factor. Our goal is to ﬁnd a way to

measure those performance factors.

100

In this part, we propose a simple performance analysis model. We divide the total execution
time into computation, communication and idle time. Communication and idle time can be
separated from the total execution time by measuring certain parameters. The total number of
event executions is closely related to the rollback rate. If there is no rollback, the number of
event executions is optimal and predetermined for a given simulation. Because of the rollback
in Time Warp, the actual number of event executions is larger than the optimal value. The
higher number in event executions means higher rate of rollback Knowing the remaining
computation time, we can get the average Time Warp overhead for executing one event by
measuring the total number of event executions. Time Warp overhead and rollback rate are

essential for Time Warp to perform better than other algorithms.

It is possible that during the execution of Time Warp, a Time Warp process does not have any
event to execute. Depending how Time Warp is implemented, the Time Warp process can
suspend itself and be awakened by the operating system or continuously probes for a new
message to arrive. We chose the second option because we are not interested in the
multitasking and assume that the whole processor is dedicated to the Time Warp process. The

idle time can be measured by the number of probes executed by the Time Warp process.

We can easily measure following parameters of a Time Warp execution. They are total number
of event executions, total number of messages sent, total length of messages sent and the total
number of probes. The parameters to calculate the communication cost can be measured
independently and it only depends on the system the Time Warp runs on. In part two of the
paper, we propose a communication model. The cost of each probe can be measured

independently and it depends on the system the Time Warp runs on as well.

101

It is hard to measure the Time Warp overhead. This is because the system many not have
uniform granularity and there are different overhead for state saving and garbage collection,
etc. In this paper, we investigate the system with uniform event execution granularity. We
make an assumption that the costs of state saving, garbage collection, event queue operations
and other overhead are proportional to the number of event executions. Therefore, we can
average Time Warp overhead to each event execution. The cost associated to each execution
has two parts. One is the granularity of event execution and the other is the average Time
Warp overhead for one execution. By inserting some granularity into the every event
execution, we are able to separate the Time Warp overhead and event execution time. We also
study the effect of packing several events into one single message before sending them to
other processors. Event packing can reduce the communication cost, but it may increase the

rollback back rate and idle time.

In section 4.3, we derive a formula for the performance analysis of Time Warp. We have
inserted various granularities to 535932 circuit and run our Time Warp algorithm on SGI
Origin 2000 up to 32 processors. After separating communication time and idle time from the
total execution time, we can get average computation time for each event execution. We can
estimate the Time Warp overhead per event execution by doing least square estimation on the
empirical results. The very small error shows the accuracy of our performance analysis model.
As an application of our performance analysis model, we have studied the effect of event
packing. We tried to optimize the communication cost by packing several events into one
message before sending them to other processors. We have benchmarked 535932 and multi32
circuits packing from one up to 256 events into one message. The empirical results show that

while the communication cost decreases, both idle time and rollback rate are increasing.

102

Adding the factor that computation cost dominates over communication cost, there is not
signiﬁcant gain in the ﬁnal performance. The performance analysis explains the empirical

results.

4.2.2 Communication Model

We use the same communication model as we used in predicting the performance of

synchronous simulation. Therefore, we skip this part.

4.2.3 Performance Anaboir

4.2.3. 7 Formula

The total execution time is the sum of computation, communication and idle time.

T=T +T

mnqﬁutation communication

+ TM,

The computation cost is the total number of event executions times the sum of unit event

execution time and the average Time Warp overhead for one event execution.

— 3k
Tmnniuuttion _ (Tarn! + Tannin-ad ) N amthz-cutr'on

The communication cost is the total number of messages sent times the cost of sending one

message of average message length.

— if '1’
’I-iommuniratron — (OﬂIl/l + ”I NMtIIdgr_btgf/I) N manage

The idle time is the total number of message probes times the cost of each probe.

103

T,d,c=T *N

probc_cost probes

All the parameters except 71M“, can be measured either independently or from the execution

of the simulation. In the next part, we estimate the Time Warp overhead.

4.2. 3 .2 Estimating tbe Time Warp overhead
Given a simulation, the Time Warp overhead can be measured by inserting granularity into the

event execution. According to the formula, the cost of each execution consists of event
execution and Time Warp overhead. The cost of each execution can be measured and the
Time Warp overhead does not change as event execution granularity changes. If we vary the
event execution granularity, we should expect the linear relationship between the cost of each
execution and the granularity. We can use least square ﬁt to a linear formula to estimate the
Time Warp overhead. We have benchmarked $35932 circuit on SGI Origin 2000. Table 4.3
shows the results of Tm+ﬁm with various granularities. The least square estimation gives
the Time Warp overhead with value of 53 microseconds and coefﬁcient with value of 0.97.
The small error shows the accuracy of our performance analysis model.
T +T

0 55.3
30 80.9

60 109.8
120 169.2
240 286.8

Table 4.3 Twm,+To,,,,had of 335932 with various granularities

 

104

4.2.3.3 Eﬂect of event packing

We tried to improve the performance of Time Warp by reducing the communication cost.
One way to reduce the communication cost is to combine several events into one message. We
expected that less number of messages were sent, which could reduce the communication cost.
We also expected the rollback rate to increase as more events were packed into one message,
for it takes longer for events to arrive at destinations. We tried different runs with different
number of packing ranging from one to 256. We did not see signiﬁcant improvement of the
overall performance. Two of circuits we picked are 535932 and multi32. The best performance
of message packing compared with the performance of non-packing is less than twenty
percent. We observed that there is an optimal number that produces the best performance. We
applied our performance analysis model to the empirical results of 335932 and multi32. Figure
4.7 shows the breakdown of total execution into computation, communication and idle time.
The empirical results show that while the communication cost decreases, both idle time and
rollback rate increase. Adding the factor that computation cost dominates over
communication cost, there is not signiﬁcant gain in the ﬁnal performance. The performance

analysis explains the empirical results.

105

 

Analysis of 535932

 

—o— Connunication
—I— Conpuation

+ Ids

—x— Total Execution Tm

Seconds

 

 

 

0 5 10
Logﬂhmber of Events per esssge)

 

 

Analysis of Multi32

 

—O—Communication

Seconds

+Total Execution Time

 

 

 

0 2 4 6 8
Log(Number of Events per Message)

 

 

 

Figure 4.7 The eﬁ'ect of event packing on s35932 and multi32
4.2.4 Average quendeng' Gap
In this section, we propose a preliminary measure that describes the parallelism that can be
exploited by Time Warp. Our Time Warp results show that multi32 has higher rollback rate
than 835932 and it performs worse than 535932. Figure 4.8 shows their Time Warp
performance results. We know that they have almost the same average activity number per

106

cycle. They are 942 and 1153 for $35932 and multi32. Since multi32 has slightly higher
activity number than s35932, it performs a little better in synchronous simulation. They
have the same computational granularity. Why does multi32 perform worse than $35932?

This question leads us to search for an inherent property that explains this fact.

 

Speedup Compulsion (Tlmewsrp)

+335932
+multi32

 

0 1 0 20 30 40
Number or Processors

 

 

 

Figure 4.8 Performance comparison of 335932 and multi32

Critical path analysis exists that computes the maximum parallelisms available in the system.
However, such parallelism is extremely had to explore for a Time Warp scheduling algorithm.

We propose a practical measure of parallelism that can be exploited easily by Time Warp.

We modiﬁed sequential algorithm to record the average gap between the time when an event is
generated and when the event is executed. We used the number of executions to measure this
gap. For example, at the i’th execution an event E is generated and at jth execution the event
E is scheduled and then executed. Then the gap is j-i. We got the average of all gaps for all
events. We call this number to be the Average Dependency Gap (ADG). Can ADG help us to

explain Time Warp performance?
1 O7

Intuitively, ADG is related the degree of parallelism of a system. On average, an event is to be
scheduled after ADG number of executions. Therefore, roughly ADG number of events can

be executed at the same time. Higher ADG means higher degree of parallelism.

The ADG is calculated by the random scheduling. Different scheduling may generate different
ADG. The maximum value of ADG exists for a given system with a certain scheduling. We
are more interested in the ADG calculated by random scheduling because following reasons.
We believe that general purpose Time Warp system we are interested in does not have
scheduling capability to raise the ADP significantly. Getting the maximum value of ADG is

hard problem even when the system is given.

Now back to the results we obtained for the ADG values of 535932 and multi32. The ADC of
535932 is 3556 and ADG of multi32 is 1657. We see that 535932 has considerably more
parallelism than multi32. It is trivial that ADG 2 AAN. For a given system, if the larger the
ratio between ADG and AAN, the better performance improvement from synchronous
simulation to Time Warp. If we consider the optimistic synchronous algorithm we proposed in
chapter three, we can increase the active number of 535932 roughly four times. According to
the performance formula, the maximum speedup we can achieve for 535932 is two times more

than the traditional synchronous algorithm.

 

 

 

AAN ADG
835932 942 3556
Multi32 1 153 1657

 

 

 

 

 

Table 4.4 AAN and ADG of 335932 and multi32

108

4.2.5 Conclusion

In this section we analyzed the performance of Time Warp simulation. We broke down the
performance of Time Warp into three parts: computation, communication and idle time. We
proposed average dcpendeng gap (ADG) that measures the parallelism that can be exploited by
Time Warp algorithm. Compared with critical path analysis, this measure gives the value of a

more practical parallelism.

In this future, we need to develop a mathematical performance model of Time Warp
simulation based on ADP. We are interested in understanding how rollback rate can be

predicted with this model.

109

C/Japter 5

5. UNIFIYING SIMULATION TECHNIQUES

5.1 Introduction

There are three major parallel simulation techniques. They are synchronous simulation,
conservative simulation and optimistic simulation. Synchronous simulation is the simplest
simulation technique. At each cycle, LPs with the smallest time stamp are scheduled to
execute. After execution step the next smallest time stamp is calculated by either centralized or
decentralized global clock calculation mechanism [5,81]. The advantages of synchronous
simulation include its low overhead, easy of implementation and performance predictability
[85]. Synchronous simulation is more prone to poor load balancing and high communication
cost caused by synchronization steps between cycles. These problems will get worse with
increasing number of processors. Synchronous simulation is suited for the simulation that has
very small computational granularity and reasonably large cycle parallelism. Konas [48] and
Soul ['72] compared the performance of synchronous with other asynchronous simulations on
the logic simulation. They found that synchronous algorithm gives the best performance

compared with non—optimized asynchronous algorithms.

110

Chandy and Misra and Bryant independently developed conservative simulation algorithm.
Each generated event is accompanied with the LVT of the sending LP. The receiver LP knows
that it will not receive any event in the future with smaller time stamp than the LVT from the
sender. Each LP can safely execute the events with time stamp smaller than the LVT value of
all input channels. Deadlock may occur when there is cycle of block LPs each waiting for its
predecessor to update its clock. Null-message is used to avoid deadlock but it can add
signiﬁcant overhead on the performance. Chandy and Misra also proposed a two-phase
method to detect and resolve the deadlock Deadlock can be broken by the observation that
events with smallest time stamp are always safe to process. Good lookahead is critical for the

performance of conservative algorithm.

Time Warp algorithm allows optimistic asynchronous simulation of discrete event systems. It
can explore the parallelism in a system by allowing out of order event execution. Some of these
event executions may not be correct and need to be cancelled. Rollback mechanism is used to
recover from error executions. When a straggler event arrives, the LP needs to be recovered to
the state before time stamp of the straggler. This requires frequent state saving. GVT (Global
Virtual Time) is used to do garbage collection of the states that are no longer necessary to stay
in memory. GVT is deﬁned as the minimum time stamp of all events in transit or in the input
queues of all logical processes. No new event generated can have a smaller time stamp than
GVT. Therefore all states except one that has smaller time stamp than GVT can be discarded

safely.

Each of these techniques has pros and cons. Each of them can out-perform one another

under different situations. Our goal is to develop an efﬁcient framework for unifying

111

simulation techniques. The framework is able to exploit the advantages of each simulation
technique. We choose optimistic technique as basis of our framework. The existing optimistic
technique has the best framework for the unifying simulator that is able to exploit both cycle
parallelism and conservative parallelism. The rest of the paper is organized as follows. In part
two of this chapter, a technique to exploit the cycle parallelism is introduced. In part three, we
discuss techniques to exploit conservative parallelism. In part four, we introduce a framework
to unify the simulation techniques. In the last part, we present an application of the unifying
framework on logic simulation. We compared the experimental results of unifying technique

with synchronous, conservative and optimistic simulations.

5.2 Exploiting Cycle Parallelism

5.2.1 Cycle Parallelism
The synchronous algorithm exploits cycle Parallelism. Each cycle, the smallest time stamped

events are guaranteed to be safe to execute. It has following advantages and disadvantages.

* Advantages:

Low overhead

Easy implementation

Performance predictability

112

* Disadvantages:

Limited parallelism

Bad load balancing

Synchronous simulation is well suited to simulate systems with large cycle parallelism. Because
it has low overhead, synchronous simulation out performs optimistic simulation protocols in

following situations:

Large number of cycle parallelism, and Small computation granularity

Experimental studies show that when number of processors is small synchronous simulation
outperforms optimistic simulation [47]. Studies show that state saving cost of optimistic
simulation needs to be reduced signiﬁcantly to have better performance than synchronous
simulation. If we can exploit the synchronous parallelism and skip the state saving, optimistic

may perform as good as synchronous simulation in most situations.

5.2.2 Tweaking the deﬁnition of GVT

GVT is deﬁned as the minimum time stamp of events in the unprocessed event queue and the
events in transit. The idea of GV I is proposed to reclaim memory of saved states and commit

output events. Many GVT computation algorithms are proposed. The goals of these

113

algorithms include the better estimation of real GVT and efﬁciency. In this paper, we ﬁnd

another purpose of GVT computation: exploiting cycle parallelism.

It is not safe to execute the events with time stamp GV I because another event with the same
time stamp may still be in transit and it may arrive at the other port of the same object. This

forbids us from exploiting the cycle parallelism.

The deﬁnition of GVT has to be changed to exploit the cycle parallelism. Let us deﬁne GVT

as follows:

GVT 1' n1in({tin1e stamp of unprocessed event} U {send time stamp of events in transit })

This new GVT deﬁnition does not regard the events in transit as the same identity as the
events that are already placed in the event queue. The events in transit are regarded as the
unﬁnished products of an execution. An execution is not completed until the new events
generated are inserted into the receiver’s queue. Therefore, event in transit contributes the
send time stamp to the GVT. We gave a new meaning to GVT. With this new deﬁnition, we
are able to safely execute all the events with the time stamp of GVT without doing state

saving.

5.2.3 Performance Considerations
Ideally, if the accurate value of GVT is available at all times, optimistic simulation can exploit

the maximum cycle parallelism. In practice, we need to balance between the GVT
114

computation and exploiting cycle parallelism. Frequent GVT computation can help LPs skip
state saving by exploiting more parallelisms, but it will increase the communication cost of the
system. Infrequent GVT computation has smaller communication cost but it exploits less cycle
parallelism. The overhead caused by frequent GVT computation should not be larger than the
event execution cost. On the other hand, GVT computation needs not be frequent if there

exists large cycle parallelism.

We can consider synchronous simulation as a special case of optimistic simulation. GVT is
computed when all events with time stamp GVT ﬁnished execution. All event executions are
safe and no state saving is necessary. The frequency of GVT computation is completely
decided by the size of cycle parallelism. If there exists large cycle parallelism, the performance
is good because there is no state saving and rollback, and the overhead of GVT computation is
small compared with event execution. When there are small cycle parallelism or large number

of processors, GVT computation cost will degrade the performance.

5.3 Exploiting Conservative Parallelism

5.3.7 Conservative Parallelism

Studies have shown that conservative and optimistic simulation can outperform each other. If
optimistic simulation can exploit conservative parallelism, it may achieve the comparable
performance of conservative simulation at its best while still outperforming conservative

simulation in the cases that favor optimistic simulation.

115

5 .3 .2 Related Work

Jha and Bagrodia [46] proposed a framework that uniﬁes the conservative and optimistic
algorithms. A process can choose autonomously to be optimistic or conservative. A process is
selected to be conservative to limit the parallelism and to avoid generating large number of
incorrect messages. The decision to be optimistic or conservative is hard to make. If bad
decision is made, the performance may be worse than both conservative and optimistic
simulation. Chung and Xu [27] exploited lookahead further to reduce the overhead of
optimistic simulation. In their approach, an LP executes in conservative mode only when the
execution is guaranteed to be safe. An event execution is in optimistic mode if the event
execution is not safe. Therefore, their mode switching is passive. They applied their technique
to logic simulation where the computational granularity is very small and state saving cost is
relatively high. They observed number of state saving is reduced up to 60% for some circuits.
Their worst-case performance is similar to pure Time Warp when no lookahead exists.
Optimally, if all the LPs are executed in conservative mode, the performance will be equal to
no rollback Time Warp without state saving. Both approaches embed the lookahead
computation into the optimistic simulation. Chung and Xu’s approach does not generate null-

messages to advance the lookahead and puts little overhead to optimistic simulation.

5.3.3 Performance Considerations

From the viewpoint of conservative simulation, the optimistic execution of an LP can be

regarded as improved conservative simulation that executes unsafe events instead of blocking

116

to wait for the deadlock recovery. We can also consider conservative simulation as a special
case of optimistic simulation. GVT is computation is used as a deadlock recover mechanism.
GVT is computed when there is no safe event to execute. Simulation will have the least
overhead when GVT computation is not necessary. If the model has lookahead, GVT
computation is less frequent than synchronous simulation. If the model has no lookahead, it

will become synchronous simulation.

If there exits good lookahead in the model, the optimistic simulation can beneﬁt by exploiting
conservative parallelism. State saving cost can be reduced. Scheduling safe events to execute

before unsafe events may reduce rollback rate.

5.4 Unifying Framework

An LP has three modes of execution: synchronous mode, conservative mode and optimistic

mode.

117

 

Synchronous Mode

 

 

 

 

 

Conservative Mode ‘___, Optimistic Mode

 

 

 

 

 

 

Figure 5.1 An LP can switch back and forth from three execution modes

The ﬁgure above shows that an LP can switch back and forth from synchronous, conservative
and optimistic mode. An LP is eligible to execute in synchronous mode when the local virtual
time is equal to GVT. An LP is eligible to execute in conservative mode when the local virtual
time is smaller than the lower bound time stamp of future event the LP may receive. Both
synchronous and conservative mode have low overhead with no state saving necessary. All
previous states can be reclaimed for synchronous and conservative mode executions. An LP
executes in optimistic mode when it does not satisfy the conditions to be executed in
synchronous mode or conservative mode. In optimistic mode current state needs to be saved

because future rollback is possible.

The following ﬁgure shows the module diagram of the simulation kernel.

118

 

 

 

communication 4——> scheduler ‘——p simulation
manager object list

 

 

 

 

 

 

 

 

 

 

GVT manager

 

 

 

Figure 5.2 Module diagram of simulation kernel

Scheduler is the center of simulation kernel. It decides which LP to execute and in which
mode to execute. GVT Manager is responsible for computing the value of GVT. The
frequency of GVT computation is decided by the scheduler. Communication manager is
responsible for sending and receiving messages from other processors. Scheduler decides how

many events to be buffered before sending events.

The advantage of our unifying framework is it can be easily adapted to execute in almost all of
known simulation techniques. In the following table, we map our unifying framework to

known simulation techniques.

119

 

Simulation
Techniques

Mapping

 

Synchronous

Scheduler only allows an LP to execute in
synchronous mode. GVT is used to decide the
smallest time stamp.

 

Conservative

Scheduler only allows an LP to execute in
conservative and synchronous mode. GVT
computation is used as the deadlock recovery
mechanism.

 

Optimistic

Scheduler only allows an LP to execute in optimistic
mode.

 

Breathing Time
Buckets

Scheduler allows an LP to execute in synchronous
mode and optimistic mode when no synchronous
mode execution is available. GVT computation is
used to compute the event horizon.

 

AdvanCE SPaDES

Scheduler allows an LP to execute in synchronous
mode and conservative mode when synchronous
mode execution is not available. GVT is used to
decide the smallest time stamp.

 

Optimized Optimistic

 

 

Scheduler gives priority to events that can be
executed in either synchronous mode or
conservative. When no synchronous mode or
conservative mode execution exists, scheduler picks
an LP to execute in optimistic mode.

 

Table 5.1 Mapping of simulation techniques.

We implemented the optimized optimistic simulation that exploits cycle parallelism and
conservative parallelism. We implemented the algorithm and tested it on Terascale Computing
System (T CS) at the Pittsburgh Supercomputing Center. We chose four large ISCASS9
benchmark circuits as our application. We executed circuits with up to 16 processors. Entities
are randomly partitioned. Input vectors come at every clock with random values. Clock is

10,000 times larger than the unit delay of a gate. We observed that the performance of the

5.5 Experimental Results

120

 

optimized optimistic approach is almost as good as synchronous when number of processors
is small and usually outperforms synchronous and when the number of processors is large.
The optimized optimistic always performs better than optimistic simulation. Table 5.2 shows

the maximum speedup achieved for these circuits with up to 16 processors.

 

 

 

 

 

O 4 8 12 16
Number of Processors

 

 

 

121

 

 

‘
O

 

 

 

Speedup

MubUIQVGO

-e—Optlmized Optimistic

 

 

4 8 12 16
Number oi Processors

0

 

 

 

 

$15850

 

. —e—Synchronous
+ Optimistic
—e—Optimized Optimistic

 

 

 

 

o 4 8 12 16
Number of Processors

 

 

 

513207

 

—e— Synchronous
-l— Optimistic
+ Optimized Optimistic

 

 

 

 

Number oi Processors

 

 

 

Figure 5.3 The comparison of experimental results

Maximum

 

Table 5.2 Maximum speedup with up to 16 processors

5.6 Conclusion

In this chapter, we introduced a technique that enables optimistic simulation to exploit cycle
parallelism and conservative parallelism. It gives us a way to avoid optimistic simulation to

perform much worse than other techniques because of state saving overhead and had
123

scheduling that executes non-safe events first instead of safe events. We also proposed a
uni ring framework. In this framework, synchronous, conservative and other simulation

techniques are special cases of the unified simulation kernel.

124

C/Japter 6

6. SUMMARY

This dissertation focused on evaluan'ng and improving the performance of parallel simulation.

First, we proposed a model to predict the performance of synchronous logic simulation. For
the performance of synchronous simulation, the load balancing and communication cost are
the most important factors. The effect of load balancing in a synchronous simulation was
computed using probability distribution models. We derived a formula that shows the cost of
synchronous simulation by combining the communication model called LogGP and
computation granularity. The formula is simple, yet captures the most important factors for
the synchronous simulation. The performance formula predicts that the maximum speed up
achievable by a synchronous simulation is proportional to the number of average active
objects per simulation cycle. The performance prediction model we have developed can be
used to decide whether it is worthwhile to use synchronous simulation before actually
implementing it. We benchmarked several ISCAS circuits on SGI Origin 2000 and Terascale
Computing System. The developed performance model was used to analyze the effect of
several factors that may improve the performance of synchronous simulation. We proposed a

technique that improves the load balancing and simulation cycles. We evaluated the limitation

125

of improving the load balancing factor and indicated that reducing the simulation cycles is a

more promising approach for synchronous logic simulation.

Next, we proposed a cycle reducing synchronous algorithm. The algorithm was based on the
assumption that simulation cycles can be reduced by pre-executing events that should normally
be executed in future simulation cycles. There are two approaches for selecting the events to
be pre-executed. One is to pre-execute only the safe events. The safeness of event was
determined by lookahead computation. The other one is to execute events more aggressively
to execute unsafe events. Our experimental results showed that pre-executing only the safe
events was not able to reduce the simulation cycles signiﬁcantly. We implemented an
aggressive cycle reducing algorithm that executes a certain number of unsafe events in addition
to the smallest time stamped events. The benchmark results showed that this algorithm was
able to reduce the simulation cycle signiﬁcantly and achieve several times better speedup than
the traditional synchronous algorithm for several circuits. However, there is an issue about
selecting the number of extra events to execute in a simulation cycle. If this number is large,
there will be fewer simulation cycles. But, number of incorrect event executions will be higher.
The performance loss from incorrect event executions may become more signiﬁcant than the
gain from reduced simulation cycles. Therefore, this number should be decided adaptively
during the simulation. The communication pattern of cycle reducing algorithm is synchronous.
Therefore, the performance of the algorithm is deterministic. The optimal number of extra

events to execute can be alternatively determined by sequential analysis.

Time Warp algorithm is able to exploit the parallelisms that could not be found by

conservative algorithms. However, this algorithm requires frequent state savings and more

126

memory than other simulation techniques. We proposed a technique that reduced the number
of state savings as well as the memory requirement for Time Warp. The lookahead technique
is used to compute the lower bound time stamp of logical processes. The time stamp of any
future event a logical process receives will be greater than or equal to this lower bound.
Therefore, if the time stamp of an event execution is smaller than this lower bound, the
execution is guaranteed to be correct and no state saving is necessary. In addition, the states
and processed events that have smaller time stamp than this lower bound can be discarded
safely. Each logical process may have different value of lower bound. The lower bound of a
logical process is updated by events that carry the lower bound information of its predecessors.
This algorithm can be easily implemented with minimal overhead. Our benchmark results on
logic simulation showed this algorithm reduced the number of state savings and maximum
state queue sizes signiﬁcantly. We also observed considerable improvement in the parallel
execution time. The improvement came from skipping the state savings as well as the shorter

queue sizes.

Finally, we proposed a unifying framework for parallel simulation that combines the
advantages of synchronous, conservative and optimistic simulation techniques. Synchronous
simulation has the lowest overhead but it does not scale well with the number of processors.
Conservative simulation performs well if good lookahead exits, but the cost of deadlock
avoidance and detection is high. Optimistic simulation is able to exploit more parallelisms, but
the cost of state saving and rollback is signiﬁcant. Under different situations, each of these
techniques may be able to perform better than the rest. The uniﬁed approach gives priority to
the event executions that can be executed in synchronous or conservative mode. An event

execution is eligible to execute in synchronous mode when the time stamp of the event

127

execution is equal to the global virtual time. The traditional deﬁnition of global virtual time was
modiﬁed to guarantee the execution of events with time stamp the same as the global virtual
time cannot be rolled back in the future. The event executions in conservative mode can be
detected by our proposed overhead reducing algorithm. When there is no candidate to be
executed in synchronous and conservative mode, an event is picked to execute in higher
overhead optimistic mode. Therefore, this approach is able to perform as well as the
synchronous and conservative simulations when they perform well and perform better than
these techniques when they perform poorly. The uniﬁed approach always performs better than
the original optimistic approach by skipping state saving and having smaller queue sizes. We
applied this technique to logic simulation. The results conﬁrmed our analysis. The uniﬁed
approach can be easily adapted to execute as known parallel simulation techniques, eliminating

the fundamental differences between these techniques.

128

p—\

10.

BIBLIOGRAPHY

V. D. Agrawai and S. Chakradhar, Performance Analysis of Synchronized Iterative
Algorithms on Multiprocessors Systems. In IEEE Transactions on Parallel and
Distributed Systems, Vol. 3 No. 6, pp. 739-749, 1992.

Alexandrov, M. Ionescu, K. E. Schauser, C. Scheiman, LogGP: Incorporating Long
Messages into the LogP model - One step closer towards a realistic model for parallel
computation, 7th Annual Symposium on Parallel Algorithms and Architecture, pp. 95-105,
1995.

Peter J. Ashenden, The Designer' Guide to VHDL, Morgan Kaufmann, San Francisco,
1996.

P. J. Ashenden, H. Detmold and W. S. McKeen, Execution of VHDL Models using
Parallel Discrete Event Simulation Algorithms, VLSI Design, Vol. 2, No. 1, pp. 1—16, 1994.

D. Baik and B. P. Zeigler, Performance Evaluation of Hierarchical Distributed Simulators,
pp, 421-427, In Proc. of the 1985 Winter Simulation Conference, 1985.

R. Baldwin, M.J. Chung and Y. Chung, Overlapping Window Algorithm for Computing
GVT in Time Warp, Proceedings of International Conference on Distributed Computing
Systems, pp. 534—541 , 1991.

Rajive L. Bagrodia, Designing Efﬁcient Simulations Using Maisie, UCLA Technical Reort,
1997.

H. Bauer et al., Reducing Rollback Overhead in Time Warp Based Distributed Simulation
with Optimized Incremental State Saving, pp. 12-20, Proceedings of the 26th Annual
Simulation Symposium, 1993.

S. Bellenot, State Skipping Performance with the Time Warp Operation System,
Proceedings of 6th Workshop on Parallel and Distributed Simulation, 1992.

F. Brglez and H. Fujiwara, A neutral netlist of 10 combinational benchmark circuits and

target translator in Fortran, Proceedings of the International Symposium on Circuits and
Systems, pp. 695-698, 1985.

129

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

F. Brglez, D. Bryan and K. Kozminski, Combinational proﬁles of sequential benchmark
circuits, Proceedings of the International Symposium On Circuits and Systems, pp. 1924-
1934, 1989.

J.V. Briner, Jr., J.L. Ellis, and G. Kedem. Breaking the barrier of parallel simulation of
digital systems. In Proc. of the 28th Design Automation Conf., pages 223—226, 1991.

RE. Bryant, Simulation of Packet Communications Architecture Computer Systems, MIT-
LCS-TR-188, Massachusetts Institute of Technology, 1977.

Cabrera, M.J. Chung and Y, Chung, A parallel VHDL Simulator on the Connection
Machine,‘ Proc. of VHDL spring 92 Conf. pp.83-94, 1992.

CD. Carothers, KS. Perumalla and RM. Fujimoto, Efﬁcient Optimistic Parallel
Simulations using Reverse Computation, Proceedings of 13th Workshop on Parallel and
Distributed Simulation, pp. 126-135, 1999.

R. Chamberlain, Parallel logic simulation of VLSI systems, Proceedings of 32"‘1 Design
Automation Conference, pp. 139-143, 1995.

KM. Chandy, and J. Misra, Distributed Simulation: A Case Study in Design and
Veriﬁcation of Distributed Programs, IEEE Transactions on Software Engineering, vol.
SE-S, no. 5, pp. 440—452, September 1979.

K M. Chandy and J. Misra, Asynchronous Distributed Simulation via a Sequence of
Parallel Computations, Communications of the ACM, 24(11), pp. 198-206, 1981.

Chi-Chao Chang, Grzegorz Czajkowski, Thorsten von Eicken, Design and Performance
of Active Messages on the IBM-SP2, Cornell CS Technical Report 96-1572, Feb. 1996

E. Chung and M. J. Chung, An Important Factor for Optimistic Protocol on Distributed
Systems: Granularity, In Proceeding of Winter Simulation Conference, pp. 642-649, 1995.

M. J. Chung, Department of Computer Science, Michigan State University. Parallel VHDL
Performance Simulation Cost and Schedule Report, 1997.

M.J. Chung and Y. Chung, Performance Prediction to Gate to Processor Ratio, Proc. 1992
International Conference on Parallel Processing, pp. 246-253, 1992.

M.J. Chung and Y.M. Chung, An Experimental Analysis of Simulation Clock
Advancement in Parallel Logic Simulation on an SIMD Machine, Advances in Parallel and
Distributed Simulation, Vol. 23, No. pp. 125-133.

M..J Chung and Y. Chung, "Efﬁcient Parallel Logic Simulation Techniques for the
Connection Machine," Proc. of Supercomputing, pp. 606-614, 1990.

130

25.

26.

27.

28.

29.

30.

31.

32.

33.

M.J. Chung and Y. Chung, Data Parallel Logic Simulation using Time Warp on the
Connection Machine, 1989 Design Automation Conference, pp. 98-103, 1989.

Moon Jung Chung and Jiashen Zhou, Version Control and Conﬁguration Management in
Simulation, Technical Report, Michigan State University, 1998.

Moon Jung Chung and Jinsheng Xu, An Overhead Reducing Technique for Time Warp,
Proceedings of IEEE International Workshop on Distributed Simulation and Real Time
Systems, pp. 95-102, 2002.

Moon Jung Chung, Jinsheng Xu and Hee Chul Kim, Parallel PVHDL Simulation Engine,
High Performance Computing Symposium, 1999.

JP. Cohoon and J..W Davidson, C++ Program Design - An Introduction to
Pro ' and Ob'ect-Oriented Desi Times Mirror Hi her Education Grou , 1997.
mg l gn, g P

R. M. Cubert and P. A. Fishwick. MOOSE: An Object-Oriented Multimodeling and
Simulation Application Framework Submitted to Simulation, June 1997.

D. Culler, et. al., [.0ng Towards a Realistic Model of Parallel Computation, 4‘h ACM
SIGPLAN Symposium on Principal and Practice of Parallel Programming, pp. 1-12, 1993.

D. Culler, LT. Liu, RR Martin, and C. Yoshikawa, LogP Performance Assessment of Fast
Network Interfaces, IEEE Micro, November 1995

P. M. Dickens, D. M. Nicol, P. F. Reynolds, JR. and J. M. Duva, Analysis of Bounded
Time Warp and Comparison with YAWNS, ACM Transaction on Modeling and
Computer Simulations, 6(4), pp. 297-320, 1997.

34. J. Duato, S. Yalamanchili and L. Ni, Interconnection Networks: An Engineering

35.

36.

37.

38.

Approach, 1 996.

C. Dubnicki, et al, Software Support for Virtual Memory-Mapped Communication, 10th
International Parallel Processing Symposium, pp. 372-381, 1996.

T. Eicken, et. al., Active Messages: a Mechanism for Integrated Communication and
Computation, 19th International Symposium on Computer Architecture, pp. 256-266,
1992.

J. Engelsma, Y. Chung and M.J. Chung, Distributed Token-Driven Logic Simulation on a
shared-memory multiprocessor, Proc. 6th Workshop in Parallel and Distributed
Simulation, pages 197-198, 1992.

R. E. Felderman and L. Kleinrock, An ‘Upper Bound on the Improvement of
Asynchronous versus Synchronous Distributed Processing, Proc. of the SCS Multiconf. on
Distributed Simulation, 22(1), pp. 131-136, 1990.

131

39.

40.

41.

42.

43.

45.

47.

48.

49.

50.

51.

Ferscha, Parallel and Distributed Simulation of Discrete Event Systems. Handbook of
Parallel and Distributed Computing, McGraw-Hill 1995.

P. Fishwick, A Visual Object-Oriented Multimodeling Design Approach for Physical
Modeling Revision of tr96-026 send to ACM Transactions on Modeling and Computer
Simulation, April 1997.

J. Fleischmann and PA. Wilsey, Comparative Analysis of Periodic State Saving Techniques
in Time Warp Simulators, Proceeding of 9th Workshop on Parallel and Distributed
Simulation, pp. 50-58, 1995.

RM. Fujimoto, Parallel discrete-event simulation, Communications of the ACM, 33(10),
pp. 33-53, 1990.

R. Fujimoto, J. Tsai and G. Gopalakrishnan, The roll back chip: Hardware support for
distributed simulation using Time Warp, Distributed Simulation Conference, pp. 81-86,
1988.

. C. Holt, M. Heinrich, J. Pal Singh, E. Rothberg, and J. Hennessy, The Effects of Latency,

Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors, Stanford
University Technical Report CSL-TR-95-660, 1995.

DR. Jefferson, Virtual Time, ACM Transaction on Programming Languages and Systems,
7(3), pp. 404-425, 1995.

. V. Jha and R L. Bargrodia, A Uniﬁed Framework for Conservative and Optimistic

Distributed Simulation, Proceedings of the 8* Workshop on Parallel and Distributed
Simulations, pp. 12-19,1994.

P. Konas, Parallel Architectural Simulation on Shared-Memory Multiprocessors, PhD
Thesis, UIUC, 1994.

P. Konas and P.-C. Yew. Parallel Discrete Event Simulation on Shared-Memory
Multiprocessors. In Proceedings of the 24th Annual Simulation Symposium, pp. 134—148,
April 1991.

V. Krishnaswamy and P. Banerjee, Actor Based Parallel VHDL Simulation Using Time
Warp, Proceedings of the 1996 Workshop on Parallel and Distributed Simulation, pp. 135-
142, 1996.

J. Kuskin, et. AL, The Stanford FLASH Multiprocessor, 21St International Symposium on
Computer Architecture, pp. 302-313, 1994.

K. Lee and PA. Fishwick. Dynamic Model Abstraction, In 1996 Winter Simulation
Conference, December, San Diego, CA, pp. 764-771, 1996.

132

52.

53.

54.

55.

56.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

CC. Lim et. al. Performance Prediction Tools for Parallel Discrete-Event Simulation, 13th
Workshop on Parallel and Distributed Simulation, pp. 148-156, 1999.

Y.B. Lin, Parallelism Analyzers for Parallel Discrete Event Simulation, ACM Transaction
of Modeling and Computer Simulation, 2(3), pp. 239-264, 1992.

Y.-B.Lin, B. R. Preiss, W.M. Loucks, E.D. Lazowska, Selecting the Checkpoint Interval in
Time Warp Simulation, Proceedings of 7th Workshop on Parallel and Distributed
Simulation, pp. 3-10, 1993.

Y. Lin and ED. Lazowaska and ML Bailey, Comparing Synchronization Protocols for
Parallel Logic-Level Simulation, Proc. of the 1990 International Conference on Parallel
Processing, pp. 223-227, 1990.

B.D. Lubachevsky, A. Weiss and A. Shwartz, An Analysis of Rollback-Based Simulation,
ACM Transaction of Modeling and Computer Simulation, 1(2), pp. 154-193, 1991 .

Message Passing Interface Forum, MP1: A Message Passing Interface Standard, June 1995

RA. Meyer, J..M Martin and KL. Bagrodia, Slow Memory: The Rising Cost of Optimism,
Proceedings of 14th Workshop on Parallel and Distributed Simulation, pp. 44-52, 2000.

J. Misra, Distributed discrete—event simulation, Computing Surveys, 18(1), pp. 39-65,
1995.

D. M. Nicol, Performance Bounds on Parallel Self-Initiating Discrete Event Simulations,
ACM Transaction of Modeling and Computer Simulation, 1(1), pp. 24—50, 1991.

J. K Peacock, J. W. Wong and E. C. Manning, Distributed Simulation using a Network of
Processors, Computer Networks, pp. 44—56, 1979.

D. Perllerin and D. Taylor, VHDL Made Easy, Printice Hall, Upper Saddle River, 1997.
BR. Preiss, The YADDES Distributed Discrete Event Simulation Specﬁcation Language
and Execution Environments, 1989 SCS Multiconference on Distributed Simulation, pp.

139-144, 1989.

BR. Preiss and W.M.Loucks, Memory management techniques for Time Warp,
Proceedings of 9‘h Workshop on Parallel and Distributed Simulation, pp. 95-99, 1995.

F. Quaglia, Event history based Sparse State Saving in Time Warp, Proceedings of 12th
Workshop on Parallel and Distributed Simulation, pp. 72-79, 1998.

R. Radhakrishnan, T. J. McBrayer, K. Subramani, M. Chetlur, V. Balakrishnan, and P. A.
Wilsey, A Comparative Analysis of Various Time Warp Algorithms Implemented in the

133

67.

68.

69.

70.

71.

72.

73.

74.

75.

76.

77.

78.

79.

80.

WARPED Simulation Kernel, Proceedings of the 29th Annual Simulation Symposium, pp.
107-116, March 1996.

H. Rajaei, SIMA: An environment for parallel discrete-event simulation, The 25th Annual
Simulation Symposium, pp. 147-149, April 1992.

K Rohanimanesh, Generating Map File by SAVANT, Technical Document of Parallel
VHDL Simulation Project, Department of Computer Science, Michigan State University,
July, 1998.

C. Schaubschlaeger, Measurements of SKaMPI, Version 2.2 of SGI Origin 2000 at
GUP/ZID, University of Linz, Austria, 1999.

L Soule, Parallel Logic Simulation: An Evaluation of Centralized-Time and Distributed-
Time Algorithms. PhD thesis, Stanford University, June 1992.

L Soule and T. Blank. Parallel Logic Simulation on General Purpose Machines. In
Proceedings of the 25th ACM/IEEE Design Automation Conference, pages 166—171,
1988.

L. Soule and A. Gupta, 1991. An Evaluation of the Chandy-Misra-Bryant Algorithm for
Digital Logic Simulation, ACM Transaction of Modeling and Computer Simulation, 1(4),
pp. 408-347, 1991.

CB. Stunkel, et al., The SP2 High-Performance Switch, 1995.
CB. Stunkel, et al., The SP2 Communication Subsystem, 1994.

J. Steinman, SPEEDES: A Multiple-Synchronization Environment for Parallel Discrete-
Event Simulation, International Journal in Computer Simulation, 2(3), pp. 251-286, 1992.

J. Steinman, Scalable Parallel and Distributed Military Simulations using the Speedes
Framework, NASA, Jet Propulsion Laboratory, Technical Report, 1996.

T. Tabe, Janis P. Hardwick, Quentin F. Stout, Statistical Analysis of Communication Time
on the IBM SP2, Computing Science and Statistics 27, 1995.

B.W. Unger, J.G. Cleary, A. Convington and D. West, An external state management
system for optimistic parallel simulation, Proceedings of the 1993 Winter Simulation
Conference, pp. 750-755, 1993.

LG. Valiant, A Bridging Model For Parallel Computation, Communications of ACM,
33(8), pp.103-111, 1990.

G. Varghese, R.D. Chamberlain, W.E. Weihl, The pessimism behind optimistic simulation,
Proceedings of 8th Workshop on Parallel and Distributed Simulation, pp. 126-131, 1994.

134

81.

82.

K. Venkatesh, T. Radhakrishnan, and H. F. Li, Discrete Event Simulation in a Distributed
System. In IEEE COMPSAC. IEEE Computer Society Press, 1986.

K. Westgate and D. McInnis, Reducing Logic Veriﬁcation Time with Cycle Simulation.
SpeedSim, http://www.speedsim.com/tech/cycsim.htm, 1996.

83. Jinsheng Xu, An Object Oriented Model for Developing Event-Driven Systems,

Department of Computer Science, Michigan State University, 1998.

84. Jinsheng Xu, Parallel VHDL Simulation Result Report, Department of Computer Science,

Michigan State University, 1997.

85. Jinsheng Xu and Moon Jung Chung, Predicting the performance of synchronous discrete

event systems, Proceedings of International conference on Computer Aided Design, pp.
18-23, 2001.

86. Jinsheng Xu and Moon Jung Chung, Efﬁciently unifying simulation techniques, to be

87.

88.

submitted to Workshop on Parallel and Distributed Simulation.

M. Yang, IIRthdl Layer Source Code, Department of Computer Science, Michigan State
University, 1998.

B. Zeigler, Object-Oriented Simulation with Hierarchical Modular Model, Academic Press,
1990.

135

 

illWilli?WilliIIHIIWIETI
3 1293 02327