va-_

v'v—_

 

.u-uq‘ u‘l 1

.ra.

 

.r'ch-L-r _ u ‘

 

W' 'V

 

 

 

 

\
4 an

 

 

«Cam

v.

 

 

 

)‘_ w.’

Vue’m
Id“ ,.§:u:- we?!“
3,"? . ....4...°*'
5;?“ .L‘:.‘.‘§;.

'¢.‘ ..
Hﬁmzsimfw
.;.:.....J"z§¥ 3“" “'
{git-.1»
am

.egf‘r‘. .-.. ‘47? .:~
:55» '3‘- » ' to. 134‘ “a. “I" 4‘, a?
v n- 3.. J"; y ”5%; ‘ ,_
f.‘ a” niaéé'WE“ _. ﬁrs-mkm '
fﬁgﬁgaz :5"; ... (can “REWW” m.:fm=
K I ‘2». #- Ar-K‘af-f“: “1%; ”3% "Xg;
-- "3‘" -
we ‘
Mum

L'

.MAEJ-
.3 .4\ . ‘
wt‘Il' 1::‘1.Vr5'r *w
d 2. '4»- a
uquxxtmguunlwr
N I . I
g

_ ”12L? ““34; '
. .u .éi;3‘n 4?;W r
J"

‘ $3933. “"
"$23? If"!

I. V
a.“

 

 

:. ‘ 42¢ 0'55; . m r .
.‘uu.’ s ,_x n q“ ‘. ,'.r ‘Ag'u' .. a
r "uh-1-:- 1-. r .-. > ~' r, . ‘V ~ ‘ :u. “ﬂan"r -. .-
a: A ‘ .' 7 v ‘ , I‘ an: . t-n ‘ \ »3 J 1 , -. ~n..‘ a}.
. z; 3'.

 

r
"aft“! .. .fn-u ow-

pr
A." u‘r-‘Lrn 1 ~vy~a
"an“

- ..
n“! a,
w fiat. :

rrr'n 21

“y
,.-. ,
N . -..._ .4
rr; .i{v§;111 {L'T
(w
r ”"-

 

Chi“,

 

.\.'m
(I? g. n.4,.
_. w!.....,——.--

a. — -y m —o ,
, 4.," ..~

.r ~-
.34..

u”:

- ‘1".3'. I“;.~.I.,

-
“V"
, .:.:r...:,-:T.~:s
.,... WW-
.:..,....._ _

 

THESIS

was srrv LleaAales

\llzll ll l? l l

 

 

 

 

 

 

 

 

 

ll‘gll lllllllnll

This is to certify that the

dissertation entitled

LOGIC SIMULATION
ON MASSIVELY PARALLEL SIMD MACHINES

presented by

Yunmo Chung

has been accepted towards fulﬁllment
of the requirements for

Ph.D. degreein Computer Science

49—week

Major professorU

 

 

Date 2/ / 01/9 2"

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

“ DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

”—l l

_l
MSU Is An Affirmative Action/Equal Opportunity institution
61W

 

 

W313“

 

 

O.\'

LOGIC SIMULATION
ON MASSIVELY PARALLEL SIMD MACHINES

By

Yunmo Chung

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science
1991

Sim:
VLSI d:
driven s
In this I
Efﬁcient;
Parallel !
the WOW
1 (MasPa
qU€UC sjz.
Pﬁforman

gate. The
increase a
logic simu]
Consen'ativ
techniqu‘eSJ
the new tec]
tolten drll'en

Simulatiol1 v

theSlS preSen

6 80-5 7911

ABSTRACT

LOGIC SIMULATION
ON MASSIVELY PARALLEL SIMD MACHINES

By

Yunmo Chung

Simulation is the primary tool for validation and analysis of digital circuits in
VLSI design. As the complexity of VLSI circuits has increased, distributed event-
driven simulation has attracted considerable interest for providing fast simulation.
In this thesis, based on the characteristics of gate-level logic simulation, we study
efﬁcient paradigms and data structures for fast parallel logic simulation on massively
parallel SIMD (Single Instruction Multiple Data) machines. The performance of
the preposed schemes is measured on the CM-2 (Connection Machine) and the MP-
1 (MasPar) in terms of number of simulation cycles, parallelism, maximum event
queue sizes, and execution times. We present a probabilistic model to estimate the
performance of parallel logic simulation when a processor contains more than one
gate. The performance estimation shows that parallelism and processor utilization
increase as the number of gates per processor increases. We propose new parallel
logic simulation techniques by giving a clock advancement window to each gate in
conservativelsimulation and optimistic simulation. Experimental results show that the
techniques give much better performance than traditional simulation techniques since
the new techniques enhance simulation clock advancements. We present a distributed
token driven logic simulation technique as a parallelized version of compiled-code logic
simulation which is used to verify the functional correctness very efﬁciently. This
thesis presents broad studies of parallel logic simulation in massively parallel SIMD
environments. The research results are useful in the implementation of parallel logic

simulation, as a part of VLSI design, in SIMD processing environments.

To my parents and my wife

iii

lv'
his con
suppor

Iva
DR Rh
Endant

lly
duﬁng

especial
my rese
danghte
days. I (
lhdrpe,

I W01
LabOrat c

ACKNOWLEDGMENTS

I wish to express my appreciation to my thesis advisor Dr. Moon Jung Chung for
his consistent guidance and encouragement right from the beginning, and his financial
support. I am grateful for the many discussions and invaluable comments he provided.

I would like to thank my guidance committee members, Dr. Anthony S. Wojcik,
Dr. Richard Enbody, and Dr. Joseph C. Gardiner, for their help, encouragement and
guidance.

My sincerest thanks goes to the Korean government for their ﬁnancial support
during the beginning of my Ph.D. program. I thank friends in our department,
especially Mr. J. Engelsma, Mr. R. Baldwin, and Mr. P. Wolberg, for helping me do
my research and write my dissertation. I thank my wife Yunwha and my two pretty
daughters, Youjin and Yuri, for their support, patience, and love during many long
days. I express my special thanks to my father-in-law and mother-in-law in Korea for
their persistent spiritual encouragement and help.

I would like to acknowledge Thinking Machines Corporation, Argonne National
Laboratory, and the National Center for Supercomputing Applications, University of

Illinois at Urbana-Champaign, for allowing the use of their resources.

iv

List of

List of

1 Intr

1.1

1.3

1.4

Table of Contents

List of Figures ix
List of Tables xi
1 Introduction 1
1.1 Computer Simulation in VLSI Design .................. 2
1.2 Gate-Level Logic Simulation ....................... 3
1.2.1 Characteristics of Logic Simulation ............... 4

1.3 Parallel Simulation Paradigms ...................... 6
1.3.1 Event-Driven Simulation ..................... 6

1.3.2 Demand-Driven Simulation ................... 8

1.3.3 Token-Driven Simulation ..................... 8

1.4 Massively Parallel SIMD Machines ................... 8
1.4.1 Considerations on Massively Parallel SIMD Machines ..... 9

1.5 Thesis Motivation ............................. 10
1.6 Thesis Overview .............................. l2

2 Parallel Logic Simulation in SIMD Environments 14
2.1 Prior Work ................................ 15
2.2 Basic Data Structures .......................... 16
2.2.1 Memory Layout for Parallel Simulation ............. 18

Perf

3.1

3.3 (

3.4 (

Perfo
4.1 I,
4.2 I
4-3 P.

4.4 CO

2.2.2 Event Queue Scheme .......................

2.2.3 Table Lookup ...........................
2.3 Parallel Logic Simulation Protocols ...................
2.3.1 Distributed Synchronous Logic Simulation ...........
2.3.2 Conservative Logic Simulation ..................
2.3.3 Optimistic Logic Simulation ...................
2.3.4 Moving Simulation Bound(MSB) ................
2.4 Conclusions ................................

Performance of Parallel Logic Simulation

3.1 Experimentation ........ p .....................
3.1.1 Circuit Transformation ......................
3.1.2 Additional Considerations ....................
3.2 Performance Evaluation .........................
3.2.1 Performance Metrics .......................
3.2.2 Experimental Results .......................
3.3 Comparisons of Massively Parallel Machines ..............
3.4 Conclusions ................................

Performance Prediction Based on Gate-to-Processor Ratio

4.1 Introduction ................................
4.2 The Model .................................
4.3 Performance Estimation .........................
4.3.1 Parallelism .............................
4.3.2 Number of Simulation Cycles ..................
4.3.3 Execution Times .........................
4.4 Comparisons with Experimental Results ................

vi

18
19
2O
20
22
24

33

37
46
50

51
52
52
54
54
56

62

5.3

5.4

Dist
6.1

6.3 1
6.4 1

4.5 Conclusions ................................ 66

Enhancement of Simulation Clock Advancements 68
5.1 Multiple Events in a Message ...................... 69
5.2 Clock Advancement Windows ...................... 71
5.2.1 Advanced Conservative Logic Simulation ............ 71
5.2.2 Advanced Optimistic Logic Simulation ............. 74
5.3 Performance Evaluation ......................... 76
5.3.1 Advanced Conservative Logic Simulation with SCAW ..... 78
5.3.2 Advanced Optimistic Logic Simulation with SAAW ...... 81
5.3.3 Performance on Circuits with Feedback ............. 84
5.3.4 Communication Costs According to Message Length ...... 88
5.4 Conclusions ................................ 89
Distributed Token-Driven Logic Simulation 91
6.1 Introduction ................................ 92
6.2 Token-Driven Simulation ......................... 93
6.2.1 Additional Considerations .................... 95
6.2.2 Simulation on Circuits with Feedback Loops .......... 97
6.3 Performance Evaluation on the CM-2 .................. 100
6.4 Experimentation on a Shared Memory Multiprocessor ......... 104
6.4.1 Performance Evaluation on the BBN TC-2000 ......... 105
6.5 Conclusions ................................ 109
Conclusions and Future Research 110
7.1 Summary and Major Contributions ................... 111
7.2 Future Research .............................. 113

vii

 

Bibliog

1 1 7
Bibliography

viii

Lo
QL

Ca

Pe
Fe
.\'1
De

In.
Pr
Pr
Cc
Co
Co

2.1
2.2
2.3
2.4
2.5

3.1
3.2
3.3
3.4

4.1
4.2
4.3
4.4
4.5
4.6
4.7

5.1

5.2

List of Figures

Local simulation clock advancements .................. l7
Queue assignment to a gate ....................... 19
Diagram for possible state changes of a process ............ 21
Cancellation techniques .......................... 26
A CBSQ structure ............................ 28
Performance based on MSB size in conservative simulation ...... 38
Performance based on MSB size in optimistic simulation ....... 39
Number of simulation cycles for 89234 ................. 42
Degree of parallelism ........................... 43
Parallelism depending on GP ratio ................... 56
Increase ratio of simulation cycles ................... 59
Predicted execution times for C1355 and C1908 ............ 60
Predicted execution times for S9234 and 335932 ............ 61
Comparisons for C7552 in synchronous simulation ........... 63
Comparisons for C1355 in conservative simulation ........... 64
Comparison for C1355 in optimistic simulation ............ 65
A multi-event ............................... 70
An example of an advancement window ................. 73

ix

53 Re
5.4 A1
5.5 C<
5.6 Cc

5.8 A
5.9 A
5.10 A
5.11 A

5.12 C

6.1 A
6.2 s
6.3 p
6.4 S
6.5 s

6.6 E

5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12

6.1
6.2
6.3
6.4
6.5
6.6

Relationship between MCAWs and a SCAW .............. 73

An example of SAAW with size 2 .................... 76
Conservative simulation with different SCAW sizes (1) ........ 79
Conservative simulation with different SCAW sizes (2) ........ 81
Conservative logic simulation with SCAW = 4 ............. 82
Advanced optimistic simulation with different SAAW (1) ....... 83
Advanced optimistic simulation with different SAAW (2) ....... 85
Advanced conservative simulation for sequential circuits . . . . . . . . 86
Advanced optimistic simulation for sequential circuits ......... 87
Communication costs according to message lengths .......... 89
An example for simultaneous evaluation ................ 98
Simulation on circuits with feedback loops ............... 99
Performance according to token size ................... 103
Speedup obtained for various numbers of input vectors ........ 106

Speedup comparison for consecutive and random circuit partitioning . 107

Bit packing comparisons ......................... 107

Cha

Intr

Simulati.
circuit si
difﬁcult ‘
number
circuit 5
needed 1
Thus. g;
time am
Trad
ﬁcation
tiou usi:
In an a
traded
the 50b
AS (
Special.
de‘flo;
simulat
As ‘
prOCeSs
has b e,
InstruC
SIMD

Simula'

Chapter 1

Introduction

Simulation is the primary tool for validation in VLSI design. Though low-level analog
circuit simulation is used, such circuit simulation has several drawbacks [36]. It is
difﬁcult to simulate the entire system for most digital integrated circuits with a large
number of logic gates because extensive computer time is required. Also, analog
circuit simulation generates traces of voltage levels, which are more detailed than
needed to validate digital designs. For digital designs, binary signals are sufﬁcient.
Thus, gate-level logic simulation has been developed, as an effort to reduce simulation
time and provide efficient validation of digital circuits.

Traditional event-driven simulation techniques have been widely used for the veri-
ﬁcation of timing and functional correctness of circuit designs. However, logic simula-
tion using these techniques becomes slow as the complexity of VLSI circuits increases.
In an attempt to cope with the problem, parallel logic simulation has recently at—
tracted a considerable amount of interest. Simulation time can be reduced based on
the substantial amount of parallelism achievable.

As one approach for fast logic simulation, hardware logic simulation, which uses
special-purpose hardware that executes many simulation steps in parallel, has been
developed to speed up the simulation process [2, 45]. But, since hardware logic
simulation is very expensive, it is not applicable as a common tool for simulation.

As the computer technology has been developed for parallel processing on multi-
processor environments, research on parallel simulation techniques for fast simulation
has been active. So far, most of that research has been related to MIMD (Multiple
Instructions Multiple Data) processing environments. Recently, massively parallel
SIMD machines have become available. In this thesis, characteristics of both logic

simulation and massively parallel SIMD machines are investigated to develop efﬁcient

 

paradigms
lation is 51
small-grai
Par) and

machines

1.1 t

In VLSI .
particular
correctne:
siderable

can be (I

0 Cir

ofa

0 Sw

for

‘ SW
01’

me

0 Ga
rat
O]

26:

o Fu
Of

paradigms and data structures for fast parallel logic simulation. Gate-level logic simu-
lation is suitable for massively parallel SIMD machines since logic simulation requires
small-grain operations and may need many processors. Currently, the MP-l (Mas-
Par) and the CM-2 (Connection Machine) are available as massively parallel SIMD

machines with enough local memory for logic simulation.

1.1 Computer Simulation in VLSI Design

In VLSI design, simulation results are produced for a given circuit design with a
particular set of input vectors. Then, these results are used to verify the functional
correctness and timing of the circuit design. To achieve reliable circuit design, con-
siderable time is often invested in computer simulation. Types of digital simulation

can be classified as follows [14, 36, 72]:

0 Circuit simulation : This simulation is used for accurate simulation in the design

of analog circuits, such as filters, comparators, and operational ampliﬁers.

o Switch-level simulation : Transistors with constant delay are used as primitives

for simulation.

0 Switch—level with timing : To increase the chances of catching race conditions
or to compute propagation delays in switch-level simulation, some timing infor-

mation is used.

0 Gate-level logic simulation : Gate—level simulation is faster and nearly as accu-
rate as switch-level with timing. A small set of functions, such as NOT, AND,
OR, NAND, and NOR gates are supported. Gate-level simulation supports

zero-delay, unit-delay, or ﬁxed-delay.

0 Functional and register-transfer simulation : A designer describes the behavior

of components of a circuit design at a high level. Simulation is performed based

 

on t
tion

sucl

0 Mix
gate
of a

spec

0 Mix

anal

1.2 1

This thes
There are
informati
The f
10ng sim
on delay
descripti,
langUage
Each Prc
indepenc
incrEaseg
of execu
manage,
function
The

1051C SlII

on these high-level components. The high-level functions indicate transforma-
tions on data as the data moves from one storage device or register to another

such device.

0 Mixed-level simulation : Different levels of simulation—switch, switch-timing,
gate, and function—are allowed to be run simultaneously. This allows portions
of a design to be tested in detail, while other portions are tested with greater

speed but less detail.

0 Mixed-mode simulation : Simulation is simultaneously performed using both

analog and digital modes.

1.2 Gate-Level Logic Simulation

This thesis investigates how to use parallel simulation paradigms for logic simulation.
There are two kinds of logic simulation techniques classified according to how timing
information is handled.

The first approach, compiled-code logic simulation, is used for fast and simple
logic simulation where the delay of logic gates is not critical, i.e. simulation based
on delay is not required. Zero—delay or unit-delay logic simulation is performed. The
description of a circuit to be simulated is translated into a conventional computer
language, such as assembly or Pascal code. The code is compiled and then run [59].
Each process is evaluated once for every input vector during each simulation cycle
independent of the input patterns. Compiled-code simulation provides signiﬁcant
increases in performance over event-driven simulation by means-of the high speed
of execution of compiled-code. Since this technique does not require event queue
management, it is extremely efficient. Compiled-code simulation can be used for
functional simulation of combinational and synchronous circuits only.

The other simulation technique, event-driven simulation, is used for multi-delay

logic simulation. So far, event-driven simulation has been used to implement several

 

ti’pes of sin
tion is efhci
the selectiv

timing anal

1.2.1 C

logic simul
queuing me

be develOpc
Gate I

We can
value of ([-
l'HDL [1‘1 -
port delay
the delay 1
acteristic <
switching.
either the
PUlse dura

Only trans
Non‘l

During
us to 3531

Sch r.
times [I
med

In 10%.,
determlm

types of simulators. Processes communicate by sending events. Event-driven simula-
tion is efficient since it works by evaluating only the active elements, which is called
the selective trace approach. An additional advantage is that we can easily deal with

timing analysis and asynchronous designs.

1.2.1 Characteristics of Logic Simulation

Logic simulation has special characteristics which are not usually displayed in general
queuing network simulation. In this thesis, event scheduling and data structures will

be developed which utilize these properties.
Gate Delay

We can assume that the delay of each gate is independent of the input and output
value of the gate. In other words, each gate has a ﬁxed delay during simulation.
VHDL [l] recognizes two common types of gate delay: transport and inertial. Trans-
port delay is a characteristic of hardware devices whereby a pulse is transmitted after
the delay no matter how long the duration of the pulse is. Inertial delay is a char-
acteristic of switching circuits whereby a pulse whose duration is shorter than the
switching time of the circuit will not be transmitted at all. Most logic simulators use
either the unit delay or the transport delay model [24, 62, 71]. In our simulation,
pulse durations are assumed to be no shorter than the switching time. Therefore,

only transport delay is considered.
Non-preemption

During the execution of a process on a processor, there is no interruption of
the process in execution before its termination. The transport delay model allows
us to assume that there is no preemption of output events, like other simulation

schemes [25, 32, 46].
Fixed Propagation Routing

In logic simulation, each active gate propagates the same events to all of its pre-

determined successors. This property simpliﬁes the data structures in distributed

 

event-drf

tion for 1
Look

Lookz
happen i
is allowe<
tion appl
mance. C
to achievn

capabilit)
Fine

In logi
ality and
etc, is do
nlque app
managem.

Operations
Large

A circL
tion, a gat
of a large
become rm

5' .- .
Mel) Para

event-driven logic simulation protocols since we do not need to consider any condi-

tion for propagation directions to successors.

Lookahead Capability

Lookahead refers to the ability to predict what will happen and what will not
happen in the simulated future [32, 33]. Due to the fixed delay of each gate, a gate
is allowed to predict the output event time when it receives an event. If a simula-
tion application has good lookahead capabilities, we can often obtain better perfor-
mance. Conservative simulation relies on lookahead more than optimistic simulation
to achieve good performance [34]. A gate in logic simulation has a good lookahead

capability since the gate has a predefined delay.
Fine Grain Operations

In logic simulation, each gate performs event evaluation according to its function-
ality and event propagation. Event evaluation for logic gates, such as AND, OR,
etc., is done by a simple table lookup operation. Depending on the simulation tech-
nique applied, some additional operations, such as queue manipulation, and rollback
management, are required. Event evaluation operations take less time than other

operations in a simulation cycle.

Large Number of Elements

A circuit to be simulated consists of gates and links. In event-driven logic simula-
tion, a gate is considered a process. When a coarse grain machine is used, simulation
of a large circuit leads to partitioning and load balancing problems. The problems
become more serious as the complexity of VLSI circuits increases. If we use a mas-
sively parallel SIMD machine with a large number of processors, the problems may

not be serious.

 

 

1.3

In thi

simul;

1.3.]

There
asyn cl
clock.

queue:
smalle
accord

into I\

Syncl

hlsyn
eVents
tﬁnest
The ex
8X8cu(
62}. H
limitir.

replau

1.3 Parallel Simulation Paradigms

In this section, distributed simulation paradigms which can be used for parallel logic

simulation are discussed.

1.3.1 Event-Driven Simulation

There are two approaches for distributed event-driven simulation: synchronous and
asynchronous. In synchronous simulation, all processes are synchronized by a global
clock. In asynchronous parallel simulation, each process maintains its own message
queues and local simulation clock. The local simulation clock of a process is the
smallest timestamp of events to be processed. A process executes arriving events
according to a certain scheduling policy. Asynchronous simulation can be divided

into two common simulation paradigms: conservative and optimistic.

Synchronous Simulation

In synchronous simulation, there is a central event queue, which contains all of the
events during simulation. A Global Virtual Time (GVT) is computed as the smallest
timestamp of all unprocessed events. All processes are synchronized by the GVT.
The events with timestamps equal to the GVT are chosen from the event queue, and
executed in parallel. Similar approaches have been explored for logic simulation [55,
62]. However, the centralized nature of the queue causes a communication bottleneck
limiting the performance. For efﬁcient parallel logic simulation, the central queue is

replaced by distributed event queues.

Conservative Simulation

The Chandy-Misra approach [20, 21] is the most well-known conservative simulation
paradigm. In the Chandy-Misra approach, each process executes events according
to the input waiting rule so that earlier events are guaranteed not to occur. The
method has an inherent deadlock management problem. Two strategies [20] have been

proposed to handle the deadlock problem: deadlock avoidance protocol and deadlock

 

def

RIG

ant

hat

In

eac
def
a I
rol
eve
sta
dis

for

M
or

en

M

M.

an

111
to

to

detection and recovery paradigm. In the deadlock avoidance protocol, null messages
are used to propagate clock information to avoid deadlock. In the deadlock detection
and recovery paradigm, simulation is allowed to deadlock, the deadlock is detected,
and then the deadlock is broken. Several other conservative simulation approaches

have been developed in [6, 22, 30, 54].

Optimistic Simulation

In Time Warp [42, 43], the most popular optimistic paradigm, a process may execute
each event as soon as the event arrives. A local virtual time (LVT) at a process is
deﬁned as the smallest timestamp of unprocessed events at the process. Whenever
a process receives an event with a timestamp less than the LVT of the process, it
rolls back immediately to the state just before the timestamp of the newly received
event. GVT acts as the ﬂoor on all future rollbacks. To handle the rollback, previous
states must be saved. Two ways of controlling the cancellation of side effects given are
discussed in [43]. They are lazy cancellation, which cancels side effects in the next
forward simulation, and aggressive cancellation, which cancels side effects as soon
as they are found. Fossil collection is concerned with recovering storage associated
with past simulation times by discarding processed events with timestamps less than
or equal to GVT. A variation of Time Warp, which runs in a synchronous parallel

environment, has been proposed in [64].

Moving Time Window

Moving Time Window (MTW) has been proposed in [61] to reduce rollback frequency
and the space overhead of the Time Warp mechanism. The MTW technique uses a
global window to limit the maximum difference between the local virtual simulation
time of communicating processes. In other words, the global time window is applied
to allow only those simulation events whose execution times fall within the window

to be considered for execution.

 

1.3.2

In demz
demand
in distri
the over

particul

1.3.3

Token-d
no timi
of even1
only ca
may be
Simulat
SChedul
least on
Signal \

Of mair

1.3.2 Demand-Driven Simulation

In demand-driven simulation [60], gates are only evaluated when their outputs are
demanded. This eliminates many of the unnecessary events which are generated
in distributed event-driven simulation. The main disadvantage of this approach is
the overhead incurred by a recursive backtracking routine. The recursion becomes

particularly inefﬁcient when the circuit is deep.

1.3.3 Token-Driven Simulation

Token-driven simulation [28] can be used to verify functional behavior, but provides
no timing information. In token—driven simulation, tokens are propagated instead
of events. Unlike events, which are time-stamped with a simulation time, a token
only carries the output signal from a process without a timestamp. An identiﬁer
may be assigned to each token, if necessary, for synchronizing messages for correct
simulation. Instead of evaluating processes based on the occurrence of a previous
scheduled timestamped event, process evaluation is triggered by the presence of at
least one token in each of the gate’s input queues. A token may contain one or more
signal values. By eliminating timestamps, we also eliminate the non-trivial problem

of maintaining global time over multiple processes.

1.4 Massively Parallel SIMD Machines

Parallel simulations on MIMD machines have some advantages over SIMD machines
in several aspects: general computational capabilities; larger-grain parallelism and
asynchronous computation on each processor. However, one signiﬁcant drawback
of MIMD machines is that they usually have much fewer processors than massively
parallel SIMD machines. SIMD architectures are the only alternative, with present
technology, to achieve massive parallelism. They offer signiﬁcant advantages over
MIMD architectures for parallel logic simulation. Logic simulation requires small-

grain operations, such as NOT, AND, and OR. A large circuit requires a large number

 

of proces
architeci
of logic :
balancin
fewer ga
instructi

and datz

1.4.1

There a:
Processc
Chi-2 am
has enor
“"0 pro
using th
15K prc

Local 11

The Ci:
of the IC

of processors, which is well suited to a massively parallel SIMD environment. SIMD
architectures also offer simpler synchronization, which is crucial to the performance
of logic simulation. A SIMD environment has no serious circuit partitioning or load
balancing problems since, in general, each processor of a SIMD machine will simulate
fewer gates than for a MIMD machine. In a SIMD architecture, however, different
instructions cannot be issued at the same time, which limits simulation techniques

and data structures.

1.4.1 Considerations on Massively Parallel SIMD Machines

There are several massively parallel SIMD machines, such as the DAP (Data Array
Processor), the MPP (Massively Parallel Processor), the CM-2, and the MP-l [5]. The
CM-2 and the MP-l are used in this thesis as target machines since each processor
has enough local memory for event queues and direct communication between any
two processors is possible. The majority of our experimental results were obtained

using the CM-2. The CM-2 [66] and the MP-l [11] consist of 4K to 64K and 8K to

16K processors, respectively.

Local Memory

The CM-2 and the MP-l have 64K and 128K bits of local memory, respectively. Most
of the local memory capacity on each physical processor of a massively parallel SIMD
machine is available for user data. In SIMD, no program resides in the nodes while
each node in MIMD needs memory space to store some system software, such as an
operating system. But the size of each local memory in SIMD is relatively small for
simulation of a large number of input vectors. Thus, efﬁcient data structures are

required to maximize available queue size.

Related Instructions

Some critical instructions related to logic simulation are explained below.

 

Ch,-1

10

0 direct send

This instruction sends a message from every selected processor to a speciﬁed
destination processor. Each selected processor may specify any processor as
the destination, including itself. A destination processor may receive messages
even if it is not selected. As one of the most common and time-consuming
instructions, this instruction is used to propagate a generated event from a
node to its successor nodes. Many processors sending to the same place results

in conﬂicts.

o global-min ~

This instruction, called reduction, is for computation or communication between
a front-end processor and local processors. One value is examined in every
selected processor, and the smallest of all these values is returned to the front
end. This instruction is used to compute the global clock by obtaining the

minimum local virtual time of all active gates.

O array access

The CM-2 supports two array access instructions, aref32 and aset32, to read
and write an element of an array in each processor respectively. Since each
processor of the CM-2 is bit-serial, these take a relatively long time compared
with the read / write operations of non-array variables. They are frequently used

for the manipulation of event queues.

The performance of the instructions on the CM-2 and the MP-l will be given in
Chapter 3.

1.5 Thesis Motivation

Today, simulation constitutes at least half of the design cycle. As the VLSI designs
become larger, the simulation cost can only grow. Of course, the amount of validation

required increases due to more interconnections between elements.

 

The p
structures
and massi
massively
Therefore
is studied

Currer
tations. C
machines
To cope v
gates to tl
performar

We in!
ﬁcient eve
Simulatior
a Window
window a:
are sent a;
she 0n pe
Cle (0T lte
propagatit

Thong
SimUIalIOI
of digital
behal'ior i
code Simu

tokens to ‘

11

The primary purpose of this thesis is to propose efficient paradigms and .data
structures for parallel logic simulation based on the characteristics of logic simulation
and massively parallel SIMD machines. For example, an event queue manipulation on
massively parallel SIMD machines is critical for the performance of logic simulation.
Therefore, an efﬁcient event queue structure for each distributed simulation protocol
is studied.

Currently, each processor can simulate only one gate due to local memory limi-
tations. One of the critiques of parallel logic simulation on massively parallel SIMD
machines is that parallelism is low compared with the number of processors used.
To cope with this problem, we investigate the effect of the ratio of the number of
gates to the number of processors. We develop a probabilistic model to estimate the
performance of parallel logic simulation when we increase this ratio.

We investigate the enhancement of local simulation clock advancements for ef-
ﬁcient event evaluation and propagation in conservative simulation and optimistic
simulation. We present advancement window concepts, in which each gate computes
a window containing events to be involved in event evaluation. All events within the
window are evaluated during a simulation cycle, and the resulting sequence of events
are sent as a single message for fast simulation. We investigate the effects of window
size on performance. More than one event evaluation is allowed for a simulation cy-
cle (or iteration) to enhance local clock advancements. In addition, more than one
propagation event can be sent in a message to save communication cost.

Though compiled-code simulation has been proposed for fast zero (or unit)-delay
simulation, such a simulation works only on sequential machines. As the complexity
of digital circuits has increased, faster logic simulation for the veriﬁcation of circuit
behavior is required. We investigate the possibility of a parallel version of compiled-
code simulation. As one approach, we propose a parallel simulation technique using

tokens to synchronize all the signals from the same input vectors.

 

1.6

The rerr
ﬁcient si
on distri
machine
ulation.
simulati<
In Cl
Chapter
lation cy
of the te
CAI-2 an
In C1
Can beco
mOdel is
on the g
OI Simul;
With exp
Simulatit
In CI
Perform;
Window ,-
POSsible.
is given
in terIns
1“ additi
how a Set
In Ch
drive,1 10%

12

1.6 Thesis Overview

The remainder of this thesis is organized as follows. In Chapter 2, we present ef-
ficient simulation paradigms and data structures for parallel logic simulation based
on distributed discrete event-driven simulation protocols on massively parallel SIMD
machines. The event-driven simulation protocols investigated are synchronous sim-
ulation, conservative simulation, and optimistic simulation. Some variations of the
simulation protocols are proposed to improve performance.

In Chapter 3, the performance of parallel logic simulation paradigms proposed in
Chapter 2 will be measured on the CM-2 and the MP-l, in terms of number of simu-
lation cycles, parallelism, maximum queue sizes, and execution times. Characteristics
of the test circuits will be discussed. In addition, a performance comparison of the
CM-2 and MP-l is presented.

In Chapter 4, we prove that the parallelism on massively parallel SIMD machines
can become high if we increase the gate—to-processor ratio. In addition, a probabilistic
model is proposed to estimate the performance of parallel logic simulation based
on the gate-to-processor ratio. The model is used to predict parallelism, number
of simulation cycles, and execution times. The predicted performance is compared
with experimental results for three distributed simulation protocols, i.e, synchronous
simulation, conservative simulation, and optimistic simulation.

In Chapter 5, we present techniques to enhance clock advancement for better
performance in conservative simulation and optimistic simulation. An advancement
window at each gate is computed to allow more than one local clock advancement if
possible. A method to compute and use windows in the above simulation protocols
is given. The performance, as a function of window size, is measured on the CM-2,
in terms of number of simulation cycles, parallelism, queue size, and execution times.
In addition, communication costs according to message lengths are measured to see
how a sequence of events sent as a single message can save communication costs.

.In Chapter 6, we present a new parallel logic simulation protocol called token-

driven logic simulation, which is efﬁcient for zero-delay or unit-delay logic simulation.

 

Many p9
simulatic
simulatic
driven SI]
We Show
mental rt
SIMD m2

problems

13

Many performance considerations are discussed. The performance of token-driven
simulation is measured on the CM-2. We compare the performance of token—driven
simulation and compiled-code simulation. In addition, the performance of token-
driven simulation on the BBN TC-2000, a shared-memory MIMD machine, is given.
We show that, with a reasonable partitioning scheme, the simulation gives experi-
mental results comparable to those obtained by a similar algorithm running on an
SIMD machine. Finally, we conclude our thesis and suggest some interesting research

problems which remain.

Chapter 2

Parallel Logic Simulation in SIMD

Environments

This chapter presents efﬁcient simulation paradigms and data structures for paral- .
lel logic simulation based on the characteristics of gate level logic simulation. The
event-driven simulation protocols investigated are synchronous simulation, conserva-
tive simulation, and optimistic simulation. Some variations are considered to imple-
ment the simulation techniques efﬁciently on massively parallel SIMD machines.

Efﬁcient event queue manipulations are critical on massively parallel SIMD ma-
chines since array access Operations are, in general, slow. Circular FIFO (First-In-
First-Out) lists are used as event queues in both synchronous and conservative sim-
ulation. As an efﬁcient event queue structure for optimistic simulation, we present a
circular binary search queue structure which allows binary search on a circular list.
Advantages and disadvantages of the proposed techniques will be discussed.

In optimistic simulation, both lazy and aggressive cancellation techniques work
poorly in SIMD processing environments. Thus, a new cancellation scheme, called
immediate cancellation, is also presented.

The remainder of this chapter is organized as follows. Section 2.1 discusses the
literature survey of distributed event-driven simulation, with special emphasis on par-
allel logic simulation. Basic data structures for designing efﬁcient parallel simulation
paradigms on massively parallel SIMD machines will be discussed in Section 2.2.

Section 2.3 presents efﬁcient simulation paradigms for logic simulation.

14

ch
sir

te<

0V1

tio
me
an.

011‘

15

2.1 Prior Work

A simple approach to parallel implementation of logic simulation is to use vector ma-
chines. However, it has been shown in [18, 58] that vectorization techniques for logic
simulation can achieve very limited parallelism. Distributed event-driven simulation
techniques offer the most promise.

In the well-known Chandy-Misra conservative simulation algorithm, deadlock
management is a major problem in MIMD environments, and has been investigated
in [30, 32, 58, 63]. Soule and Gupta [63] classiﬁed the types of deadlocks in digital
circuit simulation to reduce deadlock occurrence, and reported simulation results on
a Multimax. They observed that conservative simulation with null messages is inefﬁ-
cient on the MIMD machine. In [74], a new conservative simulation algorithm, called
YADDES, is proposed, which uses a dataflow network to avoid deadlocks.

In Time Warp, several techniques have been proposed in [47, 50] to reduce the
storage overhead and rollback problems. Moving Time Window (MTW) was proposed
in [61] as a semi-optimistic approach to reduce the rollback frequency and the space
overhead of Time Warp.

Comparisons of the performance of synchronous simulation, conservative simula-
tion, and optimistic simulation techniques have been reported based on actual imple-
mentation on MIMD machines: Transputer-based multiprocessor with 8 nodes [57]
and BBN Butterfly with 64 processors [34]. Fujimoto [34] reported that Time Warp
outperforms the Chandy-Misra algorithm in his implementation. Lin et al., proposed
a performance comparison model of parallel logic simulation in a MIMD environ-
ment [48]. They showed that Time Warp always outperforms the Chandy-Misra
algorithms under the assumptions of zero overhead of rollback and state saving. It
has been known that simulations using Time Warp on MIMD machines give promis-
ing performance in battleﬁeld simulation [37], digital hardware simulation [8], and
producer/ consumer simulation workloads [4].

When optimistic simulation is implemented on MIMD machines, fast computation

of Global Virtual Time (GVT) is very important, but it is not easy. Preiss suggests

16

in [56] a token-ring calculation of GVT. Briner presents in [14] a GVT approximation
algorithm using a tree of processors. Baldwin et al. present a GVT computation
algorithm with overlapping window concept in [7].

A few results are available for parallel logic simulation in SIMD environments.
Results have been published on parallel simulation of queuing models [49], circuit
simulation [71], and switch-level simulation on the Connection Machine [15]. We
have investigated logic simulation schemes on a massively parallel SIMD machine.
Since queue structures for optimistic simulation on massively parallel SIMD machines
are so important, we have developed several schemes as event queue structures. For
example, a data parallel queue scheme was presented in [24]. In this scheme, an event
is assigned to an event processor and event evaluation is done based on a data parallel
approach. Next, a single queue scheme was presented which assigns an event queue
to a gate in [27, 29]. In the implementation, a variation of Time Warp was presented
which uses a single queue at each process and computes a lower bound of rollback
based on immediate cancellation to reduce space overhead [29]. The analysis of local
clock advancement in a SIMD environment, and the effect of moving time window
size on execution time were also reported in [27]. In other words, each simulation
technique can be characterized by how much each process is allowed to advance
its simulation clock beyond the current global virtual time. This is represented in
Figure 2.1 [27]. The figure shows that synchronous simulation does not allow local
clock advancements beyond GVT at all, while Time Warp allows local simulation
clocks to advance optimistically. The Chandy-Misra algorithm allows local simulation
clocks to conservatively advance based on input waiting rules, while simulation clocks

advance within a given window in Moving Time Window.

2.2 Basic Data Structures

As basic data structures for logic simulation on massively parallel SIMD machines,

we consider memory layout, event queue schemes, and table lookup.

17

 

 

input waiting rule

 

”I'I'I'lml" | | ‘

 

 

 

p1p2p3p4..pn plpzpsruam
a) Synchronous Simulation b) Chandy-Misra algorithm
A

ml? l I mll l 1.1-W”

 

 

 

in p2 p3 p4 ..pn p1 p2 93 p4 "in
c) Time Warp (1) Moving Time Window
when pi indicates precast

Figure 2.1: Local simulation clock advancements

 

 

at
161
\‘a

of

18

2.2.1 Memory Layout for Parallel Simulation

Without loss of generality, we can assume that one gate is assigned to one processor
since massively parallel SIMD machines have a large number of physical processors
and allow users to define as many virtual processors as possible within available mem-

ory capacity. A process (gate) contains the following information:

MEMORY LAYOUT

0 Local Virtual Time (LVT): the smallest simulation time of unprocessed events in

the event queues of the process.

Gate information: the information about the process, such as pointers to its suc-

cessors .

Function table: the table for computing output signals from input signal values.

Input signals: the current input signals at the LVT.

Input buffer: the buffer for storing just arrived events before putting them into its

event queues.

Event queues: the queues for storing all events to be processed.
o Other local variables.

In the above memory layout, other local variables include general temporary vari-
ables, such as variables for an active bit, an output signal, etc. as well as simulation
technique-dependent variables, such as link clocks and minimum link clocks in conser-
vative simulation. In addition, a global virtual time (GVT), the smallest timestamp

of unprocessed events in all processes, is computed in the front-end processor.

2.2.2 Event Queue Scheme

In our simulation, a distributed event queue scheme is used in which each input port of

a gate has its own event queue. In SIMD, the maximum queue size over all processors

19

 

 

 

 

 

 

 

 

 

 

 

inbits(b2b1bo) 0001001 010 [011 1m 101 110 | 111 ||
ﬂ-input AND 0 o o 0 o o o 1
[Bﬂpr 1—input OR 0 1 1 1 1 1 1 1
r2-input AND 0 0 0 1 x x x x
I] l-input AND 0 1 x x x x x x
[I Invert 1 0 x x x x x x

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 2.1: Function tables for lookup operations

should be estimated and allocated in advance since dynamic allocation is not allowed,
in general. In this thesis, each gate has at most 3 input ports and 2 output ports as

shown in Figure 2.2.

 

 

j; 33335 Input Port Queues

 

 

----- One Gare

 

 

 

V

Figure 2.2: Queue assignment to a gate

 

 

 

2.2.3 Table Lookup

To evaluate different functions of all active gates at the same time, the table lookup
method is used as in [24]. Each gate contains a table for the corresponding function.
For example, if there are at most 3 inputs to each gate, 8 bits are enough to store

each operation as shown in Table 2.1. In the table, “x” represents “don’t care” signal.

20

2.3 Parallel Logic Simulation Protocols

In this section, several variations of distributed event-driven simulation protocol are
considered. A logic circuit can be considered to be a directed graph in which nodes
represent gates called processes, and arcs (links) indicate connections between the
gates. Each gate propagates its output signals by sending event messages. Each

event is time-stamped with the simulation time at which it should be executed.

Definition 1 Simulation Cycle : All processors on massively parallel SIMD machines
are synchronized. Therefore, each process repeats the same procedure, which consists
of LVT computation, choosing active gates, event evaluation, new event propagation,
and queue manipulation. Optimistic simulation may require an additional step for
rollback management. The time period to perform this procedure is called a simula-

tion cycle.

The diagram for possible state changes of a process is given in Figure 2.3. The
idle state in the ﬁgure represents that a process is not involved in any of the above

activities. Each simulation cycle is synchronized in SIMD environments.

2.3.1 Distributed Synchronous Logic Simulation

Distributed synchronous simulation is a synchronous simulation with distributed
event queues. In distributed synchronous logic simulation, each input port of a pro—
cess receives events in non-decreasing timestamp order. A circular FIFO queue, rather
than a priority queue, can be used as an event queue of each input port, facilitating

queue manipulation. During every simulation cycle, the following steps are performed.

Algorithm SYNCHRONOUS

1. Each process computes its LVT.
2. GVT is computed by taking the minimum LVT of all processes.

3. Each process whose LVT is equal to the GVT performs event evaluation and event

propagation.

21

 

 

 

\
\

I[Queue manipulation I
' I

, New event propagation]

 

 

 

 

l LVT computation"

khoosing active gatesl

 

I
I
I

I
1

Event evaluation

 

b-------d---
‘

 

 

 

——> typical path for a simulation cycle
- - - - , possible path

Figure 2.3: Diagram for possible state changes of a process

 

 

22

4. Each receiving process performs event insertion.

2.3.2 Conservative Logic Simulation

In conservative simulation, a process executes events only if it is certain that no event
with an earlier timestamp can arrive. Link clocks and minimum link clock are used
as deﬁned in conservative simulation [20, 25]. A link clock of each input port of a
process is deﬁned to be the timestamp of the most recently arrived event along that
input link. The minimum link clock of a process is the minimum of link clocks of all
its input ports.

Since the delay of each process is fixed and there is no preemption in logic simula-
tion, events arrive at an input link in nondecreasing order of timestamps. A circular
FIFO list is assigned to each input port of a process as an event queue. The follow-
ing modiﬁed input/output waiting rules of the Chandy-Misra algorithm are proposed
which exploit gate-level logic simulation properties: a process executes the events‘with
timestamps equal to its LVT when its minimum link clock is greater than or equal to the
LVT; the output events generated can be immediately sent to their destinations.

Null messages in Chandy-Misra algorithms [19, 20] are used to avoid deadlock. In
contrast to MIMD environments, where null messages may signiﬁcantly decrease the
performance of Chandy-Misra algorithms [63], the null messages in SIMD environ-
ments can be efﬁciently used [25] because all null messages are sent at the same time
real events are pr0pagated.

In addition, GVT can also be used to further reduce simulation cycles in conser-
vative logic simulation. All events whose time stamps are equal to GVT are executed
even though the Chandy-Misra input waiting rule is not satisfied. On a SIMD ma-
chine, GVT can be computed easily.

The proposed algorithm exploits features of the SIMD architecture: easy compu-
tation of GVT and negligible overhead of null messages. For each simulation cycle,

the following steps are performed for each process:

23

Algorithm CONSERVATIVE
1. Each process computes its LVT.
2. GVT is computed.

3. Each process executes the events with timestamps equal to its LVT when

3) its minimum link clock is not less than LVT or

b) the GVT is equal to LVT;

4. Each active gate performs event propagation. Each gate which has an updated
minimum link clock and does not propagate any event sends a null message with

the updated minimum link clock.

5. Each receiving process sets the corresponding link clock to the timestamp of the
just arrived message along the link. If the message is non-null, event insertion is

performed.

There are three things to be considered to obtain good performance when we
implement conservative logic simulation using null messages and GVT. First, all pro-
cesses whose minimum link clocks are updated by null messages or events need to
propagate new null messages or events to update link clocks of its successors in time.
Second, at every simulation cycle, all link clocks should be updated to GVT if they
are smaller than GVT. Finally, null messages with inﬁnite timestamps need to be
sent at the end of each input vector to inform successors that no further events will
be generated.

The proposed simulation technique can also be used for logic simulation on MIMD
processing environments. Since GVT computation on MIMD machines is expensive,

Steps 2) and 3b) in the above algorithm might be discarded.

24

2.3.3 Optimistic Logic Simulation

Optimistic logic simulation uses LVT differently from conservative logic simulation.
The difference is that all events whose timestamps equal LVT are executed in opti-
mistic simulation, while the same events may not be executed in conservative sim-
ulation if the execution condition is not satisﬁed. Optimistic logic simulation does
not need any link clocks. A simulation cycle of optimistic simulation performs the

following steps:

Algorithm OPTIMISTIC
1. Each process computes its new LVT.

2. If the new LVT is not greater than its previous LVT, then rollback procedure, i.e.

state restoration and cancellation, is performed.
3. Each process with unprocessed events performs event evaluation and propagation.
4. Each receiving process performs an event insertion operation.
5. Fossil collection is performed if necessary.

The following difﬁculties must be considered in implementing the optimistic sim-
ulation on a massively parallel SIMD machine. First, optimistic simulation has a
storage overhead problem because information about all events whose timestamps
are greater than GVT must be kept. Moreover, each process has three queues: input,
output, and state queues. Second, processors of most massively parallel SIMD ma-
chines have relatively small local memory capacity. Finally, array access time on the
CM-2, our target machine, is slow [25] because its processors are bit-serial. Therefore,

efﬁcient storage management is very important.

Immediate Cancellation

Aggressive and lazy cancellation techniques have been proposed to undo incorrect

event propagation [41, 43]. However, these are difﬁcult to implement on massively

25

parallel SIMD machines for the following two reasons. First, there is not enough
memory space for both state and output event queues. Second, additional simu-
lation cycles are required to send antimessages, i.e. one simulation cycle for each
antimessage.

To cope with the problems, we have proposed the “immediate cancellation” tech-
nique which eliminates use of antimessages and does not require both state and output
queues [29]. In this cancellation technique, a gate in rollback immediately propagates
a replacement event to its successors. Immediate cancellation is a variation of aggres-
sive cancellation which sends antimessages for all incorrect event propagation. Since
logic simulation has ﬁxed event propagation routing and no preemption, one replace-
ment event is enough to nullify all incorrect event propagation. With the immediate
cancellation scheme, we can achieve significant reduction in space and make both
queue manipulation and rollback management fast and easy.

We explain the immediate cancellation technique with both aggressive and lazy
cancellation techniques together where the latter two concepts were described in
Chapter 1. In Figure 2.4, three cancellation techniques, immediate, aggressive, and
lazy cancellation, are explained from the situation given in Figure 2.4.(a), where pro-
cess C has processes D and E as its successors. Suppose that rollback occurs at process
C because there is a new event with timestamp 45 when the current LVT is 60. In
immediate cancellation, after the ﬁrst event, which caused rollback, is processed, a
rolled back process propagates a replacement event to its successors. In other words,
as soon as the event evaluation ﬁnishes at the simulation time 45, a replacement
event with a new output signal is propagated to both processes D and E. Here, any
antimessage is not needed for the outputs already sent to both D and E at simulation
times 50 and 60, since the event sent at the simulation time 45 causes a rollback at
both processes D and E. The simulation continues forward again. We do not have to
consider the rollback caused by the new event any more.

In aggressive cancellation, antimessages at simulation times 50 and 60 must be
sent to both D and E as shown in Figure 2.4.(c), which may cause rollback at both

D and E. But the antimessages are useless if a new event at simulation time 45 is

26

 

 

 

 

 

 

Simulation time Simulation time
60
50 _ ,To D and E 5024 Elgagncgignt event
40.JoDandE 23 PH"
Process C PrEeEs—C

a) Before the rollback 5) Immediate Cancellation

Simulation time Simulation time

60 ._------>Anti-1 60.,1----->Anti-2

 

50 3. . ---->Anti-2 503. .---->Anti-l
45 45"
40 40

Process C m C
c) Aggressive Cancellation d) Lazy Cancellation

note) "Anti-k" indicates a possible k-th antimessage
for cancellation sent to D and B

 

Figure 2.4: Cancellation techniques

 

 

27

propagated to both D and E and both the processes have to roll back again. In this
case, extra time is wasted sending the useless antimessages. In addition, processes D
and E also need extra time to take care of the received antimessages.

In lazy cancellation, as shown in Figure 2.4.(d), even though an event is sent at
simulation time 45, antimessages at simulation times 50 and 60 may be sent. But
these two antimessage can be avoided, since once a process propagates an event to its
successors, the process does not have to propagate any more events (or messages) for
lazy cancellation. For example, as shown in Figure 2.4.(b), if an event is propagated
at simulation time 45, then we do not need to send any more messages at simulation

times 50 and 60 for cancellation.

Circular Binary Search Queue

The FIFO queue structure which has been used for both synchronous simulation
and conservative simulation is not suitable for optimistic logic simulation because
timestamps of events along each input link are not monotonic. Several existing queue
structures for optimistic simulation have been suggested: a splay tree based on a self
adjusting binary tree [65]; a timing wheel using a priority queue structure [67]. In
case of rollback, however, all the events in the queue may have to be searched to
determine which to remove [14].

As an attempt to achieve fast queue manipulation, we use an event queue scheme,
called CBSQ (Circular Binary Search Queue). A CBSQ is a data structure which
allows a binary search on a circular list. As shown in Figure 2.5, a CBSQ has two
pointers, Front and Rear, to indicate events in non-decreasing order of timestamp
contained in the queue. Front points to the element with the smallest timestamped
event, while Rear points to the next available element in the queue. An algorithm
for ﬁnding a key in a CBSQ Q with n elements numbered 0 - - - n - 1 is described as

follows:

Algorithm CBSQ_search(key)

if( Front < Rear) then perform binary search from Front to ( Rear — 1)

 

else if ( F1
ii I
else

else

else the q

 

A CBS
two POInte

which h as

CBSQ Strt
dOes 1101. n

The fol
0 ﬁnd E
0 insert

' (Islete

28

else if ( Front > Rear) then

if ( key < Q(n — 1)) then perform binary search from Front to (n — 2)
else if ( key > Q(n — 1)) then perform binary search from 0 to ( Rear — 1)

else found

else the queue is empty

 

 

 

 

 

 

 

   

rear
b) case front > rear

Figure 2.5: A CBSQ structure

 

 

A CBSQ is assigned to each port of a process as an event queue. In addition to the

two pointers, another pointer called SimPtr is used to point to the unprocessed event

which has the smallest timestamp and is ready to be evaluated for simulation. The

CBSQ structure can be used for event queues since immediate cancellation, which

does not need use of antimessages and state queues, is used.

The following basic operations can be done on CBSQs with O(log n).
a ﬁnd an event with timestamp t,
0 insert an event into a queue.

o delete all events with timestamps greater than t,

29

o delete all events with timestamps less than t, and
0 ﬁnd the next event after the event pointed to by SimPtr.

In the above operations, all operations except the ﬁrst can be done with 0(1)
operations after the ﬁrst operation is done for the same timestamp, since the ﬁrst
operation gives a result index from binary search. The second and third operations
are needed for event insertion and elimination of all events with timestamps greater
than the timestamp of a new event on the same queue. The fourth operation is used
for fossil collection. Finally, the ﬁfth operation is used to increase the pointers,Front,
Rear, and SimPtr.

The following operations using the above basic operations are required for each

simulation cycle.

0 New event insertion: When a new event is inserted into a CBSQ, all the events
from Rear to the timestamp of the new event are automatically eliminated by
just moving Rear to the element after the new event. That is, we do not need
any additional operations for deletion following rollback. If the timestamp of
the new event is less than the timestamp of the event pointed to by the SimPtr
( rollback ), SimPtr is moved to the new event. In case of rollback, signals from
events with timestamps just less than the timestamp of the new events are used
to restore the state of the gate. Hence, we do not need additional state queues

for state saving.

o Rollback adjustment: When the rollback occurs, CBSQs of all other inputs in
the rolled back process are also adjusted. That is, SimPtrs are moved down to

the event with timestamp equal to or just greater than the timestamp of the

new LVT.

0 Fossil collection: Binary search is used to ﬁnd an event whose timestamp is just
less than GVT, and adjust Front to point to that event. But fossil collection
can be limited by some condition. For example, at every 10 simulation cycles, a

check is made as to whether there is any queue which exceeds a predetermined

30

limit of queue size. If there is at least one queue which satisﬁes the above

condition, fossil collection is performed.

On the average, it takes 0(log n) time to do a CBSQ search operation on each
queue, where n is the number of events on the queue. At least two binary search
operations are required on each queue for a simulation cycle. For a circuit with
maximum fanin p, we need (2 + q)p binary search operations on the average at each
simulation cycle where q is a real number (0 S q S 1) representing an average
frequency of fossil collection. q varies depending on available queue size. For example,
if we assume that unlimited sized queues are used, q is zero. If the queue size is
2" where n is a positive integer, boundary check operations are not required. For

example, for a 512 element CBSQ, 9-bit unsigned integers can be used as pointers.

2.3.4 Moving Simulation Bound(MSB)

On a massively parallel SIMD machine, a single queue overﬂow will halt simulation.
The Moving Time Window technique has been used to control queue space overhead
in Time Warp. The same technique can be applied to conservative simulation to
reduce the possibility of queue space overﬂow.

In this thesis, for consistency, Moving Simulation Bound (MSB) refers to the
bound in which active processes determine events to be executed in conservative sim-
ulation, as well as the moving time window in Time Warp. The MSB technique is
a controlling mechanism for traditional event-driven simulation techniques to ﬁnish
simulation successfully, not a simulation technique itself. This technique is applied
when event evaluation is performed in each simulation cycle of both conservative sim-

ulation and optimistic simulation by giving a bound B as follows.

Algorithm MSB(B)

1. GVT is computed

2. Each process with new LVT is chosen as a candidate (to be active) based on event

scheduling of a given simulation technique.

31

3. Only each candidate process with LVT < (GVT + B) is considered an active process

for simulation at every simulation cycle.

2.4 Conclusions

This chapter studied logic simulation as one application of distributed event-driven
simulation on massively parallel SIMD processing machines. Data structures to imple-
ment several simulation techniques efﬁciently in SIMD environments were discussed.
Some variations of distributed event-driven simulation have been proposed to improve
performance of logic simulation. The performance of the considered logic simulation

techniques will be given in the next chapter.

Chapter 3

Performance of Parallel Logic

Simulation

In this chapter, we present experimentation and performance of the logic simulation
protocols considered in Chapter 2. The logic simulation protocols were implemented
on the CM-2 with 32K processors and the MP-l with 16K processors. Some ISCAS’85
and ISCAS’89 benchmark circuits are used as test circuits. The performance of the
protocols are measured for 1,000 randomly generated input vectors

The remainder of this chapter is organized as follows. Section 3.1 describes the
transformation of test circuits. The performance of the parallel logic simulation tech-
niques proposed in the previous chapter is measured in Section 3.2 in terms of number
of simulation cycles, parallelism, maximum queue sizes, and executions times. Sec-
tion 3.3 presents the performance comparisons of logic simulation on both the CM-2

and the MP-l.

3.1 Experimentation

To get reasonable test results for performance analysis, we use three kinds of bench-
mark circuits: ISCAS’85 benchmark circuits [13], ISCAS’89 benchmark circuits [l2],-
and a 32-bit array multiplier [40]. They have different characteristics as test circuits.
ISCAS’85 circuits are combinational circuits, while ISCAS’89 circuits are sequential
circuits which have ﬂip-ﬂops and feedback. In addition, a 32-bit array multiplier with
8,256 gates is used as a test circuit to compare the execution times between a VHDL

simulator [1] and a distributed event-driven simulation.

32

33

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Kinds Circuit Original Transformed
Names No. of Gates No. ojLGates
ISCAS’85 (31355 546 ‘ 1001
C1908 880 1340
C6288 2416 3880
C7552 3513 5848
ISCAS’89 S9234 5896 7830
813207 8651 11860
$15850 10470 13782
835932 18148 25359
Other [ 32bit Mul [[ 8256 [ 8256]]

 

 

 

Table 3.1: Test circuits used for simulation

3.1.1 Circuit Transformation

In SIMD environments, the time period taken by the slowest (or busiest) pro-
cess determines the simulation cycle time since all the processors are synchronized.
Maximum fanin and fanout in a circuit signiﬁcantly affects the performance of logic
simulation since the number of required send operations per cycle depends on the
maximum fanin and fanout. In addition, if a gate has many input ports, a dis-
tributed event queue scheme cannot be used on the CM-2 due to the small size of
local memory. For good performance, we need to transform a given circuit to be
simulated into an equivalent circuit with limited fanin and fanout. The names of
circuits used as testbeds and their ﬁnal number of gates after the transformation are
given in Table 3.1. The names of ISCAS’85 circuits start with “C”, while those of
ISCAS’89 start with “S”. The change of circuit characteristics after transformation
is shown in Table 3.2. Tree structures using dummy gates with zero-delay are used to
transform gates which exceed the fanin or fanout limit. It must be guaranteed that
the transformed circuit generates the same results as the original circuit.

In our simulation, the maximum fanin and fanout are 3 and 2, respectively. In
this case, at most 6 (= 2x3) send operations are required at every simulation cycle

because all combinations of input ports and output ports have to be considered during

event propagation. Array multipliers do not need the transformation procedure since

 

 

 

 

 

 

 

 

 

I Before Transformation_— After Transformation [[
Circuit Critical Path Max Critical Path Max
Names Length (fanout,fanin) Length (fanout,fanin)
C1355 24 12,5 36 2,3
C1908 40 16,8 60 2,3
C6288 124 16,2 174 2,2
(37:552 43 15,5 55 2,3 g“

 

 

 

 

 

 

 

 

 

 

 

r —

Table 3.2: Circuit characteristics

the maximum fanin and fanout of each gate is 2. Reducing the maximum fanin and
fanout also reduces both queue size and the number of queues at a gate as well as
communication costs. But, as side effects, we can see that the transformed circuits
have about 27 - 50 percent longer critical path lengths [10], and about 32 - 83 percent
more gates than the original circuits as shown in Table 3.2. Therefore, more simulation

cycles and processors are required.

3.1.2 Additional Considerations

In the simulation of sequential circuits, we need to consider global clock generation for
synchronization of elements. Simulation results may be different depending on how
global clocks are given to the circuit being simulated. In our simulation, without loss
of generality, it is assumed that the global clock period is the same as the simulation
time interval between successive input vectors. That is, whenever a new input vector
enters the circuit being simulated, the next global clock is activated.

In SIMD environments, it is difﬁcult for storage elements like D ﬂip-ﬂops to use
clocks from a clock generator, since one send operation is required to propagate to
each element. Therefore, there are two ways of handling clock generation. In the ﬁrst
approach, each storage element contains clock values in advance in its event queues.
In this case, each clock signal is treated as an event. In the other approach, each
storage element computes clock signals, which it is supposed to receive from the clock
generator, based on its local virtual time. In this thesis, the second approach is used.

Let us discuss queue space based on available memory size of the CM-2. To use

as

aref32 and aset32, the fastest array access operations of the CM-2, we need to deﬁne
32 bits as the event size. For an event size of 32 bits and 1.6K bits of space for
local variables, we can store up to about 1950 events, i.e. 62.4K / 32 = 1950, when
a processor contains and simulates one gate only. Since each gate has three event
queues, one for each input port, 1950 is divided by three. In both synchronous and
conservative simulation techniques, the maximum size of a FIFO queue is 650 events.
However, in optimistic simulation, the queue size is restricted to a power of 2 to avoid
checking a boundary while manipulating a CBSQ. The maximum size is 512 (:29).
In this scheme, if each input gate has initial input events at port 0 only, the
maximum number of input vectors to be used for simulation is the same as the event
queue size. We used an event feeding technique, which feeds events to port 0 from

event queues at other ports only when GVT exceeds a certain simulation time.

3.2 Performance Evaluation

Experimental results on the performance of the considered simulation protocols have
been obtained for the benchmark circuits on the CM-2 with 32K processors. Although
simulation with multi-delay can be done, a unit-delay was assigned to each gate for
the consistency of experimental results. In the measurements, Moving Simulation
Bound (MSB) is applied to both conservative simulation and optimistic simulation

techniques to prevent queue overﬂow.

3.2.1 Performance Metrics

As performance metrics, we use the number of simulation cycles, parallelism, maxi-
mum queue size, and execution time.

Number of Simulation cycles

As deﬁned in Section 2.3, a simulation cycle is synchronized for processors in SIMD.
The execution time is proportional to the number of simulation cycles. That is, a large

number of simulation cycles means slow speed. Based on the relationship between

36

the number of simulation cycles and execution time, we can measure the average time

taken per simulation cycle.

Parallelism

The degree of parallelism (sometimes called activity level [63]) is the ratio of active
processors to assigned processors at a given simulation cycle. An active gate at a
simulation cycle is deﬁned as a gate which has at least one event to be executed
during the simulation cycle. As the number of active gates increases, the concurrency
becomes higher. An active processor is deﬁned as a processor which contains at least
one active gate.

Let H be the number of assigned processors. To measure concurrency, we deﬁne

degree of parallelism and parallelism as follows.

Deﬁnition 2 Degree of parallelism (D,- ) at simulation cycle i is the ratio of the total
number of active processors to the total number of assigned processors. That is,

D,- = C.-/ H, where C,- is the number of active processors at simulation cycle i.

When a processor simulates one gate only, the number of active gates is equal to

the number of active processors.

Deﬁnition 3 Parallelism is the average ratio of the total number of active processors

to the total number of assigned processors.

Parallelism can be computed as follows:

i=5
Parallelism = %Z D,-

i=1

where S is the number of simulation cycles.

Maximum Queue Size and Fossil Collection Requency

The maximum queue size is related to the memory requirement and speed. If there is

a queue overﬂow, simulation cannot continue. In SIMD, the maximum queue size over

37

all processors should be estimated and allocated in advance. In optimistic simulation
using CBSQ data structures, fossil collection is performed whenever there is at least
one queue which exceeds a certain limit. The frequency of fossil collection signiﬁcantly

affects performance.

Execution Times

Some results in this thesis are not ideal (or not general) on massively parallel SIMD
machines because some constraints of evaluation advancement were applied to limit
queue size. In other words, if we had used a massively parallel SIMD machine with

enough local memory size, different performance would have been obtained.

3.2.2 Experimental Results

For performance evaluation 1,000 input vectors were used. We gave 200 and 512
as timestamp intervals between successive input vectors for ISCAS’85 and ISCAS’89
benchmark circuits, respectively. The VP ratio (deﬁned later) was 1 in the measure-

ments.

Effect of Moving Simulation Bound

Let us ﬁrst see the effect of Moving Simulation Bound (MSB) on simulation tech-
niques. C1908 and C7552 are used as benchmark circuits to demonstrate the effect.
Synchronous simulation does not require MSB since large queue size is not necessary
due to event processing based on each input vector.

The effect of MSB on conservative simulation is shown in Figure 3.1. The same
simulation effects were obtained for both C1908 and C7552. Let us explain the effect
on C7552 in detail. When the size of MSB is greater than 10,000, queue overﬂow
occurs. As MSB increases, the required queue size increases while both the execution
time and the number of simulation cycles decrease. Here we can ﬁnd an interesting
fact that, for MSB > 2,500, the number of simulation cycles and execution time do

not change while the queue size increases. The reason can be explained as follows.

 

 

. T. -hLAJ—A
r—‘tQIwa
p—a

ﬁ—Jh—Jﬁ—a
pap—A

Figur

As MSB inc:
fast Chongh
a Ceftain M!
we can ﬁnd
Parallel 5le
is difﬁcult Si,
The Perfo
ilar simulati

effect 011 C13-

38

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I I I I I I I

25000 =2 Number of nguéaztégg Cycles —
20000 Q .....+ C1908 ‘
15000 ”i?- o <> <> <> ‘
10000 —+- ......... a
5000 - l ' 1 + 1 1 f l l +_

2000 4000 6000 8000 10000 12000 14000
450 I I I I 6 I I '.+d
ggg : .....O C7552 0 .................................... :
ggg [_ .....+ C1908 ............................... _.
200 — "'0' ....................... +-- A
150 T. 0.... .......................... _
158 : ...-1:53;? """"""""""" Maximum Queue Sizes ( 1n events ) 2

,,,,,,,,, 1

2000 4000 6000 8000 10000 12000 14000
180 I I I I I I I -
160 a, Execution Times in seconds ) a
140 4 .. 7552 _
120 9 .....+ 01908 ...
100 ’3.) ......... Q 0 0 0 ‘l
80 ”is. a
60 ~ ---- m. . n
40 ' 1 I 1 I l 1 i 1 1 +

2000 4000 6000 8000 10000 12000 14000

Size of MSB

Figure 3.1: Performance based on MSB size in conservative simulation

As MSB increases, queues receive more events. But event evaluation cannot be done
fast enough to consume all the events due to the input waiting rule. That is, beyond
a certain MSB size, it is not necessary to increase MSB for efﬁcient simulation. If
we can ﬁnd this MSB size in advance, large circuits can be simulated on massively
parallel SIMD machines without any queue space problem. However, ﬁnding this size
is difﬁcult since it depends on the circuit being simulated.

The performance of optimistic simulation with MSB is given in Figure 3.2. Sim-
ilar simulation effects were obtained for both C1908 and C7552. Let us explain the
effect on C1908 in detail. Queue overﬂow occurs when MSB is greater than 18,000.

 

 

 

160C
140C
120I
100C
SOC
60C

D—‘IQCADJhO‘Oﬁ-IC/J

",1 Q gm

'0 (\3 CAD
A t"' h kcjo

 

Flgur

16000
14000
12000
10000
8000
6000

0.4
0.35
0.3
0.25
0.2
0.15

0.1 V

80

60
50
40

20 ”

10

450
400

350
300
250
200

Figure 3.2: Performance based on MSB size in optimistic simulation

39

 

 

I I I I I l T I I

 

 

 

 

 

 

 

Em ....... 4) ........... 4) ............... o _
Ir Number of Simulation Cycles —
-‘=.. 0.. C7552 —
J} ..+.. C1908 _
+..
_ l+ ...... .1 + J l + l I IF l j4
2000 4000 6000 8000 1000012000140001600018000
I W I I I I T ' I ..L
+- ............... + ................ T T '—
_ ..+..-Fl" _,
j. Parallelism _
0.. C7552
‘ +.. C1908 ‘
T09 <> <> 0 ‘
' 1 1 n J 1 J 1 1 1

 

 

2000 4000 6000 8000 1000012000140001600018000

 

 

 

 

 

 

 

 

 

I I l O I I I I I I
C Fossil Collection Frequency a
” ' ...<>.. C7552 ‘
“ ........ 9 .....+ 01908 ..... .r
”wuw-O ............... ._
T .1. ................................... +"'
' TIT 1 I 1 I i l l l 1‘
2000 4000 6000 8000 1000012000140001600018000
I I I I I r I I I

.1

-

Execution Times ( in seconds )

—

 

 

 

; 0.. 07552 _
+.. C 1908
.42 a
.— +""-..L l l . _L—l
1 l 1 T 1 1 T 1 l T 1 I
2000 4000 6000 8000 10000 12000 14000 16000 18000
Size of MSB

40

As the size of MSB increases, the number of simulation cycles decreases while both
parallelism and fossil collection frequency increase. Optimum execution time is ob-
tained at MSB=5,000 where the number of simulation cycles (=5855) and the fossil
collection frequency (=11) are appropriately compromised. The parallelism at this

point is high as well.

Number of Simulation Cycles

Table 3.3 shows the number of simulation cycles for 1,000 randomly generated input
vectors. The MSB sizes to get the corresponding numbers of simulation cycles are
also given in the table. According to the experimental results, synchronous simulation
requires the largest number of simulation cycles, while optimistic simulation needs the
least. Synchronous simulation activates only the slowest processes whose LVTs are
equal to GVT at each simulation cycle. Therefore, a lot of simulation cycles are
needed because a relatively small fraction of gates are involved in event evaluation.
On the other hand, optimistic simulation advances the local clock of each process as
far as possible. If there is an event whose timestamp is equal to or less than LVT of a
gate, the gate is rollbacked immediately without loss of any simulation cycles. When
rollback does not occur, the process gets as much gain as possible. In conservative
simulation, event evaluation at a gate is done based on information derived from its
ancestors.

In combinational circuits, synchronous simulation has many more simulation cy-
cles than the other two techniques, while the numbers of simulation cycles for conser-
vative simulation and optimistic simulation are close. On the other hand, optimistic
simulation for sequential circuits has fewer simulation cycles than the other two tech-
niques. The reason is that event propagation is not frequently performed in sequential
circuits since ﬂip-ﬂops control event ﬂow. Flip-ﬂops execute received events according
to the event scheduling policy of a technique, but they send messages only at clock
times. In this case, link clocks cannot be properly updated in conservative simulation.
For example, consider a gate which has a D ﬂip-ﬂop as its parent. In conservative

simulation, the input link clock of the gate from the ﬂip-ﬂop is not updated in time

41

 

t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Circuits Sﬁlchronous Conservative Optimistic
Cycles Cycles [ MSB Cycles ] MSB
C1355 581 17 6686 00 6678 20000
C1908 68345 5675 28000 5717 20000
C6288 177366 67658 2500 67527 1500
C7552 83784 13304 10000 13844 7500
S9234 31886 21072 00 16310 00
513207 41886 30297 00 33568 40000
315850 58874 22794 00 1320 00
335932 26062 16044 00 1550 00 .

 

 

 

 

 

 

Table 3.3: Number of simulation cycles

because local clock computation must wait until the ﬂip-ﬂop sends a message to the
gate. In optimistic simulation, the gate can execute event evaluation as soon as
possible and go ahead with virtual time advancement.

Figure 3.3 shows the number of simulation cycles for S9234 as a function of the
number of input vectors. According to our experimental results, in general, optimistic

simulation has the smallest number of simulation cycles, while synchronous simulation

has the largest number of simulation cycles.

Parallelism

Table 3.4 shows the parallelism for 1,000 randomly generated input vectors. Opti-
mistic simulation has the highest parallelism since it includes unnecessarily active
gates which will be rolled back later. Synchronous logic simulation has the lowest
parallelism. The reason is as follows. The time difference between successive input
vectors is larger than the critical path length of the circuit being simulated. Since
only the gates with the smallest LVTs are involved in event evaluation, any two suc-
cessive vectors cannot be overlapped. In other words, input vectors are processed one
by one. Therefore, parallelism is very low.

Combinational circuits have higher parallelism than sequential circuits since ﬂip-
ﬂops in sequential circuit control (or reduce) the ﬂow of events. We can see, in the

table, that parallelism becomes low if MSB is used.

42

(simulation cvcle)T

 

 

 

 

I I I I
30000 e ...-O _
25000 .. .....O Synchronous d
+.. Conservative .<>'
....... .0 ..+ "1:1
15000 - ............................ _
...-0" ...+ ........... 13""
10000 - , " ,,,, 1:1 .......... _
5000 ...9"",.....:;:::ZIES ..........
...;:i::::Il§:: iiiiiiii
0 .. .::. .... l l l l l
0 200 400 600 800 1000

Number of input vectors

Figure 3.3: Number of simulation cycles for S9234

Figure 3.4 shows the degree of parallelism in three simulation techniques for C1355
with 200 input vectors. Synchronous simulation, conservative simulation, and opti-
mistic simulation have 11525 (0.023), 1332 (0.19), and 1328 (0.29) simulation cycles
(parallelism), respectively. The ﬁgure shows the degree of parallelism for the ﬁrst 1400
simulation cycles of synchronous simulation. Synchronous simulation has a low wave-
form in the degree of parallelism due to its low activity level. Conservative simulation
and optimistic simulation in the ﬁgure follow the trend shown in Table 3.4.

A higher level of parallelism does not automatically imply better speed. For
example, in optimistic simulation there are some active gates which will be rolled
back later. For a simulation technique with a given circuit, however, high parallelism
indicates that good performance can be obtained as the circuit size increases.

Figure 3.4 shows that, in both optimistic simulation and conservative simulation,
highest parallelism will occur when the number of simulation cycles is less than or
equal to the input vector size plus the critical path length of the circuit being simu-

lated. But parallelism becomes low as the simulation continues.

43

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Circuit-s [[ Synchronous Conservative Optimistic [l

1 [I Parallelism Parallelism MSB Parallelism [ MSBJ]
01355 |] 0.0230 0.1900 00 0.324 20000
C1908 0.0230 0.2791 28000 0.380 20000
C6288 0.0860 0.1976 2500 0.260 1500
C7552 0.0160 0.1053 10000 0.133 7500
S9234 0.0020 0.0033 00 0.024 00
813207 0.0018 0.0021 00 0.0068 40000
315850 0.0024 0.0027 00 0.049 00
535932 0.0140 0.0230 00 0.24 00

Table 3.4: Parallelism
0.8 I I T I I I
0.7 -
Optimistic (upper line)

0'6 Conservative (middle line) ”

Degree of 0.5 '; Synchronous (lower line) a
parallelism 0.4 ; ‘ _

= D,-

( l 0.3 kiwi“ _
0.2 . hm _

0.1 ' , = -

 

0'1..- ~ 1111,...

200 400 600 800 1000 1200 1400
Simulation Cycle ( = i )

Figure 3.4: Degree of parallelism

44

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Circuits Synchronous Conservative Optimistic
Q Size Q Size [ MSB Q Size [ MSB
01355 2 647 00 346 20000
01908 2 630 28000 473 20000
C6288 2 632 2500 419 1500
C7552 2 436 10000 431 7500
89234 2 427 00 337 00
$13207 2 284 00 434 40000
315850 2 386 00 258 00
835932 2 271 00 289 00

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.5: Maximum queue size (in events)

Queue Size

Table 3.5 presents the required queue size of a gate at the possible maximum MSB
size for each test circuit with different simulation techniques. The measurement of
the maximum queue size considers all event queues during the entire simulation. The
smallest queue size is required in synchronous simulation compared with the other
two techniques since parallelism is low and simulation is done for input vectors one
by one. On the other hand, optimistic simulation needs a large maximum queue
size since events must be stored in the queue to cope with rollback. In sequential
circuits, however, conservative simulation might need larger queues than optimistic
simulation since conservative simulation must wait to execute events due to a rare
event (or message) propagation.

In actual simulation, to avoid frequent fossil collection and improve performance,
the following strategy can be used: fossil collection is performed if the queue
size exceeds the limit. But the experimental results in Table 3.5 were obtained by

applying fossil collection at each simulation cycle.

Execution Times

Table 3.6 compares the execution times of parallel logic simulation techniques for dif-

ferent test circuits with 1,000 randomly generated input vectors. In the measurement,

45

only the time period for simulation is considered. The ﬁgures listed do not include the
time required for translating circuit description, reading vectors, or printing output. _

We observed that conservative simulations are faster than synchronous and op—
timistic simulations for combinational circuits even though optimistic simulation re-
quires fewer simulation cycles. One reason is that, each cycle of optimistic simulation
involves more time-consuming operations, such as rollback and long queue manip-
ulation. This observation contrasts with the claims in [34, 48] that Time Warp
outperforms the Chandy-Misra algorithms. But, as shown in the experimental re-
sults, optimistic simulation may be better than any other techniques for sequential
circuits. But, for circuits with fewer ﬂip-ﬂops, such as 89234, Time Warp might be
slow because optimistic logic simulation has only slightly fewer simulation cycles than
conservative logic simulation. Even though each simulation cycle of synchronous sim-
ulation is very simple, it usually takes more time than the two techniques since it
needs many more simulation cycles.

Soule and Gupta [63] found that deadlock avoidance with null messages on the
Encore Multimax( shared memory MIMD machine with 16 nodes) is highly inefﬁcient.
They claimed that actual run times range from 30 times slower to more than 100
times slower than conservative simulation with deadlock detection and recovery. But
conservative simulation with deadlock detection and recovery is difﬁcult to implement
on SIMD machines. As we can see from the table, conservative simulation with null
messages works very fast.

Comparison of execution times for a 32-bit array multiplier with 8256 gates be-
tween Intermetrics VHDL simulator on a SUN3/ 260 and conservative simulation on
the CM-2 is shown in Table 3.7. All the schemes considered in this thesis are also
considerably faster (up to several hundred times) than the Intermetrics VHDL simu-

lator [1] which is a general purpose behavioral simulator.

46

 

 

 

 

 

 

 

 

 

 

 

 

 

Circuits Synchronous Conservative Optimistic
Times Times I MSB Times [ MSB
C1355 214 55 00 189 20000
C1908 265 41 28000 184 20000
C6288 735 437 2500 1667 1500
C7552 324 89 10000 402 7500
89234 119 100 00 443 00
513207 . 154 142 00 814 40000
515850 227 111 00 39 oo
S35932 100 80 oo 37 oo

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.6: Execution times (in seconds)

 

LNG. of input vectors I] 2 [ 4 [ 6] 8 [ 300 [ 500]
VHDL 203.3 420.2 700.2 1023.2 - -
Conservative Sim. 1.5 2.4 3.0 3.6 90.4 123.9

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.7: Execution times (in seconds) for a 32-bit array multiplier

3.3 Comparisons of Massively Parallel Machines

Architectures of the CM-2 and the MP-l are compared for parallel logic simulation.
In view of logic simulation, the critical features of a SIMD machine are the number of
processors and the memory per processor. The CM-2 is known as a parallel machine
with a large number of small processors, while the MP-l has a large processor memory
size relative to other SIMD machines. We discuss machine-dependent characteristics

in logic simulations on the two SIMD machines.

0 Performance of Related Instructions

The performance of the major instructions related to logic simulation is mea-
sured. MPL(Massively Parallel Language) is used for the measurement on the
MP-l, while C/ Paris is used on the CM-2. By knowing the execution time of

each instruction used in our implementation, we can estimate the performance

47

 

I Instruction II Global Router II Global Min II Array Write II Array Read I
I inbits I32I 64 II 32] 64_II32 64 "32] 64 I

CM-2 T60 1600 240 470 40 so 70 140
MP-l 422 562 92 142 12 13 12 13

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.8: Performance of instructions (in microseconds)

of the simulation protocols and use this information to ﬁnd better data struc-
tures and algorithms for the protocols by avoiding the most time-consuming
instructions. The ratio of the number of virtual processors to physical proces-
sors is referred to as the VP ( Virtual Processor) ratio. The VP ratio was 1
in these measurements. The performance of the instructions on the CM-2 and
MP-l will be given Table 3.8 [23]. In the table, non-conﬂict send operations
are considered in the measurement. According to the measurements, the MP-l

instructions execute 2 to 10 times faster than the CM-2.

Queue Space

One of the important factors for efﬁcient parallel logic simulation is to have
enough memory space for event queues. Each processor on the MP-l and the
CM-2 has 16K and 8K bytes of local memory, respectively. Moving simula-
tion bounds can be used to prevent overﬂow in event queues for conservative

simulation and optimistic simulation.

Gate to Processor Ratio

The number of available processors is another signiﬁcant factor when each pro-
cessor is assigned to a gate. The CM-2 provides up to 64K physical processors,
and offers the virtual processor concept for circuits larger than 64K gates, which
indicates how many times each physical processor must perform a certain task

in order to simulate the appropriate number of virtual processors [66].

The MP-l has fewer processors than the CM-2. The number of physical proces-
sors of the MP-l family ranges from 8K to 16K. Using the large local memory,

48

the virtual processor concept can also be implemented, although it must be

explicitly written as part of the program.

0 Communication Scheme

The CM-2 [66] uses a Hypercube interprocessor communication topology, while
MasPar’s topology is a 2D mesh and external router. Each CM-2 processor chip
contains one router node, which serves the 16 data processors on the chip. The
router nodes on all the processor chips are wired together to form the complete
router network. The algorithm used by the router can be broken into stages
called petit cycles. The delivery of all the messages for a send operation might
require only one petit cycle if fewer processors are active, but, if every processor

is active, then many petit cycles are typically required.

In MasPar, the Global Router is a bidirectional communication path and a cir-
cuit switched style network organized as a 3 stage hierarchy of crossbar switches
[11]. Each square matrix of 16 processors in the processor array is called a clus-
ter. The system can communicate with all PE clusters simultaneously, but it

can communicate with only one processor per cluster at one time [51].
For the comparison of performance on the machines, a term is deﬁned as follows.

Deﬁnition 4 Speed ratio is deﬁned as the ratio of execution time on the CM-2 to

that on the MP-l.

Table 3.9 shows the results of the performance of synchronous, conservative, and
optimistic logic simulation for 1,000 randomly generated input vectors [23]. In the
measurements, the same MSB size and input vectors which were used in Section 3.2.2
are applied to prevent queue overﬂow. The MP-l runs 2 to 2.5 times faster. The
obtained results are reasonable when we consider the speed of related instructions in
Table 3.8.

As you can see, the optimistic protocol gives more advantages than synchronous
and conservative protocols. We can consider two analytical factors, array access

frequency and congestion. The ﬁrst factor, array access frequency, produces very

49

 

 

 

 

 

 

 

 

 

I Benchmark Gigs II Synchronous—I Conseﬁtive Optimistic I
01355 T 1.89 2.12 2.52
C1908 1.84 1.95 2.36
C6288 1.89 1.92 2.52
C7552 1.86 2.07 2.48

 

 

 

 

 

 

 

 

Table 3.9: Speed ratios of simulation protocols

manifest effects and wide differences. In optimistic simulation, a lot of event searching
causes frequent array accesses in each simulation cycle. As shown in 3.8, the array
access time is much faster on the MP-l than the CM-2. The performance of optimistic
simulation on the MP-l is almost two and half times better than that on the CM-2.

Thus, for optimistic simulation, the performance difference for queue manipula-
tions dominates the difference resulting from high congestion. The performance of
conservative simulation on the MP-l is also affected by this domination, producing
a higher speed ratio than synchronous simulation. Thus, we can see the congestion
problem does not have an important effect in ISCAS’85 circuits. The magnitude of
the congestion problem, however, depends on how gates are allocated to processors.

Based on the experimental results, several factors have been considered for the
analysis of the two machines. First, as the number of propagation events increases,
the MP-l has a more serious congestion problem than the CM-2 since the MP-l uses
an external global router. Among the three logic simulation approaches, optimistic
simulation has the heaviest congestion overhead because it requires many propaga-
tion events. This problem can be avoided by assigning gates to processors properly.
Second, the MP-l provides bigger local memory size than the CM-2. So, the MP-l is
good for simulation protocols which require large queue sizes, such as optimistic sim-
ulation. Third, due to faster array access of the MP-l, optimistic simulation, which
requires a lot of event handling, can get good performance. Finally, the CM-2 has a
larger number of processors. Therefore, a circuit with a large number of gates can be

simulated on the CM—2 with VP ratio 1.

50

3.4 Conclusions

We have experimentally analyzed the effect on performance of logic simulation de-
pending on several factors, such as target machine used, simulation technique applied,
event queue structures implemented, and test circuit simulated.

The performance of logic simulation also depends on the speed of several instruc-
tions. These related instructions are faster on the MP-l than the CM-2. Three logic
simulation algorithms are also evaluated and compared on both machines. Experi-
mental results show that the MP-l is about 2 to 2.5 times faster than the CM-2 for
all three simulation techniques.

We observed that, despite theoretical arguments to the contrary, optimistic simu-
lation such as Time Warp is not the best technique for all applications on massively
parallel SIMD machines. This is attributed to its inherent rollback and queue man-
agement overhead. We also observed that, in contrast to MIMD environments, con-
servative simulation with null messages works very fast on massively parallel SIMD
machines. Finally, we conclude massively parallel SIMD machines can be efﬁciently

used for parallel logic simulation if we utilize the limited local memory efﬁciently.

Chapter 4

Performance Prediction Based on

Gate-to-Processor Ratio

One of the critiques of parallel logic simulation on massively parallel SIMD machines
is that the parallelism is low. In this chapter, we show that the parallelism on mas-
sively parallel SIMD machines can be increased if we increase the gate-to-processor
ratio (ratio of the number of gates to the number of processors). A probabilistic model
is proposed to estimate the performance of parallel logic simulation based on the gate-
to—processor ratio. Using this model, parallelism, number of simulation cycles, and
execution times are predicted as the gateto-processor ratio increases. To compare
the predicted performance estimation with experimental results, three distributed
simulation protocols for logic simulation, synchronous simulation, conservative sim-
ulation, and optimistic simulation, were implemented. The model generally predicts
more improvement in performance than experimental results indicate. We found out
that the correlation between the predicted performance and the experimental results
depends on both simulation protocols applied and circuits simulated. The correlation
is best for optimistic simulation, and worst for conservative simulation.

The remainder of this chapter is organized as follows. Section 4.1 explains why
performance estimation at various gate-to—processor ratios is needed. In Section 4.2, a
probabilistic model for estimating performance based on the gate-to-processor ratio is
proposed. Parallelism, number of simulation cycles, and execution times are predicted
using the model in Section 4.3. In Section 4.4, the experimental results are compared

with the predicted performance.

51

52

4.1 Introduction

It may be necessary to simulate a circuit with an enormous number of gates, while the
number of available processors on a particular machine is limited. In MIMD environ-
ments, many gates are usually assigned to a processor since the number of processors
is relatively small (compared to massively parallel SIMD machines) and each proces-
sor has powerful processing capabilities. But, even though massively parallel SIMD
machines have a lot of processors, we may need to assign more than one gate to a
processor to accommodate large circuits. Currently, it is difﬁcult to do simulation
with multiple gates in a processor, because existing massively parallel SIMD machines
do not have enough local memory. According to manufacturer announcements [53],
machines with large local memories will be released in the future. However, we can
use the currently available machines by controlling event queue sizes with some mov-
ing simulation bound, such as MTW (Moving Time Window) [61]. In this case, in
general, more simulation cycles and longer execution times are required.

As long as the memory capacity of a processor allows, we may assign many gates
to a processor to increase the efﬁciency (or utilization) of the machine even though
we have enough processors available. In this case, the performance can be predicted
using a probabilistic model. In this thesis, how the ratio of the number of gates to the
number of processors affects the performance of logic simulation on massively parallel
SIMD machines will be investigated based on a probabilistic model and experimental

results.

4.2 The Model

In [3], Agrawal and Chakradhar presented a statistical model of parallel processing
for evaluating the performance of several synchronized iterative algorithms on multi-
processor systems. Based on their model, the speedup was estimated as the number
of processors increases. Logic simulation of several VLSI circuits was used as a test

problem to get experimental results.

53

We propose a model for analyzing the performance of parallel logic simulation on
massively parallel SIMD machines. Our performance model uses the same assump-
tions as those in [3]. Based on our model, parallelism and the number of simulation
cycles are estimated as the gate-to-processor ratio increases. Finally, we can estimate
the execution time for a simulation using a known cycle time.

An active gate at a simulation cycle is deﬁned as a gate which has at least one
event to simulate during the simulation cycle and is ready for event evaluation. In

our simulation model, we use the following model.
0 Gates are assumed to be statically distributed among processors.
0 Approximately the same number of gates are assigned to each processor.

0 All the physical processors are synchronized and compute independently in be-

tween the successive simulation cycles.

0 The amount of work performed on an active gate during a simulation cycle is
called an atom. All atoms are statistically independent and have the same level

of activity.

0 The probability of a gate being active is independent of the gate-to-processor

ratio.
0 The active probability in steady state is constant.

Let n,- denote the number of active gates among the N,- gates which belong to
processor i. Then n,- is a random variable that can only assume the values 0, 1, 2, ..., N5.
The number of active gates within a physical processor during a particular simulation
cycle is a random variable with a binomial distribution. Thus, the probability that a
processor has :1: active gates is computed as follows.

Ni

Plni = a31= p“(1 - P)”“‘ (4.1)

where p is the probability that a particular gate is active.

54

4.3 Performance Estimation

Based on the probabilistic model proposed in the previous section, performance with
respect to parallelism, number of simulation cycles, and execution times will be esti—

mated in this section.

4.3.1 Parallelism

Let H and N be the number of assigned physical processors and the number of gates,

respectively. Gate-to—processor ratio and parallelism are deﬁned as follows.

Deﬁnition 1 Gate-to-processor ratio (GP ratio or g) is deﬁned as the ratio of the
total number of gates to the number of assigned processors. In this case, g = IN/H] .

For simplicity, it is assumed that gates are equally distributed among processors.

Deﬁnition 2 Parallelism (P9) is deﬁned as the ratio of the average number of active

processors to the number of assigned processors at GP ratio g.
The symbol E (Pg) denotes the expected value of P9.

Theorem 1 Suppose that only one of the active gates in a processor is involved in
event processing during a simulation cycle. E(Pg) is 1 — (1 — p)“, where p is the

probability that a particular gate is active.

Proof: Suppose we have H processors. Each processor has 9 gates. From
Equation 4.1 in the model of Section 4.2, the number n,- of active gates in processor
i is a binomial random variable. A processor becomes active if the processor has at
least one active gate. Therefore, the probability that processor i is active is PIn; Z 1].

Since all processors were assumed to compute independently, parallelism at GP

ratio g is computed as follows:

in” .
Pg _ z . l P[n.21]

_ H

Therefore, the expected value of P9 is

55

E(Pg) = MW)

=P[n,-_>_1]
=1-P[n,°=0]
=1- 9 (l-P)”
0
=1-(1-p)’

We can get E (P1) = p from the model. To estimate E'(Pg), we can use p = P1 as
a special case.

If a processor evaluates events of all active gates during each simulation cycle,
the number of active gates may be different from processor to processor since static

allocation is assumed. We get the following corollary:

Corollary 1 When a processor executes events of all active gates during each simu-

lation cycle, E(Pg) is also 1 — (1 — p)9

Proof: We already assumed that in steady state the active probability p is
constant regardless of GP ratio. As the proof in Theorem 1, a processor is active if

it has at least one active gate. D

But, the case that only one gate is involved in event evaluation is more efﬁcient
than the case that all gates in a processor are involved during each simulation cycle.
The ﬁrst case gives better performance on massively parallel SIMD machines. In the
second case, a simulation cycle time is the longest processing period of all processors
since all processors must wait until the slowest processor ﬁnishes the cycle. That is,
the load balancing problem would be more severe during a simulation cycle because
the number of active gates ranges from 0 to 9.

We implemented parallel logic simulation using the three simulation protocols on

the CM-2 and measured the performance in Chapter 3. In the measurement, GP

56

 

 

 

 

1.2 I I I I I
'- .- -- "0‘0” ' ” 3 '° " z: u -

1 0.9.590 é:$.+..+-$.$%é 44‘? 44> M
0.8 r 9 *..+""l'"+ .1

P O $31". 0.. Optimistic Simulation
9 0 6 T <2 +‘ +.. Conservative Simulatioon
0.4 3:: +‘ CI. Synchronous 8211;331:103 _
0.2 r" B..D--B“D"Qﬂ _
mﬂm..D«Q”D'
” D” l 1 1 1 1
5 10 15 20 25

GP ratio ( = g)

Figure 4.1: Parallelism depending on GP ratio

ratio was 1, i.e. a processor contains one gate only. ISCAS’85 circuits were used
as test circuits. According to experimental results, synchronous logic simulation,
conservative logic simulation, and optimistic logic simulation with no simulation time
bound have average active ratio 0.021, 0.18, and 0.33, respectively. We can use these
values in our model to predict the parallelism for other GP ratios. Figure 4.1 shows

the predicted parallelism of three logic simulation protocols for GP ratio up to 25.

4.3.2 Number of Simulation Cycles

The number of simulation cycles is very important since it is proportional to the
execution time. In this section, the number of required simulation cycles at a certain
GP ratio is computed based on the proposed model.

Let Wg be the amount of work, i.e. the total number of atoms, for a simulation
at GP ratio g. For the prediction of number of simulation cycles, we need to use the

following assumption.
Assumption 1 W1 = Wg for any g

The assumption means that the total amount of work to be done is the same

57

regardless of GP ratio g.
Let Sg be the'number of simulation cycles for a simulation at GP ratio g. The

following lemma can be obtained:

Lemma 1 23'9 is bounded as follows:
51 S 59 S 951

Proof: Proof is done by induction on g. Basis is established when 9 = 1.
Suppose that the inequality is true for g. We must prove that the inequality is true
for g + 1. If we increase g by 1, each processor contains one more gate. Therefore, in
the worst case, i.e. all gates at GP ratio 1 are active during simulation, the number
of simulation cycles is increased by SI according to Assumption 1.

Therefore

51359“559+513951+51=51IQ+U

We use P9 and E ( Pg) interchangeably throughout the remainder of this chapter.

Theorem 2 Suppose that only one active gate is involved in event processing during

a simulation cycle. When P1 = p is given for a simulation, Sg is pgSl/Pg.

Proof: Let N be the number of gates to be simulated. We can compute W1

and W, as follows:

W, = Sle and

W9 = SgIN/glpy-
From Assumption 1, we get W1 = Wg for any g. Thus,
SIN? = SgIN/glpg

From the equation, we can get the following solution.

58

59 = PQSl/Pg

Corollary 2 Suppose that all active gates in a processor are involved in event eval-

uation during a simulation cycle. In this case, S5. for any 9 is 51.

Proof: All active gates are determined at the beginning of each simulation
cycle. That is, any active gate does not make another gate active by sending events
during a simulation cycle. The amount of work done during each simulation cycle
at GP ratio g is the same as that done in the corresponding simulation cycle at GP

ratio 1. Therefore, the total number of simulation cycles is S. D

To see the change of the number of simulation cycles with a simulation protocol

as GP ratio increases, we deﬁne a new term as follows.

Deﬁnition 3 Cycle increase ratio (R9) is deﬁned as the ratio of the number of sim-

ulation cycles at GP ratio g compared with GP ratio 1, i.e. R9 = 59/51

Note that, from Theorem 2,
R9 = 59/51 =Pg/Pg

Figure 4.2 shows R9 for each simulation protocol as a function of GP ratio. For
a simulation protocol with high active probability, such as optimistic simulation, the

cycle increase ratio increases rapidly.

4.3.3 Execution Times

The execution time for a simulation is predicted if we know the number of simulation
cycles and the time taken for a simulation cycle. If the time taken for a simulation
cycle, denoted by r, is assumed to be independent of GP ratio, the execution time at

GP ratio g, E9, is computed as

59

 

 

 

9 I I I I I
8 - ..0.. Optimistic Simulation 09'9“
7 _ .....+ Conservative Simulation 9.9” _]
6 13.. Synchronous Simulation 0.9.0"
I ..o--- ' ‘
R 5 " .o 9'9 + r
..0 +--+”
3 '- 0... + + 'l" —I
.9... ”+”+..-+"'
2 - .59 +,+..+.+-+ ..
1 mafia—:3315250.0-QuLII-.13.-Q~QG~09~D~QB~DQ~DD~D~D _
0 I - l l

 

L L
5 10 15 20 25
GP ratio ( =g )

Figure 4.2: Increase ratio of simulation cycles

E9 =59 X T=IP951/Pg) X szgTsl/Pa

Let us estimate the execution times of three different simulation protocols on
benchmark circuits, C 1355, C1908, 39234, and S35932. All the data for the estimation
was obtained from the experimental results in Chapter 3. For example, the time taken
for a simulation cycle is equal to execution time divided by the number of simulation
cycles. But, we could not get the data for C1908 since MSB was used to prevent
queue overﬂow. So, we wrote programs and ran them on a SUN workstation to get
parallelism and the number of simulation cycles without using MSB.

The predicted execution times as a function of GP ratio are shown in Figures 4.3
and 4.4. According to the ﬁgure, as GP ratio increases, in general, optimistic simu-
lation gives worse performance than the other two protoCols because it has a higher
active ratio and requires more time per simulation cycle. As a good example, we can
observe that optimistic simulation for S35932 is faster than the other two simulation
techniques when GP ratio is small as in Figure 4.4. However, as the GP ratio in-
creases, the predicted execution times increase drastically, since parallelism is much
higher than the other two techniques.

To analyze the decrease of speed as the GP ratio increases, we introduce a new

term as follows.

60

 

 

 

 

 

 

 

 

 

 

 

 

((second)
1 00 1 l , I 0 <5
_ C1355 . " _
1400 0.. Optimistic Simulation "9.0 O
1200 ”...+.. Synchronous Simualtion 9.9-9 M+,..+"“l' -
1000 _... Cl. Conservative Simulation 0 9.0“ +ﬂ+..'+"'+ .
E9 800 '- ovO"Q.:"+.-+T+ —
600 '- $$.$;+,.J+' .+r _
400 '- 9 @¢,,Q::$2. .. _
" " . .. .3943“
200 I: In In m..c1..m..m..m..n-ca..m..m..m-a..m.o.a-a..c31:1 :1 y
0 .. .. .. i a 1 1 :
5 10 15 20 25
(sec and) I I T T ,. I
1400 -...0.. Opgmistic Simulation 0,09 _
a...+.. Synchronous Simualtion .0 d
1200 . . . 0.0
1000 _... 13.. Conservative S1mulat1on Q _
.0'"
E9 800 ‘ 9"909 _
600 — 09-9" _
Q...
400 " 9.9 . 1 I l 1 l 1 I 1 _L -'
esters I+I++I I+:.1T..1T.Tr
200 PlznnnnnnnnnQDDDQQL—JQQEDDQQ‘
0 '- — .. .. — — .. — — — i I
5 10 15 20 25

GP ratio ( = g)

Figure 4.3: Predicted execution times for C1355 and C1908

61

 

 

 

 

 

 

 

 

 

 

ébﬁcond)
I I I Q,Q..O»$~Q«Q-G“O “ I
500 r- 00 o 4;. 0.0-090" _
400 ' 89234 ‘
E9 300 _ ..<>.. Optimistic Simulation _
+.. Synchronous Simualtion
200 b i .....U Conservative Simulation _‘
100 ”'11!“inrdzﬂmﬁzﬁrab"i1"ri1"i1'Ti1”de.T'd.1”ird_W_b'TirliTT_b“dnt
l 1 l l 1
5 10 15 20 25
£second)

20 I I j I OMJ
200 r S35932 0.9-" ..
180 - ".0.. Optimistic Simulation 9.9” _
150 _ +.. Synchronous Simualtion 9.9" q
140 _ C1. Conservative Simulatign0.9” _

_ 9" . . _L d
E9133 .1 . - . 'éA?:i l+l H H I ILIIJ
~+l l I I ' ' ' 1... "Y .. &.D~DD«Q

80 29-0.0.0.5.gﬁﬂti84:3»9-9-9-5'5“9’9 m ..

60 L 9.9.0" _

9.9-
40 .—... 1 i 4 1 4 1
5 10 15 20 25

GP ratio ( = g)

Figure 4.4: Predicted execution times for S9234 and S35932

62

Deﬁnition 4 Speeddown (09) is deﬁned as the increased ratio of the execution time

at GP ratio g compared with E1, i.e. 0,, = Eg/E1
Theorem 3 For a simulation, 09 = R9

Proof:

0,, = Eg/El = 759/751 = 59/51.
By Theorem 2,

59 = PQSI/Pg

Therefore,

09 =(P951/Pg)/Sl = PSI/Pg = R9.

In the worst case, speeddown is g. It is very important to maximize the difference

between 9 and actual speeddown.

4.4 Comparisons with Experimental Results

Since the currently available massively parallel SIMD machines do not have enough
local memory to run logic simulation with a high GP ratio, sequential programs were
written in C and run on SUN workstations. From the programs, we can obtain the
number of simulation cycles and parallelism, but not execution times.

As test circuits, C 1355 and C7552 circuits were used. Based on the experimental
results, the predicted performance of each simulation protocol is analyzed. Gates were
statically and randomly assigned to processors. Among active gates in a processor,
an active gate with the smallest local simulation time is chosen as a ﬁnal active gate
for a simulation cycle. Figures 4.5, 4.6, and 4.7 show the comparisons of experimental
results and predicted performance in synchronous simulation, conservative simulation,
and optimistic simulation, respectively. In the graphs, the predicted performance is
compared with the experimental results as the GP ratio increases.

In general, the parallelism obtained from the experiments is a little lower than the

predicted performance. But, optimistic simulation has almost the same performance

63

 

 

 

 

 

 

 

 

0.6 - —
I I I I
0.5 - ‘
0.4 r Parallelism (= P9) _
0.. Predicted Parallism
0.3 - ....+ Experimental Results .0 0 d
0 9 9.9-"
0.2 _ 90.9.1) OM ........................... + a
0....0 ..................... +
O 0'" .................. + ------
0'1 ' o---€;?.2::: ....... + ......... %
©3ng .........
..4>-'"
0 ' 1 l 1 I
5 10 15 20
2 r I I T
1.8 b Cycles Increase Ratio (:- R9) _
....0. Predicted Ratio + ........... +
1 6 __ .....+ Experimental Results +..}. ........... _
...-”l" ........... ..F'
1.4 " M+++ ----- -
...+'+' o o
1.2 - ..+....4+ "Omomo 0 04>»- ‘
”VA. 4)..<>...4>...<>--~<> 0"0'
l l 1 J
5 10 15 20

GP ratio ( = 3)

Figure 4.5: Comparisons for C7552 in synchronous simulation

64

 

 

 

 

 

 

 

 

1.2 r , I I
1 — Parallelism (= P9) 0...<>...o...o...o...o...o...o...o...<> -
<>...<>---'<>"'
0.8 ~ 0»- _
.0
.0 + ........... + ........... +
0'6 _ .0". + ................... +...+... _
0.4 “ Q --°+m+m+ _
..-+ .....0 Predicted Parallelism
0 2 .-;'--'+ +.. Experimental Results _
' 1 l J J
5 10 15 20
7 T I I I
6 P Cycles Increase Ratio (2 R9) _
.....0 Predicted Ratio "..+. ........... +
5 _ +.. Experimental Results +.--wk .......... _
4 + ........ O
” ' -- ....o ‘
3 _ 9....0 a
2 - . ...<>...O -
. ""Z...+..-<>---<> 0
l ‘2" l l l l -
5 10 15 20

GP ratio ( =g)

Figure 4.6: Comparisons for C1355 in conservative simulation

65

 

1.1 I I I I
1 .aaassssesss a a
0.8 ~ _.+'<> , _
0.7 — +9" Parallelism (= P,) -

0'6 bs'G .....0 Predicted Parallelism '-
0 5 -555. +.. Experimental Results

0.4 _
0.3 E l l 1
10 15 20

I 1 1 I

-

 

 

Cn—

 

r- Cycles Increase Ratio (= R9) ........... +

0.. Predicted Ratio +...+ ...........
+.. Experimental Results

I

I
+
"'<.>
l

 

 

a. up on p. or c: «I a: ¢>
I
4.
+
G
G

 

5 10 15 20
GP ratio(=g)

Figure 4.7: Comparison for C1355 in optimistic simulation

66

as measured by parallelism, as predicted. Parallelism depends on how gates are
selected among active gates in a processor for execution. According to experience, the
selection of a gate with the smallest simulation time gives much better performance
than selection based on round-robin. The reason is that the former technique reduces
the frequency of rollback in optimistic simulation and facilitates the update of link
clocks in conservative simulation. Among the three simulation protocols, conservative
simulation shows the biggest difference between the predicted parallelism and the
observed performance. We can explain this observation as follows. When the GP
ratio increases in the conservative simulation with null messages, link clocks are not
updated as soon due to the following two reasons. First, the number of active gates
which will not be chosen for execution during a simulation cycle increases as GP ratio
increases. The gates cannot propagate events to update corresponding link clocks for
the simulation cycle. Second, null messages cannot be propagated in a proper fashion
in each simulation cycle because all except one gate are not allowed to send even
null messages to successors. Thus, as GP ratio increases, the number of gates which
cannot send any events or null messages increases.

Let us compare the predicted increase ratio of simulation cycles and the experi-
mental results. According to the graphs, the observed increase ratios of simulation
cycles are much higher than the prediction. In optimistic simulation, the predicted
ratio and the observed ratio become close when GP ratio is high. Among the three
simulation protocols, conservative simulation gives the biggest difference between the
predicted ratio and the observed ratio. The reason for this is the same as for the

difference in parallelism. The number of simulation cycles is reciprocal to parallelism.

4.5 Conclusions

There are no massively parallel SIMD machines in which a processor can contain many
gates because currently available systems have local memory limit problems. We
presented a model to compute parallelism and number of simulation cycles depending

on the gate-to-processor ratio when we know the active ratio. Based on the model,

67

we can predict the performance of parallel logic simulation for a circuit with more
gates than processors.

The predicted performance was compared with the experimental results for some
benchmark circuits. The predicted curves in the above ﬁgures have the same shape as
observed results. We can also predict execution times based on the predicted number

of simulation cycles and time taken per simulation cycle.

Chapter 5

Enhancement of Simulation Clock

Advancements

In parallel logic simulation using traditional distributed event-driven simulation pro-
tocols, such as the Chandy-Misra algorithm and Time Warp, an event is used to carry
a value associated with a timestamp. Each gate computes a local virtual time (LVT)
which is the smallest timestamp of unprocessed events. All the events with times-
tamps equal to LVT are involved in event evaluation at each gate during a simulation
cycle (or iteration) based on the event selection (or scheduling) policy of a simulation
protocol. Event evaluation means that an output value is computed based on the
gate type with all the chosen events. After event evaluation, an output event may be
propagated to successors.

We propose efficient event evaluation and propagation techniques to enhance sim-
ulation clocks of conservative simulation and optimistic simulation on parallel pro-
cessing environments. The main idea of the techniques is to allow more than one
event evaluation per simulation cycle and to pack more than one propagation event
in a single message. In other words, evaluation of many events in each simulation
cycle is allowed, reducing the number of simulation cycles, communication costs, and
execution times. A sequence of events to be propagated as a single message is called
a multi-event in this chapter. In parallel processing environments, communication
overhead becomes very significant as the number of processors increases. With multi-
events, we can significantly reduce communication cost since a set of events is sent as
a single message.

In conservative logic simulation, each gate computes an advancement window
containing events which can be safely executed without violating event execution

precedence. In optimistic simulation, an aggressive advancement window is given

68

69

which contains events to be aggressively evaluated. More than one evaluation is
allowed per simulation cycle even though rollback frequency increases.

The proposed advancement windows and multi-event techniques were imple
mented for some ISCAS’85 and ’89 benchmark circuits on the CM-2. Good per-
formance, measured by parallelism and execution time, was obtained for some bench-
mark circuits. We investigate the effects of window sizes on performance. As the
window size increases, execution time decreases initially, but then increases since
time taken for each simulation cycle increases due to the increased window size.

The remainder of this chapter is organized as follows. Section 5.1 explains how a
set of events are sent as a single message. Both conservative simulation and optimistic
simulation with advancement windows are presented in Section 5.2. Section 5.3 gives
the performance of the proposed techniques in terms of number of simulation cycles,
parallelism, maximum queue size, rollback frequency, and execution times. In addi-
tion, communication costs will be measured as a function of message length to see

how much multi-events can improve communication costs.

5.1 Multiple Events in a Message

Gate-level logic simulation is different from other simulations, such as queueing net-
work simulation, high level circuit simulations, etc., since simulation is performed
based on ﬁxed propagation routing, ﬁxed delay, good lookahead capabilities, and
non-preemption. The ﬁxed propagation routing allows a gate to execute more than
one event in a simulation cycle based on a speciﬁc simulation protocol: optimistic or
conservative.

We discuss multi-events, in detail, to be contained in a single message. A message
is an object which transfers information between gates. The information carried by

the message represents a set of events as follows.
Deﬁnition 5 A multi-event contains a sequence of events to be propagated.

A multi-event contains a sequence of timestamps in non-decreasing order to prop-

agate the results of the whole event evaluation within an advancement window. A

70

multi-event has the form as shown in Figure 5.1.(a). In the ﬁgure, ts represents the
timestamp of the ﬁrst event of a multi-event, and v indicates the signal value of the
ﬁrst event. In addition, d.- contains the timestamp difference between the ﬁrst event
and the (i + 1)-th event in the multi-event, and corresponding signal v,- follows. In
event-driven simulation, only events whose output signal is different from the output
signal of the previous event evaluation must be propagated.

Depending on what parallel machine is used, the format of a multi-event may
vary. For example, if the computation of the difference in timestamps takes too much
time, the timestamp itself .is sent instead of the difference. The signal ﬁelds (1) or 1);)
may not be necessary in a multi-event if binary signals, either 0 or 1, are used and
we can make sure that the signal value of the last event in the most recently sent
multi-event is different from the value of the ﬁrst event in the multi-event to be sent.

Time Warp has an antimessage used to nullify previous incorrect event propaga-

tion. In this case, the message contains only a timestamp as its information.

II Illa J

a) the format of multi-event

 

 

 

 

 

Its

l10000L1l 10 40 50

 

 

 

b) an example of multi-event

Figure 5.1: A multi—event

 

 

 

For example, suppose binary signals are used and that there are propagation events
e(10000,l), e(10010,0), e(10020,0), e(10030,0), e(10040,1), and e(10050,0). The mes-
sage to be sent will contain the multi-event shown in Figure 5.1.(b). In traditional

distributed event-driven simulation, at least 6 simulation cycles and 4 send operations

71

for the above event evaluation and propagation are required. With the pr0posed ad-
vancement windows and multi-events, however, all the above evaluations may be per-
formed and their results will be propagated to successors during the same simulation

cycle.

5.2 Clock Advancement Windows

In this section, the concepts of clock advancement windows in conservative simulation
and optimistic simulation are discussed. Since the advancement window techniques
are implemented on a massively parallel SIMD machine, terminology needed for im-

plementation on the machine will also be introduced.

Deﬁnition 6 Event evaluation at simulation time t is deﬁned to be safe only if it is
certain that no event with timestamp less than t can arrive in the future. Otherwise,

event evaluation is called unsafe.

5.2.1 Advanced Conservative Logic Simulation

Link clocks and minimum link clocks are used as defined in Chapter 2. That is, a
link clock of each input port of a process is deﬁned to be the timestamp of the most
recently arrived event along the input link. The minimum link clock of a process is
the minimum of link clocks of all its input ports.

Based on the concept of safety, we deﬁne the following advancement window in

conservative simulation.

Deﬁnition 7 In conservative simulation, a window at a gate can be deﬁned from
LVT (=1) to minimum link clock (= c) where l S c. Such a window, which contains

the events to be involved in safe event evaluation, is called a Maximum Conservative

Advancement Window (MCAW).

On MIMD machines, all (or some) events within the maximum conservative ad-

vancement window can be executed for a simulation iteration (or cycle). All gates

72

are synchronized in SIMD processing environments. Hence, all gates must wait until
the slowest (or busiest) gate ﬁnishes its event evaluations. To avoid that situation,

we limit the number of event evaluations.

Deﬁnition 8 A Synchronized Conservative Advancement Window (SCAW) is de-
ﬁned as the synchronized window which contains only events to be involved in safe

event evaluation. The size of the window is deﬁned as the number of event evaluations

from LVT.

Definition 9 If advancement windows are applied in logic simulation, each simula-
tion cycle has a repeated procedure which consists of LVT computation, choosing
active gates, and event evaluation. This procedure is called an advancement cycle.

Only one event evaluation is allowed in an advancement cycle.

A simulation cycle has one or more advancement cycles. For example, F ig-
ure 5.2.(a) shows the state of a gate during simulation. The minimum link clock
of the gate is 60. All events from LVT 10 to the minimum link clock can be involved
in safe event evaluation, as shown in Figure 5.2.(b). In this case, an event evalua-
tion means that an output value is computed based on all the events with the same
timestamp. For example, the event evaluation at 10 in Figure 5.2.(b) involves two
events stored in queues Q1 and Q2, respectively. All the events in the MCAW can
be conservatively evaluated without violating event processing precedence. In other
words, the MCAW consists of a sequence of safe event evaluations. In the ﬁgure,
since SCAW size 3 is used, the maximum number of LVT computation and event
evaluations at a simulation cycle is three.

Each gate might have a different MCAW at a certain simulation cycle, as shown
in Figure 5.3. Gates 1, 2, and 3 have MCAW of sizes 5, 2, and 4, respectively.
We assumed that SCAW size is 4. A gate whose MCAW is not smaller than a
SCAW, as shown in gates 1 and 3 of the ﬁgure, performs event evaluation at each
advancement cycle of the simulation cycle, as shown in gate 3 of the figure. Otherwise,
the gate may be idle after the ﬁrst few advancement cycles. It is very important to

choose an appropriate SCAW size for good performance. The performance will be

73

 

 

Simulation time

   

 

 

 

 

 

90
. . . 7O
Mlnlmum link ClOClL -- 6O
50
so 90 4° EMCAW
4° 7° 30 ESCAW
3° 3° LVT - 10”:
10 10
1 V 2
Q Q 0 indicates unsafe event evaluation

. indicates safe event evaluation

(a) A gate with two input ports (b) SCAW with 3 evaluations as its size

Figure 5.2: An example of an advancement window

 

 

 

 

A
SCAW
o
m. ..... 2 .....
Gate numbers 1 2 3

0 indicates unsafe event evaluation
. indicates safe event evaluation

Figure 5.3: Relationship between MCAWs and a SCAW

 

 

 

74

analyzed according to the the SCAW size in Section 3. In advanced conservative logic
simulation with advancement windows, the following steps are performed at each gate

for each simulation cycle.

Algorithm ADVANCED CONSERVATIVE

1. The minimum link clock is computed.

2. The following advancement cycle is repeated as many times as the predetermined

advancement window size.

(a) LVT is computed.

(b) if LVT is less than or equal to the minimum link clock, event evaluation is

performed.
3. Event propagation is performed.
4. Each link clock is set to the last timestamp in the multi-cvent received on the link.

5. Queue manipulation is performed if necessary.

In addition, we can use a global clock to choose active gates for better perfor-
mance if the above algorithm is implemented on SIMD machines, as given in [25].
For the above algorithm, a multi-event may also carry a virtual time which is the
delay of the gate plus the last LVT. This value is useful in setting the link clocks
of successors. Conservative simulation with advancement windows can be applied to
the null message strategy as well as the deadlock detection and recovery strategy. It
is known that conservative simulation with deadlock detection and recovery is very
efficient on MIMD machines [20, 63], while the simulation with null messages works
efﬁciently on massively parallel SIMD machines [25]. The techniques proposed in this

section can be applied effectively in both cases.

5.2.2 Advanced Optimistic Logic Simulation

In traditional optimistic simulation such as Time Warp, only one LVT increment is

allowed at each simulation cycle. In the proposed approach, each gate can advance

75

its LVT one or more times based on a predetermined advancement window size. We

deﬁne a term for the window.

Deﬁnition 10 A Synchronized Aggressive Advancement Window (SAAW) is deﬁned
as the synchronized window which contains some events from LVT for each gate as
far as there are available events to be executed. The size of the window is deﬁned as

the number of event evaluations.

If a MIMD machine is used as a target machine, a predetermined advancement
window, which need not be synchronized, is applied to each gate during simulation.
Event evaluation in optimistic simulation is allowed even when ordering could be vi-
olated. In the proposed technique, event evaluation is performed for all events within
a SAAW even if rollback may occur later. In advanced optimistic logic simulation
with advancement windows, the following steps are performed at each gate for each

simulation cycle.
Algorithm ADVANCED OPTIMISTIC

1. The following advancement cycle is repeated as many times as the predetermined
advancement window size.
(a) LVT is computed.

(b) For the first advancement cycle only, rollback is performed if the LVT is less

than or equal to the previous LVT.

(c) Event evaluation is performed.
2. Multi-event propagation is performed.

3. Queue manipulation and fossil collection are performed, if necessary.

Since the timestamps of events within a simulation cycle never decrease, only the
ﬁrst advancement of a simulation cycle at a gate can cause rollback. Figure 5.4 shows
a simulation example with SAAW size 2. That is, all events within the window are
involved in event evaluation at the same simulation cycle. Gates 1 and 2 are busy

during the simulation cycle, while gate 3 is idle after one advancement cycle.

76

 

 

A
SAAW
LVTs ---- ----------- (Du-...?
Gate numbers 1 2 3

0 indicates unsafe event evaluation

Figure 5.4: An example of SAAW with size 2

 

 

 

When the proposed approach is used, there are some advantages in queue manip-
ulation. All events along the same link will be inserted near each other in the same

queue. A disadvantage is that rollback frequency increases.

5.3 Performance Evaluation

Both conservative and optimistic logic simulations with advancement windows are
implemented on the CM-2 with 32K processors. Both simulation protocols were
implemented based on the same data structures proposed in Section 2. As test input,
1,000 randomly generated input vectors were used. In our simulation, we use the
number of event evaluations as an advancement window size.

Let us discuss the performance metrics for the proposed techniques. The per-
formance of the proposed simulation technique will be evaluated by means of the
comparison to traditional distributed event-driven simulation techniques. We deﬁne

the following performance metrics for the evaluation.

0 Simulation Cycle Ratio (Cw), which is the ratio of the number of simulation
cycles at window size w to the number of simulation cycles at window size
1. That is, simulation at window size 1 indicates the traditional distributed

simulation techniques.

77

0 Maximum Queue Ratio (Qw), which is the ratio of the maximum queue size at

window size w to maximum queue size at window size 1.

c There are two ways of deﬁning parallelism based on how to deﬁne an active
gate. When an active gate is deﬁned as a gate which performs event evaluation

at least once during a simulation cycle, parallelism is computed as
22:." MM)

where T,- is the total number of simulation cycles for which gate i is active, N

is the number of gates, and S is the number of simulation cycles.

Parallelism Ratio (Pm) is deﬁned as the ratio of parallelism at window size w

to parallelism at window size 1.

If we deﬁne an active gate as a gate which performs event evaluation at an

advancement cycle, adjusted parallelism at window size w is computed as
1:3" Ti/ ( N 5 W)

where T,- is the total number of advancement cycles for which gate i is active,

and W is the synchronized advancement window size.

When the window size is equal to 1, parallelism is the same as adjusted par-
allelism. The average number of clock advancements (or event evaluations) at
window size w for a simulation cycle can be computed by multiplication of

adjusted parallelism and the window size.

0 Execution Time Ratio (Em), which is the ratio of the execution time at window

size w to execution time at window size 1.

e Rollback Frequency Ratio (Rw), which is the ratio of the rollback frequency
at window size w to rollback frequency at window size 1 in optimistic logic

simulation.

78

In terms of the ratios proposed above, we compare the performance of traditional
event-driven simulation and proposed simulation techniques with advancement win-
dows. The maximum number of event evaluations is used as a unit which determines
advancement window size. In addition, there are several other ways to deﬁne the unit

as follows.

0 the maximum number of additional ﬁelds in a multi-event or

o the maximum number of events to be processed.

5.3.1 Advanced Conservative Logic Simulation with SCAW

The performance of advanced conservative logic simulation with different SCAW sizes
is evaluated in terms of simulation cycle ratio, maximum queue size ratio, parallelism
ratio, and execution time ratio. Figure 5.5 shows the performance comparison with
traditional simulation for C1355, C1908, and C7552 with 1,000 randomly generated
input vectors as a function of SCAW size. The performance in the ﬁgure was obtained
based on the performance of traditional conservative logic simulation which evaluates
gates once per simulation cycle.

As a measurement of how much local simulation clocks advance, the number of
required simulation cycles is used. As the amount of clock advancement increases,
the number of simulation cycles decreases, since local simulation clocks move ahead
quickly. As shown in Figure 5.5.(a), as the SCAW size increases, the number of
simulation cycles decreases rapidly and then converges to a certain point.

In Figure 5.5.(b), as the SCAW size increases, the maximum queue size decreases
initially, but then increases. We can explain this as follows. Events in gates are
properly processed and consumed as the SCAW size approaches a certain value in
the circuit. In this case, the required maximum queue size would become small.
But, as the SCAW size increases beyond this value, some gates may have large ac-
tually processed advancement windows, while other gates might have small windows.

Therefore, some input ports might have a lot of events while other input ports have a

79

 

1 I I I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1
0.8 3‘ Simulation Cycle Ratio (=Cw) .....O C7552 -
0 6 fa. .....+ C1908
' b. 0.. 01355
0.4 - ' ...... ..
0.2 " unumeaagu$ ....... -l
0 l """""" {immaéumaaﬁuu """" ma::::::d;:::::::¢:::::::d;z:::::::b
5 IIO 15 20 25
1.6 I , , I
1.4 - Maximum Queue Size Ratio (=Qw) .....0 07552-
1.2 *- .....+ C1908-
1 ".6. .....D Cl355_
0.8 _ ........ $2:-s$mm.$1:::g:$ ..... ”mug ...... ..+. ....... + ....... +—
0.6 - "0- ....... Q ""“:13'::::::{bZIIII..$....... Mg... 8_
0.4 - """"" 0 0 00.941) .......... _
l l L
5 10 15 20 25
5'? <> 07'552 ' ' '
b ' ...Q ....... "
5 :.+.. 01908 ....... 9"" 0- ------- 4;. ....... o ....... Q ...... 4;. ....... o“
3% i D 01355 ----- P all 1 Rat' P I
.3 .. 9 ar e ism 10 ( u,) -
2.5 .. o g ....... £3 ....... m ....... c1 ....... g ..-m... _
1 g :$.::3,.sz: .... + ....... + ....... + ....... ... ....... + ....... it . ...i; 51.....9:
.1 g i l ! l ' I 1 Ir ...... +...
5 10 15 20 25
2 I I l I
i2 : ..<>.. C7552 Execution Time Ratio ( Ew) ....... m:
1.4 ...+.. C1908 D ....... _
1.2 r- ... 0.. 01355 MB ....... G '''''' o ...... Q-
1 r B". ....... 9"” ..... .+ —
- ................ .....¢====::: .. + -------+ _
(0):: Emmoammi8iiiiiii8:::uguuUNI8"” '¢ ‘-
0.4 l l 1
5 1i) 15 20 25

SCAW Size (=w)

(a)

(b)

(C)

(d)

Figure 5.5: Conservative simulation with different SCAW sizes (1)

80

small number of events. There might be big differences between link clocks at a gate.
Hence, there is a high possibility of having large maximum queue sizes.

In Figure 5.5.(c), the parallelism increases initially, but then decreases. We can
explain this as follows. In the case of a large SCAW size, gates close to primary input
ports process many events at each simulation cycle since the gates in general have
large MCAW, while gates near primary output evaluate a relatively small number
of events due to their small MCAW. That is, if a large SCAW is used, gates in the
front part of the circuit complete simulation early, but gates in the rear part may
have congestion problems. Therefore, as SCAW size increases beyond a certain point,
parallelism might decrease.

In Figure 5.5.(d), the execution times decrease initially, then increase. The best
performance is given when the SCAW size is 4. As the SCAW size increases, the time
taken for a simulation cycle becomes longer since we need as many LVT computations
and event evaluations as the SCAW size at each simulation cycle. Therefore, the best
performance is given at the point where the time spent for a simulation cycle is not
long, as well as the number of simulation cycles is not large.

Figure 5.6 shows the performance with respect to adjusted parallelism, and average -
number of advancements per simulation cycle. According to the graphs, as SCAW
size increases, adjusted parallelism decreases since the number of idle advancement
cycles increases. The average number of advancements increases, and converges to a.
certain value since there are no more useful advancement cycles if SCAW size exceeds
MCAW sizes of all the gates in the circuit being simulated.

Figure 5.7 shows the performance of advanced conservative logic simulation as
a function of the number of input vectors. SCAW size 4 is used since the best
performance was given at that SCAW size. As the number of input vectors increases,
the number of simulation cycles, the maximum queue sizes, parallelism, and execution

times also increase.

81

 

 

 

 

 

 

 

 

0.3 + I I I u
0.25 'L'.’ """" + ......... ....0. C7552 —
2 + ......... .....+ C1908 _
0. ”"Q ....... g. ....... Q ....... $12.... .....U 01355
0 15 " memﬁzzxxzdixz .... F
0 l “O ....... 0 ....... 0 ....... Q ....... 0 ....... Q ....... Q ...... $3» iguana-mag ....... -
0.05 _ Adjlusted Paralllelism 1 ....... o ....... (I) ....... Q ....... _
0
5 10 15 2O 25
2 5 i I i I m ..... g
2 - g13:33:13},...:::+:::::::sz ..... + ....... +-
1 5 _ ..... +...::::$ ....... -
...-['3 ...... 1'3"" Average Aggrangemeats <>
1 ’ E; ..... .- 09%87559 _
0.5 agiv-“Q ....... ~<>"" .....+ 01908 _
0 33:. ------ 1 1....0 C1555 1
5 10 15 20 25

SCAW Size (=w)

Figure 5.6: Conservative simulation with different SCAW sizes (2)

5.3.2 Advanced Optimistic Logic Simulation with SAAW

Figure 5.8 shows the performance for C1355, 01908, and C7552 as a function of
SAAW size. In our simulation, the maximum number of event evaluations was ﬁxed
in advance as a synchronized aggressive advancement window. The performance
comparison with traditional‘optimistic simulation was measured in terms of ratios for
number of simulation cycles, parallelism, rollback frequency, and execution times.

As the SAAW size increases, the number of simulation cycles decreases and con-
verges to a certain point, as shown in Figure 5.8.(a). The number cannot decrease
further because rollback requires some number of additional simulation cycles even
though clock advancements are enhanced by giving a larger window size.

In Figure 5.8.(b), as SAAW increases, rollback frequency rapidly increases. But,
it decreases as the SAAW increases beyond a certain size. The reason is as follows.
For a large SAAW, many events are processed in a simulation cycle, even though
only the ﬁrst advancement cycle of each simulation cycle may have rollback and the

number of simulation cycles also decreases. At any rate, optimistic simulation with

-_-_ 71'

 

82

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3500 h ' ' ' ' F
3000 ~Number of Simulation Cycles <> ......... 9"” .-
_ ”Q .......... .-
2500 ”.0.. C7552 ...... Q """"" _
2000 Q...
1500 ...+.. 01908 O ........ $ $ ......... $
7.. 13.. C1355 ------- , 333;": ===::::: ......... a
1000 ’ «""O spasm ----- «ID-”"43 -
500 Ema“: ..... .....Q ....... g 1 1 1 q
0 200 400 600 800 1000
500 — I ' 1.1-'4 1 1 ,
400 _ 5.::3;gﬁ2:22::::d::::::::::d1:::::::::t5 ........ a ......... a ......... 1:1 _
300 - <>----:::gs.&?=" '''' M 3%;5111‘3"Ma‘§aa“'s%;§"‘ 1523311130 4
.. .-;;I--' .....<> C7552 _
200 ...-3156‘ .....+ C1908
“’0 2:511” ...U.. (31355 -
3' 1 l 4 I l
0 200 400 600 800 1000
0.5 1 I I 1
0042 : Parallelism 9H,“,waggzxzdgrizizz: ::::::::ﬁ2........fﬁ :
003:5; _ ,..-d;::::::===3““""' ..
0.25 - ,. "if: .......... o ........ (.> O 0 0 <> <> -
02 '- <>o~°.'.'-.":::'a ...... O .....0 C7552 J
0-15 ‘ .- 1:" + C1908 d
0.1 ,_ a ..... _
0.03 ;.-==*"' 1 1 INCL. 013155 . -
0 200 400 600 800 ' 1000
70 1 1 r I ".9?
60 -Execution Times (in seconds) ....... Q """" _
50 - Q ......... Q"
. .0.. C7552 Q .........
:3 7+ 813g: 9 ..... <> ------- a ...... 1:1 ......... 1:1 -
b0 “'""“ I ................ "
20 h . "”0 ......... $‘~::::::$=========$= """"" Q """"" + + +
10 Emmggﬁiiigi:$3333Q: .........
0 " L 1 1 L 1
0 200 400 600 800 1000

Number of Input vectorss

Figure 5.7: Conservative logic simulation with SCAW = 4

 

I I I I
0.8 X Simulation Cycle Ratio (=Cw) -
0 6 _ .....O C7552 _
' ,g +.. 01908
0.4 - ~33, ...D.. 01355 _
0 2 - «alumina ....... a
0 l ' om.....:?33:::::::$muum¢m 1m..¢m”¢.....?
5 10 15 20
4 Gui I I I
3.5 r- 8 """"" Q ---------- QRollback Frequency Ratio (=Rw) 1
3 __ . :2 .............. Q ...... <> <> . ...»... ‘
2.5 ~§ "51 ...... .55.. 875520 _
2 j ""121 ........ £1 .....+ 81908 _
3 :1. ....................................... a ........ D" 1355
1.5 :3: + + ""‘+ -------- 2 ....... +.."..Zg” ...... Q ........ ('3 ...... .q] ‘1
1 i 1 1 L T T -
5 10 15 20
6 T A I I
5 _ ..... O --------- é V 0 o O ...... .0 _
0"” 0.. C 7552
4 - O ..... " .....+ C1908 "
3 _ 13.. 01355 _
2 1. Q “$3333: , ....... Q ..... 1,? ........ E ........ 1? ....... (p ....... E _
1 gill-mewmw Parallelism Ratio (=Pw) _
I I I I
5 10 15 20
1.1 I 1 1 T
1 g. .....0 C7552 " ..+ _
2.2: .. . +.. 01908 ”.9 ...... + ......
0-9 ”8.13.. 01355.. ........ 453;: ..... .. ..n _
0.8 - 1:232:31. ......... + --------- 4"” .---""I..m ....... 1:1» ------ ‘3 """" -
0.7 1 ‘9"5"“:332II::::::8::::::::8‘“""“e """"" ‘
0-6 ' Execution Time Ratio (=Ew) ‘
0.5 I I I I
5 10 15 20

83

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SAAW Size (=w)

(a)

(b)

(C)

(d)

Figure 5.8: Advanced optimistic simulation with different SAAW (1)

84

advancement windows has much higher rollback frequency than traditional optimistic
simulation.

As shown in Figure 5.8.(c), parallelism increases, since more events are involved
in event evaluation at each simulation cycle as SAAW size increases. But parallelism
decreases ﬁnally, since the number of active gates will become smaller for a large
SAAW. With a large window size, the gate is more likely to consume its entire queues,
and therefore be idle on the next cycle. For example, in the worst case, if the SAAW
size is equal to the number of input vectors, parallelism will be extremely small.

In Figure 5.8.(d), the execution time of optimistic simulation with the proposed
technique decreases initially for the following reasons. First, the number of simulation
cycles decreases and only one rollback manipulation is required in a simulation cycle
even though SAAW size increases. Second, queue manipulations are fast on circular
binary search event queues. Finally, rollback manipulation is very simple, therefore
high rollback frequency does not affect the execution time that much. But the execu-
tion time finally increases as the SAAW size increases, since the number simulation
cycles converges to a certain point and each simulation cycle has more advancement
cycles.

Figure 5.9 shows the performance measured in adjusted parallelism, and aver-
age number of advancements per simulation cycle with 1,000 randomly generated
input vectors. Adjusted parallelism increases until SAAW size reaches 10, but the
adjusted parallelism decreases since the number of idle advancement cycles increases
beyond that window size. The average number of advancements per simulation cycle

increases.

5.3.3 Performance on Circuits with Feedback

Fast simulation on circuits with feedback loops is one of the major issues in logic sim-
ulation. We measured the performance of conservative logic simulation with SCAW
to see how the proposed window concepts affects the performance for sequential cir-
cuits. As test circuits, 89234, 813207, and 815850 are used which have D ﬂip-ﬂops
and feedback loops.

85

 

 

 

 

 

 

 

 

 

0.05? j ¢ 0 I I
0-45 h. -------- 3:222:15:22331::3fffj;.¢ ......... 6.222222353332322; 2:21;;agggggi‘6 -
0.4 PB .............. 0 q
0033 -... 9 Adjusted Parallelism ‘
0.2'5 - ...<>.. 07552 -
0.2 ”,0 .....+ C1908 ‘
001? F" 1 ...CLl C1355 1 1 ‘
' 5 10 15 20
10 l 1 I .
8 _ Average Advancements a:::;;:$;;;;;;;$2:313:75 '1
6 r- 88""..3 _'
4 * ....Qggsssiif """ ...<>.. 07552 «
2 _ ...-""3"”. .....+ C1908 _
mast?” ...D.. 01355
0 13.3%"1 1 L 1
5 10 15 20

SAAW Size (=w)

Figure 5.9: Advanced optimistic simulation with different SAAW (2)

Figure 5.10 shows the performance for the three test circuits as a function of
SCAW size. According to the graphs in the ﬁgure, as the SCAW size increases,
the number of simulation cycles decreases little. Parallelism decreases and execution
times increase. There is no beneﬁt to using the advancement window for circuits
with feedback loops since advancement windows cannot work well for gates which
have feedback inputs. The reason is that the gates with feedback inputs cannot.
advance their simulation clocks until they get input signals from feedback loops. How
to efficiently use advancement windows on circuits with feedback is left for further
study.

In addition, the performance of optimistic simulation with SAAW for sequential
circuits was measured as shown in Figure 5.11. Depending on the benchmark circuits,
the performance in the number of simulation cycles, parallelism, rollback frequency,

and execution times, is different.

86

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1165 I I ' ' '
0 '- . I A Id
1 'th'mmmttit:::::;¢::::::.......:::::::II'$ ................. (n) 9315850 ¥ Q} ..J
095 _ .. _1
0 9 _ .....+ 313207 _
0.85 _ TD, 1:; ....ClQ 89234 .. 1:] ﬁ
0.8 - _
0.75 - Simulation Cycle Ratio (=Cw) -1

0.7 l l 1 L 1

l 2 3 4 5 6

1 "Kg... I j . . I I
0ng : ...... 0" Parallel-ismjﬂatlo (=Pw) :
0.85 - + ----- 55¢. ...................... o... T + 1
0.8 F El ..................... Q ...................... 4) a
0.75 '- ------------ —1
0.7 - ...<>.. 815850 '9 .................. ‘
0062 r; ”...+ 513207 ""0 ...................... Q ........... :

0.5-5 ... D" 89234 ........... D

0.5 J l 1 l L

1 2 3 4 5 6

4 I l l 1 7
3.5 - Execution Time Ratio (=Ew) .....;:::==‘Q —
3 '- ...O.. 815850 "...Qﬁzgﬁiz zzzzzz —1
2 5 1. .....+ 813207 _:::::;;;;::Z;:: ------ B ...................... E] d

'2 :1. 39234 Mixgggzgggggsﬁ ......................

15 r ......:$“ifiﬁifﬁffffiﬁ::::::§ ------- -

l “52:33:: ........... l l l 1

l 2 3 4 5 6

SCAW Size (=w)

Figure 5.10: Advanced conservative simulation for sequential circuits

87

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I I I I I I-
2 Simulation Cycle Ratio (=Cw)
1.5 - -<> 0 O.
................. ....O. 89234
1 ~53: ...... +.. 315850 _
mmm:::::::::::::; ........ . . . D. . S35932
0 5 _ ”33$ ......................................... 5'1 I?“
' T
0 I I I I I I
l 1 5 2 2 5 3 3.5 4
4.5 I I I I I I
4 . . _ (3..
3.5 — Pmud’sm Rm" (_ ...) <> -
3 r ..................... ,..<>.. $9234 —
2.5 - ..... ..<> """ ...+.. 815850 «
2 ~ .............. ...D.. S35932 -
1'5 h ............................ I11 fﬁ $—
1 ................... a. _
l I I I I I
1 1.5 2 2.5 3 3.5 4
5 l 1 1 , I
4.2 _ 0.. 89234 Rollback Frequency Ratio (—R,,,) ;
3.5 — .....+ 815850 ................................... £1 5‘
3 - .....D S3593_2---D"" 9 <>‘
2.5 - ...................... _
2 - ........................ q
15 i- ........................................... + +—
1 $2: ------------ —
0.5 _ 1 1 1 1 1 1
1 1.5 2 2.5 3 3.5 4
4 5 _ I I I I I
'4 - Execution Time Ratio (=Ew) -------------------- _
3.5 h ........ O ................ .-
2g _ "-0 .............................. MO 89234 :
2 - ..................... +.. 815850 -
1? ' ...................... 13.. 335932 55
0.5 _ 1 ............ T 1 I? 1 T
l 1.5 2 2.5 3 3.5 4

SAAW Size (=w)

Figure 5.11: Advanced optimistic simulation for sequential circuits

 

88

5.3.4 Communication Costs According to Message Length

The performance of logic simulation using advancement windows will be different
depending on the architecture of the parallel machine used for the simulation. Simu-
lation with the windows can be done more efficiently in MIMD than SIMD environ-
ments. In SIMD machines, the longest time among all gates for event evaluations and
multi-event propagation dominates a simulation cycle time, since all processors are
synchronized. In MIMD machines, however, each processor can perform event evalu-
ation and propagate multi-events at its own pace without affecting other processors.

The big advantage of using multi-events is to reduce the communication costs
by grouping multiple events into a multi-event. In parallel processing environments,
each message between processors contains some additional information, such as packet
header and tail.

Figure 5.12 shows the communication cost increase ratio on several parallel ma-
chines as message lengths increase, where the cost increase ratio is deﬁned‘as the ratio
of communication time for a 4n byte message to communication time for a 4 byte
message [9, 23, 44, 73]. In the ﬁgure, separate sending means that a single event is
sent in a message, as in traditional distributed event-driven simulation. In this case,
communication cost is exactly obtained by multiplying the number of events sent and
time per event. In the ﬁgure, the CM-2 and the MP-l are massively parallel SIMD
machines and the others are MIMD machines.

We can achieve good performance in logic simulation with advancement windows
on MIMD machines for the following reasons. First, according to Figure 5.12, com—
munication costs on modern parallel machines such as the N CUBE-2 can be signif-
icantly saved by sending long messages rarely, rather than sending short massages
frequently. Moreover, in SIMD processing environments, there might be many idle
processes which have smaller advancement windows than the predetermined maxi-
mum advancement window due to synchronization of all processors. In contrast, in
MIMD processing environments, there might not be many idle processes for smaller

advancement windows.

 

L

89

 

 

 

 

25 I I T 1 9,7
0.. separate sending (>9.
+.. CM-2- ,9"
20 ~ r1. NCUBE-l 0.0 . -
x.. MP-l Q"
...A. BBN 0.0" . a
15 _ .....* NCUBE-2 g- "$55 -
Communication 0.9 ,ﬁ'ﬁ' .
Cost Ratio __<> mwﬁ;
10 — .9 it} {lid} . T
0 . -:
,9 ﬁlli'd} war-3‘ X x
9'9 It!“ : x...><-" "'xﬂ'x
5 _ O 0 53:3 a} xmx x...x---><"'XNXM A _
.‘BEB , . .A-A-A-
.EB . . ,A.A.A-AAA
“AAAA
ggéﬁ ééééé****+**+*++++a
20 4O 60 80 100

Number of Bytes

Figure 5.12: Communication costs according to message lengths

5.4 Conclusions

We have proposed new efﬁcient logic simulation approaches with advancement win-
dows to enhance local clock advancement for fast simulation in parallel processing
environments. According to experimental results for some combinational benchmark
circuits on the CM-2, both conservative and optimistic logic simulation protocols with
advancement windows and multi-events require fewer simulation cycles and achieve
higher parallelism than traditional conservative and optimistic simulation protocols.
In addition, execution times also are smaller for some window sizes. Simulation with
advancement windows is very promising when there are many input vectors, because
average events per cycle will be almost at the maximum during steady state.

The good performance would be enhanced even more in MIMD environments than
the SIMD environment of the CM—2. The technique can be applied to improve the
performance of other simulation schemes, such as YADDES [74].

We can use the proposed technique to enhance local clock advancements, while a

90

global moving time window can be simultaneously applied to prevent queue overﬂow.
That is, advancement windows make local clocks go ahead as fast as possible, but any
local clock too far beyond a global clock is not allowed to advance to prevent queue
overﬂow.

As future studies, we aim to implement conservative and optimistic simulation
approaches with advancement windows and multi-events on MIMD machines and to
compare with the performance obtained on a SIMD machine. In addition, we are
going to continue research on how we can efﬁciently use advancement windows and

multi-events for circuits with feedback loops.

 

Chapter 6

Distributed Token-Driven Logic

Simulation

In this chapter, we present a distributed token-driven technique, which uses the con-
cept of a dataﬂow network [68], for zero-delay logic simulation. Each signal is as-
sociated with a token. Tokens are used to synchronize all signals driven from the
same input vector. A gate executes signals based on the readiness of tokens or
comparison of token identifiers. Zero-delay logic simulation verifies functional cor-
rectness, not timing. The technique was implemented on the CM-2, and the BBN
TC-2000, a shared memory MIMD machine. Experimental results indicate that the
distributed token-driven technique outperforms both distributed event-driven simula-
tion and compiled-code simulation for some benchmark digital circuits on the CM-2.
A higher degree of parallelism can be achieved than with other compiled-code sim—
ulation techniques. With a reasonable partitioning scheme on the BBN TC-2000,
the technique gives experimental results comparable to those obtained by a similar
algorithm running on the CM-2.

The remainder of this chapter is organized as follows. Background and prior
work for token-driven simulation is given in Section 6.1. Section 6.2 presents basic
concepts of distributed token-driven logic simulation and explains how to efficiently
apply the technique to parallel logic simulation on massively parallel SIMD machines.
Section 6.3 presents performance evaluation for some benchmark circuits on the CM-
2. In Section 6.4, the performance of token-driven simulation on a shared memory

MIMD machine is presented.

91

 

92

6.1 Introduction

Due to the large amount of time required to simulate circuits at the gate level, much ef-
fort has been focused on ﬁnding more efﬁcient simulation methodologies. Distributed
event-driven simulation techniques have been studied for parallel logic simulation be-
cause they have the ability to easily handle asynchronous designs and timing analysis
in parallel processing environments. Traditionally, digital logic simulation has been
accomplished using the event-driven approach [14, 25, 35]. However, it has been
shown that as much as 26 times more events are generated than necessary [70]. Re-
cently, however, as the demands for maintainability, testability, portability, and high
speed increase, synchronous circuits have been preferred for VLSI applications [69].
In an attempt to eliminate these extra events, techniques based on a concept known
as compiled-code logic simulation have been proposed [16, 39, 69, 70].

Compiled-code logic simulation usually begins with a levelization process, which
arranges the gates in levels, based on the minimum distance to the primary inputs.
Once the levels are identiﬁed, then the logic at each level is compiled into executable
code. The output vector for each input vector can be obtained by executing the
code for each level sequentially from the ﬁrst level to the last level. This method is
faster than the event-driven strategy because there is no need to maintain a global
event heap. LECSIM [70] is a simulator that incorporates both the compiled-code
logic simulation approach and the event-driven approach. Another approach to logic
simulation is the demand-driven approach [60]. In this methodology gates are only
evaluated when their outputs are demanded. This eliminates many of the unnecessary
events which were mentioned earlier. The main disadvantage of this approach is the
overhead incurred by a recursive backtracking routine. 7

As discussed in [70], intermediate events in unit-delay simulation are used for
detecting timing anomalies like hazards and race conditions. As the complexity of
synchronous circuits increases, however, testing the functional behavior of a circuit is
done before performing timing analysis. Zero-delay logic simulation is usually used for

such functional testing. We are concerned with testing only the functional behavior

 

93

of the circuit.

6.2 Token-Driven Simulation

In conventional distributed event-driven simulation, each simulation output from a
process is carried by an event, which is timestamped with simulation time at which
it should be executed. The term token used in this chapter is conceptually different
from an event. In distributed token-driven simulation, a token carries an output
signal from a process without being timestamped. An identiﬁer may be assigned to
each token, if necessary, to synchronize messages for correct simulation.

In distributed event-driven simulation on a massively parallel SIMD machine, each
process repeats the same procedure which consists of event selection, event evaluation,
new event propagation and queue manipulation, as deﬁned in Chapter 2. The time
period to perform such a procedure is called a simulation cycle [25]. Each simulation
cycle is synchronized in SIMD environments. In a similar way, the simulation cycle
can be deﬁned in the distributed token-driven simulation by substituting event with
taken in the above procedure.

Depending on the type of circuit to be simulated, the distributed token-driven
technique for zero-delay logic simulation can be divided into two schemes: without
identiﬁers and with identiﬁers. The scheme without token identiﬁers is the most
efﬁcient, but works only for combinational circuits, while the scheme with identiﬁers
works for sequential circuits with feedback loops.

Let us consider the scheme without identiﬁers. An active port at a gate is deﬁned
to be a port which is supposed to receive tokens during simulation. Each active port
has its own queue to store tokens. During every simulation cycle, the following steps

will be performed:

Algorithm TOKEN-DRIV EN SIMULATION

1. For all gates, the gate is chosen as an active gate if each active port has at least

one token.

94

2. Each active gate produces an output signal from input signals of the tokens.

3. Each active gate propagates a token with the output signal to all the successors of

the gate.
4. Each gate which receives new tokens puts them into its corresponding queues.

Since simulation without identiﬁers is performed for a combinational circuit, the
simulation terminates only if all the active ports in the simulation network have empty
queues. There are no intermediate output signals produced from simulation. Hence,
it is very easy to recognize the resulting vectors without output control. 9

The intuition behind the token-driven approach is quite simple [26]. Instead of
evaluating gates based on the occurrence of a previous scheduled timestamped event,
gate evaluation is triggered by the presence of at least one token in each of the gate’s
input queues. Although the token—driven approach and the event-driven approach
appear to be quite similar, they are, in actuality, quite different. In the event-driven
approach a gate is evaluated whenever an event occurs which references the gate. The
occurrence of an event is based on its timestamp. In logic simulation, a single signal
is usually associated with each event.

Tokens are used to synchronize all the messages from the same input vector.
When a massively parallel SIMD machine is used as a target machine, the number of
simulation cycles [25] required can be predicted and is smaller than any distributed
event-driven simulation techniques.

Although a token may be augmented with an identiﬁer to aid in signal synchro-
nization, a timestamp is not needed. In the token-driven approach, gate evaluation
occurs whenever a token is ready at every input port of the gate. By eliminating
timestamps, we also eliminate the non-trivial problem of maintaining global time
over multiple processes. This is a key factor in the performance improvements re-
alized by the token-driven approach. An explicit timing mechanism can be ignored
because only the functional behavior of the circuit is being tested. That is, given
the input vectors, determine what the corresponding output vectors are. Timing

anomalies such as hazards and race conditions are not considered. The token-driven

 

95

approach also differs from the levelized compiled-code techniques. In the levelized
compiled-code techniques the circuit description is compiled directly into code. Input

vectors are then evaluated by executing the code once for each vector.

6.2.1 Additional Considerations

There are many factors which affect the performance of the proposed token-driven

simulation technique. The remainder of this section will discuss these factors.

Token Size

In most distributed event-driven simulation techniques, events, each with a digital
signal, traverse a simulation network model for a circuit under simulation. On a
parallel machine which has processors and slow communication, the communication
overhead may cause signiﬁcant degradation of the simulation performance. To cope
with the overhead in the distributed token-driven simulation technique, digital signals
are encapsulated into tokens. A token may contain one or more signal values. In this
case, both the number of simulation cycles and the simulation time can be signiﬁcantly
reduced.

As the number of input vectors increases, good performance can be obtained if the
token size increases accordingly. However, the performance goes down if the token
size exceeds a certain limit since there is a trade-off between communication time
reduction and evaluation delay. The optimal token size for the best performance

depends on both the critical path length of a circuit and the number of input vectors.

Simultaneous Signal Evaluations

In distributed event-driven simulation on massively parallel machines, each process
has a function table which contains each output signal for all possible input signal
combinations according to the function of the process [25]. Only signals carried by
events with the same timestamp can be evaluated at a process during each simulation

cycle since only one signal is associated with an event. In other words, signals for at

a. '2‘." ' I I

 

96

most one input vector are involved with signal evaluation.

In the proposed distributed token-driven simulation technique, however, each ac-
tive gate can simultaneously evaluate as many signals as carried by a token. In other
words, signals produced from several input vectors can be simultaneously evaluated
at a process because a token carries more than one signal and each output signal is
independent of previous output signals.

A function table is modiﬁed for simultaneous signal evaluation in the distributed
token-driven simulation technique. Let m and n be the number of input ports at a
process and the token size in bits, respectively. Since each input port is associated
with a circular queue to store tokens, a process needs m queues, Q0, Q1,... Qm_1.
Let t,,- be the j-th token in Q,. In addition, bpq, is deﬁned as the r-th input signal in
token tpq. In this case, the ﬁrst tokens to (i = 0..m — 1) in each queue contain the

following input signals:

too (boon, boon 5002, .., bOO.n—1)a
110 (5100, 5101, 5102, .., blO,n-1)a

tin—1,0 (bm-1,00a I’m-1.019 "y bm—1,0,n—l)-

Here, let 3 be the number of input vectors which can be involved with simulta-
neous signal evaluation. In other words, all signals derived from 3 consecutive input
vectors can be simultaneously evaluated at a process if each input port of the process
has a token. The ﬁrst 5 signals of the ﬁrst tokens in each queue are involved with
simultaneous signal evaluation using a function table lookup operation. A function
table contains 3 output signals in each element. An index into the function table for

the table lookup operation is deﬁned to be the concatenation of the following bits:

bm-l,0,s-1 a bm—2,0,a-1 9 “a boos-1,
bm—1,0,s—21 bm-2,0,s-2a "a boas—23

bin-1,003 bin-2.003 "a b000

97

We now discuss the data structure for a function table, and how each output
signal of an element of the function table must be computed. The function table is
a two-dimensional array. An element of the table is deﬁned to be a binary value f,-,-,
( i=0,1,2,..,(2("‘”) — l) and j = s - 1,.., 1,0). For an index i of a function table, 3

elements (to be stored), s,o,s,1,..,s,,,_1 are computed as follows:

1. index i is converted into a (m =4: s)-bit binary number, say

am-1,a-1 yam-2,8-l 3“9a0,5-l?

am-1,1 aam-2,13"9a0,19

am-1,030m-2.0’°°3a0g0'

2. fij is computed by applying the gate function with bits am-1,j,am-2,j,..,ao,,-,
where the gate function indicates a digital function of the gate like AND, OR,

etc.

Given an index i in a table lookup operation, output signals f,-,- (j = s — 1,s —
2, ..., 1,0) come out from the table.

Figure 6.1 shows the data structures for queues and a function table where s = 3
and m = 3 at a process. For best performance, the token size should be a multiple of
s so that each signal evaluation is involved with 3 input vectors. The simultaneous
signal evaluation strategy can be optionally used depending on the target machine.
This strategy is recommended for parallel machines whose array access time is less

than its functional access time.

6.2.2 Simulation on Circuits with Feedback Loops

The proposed distributed token-driven simulation technique can be applied to any
circuit with feedback loops if each loop contains at least one ﬂip-ﬂop. The feedback
loop can then be broken at the ﬂip-ﬂops by treating ﬂip-ﬂop inputs as primary outputs

and their outputs as primary inputs, as is done in unit-delay compiled simulation [52].

98

 

 

 

 

 

 

 

 

 

 

 

_L _l_ index.‘i222.?1°3'i.°9%’2°1‘11211’2°1’322°?2°.‘3°99
I I
0 £02 5 f0] 5 foo
: 1 1 1 f12 : fll : f10
I I
: : : 2 f22 l f21 l f20
! 9 ' l l
: : ' : :
t21 I11 I01 . . l . .
. t t . . I l l l l l
4&2. iv: . = - 5 -
® ‘ '
a) Queues at process D b) A function table at D
Figure 6.1: An example for simultaneous evaluation

 

 

 

In the distributed token-driven technique with identiﬁers, simulation ends when a
predetermined identiﬁer limit is reached.

A scheme using tokens with identiﬁers, which can simulate sequential circuits, is
more general than the scheme without identiﬁers. During every simulation cycle for

synchronous sequential circuits, the following steps will be performed:

Algorithm SIMULATION ON SEQUENTIAL CIRCUITS

1. For all gates, a gate is chosen as an active gate if each active port has at least one

token.

2. Each active gate produces an output signal from input signals of tokens with the

same identiﬁer.

3. Each active gate propagates a token with both the output signal and an identifier

to all the successors of the gate.
4. Each gate which receives new tokens puts them into its corresponding queues.

In the second step of the above algorithm, all active ports must have tokens with

 

99

the same identiﬁer. All ﬂip-ﬂops should generate a token with the same identiﬁer as
the ﬁrst input vector at the beginning of simulation. Figure 6.2 shows a 0-1 detector
with a D flip-flOp, an AND gate, and an inverter, as a sample sequential circuit, which
detects all the 0—1 sequences from input vectors. At the begining of simulation, the
D ﬂip-ﬂop generates a token with identiﬁer 0. For each input token to gate A from
primary input vector, the D ﬂip-ﬂop generates the same identiﬁer as that of the token.
For example, for input token with identiﬁer 1 and signal 0, the D ﬂip-ﬂop generates

the token with identiﬁer 1 from the previous input vector.

 

 

 

 

fo‘
_/_J

a) 0-1 detector

 

 

 

 

 

 

 

 

1 r— —
0 __4 _ .
3 2 1 O 3 2 1 0
21) input vectors b) output vectors

Figure 6.2: Simulation on circuits with feedback loops

 

 

 

Before starting simulation, it is very important to determine how an identiﬁer is
assigned to a token. Numbers used as token identiﬁers must be non-decreasing. For
example, timestamps of input vectors can be used as token identiﬁers. A flip-ﬂop can
be considered a gate in logic simulation. In step 3, the assignment of a new identiﬁer
is based on the function of the active gate. For example, a normal gate like AND
or OR does not change the identiﬁer, while a new identiﬁer at a D ﬂip-ﬂop is set to
the current identiﬁer plus the clock interval. Asynchronous designs need more careful
generation and assignment of identiﬁers than synchronous designs since asynchronous

circuits are complicated.

 

100

The above procedure may not end since there might be some unprocessed tokens
forever. To ﬂush the unprocessed tokens at the end of the simulation, a token with
inﬁnity as its identiﬁer is added to the end of input at each input gate. In this special

case, a token with inﬁnity as its identiﬁer is considered to match any other token.

6.3 Performance Evaluation on the CM—2

The scheme for logic simulation proposed in [25] was used to implement the dis-
tributed token-driven simulation technique. The following data structures are con-
sidered. First, each gate in a circuit being simulated is assigned to a processor in
the target machine. Second, each input port of a gate is associated with a circular
FIFO queue to store tokens sent to the port since each port receives tokens in pro-
cessing order. Each element of the queue has a link ﬁeld to point to the next element;
All queues belonging to a process must be contained in a processor. Finally, event
evaluation is performed based on a table lookup operation as in [25].

The distributed token-driven simulation technique without identiﬁers was imple-
mented on the CM-2 with 32K processors. A 32 x 32 array multiplier and ISCAS’85
and ISCAS’89 benchmark circuits [12, 13] are used as test circuits for the performance
evaluation.

One of the advantages of distributed token-driven simulation is that the number

of simulation cycles can be predicted. We can make the following observations.

Observation 1 Suppose that token size 1 is used. Let T and N be the critical path
length (in nodes) of a test circuit and the number of input vectors for the simulation,

respectively. The number of simulation cycles for the simulation is exactly
T + N — 1.
The above number can be obtained since no unnecessary messages (or events) are
involved in simulation at all.

Observation 2 When a token contains more than one signal, the number of

simulation cycles is computed as

101

T + r%1 - 1
where S is a token size in signals.

Comparing experimental results of token-driven simulation to those in Chapter 3,
token-driven logic simulation runs much faster than traditional event-driven simu-
lation techniques. There are several reasons for the good performance. First, the
number of simulation cycles is much less than in any other distributed event-driven
simulation techniques. Second, communication costs are reduced since the size of a
token is smaller than that of an event. In addition, grouping consecutive output sig-
nals into a token reduces communication costs. Third, queue manipulation for tokens
is much simpler than for events. There are no unnecessary events during simulation.
Finally, a simulation cycle is very simple.

Since there are no parallel zero-delay simulation schemes in existence, the experi-
mental results of compiled-code simulation are considered for comparison. Table 6.1
shows the comparison of the simulation techniques for some ISCAS’85 benchmark
circuits with 5,000 randomly generated input vectors. As a unit of measurement, CM
time is used which is the time it takes to execute parallel instructions on the sequencer
of the CM-2. In the measurement, only the time period for simulation is considered,
not the time required for reading vectors, printing output, or translating the circuit
description. LECSIM was measured on a SUN 3/260 [70]. The experimental results
of both the PC-set and the parallel technique in [52] are shown in the table. The
experimental results for token-driven simulation were obtained with token size of 21,
i.e. each token contains 21 signals. According to the experimental results in the table,
distributed token-driven simulation runs much faster than any other compiled logic
simulation technique. Since the execution is pipelined, the execution times are pro-
portional to the number of input vectors rather than the size of test circuits for a large
number of input vectors. The execution time of token-driven simulation with up to
10,000 randomly generated input vectors is given in Table 6.2. In the measurements,
a 32-bit array multiplier with 8256 gates and ISCAS’85/C6288 are considered as test

circuits.

102

 

Test 1' LECSIM PC-set Parallel Token-

 

 

 

 

 

 

 

 

 

Circuits (SUN3 / 260) driven
C1355 309 84.9 9.8 2.8
C1908 500 162.7 54.3 3.0
C6288 1487 1757.3 369.3 4.6

 

 

 

Table 6.1: Performance comparisons (in seconds)

 

No. input
vectors 2100 4200 8400 10000

C6288 3.59 4.99 7.77 83?
Mul-32 3.38 4.70 7.16 8.10

 

 

 

 

 

 

 

 

 

 

 

Table 6.2: Execution times (in seconds)

We now compare the proposed distributed token-driven simulation with tradi-
tional distributed event-driven simulation. Output control is not required in dis-
tributed token-driven simulation since no unnecessary messages occur. Deadlock
never occurs during simulation since each active gate unconditionally propagates a
token to all the successors of the gate. Determining active gates based on the selective-
trace approach does not cause any rollback and state-saving overhead as required in
Time Warp. The computation of a global clock is not necessary because each gate
works based on the readiness of tokens in its queues independently of processing of
other gates. One of the great advantages of the distributed token-driven simulation
technique is that queue manipulation is simple. In addition, the size of each queue is
relatively small due to smaller token sizes compared with event sizes.

Depending on the token size, the execution time may vary since the size affects
the number of simulation cycles, communication costs, and signal evaluation time.
Figure 6.3 shows the performance as a function of token size. Token sizes 21, 53, and
117 are used for C7552 with up to 10,000 input vectors. The number of simulation
cycles increases based on Observation 2. That is, token-driven simulation with larger

token size requires smaller number of simulation cycles and larger communication cost

 

 

 

103

 

 

 

 

 

 

 

 

600 I I I T I I I I I

500 _ No. of Simulation Cycles <> .......... ’9 q

0.. Size 21 .0 ..........
400 - .....‘I" Size 53 O ........ _
C1. Size 117 "Q ........

300 - O O ...... ﬁ
...................... +

200 *- .0 ,,,,,,,,,,,,, +.."...MI- -------- + _
MO ............ 5 """"" + .......... g ......... a .......... a .......... Q. .......... E]

100 :::::::::::::['..IZZIZZII..$ .......... 1 .......... cl; ......... 1 1 1 I L g

1000 2000 3000 4000 5000 6000 7000 8000 900010000
9 I I I T I I r T r
8 i 1: ......... m

7 ”Execution Times (in seconds) ...... E} """""

6 I 13.-.. ....m---' o q

5 ’— ..... D """"""" Q .......... 09+ —1

4 b -------- 9 """"" Q ..... ...,Q:::ZZIIILI.. .......... 4" ..

3 '—'" 45 .......... 41> """""" 4)....in Size 21 d

2 "' ..... $..:;;;::::-®" ......... +.. Size 53 -‘

1 £333 ......... C1,. Size117 _
0 l J l l l l l I l

H

000 2000 3000 4000 5000 6000 7000 8000 900010000
Number of input vectors

Figure 6.3: Performance according to token size

for each token. For a large number of input vectors, simulation with token size 53 has
the best performance. Simulation with token size 21 requires more simulation cycles,
while simulation with token size 117 has higher communication costs compared with
other cases. In the CM-2, the most efﬁcient token size is at least 21, since the optimal
length of each element in an event queue for the best performance is a multiple of 32
bits and a part of an element is used as a link ﬁeld in the queue.

We also measured the performance of token-driven simulation on sequential cir-
cuits with feedback loops. In the circuit with feedback, we cannot use the encap-
sulation of signals into a token. In other words, a token has a single signal and an
identiﬁer. In this case, the number of simulation cycles is much larger. Since event-
driven simulation propagates events only when the output signal is different from

the previous output, token-driven logic simulation has many more simulation cycles

104

 

 

 

 

 

 

 

 

 

 

Test Simulation Execution
Circuits Cycle Ratio Time Ratio
89234 2.39 2.1
813207 2.00 1.77
_Sl5850 2.86 2.55 __

 

 

Table 6.3: Performance comparisons to conservative logic simulation

than distributed event-driven simulation. Table 6.3 shows the performance compar-
ison to conservative simulation with null messages. The second column contains the
ratio of the number of simulation cycles for token-driven simulation to the number
of simulation cycles for conservative logic simulation. The third column contains the
corresponding ratio for execution time. According to the table, token-driven simu-
lation runs much slower than conservative logic simulation for the given sequential
circuits. This indicates that token-driven simulation has no advantages over event-
driven simulation in circuits with feedback loops. However, token-driven simulation

is easier to implement on a SIMD machine than event-driven simulation.

6.4 Experimentation on a Shared Memory Mul-
tiprocessor

Engelsma and we investigated the performance on a shared memory MIMD ma-
chine [31]. As was the case for the massively parallel SIMD algorithm, experimental
results indicate that good performance can be obtained. Moreover, if we use a sim-
ple non-random partitioning scheme, performance comparable to that realized by the
massively parallel SIMD algorithm can be obtained.

The token-driven simulation algorithm for MIMD machines is very similar to the
algorithm used on SIMD machines. The only major difference is that, on MIMD
machines, the processors will use a round-robin or time-slice mechanism to distribute
simulation cycles among its set of assigned gates.

In order to obtain good performance with the token-driven approach in a shared

 

 

 

105

memory multiprocessor environment, data structures must be strategically dis-
tributed across the memories. For each gate, a simple data structure is allocated
in the memory of the processor which has been assigned to the gate. This structure
contains a gate type, pointers to the input queues of the gate’s successor gates, and
pointers to its own input queues. All gate input queues reside in the same memory
that the gate structure resides in. A gate’s output queues are another gate’s input
queues, which means that the output queue may be in a remote memory, or a local
memory, depending on which processor the successor gate has been assigned to. Thus,
the token queues can be thought of as mail boxes where processors can deposit tokens
for another processor, or receive tokens from another processor. Each queue has a set

of lock variables associated with them, which are used to ensure data consistency.

6.4.1 Performance Evaluation on the BBN TC-2000

The algorithm was implemented and run on a 44 node BBN TC-2000 Butterﬂy [31].
The circuits tested include 16x16 bit array multipliers, ISCAS’85/C1355, C1908, and
C6288 [13], all of which are combinational logic circuits.

The most interesting results are the speedup curves which are illustrated in Figure
6.4. The speedup measurements are based on simulation runs on a 16x16 bit array
multiplier using the consecutive gate partitioning scheme. In all cases except that of
l, 000 inputs, considerable speedup is obtained by using between 1 and 8 processors.
When 10, 000 input vectors were used, the speedup curve is still relatively steep. This
indicates that increasing the number of inputs will result in greater speedup. On the
other hand, using less then 5, 000 inputs resulted in inverse linear speedup after 8
processors. This is due to the overhead involved with generating the extra processes on
the additional processors. Allowing for a longer steady state run time allows recovery
from the initialization overhead. The maximum speedup attained for this circuit was
approximately 8.5 by using 32 nodes and 10, 000 input vectors. As the number of
input vectors increases, we expect that the speedup can be improved as shown in
Figure 6.4. From the ﬁgure, we can relate the Gustafson’s scaled speedup [38]. In

other words, the parallel portion of token—driven simulation is scaled up linearly with

106

 

 

9 I I. l I I I
.....0 10000 inputs .............. <>
8 T..+.. 5000 inputs .......................... ..
13.. 3000 inputs ,0 """"""" 4
7 7.. x.. 1000 inputs-”f "..+
6 ~ , ........................................ ﬂ
5 — .9" .................. 4r” _
Speedup 5+ ........
4 .. [J ...... Q g .7
5‘: ............ x .............. -r
2 pg...)( ............. x ..............................
1 E3 ......................... x ..1

 

 

 

5 10 15 20 25 30 35

Number of processors

Figure 6.4: Speedup obtained for various numbers of input vectors

the number of input vectors, and the speedup increases linearly with the number of
input vectors.

The simple consecutive gate partitioning scheme was implemented and compared
to a random partitioning scheme. The experimental results indicate that by employing
a very straight forward partitioning scheme, such as the simple consecutive gate
scheme, performance is signiﬁcantly improved. Figure 6.5 gives the speedup curves for
16 bit multipliers. For this circuit, the consecutive gate partitioning scheme reduced
the execution time by approximately half. The maximum obtained speedups were 8.5
and 4.8 for 10, 000 input vectors.

Representing signals as actual bits, and then using 31 bit logical operations to
evaluate multiple sets of signals at a gate in parallel, resulted in signiﬁcant perfor-
mance increases. Figure 6.6 shows the difference in raw execution time of the bit
packing approach compared to representing a single logical signal in an 8 bit byte. In
general, speedup was not effected signiﬁcantly. In addition to reducing the execution
time, bit packing also allowed more input vectors to be processed, since less memory

was required per signal.

 

107

 

 

 

 

: I I I I ' ................. ‘ ..... o
7 - ....‘Q ............................ _
6 ’- .1
Speedup 5 I O ........................... + —
4 _ .......... + ............................ 4
3 7 .9" + ............ + _
2 2),;- ............ <> consecuﬁve ﬂ
1 rah-"4‘ ,_,,,+ Random -
0 l l 1 1 l l

5 10 15 20 25 30 35

Number of processors

Figure 6.5: Speedup comparison for consecutive and random circuit partitioning

 

 

 

80 I I I I I I
70 - Q, a
60 - .....O No bit packing ~
. 50 — +. With bit packing -
Time _ _
in Seconds OR.
30 ” ""~<>. ...... 0 ‘
.20 __ ............................................................. 0‘
IO ”m _L_‘
""11. 1 1 I T
O I T l T lJ l l l

 

 

 

L
5 10 15 20 25 30

Number of processors

Figure 6.6: Bit packing comparisons

 

108

 

 

[F Circuits ] BBN TC-2000 I CM-2 [I

 

 

 

 

 

 

8 bit multiplier 1.4 2.2
16 bit multiplier 4.3 2.7
C1355 2.9 2.9
C1908 3.6 3.1
C6288 6.2 5.3

 

 

 

 

 

 

 

Table 6.4: Performance comparison on MIMD and SIMD machines (in seconds)

The experimental results show that the MIMD implementation of the token-driven
simulation approach delivers performance comparable to the SIMD implementation.
The BBN TC-2000 and the CM—2 are considered as target systems for the MIMD
and SIMD implementations, respectively. The BBN TC-2000 can support up to 256
processors, while the CM-2 Supports up to 64K processing elements. It must be noted
that, in these experiments, the TC-2000 used had only 44 processors. Table 6.4 lists
the execution times (excluding input/output) of the various test circuits with 6,000
input vectors for both the MIMD and the SIMD implementations. With bit packing,
each token carries 31 bits and 21 bits for the MIMD and SIMD implementations, re-
spectively. Currently we can simulate the circuits with only 6,000 randomly generated
input vectors due to the limit of total memory size.

Table 6.4 lists the execution times of the various test circuits with 6,000 input
vectors for both the MIMD and the SIMD implementations. In the measurement,
the execution time includes only the time period taken for simulation, but excludes
time for reading vectors, printing output, or translating the circuit description.

The performance of token—driven simulation in MIMD environments depends on
both the length of critical path [10] and the number of gates. On the CM-2, the
number of processing elements is large enough to allocate one processing element for
each gate for the benchmark tests. As the number of gates increases, the execution
time on the CM-2 does not increase as much as on the BBN. On the BBN, however,
the execution time increases since the number of available processing elements is much

smaller than the CM-2.

 

109

6.5 Conclusions

The distributed token-driven simulation presented herein is a new efficient parallel
zero-delay logic simulation technique in massively parallel processing environments,
which integrates the advantages of distributed event-driven simulation and compiled-
code simulation. The experimental results demonstrate that the presented technique
in this chapter outperforms both traditional distributed event-driven simulation and
compiled-code simulation when we consider only functional simulation rather than
timing simulation.

The token—driven technique for zero-delay logic simulation has the following ad—
vantages: (1) high parallelism is achieved, (2) compared with distributed event-driven
simulation, output control as well as unnecessary intermediate messages are not
needed, and (3) a levelization procedure is not required.

A token-driven approach for parallel logic simulation on shared-memory multipro-
cessor machines has been presented. Using a simple partitioning scheme, experimental
results indicate that signiﬁcant speedup can be obtained for large circuits with a large
number of inputs. According to the experimental results given in the above tables,
distributed token-driven logic simulation is much faster, in both SIMD and MIMD
environments, than any other compiled-code logic simulation techniques [52, 70].

As future work, we aim to investigate the optimal token size for a given critical
path and the number of input vectors, and measure the performance of the distributed
token-driven simulation technique with identiﬁers for sequential circuits with feedback
loops based on the technique proposed in [26]. The optimal token size may vary for
different circuits, possibly depending on the circuit topology. Further studies will
consider the extension of the proposed distributed token-driven simulation technique

to circuit designs with multi-delay.

 

Chapter 7

Conclusions and Future Research

In this thesis, we studied the performance of parallel logic simulation in massively
parallel SIMD processing environments. The main objective was to model and im-
prove the performance of gate-level logic simulation on SIMD environments, using
distributed discrete event-driven simulation or complied-code simulation. We com-
pared the performance of logic simulation based on parallel processing architectures,
such as SIMD and MIMD. We found that massively parallel SIMD machines have
many advantages over MIMD machines for logic simulation.

Our studies of parallel logic simulation in massively parallel SIMD environments

involved

0 experimental studies to evaluate the performance of parallel logic simulation
using distributed event-driven simulation. The performance was evaluated with
respect to number of simulation cycles, parallelism, maximum queue size, and
execution times. Our experimental results provided an understanding of the

performance of parallel logic simulation

0 experimental analysis to evaluate the effect of clock advancement on perfor-
mance of existing event-driven logic simulation protocols and ways to enhance

clock advancements for better performance.

0 probabilistic analysis to model and estimate the performance of parallel logic
simulation as the ratio of the number of gates to the number of processors
increases. The estimation was done in terms of number of simulation cycles,
parallelism, and execution times. The estimated performance was compared
with experimental results for three distributed logic simulation protocols, syn-

chronous simulation, conservative simulation, and optimistic simulation.

110

111

e parallelization of a sequential simulation technique for fast zero-delay or unit-
delay simulation to improve its performance. Our studies provided an efficient

parallel version of compiled—code simulation.

We summarize the main contribution of our thesis now.

7.1 Summary and Major Contributions

This thesis presented broad studies of parallel logic simulation on massively parallel
SIMD processing environments. We regard the followings as our contributions:

1. So far, most parallel logic simulation using discrete event-driven approaches
have been implemented on MIMD machines. Three major distributed simulation
protocols, synchronous simulation, conservative simulation, and optimistic simula-
tion, have received the most attention. We examined how logic simulation can be
done efficiently in massively parallel SIMD processing environments. Performance
evaluations were given in terms of the number of simulation cycles, maximum queue
size, parallelism, and execution times. We investigated efficient queue structures in
each simulation protocol and proposed a new cancellation technique for optimistic
simulation which is good for SIMD processing environments.

2. We found out that, despite theoretical arguments, optimistic simulation is
not necessarily the best technique on massively parallel SIMD machines because of
its inherent rollback and queue management overhead. Experimental results showed
that optimistic simulation is much slower than conservative logic simulation, even
though the number of simulation cycles is slightly smaller than conservative simu-
lation. The reason is that optimistic simulation requires more time per simulation
cycle. In addition, we observed that, in contrast to MIMD environments, conservative
simulation with null messages is very fast on massively parallel SIMD machines.

3. We compared and analyzed the performance of massively parallel SIMD
machine architectures for parallel logic simulation. The CM-2 and the MP-l were
target systems for the comparison. Experimental analysis was performed in terms

of related machine instructions, local memory sizes, communication schemes, and

112

virtuality. The experimental results showed that the MP-l is two to three times
faster than the CM-2.

4. To reduce the rollback frequency and space overhead, the moving time win-
dow concept was applied to optimistic simulation. Use of this technique is necessary
to prevent queue overﬂow on massively parallel SIMD machines. We found out that
we can also use the technique in conservative simulation to avoid queue overﬂow.

5. We considered the assignment of multiple gates to a processor to increase
both parallelism and system utilization. This is necessary since the number of phys-
ical processors is limited even though the circuits to be simulated may have a huge
number of gates. On the current massively parallel SIMD machines, such as the CM-
2 and the MP-l, only one gate is assigned to a processor since local memory size is
small. According to announcements from supercomputer manufacturing companies,
massively parallel SIMD machines with a large number of processors and large local
memories will be available in the near future. In this thesis, we have proposed a statis-
tical model to estimate the performance of parallel logic simulation when a processor
of a massively parallel SIMD machine contains more than one gate. The performance
was predicted in terms of the number of simulation cycles, parallelism, and execution
times. We showed how to estimate the performance when the performance with one
gate per processor is given. The predicted performance was compared with simula-
tion results. The comparison showed that the performance estimation agrees with the
simulation results for some simulation protocols. We contributed to the prediction of
performance for a large circuit on future large massively parallel SIMD machines.

6. In traditional distributed event-driven simulation, processes communicate
by sending events. An event is timestamped with a virtual time and a digital signal.
Only one local clock advancement is allowed. In this thesis, we proposed new event
evaluation and propagation techniques in both conservative and optimistic simula-
tions to improve (or facilitate) clock advancements and reduce communication costs.
In both techniques, an advancement window is ﬁrst determined and then all events
within the window are executed and propagated. Experimental results showed that

the number of simulation cycles can be signiﬁcantly reduced and that the techniques

. 1.2.137

 

 

113

also works well on MIMD processing environments. The technique can be applied to
improve the performance of other simulation schemes, such as YADDES.

7. So far, only compiled-code simulation with zero-delay or unit-delay has been
used to check the functional correctness of a digital circuit in VLSI design. Compiled-
code simulation works only in sequential processing environments. Distributed token-
driven simulation has been proposed for fast zero-delay logic simulation on paral-
lel processing environments. Experimental results showed the proposed distributed
token-driven simulation on massively parallel SIMD machines works much faster than
compiled-code simulation. Analysis also showed that the distributed token-driven
technique works well on MIMD machines if appropriate circuit partitioning schemes

are used.

7 .2 Future Research

In this section, we present some topics for future studies based on the research given

in this thesis.

Strategies for Better Performance

We can consider the following strategies for better performance on a massively par-
allel SIMD machine. First, to reduce both required memory size and event queue
manipulation time, we can investigate and compare efﬁcient data structures for logic
simulations on a speciﬁc machine, such as the CM-2 and the MP-l, based on its ar-
chitecture. Second, to cope with the congestion problem on the MP-l, assignment of
logic gates to processors is an important factor in optimizing performance. Finally,

partitioning a circuit is also important for the efﬁcient use of parallel virtuality.

Combination of Both Conservative and Optimistic Simulation Protocols

Conservative simulation allows each process (gate) to execute events as long as roll-

back is avoided, while optimistic simulation allows each process to evaluate events

 

'J‘L.’ '

114

even if rollback may occur. The above techniques each have their own inherent draw-
backs, which are mentioned in Chapter 1. As future research, hybrid techniques
could be developed. A process may be allowed to do either conservative or optimistic
simulation.

We can consider two kinds of decision procedures for how a process selects one of
the two simulation techniques. First, before simulation starts, one of the simulation
techniques is selected for each process based on its characteristics, such as critical
path length from input gates. Alternatively, a process may obtain the information
for the decision during simulation.

In addition, instead of using a global moving time window for all active gates,
we may apply local windows to each gate based on the information obtained during

simulation.

Logic Simulation on Feedback Loops

Efficient simulation of feedback loops is one of the major issues in VLSI design.
We consider use of the proposed simulation techniques on feedback loops. Of course,
token-driven logic simulation and event-driven simulation with advancement windows

also should be considered for the efﬁcient simulation of feedback loops.

Extended Simulation Results

We used the CM-2 and the MP-l as our target systems in this thesis. According
to the manufacturer of the CM-2, Thinking Machines Corporation, CM-5 will be
released in the near future. We have heard that the CM-5 is much more powerful
than the CM-2. We hope to obtain more extended experimental results on the new
machine to compare with results obtained on the CM-2, to analyze the effects of
system architecture and performance.

We have claimed that our proposed simulation techniques, both conservative logic
simulation with conservative advancement window and optimistic simulation with
aggressive advancement window, will give give good performance in MIMD processing

environments. We hope to analyze the performance of both proposed techniques on

 

115

MIMD machines, such as the BBN TC-2000 and the NCUBE-2.

Extension to Other Levels of Simulation

We can consider several levels of simulation in VLSI design. It could be interesting
to study how to extend the parallel simulation techniques proposed for gate-level
logic simulation to simulations in other levels, such as circuit level simulation and

behavioral simulation.

Generalized Performance Model for Logic Simulation in SIMD

Several performance models of parallel logic simulation as a synchronized iterative
algorithm have been proposed and compared with experimental results on MIMD
machines. As future research, a generalized performance model of parallel logic sim-
ulation in an SIMD environment could be developed. We found out that, among
three distributed event-driven logic simulation techniques, optimistic logic simula—
tion has the smallest number of simulation cycles and the highest parallelism , while
synchronous simulation has the largest number of simulation cycles and lowest paral-

lelism. We did not propose any performance model to explain the simulation results.

More Accurate Performance Model for the Gate-to-Processor Ratio

We assumed steady state conditions in the model to predict the performance of logic
simulation as the gate-to-processor ratio increases. To get more accurate performance
predictions, the exact active ratio of gates should be statistically computed for dif-
ferent gate-to—processor ratios. Gates as well as processors were assumed to work
independently in the model. We could ﬁnd some relationship for accurate modeling.
Reﬁnement of our model remains as future research.

In the model, we assumed that the time taken for a simulation cycle is independent
of GP ratio. But, for accurate estimation, we need to consider some factors which
affect the simulation cycle time. For example, depending on the machines used, the
number of required send operations might be different. In addition, as the GP ratio

increases, the time for LVT computation, event insertion, etc. may increase.

 

116

As the gate-to—processor ratio increases, each processor contains more gates. That
is, memory space allocated to a gate decreases. We cannot implement simulation tech-
niques on current massively parallel SIMD machines to get accurate experimental
results for large gate-to-processor ratio since the machines have memory size prob—
lems. When a massively parallel SIMD machine with enough local memory becomes

available, actual execution times on the machine can be measured.

Parallel VHDL Simulator

In [17], we have studied a parallel VHDL simulator using parallel logic simulation. The
VHDL description for a circuit is translated into an intermediate form for simulation.
The intermediate form is simulated with discrete event simulation techniques, such
as Time Warp or the Chandy-Misra algorithm, on the Connection Machine. Signal
assignment statements in the VHDL description are transformed into gate processors,
and component instantiations are handled by their expansion into a combination of
basic elements. As future study, we consider more extensive and practical low level
interfaces for a parallel VHDL simulator.

We plan to include behavioral descriptions and to extend the set of available data
types. We also plan to evaluate the use of other paradigms available for parallel
VHDL simulation, such as process-oriented Time Warp schemes and the Chandy-
Misra algorithm. Note that our simulation scheme, which decomposes the circuits
into basic gate elements and runs the simulation speciﬁcation on a parallel machine,
can also be applied to other types of parallel machines such as the BBN Butterﬂy, a
shared memory MIMD machine. Hence performance on other machine architectures

will also be explored.

Bibliography

Bibliography

[1] VHDL Language Reference Manual, 1987.

[2] M. Abramovici, Y. H. Levendel, and P. R. Menon. A logic simulation ma-
chine. In Proceedings of the 19th Design Automation Conference, pages 65—73.
ACM/IEEE, 1982.

[3] V. D. Agrawal and S. T. Chakradha. Performance estimation in a mas-
sively parallel system. In Proceedings of the Supercomputing ’90, pages 306—313.
ACM/IEEE, November 1990. '

[4] J. R. Agre. Simulations of time warp distributed simulations. In Proceedings of
the SCS Multiconference on Distributed Simulation, pages 85—90, March 1989.

[5] G. S. Almasi and A. Gottlieb. Highly Parallel Computing, pages 301-351.
Addison-Wesley, 1989.

[6] R. Ayani. A parallel simulation scheme based on the distance between objects.
In Proceedings of the SCS Multiconference on Distributed Simulation, pages 113—
118, March 1989.

[7] R. Baldwin, M. J. Chung, and Y. Chung. Overlapping window algorithm for
computing GVT in Time Warp. In Proceedings of the 11th International Con-
ference on Distributed Computing Systems, pages 534—541. IEEE, May 1991.

[8] D. Ball and S. Hoyt. The adaptive Time Warp concurrency control algorithm.
In Proceedings of the SCS Multiconference on Distributed Simulation, pages 174—
177, January 1990.

[9] BBN Advanced Computers Inc. BBN GP1000 Switch Tutorial, March 1989.

[10] O. Berry and D. R. Jefferson. Critical path analysis of distributed simulation. In
Proceedings of the SCS Multiconference on Distributed Simulation, pages 57-60,
January 1985. '

[11] T. Blank. The MasPar MP-l architecture. In Proceedings of the 35th IEEE
COMPCOM Spring 1990, pages 20—24, February 1990.

[12] F. Brglez, D. Bryan, and K. Kozminski. Combinational proﬁles of sequential
benchmark circuits. In Proceedings of International Symposium on Circuits and

Systems, pages 1929-1934. IEEE, May 1989.

117

 

118

[13] F. Brglez, P. Pownall, and R. Hum. Accelerated ATPG and fault grading via
testability analysis. In Proceedings of International Symposium on Circuits and

Systems, pages 695—698. IEEE, June 1985.

[14] J. Briner. Parallel Mixed-Level Simulation of Digital Circuits Using Virtual
Time. PhD thesis, Duke University, 1990.

[15] R. E. Bryant. Data parallel switch-level simulation. In Proceedings of the 1988
International Conference on Computer Aided Design, pages 354-357, 1988.

[16] R. E. Bryant, D. Beatty, K. Brace, K. Cho, and T. Shefﬂer. COSMOS: A com-
piled simulator for MOS circuits. In Proceedings of the 24th Design Automation
Conference, pages 9—16. ACM/IEEE, 1987.

[17] A. D. Cabrera, M. J. Chung, and Y. Chung. A parallel VHDL simulator on

the Connection Machine. Technical report, Department of Computer Science,
Michigan State University, October 1990.

[18] A. Chandak and J. C. Browne. Vectorization of discrete-event simulation. In
Proceedings of the 1983 International Conference on Parallel Processing, pages
359—361, August 1983.

[19] K. M. Chandy and J. Misra. Distributed simulation: A case study in design and
veriﬁcation of distributed programs. IEEE Transactions on Software Engineer-
ing, 5(5):440—452, September 1979.

[20] K. M. Chandy and J. Misra. Asynchronous distributed simulation via a sequence
of parallel computations. Communications of the ACM, 24(11):198-206, April
1981.

[21] K. M. Chandy and J. Misra. Parallel Program Design, A Foundation. Adison-
Wesley, 1988.

[22] K. M. Chandy and R. Sherman. The conditional event approach to distributed

simulation. In Proceedings of the SCS Multiconference on Distributed Simulation,
pages 93-99, March 1989.

[23] E. M. Choi, M. J. Chung, and Y. Chung. Comparisons and analysis of massively
parallel SIMD architectures for parallel logic simulation. In Proceedings of the
Sixth International Parallel Processing Symposium, March 1992.

[24] M. J. Chung and Y. Chung. Data parallel simulation using Time Warp on the
Connection Machine. In Proceedings of the 26th Design Automation Conference,

pages 98-103. ACM/IEEE, June 1989.

[25] M. J. Chung and Y. Chung. Efﬁcient parallel logic simulation techniques for the
Connection Machine. In Proceedings of the Supercomputing ’90, pages 606—614.
ACM/IEEE, November 1990.

 

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

119

M. J. Chung and Y. Chung. A distributed token-driven technique for parallel
zero-delay logic simulation on massively parallel machines. In Proceedings of

1991 International Conference on Parallel Processing, pages 391-394, August
1991.

M. J. Chung and Y. Chung. An experimental analysis of simulation clock ad-
vancement in parallel logic simulation on an SIMD machine. In Advances in
Parallel and Distributed Simulation, volume 23, pages 125—132, January 1991.

M. J. Chung and Y. Chung. Parallel logic simulation on a massively parallel
SIMD machine. Technical report, Department of Computer Science, Michigan
State University, October 1991.

Y. Chung and M. J. Chung. Time Warp for efﬁcient parallel logic simulation
on a massively parallel SIMD machine. In Proceedings of the Tenth Annual
International Phoenix Conference on Computers and Communications, pages

183-189. IEEE, March 1991.

R. C. De Vries. Reducing null messages in Misra’s distributed discrete event
simulation model. IEEE Transactions on Software Engineering, 16(1):82—91,
January 1990.

J. R. Engelsma, M. J. Chung, and Y. Chung. Distributed token-driven logic
simulation on a shared-memory multiprocessor. In Proceedings of the 6th Work-
shop on Parallel and Distributed Simulation, pages 197—198. ACM/IEEE/SCS,
January 1992.

R. M. Fujimoto. Lookahead in parallel discrete event simulation. In Proceedings
of 1988 International Conference on Parallel Processing, volume 3, pages 34—41,
1988.

R. M. Fujimoto. Performance measurements of distributed simulation strategies.
Transactions of the Society for Computer Simulation, 6.2:89-132, April 1989.

R. M. Fujimoto. Time Warp on a shared memory multiprocessor. In Proceedings

of 1989 International Conference on Parallel Processing, volume 3, pages 242—
249, 1989.

R. M. Fujimoto. Parallel discrete event simulation. Communications of the
ACM, 33(10):30-53, October 1990.

R. L. Geiger, P. E. Allen, and N. R. Strader. VLSI Design Techniques for Analog
and Digital Circuits, pages 909—917. McGraw-Hill, 1990.

J. B. Gilmer. An assessment of Time Warp parallel discrete event simulation
algorithm performance. In Proceedings of the SCS Multiconference on Distributed
Simulation, pages 45-49, July 1988.

 

120

[38] J. Gustafson. Reevaluating Amdahl’s law. Communications of the ACM, 31:523-
533, May 1988.

[39] H. Hansen. Hardware logic simulation by compilation. In Proceedings of the 25th
Design Automation Conference, pages 712—715. ACM/IEEE, 1988.

[40] J. P. Hayes. Computer Architecture and Organization, pages 187-190. McGraw-
Hill, 1978.

[41] D. R. Jefferson. Fast concurrent simulation using the Time Warp mechanism. In
Proceedings of the 1985 Multiconference on Distributed Simulation, pages 63—69.
SCS, January 1985.

[42] D. R. Jefferson. Virtual time. ACM Trans. Programming Languages and Sys-
tems, 7(3):404—425, July 1985.

[43] D. R. Jefferson et al. Implementation of Time Warp on the Caltech Hypercube.
In Proceedings of the SCS Multiconference on Distributed Simulation, pages 70—-
75, January 1985.

[44] Y. Lan. Interprocessor Communication in Distributed Memory Multiprocessors.
PhD thesis, Michigan State University, 1988.

[45] Y. H. Levendel, P. R. Menon, and S. H. Patel. Special-purpose computer for
logic simulation using distributed processing. The Bell System Technical Journal,
61(10):2873-2909, December 1982.

[46] Y. Lin and E. D. Lazowaska. Determining global virtual time in a distributed
simulation. Technical report, 90-01-02, Department of Computer Science, Uni-
versity of Washington, December 1989.

[47] Y. Lin and E. D. Lazowaska. Reducing the state saving overhead for Time
Warp parallel simulation. Technical report, 90-01-03, Department of Computer
Science, University of Washington, February 1990.

[48] Y. Lin, E. D. Lazowaska, and M. L. Bailey. Comparing synchronization proto-
cols for parallel logic-level simulation. In Proceedings of the 1990 International
Conference on Parallel Processing, volume 3, pages 223-227, August 1990.

[49] B. D. Lubachevsky. Efﬁcient distributed event-driven simulations of multiple-
loop networks. Communications of the ACM, 32(1):111-123, January 1989.

[50] V. Madisetti, J. Walrand, and D. Messerschmitt. WOLF: A rollback algorithm
for optimistic distributed simulation systems. In Proceedings of the 1988 Winter
Simulation Conference, December 1988.

[51] MasPar Computer Corporation. MasPar MP—I, July 1990.

 

121

[52] P. M. Maurer and Z. Wang. Techniques for unit-delay compiled simulation.
In Proceedings of the 27th Design Automation Conference, pages 480—484.
ACM/IEEE, 1990.

[53] W. Myers, Editor. Massively parallel systems break through at supercomputing
’90. IEEE Computer, 24(1):121—126, January 1991.

[54] J. K. Peacock, J. W. Wong, and E. G. Manning. Distributed simulation using a
network of processors. Comput. Networks, 3(1):44-—56, February 1979.

[55] D. L. Pitts and D. L. Smith. Central concurrency control for distributed simula-
tions. In Proceedings of 1988 Summer Computer Simulation Conference, 1988.

[56] B. R. Preiss. The Yaddes distributed discrete event simulation speciﬁcation
language and execution environment. In Proceedings of the SCS Multiconference
on Distributed Simulation, pages 139-144, January 1989.

[57] B. R. Preiss. Performance of discrete event simulation on a multiprocessor using
optimistic and conservative synchronization. In Proceedings of the 1990 Inter-
national Conference on Parallel Processing, pages 218—222, August 1990.

[58] D. A. Reed, A. D. Malony, and B. D. McCredie. Parallel discrete event simulation
using shared memory. IEEE Transaction on Software Engineering, 14(4):541—
553, April 1988.

[59] A. E. Ruehli and G. S. Ditlow. Circuit analysis, logic simulation, and design
veriﬁcation for VLSI. Proceedings of the IEEE, 71(1):34—48, January 1983.

[60] S. P. Smith, M. R. Mercer, and B. Brock. Demand driven simulation:BACKSIM.
In Proceedings of the 24th Design Automation Conference, pages 181—187.
ACM/IEEE, 1987.

[61] L. M. Sokol, B. K. Stucky, and V. S. Hwang. MTW: A control mechanism for
parallel discrete simulation. In Proceedings of the 1989 International Conference
on Parallel Processing, volume 3, pages 250—254, August 1989.

[62] L. Soule and T. Blank. Parallel logic simulation on general purpose ma-
chines. In Proceedings of the 25th Design Automation Conference, pages 166—171.
ACM/IEEE, June 1988.

[63] L. Soule and A. Gupta. Characterization of parallelism and deadlocks in dis-
tributed digital logic simulation. In Proceedings of the 26th Design Automation
Conference, pages 81—86. ACM/IEEE, June 1989.

[64] J. S. Steinman. SPEEDEstynchronous parallel environment for emulation and
discrete event simulation. In Advances in Parallel and Distributed Simulation,
volume 23, pages 95-103, January 1991.

[65] R. E. Tarjan and D. D. Sleator. Self-adjusting binary search trees. Journal of
the ACM, 32(3):652—686, July 1985.

122

[66] Thinking Machines Corporation. The Connection Machine System, May 1988.

[67] G. Varghese and T. Lauck. Hashed and hierarchical timing wheels: Data struc-
tures for the efﬁcient implementation of a timer facility. In Proceedings of .
the Eleventh Symposium on Operating System Principles, pages 25—33. ACM,
November 1987.

[68] A. H. Veen. Dataﬂow machine architecture. ACM Computing Surveys,
18(4):365—396, December 1986.

[69] L. Wang et al. SSIM: A software levelized compiled-code simulator. In Proceed-
ings of the 24th Design Automation Conference, pages 2—8. ACM/IEEE, 1987.

[70] Z. Wang and P. M. Maurer. LECSIM: A levelized event driven compiled logic
simulator. In Proceedings of the 27th Design Automation Conference, pages 491-
496. ACM/IEEE, 1990.

[71] D. M. Webber and A. Sanggiovanni-Vincentelli. Circuit simulation on the Con-
nection Machine. In Proceedings of the 24th Design Design Automation Confer-
ence, pages 108—113. ACM/IEEE, June 1987.

[72] N. Weste and K. Eshraghian. Principles of CMOS VLSI design, pages 255—256.
Adison-Wesley, 1985.

[73] H. Xu, P. K. Mckinley, and L. M. Ni. Efﬁcient implementation of barrier syn-
chronization in wormhole routed hypercube multicomputers. Technical report,
MSU-CPS-ACS-47, Department of Computer Science, Michigan State Univer-
sity, October 1991.

[74] M. Yu, S. Ghosh, and E. DeBenedictis. A non- -deadlocking conservative asyn-
chronous distributed discrete event simulation algorithm. In Proceedings of the
SCS Multiconference on Advances in Parallel and Distributed Simulation, pages

39—43. ACM/IEEE/SCS, January 1991.

 

I‘lICl-IIGQ

 

N 57an UNIV. LIBRﬁI‘T’lIES
llllllll][Ill]ll]lllllllUlllllllllllllll]Hl
3 1 009022934

293