3 C 3.. .3.) .(V (.1: .. x..- .lnva 1...? inflate) .. s... e T 2...? 53;? . .. x 13-5.. . ‘ 23.x. «a. 21...“... "HR . 4:55 .. . 2...}. .. a” human... huh“. . A I. l. . . in, \ 53.3.3; I .6... . .i. . v V €31 arias. ~41 1.. .LLT‘ a: - r ,.,,.. h p 1. ix. 1., . . .i.. .. .3 fix...) I~A~:! . d HF; ,vv .: Lu. V. MI? 312. :1 THESR- \ '4 («00/ This is to certify that the dissertation entitled Parallel Discrete Event Simulation and Its Application on Logic Simulation presented by Jinsheng Xu has been accepted towards fulfillment of the requirements for the Doctoral degree in Computer Science as Major‘Pr’ofessor’s Qfipbture fie..- - 7, 100.1 Date MSU is an Affirmative Action/Equal Opportunity Institution _.—-—.—.—-—._ —._--.---.-.-.-.-.-.-.-.--- .-o-o—o—-._. .— N LIBRARY | Michigan State University PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 cJCIRC/DateDuopes-sz PARALLEL DISCRETE EVENT SIMULATION AND ITS APPLICATION ON LOGIC SIMULATION By Jinsheng Xu A DISSERTATION Submitted to MICHIGAN STATE UNIVERSITY in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2002 ABSTRACT PARALLEL DISCRETE EVENT SIMULATION AND ITS APPLICATION ON LOGIC SIMULATION By Jinsheng Xu The purpose of this PhD dissertation is to investigate new and efficient paradigms for parallel discrete event simulation on massively parallel processors and its application on parallel logic simulation. We first develop performance prediction models of parallel simulation. Using the performance models, we will identify critical factors for improving the performance of simulation. Among them, we will consider load balancing, communication cost, computational granularity, activity rate, and rollback rate. Based on the importance Of these factors, we will investigate new and improved approaches Of parallel simulation. The first approach we consider is an aggressive synchronous algorithm to reduce the simulation cycles. This method reduces the communication cost and speedup of the simulation significantly. Next, a look-ahead technique will be investigated to reduce the state savings and the event queue size in Time Warp. The technique reduces the number of State savings by skipping state savings that is not necessary. Event queue and state queue sizes can be reduced by discarding the unnecessary events and states immediately. Finally, we will develop a unifying simulation framework that exploits the advantages of synchronous and conservative simulation techniques while keeping the optimism. The performance of this approach is always better than Time Warp and comparable to the best-case performance of synchronous and conservative simulation. DEDICATION To my grandparents, parents, wife and son. iv ACKNOWLEDGMENTS The author wishes to thank my adviser, Dr. Chung, for his generous guidance during my Ph.D. study. I learned many things from him. I also would like to thank my committee members, Dr. Esfahanian, Dr. Wojcik and Dr. LePage. I truly enjoyed the classes from Dr. Esfahanian and Dr. Wojcik. With the help Dr. LePage, I found a serious flaw in on my papers. I would like to thank my entire family members for their support and love. Without them, I could not have come this far. TABLE OF CONTENTS LIST OF TABLES .................................................................................. viii LIST OF FIGURES .......................................................................................................... ix Chapter 1. Introduction ...................................................................... .- -- 1 1.1 Background - .................. 1 1.2 Introducu'on to Parallel Discrete Event Simulation ........................................ 3 1.2.1 Discrete Event Simulation - . - - 3 1.2.2 Synchronous Simulation ............................................. - ........ . 4 1.2.3 Conservative Simulation ........................................ 4 1.2.4 Optimistic Sirnulation . - 5 1.3 Introduction to Parallel Logic Simulation ........................ - -- 6 1.4 Contributions and Outline of Thesis - - - ...... 7 1.4 1 Object Modeling ........................................... 7 1 ..4 2 Estimating communication parameters .................................................. 7 1.4 2 Synchronous Performance Prediction Model ....................................... 8 1.4.3 Cycle Reducing Synchronous Algorithm 9 1.4.4 An Overhead Reducing Technique for Time Warp -- .. 9 1.4.5 Time Warp Performance Analysis 10 1.4.6 Efficiently Unifying Simulation Techniques - - - 11 Chapter 2. Preparatory research .............................................................. 13 2. 1 Object Modeling ......... 13 2.1.1 Object Model Of the Simulation Engine ....... - . -- 15 2.1 .2 Simulation Engine: Logical Process and Simulation Object ............. 20 2.2 Measuring the Communication Cost ................. 21 2.2.1 Introduction ........................................................... 21 2.2.2 Measuring the parameters of the LogP Model - -- ..... -- - 23 2.2.3 All-to-All communication ....................................................................... 29 2.2.4 Minimum hardware support for multicasting? .................................... 34 2.2.5 Conclusion .................................................................................................. 35 Chapter 3. Syncrhonous simulation ............................................................................... 36 3.1 Performance Prediction ..................................................................................... 36 3.1.1 Introduction ............................................................................................... 36 3.1.2 The Model .................................................................................................. 40 3.1.3 Prediction Model due to Load Balancing factor ................................. 44 3.1.4 Performance Model .................................................................................. 50 vi 3.1.5 Improved Synchronous Algorithm ........................................................ 61 3.1.6 Conclusion .................................................................................................. 64 3.2 Cycle reducing simulation .................................................................................. 65 3.2.1 Introduction ............................................................................................... 65 3.2.2 Performance Evaluation Model ............................................................. 67 3.2.3 Optimistic Synchronous Algorithm ...................................................... 69 3.2.4 Experimental Results ................................................................................ 70 3.2.5 Conclusion .................................................................................................. 75 Chapter 4. Optimistic Simulation ................................................................................... 77 4.1 Reducing the overhead of optimistic simulation ........................................... 77 4.1.1 Introduction- ............................................... 77 4.1.2 Algorithm .................................................................................................... 80 4.1.3 Performance Analysis .................................................... 92 4.1.4 Benchmark Results ................. -_ - - - - - 95 4.1.5 Conclusion - -. .................................................... 99 4.2 Performance Analysis - ......................... 100 4.2.1 Introduction ................................... - - 100 4.2.2 Communication Model ................. - ....... - - . - 103 4.2.3 Performance Analysis ................................................. . 103 4.2.4 Average Dependency Gap .................................................................... 106 4.2.5 Conclusion __ -_ ....... - . 109 Chapter 5. Unifiying Simulation techniques ............................................................... 110 5.1 Introduction ................................................................. 110 5.2 Exploiting Cycle Parallelism - - 112 5.2.1 Cycle Parallelism - ............................. 112 5.2.2 Tweaking the definition of GVT ........... - 113 5.2.3 Performance Considerations .................. - - - . ...... 114 5.3 Exploiting Conservative Parallelism ............................. 115 5.3.1 Conservative Parallelism ............................................ - - 115 5.3.2 Related Work .......................... 116 5.3.3 Performance Considerations ................................................................. 116 5.4 Unifying Framework. . . ............... - 117 5.5 Experimental Results ........................................................... - ..... 120 5.6 Conclusion .......................................................................................................... 123 Chapter 6. Summary ................................................................................ 125 BIBLIOGRAPHY .......................................................................................................... 129 vii LIST OF TABLES Table 2.1 Result of All-to-All communication for message length 1 ...................... 34 Table 3.1 Multiplication factor for different A and P 45 Table 3.2 Speedup table for different A and P 46 Table 3.3 Number of Executions, cycles and Average Activity Number .............. 47 Table 3.4 Comparison of theoretical and empirical results 48 Table 3.5 The comparisons of predicted and experimental maximum speedup .. 54 Table 3.6 The characteristics of circuits- 72 Table 4.1 Benchmark results of the algorithm on circuit simulation ...................... 96 Table 4.2 Best speedup comparisons 99 Table 4.3 TWW+Toverhead of 535932 with various granularities 104 Table 4.4 AAN and ADG of 535932 and multi32 108 Table 5.1 Mapping of simulation techniques. 120 Table 5.2 Maximum speedup with up to 16 processors 123 viii LIST OF FIGURES Figure 2.1 Object Models ................................................................................................ 16 Figure 2.2 Example of a hierarchical model- - - - - .. 18 Figure 2.3 Logical Process ................................ - ........................... 20 Figure 2.4 Time diagram for benchmark 1 - -- ........................ 23 Figure 2.5 Time diagram for benchmark 2 with X > 2*L+Or+Os ........................ 24 Figure 2.6 Experimental Result of Benchmark 1 and 2. .................. 26 Figure 2.7 Result of Benchmark 3 — No.1 .- - 28 Figure 2.8 Result of Benchmark 3 ~ No.2 ...................... - - 28 Figure 2.9 Result of Benchmark 3 - No.3 ........................................ . 29 Figure 2.10 Time diagram for Algorithm 1 .................................................................. 31 Figure 2.11 Execution Sequence for processor 0- - _ 32 Figure 2.12 Time diagram for Algorithm 2 - ....... 33 Figure 3.1 Synchronous algorithm our model based on - 40 Figure 3.2 Speedup without communication 49 Figure 3.3 Message overhead ............. . - - 51 Figure 3.4 Least Square fit to a linear formula - - 51 Figure 3.5 PSC TCS vs. SGI Origin 2000 ............................... - 56 Figure 3.6 The effect of circuit size -- - _ ................ 57 Figure 3.7 The effect of granularity. ................................ - .................. 58 Figure 3.8 Improved Synchronous Algorithm - ..... - 62 Figure 3.9 Benchmark results of improved synchronous algorithm ....................... 63 Figure 3.10 Synchronous and optimistic synchronous simulations ........................ 74 Figure 3.11 The speedup comparison of unit and non-unit delay. - 75 Figure 4.1 An illustration of the algorithm (GVT=1) 82 Figure 4.2 Event 6" was received by P' after P' executed Q} and generated e0 ........ 86 Figure 4.3 Event 6" was received by P' before P' executed eo' and generated :0 ..... 88 Figure 4.4 Three performance cases of the algorithm (GVT=2) 92 Figure 4.5 Performance gains from GVT computation (GVT=2) 93 Figure 4.6 Improvement on the speedup of the algorithm. .......... -- 98 Figure 4. 7 The effect of event packing on 535932 and multi32 ....... 106 Figure 4. 8 Performance comparison of 535932 and multi32 107 Figure 5.1 An LP can switch back and forth from three execution modes ......... 118 Figure 5. 2 Module diagram of simulation kernel ................... - - -119 Figure 5.3 The comparison of experimental results ................................................. 123 C/Japterl 1. INTRODUCTION 1.1 Background With the increasing complexity of microelectronics systems, validating various models (or virtual prototypes) becomes critically important. There are two approaches to validating the correctness of VLSI (Very Large Scale Integration) design: verification techniques and simulation. Of these two approaches, Simulation is still the primary tool. Simulation can be used in a hierarchical fashion. At the top level, performance models are used to explore the trade off between various hardware components and architectures. Behavioral simulations are used to prove the correctness of the system specification. At a lower level, timing simulation is used to validate the correct timing information such as set-up hold times, and is the only practical stool available at this level. For large microelectronics systems, simulation has become a very time—consuming and critical part of VLSI design. Typically, simulation constitutes about 80% of the design cycle. VHDL (Very High Speed Integrated Circuit Hardware Description Language) [3]s is an IEEE standard hardware description language developed by the DOD. It can be used to describe 1 systems at both behavioral and structural levels. It can also be used to describe a performance model. It is expected that we need to simulate a system up to 100,000 VHDL processes to accurately model a start-of-the-art system. Simulating such a large system can be extremely slow in a sequential machine. Parallel logic Simulation has recently attracted a considerable amount of interest. However, most research is restricted to a MIMD (Multiple Instruction/ Multiple Data) machine with a small number of processors. Moreover, there are only a few benchmark results available based on actual simulation of large circuits. The few existing empirical results are based on MIMD machines with only a small number of processors, typically tens of processors. Thus, the speed-ups attained compared to sequential simulation are severely limited. With the hardware resources available for High Performance Computing, such as the 336 node Intel Paragon, the 400 node IBM SP2 and the 2000 node TCS, parallel simulation on Massively Parallel Processors (MPP) is now feasible which can achieve a speed-up of up to several hundred times compared to sequential simulation. Fast parallel Simulation of large VLSI systems enables the designer to validate the correctness of the design. The usage of behavioral Simulation in an earlier stage of the VLSI system design can prove the functional correctness. By speeding up the performance model simulation, designer can carry out extensive HW / SW trade-off analysis, and allow designers to select the best hardware architecture. The structural level simulation can be used to validate the correctness in logic level and timing. Fast simulator will allow designers to validate various virtual prototypes and to test complete fault coverage analysis. Thus, the design cycle can be shortened, and the number of real prototypes to be constructed can be reduced. Although massive parallel simulation is a promising technique, it is not a widely used technique in validating the VLSI design. One of the main reasons is the performance uncertainty of this technique. This thesis concentrates on investigating new efficient techniques for parallel simulation. 1.2 Introduction to Parallel Discrete Event Simulation A comprehensive introduction to parallel discrete event simulation can be found in [39]. In this section, we are going to introduce some basic concepts in this field. 1.2. 7 Dircrete Event Simulation In time driven discrete simulation, simulated time is advanced in time Steps. If events are dispersed over time, the time driven simulation is inefficient. In event driven systems, evaluation is driven by the events. Discrete event simulation technique is widely used in VLSI validation, telecommunication, military training and many other areas. In sequential Simulation, the simulation engine keeps a sorted list of events according to the time Stamp. At each Step the smallest time stamped event is retrieved and processed. New events are inserted into the event queue. This process continues until the event queue is empty or simulation has reached pre-defined end time. Parallel Simulation consists of a collection of Logical Processes (LPS) that communicate by exchanging time stamped events. There are three categories of parallel simulation techniques: synchronous, conservative and optimistic. 7.2.2 S Jno/Jronour Simulation Synchronous simulation is the simplest parallel simulation technique. At each Simulation cycle, all processors execute the events with the smallest time stamp. After the execution of these events, all processors exchange new events and decide the next smallest time stamp. This technique is referred as global-clock algorithm. This approach has very small overhead compared with other advanced simulation techniques. Because the synchronization cost is high, it does not scale well as the number of processors increases. 7.2.3 Conservative Simulation The conservative simulation is also referred as Chandy-Misra-Bryant [13,17] algorithm. Conservative and optimistic simulation techniques are asynchronous algorithms. Unlike synchronous simulation, conservative simulation advances simulation time at individual logical processes. Conservative simulation preserves the causality of event executions by processing events in strictly non-decreasing order in a logical process. The safeness of an event execution is decided by the calculation of local virtual time (LVT). Each generated event is accompanied with the LVT of the sending LP. The receiver LP knows that it will not receive any event in the future with smaller time stamp than the LVT from the sender. Each LP can safely execute the events with time stamp smaller than the LVT value of 4 all input channels. Deadlock may occur when there is cycle of block LPs each waiting for its predecessor to update its clock. Null-message is used to avoid deadlock but it can add significant overhead on the performance. Chandy and Misra [18] also proposed a two-phase method to detect and resolve the deadlock. Deadlock can be broken by the Observation that events with smallest time stamp are always safe to process. Good lookahead is critical for the performance Of conservative algorithm. 7.2.4 Optimirtio Simulation The original optimistic Simulation is proposed as Time Warp algorithm by Jefferson [45]. It can explore more parallelisms than the conservative Simulation by allowing the out of order event executions. An event can be executed immediately. Some of these event executions may not be correct and need to be cancelled. Rollback mechanism is used to recover from error executions. When a straggler event arrives, an LP needs to be recovered to the state before time stamp of the straggler. This requires frequent state savings. GVT (Global Virtual Time) is used to discard the states that are no longer necessary to stay in memory. GVT is defined as the minimum time stamp of all events in transit or in the input queues of all logical processes. No new event generated can have a smaller time stamp than GVT. Therefore all states except one that has smaller time stamp than GVT can be discarded safely. 1.3 Introduction to Parallel Logic Simulation Simulation is a significant bottleneck in the development of VLSI systems. There have been considerable efforts to accelerate the simulation using parallel discrete event simulation techniques. Asheden[4] at al investigated the feasibility of synchronous simulation on VHDL models. They concluded that high level models with large event execution granularity is suitable for small scale multi—computers and low level models with small event execution granularity is not suitable for parallel execution. They used very small benchmarks with largest number of objects only around 150. The parallelism is limited by their own sizes. Soule [70] studied parallelisms of digital logic simulation. For the synchronous parallel Simulation method, the element evaluation concurrency ranged from 156 to 275. For conservative method, the concurrency ranged from 42 to 196. Soule and Gupta [72] evaluated the Chandy-Misra algorithm for digital logic Simulation. They found that overhead of Chandy- Misra algorithm overwhelm its advantage and the performance is about three times slower than traditional parallel event—driven algorithm. Konas [47] studied inherent parallelisms of logic circuits with Event Precedence Graph (EPG) and compared them with synchronous parallelisms. For the seven largest ISCAS-89 [11] circuits the average EPG parallelisms ranged from 2600 to 23873 and average synchronous parallelisms ranged from 957 to 13481. Konas and Yew [48] compared the performance of a synchronous simulation to the conservative and optimistic algorithms in simulating a synchronous multiprocessor system. Their results Show that the synchronous method is considerably faster than both of the asynchronous approaches. Briner [12] implemented optimistic simulation and they achieved speedup of 23 with 32 processors. Konas [47] concluded that the state saving cost has to be reduced significantly for optimistic simulation to perform better than synchronous algorithm. There is no known implementation that performs consistently well independent of the circuits. It is not clear that which parallel simulation technique is most appropriate for logic simulation. A survey on this topic can be found in Chamberlain’s paper [16]. 1.4 Contributions and Outline of Thesis 7.4.7 Object Modeling We developed an object model for parallel simulation. This object model describes the behavior of systems to be simulated independently from the simulation engines. This model enables object-oriented modeling of complex systems. The modeler is able to model the objects in stepwise refinement. Unlike many modeling techniques that use a new modeling language, with this modeling technique, modeler can efficiently model the objects with our C++ class libraries. Our object model technique is described in chapter two of this thesis. 7.4.2 Eitimating communication parameter: LogP [31] model captures major parameters of a distributed memory multiprocessor. We propose a technique that measures the parameters of LogP model. These parameters are measured by carrying out several benchmarks. All the benchmarks are executed on IBM SP2 using MP1 (Message Passing Interface) [57] library. The result shows high overhead and relatively small hardware latency. Exploiting these characteristics, an All-to-All [57] 7 communication, which has much smaller number of steps, is proposed. The better result of this algorithm over the MPI’s MPI_Alltoall() implementation confirms that software overhead is high in SP2. The results of this research give us idea how significant the communication cost is for parallel simulation and how to reduce the communication overhead. In the second part of chapter two, we give detailed explanations of this topic. 7.4.2 S jncbronou: Performance Prediction Model We propose a model to predict the performance of synchronous logic simulation. This model considers the factors including load balancing, event execution granularity and communication cost. We derive a single formula that predicts the performance of synchronous simulation. We have benchmarked several ISCAS[10,11] circuits on SGI Origin 2000. The benchmark results show that the prediction model explains more than 90% of parallel simulation execution time. We propose a formula that predicts the upper limit of the achievable speedup. We found that the maximum speedup achievable is proportional to the square root of average number of active objects per cycle. This model can be used to predict the speed-up expected for synchronous simulation, and to decide whether it is worthwhile to use synchronous simulation before actually implementing it. We also propose an improved synchronous algorithm that is able to improve the load-balancing factor. The benchmark results Show that the improved algorithm is able to reduce the load-balancing factor as well as the number of simulation cycles. We are going to elaborate this research in the first part of chapter three. 7.4.3 Cycle Reducing Synchronous Algoritlym Logic simulation has very small computation granularity. The communication cost is the significant factor that limits the speedup of synchronous logic simulation. The benchmark results showed performance improvement from load balancing of synchronous simulation is very limited. To Significantly accelerate the performance of synchronous simulation, we have to reduce the communication cost significantly. Based on the previous prediction model we developed, we propose an improved prediction model that integrates the number of communication cycles. We found that the number of simulation cycles can be reduced by pro-executing safe or unsafe events that should be executed in later cycles. Experimental results show that pro-execution safe events cannot reduce the simulation cycle significantly. We propose an optimistic synchronous algorithm that executes unsafe events optimistically. Experimental results Show that this approach significantly reduces the number of communication cycles while keeping the rollback rate within a small range. The performance is significantly better than the traditional synchronous algorithm. Detailed research on this topic can be found in the second part of chapter three. 7.4.4 An Ooerbead Reducing Teclmique for Time Warp We introduce a technique that reduces the number of state savings and the event queue size of Time Warp [27]. By reducing the state saving and the sizes of event queues, we decrease the overhead of Time Warp, and the maximum memory requirement. We exploit the look-ahead technique to get a lower bound time Stamp of the next event and determine if an event is safe to execute. If an event execution is safe, no state saving is carried out. The saved States that have smaller time stamp than this lower bound can be safely discarded without waiting for the updated value of Global Virtual Time (GVT). We prove that the technique is correct under both aggressive and lazy cancellation scheme. This technique can be implemented with minimal additional overhead. Benchmark results on logic simulation Show that the mechanism can reduce the number of state savings and maximum memory size significantly. We conducted experimental studies on logic simulation. The results show the significant improvement in the number of state savings as well as maximum state queue size. We also observed improvement in execution time of simulation caused by less number of state savings. Our benchmark shows that the proposed technique improves memory usage 18-85% and maximum speed-up up to 5—30%. The detailed research on this topic is discussed in chapter four of this thesis. 7.4.5 Time Warp Performance Anabsis We analyze the performance of Time Warp simulation. We break down the performance of Time Warp into three parts: computation, communication and idle time. We proposed Average Depending Gap (ADG) that measures the parallelism that can be exploited by Time Warp algorithm. Compared with critical path analysis, this measure gives the value of a more practical parallelism. The detailed research on this topic is given at the second part of chapter four. 10 7.4. 6 E flicientbr U ngfiing Simulation Tecbniques We study a unifying technique that is able to exploit the advantages of synchronous, conservative and optimistic simulations [86]. The unifying framework is based on optimistic simulation technique. The optimized optimistic simulation exploits synchronous parallelism by tweaking the definition of GVT. Conservative parallelism is exploited by embedding lookahead computation into the optimistic simulation. In the unified framework, a logical process may execute in synchronous mode, conservative mode and optimistic mode. Synchronous mode and conservative mode executions have smaller overhead than optimistic mode execution because state saving can be skipped and all saved states can be discarded. An LP can switch back and forth from synchronous, conservative and optimistic mode. An LP is eligible to execute in synchronous mode when the local virtual time is equal to GVT. An LP is eligible to execute in conservative mode when the local virtual time is smaller than the lower bound time stamp of future event the LP may receive. An LP executes in optimistic mode when it does not satisfy the conditions to be executed in synchronous mode or conservative mode. In optimistic mode current state needs to be saved because future rollback is possible. The simulation kernel consists of scheduler, communication manager and GVT manager. Scheduler decides which LP to execute and in which mode to execute. GVT Manager is responsible for computing the value of GVT. The frequency of GVT computation is decided by the scheduler. Communication manager is responsible for sending and receiving messages from other processors. Scheduler decides how many events to be buffered before sending events. 11 In optimized optimistic Simulation, scheduler gives priority to events that can be executed in either synchronous mode or conservative. When no synchronous mode or conservative mode execution exists, scheduler picks an LP to execute in optimistic mode. In the case of large cycle parallelism exists, the optimized optimistic simulation will execute in low overhead mode and it will perform as good as synchronous sirnuladon. This is also true for the case of conservative simulation. When neither synchronous nor conservative simulation performs well, the optimized optimistic Simulation can exploit parallelism of optimistic Simulation. Selecting a good GVT computation is critical for exploiting the cycle parallelism. The frequency of GVT update should be adaptive. Frequent GVT computation can help LPs skip state saving by exploiting more parallelisms, but it will increase the communication cost of the system. Infrequent GVT computation has smaller communication cost but it exploits less cycle parallelism. The overhead caused by frequent GVT computation should not be larger than the event execution cost. On the other hand, GVT computation needs not be frequent if there exists large cycle parallelism. We benchmarked our unified framework on the Terascale Computing System at the Pittsburgh Supercomputing Center. We found that when the number of processors is small, the performance of the unified approach is close to the synchronous simulation and much better than optimistic simulation. When the number of processors is large, the performance of the unified approach is significantly better than both synchronous algorithm and the optimistic Simulation. Detailed explanations on this research are stated in chapter five of this thesis. 12 C/Japter 2 2. PREPARATORY RESEARCH 2.1 Object Modeling In practice, Object modeling aims to be suitable for the representation and manipulation of complex objects of engineen'ng design. In most current Simulation systems, users must know the parallel platforms and the simulation technique to develop the model. As a new simulation technique is invented and new hardware introduced, the models they developed are not reusable, and must be modified. Moreover, the modeling techniques are different for each application domain. There has been much research work in modeling simulation. Several simulation languages have been proposed including Simula, GASP, GPSS, CSIM, and Sim++. In general, they can be classified into three different approaches: 0 an approach which is limited into a specific application domain, 0 a fixed model for discrete event simulation 0 a new simulation language to model the objects 13 In these approaches, the application domain is limited and the system is not extensible. Modifying Simulation models frequently requires changing the simulation scheme employed. Moreover, adding a new simulation mechanism or modifying data structures may affect the models already developed. For example, the Tyvis Simulator is specifically targeted for parallel VHDL simulation using Time Warp. In this approach, all objects are a subclass of Time Warp objects. Therefore, adding a new simulation scheme requires changes of all models already developed. DEVS [88] has been proposed by Ziegler as a model of distributed event simulation. All objects must be modeled under DEVS formalism. Even though this approach may have the advantage of introducing the formalism, it puts a burden on the modeler in a certain application domain. Simulation languages such as Yaddes[63], Maisie[7], SIMA[67] and MOOSE[30,40] have been proposed to aid the user to develop models. These languages are also associated with a fixed set of pre-selected set of simulation mechanisms. Yaddes is a distributed event driven specification language that resembles Yacc and Lex. A Yaddes program is translated into a C program which is later linked together with a run time support library. SPEEDES [75] is a C++ based simulation environment developed at the jet Propulsion Lab which supports sequential, Time Warp [45], and Time Bucket [75] Algorithms. In SPEEDES, end users may adjust a predefined set of parameters to improve the speed. Only a few research results have been reported on the modeling and the performance of parallel simulators. Maisie is developed for efficient execution of the simulator. Depending on the hardware platforms, the complexity of the model, and the frequency of messages generated, the performance of parallel simulation varies greatly. Most of the parallel Simulators discussed above require the end user to provide their own models as the simulation scheme 14 changes. The only exception is Maisie. In Maisie, objects must be defined using Maisie language. Experience has shown that Simulation is evolutionary in nature. While requirements change, the system being simulated also changes. The modeling Should be less resistive to such changes, so the maintenance of the evolving system will be much easier [26]. In this section, we outline our simulation engine that is developed using an object-oriented and layered approach. Our proposed object modeling technique has the following features. 0 It allows users to model complex objects that consist of different type of objects with various level of complexity. 0 It is simple so that the modeler should be able to model the objects in stepwise refinement. The object modeling technique must provide a way of hiding the information so that modelers and users are not overwhelmed by the details of the objects. 0 It separates models and simulation scheme so that modelers should be able to develop models without knowing the simulation scheme employed. 0 It provides a mechanism for parallel programmers to develop library modules and data structures independent from simulation models. 2. 7.7 Object Model of tbe Simulation Engine Our models are cleanly separated from the execution environment, and parallel platform. There are three types of objects in our proposed modeling: Application Objects, Simulation Objects, and Simulation S cbeme Objects. Application Objects model the behavior of objects in the application domain to be simulated. They can be developed by modelers of a specific application domain, and are independent 15 Simulation Scheme Objects Application Objects [TW Kemej ] [(‘M Kerncfl Sync Kant-l CPU [TWObj] [CMOBfl EyncObj ] [MMX] [Pentium] [ Intel 486DX] Parallel Programmer [166MHz] [200MHz] Model developer [mFH-Lple Inlyfritance j \ I Figure 2.1 Object Models from the simulation schemes and hardware platforms. Simulation Scheme Objects depend on simulation scheme used, and include all library modules necessary to run the Simulator on a parallel platform. They form the core of the simulation engine, and are developed by parallel programmers with expertise in Time Warp and Chandy-Misra schemes [59]. The system to be simulated consists of many Application Objects. An end user selects the Simulation scheme to be used, and instantiates Simulation Application Objects (Simulation Objects in Short). As Shown in Figure 2.1, a simulation object inherits behaviors and properties from Application Objects, and parallel run time methods from the Simulation Scheme Objects. For example, 166MHz CPU and 200MHz CPU inherit their instruction set from the Pentium class object. To Simulate a computer system using Time Warp, an end-user instantiates an object that inherits its behavior from the 200MHz Pentium object, and its run-time environment and data structures from TW-OB], a Time Warp object. The modelers populate the libraries of object models independent from the simulation scheme. The rationale of our approach is to allow the front end user to use the application 16 objects developed by the modeler freely. Our object modeling technique provides freedom to three parties of the simulation arena: modelers, parallel programmers, and front end users. Parallel programmers (simulation scheme experts) may concentrate on providing various parallel Simulation methods. In other words, once the modeler develops Application Objects, the same Objects can be simulated with other simulation schemes, without remodeling them for the new simulation scheme. Application Objects are organized using the inheritance mechanism among them. Therefore the modeler, with minimum knowledge of modeling language and object oriented modeling, is able to create a library of objects. The front-end user, who simulates the system, mixes and matches the models developed by modelers and the simulation scheme developed by parallel programmers. The simulation scheme class defines the data and methods that each Simulation object needs to operate within the system. This class can be viewed as the kernel of the simulation and any type of the simulation domain can use that scheme via its interfaces. Depending on a particular simulation scheme, a simulation scheme object is created. The simulation scheme objects include all the necessary methods for that specific scheme. A Simulation Application Object (Simulation Object) is an instance of a class that is obtained from integration of the Application Object class and Simulation Scheme class. It contains the implementation of the methods of the simulation scheme inherited from Simulation Scheme Objects. It inherits its behavior and attributes from Application Objects. For the simplicity of implementation and portability, we used the template class approach rather than the multiple inheritance mechanism. A Simulation Object is responsible for simulating the single object and generating events. 17 The following class definition shows the skeleton of an Application Object. Other instance variables, member functions, and class relationships can be added into the definition. If the parent class of the object has member functions of instance variables, the object does not need to include these fields. Consider the following example. We want to model an 8-bit ALU in the design of 8-bit microprocessor. All components of an 8-bit microprocessor share common characteristics, such as design rules, bus width, minimum gate delays etc. Therefore it is a subclass of 8-bit components. Suppose that the behavior of the 8-bit ALU has the following specifications: 8-bit Input port and Output port, 6-bit control line, and 8-bit addition and subtraction. Later ALU8M can be modeled which is a special case of ALU8 by providing multiplication and division operations. This specialization can be accomplished by making ALU8M a subclass of ALU8, and inheriting operations of ALU8M from ALU8: [ 8-bit components I / \. I Shift Register I l ALU8 l l I ALU8M I Figure 2.2 Example of a hierarchical model Here is the sample code for the ALU object: class ALU8 { public: / / Methods that simulates the behavior of the object void addO; void subtracto; void exceptiono; void executeProcessO; }; 18 class UserState : public BasicState { // input and output port of the object InSignal x[8], y[8], opcode[2]; / / opcode denotes the operation OutSignal z[8]; }; The method executeProcessO simulates the behavior of the object. This is the main part of the modeling and the modeler just needs to plug the behavior of the object into the class definition, by conforming to the names that are declared within the class. In this sample code, we declare two other functions, add() and subtractO that are used within executeProcessO method. To create a model of an ALU with more functions, such as multiplication and division, we can simply model by inheriting from ALU8: class ALU8M : public ALU8 { public: void multiplyO; void divideO; void executeProcesso; }; To simulate the ALU8M, the behaviors of the 8-bit ALU such as add, subtract, and exceptions are inherited from the class of ALU8. ALU8M has its own functions, multiply and divide, which are not defined in ALU8. Moreover, ALU8M can implement its own methods for ADD and SUB. The sample C++ code is provided as follows: void ALU8M::executeProcess() { switch (opcode) { case 0: add 0; break; case 1: subtracto; break; case 2: multiplyo;break; case 3: divide0;break; default: exceptiono;break; } 19 2. 7.2 Simulation Engine: Logical Process and Simulation Object Each processor is an instance of a logical process (LP), which is the simulation engine of the processor. Logical Process is responsible for the global flow-control of the simulation. Figure 2.3 Logical Process It instantiates the simulation objects assigned to that particular processor. During the execution of the simulator, LP handles the communication of simulation objects via messages, schedules events by selecting simulation objects to be executed in next simulation cycle, and computes the Global Virtual Time. The LP object depends on the simulation scheme. It consists of Simulation Scheme Objects. Figure 2.3 shows the LP object organization. The Scheduler schedules events based on scheduling policy. Dispatcher reads input messages from the buffer and places events into the event queue of the corresponding simulation object. Messages to be sent are handled by the collector, which aggregate messages and send them according to the communication policy. 20 2.2 Measuring the Communication Cost 2. 2. 7 Introduction Several models have been proposed to analysis the complexity of parallel applications. One of the early models is PRAM (Parallel Random Access Machine). It cannot predict the cost of parallel applications accurately because it does not model the architecture features of distributed memory systems that use messages to access the remote data. LogP [31] model captures major parameters of a distributed memory multiprocessor. LogP model separates the overhead and latency of a message passing. Overhead is time spent by CPU to transmit or receive message while latency is the time for the first bit of message to arrive at the destination. This separation is important for analyzing the asynchronous applications. For asynchronous applications, shorter overhead is desired even at the cost of longer latency. The Stanford FLASH multiprocessor [50] has a custom node controller called MAGIC that implements all the data transfers. Because communication overhead (parameter 0) can be off—loaded to MAGIC, occupancy [44] replace overhead where occupancy means the controller is occupied while processor can continue computation. If the time of overhead and occupancy are same, in application’s point of view, the network has lower overhead and longer latency. Although controllers are off-loading more communication tasks from the processor, in most multiprocessors currently available processor has to do some message passing tasks. One method of measuring the parameters of the LogP model is to assume the correctness of the model and construct several benchmark programs and analyze them using the LogP model. According to the equation and the experimental result all the parameters can be 21 decided [32]. In this paper, the same technique is used with much simpler Ping-Pong test that measure the round trip time of a message. All the benchmarks were run at IBM SP2 machine at the Cornell Theory Center. At first message of length 0 is tested. The result is quite surprising. Most of the communication time is overhead for sending and receiving messages for IBM SP2. Hardware latency is very small. To test the effect of message length over the overhead, another benchmark is constructed. As expected the overhead is increasing slowly as the message length increases with a slight jump every 200 to 300 bytes that may be caused by the packetization cost. Long software overhead and usually short latency and high bandwidth, makes it possible for another approach in implementing the All-to-All communication. In All-to-All communication, all processors exchange messages with one another. Currently most of the All~to-All communication is implemented in software and takes O(n) communication steps for n processor All-to-All communication. Since the overhead is not very sensitive to message length and bandwidth usually high, a new All-to-All communication algorithm is proposed in part 4 of this paper. It has only O(logn) steps but each step sends m/ 2 times longer message. The result from SP2 machine, the new algorithm is much better than the MPI’s MPI_Alltoall() subroutine for short messages. This confirms that overhead is too large for some parallel computers including SP2. Many methods are proposed to reduce the overhead including Active Messages [36] by allowing the user level control over message passing. In the last part of this section the hardware support of multicasting is discussed. Pure software implementation is too costly because there are many unnecessary memory copying. By a special hardware which can recognize the multicasting message and send the same to both 22 direCtions: one to the local memory and the other to another node, the unnecessary memory copying may be avoided. 2. 2.2 Measuring tbe parameters of tbe LogP Model 2.2.2.7 Communication Bencbmarks Two communication benchmarks are constructed. The first one is the pure Ping—Pong test that is used to measure the round trip time of a message. The second benchmark embeds a fixed amount of computation between the send and receive. All the messages are of fixed length of 0 byte. The details of benchmark, time diagram and analysis are listed below. Benchmark 1 (Ping-Pong test) Processor 1: Start Timer repeat N times send a message of size 0 to processors 2 receive a message of size Ofiom processors 2 Stop Timer Processors 2: repeat N times receive a message of size 0 from processor 7 send a message of size 0 to processors 7 Time Diagram: I .L , i l J I l J , Os: ; Ide Or Os Idle Or Time = = (a) L l l L l l l H , Idle 0: Os Idle 0r Os Idle Time (b) Figure 2.4 Time diagram for benchmark 1 23 AnalYSis using LogP model: The cost of each iteration is Os+L+Or+Os+L+Or=2*(Os+Or+L). Benchmark 2 (Ping-Pong test with computation) Processor 1: Start Timer repeat N times send a message cfsize 0 to processors 2 Compute for X time receive a message of size Ofiom processors 2 Stop Timer Processors 2: repeat N times receive a message of size 0 from processor 7 send a message of size 0 to processors 7. Compute forX time Time Diagram: Osf F X Or Os X Or Tune 3 = (a) TL"? [‘1]! III J...l . Idle 0: Os X Or 03 X Time (b) Figure 2.5 Time diagram for benchmark 2 with X > 2*L+Or+Os Analysis using LogP model: The cost of each iteration is: Os-l-MAX(X,2*L+Or+Os)+Or. When X<2*L+Or+Os then the cost will be the same as benchmark 1. 24 2.2.2.2 Computing Os, Or and L Make X large enough to ensure that X>2*L+Or+Os. The total execution time for benchmark 2 will be: T2 = N*(Os+Or+X). \Ve have, Os+Or=T/N-X (1) Let T1 be the execution time of benchmark 1. T1=N*2*(Os+Or+L). We have, L=T1/(2*N)-(Os+Or) (2) 2.2.2.3 Experimental Result on IBM SP2 Benchmark 1 and 2, is implemented on IBM SP2 using MPI library. The program is written in C language. Figure 2.6 shows the result of Benchmark 1 and 2. When computation time X is 0, that is Ping-Pong test, the time per iteration is around 117 microseconds. X is increased by increasing the number of procedure calls to a floating-point computation subroutine. The time spent on one subroutine call is defined as a unit. At first, when X increases the execution time per iteration is not increasing. After X is increased to a certain amount, the execution time is increasing almost linearly. The pure computation time is also increasing linearly to unit computation. When X is larger than certain amount the difference between two lines is almost 25 constant, that is (Os+Or). From the experimental result, we Observed that the difference is increasing very slowly as X increases. This may be caused by more context switching operations due to more interrupts for message passing as X increases. To know the exact reason, more detailed research should be done in future. In this experiment, (Os+Or) is estimated by computing arithmetic mean of differences between execution time for Benchmark 2 and pure computation when X is between 35 and 95 unit. The result is (Os+Or)=47 microseconds. The parameter L is also decided by formula (2). The result is L=11.5 microseconds. During the experiment, we found that execution time varies significantly even for fairly large number of iterations. The error of this experiment may not be small. Result of Benchmark 2 350 full 300 ,, BenchmarkZ 45/ /. g 250 4’ // g 200 4. g g 340: Pure Computation 150 ”’f 5 i 4,.e/X n——-¢—-———¢— E 100 _,r/ / 50 1' / D D/ r a . . 0 20 40 so 80 100 Computation time X (Brit) Figure 2.6 Experimental Result of Benchmark 1 and 2. 26 2.2.2.4 Tbe Eflect of Message Lengtb To test the effect of message length on (Os+Or) , benchmark 3 is designed to meet the purpose. Starting from message length of 0, it increases message length after executing the same code for benchmark 2. The detail benchmark and result are shown below. Benchmark3: Processor 1: For (Messagelxngtb=0,MessageUngtb o, for all i A [tj=0 means there is no object active at time t,. It is not necessary to have discrete time t,. 5 >>A[t.~/ This means that simulation has a wide range of discrete time and does not finish in a few cycles. Condition 2: Randomness 43 At each cycle, an object has equal probability of being active with all other objects. Condition3: Uniform granularity Each object takes a unit time to execute an event. 3. 7.3 Prediction Model due to Load Balancing factor In this section we will derive the computation cost under the multinomial distribution model and compare it with the benchmark result. We also propose a multiplication factor, which describes the degree of load unbalancing. 3. 7.3.7 Total computation cost The cost of each cycle is the maximum number of active objects among all partitions. We denote the cost of each cycle by C[tj = maxtAkflj), where A‘flj is number of objects active at partition b. and A,[ti] + + Apfll] = A[t,./. p is the number of processors the simulation is running on. The cost of the whole simulation is C[tL] + C[tz] + + C[t,m]. Since we have assumed that each active object has equal probability of being present at any of the partitions, A[tj objects that are active at time t,, are randomly distributed to p partitions. Therefore the cost of each cycle can be estimated by the expected value of the maximum order statistics of counting variables that are the number of active objects distributed to p partitions. The cost of each cycle is a function of A[tj and p which denote as E(A[tj ,p). The closed formula for function EMflj , )may not be easy to obtain. Instead of trying to find the closed formula for the function, we can use simulation to get the result of the function with good accuracy. Total 44 simulation costs due to execution of events should be Numgcler*E(A,p), where A is the average of A [ti]. 3. 1.3.2 A’lultzplication Factor We want to measure the degree of load unbalancing with different average number of active objects per cycle. We now define the multiplication factor as follows. Suppose that the average activity number of objects per cycle of a circuit is A and the circuit is partitioned to P parts. Suppose E(A,p) is expected value of maximum order statistics of counting variables that are number of objects distributed to P partitions. We define multiplication factor M as: l M(A,P)= E(A,P)*P/A M is a function of A and P. Table 3.1 shows the value of the multiplication factor over different values of A and P. The range of multiplication factor is from 0 to 1. The multiplication factor shows the efficiency of parallel simulation. Larger multiplication factor means better load balancing. The multiplication factor increases as A increases and P decreases. Therefore, the synchronous parallel simulation will be more efficient with a larger A and smaller P. Multiplication factor A / P 2 4 8 16 32 1 0.50 0.25 0.13 0.06 0.03 10 0.80 0.60 0.41 0.27 0.17 100 0.93 0.83 0.70 0.57 0.43 1000 0.98 0.94 0.87 0.81 0.72 10000 0.99 0.98 0.96 0.93 0.89 Table 3.1 Multiplication factor for different A and P 45 The speedup can be defined using multiplication factor as: .S‘(A,P) == P*M(A,P) Table 3.2 shows the speedup for the A and P of above table. Speedup A/P 2 4 8 16 32 1 1.00 1.00 1.00 1.00 1.00 10 1.61 2.38 3.31 4.29 5.29 100 1.85 3.31 5.63 9.07 13.73 1000 1.95 3.76 6.94 13.02 23.06 10000 1.98 3.92 7.69 14.95 28.57 Table 3.2 Speedup table for difl'erent A and P Table 3.1 and 3.2 show that for large A there is little loss of efficiency, but for small A the loss of efficiency is significant. For example when A is 100 and P is 32, the load unbalancing will cost 57% of the ideal speedup. 3. 7.3.3 Contoariron: wit/2 Bencbmané Theoretical vs. Empirical We ran several ISCAS89 circuits and one combinational circuit sequentially up to some large enough time limits. We are able to get the total number of executions and number of cycles. The activity number is derived from the quotient of number of executions divided by number of cycles. Table 3.3 shows the results for 535932, 515850, 59234 and multi32. 46 Number of Executions, Cycles and Activity Number Circuit 835932 15850 S9234 Multi32 Number of Executionsl32949202574571439652303858 Cycles 3499 4415 3258 1998 Avg. Activity Number 942 58 44 1 153 Table 3.3 Number of Executions, cycles and Average Activity Number When these circuits are partitioned to a number of processors, the cost of each cycle is the maximum number of executions among these processors. The sum of these maximum numbers of executions for all cycles is the total amount of the computation. We can estimate the total amount using our multinomial distribution model with the average activity number. To see how accurate the estimation is, we randomly partitioned the objects into a different number of parts where each part had the same number of objects. We then recorded the number of executions for each of these parts and recorded the sum of then maximum number of executions of all parts for all cycles. This experiment was done sequentially and there was no communication cost involved. Table 3.4 shows comparisons of the benchmark computation cost and the estimation with the different number of partitions. The estimation is quite close to the benchmark. Most of the estimations have errors less than 2%. This shows that many circuits have random behaviors that fit well to our multinomial distribution model. We compared the result of the multinomial distribution with binomial distribution. They both have good estimations of the benchmark. The number of independent trials for multinomial distribution model is much less than that of binomial distribution model because the number of independent variables of the multinomial distribution model is much less than that of the binomial distribution model. A more important advantage of the multinomial distribution model is it only depends on one parameter — the average number of active object per grcle. The binomial distribution model needs the activity rate and the total number of objects. 47 7 535932 ‘P .13 l. mpirical Multinomi inomi ultinomial Erro inomial Error 2 1701621 1690454 1687933 0.66% 0.80% 4] 877571 879266 877099 0.19% 0.05% 8 463184 466671 465514 0.75% 0.50% 1 6 254433 254690 253294 0.10% 0.45% 1 777 4851 14211 5010 325 $1 5850 ultinomial 1 423 4 2.440/ 5.640/ 3.280/ 1 1.900/ 2.630/ 5.590 2.850/ 11.820/ 59234 lPEEmpiricallMultinorniallBinomiaWultinomial ErrorBinomial Error 2 81 135 80534 80564 0.74% 0.70% 4 45804 47469 47519 3.64% 3.74% 8 29635 29596 29515 0.13% 0.40% 1 19591 19670 19598 0.40% 0.04% multi32 IPEEnfiiiricallMultinomiallBinonnmultinomial Error inomial Error! 2 1 172691 1 178893 1176680 0.53% 0.34% 4 605081 61 1254 608103 1.02% 0.50% 8 317674 322804 319431 1.61% 0.55% 16 168821 174766 171944 3.52% 1.85% Table 3.4 Comparison of theoretical and empirical results 3. 7.3.4 Speedup and Mult¢lication factor From results of the above experiment we obtained the speedup for those circuits. Therefore, the speedup was calculated from pure computation with no consideration of communication COSt. 48 Figure 3.2 shows the speedup graph for the circuits we used above. The multi32 has the best performance while 89234 has the worst. From Table 3.3 we can see that multi32 has the highest average activity number and 59234 has the lowest average activity number. The speedup is directly related to the average activity number of a circuit. From the multinomial distribution model, we can see that the speedup is a non-decreasing function of the number of processors. We ignored the communication cost for all previous analysis. If the communication cost is considered, the speedup will not be a non-decreasing function. If the number of partitions increases over a certain threshold, the performance will be suffered because the communication cost will become higher than computation cost. Speedup wlhtout communlcatlon 0 4 8 12 16 Number of Partltlons Figure 3.2 Speedup without communication 49 3. 7 .4 Performance Model 3. 7.4.7 Communication Model In this section, we describe a prediction model based on the communication cost and the synchronization cost. The communication overhead of modern message passing parallel computers includes the operating system overhead of dividing the messages into packets, adding header information to the packets and then putting the packets into hardware. For modern parallel computers with advanced routing technique and high bandwidths, communication overhead takes up most of the total communication cost. Because of this we are only interested in the communication overhead and will incorporate it into our prediction model. There are several parallel models proposed. The LogP model is good when the message is short [31]. The LogGP model is proposed to incorporate long messages [2]. In the LogGP model the message with size 13 is sent into network at time Ow+m(k— 7 ), where 0",, is overhead and m is time per byte for a message. In this paper, we considered O..,,+m(,€-1) as the communication overhead and, 0",, as the overhead for sending the null message. We ran the experimental program on SGI Origin 2000. Figure 3.3 shows the change of communication overhead of sending and receiving as size of message increases. Figure 3.4 shows the least square fit of communication overhead to a linear formula. We assumed that the communication overhead O follows following linear formula: 0 = OM + m’kMrgSize 50 Where Onun is the overhead of sending a null message and m is the increasing factor as size of the message increases. We used least square method to estimate at and 0",, The value of m and 0",, are 12.9*10'3 microsecond and 37.3 microsecond separately on SGI origin 2000. Cummleatlon War) 180 160 140 '5 13 g 8. 0 E 60 40 20 0 0 5000 10000 man Figure 3.3 Message overhead Marauder-am 180 160 140 'g 120 —-Experirra'rd g 100 o 80 +LineaSqua 5 so Esirrdim : 4o 20 o 0 maoooeoooaooowooo mm Figure 3.4 Least Square fit to a linear formula 51 3.7.4.2 Predicted Speedup Total execution time is the sum of computation time and communication time. Our multinomial distribution model can estimate the computation cost. From the activity number and the number of processors, we can compute the multiplication factor. The total computation cost is Total Computation = G*N/ fl’*M), where P is the number of processors; M is multiplication factor; G is the granularity of computation that includes the cost of event queue operation and event execution. N is the total number of event executions. The multiplication factor measures the degree of load unbalancing and can be computed with other random distribution models. The communication time depends on number of events generated. Suppose that an average of E events are generated per cycle and C is the total number of cycles. The total number of cycles is total number of event executions divided by the active number. When the system is partitioned into P parts, the number of events generated also follows multinomial distribution. The cost of communication per cycle is decided by the maximum order statistics of these multinomial random variables. Therefore, the number of events generated per cycle per processor is (E/P)*7/M. These events will be sent to P processors. Therefore the event sending and receiving cost of a cycle is P*((E/P)/fl’*M)*m+OM). OM and m are the communication parameters derived from part four of this paper. The total communication cost is: C *1”? (E/ P)/ fl’*M)*m+ 0../1)=E..a/ (P*M)*”’+C *P *0“, 52 Where E is the total number of events generated. Not all event executions generate new total events. Depending on the circuit, we have EW=N*,€, wbere k is the probability that an event execution generates a new event. The number of cycles C can be represented as N/A. The refined total communication cost is N / fl’*M )’*(m/ k)+N / A *P *0,“ The speedup is the ratio of sequential simulation time and parallel simulation time. The following gives the predicted speedup of synchronous simulation under our model. Tm] _ N*G +T N*G/(P*M)+N/(P*M)*(mlk)+N/A*P*0,m,, COMP COMM Speedup = The above formula can be simplified to: 1 1+(m/k)/G + on,” *P P*M G*A G includes the event queue handling and event execution that can be considered to be much larger than m and k is within a practical range. The value of m, the unit cost of message per byte, is around 12.9"‘10'3 microsecond for SGI Origin 2000 and value of k, the event generation rate, is dependent on the circuit and the values are around 0.1 for the circuit we have tested. The value of G is dependent on the circuit too and it is in the order of microseconds. Therefore, (m/k)/G is close to 0. This can be explained intuitively. The queue overhead plus the object execution cost for generating one event is much larger than the cost of sending a longer message with additional size of an event. We removed (m/k)/G from the formula to get the simpler formula: 53 l +0null * P*M G*A M *P . Suppose load is When P: j_1_ a G 1.3/Z, the maximum speedup achievable is M Onto” 6* JA 0 perfectly balanced, the maximum speedup achievable is _'"‘_"___ when the number of processors is G *JX . The maximum speedup achievable is proportional to the square null root of the active number, which describes the degree of synchronous parallelism of a circuit. 3. 7.4.3 Bencbmark Result: Table 3.5 shows the comparisons of predicted and experimental maximum speedup. The circuits we tested include five large ISCAS89 sequential circuits and a 32-bit multiplier. Input vectors come at every clock with random values. We benchmarked these circuits on the PSC TCS. The error of the prediction is less than 10%. ’ um Circuit 35932 17 5. 5.2 2.95°/ 38584 2071 3. 3 4.91 17 2.7 2 2 15850 10 1. 1.9 8.3 13207 865 1.4 1.31 9.66 ulti32 695 7 6.4 6 6.380/ Table 3.5 The comparisons of predicted and experimental maximum speedup 54 We measured the performance of synchronous simulation on different parallel computers. Benchmark results show that the PSC TCS is considerably faster than Origin 2000. The event execution rate we measured for the PSC TCS is about 5 times faster than Origin 2000. That is, the computation granularity (G) of the PSC TCS is about 5 times smaller than that of Origin 2000. However, the communication overhead of the PSC TCS is only 1.5 times faster. Thus, according to the performance prediction formula, the Origin 2000 will achieve about 1.8 (JG/0m,” ) times better maximum speedup than the PSC TCS. Figure 3.5 compares the speedup of Multi32 and 535932 on the PSC TCS and Origin 2000. The results match our prediction model. The largest circuit in ISCAS89 has around 20,000 objects. We multiplied 835932 and 538584 four, nine and sixteen times to test the effect of circuit size on the performance. Because multiplicadon is implemented by juxtaposing the circuits, the new average active number: of the multiplied circuits are about four, nine and sixteen times of the original ISCAS circuits. We benchmarked these circuits on the PSC TCS. The maximum speedup achieved for these circuits was about two, three and four times more than the original circuits. Figure 3.6 shows the benchmark results of this experiment. This confirms our formula for predicting the maximum speedup achievable, which is proportional to the square root of active number. 55 TCS vs Q'lgln 2000 (M ul|l32) —o—PSCTCS- Btper’nnntal +PSCTIS- Hedbted +0r'gh2000- Btperhnntal +Orid‘12000- Hedicted TCS vs Glgln 201!) (335932) 0 4 8 12 16 OPE -o—PSCTCS- Btperhental +PSCTCS- Pl'edlcted +Origh2000- Btper'rrental +Origh20m- Hedbted Figure 3.5 PSC TCS vs. SGI Origin 2000 Effect of Active Minibar (838584) 1—0—338584 +338584x4 1+338584xs 1—I—338584x16 Effect of MN. Numbor ($35932) 1—0—335932 +335932x4 +53593239 +335932x16 Figure 3.6 The efl'ect of circuit size Figure 3.7 shows the effect of granularity. We put some loops to the object execution to study the effect of granularity over speedup on 535932 and 515850. Each unit of extra granularity is around 20us. There is improvement of speedup when extra G goes up from 0 to 1. But when extra G is over 1, there is no significant improvement on speedup. This can be explained by our prediction formula: when G*A >> OM”‘P, [”1 dominates 05‘” *—E. For 515850, it o E 1 cannot overwhelm P * M G *- >j'u has much smaller A than $35932, although G is large This caused the better speedup for larger G. The Effect of Grenulerfty (835932) 12.00 10.00 8.00 g. -0—Comp Grain Size = O i 6'00 -I-Comp Grain Size = 1 m 4.00 —e—Comp Grain Size: 2 +Comp Grain Size = 4 2.00 0.00 2 6 10 14 Number of Proceeeere The Effect of Grenulerfty (815850) 7.00 6.00 5.00 -o—COmp Grain Slze = 0 +Comp Grain Size: 4 -e-Cornp Grain Size = 8 4.00 3.00 Speedup 2.00 1.00 0.00 2 4 6 6 10 12 14 16 NumberofPreceeeore Figure 3.7 The efl'ect of granularity. Higher granularity can improve the performance because it can hide the increasing communication cost as the number of processors increases. This can be seen best in the speedup graph for 515850 with varying granularity. 58 When the granularity is large enough that communication takes a very small part of the total execution time, increasing the granularity will not improve the performance. Inherent load unbalancing as demonstrated by our multinomial distribution model restricts speedup. The speedup of 535932 did not increase with the granularity starting from two. The predicted communication cost of 535932 at granularity two takes less than 12% of the total cost. The cost of computation decreases as number of processors increases, while the cost of communication increases as the number of processors increases. If the activity number is small or granularity is small, then the computation takes a small portion of the total cost and performance will suffer. 3.7.4.4. Improving tbe performance of yncbmnour simulation Reducing the communication overhead Reducing the communication overhead is able to reduce the communication cost and speedup the overall performance. Many methods are proposed to reduce the software overhead. Active Messages [36] reduces communication cost by allowing the user level control over message passing. Stanford FLASH Multiprocessor’s MAGIC is designed to completely off-load the communication overhead from the CPU [50]. Virtual memory-mapped communication supports direct data transfer between the sender's and receiver's virtual address spaces without operating system’s support [35]. 59 Load balancing Another way of enhance the performance of the synchronous simulation is to improve the load-balancing factor. Techniques [47,75] exist that balances the number of event executions. In the next part of the paper we propose a different improved synchronous algorithm that is also targeted to improve the load balancing factor. As revealed in the formula, the maximum speedup improvement from multi-norninal distribution to perfect load balancing is 47M -1, where M is the multiplication factor that describes the degree of load unbalancing. Therefore, -1 the improvement from these techniques is limited to the factor of s/M . For a circuit with average active number of 1000 on 32 processors, the improvement in speed up will be in the factor of 1.2. If the average active number is 100, the speedup improvement factor on 32 processors is around 1.5. Partitioning Some circuit partitioning algorithms try to reduce the cost of message passing by making as many events internal to its own partition as possible while still keeping the load well balanced. However, our performance formula reveals that such partitioning techniques have little impact on the performance of synchronous simulation. The cost of scheduling and executing an event dominates the overhead of sending a message that is slightly larger in size. When we derived the performance formula, we cancelled out the cost of sending extra messages. Soule’s experimental results showed that there is no much difference in parallel performance with several partitioning techniques [70]. 60 Optimizing Communication Our prediction model assumes that after each cycle, all processors exchange events with all-to- all communication pattern. The value of next time step can be calculated by piggybacking the local time to the event exchanges and there is no need for explicit barrier synchronization. Therefore, when there are P processors, the cost of this communication is Ofl’). This communication pattern becomes inefficient when P becomes large and a processor only sends events to a subset of processors. In this situation, the communication should be optimized that a processor only sends messages to another processor when there are events to deliver. However, additional O(logP) communication overhead is needed to calculate the next time stamp. 3. 7.5 Improved Syncbronour Algoritbm The original synchronous algorithm only executes the smallest time stamped events. The performance degrades because of load-unbalancing factor. The load unbalancing can be modeled by multinomial distribution model proposed in this paper. The cost of each cycle is decided by the maximum number of active objects assigned to one of the processors. The Improved Synchronous Algorithm tries to execute the same maximum number of active objects per cycle by pre-executing :afi’ events on the processors that are not assigned the maximum number of active objects. For example, at a certain cycle processor 0 has 50 active objects and processor 1 has 60 active objects. After executing 50 events, processor 0 tries to execute 10 more rafe events. The safeness of an event execution can be decided by the look- ahead technique used in conservative simulation. 61 Our technique is differs from previous techniques by its ability to predeterrnine the number of extra event executions. No asynchronous barrier synchronization is required by our technique. Each processor executes the following code: while (GVT < MAXGVT) { Execute all events with time stamp GVT Execute safe events so that the number of event executions this cycle is max_num_schedule Send and receive events. Do barrier synchronization and calculate new GVT and max_num_schedule. } Figure 3.8 Improved Synchronous Algorithm Figure 3.8 shows the algorithm of the improved synchronous algorithm. The value of max_num_rcbedule is the maximum number of active objects scheduled to be executed in a cycle among all processors. The rest of processors will execute up to max_num_.tcbedule of event executions by executing safe events. The new value of GVT and max_num_:cbedub can be computed by piggybacking relevant information in the event exchange step. The outcome of the improved algorithm generates better load balancing than the multinomial distribution model. We also observed that the number of simulation cycle is reduced as the side effect of load balancing effort. The performance gain comes from both of these factors. The following figure compares the performance of the improved synchronous and synchronous algorithm. 62 Improved synchroune ve slnchronoue ($38584) #PE l+ Synchronous + hproved Synchronous] Improved synchronous ve Syncrhonous (assume) 14 12 10 Speedup 010150503 0 4 812162024283236404448525660 IPE —e— Syncrhonous -e— h'proved Synchronous Improved Synchronous ve Synohroue ($13207) 0 5 10 15 20 #PE —e— Synchronous -e- hproved Synchronous j Figure 3.9 Benchmark results of improved synchronous algorithm 3. 7.6 Conclusion We proposed a performance model for synchronous parallel discrete event simulation. From the speed—up formula derived from our performance model, we can predict that synchronous simulation performs better with a higher average active number of objects per cycle. It can both reduce the load unbalancing factor and communication cost. We found that the maximum speedup achievable by the synchronous algorithm is proportional to the square root of average active number. The error of our prediction result compared with the benchmark result is small. This result can give some hint on whether it is worthwhile to use synchronous simulation before actually implementing it. We proposed an improved synchronous algorithm that tried to improve the load balancing factor. Benchmark results show that the improved synchronous algorithm generates better load balancing than original synchronous algorithm. Although we observed performance improvement from the improved algorithm, our performance prediction formula reveals that the gain from improving load balancing is limited to w/M—Il , where M is the multiplication factor that describes the degree of load unbalancing. For a reasonable range of active number and number of processors, M is typically greater than 0.5. Therefore the maximum improvement in speedup is typically less than 1.4. To further improve the performance of synchronous simulation, a new algorithm must be able to improve the communication cost by reducing the simulation cycle significantly. 64 3.2 Cycle reducing simulation 3 .2. 7 Introduction Logic simulation is an important method for assuring the correctness of a VLSI system design. For large designs, logic simulation is very time consuming. Therefore, parallel discrete event simulation techniques are used to accelerate the simulation. There are three major categories of parallel simulation: synchronous, conservative and optimistic. A survey on the topic of parallel discrete event simulation is described by Fujimoto [42]. A survey on the topic of parallel logic simulation is described by Chamberlain [16]. Synchronous simulation is the simplest of all parallel simulation techniques. At every cycle only the smallest time stamped events are executed. The main advantage of synchronous simulation is its low overhead compared with other asynchronous simulation techniques. Soule and Gupta [72] evaluated the conservative asynchronous simulation for digital logic simulation. They found that the overheads of conservative asynchronous simulation overwhelm its advantage and the performance is about three times slower than synchronous parallel event- driven algorithm [72]. Konas and Yew compared the performance of a synchronous simulation to the conservative and optimistic algorithms in simulating a synchronous multiprocessor system [47]. Their results show that the synchronous method is considerably faster than both of the asynchronous approaches. The main reason is both of the asynchronous simulation algorithms introduce significant overhead. 65 The main problem of synchronous simulation is poor load balancing and frequent synchronization. There is a synchronization step between each cycle that exchanges messages between processors and decides the next smallest time stamp. Inside each cycle, the number of event executions varies among different processors. The processor that finishes event execution early in a cycle has to wait for the latest processor to complete the execution. There exists statistical model [85] that describes this load unbalancing factor. The performance of synchronous simulation does not scale well because the cost of synchronization rises as the number of processors increases. There are some variations of synchronous algorithm that try to overcome the load imbalancing problem of synchronous simulation. AdvanCE SpaDES [47] executes safe events after the smallest time stamped events finished execution and before the synchronization point. They observed some performance improvements on logic simulation when the size of circuit is small. Steinmen [75] proposed Breathing Time Buckets algorithm that executes unsafe events before the synchronization point. States of the objects that execute unsafe events are saved to be able to recover to the states before the incorrect executions. These techniques do not solve another problem of synchronous algorithm: frequent synchronization. Xu and Chung [85] proposed a prediction formula for the synchronous simulation. The formula considers factors including load balancing and computation communication ratio. Logic simulation has very fine computational granularity. Because of this, even if we are able to make the load perfectly balanced at each cycle, the improvement on the performance is very limited. In this section, we propose an improved performance formula for synchronous simulation that considers the number of cycles. Then we describe an optimistic synchronous 66 algorithm that is targeted to reduce the number of simulation cycles instead of improving the load balancing factor. We applied the algorithm to ISCAS89 sequential and ISCASSS combinational circuits. We observed significant improvement on the performance compared with synchronous simulation. 3.2.2 Performance Evaluation Model We extend Xu and Chung’s performance prediction model of synchronous simulation by including the number of simulation cycles. Terms: C: Total number of simulation cycles with traditional synchronous algorithm. A: Average number of event executions per cycle. G: Computational granularity of an event execution. For cycle reducing algorithm, this number also includes the state saving overhead. 0,: Communication overhead for sending and receiving one message. P: Number of processors. M: Multiplication factor that describes the degree of load balancing. For cycle reducing algorithm this number includes the extra cost of rollback. 0 < M _<. 1. P * M describes the speedup without communication cost. When M = 1, the load is perfectly balanced and there is no rollback. C ’: The number of simulation cycles of cycle reducing algorithm.. C S C. 67 Analysis: The cost of sequential simulation is: C * A * G For traditional synchronous simulation, the costs of computation and communication are: Computation = C*A *G */ (P*M) Communication = C ”*P * O, The cost of synchronous simulation is: C*A*G*/fl°*M)+C’*P*O, +_* f *— M*PCGA _ 1 c' 0 P “ The speedup formula 18 l j C G The maximum speedup is achieved when P = * —, * — *«lA . «l M C 0, The maximum speedup is 0.5 * JM * ll; * jog *w/A. Consider the case when C = C (traditional synchronous algorithm) and M = 1 (perfect load G * «l; . We can see that the upper balance). The maximum speedup achievable is 0.5 * null limit speed up is determined by the average number of active objects per cycle for the 68 synchronous simulation. When this number is small, it is hard to achieve good speedup even if the load is perfectly balanced. There are techniques that try to improve the multiplication factor. The upper limit of improvement from balancing the load is limited to the factor of WI}. The multinomial distribution model predicts the value of this load balancing factor. For average active number of 1000 and 32 processors, the value of M is around 0.72. For average active number of 100 on 16 processors, the value of M is around 0.57. The improvement factor from balancing the load to the optimal for these cases are 1.2 and 1.3 separately. The cycle reducing algorithm is able to reduce the simulation cycles. The performance of this algorithm depends on how much we can improve simulation cycles without causing significant degrades of the multiplication factor. We can determine the value of the multiplication factor and reduction in communication cycles by running a sequential simulator. For a given circuit, if the number of processors and the number of events to execute in a cycle are given, the sequential simulator can determine the number of rollbacks, the multiplication factor and the reduced number of cycles. Together with other easily obtainable factors, the performance of the cycle reducing algorithm can be predicted on a parallel computing platform. With the prediction formula, it is possible to determine the optimal number of events to execute in a cycle that generates the best speedup. 3 .2. 3 Optimistic Syncbmnous Algoritbm To reduce the number of cycles, the number of event executions per cycle should be larger than the average number of active objects per cycle. Besides executing the smallest time 69 stamped events each cycle, the algorithm optimistically executes a certain number of unsafe events. Therefore, we call the algorithm: Optimistic Synchronous Algorithm. If the number of events scheduled to execute is large, the number of cycles will be small. On the other hand, there is a risk of higher rate of incorrect event executions. Therefore, the number of event executions per cycle should be decided adaptively during the simulation. It is important that all the processors schedules the same number of event executions per cycle to avoid the bad load balancing. State saving is necessary for unsafe event executions. Garbage collection is done each cycle. Because incorrect event execution may occur, rollback mechanism should be built into the algorithm. The following is the description of the algorithm. Each processor executes the following code: GVT=0; while (GVT < MAXGVT) I A = Number of events with time stamp GVT; Execute all events with time stamp GVT; If (K > A), Execute K — A unsafe events; Do garbage collection of the events and states with smaller time stamp than GVT; Send events to other processors with LVT attached; Receive events from other processors; Compute the GVT with piggybacked LVT; Adaptively adjust the value of K. } 3 .2.4 Evgoerimental Results We implemented the algorithm and tested it on Terascale Computing System (T CS) at the Pittsburgh Supercomputing Center. We chose four large ISCAS89 [11] sequential circuits and one large ISCAS85 [10] combinational circuit as our target system. Entities are randomly partitioned. Input vectors come at every clock with random values. Clock is 10,000 times larger 70 than the unit delay of a gate. We simulated the all circuits with 800 input vectors. Table 3.6 shows the characteristics of five circuits. We observed significant improvement of our optimistic synchronous algorithm over the synchronous algorithm. The number of cycles is reduced up to only 7% of the original number of cycles. Figure 3.10 shows the speedup graph of five ISCAS circuits of both synchronous and optimistic synchronous algorithm. For small number of processors, synchronous simulation performs better than optimistic synchronous simulation. This is caused by overhead for state saving cost of optimistic synchronous simulation. As the number of processors increases, optimistic synchronous simulation performs much better than synchronous simulation. This is caused by the significant reduction in the number of simulation cycles. The percentage of incorrect event executions is usually smaller than 15% when the number of events scheduled to execute is not very large. Synchronous simulation performs much worse when the circuits has non-unit delay. This is caused by the reduction of the average number of active objects per cycle and increment in the number of cycles. We observed that the optimistic synchronous algorithm is not sensitive to the delay model. Figure 3.11 shows the speedup comparison of the optimistic synchronous algorithm with unit delay and non-unit delay. The performance of non-unit delay is only slightly worse than unit delay. 71 Table 3.6 The characteristics of circuits Speedup ($38584) 12345678910111213141516 OPE Fo— Synchronous + Optim'stic SynchronousJ 72 Speedup ($35932) 12345678910111213141516 OPE Speedup ($15850) 0 12345678910111213141516 OPE Speedup ($13207) o 12345678910111213141516 res Speedup (c7552) 0 12345678910111213141516 SPE Figure 3.10 Synchronous and optimistic synchronous simulations 74 Speedup (C7552) 0 12345678910111213141516 —.— Unit Delay —.— 'Non um Delayrl Figure 3.11 The speedup comparison of unit and non-unit delay. 3.2.5 Cont/«don In this section, we proposed a performance evaluation formula for synchronous simulation. The formula considers factors including load balancing, computation communication ratio and simulation cycles. The formula reveals that reducing the number of simulation cycle is one of the most important keys to enhance the performance. We proposed an optimistic synchronous algorithm that is targeted to reduce the number of cycles. The experimental results Show that the algorithm is able to reduce the number of cycles significantly. The performance of the optimistic synchronous algorithm continues to improve when synchronous algorithm is not 75 able to do so. For non-unit delay models, the optimistic synchronous algorithm performs only slightly worse than the corresponding unit delay models. 76 Cbapter 4 4. OPTIMISTIC SIMULATION 4.1 Reducing the overhead of Optimistic simulation 4.1.7 Introduction Time Warp algorithm allows optimistic asynchronous simulation of discrete event systems. It can explore the parallelism in a system by allowing out of order event execution. Some of these event executions may not be correct and need to be cancelled. When an event with time stamp t arrives in a logical process (LP) and the LP already executed an event with a time stamp greater than or equal to t, the current LP needs to be recovered to the state before time stamp t to be able to correctly execute the new event. This requires frequent state saving. A great amount of memory and CPU cycles will be consumed doing state saving. GVT (Global Virtual Time) is used to do garbage collection of the states that are no longer necessary to stay in memory. GVT is defined as the minimum time stamp of all events in transit or in the input queues of all LP. No new event generated can have a smaller time stamp than GVT. Therefore all states except one that has smaller time stamp than GVT can be discarded safely. State saving is one of the major overheads of Time Warp algorithm. State saving requires lots of computation overheads. It also has more memory requirement, which may exceed the physical memory size. Reducing the cost of state saving is essential to enhance the 77 performance of Time Warp. Hardware [43] and software approaches exist to reduce the cost. There are three major software techniques that try to reduce this overhead. They are incremental state saving [8,78], checkpoint or infrequent state saving [9,41,54,65] and reverse computation [15]. In this paper, we present an algorithm, which estimates the lower bound of LP. This lower bound together with GVT will be used to reduce the number of state savings and maximum memory size needed for Time Warp. The algorithm embeds the lower bound estimation in Time Warp event execution, so it has minimal overhead. The technique increases the performance of Time Warp in most cases. The worst-case performance of the algon'thm is nearly the same as normal state saving algorithm. We incorporate the lookahead technique in conservative algorithm to reduce the number of state savings. In conservative simulation, each generated event is accompanied with the LVT (Local Virtual Time) of the sending LP. The receiver LP knows that it will not receive any event in the future with smaller time stamp than the LVT from the sender. Each LP can safely execute the events with time stamp smaller than the LVT value of all input channels. Only safe events are executed in conservative simulation. Deadlock may occur when there is a cycle of block LPs each waiting for its predecessor to update its clock. Null-message is used to avoid deadlock but it can add significant overhead on the performance. Chandy and Misra proposed a two-phase method to detect and resolve the deadlock. Deadlock can be broken by the observation that events with the smallest time stamp are always safe to process. jha and Bagrodia [46] did research that tried to unify the conservative and optimistic algorithms. A process can choose autonomously to be optimistic or conservative. A process is chosen to be 78 conservative to limit the parallelism and avoid generating a large number of incorrect messages. The decision to be optimistic or conservative is hard to make. If a bad decision is made, the performance may be worse than both conservative and optimistic simulation. Varghese, Chamberlain and Weihl [80] used local guaranteed time to reclaim the memory with time stamp greater than the GVT. It did not address the possibility of skipping the state savings. In our approach, whenever an event is generated, it sends along the lower-bound time stamp the sender may send to the receiver in the future. The lower-bound time stamp of an LP is defined as the minimum of lower-bound time stamp of all input channels. The lower-bound time stamp of an event updates the value of the lower-bound time stamp of an LP and vice versa. An event execution is safe if the current time stamp is smaller than the lower-bound of the LP. In a safe execution mode, state saving is not necessary and all saved states can be discarded. In an unsafe execution mode, we need to save states, but some saved states with time stamp smaller than the lower-bound time stamp of the LP can be discarded. During the simulation, an LP may execute in safe mode or unsafe mode. If all event executions are in safe mode, the technique will perform similar to Chandy-Misra algorithm without blocking. If all event executions are in unsafe mode, the performance will be similar to the Time Warp. In other cases, it may perform better than Time Warp by having fewer state savings. It may also perform better than Chandy- Misra algorithm by exploiting more parallelism. Our technique requires the simulation to have fixed interconnections. The algorithm of our technique and the proof of correctness are stated in section 4.2. The performance issues of the algorithm are discussed in section 4.3. The proposed technique is especially good when the 79 cost of state saving is high compared with the cost of event execution. Logic simulation has very fine computational granularity. We applied our algorithm on logic simulation and tested it on Terascale Computing System (T CS) at the Pittsburgh Supercomputing Center. The performance result is presented in the section four of the paper. 4. 7.2 Aégorz't/Jm 4. 7.2.7 The Model We assume that the system to be simulated has a fixed interconnection. The applications that have a fixed interconnection include hardware and network simulation. The algorithm is not applicable to war game simulation where objects communicate without fixed connections. The algorithm also requires communication channels of parallel machine to be FIFO (First In First Out). Popular communication mechanisms including MP1 and TCP satisfy this condition. We call a channel that has initial events, a primary input channel. We require that a primary input channel is not an output channel of any process. In summary, we have the following assumptions about the model. Interconnection is fixed. Communication channels are FIFO. Initially, there are only events on the primary input channels. 4. 7.2.2 Notation: We use the following notations in our algorithm. [171(C): Lower bound of the time stamp of an event that may be received on channel C. To specify the value at simulation time of 2', we use the notation lbtJC). 80 llztfl3): Lower bound of the time stamp of an event that may be received on process P. It is defined as the minimum value of lbt(C) of all input channels of the process. 3(a): Time stamp of event e. 5 Minimum delay of a process. lbt(e): Lower bound of the time stamp of the next event (after 8) that may be sent from the sender to the receiver. It is defined as min(/btfl’),t5(ew))+5, where e is the smallest time next stamped event in the unprocessed event queue of process P. Anti-message: We use the —e to denote the anti-message of event e. 4. 7.2.3 Aégoritbm The algorithm updates the value of lbt(C) and lbtfl’) with the value of lbt(e) every time a normal event e arrives on channel C of process P. When the time stamp of an event, which is scheduled to execute on process P, is less than lbtfl’), we can skip the state saving. We can also discard the states that have smaller time stamp than lbt(P), which may be larger than the value of GVT. When a new event 6 is generated on process P, we attach lbt(e) to the message. Initialization Step: For any channel C, if C is a primary input channel set lbt(C)mx, else lbt(C)=0. For each primary input, the lbt(e) for each event e is set to be the time stamp of the next event of e. The lbt(e) of the last primary input event is set to «a simulation seep: if (e arrives on channel C of process P) { lbt(C) max(lbt(C),1bt(e)); lbt(P) max(GVT, min({1bt(Ci)| C, is an input channel of P )l): } if (e is scheduled to execute on process P) { if (ts(e) >= lbt(P)) save current state; Execute the event. For each new event ex generated on output channel k, 81 lbt(ek) = min(lbt(P),ts(en,,¢)) + 5, enext is the next event to be scheduled by P. If enext == NULL, ts(ene,¢) = 0°. Discard all the states that have time stamp smaller than lbt(P) except the one that has the nearest time stamp to lbt(P),- } (5.0%. fl... (49°) (2,...) (2.5) a a (' ) (2,4) ‘4 5 15’“ bezoo 8:5 P1 5:1 P2 (6,7) /,5) (6’7) x ¥ (6,7) (3,5) (6,7) is Straggler. P3 executed (3,5) ls at p3 executed at P3 Figure 4.1 An illustration of the algorithm (GVT=1) Figure 4.1 illustrates the algon'thm. The dashed arrow represents primary input channels. We use a tuple (t:(e),lbt(e)) to represent an event with time stamp t.r(e) and lb! value lbt(e). Suppose GVT is 1 during the following event executions. Process P1 has two primary inputs. Event (1,00) is scheduled to execute. The lot of process P1 is 00. The time stamp of the event being executed is less than the 112:. The algorithm skips state saving. The execution of the event generates a new event (6,7) that goes to the first input channel of process P3. Event (2,4) and (2,00) are scheduled to execute on process P2. The execution of the events generates a new event (3,5) that goes to the second input channel of process P3. Again, no state saving is 82 necessary. Suppose event (3,5) generated by P2 is a straggler. Event (6,7) is scheduled to execute on process P3 before the arrival of event (3,5). No event updated the lb! of the second channel of P3. The lbt of P3 is set to l, which is the maximum of GVT and minimum of lbt of input channels. State saving is done because the event time stamp is greater than the lbt of the process. When the straggler event (3,5) arrives, state is restored to the state before event (6,7) was executed. The new lb! of the process is updated to 4. Since the time stamp of event execution is 3, no state saving is done. 4. 7.2.4 Proof of Commie“ Note that the algorithm may skip state saving and discard states that have a larger time stamp than the GVT. To prove that the algorithm is correct, we need to show that whenever a straggler event arrives, there exists a state that has a smaller time stamp than the straggler event. Thus, the Time Warp can be rolled back to the correct state. The straggler event may be a normal event or an anti-message. We also assume cancellation schemes meet the following cancellation condition. (Cancellation Condition) For an event e tbat it generated [yr prom: P and an} ”Image e ’ generated between e and anti-message —e, lbt(e) must not be greater tban ti(e). Later in this section, we are going to prove that both aggressive and lazy cancellation scheme satisfy the above cancellation condition. The following is some trivial relationships from the algorithm. lbtfl’) is defined as the minimum value of lbt(C) of all input channels. 83 lbt(P) = min({/bt(C,) | C, i; an input ebonne/ of P} ) --- (1) GVI does not have an effect on the correctness of the algorithm. Therefore, we will not consider the GVT in the proof. If event e is a primary input event, lbt(e) is pre-set to be the time stamp of the next event on its primary input channel. No new event may be generated on the primary input channel. The lbt(e) of the last primary event is set to 00. When a new event 6 is generated, we set the lbt(e) to /bt{e)‘—’Inin(/bt(P),t5(e ”+5 (2) ”(A e is the next event to be executed. If e does not exist, t:(em,)=°°. next next Initial value of lbt(C) is 0 if C is a non-primary input channel or 00 if C is a primary input channel. Primary input events are pre-inserted and no event will arrive on a primary input channel. lbt(C) of a non-primary input channel is updated when an event arrives on it. /bt,,,.(C)= max-(1124,10, lbt(e) ) (3) From (1) and (3) we have following simple observations. Observation 1: lbtr(C) S/btr'(C), if TS If Observation 2: lbttflJ) flint/(P), if 15 2". The following lemma 1 is the foundation of the proof. 84 Lemma 7. At rinIu/ntion timez; if event e or —-e arrive: at an input ebanne/ C of oly’eet P, tben t:(e)2/th (C) and /bt(e)2/btr (C). (Proof) We prove Lemma 1 by induction on 2: (Boris) At simulation time 0, no event arrives at any input channel. Lemma 1 is trivially correct. (Inductive ijotbetis) Suppose that the claim is true for all objects for simulation time up to 1"< I: (Inductive Step) Let us prove that the claim is true at simulation time 21 Note that lbt(C) is updated every time a normal event e is received on C. Let e be an event received on channel C at simulation time 2: We need to prove that t5(e) 21th (C) lbt(e) Zlbt, (C), if e is a normal event. Note that if e is the first message that P has received on channel C during the simulation time, then lbt(C) '—" 0 at the time C receives e, so t:(e)2 lbt(C), lbt(e) 2/bt(C) and the claim holds. Thus, we need to consider the case when there is a message placed on channel C before e. Event e could be a normal event or an anti-message. Let us first consider when e is a normal event. Let e0 be the last normal message before e is placed on the channel C. Since the 85 interconnection is fixed, e and e0 are generated by the same processor P'. Moreover, because the communication channel is FIFO, e, is generated before e. Let e be generated by executing event e' at simulation time 1", and e0 be generated by executing eo' at time 25'. Let C' be the channel where P' receives e'. Note that at P, after receiving e, at time To, the new lbt of channel C will be lbt(C) = max(/bt(ea), lbtm(C)). Event e, is a normal event. By inductive hypothesis, lbt(ea)2 lbt10(C), so the new lbt(C) after receiving e0 will be lbt(ea), and it will remain until e arrives at C. That is, lbt,(C)‘—'/bt(ea), for 1;, < t_< z' ..- (4) There are two cases: (Case 7) Event e' was received by P' after P' executed eo' and generated e0. e' arrives I ‘0 l le' 8 $80 i \ ea \1 arrives e 70' to z" 1" 2’ Physical time Figure 4.2 Event (3' was received by P' after P' executed e,’ and generated e, 86 Let 1'" be the simulation time P' receives e', 2" be the time P' executed e', and 25' be the time of P' executed eo'. We have T> 2’ > 2'"> 2;]. By inductive hypothesis, “(6') 2 In... (C') 2 1a,, (C) 2 1177.011?) (5) Therefore, t5(e) 2 t5(e') + 52 lbtmfly) + 6 _>_ lbt(eo) = 1th (C) --- (6) Note that the last inequality follows from equation (2). Consider lbt(e) = min (lbtr (P'), t.r(em,)) + 5, where ewis the next event after (2' at P'. From (5), we have t:(e,,m) _>.t5(e') 2 lbtm.(P'). Since r> 2’ > 2'"> 70', lbt, (P') 2 lbtm.(P') Thus, lbt(e) = min ( 1th a”), t5(€m1)) + 62 lbtwfly) + 62 lb: (ea) = lbt, (C) (7) Note that the last equality follows from equation (4). From (6) and ('7), the claim is true for case 1. (Cate 2) The event e' was received by P' before P' executed eo' and generated e0. Note that case 2 includes the case that e' is a primary input event. There are two sub cases. 87 (Care 2A) t5(e') _ me.) > ate") >lbtro (P7 88 Therefore, lbt(e)) = 1127,, (P') + 6 Consider, lbt(e), lbt(e) 2172;, a?) + 6 2 lbt,0.(P') + 6 = lbt(ea) = lth(C) (11) From (10) and (11), the claim is true for case 2A. (Care ZB) tr(e') > tr(e0') In this case, e' was in the unprocessed event queue of P' when P' executed eo'. Therefore, tr(e') 2tr(em,), where e next is the next event scheduled after eo' at the execution of eo'. Note that lbt(ea) = min (lbtto. fl”), t:(em,)) + 5 _< tr(em) + 5 Therefore, t5(e) 2 t5(e') + 52 1563”“) + (5 2 lbt(ea) = lbtf(C) --- (8) Let e" be the event scheduled next after e' at P. Note that t:(e") > t.r(e') > tr (em). Thus, lbt(e) = min(lbtp (P'), t.r(e")) + 5 2 min (Ibtm.(P'), tr(em ) + 5 = lbt(eO) -- (9) From (8) and (9), the claim is true for case ZB. 89 Therefore the claim is true for any event e. Now let us consider anti-message —e. The proof is trivial for tr(-e)=tr(e)2 1th (C). The value of lbt(C) is not greater than tr(e) when event e arrived. By the cancellation condition, the value of lbt(C) is not increased before —e is received. The proof of lbt(—e)2 lbt,(C) is similar to the techniques we used above and we are going to skip it. Thus, the claim is true during all simulation time. D Lemma 2. For any stragler event scheduled to execute, there is at least one saved state that has smaller time stamp than the time stamp of the straggler event. (Proof) Suppose e or —e is a straggler event. The time stamp of the last event execution must be greater than or equal to tr(e). Therefore, e or —e must have been arrived after the last event execution (note that execution does not change the lbt value of a process). Suppose e’is the first event executed by the same process such that tr(e) _ scheduler ‘——p simulation manager object list GVT manager Figure 5.2 Module diagram of simulation kernel Scheduler is the center of simulation kernel. It decides which LP to execute and in which mode to execute. GVT Manager is responsible for computing the value of GVT. The frequency of GVT computation is decided by the scheduler. Communication manager is responsible for sending and receiving messages from other processors. Scheduler decides how many events to be buffered before sending events. The advantage of our unifying framework is it can be easily adapted to execute in almost all of known simulation techniques. In the following table, we map our unifying framework to known simulation techniques. 119 Simulation Techniques Mapping Synchronous Scheduler only allows an LP to execute in synchronous mode. GVT is used to decide the smallest time stamp. Conservative Scheduler only allows an LP to execute in conservative and synchronous mode. GVT computation is used as the deadlock recovery mechanism. Optimistic Scheduler only allows an LP to execute in optimistic mode. Breathing Time Buckets Scheduler allows an LP to execute in synchronous mode and optimistic mode when no synchronous mode execution is available. GVT computation is used to compute the event horizon. AdvanCE SPaDES Scheduler allows an LP to execute in synchronous mode and conservative mode when synchronous mode execution is not available. GVT is used to decide the smallest time stamp. Optimized Optimistic Scheduler gives priority to events that can be executed in either synchronous mode or conservative. When no synchronous mode or conservative mode execution exists, scheduler picks an LP to execute in optimistic mode. Table 5.1 Mapping of simulation techniques. We implemented the optimized optimistic simulation that exploits cycle parallelism and conservative parallelism. We implemented the algorithm and tested it on Terascale Computing System (T CS) at the Pittsburgh Supercomputing Center. We chose four large ISCASS9 benchmark circuits as our application. We executed circuits with up to 16 processors. Entities are randomly partitioned. Input vectors come at every clock with random values. Clock is 10,000 times larger than the unit delay of a gate. We observed that the performance of the 5.5 Experimental Results 120 optimized optimistic approach is almost as good as synchronous when number of processors is small and usually outperforms synchronous and when the number of processors is large. The optimized optimistic always performs better than optimistic simulation. Table 5.2 shows the maximum speedup achieved for these circuits with up to 16 processors. O 4 8 12 16 Number of Processors 121 ‘ O Speedup MubUIQVGO -e—Optlmized Optimistic 4 8 12 16 Number oi Processors 0 $15850 . —e—Synchronous + Optimistic —e—Optimized Optimistic o 4 8 12 16 Number of Processors 513207 —e— Synchronous -l— Optimistic + Optimized Optimistic Number oi Processors Figure 5.3 The comparison of experimental results Maximum Table 5.2 Maximum speedup with up to 16 processors 5.6 Conclusion In this chapter, we introduced a technique that enables optimistic simulation to exploit cycle parallelism and conservative parallelism. It gives us a way to avoid optimistic simulation to perform much worse than other techniques because of state saving overhead and had 123 scheduling that executes non-safe events first instead of safe events. We also proposed a uni ring framework. In this framework, synchronous, conservative and other simulation techniques are special cases of the unified simulation kernel. 124 C/Japter 6 6. SUMMARY This dissertation focused on evaluan'ng and improving the performance of parallel simulation. First, we proposed a model to predict the performance of synchronous logic simulation. For the performance of synchronous simulation, the load balancing and communication cost are the most important factors. The effect of load balancing in a synchronous simulation was computed using probability distribution models. We derived a formula that shows the cost of synchronous simulation by combining the communication model called LogGP and computation granularity. The formula is simple, yet captures the most important factors for the synchronous simulation. The performance formula predicts that the maximum speed up achievable by a synchronous simulation is proportional to the number of average active objects per simulation cycle. The performance prediction model we have developed can be used to decide whether it is worthwhile to use synchronous simulation before actually implementing it. We benchmarked several ISCAS circuits on SGI Origin 2000 and Terascale Computing System. The developed performance model was used to analyze the effect of several factors that may improve the performance of synchronous simulation. We proposed a technique that improves the load balancing and simulation cycles. We evaluated the limitation 125 of improving the load balancing factor and indicated that reducing the simulation cycles is a more promising approach for synchronous logic simulation. Next, we proposed a cycle reducing synchronous algorithm. The algorithm was based on the assumption that simulation cycles can be reduced by pre-executing events that should normally be executed in future simulation cycles. There are two approaches for selecting the events to be pre-executed. One is to pre-execute only the safe events. The safeness of event was determined by lookahead computation. The other one is to execute events more aggressively to execute unsafe events. Our experimental results showed that pre-executing only the safe events was not able to reduce the simulation cycles significantly. We implemented an aggressive cycle reducing algorithm that executes a certain number of unsafe events in addition to the smallest time stamped events. The benchmark results showed that this algorithm was able to reduce the simulation cycle significantly and achieve several times better speedup than the traditional synchronous algorithm for several circuits. However, there is an issue about selecting the number of extra events to execute in a simulation cycle. If this number is large, there will be fewer simulation cycles. But, number of incorrect event executions will be higher. The performance loss from incorrect event executions may become more significant than the gain from reduced simulation cycles. Therefore, this number should be decided adaptively during the simulation. The communication pattern of cycle reducing algorithm is synchronous. Therefore, the performance of the algorithm is deterministic. The optimal number of extra events to execute can be alternatively determined by sequential analysis. Time Warp algorithm is able to exploit the parallelisms that could not be found by conservative algorithms. However, this algorithm requires frequent state savings and more 126 memory than other simulation techniques. We proposed a technique that reduced the number of state savings as well as the memory requirement for Time Warp. The lookahead technique is used to compute the lower bound time stamp of logical processes. The time stamp of any future event a logical process receives will be greater than or equal to this lower bound. Therefore, if the time stamp of an event execution is smaller than this lower bound, the execution is guaranteed to be correct and no state saving is necessary. In addition, the states and processed events that have smaller time stamp than this lower bound can be discarded safely. Each logical process may have different value of lower bound. The lower bound of a logical process is updated by events that carry the lower bound information of its predecessors. This algorithm can be easily implemented with minimal overhead. Our benchmark results on logic simulation showed this algorithm reduced the number of state savings and maximum state queue sizes significantly. We also observed considerable improvement in the parallel execution time. The improvement came from skipping the state savings as well as the shorter queue sizes. Finally, we proposed a unifying framework for parallel simulation that combines the advantages of synchronous, conservative and optimistic simulation techniques. Synchronous simulation has the lowest overhead but it does not scale well with the number of processors. Conservative simulation performs well if good lookahead exits, but the cost of deadlock avoidance and detection is high. Optimistic simulation is able to exploit more parallelisms, but the cost of state saving and rollback is significant. Under different situations, each of these techniques may be able to perform better than the rest. The unified approach gives priority to the event executions that can be executed in synchronous or conservative mode. An event execution is eligible to execute in synchronous mode when the time stamp of the event 127 execution is equal to the global virtual time. The traditional definition of global virtual time was modified to guarantee the execution of events with time stamp the same as the global virtual time cannot be rolled back in the future. The event executions in conservative mode can be detected by our proposed overhead reducing algorithm. When there is no candidate to be executed in synchronous and conservative mode, an event is picked to execute in higher overhead optimistic mode. Therefore, this approach is able to perform as well as the synchronous and conservative simulations when they perform well and perform better than these techniques when they perform poorly. The unified approach always performs better than the original optimistic approach by skipping state saving and having smaller queue sizes. We applied this technique to logic simulation. The results confirmed our analysis. The unified approach can be easily adapted to execute as known parallel simulation techniques, eliminating the fundamental differences between these techniques. 128 p—\ 10. BIBLIOGRAPHY V. D. Agrawai and S. Chakradhar, Performance Analysis of Synchronized Iterative Algorithms on Multiprocessors Systems. In IEEE Transactions on Parallel and Distributed Systems, Vol. 3 No. 6, pp. 739-749, 1992. Alexandrov, M. Ionescu, K. E. Schauser, C. Scheiman, LogGP: Incorporating Long Messages into the LogP model - One step closer towards a realistic model for parallel computation, 7th Annual Symposium on Parallel Algorithms and Architecture, pp. 95-105, 1995. Peter J. Ashenden, The Designer' Guide to VHDL, Morgan Kaufmann, San Francisco, 1996. P. J. Ashenden, H. Detmold and W. S. McKeen, Execution of VHDL Models using Parallel Discrete Event Simulation Algorithms, VLSI Design, Vol. 2, No. 1, pp. 1—16, 1994. D. Baik and B. P. Zeigler, Performance Evaluation of Hierarchical Distributed Simulators, pp, 421-427, In Proc. of the 1985 Winter Simulation Conference, 1985. R. Baldwin, M.J. Chung and Y. Chung, Overlapping Window Algorithm for Computing GVT in Time Warp, Proceedings of International Conference on Distributed Computing Systems, pp. 534—541 , 1991. Rajive L. Bagrodia, Designing Efficient Simulations Using Maisie, UCLA Technical Reort, 1997. H. Bauer et al., Reducing Rollback Overhead in Time Warp Based Distributed Simulation with Optimized Incremental State Saving, pp. 12-20, Proceedings of the 26th Annual Simulation Symposium, 1993. S. Bellenot, State Skipping Performance with the Time Warp Operation System, Proceedings of 6th Workshop on Parallel and Distributed Simulation, 1992. F. Brglez and H. Fujiwara, A neutral netlist of 10 combinational benchmark circuits and target translator in Fortran, Proceedings of the International Symposium on Circuits and Systems, pp. 695-698, 1985. 129 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. F. Brglez, D. Bryan and K. Kozminski, Combinational profiles of sequential benchmark circuits, Proceedings of the International Symposium On Circuits and Systems, pp. 1924- 1934, 1989. J.V. Briner, Jr., J.L. Ellis, and G. Kedem. Breaking the barrier of parallel simulation of digital systems. In Proc. of the 28th Design Automation Conf., pages 223—226, 1991. RE. Bryant, Simulation of Packet Communications Architecture Computer Systems, MIT- LCS-TR-188, Massachusetts Institute of Technology, 1977. Cabrera, M.J. Chung and Y, Chung, A parallel VHDL Simulator on the Connection Machine,‘ Proc. of VHDL spring 92 Conf. pp.83-94, 1992. CD. Carothers, KS. Perumalla and RM. Fujimoto, Efficient Optimistic Parallel Simulations using Reverse Computation, Proceedings of 13th Workshop on Parallel and Distributed Simulation, pp. 126-135, 1999. R. Chamberlain, Parallel logic simulation of VLSI systems, Proceedings of 32"‘1 Design Automation Conference, pp. 139-143, 1995. KM. Chandy, and J. Misra, Distributed Simulation: A Case Study in Design and Verification of Distributed Programs, IEEE Transactions on Software Engineering, vol. SE-S, no. 5, pp. 440—452, September 1979. K M. Chandy and J. Misra, Asynchronous Distributed Simulation via a Sequence of Parallel Computations, Communications of the ACM, 24(11), pp. 198-206, 1981. Chi-Chao Chang, Grzegorz Czajkowski, Thorsten von Eicken, Design and Performance of Active Messages on the IBM-SP2, Cornell CS Technical Report 96-1572, Feb. 1996 E. Chung and M. J. Chung, An Important Factor for Optimistic Protocol on Distributed Systems: Granularity, In Proceeding of Winter Simulation Conference, pp. 642-649, 1995. M. J. Chung, Department of Computer Science, Michigan State University. Parallel VHDL Performance Simulation Cost and Schedule Report, 1997. M.J. Chung and Y. Chung, Performance Prediction to Gate to Processor Ratio, Proc. 1992 International Conference on Parallel Processing, pp. 246-253, 1992. M.J. Chung and Y.M. Chung, An Experimental Analysis of Simulation Clock Advancement in Parallel Logic Simulation on an SIMD Machine, Advances in Parallel and Distributed Simulation, Vol. 23, No. pp. 125-133. M..J Chung and Y. Chung, "Efficient Parallel Logic Simulation Techniques for the Connection Machine," Proc. of Supercomputing, pp. 606-614, 1990. 130 25. 26. 27. 28. 29. 30. 31. 32. 33. M.J. Chung and Y. Chung, Data Parallel Logic Simulation using Time Warp on the Connection Machine, 1989 Design Automation Conference, pp. 98-103, 1989. Moon Jung Chung and Jiashen Zhou, Version Control and Configuration Management in Simulation, Technical Report, Michigan State University, 1998. Moon Jung Chung and Jinsheng Xu, An Overhead Reducing Technique for Time Warp, Proceedings of IEEE International Workshop on Distributed Simulation and Real Time Systems, pp. 95-102, 2002. Moon Jung Chung, Jinsheng Xu and Hee Chul Kim, Parallel PVHDL Simulation Engine, High Performance Computing Symposium, 1999. JP. Cohoon and J..W Davidson, C++ Program Design - An Introduction to Pro ' and Ob'ect-Oriented Desi Times Mirror Hi her Education Grou , 1997. mg l gn, g P R. M. Cubert and P. A. Fishwick. MOOSE: An Object-Oriented Multimodeling and Simulation Application Framework Submitted to Simulation, June 1997. D. Culler, et. al., [.0ng Towards a Realistic Model of Parallel Computation, 4‘h ACM SIGPLAN Symposium on Principal and Practice of Parallel Programming, pp. 1-12, 1993. D. Culler, LT. Liu, RR Martin, and C. Yoshikawa, LogP Performance Assessment of Fast Network Interfaces, IEEE Micro, November 1995 P. M. Dickens, D. M. Nicol, P. F. Reynolds, JR. and J. M. Duva, Analysis of Bounded Time Warp and Comparison with YAWNS, ACM Transaction on Modeling and Computer Simulations, 6(4), pp. 297-320, 1997. 34. J. Duato, S. Yalamanchili and L. Ni, Interconnection Networks: An Engineering 35. 36. 37. 38. Approach, 1 996. C. Dubnicki, et al, Software Support for Virtual Memory-Mapped Communication, 10th International Parallel Processing Symposium, pp. 372-381, 1996. T. Eicken, et. al., Active Messages: a Mechanism for Integrated Communication and Computation, 19th International Symposium on Computer Architecture, pp. 256-266, 1992. J. Engelsma, Y. Chung and M.J. Chung, Distributed Token-Driven Logic Simulation on a shared-memory multiprocessor, Proc. 6th Workshop in Parallel and Distributed Simulation, pages 197-198, 1992. R. E. Felderman and L. Kleinrock, An ‘Upper Bound on the Improvement of Asynchronous versus Synchronous Distributed Processing, Proc. of the SCS Multiconf. on Distributed Simulation, 22(1), pp. 131-136, 1990. 131 39. 40. 41. 42. 43. 45. 47. 48. 49. 50. 51. Ferscha, Parallel and Distributed Simulation of Discrete Event Systems. Handbook of Parallel and Distributed Computing, McGraw-Hill 1995. P. Fishwick, A Visual Object-Oriented Multimodeling Design Approach for Physical Modeling Revision of tr96-026 send to ACM Transactions on Modeling and Computer Simulation, April 1997. J. Fleischmann and PA. Wilsey, Comparative Analysis of Periodic State Saving Techniques in Time Warp Simulators, Proceeding of 9th Workshop on Parallel and Distributed Simulation, pp. 50-58, 1995. RM. Fujimoto, Parallel discrete-event simulation, Communications of the ACM, 33(10), pp. 33-53, 1990. R. Fujimoto, J. Tsai and G. Gopalakrishnan, The roll back chip: Hardware support for distributed simulation using Time Warp, Distributed Simulation Conference, pp. 81-86, 1988. . C. Holt, M. Heinrich, J. Pal Singh, E. Rothberg, and J. Hennessy, The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors, Stanford University Technical Report CSL-TR-95-660, 1995. DR. Jefferson, Virtual Time, ACM Transaction on Programming Languages and Systems, 7(3), pp. 404-425, 1995. . V. Jha and R L. Bargrodia, A Unified Framework for Conservative and Optimistic Distributed Simulation, Proceedings of the 8* Workshop on Parallel and Distributed Simulations, pp. 12-19,1994. P. Konas, Parallel Architectural Simulation on Shared-Memory Multiprocessors, PhD Thesis, UIUC, 1994. P. Konas and P.-C. Yew. Parallel Discrete Event Simulation on Shared-Memory Multiprocessors. In Proceedings of the 24th Annual Simulation Symposium, pp. 134—148, April 1991. V. Krishnaswamy and P. Banerjee, Actor Based Parallel VHDL Simulation Using Time Warp, Proceedings of the 1996 Workshop on Parallel and Distributed Simulation, pp. 135- 142, 1996. J. Kuskin, et. AL, The Stanford FLASH Multiprocessor, 21St International Symposium on Computer Architecture, pp. 302-313, 1994. K. Lee and PA. Fishwick. Dynamic Model Abstraction, In 1996 Winter Simulation Conference, December, San Diego, CA, pp. 764-771, 1996. 132 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. CC. Lim et. al. Performance Prediction Tools for Parallel Discrete-Event Simulation, 13th Workshop on Parallel and Distributed Simulation, pp. 148-156, 1999. Y.B. Lin, Parallelism Analyzers for Parallel Discrete Event Simulation, ACM Transaction of Modeling and Computer Simulation, 2(3), pp. 239-264, 1992. Y.-B.Lin, B. R. Preiss, W.M. Loucks, E.D. Lazowska, Selecting the Checkpoint Interval in Time Warp Simulation, Proceedings of 7th Workshop on Parallel and Distributed Simulation, pp. 3-10, 1993. Y. Lin and ED. Lazowaska and ML Bailey, Comparing Synchronization Protocols for Parallel Logic-Level Simulation, Proc. of the 1990 International Conference on Parallel Processing, pp. 223-227, 1990. B.D. Lubachevsky, A. Weiss and A. Shwartz, An Analysis of Rollback-Based Simulation, ACM Transaction of Modeling and Computer Simulation, 1(2), pp. 154-193, 1991 . Message Passing Interface Forum, MP1: A Message Passing Interface Standard, June 1995 RA. Meyer, J..M Martin and KL. Bagrodia, Slow Memory: The Rising Cost of Optimism, Proceedings of 14th Workshop on Parallel and Distributed Simulation, pp. 44-52, 2000. J. Misra, Distributed discrete—event simulation, Computing Surveys, 18(1), pp. 39-65, 1995. D. M. Nicol, Performance Bounds on Parallel Self-Initiating Discrete Event Simulations, ACM Transaction of Modeling and Computer Simulation, 1(1), pp. 24—50, 1991. J. K Peacock, J. W. Wong and E. C. Manning, Distributed Simulation using a Network of Processors, Computer Networks, pp. 44—56, 1979. D. Perllerin and D. Taylor, VHDL Made Easy, Printice Hall, Upper Saddle River, 1997. BR. Preiss, The YADDES Distributed Discrete Event Simulation Specfication Language and Execution Environments, 1989 SCS Multiconference on Distributed Simulation, pp. 139-144, 1989. BR. Preiss and W.M.Loucks, Memory management techniques for Time Warp, Proceedings of 9‘h Workshop on Parallel and Distributed Simulation, pp. 95-99, 1995. F. Quaglia, Event history based Sparse State Saving in Time Warp, Proceedings of 12th Workshop on Parallel and Distributed Simulation, pp. 72-79, 1998. R. Radhakrishnan, T. J. McBrayer, K. Subramani, M. Chetlur, V. Balakrishnan, and P. A. Wilsey, A Comparative Analysis of Various Time Warp Algorithms Implemented in the 133 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. WARPED Simulation Kernel, Proceedings of the 29th Annual Simulation Symposium, pp. 107-116, March 1996. H. Rajaei, SIMA: An environment for parallel discrete-event simulation, The 25th Annual Simulation Symposium, pp. 147-149, April 1992. K Rohanimanesh, Generating Map File by SAVANT, Technical Document of Parallel VHDL Simulation Project, Department of Computer Science, Michigan State University, July, 1998. C. Schaubschlaeger, Measurements of SKaMPI, Version 2.2 of SGI Origin 2000 at GUP/ZID, University of Linz, Austria, 1999. L Soule, Parallel Logic Simulation: An Evaluation of Centralized-Time and Distributed- Time Algorithms. PhD thesis, Stanford University, June 1992. L Soule and T. Blank. Parallel Logic Simulation on General Purpose Machines. In Proceedings of the 25th ACM/IEEE Design Automation Conference, pages 166—171, 1988. L. Soule and A. Gupta, 1991. An Evaluation of the Chandy-Misra-Bryant Algorithm for Digital Logic Simulation, ACM Transaction of Modeling and Computer Simulation, 1(4), pp. 408-347, 1991. CB. Stunkel, et al., The SP2 High-Performance Switch, 1995. CB. Stunkel, et al., The SP2 Communication Subsystem, 1994. J. Steinman, SPEEDES: A Multiple-Synchronization Environment for Parallel Discrete- Event Simulation, International Journal in Computer Simulation, 2(3), pp. 251-286, 1992. J. Steinman, Scalable Parallel and Distributed Military Simulations using the Speedes Framework, NASA, Jet Propulsion Laboratory, Technical Report, 1996. T. Tabe, Janis P. Hardwick, Quentin F. Stout, Statistical Analysis of Communication Time on the IBM SP2, Computing Science and Statistics 27, 1995. B.W. Unger, J.G. Cleary, A. Convington and D. West, An external state management system for optimistic parallel simulation, Proceedings of the 1993 Winter Simulation Conference, pp. 750-755, 1993. LG. Valiant, A Bridging Model For Parallel Computation, Communications of ACM, 33(8), pp.103-111, 1990. G. Varghese, R.D. Chamberlain, W.E. Weihl, The pessimism behind optimistic simulation, Proceedings of 8th Workshop on Parallel and Distributed Simulation, pp. 126-131, 1994. 134 81. 82. K. Venkatesh, T. Radhakrishnan, and H. F. Li, Discrete Event Simulation in a Distributed System. In IEEE COMPSAC. IEEE Computer Society Press, 1986. K. Westgate and D. McInnis, Reducing Logic Verification Time with Cycle Simulation. SpeedSim, http://www.speedsim.com/tech/cycsim.htm, 1996. 83. Jinsheng Xu, An Object Oriented Model for Developing Event-Driven Systems, Department of Computer Science, Michigan State University, 1998. 84. Jinsheng Xu, Parallel VHDL Simulation Result Report, Department of Computer Science, Michigan State University, 1997. 85. Jinsheng Xu and Moon Jung Chung, Predicting the performance of synchronous discrete event systems, Proceedings of International conference on Computer Aided Design, pp. 18-23, 2001. 86. Jinsheng Xu and Moon Jung Chung, Efficiently unifying simulation techniques, to be 87. 88. submitted to Workshop on Parallel and Distributed Simulation. M. Yang, IIRthdl Layer Source Code, Department of Computer Science, Michigan State University, 1998. B. Zeigler, Object-Oriented Simulation with Hierarchical Modular Model, Academic Press, 1990. 135 illWilli?WilliIIHIIWIETI 3 1293 02327