.177: i : usq. . . :11 5.5... 2. «.31.: 3, . a ...,:..... , SS. . . .54 _. a. r. , .. ‘ , , 3 . i .5. I?! u. .... an. 3“,... 5..., ii a .. x . . L. I . _ :. i, .3: -03.. 3:. . .. x! ,5, i «.1: g , a 1:“ 12.33:: {4:13.13... I. 3275:1353 NEE-3.... II 111.:7‘aznllx 5131x111 )1..ii:.€..5..l31:x: :1 1.1.1...- (I! G .1 l‘r.x’ '9 .9 - l. - 3.1 I {3,1311 1: Ii} . ’1 4 uvll‘Ivl‘".-‘3If3!1 :zsyzu. .. :3 xnnhlalél [aaiislv :II‘IXK 1 1‘1)?! . . . .agerVc-lg)nrn >anrJ ..7 . , 5.33.. A. .:_, V ‘ . . 1:~L7l2112.:. , , , . ...::)‘:: :9 .v v:r.«:: 5.:11x. .:;=::v: I.....;x..~.v..\:4......x.l:q In:- ::axr r: I ..1x lav .3431]... )7: I: l... a)!l.l.¢...)ti):: . .1 xx 1 33:: 2. E... .3 .Ir\:nnri..ls.r1 r: I I 3.3.1.233) I I .11..XII:5)!II..‘I)xs.x5 p212K...»:Iaa»-Aax:)):?!‘n}:1!. V L -. xlailah\| n....l;..r9 11‘s..) )|9\1v$.la 32:11:23.:1 1.1:iva.‘l litilnhllarhlfa z alix‘lxr3173ol ‘ .an ..\ :1! {.1135}: v3.11; 1...: .la.........{ la: )3 :1. I): I... ‘tnfxrsnx. .3 I... .3 .x §..sl£.)ar! «\ar. lvzyxx. x1 .31 73;... .113. , gig; In .17. . 3.33.11? .5 5:1 .. . . ‘...:‘.‘=:=> gazll .31 ., . )..)|h E llllllllllllllllllIllIlllllIlllllllllllllllllllllllllll 31293 01563 512 This is to certify that the dissertation entitled PARALLEL ASYNCHRcHouS PROTOCOLS 0N PARALLEL N417 DISTRIBUTED AVQTEMS presented by Euhm; Che} has been accepted towards fulfillment of the requirements for Pk. P. degree in Cowgufbr &tence. Major professor We}? ($3 (731 MSU is an Affirmative Action/Equal Opportunity Institution 0—12771 LIBRARY University Michigan State PLACE iN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE _ l = J J J .___| —7 [fit MSU Is An Affirmative Action/Equal Opportunity institution abimmma-m PARALLEL ASYNCHRONOUS PROTOCOLS ON PARALLEL AND DISTRIBUTED SYSTEMS By Eunmz' Choz' A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1997 .ABSTRACT PARALLELASYNCHRONOUSPROTOCOLS ON PARALLEL AND DISTRIBUTED SYSTEMS By _Eunnn Chat Parallel discrete event simulation has been Widely used over broad applications, especially for complex problems which require high performance computing. In this thesis, we perform experimental, theoretical, and simulation studies on parallel dis— crete event simulation of VLSI circuits. The primary goals are to study and to improve the performance of parallel asynchronous protocols on various parallel and distributed SyStems. Depending on the machine platform, the major focus of implementing par- allel protocols changes and a different perspective is applied to each architectural characteristic. For parallel logic simulation, the significant architectural limitations of the MIMD Platform are a relatively short number of processors and interprocessor communica- tion overhead. Thus, our research work related to the MIMD platform concentrates on COmputaticm granularity, a factor that can be used to control the ratio of the computation amount to communication overhead in parallel processmg. We analyze ¥— the effects of computation granularity on the performance of an optimistic proto- col, and present an analytic formula to predict the total execution time in terms of computation granularity. In order to understand. various aspects of computation granularity, we perform experiments on the IBM SP2, and interpret the performance effects of computation granularity in the aspects of execution activity, rollback sit- uation, communication overhead, and simulation progress. In addition, we propose efficient event scheduling schemes for the optimistic protocol. These schemes can bal- ance the progress of logical processes, so that unnecessary computations and storage overheads can be reduced. The SIMD platform is better suited to the parallel simulation of small logic circuits due to its massive processing elements and the characteristics of faster interprocessor communication and barrier synchronization. On the SIMD platform, we investigate the effects of the characteristics of major machine operations on the performance of synchronous, conservative, and optimistic protocols. The experimental features are compared on several different machine architectures and the performance results are studied. We also study the effects of the machine-independent characteristics which depend on parallel protocols and application problems. Through this thesis research, parallel asynchronous protocols are implemented and studied on various architecture platforms of distributed-memory systems: the SP2 as a parallel multicomputer system, a cluster of DEC Alpha workstations as a distributed system, and the MasPar MP-l/MP-Q as a SIMD system. According to the characteristics of protocols and machine architectures, different aspectual studies are deliberated. © Copyright 1997 by Eunmi Choi All Rights Reserved To The Living God, My Parents, and My Husband ACKNOWLEDGMENTS Praise the Lord who lives and has guided us whoever trusts in the name of our God! Through my long hardship and sorrow, the living God has provided all things to me abundantly. There is nothing that I can do but the thanksgivings and praises to Him! I thank all of my committee members for their efforts, time, discussions, and chal— lenging guidances through many meetings; Dr. Moon Jung Chung as the committee Chair and advisor, Dr. Matt Mutka for his kindness and fair guidances, Dr. Herman Hughes for providing my studying place in the HSNP lab, Dr. James Harman for his Wisdom and insight, and the late Dr. Carl Page for his warm care. Special thanks I Want to give to Dr. Anil Jain, who is the Department Chairperson and an outstand— ing researcher, for his help, wisdom, and kindness when I had a hard time. Also, I thank Dr. Lionel Ni for his supporting the facility of the ACS lab and his kindness and insight, and Dr. Richard Enbody for his nice guidance as a professor and boss When I was a student and a TA, and Dr. Donald Weinshank for guidances when I was a novice TA. In particular, I give many thanks to Dr. Rawle Hollingsworth in Bio— chemistry department for his kindness, understanding, cares, and financial supports vi ‘ NI/ 1 ‘ , , __.—______— vii especially during my difficulty at the end of PhD. program although I didn’t deserve to receive his many favors. I give many thanks to my parents, Mr. Yo—Sup Choi and Mrs. Sun—Ok Chai, for their deep and steadfast love over my life, prayers, and supports. Without their love, I cannot be the person who I am. I respect them with all my heart and give all this honor to them. Also I give thanks to my elder brothers Mr. Yong Nam Choi and Mr. Gyu Nam Choi and sisters—in—law Mrs. Yun Hee Cho and Mrs. Bong Rim Choi for their encouragements and love. I give many thanks to bothers and sisters in New Hope Baptist Church; Pastor Young Ho Cho and Mrs. Hyun Sim Cho for their love, sacrifices, and prayers, Dr. Sukjung Chang and Mrs. Boduk Chang for being the best living example to me and their prayers, Dr. Hakjin Kim and Mrs. ija Kim for their cares, my best friend Miss Kyungsook Lee for her being with me, cares, and prayers Whenever I was hurt or lonely during my husband’s absence, my friend Ms. J isoo Hong f01" her prayers and our good friendship, Mr. Henry Kim and Mr. Eric Chang and Mr. Thomas Kim for their prayers and ministries of the 2nd generation of Korean American, my elder friend Dr. Sea Hwan Choi for his abundant love, cares, and Prayers, Mr. Young-Bong Kim and Mrs. Young-Hee Kim for their unceased prayers and cares, my friend Miss Barbara Czerny for her beautiful heart and our sharing in the Lord, and all His other people who cared, helped, and prayed for me. There is the Great and Wonderful Gift in my life from God. He is my husband, Dr. Dugki Min who has the warmest, most beautiful, and purest heart in the world. EVery single moment my husband has been with me, giving unfailing love, encour— 8igements, and comforts. Without him, I could not have tasted the beauty of life and viii the love of God. He is my happiness. He is my forever friend, a fellow Christian in one faith, the father—to—be of our children, the head of our family, and my husband. I want to give thanks to him always while our living until God calls us to His Heaven. This work was partially supported by DoD HPCMP Program GEN—3, Contract No: F33615-96—C—1913. The experiments on the SP2 were performed by using the machine at the Maui High Performance Computing Center. The MP—l/MP—2 of the Scalable Computing Laboratory of the Department of Energy’s Ames Laboratory are used for the experiments of the MasPar. ‘\ _[ TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 Parallel Synchronization Protocols of PDES ............... 1.1.1 Synchronous Protocol .......................... 1.1.2 Asynchronous Protocol .......................... 1.2 PDES on the MIMD Platform ....................... 1.2.1 Computation Granularity ........................ 1.2.2 Event Scheduling ............................. 1.3 PDES on the SIMD Platform ....................... 1.4 Thesis Outline ............................... 2 Research Issues and Related Work 2.1 Research Issues of Optimistic Protocol .................. 2.1.1 Data Structures for Event Queues .................... 2.1.2 GVT Computation ............................ 21.3 Cancellation Mechanism for Rollback Handling ............ 2.1.4 Fossil Collection ............................. 2-2 Other Synchronization Protocols ..................... 22.1 Variations of Time Warp ......................... 22.2 Other Conservative Protocols ...................... 2.3 Considerations on Parallel Machine Architecture ............. 2.3.1 Existing Parallel Computer Platforms ................. 2.3.2 Number of Processors .......................... 2-3.3 Interprocessor Communication Overhead ................ 2-3.4 Local Memory Space on a Processor .................. 2-4 Parallel Logic Simulation .......................... 2-5 Partitioning Problem ............................ 3 Analytic Prediction of the Effects of Computation Granularity 3-1 The Model .................................. 3-2 Analysis ................................... 3-2.1 Execution Time per Simulation Cycle ................. 3-2-2 Number of Simulation Cycles ...................... 3~3 Selecting the Target Machine Architecture ................ 3.3.1 Characteristics of the Target Machines ................. ¥—_—_—__ x 7 3.3.2 Comparison of Experimental Results on the Target Machines ..... 63 3.4 The Results .................................. 66 3.4.1 Workload Conservation Property ..................... 66 3.4.2 Comparison of Experimental and Analytical Results .......... 67 3.4.3 Effects of Computation Granularity on Distributed Systems ...... 73 3.5 Summary ................................... 75 4 Experimental Study of the Effects of Computation Granularity on the Performance of an Optimistic Protocol on the IBM SP2 77 4.1 The Simulation Model ............................ 79 4.1.1 Logical Process Allocation ......................... 79 4.1.2 Simulation Cycle .............................. 80 4.1.3 Event Scheduling .............................. 81 4.1.4 Data Structure of Event Queues ...................... 82 4.1.5 GVT Computation ............................. 84 4.1.6 Rollback Handling ............................. 85 4.2 Experimental Results ............................. 86 4.2.1 The Environment .............................. 87 4.2.2 Total Execution Time ........................... 91 4.2.3 Analysis of Experimental Results ..................... 94 4.2.4 Speedup ................................... 117 4.2.5 The Effect of a Circuit Size ........................ 120 4.3 Performance Comparison Between the Optimistic Protocol and the Syn— chronous Protocol ............................. 122 4.3.1 Timing Granularity ............................. 125 4.4 Summary ................................... 127 5 Event Scheduling Schemes for an Optimistic Protocol on Distributed Systems 128 5.1 Event Scheduling ............................... 130 5-1.1 Balancing Progress by Executing Chance (BPEC) ............ 131 5.1.2 Balancing Progress by Virtual Time Window (BPVTW) ........ 133 5-2 Experimental Results ............................. 135 5-2.1 The Implementation Model ........................ 136 5-2.2 Performance Measurement with the BPEC Scheme ........... 136 5.2.3 Performance Measurement with the BPVTW Scheme .......... 142 5-3 Summary ................................... 144 6 Comparison and Analysis of Parallel Asynchronous Protocols on Massively Parallel SIMD Architectures 146 6-1 Event Queue Manipulation .......................... 148 6-2 Related Instructions and their Performances on SIMD machines ..... 151 6-3 Comparisons of Massively Parallel SIMD Machines ............ 152 6-3.1 Queue Space ................................ 153 6.3.2 Parallel Virtuality ............................. 153 ¥ xi 6.3.3 Communication Scheme .......................... 154 6.3.4 Processor Capability ............................ 155 6.4 Experimental Results ............................. 156 6.5 Summary ................................... 159 7 A Study of Machine-Independent Characteristics of Parallel Logic Simulation 161 7.1 Critical Path Length ............................. 163 7.1.1 The CG Value ................................ 165 7.1.2 Effect of Critical Path Length ....................... 167 7.2 Parallelism of Simulation ........................... 172 7.2.1 A Refined Conservative Protocol ..................... 176 7.3 MTW Mechanism for Memory Management ................ 182 7.4 Summary ................................... 184 8 Conclusion 186 BIBLIOGRAPHY 192 LIST OF TABLES Performances of Communication and Array Access Operations on the MP— 2 and Alphas ............................... 61 Experimental Results on the MP-2 and a Cluster of Alpha Workstations 65 Experimental Results on a Cluster of Alpha Workstations ........ 75 Comparisons of Communication Measurements .............. 106 Comparison Between the Sequential Program and the Optimistic Protocol 119 Comparison Between the Optimistic Protocol and the Synchronous Protoc01122 Performance of Instructions ......................... 151 Comparison of Congestion Effects ...................... 154 Experimental Results (Speed ratio 2 gift?) ............... 157 Fractions of Execution Time ......................... 159 Critical Path Lengths and the Numbers of Gates on the Benchmark Circuits166 Performance of Parallel Protocols on the MP—l ............... 168 Values of CG ................................. 169 Average Parallelism .............................. 174 Performance Comparisons on the Refined Conservative Protocol ..... 181 MTW Sizes Applied to the Experiments .................. 183 xii ' . .2 i ’_f LIST or FIGURES 2.1 A Configuration of Processes in a Logic Simulation ............. 42 3.1 A Layout of Logical Processes ........................ 49 3.2 (a) The Workload Varying It When P = 5 (b) The Workload Varying P When k = 100 .............................. 68 3.3 The E (m) vs. k Varying from 0 to the LP Ratio .............. 70 3.4 Comparisons of (a) Sf and (b) Tk ...................... 72 3.5 Experimental Total Execution Times with 813207 Circuit ......... 73 3.6 Experimental Total Execution Times with 838417 Circuit ......... 74 4.1 Total Execution Time ............................ 92 4.2 Total Execution Time (Close—up Between 1 and 300) ........... 93 4.3 Activity Rate ................................. 95 4.4 Actual Activity Rate ............................. 98 4.5 Rollback Frequency (Close—up Between 1 and 300) ............. 99 4.6 Rollback Frequency .............................. 99 4.7 Rollback Rate ................................. 102 4.8 Communication Time Varying the Number of PBS ............. 105 4.9 Computation Fraction ............................ 107 4.10 Computation Fraction (Close—up Between 1 and 300) ........... 108 4.1 1 Computation Fraction Varying the Number of PEs ............ 110 4.12 Comparison Between Total Execution Time and Computation Time . . . 111 4.13 Number Executed Events per Cycle in a processor ............. 113 4.14 Number of Simulation Cycles ........................ 114 4.15 GVT Jump .................................. 116 4.16 Speedup .................................... 119 4.17 Effect of a Circuit Size ............................ 121 5-1 Total Execution Time with the BPEC Scheme ............... 137 5-2 Number of Rollbacks with the BPEC Scheme ............... 138 5-3 Total Execution Time with the BPEC Scheme for S13207 Circuit . . . . 139 5-4 Effect of Computation Granularity on Total Execution Time by Varying LP ratio .................................. 141 5-5 Total Execution Time with the BPVTW Scheme ............. 142 5-6 Average Number of Executed Events with the BPVTW Scheme ..... 144 6-1 A CBSQ Structure .............................. 149 xiii ‘____—_——_—— " ‘_‘__‘ xiv /' 7.1 Total Execution Time vs. the Number of Input Event Vectors ...... 165 7.2 The Relationship between the Length of Critical Path and the Value of CG 171 7.3 Parallelism on the Conservative Protocol Excluding Null Messages (C1908 Circuit) .................................. 172 7.4 Parallelism on the Optimistic Protocol (C1908 Circuit) .......... 173 7.5 Parallelism on the Conservative Protocol Excluding Null Messages (C6288 Circuit) .................................. 175 7.6 Parallelism on the Conservative Protocol Excluding Null Messages (C7552 Circuit) .................................. 176 7.7 Parallelism on Conservative Protocol Including Null Messages (C1908 Cir— ' cuit) .................................... 178 f 7.8 The BTW Algorithm ............................. 180 \ 7.9 Average Parallelism of the Refined Conservative Protocol ......... 181 ; 7.10 Relationship between the Number of Simulation Cycles and the Number of Input Event Vectors with the MTW Mechanism ........... 184 7 .1 1 Relationship between the Total Execution Time and the Number of Input Event Vectors with the MTW Mechanism ............... 185 Chapter 1 Introduction As multiprocessor systems become popular and simulation methodologies become more efficient, parallel simulation becomes one of the most widely used research tools for high performance computing problems, such as the design of very large scale integrated (VLSI) circuits. Circuit simulation plays an important role in the design and verification of VLSI Circuits. Circuit simulators are used to verify circuit functionality and to obtain de— tailed timing information before the expensive and time—consuming fabrication pro— Cess takes place [9]. As the size of VLSI chips have grown, circuit simulations have beCOme a critical bottleneck in the design of complex VLSI systems. The simulation of VLSI systems is performed at different levels of abstraction, ranging from archi- tectuI‘al, gate- and switch—level to circuit and device simulation [69]. Among several levels of simulation, gate—level simulation is popular because it can be implemented efIiQiently by using simulation methods and algorithms, and because the gate model is i 11 _ dlrect correspondence to the Boolean equation representation of digital designs [9]. ¥ Since logic simulation is an extremely important part of the design and verification of any digital system, it is necessary to investigate parallel algorithms in order to reduce large computing times of logic simulation. Parallel algorithms of parallel discrete event simulation are the appropriate solution for logic simulation. As a simulation methodology on parallel and distributed systems, discrete event simulation has received a considerable amount of attention due to its ability to achieve a high degree of parallelism. The discrete event simulation [10] has been a modeling approach in which the state of system is described by state variables and is changed according to the events occurring at discrete points in simulation time. The simulation is progressed by the changes of the system state due to the events in consideration of timestamps of the events. The concurrency in event processing provides a possibility to achieve a high degree of parallelism. Parallel discrete event simulation which PI‘OVides the parallel processing of event handling on a number of processors and the USed simulation protocols are the research topics in this thesis. In Parallel Discrete Event Simulation (PDES) [37, 33], the major problem is syn— ChrOnization, because the simulation system is composed of a set of logical processes and the events in the simulation are processed asynchronously by logical processes. In Order to coordinate the logical processes, a number of synchronization protocols have been proposed for PDES on parallel and distributed systems. The major two protOcols to handle the synchronization are synchronous and asynchronous protocols. Thesh tl'ltl'l b‘ lbllkd ync ronous pro oco manipu a es ogica processes y usmg a g o a c oc , an Sirr“111ates the task by executing events in chronological order. The asynchronous pro- to Q01 handles logical processes by using local clocks, and allows logical processes to ¥_____ proceed in virtual time. The asynchronous protocol has the benefit of concurrent execution of events at different points in simulated time. To control synchronization of event execution among logical processes, we categorize the parallel asynchronous protocol into two schemes: conservative and optimistic. Various conservative and optimistic protocols with innovative data structures and mechanisms have been pro- posed [22, 41, 35, 37]. These protocols are used to guarantee the correctness of the concurrent execution by advancing the parallel simulation asynchronously. These protocols resolve the synchronization that is caused by the concurrent execution of events at different points in simulated virtual time. This thesis investigates several issues in employing parallel asynchronous protocol of PDES on several machine platforms of parallel and distributed systems. Depending on the machine platform, the major focus of implementing parallel protocols changes and a different perspective has to be applied to each architectural characteristic. Par— allel and distributed computer systems are classified into SIMD (Single Instruction Stream, Multiple Data streams) and MIMD (Multiple Instruction streams, Multiple Data streams) platforms. The SIMD machines have a single instruction stream over Inultiple data streams and synchronize all the parallel execution units. The MIMD rna'Chines are composed of two big configurations; shared—memory multiprocessors and distributed-memory multicomputers [11]. Shared-memory multiprocessors have a. Sirlgle address space, while multicomputers use distributed memories with multi- Dle a«ddress spaces. Distributed—memory systems consist of multiple computers that Communicate by passing messages through an interprocessor communication network. T he following sections describe several parallel protocol schemes in detail and intro— ¥ duce some important research issues of parallel discrete event Simulation on the MIMD and SIMD machine platforms. 1.1 Parallel Synchronization Protocols of PDES Parallel discrete event simulation is the parallel simulation of a discrete event—driven program on parallel and distributed systems. To achieve the benefit of high concur— rency in parallel simulation, parallel protocols are developed so that a set of logical processes computes and generates events of discrete timestamps and data in parallel. The following presents the details of synchronous and two asynchronous protocols, i.e., conservative and optimistic protocols. 1 - 1 . 1 Synchronous Protocol SynChronous protocol is the most conservative scheme in PDES. The processing of logical processes is determined according to the global clock, which is the simulation time of the latest logical process. The local clock of a logical process is determined by the Smallest unprocessed event in the logical process, say Local Virtual Time (LVT). In Order to find the global clock, the minimum global time is computed by collecting all 1Ogical processes’ LVTs. It is called Global Virtual Time (GVT). Only if the LVT of a. logical process is the same as the GVT, the logical process can be activated and a'llc’VVed to process its unprocessed event. The advantage of synchronous protocol is in the simplest feature in parallel pro— Qe - . . . . . SS111g. Since each logical process never receives the earlier timestamped events than ¥ its previously-processed events, the event processing pattern on a logical process is the FIFO and the later arriving events do not change the former processed state. In other words, each logical process keeps the local causality constraint [37] of event processing. However, the drawback of synchronous protocol is the necessity of com- puting GVT in the every step of processing events. When the synchronous protocol is used on parallel and distributed systems, it is essential to use architecture platforms which provide a fast global reduction over the processing elements, or necessary to use a developed algorithm for computing GVT. 1 . 1 . 2 Asynchronous Protocol Parallel asynchronous protocol mainly handles logical processes by using local clocks, and allows logical processes to proceed in virtual time so as to have the benefit 0f Concurrent execution of events at different points in simulated time. To control SyIIChronization of event execution among logical processes, we categorize the parallel asynchronous protocol into two schemes: conservative and optimistic. CoIFIServative Protocol CorlServative protocol processes events not only conservatively but also more con— Currently than the synchronous protocol. Each logical process allows events to be proc(Essed only if it is sure that no event with an earlier timestamp can arrive. In Other words, when a logical process assures that there is no further incoming events Whose timestamps are less than t1, it can process an event with timestamp t1. This CC) . . . r18Ervative property makes each logical process compute events In nondecreasmg ¥ timestamp order by keeping the local causality constraint. The communication connection between any two logical processes is established by a link. A logical process has its own clock which is updated by incoming events. Each input port has a link clock, which is decided by the timestamp of the recently arrived event in the port. Local clock is defined as the minimum link clock at the logical process. The local clock based on link clocks is the low bound of timestamp that the next incoming event may have. By satisfying local causality constraints, thus, there is no incoming event which has smaller timestamp than the local clock. The logical processes, which have unprocessed events whose timestamp is less than or equal to local clock, are able to active to process the events. Without waiting for any delayed events from parent logical processes, a logical process is able to work On safe processing and does not break the non-decreasing timestamp ordered event processing and sending. Since arrived events in each input port are in non—decreasing timestamp order, a FIFO list can be used as an event queue of each input port as is in S.Ynchronous protocol. Since each logical process sends an output event only if the condition of the local canSality constraint is satisfied, a logical process may infinitely wait for an incoming event which never occur. It results in the deadlock situations. For example, when ther e is a cycle in the path of connected logical processes, it produces an unavoidable dea“dlock situation. Chandy and Misra [22] proposed null message to avoid deadlock Sitllatfions. A null message with timestamp t2 sent from logical process 2' to logical pr()Cess j represents a contract that logical process i will not send logical process j a'l'ly event of timestamp smaller than t2. It provides the low bound of the next ¥n . incoming events on the link between the two logical processes, and makes logical processes proceed with the new link clock instead of an infinite waiting. The null messages are used to facilitate event processing as well as to avoid deadlocks. When a logical process does not have any sending event even if the local clock is changed, the logical process sends a null message with the timestamp of local clock, so that the child logical process could work on the next possible processing as much as it can. Local clocking and null messages, thus, increase the parallelism of processing in conservative protocol by accelerating parallel processing. Optimistic Protocol The optimistic protocol allows logical processes to proceed without considering the local causality constraint, and detects and recovers whenever any causality error is detected. The most well—known optimistic protocol is the Time Warp protocol by Jefferson [41, 43]. In Time Warp, the virtual time is used to define synchronization aInong logical processes by coordinating their simulation time. Logical processes execute their current queued events in increasing order by timestamp whenever there is an unprocessed event in one of their event queues without waiting for any other 81334118 or constraint. Since this optimistic characteristic makes each logical process propagate immature events, it can result in a situation where a logical process receives an event with an earlier timestamp than the previously—processed events. Thus, the prot()col has to recover the earlier system state from the erroneously—advanced state when an event arrives whose timestamp is smaller than the current LVT. This is (1 a'lled a rollback situation, and the event that causes a rollback situation is called a ¥ straggler. To handle the rollback situation, the optimistic protocol has to use a cancellation mechanism; correcting incorrectly generated events that were already sent and up- dating the related system status. The rollback situation yields another major aspect to be considered: a storage overhead problem. In order to cope with any unexpected rollback situations, logical processes must keep all previously processed events. Since the GVT is the smallest timestamp value that cannot be rolled back by a straggler, fossil collection is performed to reuse the storage of event queues by removing pro— cessed events with smaller timestamps than GVT from the queue so that a process can keep its memory usage to a minimum. More details about Time Warp are described in the rest of this thesis. 1 -2 PDES on the MIMD Platform In the MIMD machine platform, the systems employed in this thesis are distributed— ITlemory multicomputers. Distributed-memory systems consist of multiple computers interconnected by an interprocessor communication network and communicate by passing messages through the interconnection network. Not only on the distributed Systems, but also on the shared—memory systems, many researchers have studied of P DES [31, 36, 66, 47]. Efficient simulation of large application problems such as VLSI CirCuits on distributed—memory systems requires a large number of processors and is 1iIllited by the scalability of existing architectures. Due to this limitation on MIMD InEtchines, a physical processor must be assigned more than one logical process. The number of logical processes allocated per processor is determined by the LP ratio — the number of logical processes to the number of processors. The major consideration on a distributed—memory system is interprocessor com— munication overhead via message passing. Instead of sharing a memory over all pro— cessors, a distributed-memory system communicates between processors by means of sending and receiving messages. In order to minimize the communication overhead through the interconnection network, a processor has to control how many events are executed and decide how often the processor communicates with other proces- sors- The computation granularity, which is the computation amount of processing events between any two consecutive communication points, is proposed as a solution to resolve the issues. Event scheduling is another consideration of parallel discrete event simulation On the MIMD machine platform. Under the situation that the LP ratio is larger than 1, many logical processes have to be handled by a processor. Between any two Communication points, thus, each processor decides which events are first executed a'Inong many unprocessed events from the allocated logical processes. In the following Subsections, the research issues of computation granularity and event scheduling are d18cussed. 1 . 2.1 Computation Granularity Oomziutation granularity is the computation amount on a processor between any two Consecutive interprocessor communication points [23]. In reference to the quantity of ¥—_—_—¥ 10 the computation granularity, we use the terms computation grain size and computa- tion granularity interchangeably. The computation granularity is mainly used for the optimistic protocol in this thesis with several aspects of usage. Recall that due to the limited number of processors in parallel and distributed systems, we have to assign more than one logical process to a processor. After executing several events, proces- sors collect the generated events from the allocated logical processes into messages, each of which is sent to the destination processor, and send the messages through the communication network. In order to clarify the definition of computation granularity, it may be helpful to compare computation granularity to event granularity. Event granularity is de- tGBI‘IIIined by the computation amount of a single executed event, which is called the event grain size. Depending on the event grain size, event granularity can be cate- gorized into fine, medium, and coarse grain. The size of an event is decided by the nLlInber of instructions required to compute the event on a processor. For example, manipulating an event on a VLSI circuit usually requires around a hundred machine iIIStructions, thus the event granularity is small. If a processor only executes an event between two consecutive communication points, the computation granularity is the event granularity. In contrast, if a processor executes k events between any con— SeCutive communication points, then the computation granularity is approximately 1" times larger than the event grain size. Some researchers have studied the effect of event granularity on the performance of parallel simulations [73, 39]. In [19], an experimental study has investigated the effects of communication overhead on the performance of Time Warp, considering two types of simulation applications; one 9—; 11 with a large event granularity and the other with a small event granularity. In this study, the computation grain size was the event grain size by executing one event before communication. According to their performance results, the communication latency in distributed computing environments can significantly degrade the efficiency of Time Warp in the application problems which contain a large number of logical processes with a small event granularity. In contrast, for applications having large grained events, the communication latency appears to have little effect on the per— formance of Time Warp. Most implementations of Time Warp on shared memory systems compute an event and then communicate with other processors. Since the memory access time to event queues is almost the same as the event propagation time to other processors on a shared memory system, the ‘communication-per—event’ algorithm is common. As an exception, Das et al. [31] discuss the possibility of im— DI‘OVing performance by processing batches of k events, where k is the batch size. They claim that the batch processing approach reduces queue management overhead. In distributed—memory systems, the communication latency is long, and therefore the Computation granularity is an important issue to be studied. In addition to the control of interprocessor communication between processors, Computation granularity contributes to the control of the degree of optimism of event processing on the optimistic protocol. Adjusting the degree of optimism is the major SOlution to achieve better performance on Time Warp. High optimism could cause the progress of some logical processes to be too far advanced compared to other logical processes so that frequent rollback situations occur. In order to keep the proper degree of Optimism, each processor should not execute all possible logical processes although 12 the processor has more logical processes that contain unprocessed events. The degree of optimism can be controlled by setting the maximum bound of the number of executed logical processes before performing interprocessor communication. As for the controlling purpose, we use a parameter called the Maximum Computation Grain Size (MCGS). In view of the degree of optimism, the computation grain size should be small enough so as not to cause too far advanced logical processes and generate too many rollback situations. If the computation grain size is too small, however, the optimistic protocol be— haves like a conservative protocol or even like a synchronous protocol. In this case, Only a few events are processed out of many logical processes between consecutive Communications. It makes the progress of the whole simulation slow, and most ex— eCution time is spent in communication. Thus, the computation grain size should be large enough to prevent the communication overhead from dominating simulation performances. By using a large MCGS value, we can control the amount of computa- tion enough to avoid frequent communication situations. Maintaining the adequate Computation amount between communication points also decides the overall progress to complete the simulation. A small portion of computation requires a long series of CYCIes of communication and computation, which may yield more frequency of com— Inllnication until the simulation is complete. Therefore, the computation granularity is the parameter that should be tuned properly to achieve better performance. 13 1.2.2 Event Scheduling An efficient event scheduling on the optimistic protocol results in an appropriate execution order of events and reduces the unnecessary computations and storage overheads. In this thesis, two factors are considered in order to design an efficient event scheduling strategy for Time Warp on the MIMD machine platform. The first factor is the amount of computation to be processed in a processor between two con- secutive communication points. The computation granularity must be small enough to prevent Time Warp from processing events over-optimistically, but large enough to minimize the degrading effects of performance due to the communication over— head. The second factor is the balancing progress of logical processes in a processor. Within the range of the given computation granularity, logical processes assigned to a processor progress at different speeds and this may cause more rollback situations. Considering these two factors, we propose two event scheduling schemes: the Bal- ancing Progress by Execution Chance (BPEC) scheme and the Balancing Progress by Virtual Time Window (BPVTW) scheme. The BPEC scheme controls the progress of a process by limiting the number of chances of the process to schedule events, while the BPVTW scheme controls the progress by using a virtual time window for each process, which allows only events whose timestamps are within the interval of the virtual time window to be executed. 14 1.3 PDES on the SIMD Platform Parallel processing on the MIMD machines has advantages over the SIMD machines in several aspects, such as general computational capabilities and asynchronous compu— tation on each processor. However, one significant drawback of the MIMD machines is that they usually have a small number of processors, compared to the massively parallel SIMD machine. The SIMD machines can achieve massive parallelism with a. small-grain parallel processing elements (PES). Since the VLSI circuit simulation requires a small—grain operation of a gate, and a large number of gates, the simula— tion model is well suited for the massively parallel SIMD environment. The SIMD platform also offers a faster barrier synchronization and a communication router that can be performed by an instruction, which are crucial to several aspects of the major Performance of parallel simulation. Since the massively parallel PE nodes are able to SUStain the model that the LP ratio is 1, each processor is assigned a logical process so that the SIMD machine platform will not have any serious problems for circuit partitioning and load balancing. However, the SIMD machine platform also has limitations: the number of pro— ceSSOrs compared to the number of logical processes on a large application problem, and the local memory space on a processor. To compute a large application problem having hundreds of thousand of logical processes such as a VLSI circuit, the SIMD Ina"Chimes are short of processors. To cope with the problem, several logical processes may be assigned into a processor. The small local memory space, however, has the li - . mltation to carry many loglcal processes. Thus, only small circuits, which have a ‘n 15 smaller number of logical processes than the number of processors or have the almost same number of logical processes, are suitable for parallel simulation on the SIMD platform. To study the parallel synchronization protocols on the SIMD machine, the ma— jor features considered are the processor capability, the interprocessor communica— tion scheme, and the global synchronization. By comparing computation times of the related instructions, the performance of parallel synchronization protocols on a SIMD machine can be examined. In addition, another factor that affects simula— tion performance is the use of an appropriate data structure with consideration to the architectural characteristics and the simulation model. Also, the improved algo- rithms according to the machine architecture affect the performance results of parallel Protocols. 1 -4 Thesis Outline The remainder of this thesis is organized as follows. Chapter 2 presents the general I“asearch issues and their related work by focusing on the important research topics of PD ES. First, the research issues of parallel optimistic protocol and the related work are dEscribed. Next, major features of existing machine architectures, which affect the Simulation performance on PDES, are discussed: the number of processors, the interprocessor communication via the interconnection network, and the local memory Space per processor are the features considered. Finally, the application problem 0 . . . . . . . . . . f t-1‘11s thes1s, the VLSI c1rcu1t system, Is 1ntroduced W1th characterlstlcs of logic -+ 16 simulation. Chapter 3 presents an analytic model to study the effects of computation gran- ularity on parallel and distributed systems. Analytical formulas are derived, based on the model, to predict the performance metrics, such as the total number of sim- ulation cycles and the total execution time. The formulas can be used to analyze the performance as a function of computation granularity when each processor is as- signed many logical processes and a barrier synchronization is used for computing GVT. The predicted results, verified by the experimental results on a cluster of DEC Alpha workstations, show that the computation granularity significantly affects the simulation performance. Chapter 4 performs a set of experiments on the IBM SP2 to study the effects of computation granularity. We measure and interpret the performance effects of com— putation granularity in perspectives of execution activity, rollback situation, com— munication overhead, and simulation progress. According to the analysis, a large Computation granularity increase the degree of optimism, resulting in frequent roll— back situations, while a small computation granularity causes a high communication overhead and a slow simulation progress. Chapter 5 studies the issue of event scheduling strategy for the optimistic protocol on parallel and distributed systems. Efficient event scheduling can make an appro- priate execution order of events and reduce unnecessary computations and storage Overheads. In this chapter, we propose two efficient event scheduling schemes, called the BPEC scheme and the BPVTW scheme, by considering computation granularity a fig} the balancing progress of logical processes. The BPEC scheme restricts the num— ‘ 17 ber of executed events per logical process between interprocessor communications, while the BPVTW scheme does not allow an event to be executed if its timestamp is far ahead of those on the other events in different processes. In Chapter 6, as target machine architectures for logic simulation of small circuits, two massively parallel SIMD machines are compared by their machine-dependent characteristics. The employed parallel protocols of PDES are synchronous, conserva— tive, and optimistic protocols. The objective of the comparison is to understand the effects of machine characteristics of SIMD machines on the performance of parallel Simulations. For the efficient implementation on SIMD machines, new data structures 0f eVent queues and the queue manipulation mechanism are proposed. Chapter 7 presents the effects of machine—independent factors on performances of PDES and discusses the relationship between the factors. The characteristics consid— ered are the length of critical paths, the parallelism of simulation, data structures of event queues and the related schemes, the number of processes, and the number of inp ut event vectors. This study of the inter—relationship among machine—independent factors helps to understand the performance characteristics of the given application problem and the employed parallel protocols. Finally, Chapter 8 summarizes the major contributions of this research. Chapter 2 Research Issues and Related Work Parallel discrete event simulation (PDES) consists of the simulation protocols, the application problem, and the parallel or distributed machine system. The perfor- mance of PDES is affected by the characteristics of the three components. This Chapt er presents the general research issues and their related work by focusing on the iInDOI‘tant research topics of PDES. Among several parallel synchronization protocols, Time Warp is mainly focused SinCe it gives many challenging research issues. The data structures for event queues, the GVT computation, the cancellation mechanisms for rollback handling, and the fossil collection are studied in this chapter. The variations of Time Warp and con- SerViitive protocols that have been proposed by other researchers are also introduced. Since this thesis studies the parallel protocols on various architecture platforms, the maul or features on existing machine architectures, which mainly affect the performance on 1DDES, are studied: the number of processors, the interprocessor communication v‘ . . 1a the interconnect1on network, and the local memory space per processor are the 18 ¥_——_— 19 features considered. In order to understand characteristics of the application problem, the unique characteristics of logic simulation are described. Finally, the related work on the partitioning problem in simulating VLSI circuits on parallel and distributed systems is investigated. 2. 1 Research Issues of Optimistic Protocol The most well—known optimistic protocol is Time Warp [41]. In Time Warp, when- ever logical processes have unprocessed events, the logical processes execute events 1" egardless of causality constraints [37] of events. Since each logical process executes and propagates immature events, the protocol has to recover the earlier system state from the erroneously—advanced state when genuine events arrive. The recovery in— VOIVeS undoing execution of immature events. The genuine events are executed and propagated to other logical processes by correcting the erroneously—sent events. Each logical process progresses by processing the received events. The current progl‘ess of a logical process is indicated by the Local Virtual Time (LVT) of the logical ply"()(3ess, which is determined by the minimum timestamp of unprocessed events in the logical process. A rollback situation occurs when an event called a straggler arrives, which has a timestamp smaller than the current LVT of a logical process. When a Stra'ggler arrives, the logical process has to cancel wrong events, recover the earlier System state from the erroneously-advanced state, and execute events from the point of the straggler’s timestamp. To cope with rollback situations, each logical process must store all events even after the events are executed and sent. Fossil collection is ‘____—__— 20 necessary to reuse the storage of event queues; when an event queue is full enough, the logical process removes from the queue the processed events whose timestamps are smaller than the Global Virtual Time (GVT), which is the minimum timestamp of all LVTs. 2.1.1 Data Structures for Event Queues In P DES, logical processes contain events that have been generated or received but not yet executed, and are pending in the input event queues. Logical processes manage the pending events and let the processors schedule and execute the events according to the event scheduling scheme. The data structure implementation of event queues has an important influence on the simulation performance. G”013811 Event Queue Global event queue is the common event queue which contains all unprocessed events of 1c)gical processes. The global event queue is usually used for sequential program and Synchronous protocol. On the sequential program and synchronous protocol, the event with the smallest timestamp is always executed first among many unprocessed events. According to this algorithm, which is known as smallest—event-first processing, the priority queue is used for the global event queue. There are several queue structure for eVent queues of PDES, such as heap, splay tree [71], calendar queue [17], and skew heap [44]. Moreover, the priority queue for the global event queue is used for Time Vva'I‘p especially on the shared-memory machine platform. Jefferson [42] proposed (1 . a'IlCelback protocol for storage management and presented that an effic1ent data ‘—______—__ . 21 structure of event queue reduces the number of rollbacks in Time Warp. Ronngren et al. [68] proposed an improved version of the skew heap for event queues on a shared-memory implementation. Individually Distributed Event Queues In contrast to the global event queue, individually distributed event queues have logical processes keep their own event queues in which all unprocessed events are pending. The distributed event queue is suitable for PDES in distributed-memory machine platform. Each logical process uses the local memory in a processor, stores eVents in the local area, and processes them by accessing and proceeding concurrently. USing the individual event queue per logical process also simplifies migration of a logical process to another processor in a dynamic load management scheme. When the individually distributed event queue is used, all the events are sorted in the increasing order by timestamp in each logical process’s event queue. Also, a SeIDaJate output event queue is not necessary by keeping the previously—processed events into the event queues as well. Each event queue has three pointers; the bottom to point out the very first event, the front to point out the first unprocessed event, and the rear to point out the last event. After executing an event placed in the front, the front pointer moves to the next unprocessed event. The processed event stays in tlie queue until a fossil collection performs. The already processed event may be Ilolled back by stragglers. Instead of using separate output event queues, using a single event queue which can be used as both input and output event queues can reduce the qIlene manipulation time when a straggler arrives, by eliminating the corresponding ‘_—_—— 22 event moving and queue handling from output event queues. Output Event Queues In Time Warp, an output event queue contains negative copies of events that a logical process has recently sent, which is kept in virtual time order [41]. It is used in case of rollback situation. The output event queue, however, is a necessary structure when the global event queue is used in Time Warp. It is because the previously—processed event is to be dequeued from the event queue in order to let the other unprocessed events be in order in the global priority queue. The previously-processed events are en_ k. Thus, fmi(a:) 2: Pr[n,- Z :13] if :1: = k. Note that m,- has the truncated distribution [57] at k. E] The cumulative distribution function (c.d.f.) of m,- is given by Fm,(:c) = 1—Pr[m,->a:]=1—Pr[n,~/\k >12] 1—Pr[n,->:r]=-—Fm(:1:) ifx_k If m1,m2, . . . , mp have the identical distribution, say Fm(a:), and they are indepen— dent, then the c.d.f. of mmam is computed as F mmaz (:13) = Pr[mmaxgx] = PTle’mz‘ S Ivll 55 = Pr[m1 3 cc] Pr[m2 S 1:] - - -Pr[mp g as] = Pr[m§:t]’D = Fm(:1:)’D (3.3) The assumption that m1, m2, . . . , mp are identically distributed may be true in our simulation model, because logical processes are randomly and evenly assigned to processors. The assumption that m1, m2, . . . , mp are independent may not be correct because of the task dependency among logical processes when real logic circuits are simulated. Due to the second assumption, the analytical results in the next section show a little difference from the experimental results. Since mmax is a non-negative integer value, E(mmax) : 2:0(1 — meu (1:)) [67]. Based on the c.d.f. of mmax, the expectation of mm“, E (mmam), is obtained as follows. E(mm...) f: <1 — Frame» k—l = an — Fm-..> k—l : k '" 230(Fm($)P) (3'4) Based on Equations (3.1) and (3.4), the expected execution time per simulation cycle is as follows: E(Tyfiax) : tproc + E(mmax) [tcomp + w (1 "' Q) tmovel + (P — 1) tcomm + “93:53) 56 3.2.2 Number of Simulation Cycles Another component of the total execution time is the number of simulation cycles, Sf. In this subsection, Sf is derived when the maximum computation grain size is k and the LP ratio is r. As a preliminary step, we define the workload of parallel simulation. Definition 1 Let W1." be the workload of the simulation when the maximum compu- tation grain size is k and the LP ratio is r. Then, er is defined as the total number of executed events by all processors during the simulation. By definition, er is computed by multiplying the total number of simulation cycles by the total number of executed processes in a simulation cycle, that is, Wr’“ = (Number of simulation cycles) * (Total number of executed processes in a cycle) P i=1 where m is the average of m1, m2, . . . , mp. Let E(m) be the expectation of m. When P is large, # is near 1. Hence, we can use E(m) for m. Since m1, m2, . . . , mp have the same distribution, we can use E(m) for E(m). Applying Equation (3.4) with P = 1, we have k—l E(m) = k — §Fm(x) (3.7) In order to derive an analytical formula for the number of simulation cycles, we use a property called the workload conservation property. (3.6) 57 Workload Conservation Property: The workload is preserved for all values of r and k, i.e. W11 2 er foranykandr,0§kgr This property says that the total number of events executed during the simulation is preserved regardless of the values of k and r. That is, the number of executed events when each processor has one logical process and at most one event is executed per logical process, is equal to the number of executed events when a processor has more than 1 logical process and executes at most one event from each logical process. Note that the value of W11 contains the total number of executed events including the immature events due to optimistic processing when as many processors are used as are required logical processes (the LP ratio 1). In [29], Chung and Chung derived the number of simulation cycles in the same manner. We prepare two figures in Section 3.4.1 which support this property. By using Equation (3.7) and the workload conservation property, the number of simulation cycles can be derived as follows. Theorem 2 When the maximum computation grain size is k and the LP ratio is r, the number of simulation cycles, S”c is computed as follows: T7 S’ar SkN 1 . N W (3'8) where a is the average activation probability of a logical process. 58 Proof: Ifr = 1, k =1, and P = N, then W,1 .—. 311Na (3.9) Ifr>1andk21,thenP-——|'¥] and Wk 2 5’mean N z Sf [71E(m)l(.,k) (3.10) Due to the workload conservation property, W11 2 Wf implies SllNa x S”? l¥l E(m)l(r,k)' Thug, 5’“ x S] N a = $11 a r T If] E(m)l(r.k) E(m)|(r.k) Cl Therefore, the total execution time can be derived from Equations (3.5) and (3.8): Tk = SkE(rk ) (3.11) 3.3 Selecting the Target Machine Architecture Before performing a set of experiments in order to evaluate our derived analytical formula, we compared two typical parallel machine platforms, the SIMD and MIMD 59 platforms, to determine a proper machine architecture for simulating large VLSI cir— cuits. SIMD machines have a single instruction stream over multiple data streams and all of their parallel execution units are synchronized. In contrast, the processor nodes in the MIMD machines asynchronously execute multiple instruction streams over multiple data streams. This difference in architectural philosophies necessitates different, respective implementations of the optimistic protocol. Due to such different machine characteristics, it may be difficult to compare the performance of the opti- mistic protocol on the two machine platforms. Thus, what we do is to attempt to prove that a cost-effective cluster of workstations is a promising platform for PDES of large VLSI circuits in comparison to SIMD architectures. As target machines, a cluster of DEC Alpha workstations as a MIMD machine is compared to the MasPar MP-2 SIMD machine. In the rest of this section, we compare the architectural char- acteristics of the two machines. Also, experimental results on both machines with several benchmark circuits from the ISCAS ’89 benchmark suite [15] are presented. 3.3.1 Characteristics of the Target Machines The employed target machines are the MasPar MP-2 with 4K PEs and a cluster of 6 DEC Alpha 3000 workstations interconnected by a DEC GIGAswitch via FDDI. The GIGAswitch supports a peak data transfer rate of 200Mbits per second. For the MP-2, MPL (Massively Parallel Language on the MasPar) [53] is used. For par— allel distributed computing on the distributed system, PVM 3.3 (Parallel Virtual Machines) [65] is employed. Among many architectural characteristics, only those 60 related to parallel logic simulation are considered. Since performance results of par— allel simulations are dependent on the execution time of machine instructions, it is essential to compare the major instructions on those target systems. We compare the instructions and the characteristics needed for interprocessor communication, the global synchronization, and the access operations to event queues. In addition, the number of processors and local memory space per processor are also considered. Interprocessor Communication and Global Synchronization The target machines have distributed local memories over processors without any globally shared memory space. Processors communicate with one another by pass- ing messages over an interconnection network. One of good characteristics of the MP-2 architecture is that it has built-in global router connections for interproces- sor communications. The global router is a bidirectional communication instruction and a circuit-switched style network organized as a three-stage hierarchy of crossbar switches supporting point-to-point communications between any two processors [14]. Also, the reduction to compute the minimum value among all processors can be per- formed by a single instruction within a short amount of time via the global router. In contrast to the built-in global router on the MP-2, a cluster of workstations uses a local area network (LAN) such that interprocessor communication is accom- plished via system software, an operating system, and a message passing package. The collective communication time on a distributed system is much longer than that of any point-to—point communication time. Table 3.1 shows comparisons of commu— nication and barrier synchronization performances between the machines. Processing 61 I] Communication Synchronization I] Array Read Array Write ] MasPar MP-2 I] 385 71 ]| 5.47 5.86 DEC Alphas [I 3900 11700 I] 0.08 0.09 Table 3.1: Performances of Communication and Array Access Operations on the MP-2 and Alphas time of l—integer data is measured for communication, barrier synchronization, and array accesses on the MP-2. On the Alpha cluster, the round trip time of 1-integer messages is measured for the interprocessor communication time. To measure the barrier synchronization time on the Alpha cluster, the master workstation sends a 1-integer multicasting message by using pvm_mcast() after collecting 2-integer mes- sages from five slave workstations. The time measured includes the time spent in the system call, gettimeo f day(). ,useconds are used as a unit of measurement. Computation Capability The parallel discrete event simulation requires many event manipulations, such as event enqueuing, accessing of events, event comparisons, event dequeuing, and event queue status checking. Since the event processing time depends on the processor’s capability, the simulation performance is affected significantly by the given computa- tional capability of the machine architecture. On the MasPar, most data movement within a processor occurs on the internal processor 4-bit nibble bus and the 1-bit bus [61]. Compared to other SIMD machines such as the Connection Machine CM-2, the MasPar has a more computational capa- bility through such means as the use of an on—chip register set per processor and the 4-bit nibble bus. The distributed system is composed of DEC Alpha 3000 worksta- 62 tions whose clock rate is 133 MHz. Table 3.1 shows the processing performance of array read and write Operations on the MP-2 and the Alpha. The Alpha shows much faster processing times than the MP-2 on such operations. Local Memory Space and Number of Processors Most SIMD machines have small local memories. For example, the Connection Ma- chine CM-2 has 8KB of local memory, the MP-l 16KB, and the MP-2 64KB. However, the number of processors in SIMD machines is quite large. The MasPar 1200 family has up to 16K PBS and the CM—2 has up to 64K PEs. In contrast, the distributed sys— tem has a more local memory, but a small number of processors. Most workstations, such as the SUN Sparc 10 and the DEC Alpha 3000, have 32MB or more memories. The interconnection network between processors in a distributed system with LAN and sub-networks of LAN usually has several (in the 10’s) machines, resulting in the limited number of processors. In order to connect several hundred workstations to increase the number of processors, we have to use bridges and routers that link several sub-networks, requiring extra communication time. In logic simulations, VLSI circuits have a large number of gates. When the number of gates is larger than the number of processors, many gates should be assigned to a single processor. In this case, SIMD machines are limited by their small local memory space. Distributed systems, however, can take the advantage of having large amount of local memory for simulation of huge circuits by accommodating many gates. 63 3.3.2 Comparison of Experimental Results on the Target Machines In this section, we compare the experimental results of the optimistic protocol on both machines with several benchmark circuits. As benchmark circuits, we use several circuits from the ISCAS ’89 benchmark suite. Through a pre—processing stage, the circuits are transformed to a configuration with a maximum of 3 input and 2 output ports by adding zero-delay gates. Other normal gates have unit delays. Note that the restricted number of input and output ports on a logical process is good for parallel processing especially on the SIMD machine, in which each instruction on the massively parallel processors of a SIMD machine is synchronized by the front—end processor, so that it can reduce any time delays due to processor execution synchronization. The clock interval in a D f/f is the same as the interval of input events. On the distributed system, the optimistic protocol is implemented as a master-slave model, where the master processor initially distributes the tasks to slave processors, checks the correctness of the GVT during the simulation, and collects information and results from the slave processors after completing the simulation. The master processor does not interrupt processors’ executions during the simulation. The parallel processing of a given application problem runs on the slave processors by computing events and communicating with other slave processors in the form of SPMD. In order to interpret performance results of the target machines, we describe the implementation and data structure for event queue manipulation. For the MP-2, a Circular array data structure called Circular Binary Search Queue (CBSQ) is used. 64 The more details are described in Chapter 6. Since a processor of the MP-2 has only a small amount of local memory, event queues are implemented by a data structure that can reduce memory usage. The CBSQ not only conserves the memory due to its circular structure, but also makes event queues be handledimore efficiently by using binary searches for rollback situations and fossil collections. The access time to the CBSQ is shown by the array read and write operations on the MP-2 in Table 3.1. On a distributed system having few processors, the LP ratio must be larger than one. The large amount of memory of a workstation should be shared by many pro- cesses efficiently. For this purpose, the event queues are implemented by linked struc- ture on the Alpha cluster. The enqueuing time measured was 8.19 pseconds including the time spent in malloc() system call. The dequeuing time takes 3.18 useconds. To improve the performance of event queue manipulation, a free space pool is used. In- stead of freeing the reserved space when dequeuing, the free space pool temporarily holds the released space for future use to avoid frequent calling of mall0c() routines. When the free space pool is used, the average enqueue operation time is reduced to 4.81 pseconds. Although the computation capability of Alpha workstations is much better than that of processors in the MP-2 as shown in Table 3.1, the queue manipu- lation times on the two machines do not make a big difference because of the different implementation and manipulation of event queues. Table 3.2 shows the experimental results of Time Warp on the MP-2 and a cluster of Alpha workstations. The number of gates, the measured total execution times, and the performance ratio of total execution times on the target machines are shown for six benchmark circuits. Seconds are used as a unit of total execution time. Since the 65 No. of Total Execution Performance Circuit Gates Time (T) Ratio MP-2 Alphas g—A’yfi 8953 715 11.57 47.09 4.07 81196 967 16.41 70.16 4.28 81238 988 16.62 71.44 4.30 S1423 1161 11.98 52.79 4.41 31488 1456 9.43 33.12 3.51 81494 1462 9.48 33.21 3.50 Table 3.2: Experimental Results on the MP—2 and a Cluster of Alpha Workstations MP-2 has 4K PBS, the benchmark circuits whose numbers of gates are less than 4K are selected from the ISCAS ’89 benchmark suite. The number of initial input event vectors for the simulation was 1000. On the distributed system, the computation grain size is maximized to keep the original Time Warp concept and to compare its results with those of the MP-2 under the same conditions. That is, all processes are allowed to execute events whenever they have unprocessed events. According to the results, the MP-2 produces about four times faster performance than the cluster of Alpha workstations. Since queue manipulation times on those machines are similar, as explained above, the major difference of the results is caused by the high amount of computation due to having many assigned logical processes per processor as well as the difference in interprocessor communication and barrier synchronization. When the distributed system has more processors, this performance gap will be reduced. Due to having many allocated logical processes in a processor of the distributed system, the amount of computation and the time of managing memory for all logical processes make processors have a poorer performance. Moreover, the distributed system has the potential of being used for much larger VLSI circuits than the MP-2 can simulate. 66 Therefore, conviction that a cluster of Alpha workstations can be a feasible target machine, in the next section, we perform a set of experiments based on this machine to evaluate the analytical formula derived in Section 3.2. 3.4 The Results This section presents the analytical and experimental results of logic simulation of the optimistic protocol on the same cluster of six DEC Alphaworkstations employed in the previous section. Several large circuits from the ISCAS ’89 benchmark suite are used as the benchmark circuits. Analytical results are generated by the formula in Section 3.2 and compared to the experimental results. 3.4.1 Workload Conservation Property In Section 3.2, we used the workload conservation property to compute the number of simulation cycles. The workload conservation property states that the number of executed events when each processor has one logical process and at most one event is executed per logical process, is equal to the number of executed events when a processor has many logical processes and executes at most one event from each logical process. To support the workload conservation property, we show experimental results with some figures. During the simulation, the total number of executed events are counted by varying values of several parameters. The 813207 circuit with 100 input event vectors was used. Figure 3.2 (a) shows the workload at P = 5, varying k values, and Figure 3.2 (b) presents the workload with a fixed k, 100, varying P values up to 67 5. Both figures show virtually the same values of the total number of executed events for all r and k and support thus the workload conservation property. 3.4.2 Comparison of Experimental and Analytical Results This subsection verifies the analysis in Section 3.2 by comparing experimental results on a cluster of Alpha workstations to the analytical results. The total execution time and its related components are compared to the 813207 circuit with 100 input event vectors. The number of processors, P, is 5. The 813207 circuit had 11689 gates: N is 11689. Since 11689 gates are assigned to five workstations evenly, the LP ratio, r, is 2338. Both of the interval of input events and the clock interval in a D f/f are 200. The computation grain size varies from 1 to r. In order to derive the performances of E(mmax) and E (m) from Formulas 3.4 and 3.7, it is required that the distribution of n should be known, where n is a random variable of the number of candidate processes on a processor in a simulation cycle. The distribution of n may be determined by the characteristics of circuits and underlying simulation protocols. The binomial distribution has been used in [2] and [29] with the assumption that the gate activations at each simulation cycle are independent. Parallel simulations of real circuits will probably make it difficult to satisfy this assumption. Instead, we used the distribution of n that came from our experiments; during the simulation, the number of executed gates at each simulation cycle on all processors was counted to compute the p.d.f and c.d.f. of n. Based on the measured c.d.f. of n, the E(m) and E(mmax) can be computed by Formulas 3.7 68 80000 1 1 1 r 60000 — 1 my . 1 1 J /“ 7‘7 W W: 40000 — — 20000 — p z 5 .1— _ l l l l 0 500 1000 1500 2000 k (a) 80000 | T I 60000 — 4 4+— +‘\:r W,’c 40000 — _ 20000 e k = 100 +— ~ l l 1 2 3 4 5 P (b) Figure 3.2: (a) The Workload Varying k When P = 5 (b) The Workload Varying P When k = 100 69 and 3.4, respectively. Figure 3.3 shows the experimental and analytical results of E (m) when k varies. As k increases, E (m) increases until it reaches a bound. Once it reaches the bound, the value of E (m) is steady, even as k increases. The bound of curves is delimited by the number of available candidate logical processes in a processor. Although the value of k is large enough to allow more processes to be executed in a simulation cycle, since there are not enough candidate processes in the processor, the number of executed events are delimited by the number of available candidate processes. Since E (mmaz) shows a similar trend to E(m) in that both curves increase as k increases and have upper bounds with steady performances, we omit the figure of E (mmax). The CDF in the figure uses the E, (x) from experiments by measuring the number of executed processes on processors at each simulation cycle and computing the p.d.f. and c.d.f. of n. Figure 3.3 also presents the curves of E(m) that are computed by the binomial distribution and the exponential distribution of n. The result of the exponential distribution is closer to the experimental one than that of the binomial distribution, but both are still far from the experimental result. The number of simulation cycles, Sf, can be obtained by the values of E (m) in Figure 3.3 and by using Theorem 2. Figure 3.4 (a) shows the experimental results of Sf compared to the analytical results as a function of k. The dotted lines represent the analytical values, and the solid lines represent the experimental values. To obtain some parameter values of Theorem 2, we simulated the same circuit using a sequential simulator of Time Warp with the LP ratio 1, where the sequential simulator is a simulator performing the original theme of Time Warp on a single processor and is 70 80 I I I I ‘ 1 Experimental E m +— 70 — CDF Analytical E m - - - - ‘ Binomial Analytical E m - - - 60 — Exponential Analytical E m - 4 - ~ ‘ '-.".'.1'.'.\".".'.".'4— 0 L l l l l l O 100 200 300 400 500 600 700 k Figure 3.3: The E (m) vs. k Varying from 0 to the LP Ratio not Operated by a synchronous protocol. In the sequential simulator, thus, all available candidate logical processes are executed. Because there is no machine having as many processors as logical processes of a huge VLSI circuit, we measured S] by presuming to assign a logical process to a processor by means of the sequential simulator of Time Warp. From the sequential simulation, the value 222 is obtained for 5'], 0.021457 for the activation probability, a, and 1.286383 for the average fanout of a gate, w. Since S] is obtained under the condition of the original Time Warp when the LP ratio is 1, the parallelism of the simulation is maximal. That is, all candidate processes are executed. As shown in Figure 3.4 (a), the steady bottom line of S]? has the same value of S] as the low bound of number of simulation cycles. This is because a large k allows all candidate processes to be executed and follows the condition of the original Time Warp. In contrast, as k becomes small, processors postpone execution of some candidate processes to the next cycle due to the limitation of computation grain sizes. 71 This increases the number of simulation cycles as k decreases. The experimental and analytical results Of T"c are shown in Figure 3.4 (b) as a function of k. Seconds have been used as a unit of measurement. When k is near 1, T" becomes very high because the time spent for frequent communications and barrier synchronizations dominates the total execution time. This implies that very small computation grain sizes deteriorate the simulation performance significantly when the communication time is relatively large compared to the computation time. In contrast, in SIMD machines, which provide small communication and barrier syn- chronization times, the simulation performance may not be affected greatly even by small computation grain sizes when the LP ratio is larger than 1. In Figure 3.4 (b), when k is around 100, the simulation shows Optimal performance. It is shown that good performances can be achieved on distributed systems by using at least a certain amount of computation grain size. When k reaches approximately 300, the simula— tion reaches a steady performance. This result implies that when k is larger than the optimal size, the variation of execution times among processors is large in a simula— tion cycle. The steady curve after k = 300 is delimited by the number of candidate processes in a simulation cycle. The analytical curve that is compared to the curve of experimental result is shifted to the left somewhat. This is caused by differences of curves on Sf as shown in Figure 3.4 (a). By adjusting the maximum computation grain size, the unbalanced load among processors can be adjusted so that idle times of processors can be reduced and simulation performance can be improved. 72 1000 1 I I I Experimental -6— CDF Analytical - - - - 800 a 600 _ Si“ 400 2 200- ................. cc ....... _. l l l l l 0 200 400 600 800 1000 k (a) I I l 1 1 140 " Experimemtal —o— — CDF Analytical - - - - 120 .. _, 100 4* _ k T 80 1 _ 60 ' c ......................... 40 - -—l 20 l l l l 1 100 300 500 700 900 k Figure 3.4: Comparisons of (a) Sf and (b) T’0 73 200 I I I r I I I I I 813207, P=3 +— P=4 A— P=5 +- 150 h .1. 71‘- Tk 100 .. —A 50 - 0 1 1 l l l 1 l l n O 100 200 300 400 500 600 700 800 900 1000 Grain Size Figure 3.5: Experimental Total Execution Times with S13207 Circuit 3.4.3 Effects of Computation Granularity on Distributed Systems In the previous subsection, we have studied how performance could be affected by adjusting computation granularity for a given circuit when the LP ratio is fixed. Even on the same distributed system, however, the effects of computation granularity may vary according to machine—independent factors. By varying the number of pro- cessors, we have performed similar experiments to examine the performance effects Of computation granularity by using several circuits from the ISCAS ’89 benchmark suite. Figures 3.5 and 3.6 show T’“ for circuits S13207 and 838417, respectively, when the number of processors varies from 3 to 5. The S13207 circuit has 11689 gates and the 838417 circuit has 30367 gates. For circuit 838417, both the interval of input events and the clock interval of a D f/f are 500. From the experimental results, some 74 l I 700 - 600 p = 3 + _ P = 5 -°— 500 _ Tk 400 ’5 300 3 _ 200 — 1 1 0 200 400 800 1200 1600 2000 k Figure 3.6: Experimental Total Execution Times with 838417 Circuit aspects of parallel discrete event simulation affected by machine-independent factors, such as circuit characteristics, can be observed. The first observation is that a certain range of computation grain size can produce better performances and that locations of the optimal grain size within the range are different on those circuits. Circuit 813207 shows better performances when k is between 50 and 300, but S38417 shows better performances between 30 and 500. The range of optimal computation grain sizes on 813207 is smooth and the optimal point is located in the middle of the range, while 838417 has a sharp valley of an optimal range and its Optimal point is at the beginning. The next observation is that the amount of performance improvement affected by computation granularity varies from circuit to circuit. Through Figures 3.5 and 3.6, we can observe the amount of improvement in T k by adjusting the maximum computation grain size. Table 3.3 gives more detailed information in this aspect for four different circuits. The total execution time when the value of k is 75 | P 1| k || 813207] 835932 | 838417 | 838584 ] 2 LP ratio 286.28 8493.81 1501.87 5175.72 Optimal 239.30 7153.13 731.59 4432.14 Improved Rate 1.20 1.19 2.05 1.17 3 LP ratio 147.81 3992.81 698.05 2829.21 Optimal 131.96 3539.53 361.57 2469.10 Improved Rate 1.12 1.13 1.93 1.15 4 LP ratio 98.15 2281.65 413.34 1921.02 Optimal 71.15 1965.55 244.72 1672.98 Improved Rate 1.38 1.16 1.69 1.15 5 LP ratio 61.77 1459.43 284.52 1244.56 Optimal 45.21 1326.58 198.35 995.01 Improved Rate 1.37 1.10 1.43 1.25 Table 3.3: Experimental Results on a Cluster of Alpha Workstations equal to the LP ratio is compared to that when the value Of k is equal to the Optimal size that yields the minimum total execution time. The Improved Rate is computed (T'6 when kzr) (Th when k=optimal) ' as The Improved Rate shows the performance improvement when the computation granularity is appropriately adjusted, compared to the results of the original Time Warp. Depending on the circuit, the experimental results show various Improved Rates of up to 2.05. Even on the same circuit, the Improved Rate varies as P increases. 3.5 Summary Computation granularity is one of the adjustable variables to improve the perfor- mance of parallel discrete event simulation when the number of logical processes of the application problem is larger than the number of physical processors on parallel and distributed systems. In this chapter, we have studied the effect of computa- tion granularity on the performance of optimistic protocol on distributed systems for 76 benchmark circuits by using an analytical model. We summarize the contributions made by this research. First, analytical formulas have been developed to predict several performance metrics such as the number of simulation cycles and the total execution time for the given machine—dependent and independent parameters. An— alytical results derived from the formulas have verified experimental results on the distributed system. Second, the architectural characteristics of a cluster of DEC Al- pha workstations was compared to those of the MasPar MP-2 with respect to parallel discrete event simulation. Finally, the effects of computation grain size on distributed systems have been observed with various machine—independent factors, such as the number of processors and different circuits. The experimental results indicated that the effects of computation granularity on the performance of distributed logic simu— lation vary depending on the characteristics of circuits. According to the results, by adjusting the computation grain size, the improved rate of total execution time was increased up to 2.05. Chapter 4 Experimental Study of the Effects of Computation Granularity on the Performance of an Optimistic Protocol on the IBM SP2 In order to focus on the major effects of computation granularity on the performance of Time Warp, an experimental study of computation granularity is investigated by performing a set of experiments on the IBM SP2. The performance aspects considered in this study are execution activity, rollback situations, communication overhead, and simulation progress. Execution activity shows the amount of actual event computa- tion between communication points, which is compared to the amount of possible event computation on logical processes. Rollback situations are represented by two 77 78 performance metrics: a rollback frequency and a rollback rate. To study communica- tion overhead on nonblocking message communication, the computation fraction of the total execution time of simulation is used as the performance metric. Simulation progress shows the overall progression of simulation status until the termination con- dition is satisfied. Also the speedup of parallel processing on Time Warp is presented as a function of the number of processors. Among the many parallel architectural platforms, we consider multicomputer sys- tems as prominent ones for optimistic simulation of VLSI circuits which require fre- quent communication and a large local memory space. The target machine is the IBM SP2. It provides the High-Performance Switch as its communication subsys- tem [1, 83, 81] and can scale to a couple hundred or a thousand SP nodes according to the configuration. Each SP node uses a RISC System/ 6000 processor and can ad- dress up to 512MB or 2048MB of local memory. With the topology of a bidirectional MIN, buffered wormhole routing is used for interprocessor communication. The following section describes the model and implementation by explaining the employed schemes and design structures of the optimistic protocol. Section 4.2 de— scribes the experimental results on the SP2, and discusses the effects of computation granularity. The performance results are compared to those of a synchronous protocol in Section 4.3. The concluding remarks are presented in Section 4.4. 79 4.1 The Simulation Model In this section, how the optimistic protocol is implemented is presented by showing the employed schemes, design strategies, data structures, and characteristics of the application problem. The Optimistic protocol is implemented in an fashion of event- driven activation. Incoming events into the event queues activate logical processes and the logical processes are executed with their events by processors considering protocol strategies. 4.1.1 Logical Process Allocation The number of logical processes in large application problems is greater than the num- ber of processors on parallel and distributed systems. Partitioning of logical processes into processors can therefore greatly affect VLSI simulations, reduing communication overhead by placing frequently communicating logical processes onto a processor, and by adjusting the balance of computation among processors. Bailey et al. wrote an excellent survey [7] in parallel logic simulation, in which they explained and compared several partitioning schemes. Most of “the partitioning techniques discussed in their paper take a linear time in the size of the circuit. They questioned that if it would take longer time to partition a circuit than to simulate it, reduing only the simu- lating time could accelerate the simulation performance. In this chapter, in order to emphasize the effects of computation granularity, we use a random partitioning scheme rather than any sophisticated partitioning algorithm. Logical processes are randomly partitioned into processors, and evenly assigned into processors with the 80 same amount of the LP ratio. 4.1 .2 Simulation Cycle Each processor repeats computation of events and communication with other proces- sors until the simulation termination condition is satisfied. Each cycle from a series Of repeated cycles of computation and communication is called a simulation cycle. A simulation cycle is composed of several processing steps: LVT computation of logical processes, event scheduling, event computation, event propagation, GVT computa- tion, and fossil collection if necessary. Because each processor is allowed to proceed with its own work by repeating the simulation cycle, simulation cycles of processors advance asynchronously. The number of simulation cycles is one of the key simulation performance met— rics. A large number of simulation cycles represents a long series of works required to complete the simulation. However, it is hard to say that a short number of simulation cycles always has the fastest performance results because the amount of computation in a simulation cycle also determines performance speed. We may have better perfor- mance results if the number of simulation cycles is not too long with an appropriate amount Of computation per cycle. This is because too many simulation cycles with very small amounts of computation yield the extra cost due to the repetition. Thus, the number Of simulation cycles is a criteria of simulation performance to show the simulation progress for completing the required computation amount for the whole simulation. 81 4.1.3 Event Scheduling When each processor is assigned a logical process, i.e., the LP ratio is 1, event schedul- ing is straight-forward; executing the smallest unprocessed event from the logical process at first. Under the situation that the LP ratio is larger than one, by apply— ing the computation granularity, processors have to decide which logical processes’ events are first executed and how many events are executed before communication. To make the simulation progress rapidly, processors need to execute not only a few small events whose timestamps are near the GVT, but also more events that can proceed optimistically. After communication in a simulation cycle, processors may activate logical pro- cesses that have newly arriving events or unprocessed events. Since not all logical processes having unprocessed events are actually activated in our model, we use two different terms to present a logical process’ status: a candidate logical process and an active logical process. Candidate logical processes are those that have unpro- cessed events, whereas active logical processes are those that have events that are executed. In the original Time Warp, all candidate logical processes should execute their smallest unprocessed events, and all of the candidate logical processes become active logical processes. In our model, some candidate logical processes become active in the control of computation granularity, and the other candidate logical processes are postponed to be considered in the next simulation cycle. In our implementation of a simulation cycle, each processor has all the candidate logical processes in the central scheduling priority queue by building with LVTs of 82 the candidate logical processes and index numbers of logical processes. Up to the limit of the maximum computation grain size (MCGS), candidate logical processes are selected from the scheduling heap and become the active logical processes. The smallest unprocessed event in each active logical process is, then, executed, and the generated output events are prepared for propagation. 4.1.4 Data Structure of Event Queues In PDES, logical processes contain events that have been generated or received but not yet executed, and are pending in the input queues. Each processor schedules the pending events according to the event scheduling algorithm. The design structure of event queues has an important influence on the simulation performance. Individually Distributed Event Queues In the model, each logical process has its own event queue in which all unprocessed events are pending and the previously processed events are stored as well. This type of implementation with an individual event queue per logical process is better than one central event queue shared by all logical processes in a processor, especially when the LP ratio is large. It is because if all logical processes share one event queue in common within a processor, it takes a long time to search and manipulate the event queue for rolled back events in accordance with the logical process’s index number. That is, it can reduce the queue manipulation time when a straggler arrives, by eliminating the corresponding heavy event moving and queue handling in rollback situations. Using one event queue per logical process is also easily adaptable to migrate a logical process 83 to another processor in a dynamic load management situation. Input Event Queues In our implementation of a logical process’ event queue, all events are sorted in increasing order by timestamp. Each event queue has three pointers; the bottom to point out the very first event, the front to point out the first unprocessed event, and the rear to point out the last event. After executing an event placed in the front, the front pointer moves to the next unprocessed event. The processed event stays in the queue until a fossil collection performs. The processed event may be rolled back by stragglers. Instead of using separate output event queues, using a single event queue which can be used as both input and output event queues can reduce the queue manipulation time when a straggler arrives by eliminating the corresponding event moving and queue handling from output event queues. Central Priority Queue The event scheduling algorithm generally follows the ‘smallest-timestamp-first’ selec- tion scheme. Since each logical process has an individually distributed event queue, a processor has to schedule the assigned logical processes to select the smallest times- tamped event first. We use a central priority queue structure for efficient event scheduling in a processor. Several priority queue structures have been proposed such as a heap with array structure, a calendar queue, and a skew heap. We used the heap structure for the priority queue. Each item in the priority queue has its active logical process’s index number and its LVT (the timestamp of the first unprocessed event as 84 the key to be sorted). Only candidate logical processes that receive events and have unprocessed events are enqueued into the scheduling queue with the LVT value. If the logical process’s index number is already in the scheduling queue, only the LVT value is changed and the heap structure is left untouched. This happens when a straggler arrives and it has a smaller LVT value. From the scheduling heap, each processor de- queues the smallest one and executes the event on the corresponding logical process. When the computation granularity is applied with event scheduling, the scheduling heap is kept after event computation and communication, and is re—organized with newly received events. 4. 1.5 GVT Computation In the optimistic protocol, the GVT is used to perform fossil collections and check the termination condition of the simulation. Since logical processes can progress their virtual time asynchronously by executing the events, the GVT does not have to be computed whenever processors communicate with one another. Thus, the GVT computation in our model is implemented asynchronously in order to keep the benefit of asynchronous processing of processors. We use a token passing mechanism to collect logical processes’ virtual times asynchronously from processors and compute the GVT value. In this scheme, the master processor sends a token to its destination processor, receives the circulated token from its source processor, and computes the GVT by using the collected minimum local clocks of logical processes from processors. The 85 token is passed over processors in the designated logical order. There are two values in the token: the previous GVT value that had been computed with the information collected during the previous cycle, and the current GVT value. Each processor con- tains the minimum LVT among the assigned logical processes, the Processor Virtual Time or PVT. Whenever a processor receives the GVT token, it takes the stored GVT value from the token and writes its PVT value to the token. If the on-going GVT on the token is larger than the PVT, the PVT value becomes the on-going GVT value and the token carries the information to the next processor. The advan- tage of this scheme is that it is not necessary to send acknowledgments in order to make sure in-transit messages are received for computing a correct GVT. Also, this scheme makes simulation cycles asynchronous by allowing each processor to proceed on its own without any barrier synchronization. Detailed token passing algorithms are presented by Baldwin et al. [8]. 4.1.6 Rollback Handling Under the unavoidable rollback situations of the optimistic protocol, it is necessary to annihilate the erroneous messages caused by immature events. Among many cancella- tion mechanisms —— aggressive cancellation mechanism, lazy cancellation mechanism, direct cancellation mechanism, and immediate cancellation mechanism —, the im- mediate cancellation mechanism is used to handle rollback situations. In immediate cancellation, there is no antimessage to correct the previously—sent wrong events as in the direct cancellation mechanism [36] on shared memory systems. In addition, 86 immediate cancellation is a suitable cancellation mechanism for application problems having small event granularity such as VLSI logic circuits because the event compu- tation and propagation cost little compared to antimessage management. Fossil collection is necessary in the optimistic protocol in order to reuse spaces of previously-processed events. Because the GVT is the smallest timestamp that cannot be rolled back by a straggler, processed events whose timestamp is smaller than the current GVT can be removed. The important consideration on fossil collection is when and how often to perform it during the simulation. Frequent fossil collections keep the used memory at a minimum, but the processing time can dominate the simulation time. Rare fossil collections can cause a shortage of memory in a processor. In our implementation, we apply fossil collection to each event queue when the queue contains more events than the maximum allowed queue size. Because there may be a situation in which only unprocessed events are in an event queue and reach the maximum allowed queue size, a threshold value is used to check that the gap between the first event in the event queue and the GVT value is large enough to perform fossil collection. 4.2 Experimental Results With several sets of experiments, we observe the performance of the optimistic pro— tocol on the SP2. In this section, the simulation environment and results of our ex- periments are presented. The simulation environment includes the employed features of the target machine, benchmark circuits, and message passing communication. The 87 general trend of the major performance results, the total execution time, is presented in Section 4.2.2 as a function of the MCGS. To understand the effects of computation granularity, the performance metrics of the optimistic protocol have been measured and analyzed in Section 4.2.3 from several perspectives: execution activity, rollback situation, communication overhead, and simulation progress. In Section 4.2.4, the parallel speedup is studied when the number of processors varies. The effect of LP ratio is observed in Section 4.2.5. 4.2.1 The Environment The Target Machine The target machine is the IBM SP2. On the SP2, it is possible to have up to 512 or a thousand processors, up to 512MB of local memory for thin nodes up to 2048MB for wide nodes; each node has a RISC System/ 6000 processor. The SP2’s High-Performance Switch provides a fast communication environment: the High— Performance Switch provides a peak bandwidth of 40 MB per second. The MH- PCC [54] SP2 that has been used for our experiments is composed of about 400 SP nodes which comprise 30 frames. Each frame contains either 8 wide nodes or 16 thin nodes, and has a 16-way switch board. Among the nodes, the ones used for our experiments were the thin nodes which had 64MB of local memory. Between any two nodes, the High-Performance Switch [82] provides a packet—switching, multistage Omega network with buffered wormhole routing for interprocessor communication. The network has the topology of a bidirectional MIN. The SP2 divides messages into 88 packets, each Of which divides into 8-bit flow control digits (flits). That is, nodes send messages to other nodes by breaking messages into packets and injecting the packets into the network. Each packet contains routing information that is examined by switching elements to correctly forward the packet to its destination, where flow control is performed by a flit. Job scheduling on the SP2 is governed by a job sched- uler called loadleveler. Using the loadleveler, a job can be simulated on a number of SP nodes dedicated to the job without sharing processors with other jobs. How- ever, there is some performance degradation due to delays and contentions on the communication network, as it is shared with other jobs on other processors. Benchmark Circuits The benchmark used was the ISCAS ’89 sequential circuit suite [15]. For our experi- ments, we use the three biggest circuits from the suite: 838584, 838417, and 835932. The numbers of gates for the circuits are 20717, 23843, and 17828, respectively. The types of gates of the sequential circuits are AND, NAND, OR, NOR, inverter, and D flip-flop. The clocking interval between two input event vectors is 200. The clocking interval of D f/ f is also 200. Each gate has a unit delay. Message Communication We use the Message Passing Interface (MP1) [88] for message passing between pro- cessors on the SP2. The details of communication performance on the SP2 with MP1 have been studied in [90, 55]. In our model, to reduce communication overhead, we use MPLIsendO, which is a nonblocking send routine of asynchronous message pass- 89 ing. Thus, the amount of computation can be overlapped with the transmission time in communication from the point of issuing a nonblocking send to the checkpoint of the completeness of the issued nonblocking send. As the receive routine, a blocking receive routine MPLRech is used. For message communication between processors, two methods have been consid- ered. The first one is the all-to-all communication pattern. That is, during commu- nication of a simulation cycle, each processor sends messages to all the processors except itself. This all-tO—all communication pattern produces many messages into the network, even when a processor does not need to send messages to other processors. The second method is the some—to-some communication pattern, in which a processor sends messages to some processors only if they are the destinations of the messages. According to our experiments with tens of processors on the SP2, the all-to-all com- munication pattern shows better performance than the some-to-some. This is because there is a high possibility for a destination processor to receive the sent messages in the next simulation cycle; not in the same simulation cycle. Since a processor does not know how many messages are sent to itself from other processors, the processor may give up waiting for more messages after repeating the procedure of probing mes- sage arrival, receiving and handling the messages, and waiting. The late messages may be buffered in the receiver side, but not received by the receiver processor un- til the next cycle, thus, it may slow the overall simulation progress. Since the SP2 provides fast communication, the overhead of sending extra messages in the all-to-all communication pattern is less than the overhead of the delay for the completion of sent messages in the some-to-some communication pattern. 90 Performance Metrics and Parameters In order to study the effect of computation granularity in the optimistic protocol, we use the following performance metrics. c Total execution time: the spent wall-clock time interval from the starting point of the simulation to the terminating point. 9 Speedup: the ratio of the total execution time to execute a sequential program on a processor to the total execution time to execute a parallel program on multiple processors. 0 Number of simulation cycles: the number of repeated cycles of computation and communication until the simulation completes. o Computation time: the amount of computation time out of the total execution time. o Computation fraction: the ratio of computation time to the total execution time. 0 Number of executed events per cycle: the number of executed events by a processor in a simulation cycle. 0 Activity rate: the ratio of the number of active logical processes to the LP ratio in a processor per simulation cycle. 0 Actual activity rate: the ratio of the number of active logical processes to the number of candidate logical processes in a processor per simulation cycle. 91 o Rollback frequency: the ratio of the number of stragglers to the number of received events in the simulation. 0 Rollback rate: the ratio of the number of rolled back events to the total number of computed events in the simulation. 0 GVT jump: the increment of GVT when the GVT value is re—computed in a token cycle. In addition, the parameters used in the experiments are as follows: 0 MCGS: the maximum computation grain size, that is the maximum number of events between any two consecutive communication points. 0 Number of PBS: the number of processors used to process the given application problem in parallel. By varying the MCGS, we observe the effects of computation granularity on the performance results with the performance metrics. The following subsections show the experimental results of the optimistic protocol on the SP2 when the number of processors is six. The major performance result is introduced first and compared to the related effects of the other performance metrics. Also, the speedup and the effects Of LP ratio are studied as the number of processors varies. 4.2.2 Total Execution Time The total execution time is the wall-clock time interval from the starting point of simulation to the point of terminating. This major performance metric of simu- 92 1800 l l l I M f s38584 +— 1600 _ $38417 -e—- 2 $35932 «— 1400 [ — 1200 r = C 4‘} “a Total ., Executiofi000 .r 6 — Time 800 — 2 — 600]L ., — 400 if 'a - 200 . ...4" — O ’ ”l l l l l l l l 0 500 1000 1500 2000 2500 3000 3500 MCGS Figure 4.1: Total Execution Time lation shows how fast the simulation can complete a given task. Figure 4.1 shows the total execution times to simulate three sequential circuits, 838584, 838417, and 835932, with 500 input event vectors on the SP2. The right ending point of each curve in the figure shows the total execution time when the MCGS is the LP ratio of each circuit. In the case that the MCGS is equal to the LP ratio, each processor makes all candidate logical processes be active and their events are executed before communication. When the MCGS is the LP ratio, the simulation is performed ac- cording to the theme of the original Time Warp where all candidate logical processes can execute each of their events and propagate the newly generated events to the associated logical processes in the simulation cycle. As shown in the figure, however, the performance is better at the smaller MCGS than at the value of LP ratio. Rather than processing all possible logical processes, adjusting the number of active logical processes yields better performance. Thus, the graph of the total execution time 93 300 I I I T I 838584 5F— S38417 4*- 250 835932 +— 200 ‘ 'Tbufl ; Execution150 - / _ Time I . I a 100 r — 50 — ‘ = = ‘ ° — 0 l l l l l 50 100 150 200 250 300 MCGS Figure 4.2: Total Execution Time (Close-up Between 1 and 300) shows that we can achieve better performance results by using the MCGS instead of applying the original Time Warp directly. Each curve in Figure 4.1 has a steady performance as the MCGS becomes larger than a certain large value, that is, around 2000. The performance remains steady up to the MCGS of the LP ratio. By using other experiment metrics in the following section, the steady performance on a large MCGS is explained in more detail. To observe the performance trend near the optimal MCGS, Figure 4.2 focuses on the left side of Figure 4.1. Each curve of a circuit has its own valley that has a different slope on either side of the valley, a different width, and a different Optimal location of computation granularity depending on the Circuit’s characteristics. In order to understand the appearance of the curves more clearly, we interpret the performance results in depth with other performance metrics in the following section. 94 4.2.3 Analysis of Experimental Results To explain the performance results of the total execution time, several other aspects of Time Warp are studied and explained by other performance metrics. Those per- formance aspects are categorized into four parts. First, to observe the trend of event execution affected by computation granularity, execution activity is studied with per- formance metrics of activity rate and actual activity rate. The second is the rollback situation. Since the inherent optimistic characteristic yields a rollback situation, it is the main concern in the optimistic protocol. Rollback frequency and rollback rate are used for the performance metrics. As the third, communication overhead is consid- ered. Communication overhead is an unavoidable consideration for parallel processing on parallel and distributed systems. The computation fraction out of the total ex- ecution time is used to consider the approximate communication fraction. The last aspect is the progress of the simulation. Because a given problem is terminated when the required tasks are done, considered is how fast the simulation is able to proceed to reach the termination. The number of simulation cycles and the GVT jump are the performance metrics. Therefore, the major aspects of performance analysis for the optimistic protocol on parallel and distributed systems are execution activity, rollback situation, communication overhead, and simulation progress. Execution Activity By observing the event execution of parallel processing, an overall trend and an estimated performance can be predicted. The amount of major computation on a 95 07 I I I I I T I 0.6 — $38584 -*— S38417 -9- 335932 *— 0.5 Activity 0 4 Rate 0.3 0.2 L 4 i!» 1 0.1 l l l l 1 0 500 1000 1500 2000 2500 3000 3500 MCGS Figure 4.3: Activity Rate processor is determined by how many events it executes. Although the optimistic protocol has more other interesting concerns to study, we start with the study of the effect of execution activity on parallel processing. 0 Activity Rate The activity rate is defined as the ratio of the number of active logical processes to the LP ratio in a processor per simulation cycle, where an active logical process is one that executes an event in a simulation cycle. The activity rate indicates how many gates are active to compute an unprocessed event in a simulation cycle on a processor. The rate depends on the number of processors used, the parameter value of MCGS, and the characteristics of the circuit, such as the number of gates in the circuit and the connectivity configuration of the gates. 96 Figure 4.3 shows the average rate of activity on a processor as the MCGS changes. With the activity rate we can explain several aspects of the graph of total execution time shown in Figure 4.1. The activity rate at the rightmost MCGS has the maximum value of each curve. This point represents when a circuit is simulated by the original Time Warp protocol. A general trend we can recognize from the figure is that the activity rate decreases as the MCGS decreases to the left. This is because the MCGS limits the maximum number of active logical processes, each of which processes an event per simulation cy— cle, even if there are more candidate logical processes which have unprocessed events. An interesting phenomenon is that the curve of the activity rate be- comes flat when the MCGS is between approximately 2000 and the LP ratio. This phenomenon implies that the number of active logical processes does not increase linearly according to computation granularity at a large MCGS up to the amount of the LP ratio. This is because the number of active logical pro- cesses is decided by the characteristic of the circuit. Although the MCGS allows more logical processes to execute their events, there are not as many candidate events as the amount of the MCGS. This makes the number of executed events be delimited by the number of available candidate logical processes, not by the MCGS. Since the number of active logical processes is the same as the num- ber of executed events in a simulation cycle, the activity rate determines the amount of execution time per simulation cycle. 97 According to Figure 4.3, the rightmost value of the activity rate at the LP ratio is an inherent characteristic of a circuit. Circuit 835932 has the highest activity rate at the LP ratio, 838584 is the next one, and the last one is 338417. Compared to Figure 4.2, the circuit having a higher activity rate has a wider valley of optimal computation granularity; the circuit of a low activity rate has a narrow valley. Because the activity rate shows the required amount Of tasks on a circuit to complete the simulation, the circuit with a higher activity rate contains many unprocessed events although those events are delimited by the MCGS. At the Optimal value of MCGS, even as the MCGS increases a little more, the performance of the circuit with the higher activity rate is not degraded by executing more events fulfilling the required event amount without causing a bad effect. Meanwhile, the circuit with a low activity rate has fewer events to execute. When the MCGS is larger than the optimal point, it has a greater likelihood of generating rollback situations because good events have been already executed and some immature events may be executed, degrading the performance. Actual Activity Rate Another performance metric related to event activity is the actual activity rate. The actual activity rate is the ratio of the number of active logical processes to the number of candidate logical processes in a processor in a simulation cycle. This metric presents how many logical processes out Of the candidate logical processes are actually active and executing events. In contrast to the activity 1 qr a a 0 338584 *— S38417 '9— 0.8 $35932 +— “ Actual 0'6 _ T Activity Rate 04 _ 0.2 “I" d 0 " l l l l l l l 0 500 1000 1500 2000 2500 3000 3500 MCGS Figure 4.4: Actual Activity Rate rate, the actual activity rate can reach one depending on the MCGS. When the actual activity rate is one, all candidate logical processes are allowed to be active and to execute an event in the simulation cycle. Figure 4.4 shows the average actual activity rate per simulation cycle. The starting point Of reaching 1 is around 2000 which is same as the starting point of the steady state in the curve of activity rate in Figure 4.3 and that in the curve of total execution time in Figure 4.1. Rollback Situation Rollback is an unavoidable situation in the Optimistic protocol that is caused by Optimistically sent immature events. Varying the MCGS, the performance effect of rollbacks and the performance trend are studied in this subsection. The experimental results of rollback frequency and rollback rate are used as the performance metrics of 99 0-2 I T T $38584 +— S38417 -6— S35932 *— 0.15 r Rollback Frequency ' 0 1 l l 50 100 150 MCGS 200 250 300 Figure 4.5: Rollback Frequency (Close-up Between 1 and 300) 0-3 I I I I f I I 0.25 = = - 0.2 _. or o 6) Rollback 0.15 _ Frequency 0.1 338584 +— — S38417 -e— S35932 4— 0.05 _ 0 l l l l l l l 0 500 1000 1500 2000 2500 3000 3500 MCGS Figure 4.6: Rollback Frequency 100 rollback. o Rollback Frequency Unique characteristic of the optimistic protocol is the overhead of rollback situa- tions. Figures 4.5 and 4.6 show the rollback frequency varying with the MCGS. The rollback frequency is the ratio of the number of stragglers to the number of received events in the simulation. It shows the possibility of having a straggler when a logical process receives an incoming event. On the 838584 and S38417 circuits in Figure 4.5, the rollback frequency is very low, i.e., smaller than 0.03, when the MCGS is between 1 and 60. When the MCGS goes to around 80 or 100, there is a large jump in the rollback frequency in each of the two circuits; the rollback frequency greatly increases in the interval between 80 and 100. In contrast, the 835932 has a higher rollback frequency than other circuits and is almost steady until the MCGS reaches 100. It begins to jump at around 200. These large jumps of rollback frequency imply that many stragglers start to be generated out of the incoming events. This means that around the jump point the simulation performance is degraded due to increasing rollback situations. The degrading effect of rollback on the total execution time is shown in Fig- ure 4.2; the execution times of 838584 and S38417 after 100, and that of the 335932 circuit after 200. From the observation of rollback frequency and the associated total execution time, we can say that by adjusting the number of stragglers the MCGS is a controlling factor of the degree of optimism. 101 Also, the jumping point of rollback frequency can be interpreted by the activity rate in Figure 4.3. Since the S35932 circuit has a higher activity rate, it keeps almost the same performance although the MCGS increases to 150 and it shows degradation after approximately 200. The other two circuits have degradations earlier at around 100. Thus, the performance degradation due to rollback sit— uations is affected by the activity rate of the given circuit. In addition, on the large MCGS in Figure 4.6, the rollback frequency of each circuit has a different value of steady curve until the MCGS is up to the LP ratio. The steady perfor- mances also depend on the characteristics of circuits: because of their layout, some circuits have a few rollback situations and others have many. Figure 4.6 shows that the curves of rollback frequency are around 20% or 25% of incoming events, which are smaller portions compared to the degradations of the execution times with a large MCGS. Since the rollback frequency is decided by the number of stragglers, a straggler usually results in more than one rolled back event and requires extra processing for the queue adjustment over all input events queues on the logical process. The queue adjustment on event queues is to move the front pointer of each event queue according to the straggler’s timestamp, after the straggler is enqueued into the destined input-event queue. Rollback Rate The rollback rate is the ratio of the number of rolled back events to the total number of computed events in the simulation. Compared to the rollback fre- quency which is decided by the number of stragglers, the rollback rate relies on 102 1 I I I I I I I 838584 “A— S38417 '9— 0.8 335932 '0— - 0.6 1 _ Rollback Rate 0.4 _ 0.2 r 0 l I l l l l l 0 500 1000 1500 2000 2500 3000 3500 MCGS Figure 4.7: Rollback Rate the number of rolled back events by the stragglers. Since a straggler usually yields more than one rolled back events, about 20% rollback frequency makes the 50% up to 90% rollback rate in the simulation as shown in Figure 4.7. When a straggler comes in an input port and is enqueued into the event queue, other event queues in the logical process have to be adjusted to re—compute from the straggler’s point. The adjusted and rolled back events in other event queues are counted as well as the rolled back events in the event queue where the strag- gler occurs. The rollback rate mainly depends on the characteristics of circuits, especially the layout of connectivity among gates. Communication Overhead Another major issue related to performance of the Optimistic protocol on parallel and distributed systems is interprocessor communication overhead. In our implementation 103 for interprocessor communication on the SP2, nonblocking message passing routines from MPI are used in order to improve performance by overlapping communication and computation. A nonblocking send routine initiates the message sending opera- tion, but returns before the message is received by the receiver processor. A separate wait routine verifies that the message has been received. We use the nonblocking send MPIJsendO, blocking receive MPLRech, and MPLWaitO to complete the nonblocking send. It is impossible to measure the exact communication time for nonblocking communication because it operates together with computation and the transmission time is overlapped with the computation time. In order to measure the communication time roughly, the following several strategies can be used: 1. Lower Bound Communication Time: To measure the lower bound com- munication time, only the time spent for send and receive routines is measured. Each processor reads the value of a wall-clock time just before MPLIsendO Operations, and then reads the clock again after the send operations finish. In the same way, it reads the clock before MPI.Recv() operations and reads it again after the receive operations finish. It thus measures the elapsed wall- clock times for send and receive operations. This lower bound communication time measures the time spent for message sending and receiving before invoking the MPI communication routines and after returning from the routines without considering any transmission between processors. It contains no computation time. 104 2. Upper Bound Communication Time: The upper bound communication time is the time spent for communication until it is sure that the message send- ing is done. Each processor reads the clock before a send operation and reads it again after making sure that all the sent messages are received in the destination processor, and measures the elapsed wall-clock time. The upper bound commu- nication time contains the computation time which is between the MPIJsendO operation and the MP1. Wait() operation. Since the MPI.Wait() operation is usually called as late as possible before re—using the send buffer, the included computation time may take a large portion rather than the communication related time. 3. Approximate Communication Time: The approximate communication time is a roughly measured communication time. The method used is to com- pute { Total Execution Time - Total Computation Time); while measuring the total execution time, each processor measures the computation time for all com- puting steps in the simulation and subtracts the computation time from the to- tal execution time. The event handling time in event queues is included in the computation time. According to our experiments, this communication time is the same as (Lower Bound Communication Time + Waiting Time), where the waiting time is measured just before and after the MP1. Wait() operation, gath- ering only the waiting time for the completion of nonblocking message sending. However, since the computation time may contain an overlapping area with the transmission time in communication, this communication time may be smaller 105 200 I T I I I I Computation Time +— Lower Bound Communication Time -9- Upper Bound Communication Time -o— 150 — Approximate Communication Time A— d 1 Execution100 _ _ Time 50 - -’ A. A éfié'f—AMQ/‘t" I 0 h 4L I l l l l 2 4 6 8 10 12 14 16 Number of PBS Figure 4.8: Communication Time Varying the Number of PBS than the exact time spent for interprocessor communication. Figure 4.8 shows the experimental results of several communication patterns ex- plained above. The difference between lower bound and approximate communication times are little; the waiting time of making sure the completeness of sending is not a high portion. Those two measured results of communication time generally in- crease as the number of processors increases, paying more communication overhead. In contrast, since upper bound communication time contains some portions of com- putation, the execution time is high when a small number of processors are used, becomes low as the number of processors increases, and becomes high again when more than enough processors are used due to communication overhead. Also, Fig- ure 4.8 presents computation time as well as communication time. As the number of processors increases, the amount of computation decreases due to higher distributing tasks over many processors; computation time decreases. On the point of 12 PEs, 106 Communication Time Compu- No of MCGS Low Bound Upper || Approximation tation PEs Sending [Low B. Bound || Tot-Comp [ Low B+Wait Time P=2 85 2.68 12.50 120.34 13.94 13.29 207.34 P=4 85 3.62 14.52 76.19 15.77 15.39 118.61 P=6 60 4.39 11.71 53.87 13.05 12.70 79.64 P: 45 5.32 14.95 48.03 16.48 16.12 59.77 P210 35 6.60 16.51 43.51 18.26 17.91 48.06 P=12 35 7.57 18.06 43.47 19.88 19.55 44.88 P214 30 7.77 19.00 41.67 21.01 20.68 38.94 P=16 20 10.15 24.45 43.88 26.92 26.56 30.73 Table 4.1: Comparisons of Communication Measurements the computation and upper bound communication times are crossed over. On more than 12 PEs, computation time is smaller than upper bound communication time. It means that although computation time is small due to a small amount of computed events per processor, communication time becomes a dominant factor. Table 4.1 shows the detailed measured results Of Figure 4.8, showing the values of low bound, upper bound, and two approximate communication times as the number of processors changes. In low bound communication, the time spent for sending is smaller than the time for receiving because the nonblocking send operation does not wait for arrival of messages. The two approximate communication times, ( Total Execution Time - Total Computation Time) and (Lower Bound Communication Time + Waiting Time), are almost same. In this measurement, the program includes the performance measuring statements so that the execution time is large compared to the total execution time in Table 4.16, which is measured with only the core program without including measuring statements to compute speedup of parallel processing. 107 1 0.8 n. I I Oficr _.' _ Computation ,' *. . 'k- - 'k ------ ~k ...... * ..... .k . .' _ .G. _ ' Fraction 04 _ 03-0. 2.34? o .0..o ...... G ...... o ............. 5, _*- " $38584 Computation -A-— j' 91* .*" $38584 Rollback "A" - 0 2 _ 3X S38417 Computation -e— _ ' Q‘? S38417 Rollback 49- - 5.1 S35932 Computation —o— . o I I l I lS35932 lRollback| -" ' 0 0 500 1000 1500 2000 2500 3000 3500 MCGS Figure 4.9: Computation Fraction To get rid of the effect of computation amount on the upper bound communication time and consider the effect of nonblocking message passing, we have used the third communication measurement method, the approximate communication time, in the following performance metrics of communication overhead. 0 Computation Fraction Figure 4.9 shows the computation fraction of the total execution time vary- ing with the value of the MCGS. The computation is composed of computing LVTs, checking the D f/f clock, making the central priority queue, executing events, processing received events, manipulating event queues, computing GVT, and performing fossil collection. Based on the computation fraction, the com- munication fraction can be roughly estimated by computing the ratio of the roughly measured communication time to the total execution time, where the communication time is derived by ( Total Execution Time — Total Computation 108 1 I I I I I 0.8 — _ $38584 Computation -*— S38584 Rollback -*' - . 0-6 S38417 Computation -e— ‘ Computation S38417 Rollback -o- - Fraction S35932 Computation -o—- 0.4 335932 Rollback '0' - . . . , , .. : :0 fi ...... O ........ . . d o ..................... r 0 2 r- , ..... * """" f. - _ ”HQ: . . .35: ' ....- 0 W'- ..«- -g """" m I l 50 100 150 200 250 300 Figure 4.10: Computation Fraction (Close-up Between 1 and 300) Time ) When the MCGS is very small, the computation fraction is small; only a small number of executed events are executed in a cycle per processor while yielding frequent interprocessor communication. Thus, the communication overhead dominates most of the total execution time by computing only a small number of events and communicating frequently. As the MCGS increases a small amount, the computation fraction increases rapidly since the number of executed events increases accordingly and the spent time for communication is balanced with computation. When the MCGS is large enough, the communication overhead is not the major factor of the total execution time; the computation time due to rollback situations is the major factor. Therefore, the communication overhead mainly applies to the small MCGS’s, especially on the left side of the optimal computation granularity. 109 Figure 4.9 also shows the rollback fraction in dotted lines varying with the MCGS, compared to the computation fraction. The rollback fraction is the ratio of the approximate rollback handling time to the total execution time, that is derived by multiplying a computation fraction by a rollback rate. This approximation shows the computation amount of rolled back events out of the overall event computation amount. On a large MCGS, the rollback fraction takes a high position among the computation fraction. The 90% computation fraction may suggest that it is possible to improve the performance by increasing the number of processors. However, as shown in the figure, among the 90% computation fraction, the rollback fraction is a large portion. When a larger number of processors than the appropriate number are used, the performance does not increase because processors will perform unnecessary event processing, resulting in more rollback situations. Thus, using more processors is not always the answer. Figure 4.10 focuses on the left side of Figure 4.9, and is also able to explain the optimal valley in Figure 4.2. Figure 4.10 focuses on a higher communication overhead through the computation fraction between MCSG values 1 and 30 in solid lines. It also shows a heavier rollback situation through the rollback rate in dotted lines started from 100 on 838584 and S38417 circuits, and from 200 on 835932 circuit, respectively. The area between those two, that is, the point reaching the steady performance of computation fraction around the MCGS of 30 and the point of increasing rollback situations, is the same as the optimal 110 1 r I I I I I I 838584 -*— 838417 ~9- 0.8 S35932 *— D 0.6 - Computation , Fraction D 0.4 — - 0.2 — _ O l I l l l l l 2 4 6 8 10 12 14 16 18 Number of PBS Figure 4.11: Computation Fraction Varying the Number of PBS valley of total execution time in Figure 4.2. The area in—between provides the optimal computation granularity because the high communication overhead is reduced and the overhead due to heavy rollback situations is not yet included. Thus, the area mainly works on genuine event execution without being domi- nated by the high communication and rollback overhead. 0 The Limit of Parallel Processing Figure 4.11 shows the computation fraction according to the number of PBS. As the number of PBS increases, the computation fraction decreases, that is, the communication fraction increases. Thus, we can see that more PEs result in a greater communication overhead and a loss of the benefit of parallel processing performed by many PEs. 111 I I 838584 total time 838584 computation time S38417 total time S38417 computation time - S35932 total time S35932 computation time - 200 " -l‘?‘l’§”r ' tr Execution150 :_ Time '_ 100 - <9. . 50 -'....,.on . - " 0 l 1 l l l 50 100 150 200 250 300 MCGS Figure 4.12: Comparison Between Total Execution Time and Computation Time 0 Total Execution Time vs. Total Computation Time To explain the effect of communication overhead on overall performance in more detail, Figure 4.12 compares the total execution time in solid lines to the computation time in dotted lines varying with the MCGS. When the MCGS decreases from the LP ratio to the optimal point of execution time, both the total execution time and the computation time proceed in parallel by keeping a small gap of communication time. This small gap shows that the SP2 provides fast interprocessor communication and that nonblocking message passing does not cause great loss by overlapping the communication with the computation of processors. When the MCGS is between 1 and the optimal point, the trend is different from the former; the gap of the total execution time and the com- putation time increases as the MCGS decreases. This bigger gap represents the overhead of communication, that is, processors spend more time on commu- - nu. 112 nication at a very small MCGS’s only doing a small amount of computation between subsequent communications. 0 Number of Executed Events per Cycle Another interesting observation on the left side of the optimal point in Fig- ure 4.12 is that the computation time still increases although the real computed events are very few and the communication cost is not included. In order to know how many events are executed per simulation cycle, Figure 4.13 shows the average number of executed events between communications on a processor. As shown in the very left side having a small MCGS, the number of executed events is very small due to the limitation of the MCGS. Although each processor works on a very small amount of events at a small MCGS, Figure 4.12 shows that the computation time increases greatly even after accounting for the communica- tion overhead. This is because of the effect of the simulation progress that is presented in the next item considered. If the simulation proceeds very slowly with few executed events, it yields a very large number of simulation cycles, degrading the total execution time. Simulation Progress Simulation progress represents the proceeding status until the simulation reaches the termination condition. Even when many events are executed per simulation cycle, it is difficult to say that the simulation can be done very quickly. The simulation progress is different from the concept of execution activity. We introduce two performance 113 1800 r ’ ’ ’ — $38584 +— 1600 r S38417 -9— _ $35932 -°- _ A o :3) Number 1400 of 1200 — — Executed Events 1000 _ — per 800 [- a Cycle 600 L 1 * 400 - - 200 - — 0 l l I l 0 500 1000 1500 2000 2500 MCGS Figure 4.13: Number Executed Events per Cycle in a processor metrics to show the simulation progress: one is the number of simulation cycles and the other is the GVT jump. The number of simulation cycles represents the number of repeated cycles to complete the given task, and the GVT jump shows the increment of GVT whenever it is computed. 0 Number of Simulation Cycles Figure 4.14 shows the total number of simulation cycles during the entire sim- ulation until completing the task of a given circuit with a given set of input events. By repeating a series of simulation cycles, the simulation can reach the termination condition of the simulation. Thus, the number of simulation cycles represents the simulation progress, i.e., how many times each processor needs to repeat the computation and communication routine to finish its own work. When the MCGS is large enough to have a steady total execution time as shown in Figure 4.1, the curve of simulation cycle is also in the steady state. 114 30000 I I I I I 838584 dr— S38417 4*- 25000 - S35932 -°— - Number 20000 _ _ of Simulation FL Cycles 1 5000 :1 10000 I CD 0 '— n A A V v 5000 l I I I I I I l l 50 100 150 200 250 300 350 400 450 500 MCGS Figure 4.14: Number of Simulation Cycles This is because almost all active logical processes are executed among candidate logical processes and the number of executed events is almost the same. The benchmark circuits used have different steady state values in the number of simulation cycles; 838584 starts at around 14000, S35932 at around 10000, and S38417 at around 8500. This is affected by the characteristic of the circuit, es- pecially its critical path length. As the critical path length increases, in general, the number of simulation cycles increases. Another possible characteristic of a circuit is the circuit size, i.e., the number of gates in the circuit. According to the research in [4], circuit size has much more complex relationship with circuit parallelism than a simple linear relationship. Thus, it is hard to determine the performance effect of circuit size. In Figure 4.14, the starting points reaching the steady state occur at different places for each circuit and begin earlier than the optimal points Of total execu— 115 tion time. The starting point is related to the starting point of heavy rollback situations. In Figure 4.5, the places having the largest jumps are approxi- mately the positions of the start of steady performance curves of the number of simulation cycles. This means that the MCGS allows more than enough logical processes to execute events, causing an increase of rolled back events. The num- ber of simulation cycles has the same value up to the LP ratio while keeping the steady performance, because more than enough event executions per cycle do not contribute the simulation progress, but degrade the execution time. In contrast, as the MCGS becomes smaller than the steady point, the number of simulation cycles increases greatly. This is because the genuine events that do not generate rollback situations cannot be executed due to the very tight limitation of the MCGS, and the small number of executed events make the simulation progress slow down, increasing the number of simulation cycles. GVT Jump Another metric of simulation progress is GVT jump. The GVT jump shows how much the GVT value increases whenever it is computed. Since we use an asynchronous GVT computation method which circulates a token around all processors, only the master computes the GVT value after collecting the PVTs from the processors and measures how much the GVT value jumped from the previous GVT value. Figure 4.15 shows the average GVT jump during the sim- ulation on the master processor by varying the MCGS. The interesting thing is that the peak point in the curve of the GVT jump occurs around at the Opti- GVT Jump 838584 -*— S38417 -9— S35932 *— 7“ O I I l I I I 50 100 150 200 250 300 350 400 450 500 MCGS Figure 4.15: GVT Jump mal point of the total execution time. This means that at the optimal point of the MCGS the simulation progress is the best by executing good events among candidate logical processes, so that rollback situations do not degrade the sim- ulation progress and there is no severe loss due to the communication overhead. The appropriate MCGS encourages lagging logical processes to execute events and hinders advanced logical processes not to do so, keeping the overall sim- ulation progress optimal. The GVT jump also represents simulation progress as does the number of simulation cycles. The S38417 circuit has the highest GVT jump value, S35932 has the next value, and $38584 has the smallest. The higher the GVT jump, the shorter the number of simulation cycles. 117 4.2.4 Speedup Speedup is the ratio of the time to execute an efficient serial program for a calculation on a uni-processor to the time to execute a parallel program for the same calcula- tion on N processors that are identical to the uni-processor [80]. The serial program used in our experiment is a sequential program employing a synchronous protocol. The sequential program of synchronous protocol always executes the smallest event first among many unprocessed events by following the concept of the synchronous protocol. When the sequential program simulates a large circuit with a large num- ber of input events, the simulation may take a long time. To make the sequential program more efficient, a global common priority queue is used as the event queue for all logical processes instead of a distributed queue for each logical process and an array heap is implemented to reduce pointer handling overhead. Compared to the sequential program, the parallel program uses the optimistic protocol. As the data structure for the optimistic protocol, a linked list is used because it can use the space efficiently in heavy rollback situations which dynamically occur during the Optimistic simulation. In Section 4.1, more details of event queue implementation are presented. The difference in the data structure of event queues between the optimistic protocol and the sequential program may produce some difference in performance. Thus, the speedup that we use in this chapter is the ratio of the total execution time on an efficient sequential program to the total execution time on the parallel program of the optimistic protocol. In contrast to the sequential program, the optimistic protocol has many com- 118 plicated features such as parallel processing, Optimistic event processing, rollback situation, communication overhead, and computation granularity. With that of the optimistic protocol, in order to accurately compare the total execution time Of the sequential program, an Optimal MCGS must be used. Since the LP ratio on a pro- cessor changes as the number of processors varies, the optimal MCGS also changes according to the different number of processors. Table 4.2 shows the total execution times of the sequential program (SEQ) and the optimistic protocol (OPT) for 838584, 838417, and 835932 circuits. The table presents the approximately-optimal MCGS values and the total execution time at values on a different number of processors, which is obtained after performing several experiments. Because we use a 5 interval gap of the MCGS to find the Optimal value, the values shown in the table may not be the exact optimal MCGS. In addition, to compare between the total execution time of the sequential program and that of the optimistic protocol, this set of optimistic protocol’s experiments does not contain the measurement part of the program which measures performance in terms of other performance metrics. Thus, the total exe- cution time of the Optimistic protocol is shorter, here, than that in Figure 4.1. The effect of taking off the performance measurement part from the program also changes the optimal MCGS value. This is because the extra computation amount per event and simulation cycle for measuring other performance metrics is not included in com- putation time after removing the performance measurements. As the number of processors increases, the total execution time decreases by achieving the benefit of parallel processing as shown in Figure 4.16. The 838584 circuit has the highest speedup when the number of processors is 12, 838417 does 119 Program No. of PBS Circuit 838584 1] 838417 I] S35932 SEQ Time Time Time P=1 105.21 __ 50.51 153.1: [—OPT H u MCGS Time H MCGS [Time ]| MCGS Time] P=2 85 81.22 95 41.15 100 108.08 P=4 85 44.98 60 24.30 120 63.57 P=6 60 33.10 60 18.93 85 45.48 P=8 45 25.02 50 14.58 65 37.41 P=10 35 23.56 40 15.08 50 29.53 P=12 35 23.20 35 14.97 50 27.01 P=14 30 25.60 30 15.20 60 24.30 P=16 20 25.37 25 15.43 50 27.27 Table 4.2: Comparison Between the Sequential Program and the Optimistic Protocol 7 I I I I I I F S38584 +— 6 __ S38417 '9— .. S35932 *— , 5 — ... Speedup 4 — /’ J 2 4 6 8 10 12 14 16 Number of PBS Figure 4.16: Speedup 120 when 8, and 835932 does when 14. After reaching high-point, the speedup degrades even in a large number of processors. The major reason is the communication over- head as described in Figure 4.11. Due to this, parallel processing does not offer much performance improvement on a very large number of processors. Another important consideration is that the optimal MCGS decreases as the num- ber of processors increases as shown in Table 4.2. This is because the LP ratio de- creases as the number of processors increases, and the appropriate amount of events per simulation cycle also needs to decrease. However, there are some exceptions: when the number of processors is 2 and when the number of processors is too large for a circuit with a high activity rate. If the number of processors is 2, the internal messages within a processor are the major portion of the messages and the compu- tation granularity is not affected only by the LP ratio. When the 835932 circuit uses 14 processors, the optimal MCGS does not decrease because a small amount of event execution per simulation cycle makes the simulation progress slower and the perfor- mance is degraded. Also, circuit partitioning may affect the performance results as the number of processors changes. 4.2.5 The Effect of a Circuit Size By varying the number of PBS, we have studied performance in terms of the speedup of parallel processing. With a given circuit, the LP ratio changes as the number of PBS varies, so that the speedup can be observed on a different number of PBS. In this section, we study how a circuit size affects the performance when the 121 8 I I I T38584 E— 7 .— D38584 -°— g _. S38584 *— 5 a 3 6 _ “fl 1;,» 5 ‘ 5 r .. _ Speedup / 4 - _ n j / I I I I 5 10 15 20 Number of PBS Figure 4.17: Effect of a Circuit Size number of PBS varies. If we use a different circuit to compare results, circuit char- acteristics such as the critical path length are different and may result in a different effect. Thus, in order to keep the same characteristic of a circuit, we use a double circuit and a triple circuit of a given circuit by multiplying the circuit without any connection between the multiplied circuits, that is, by adding two or three of the same circuit in parallel. Compared to the single circuit of 838584, we call a double circuit D38584 and a triple circuit T38584. The multiplied circuit is evenly and randomly partitioned into processors. Figure 4.17 shows the effect of circuit sizes on performance results with the 838584, D38584, and T38584 circuits. As the circuit size increases through multiplication, each processor is assigned more logical processes and therefore has a higher LP ratio. As more logical processes are assigned to a processor, the parallel processing speedup increases on a larger number of PBS. Because the multiplied tasks of double and triple 122 P=2 I P=4 | P=6 T P=8 | P=10 I P=12 | P=14 | Synchronous Protocol Total Execution Time 60.83 38.61 34.20 31.35 32.30 33.53 34.07 GVT Comp. Time 7.15 8.55 11.16 11.38 13.41 14.53 15.46 Activity Rate (%) 1.3858 l Optimistic Protocol [I I MCGS 85 85 60 45 35 35 30 Total Execution Time 81.22 44.98 33.10 25.02 23.56 23.20 25.60 GVT Comp. Time 2.61 1.17 0.79 0.66 0.60 0.52 0.48 No. of Token Cycles 23560 6677 4225 3087 2469 1913 1657 Activity Rate (%) 0.8205 1.6398 1.7358 1.7351 1.6860 2.0169 2.0171 Rollback Rate (%) 0.1076 0.9982 1.6842 1.6861 1.5098 5.3299 5.3830 Table 4.3: Comparison Between the Optimistic Protocol and the Synchronous Pro- tocol circuits yield greater amounts of computation required to complete the simulation, distributing the task into more PEs can produce a greater parallel processing, result- ing in a higher speedup. If more than enough processors are used, the communication overhead degrades the performance results as described in the speedup section. 4.3 Performance Comparison Between the Opti- mistic Protocol and the Synchronous Protocol Performance results of the employed optimistic protocol are compared to those of a synchronous protocol in this section. While Optimistic protocols allow logical pro- cesses to execute events optimistically, synchronous protocols make logical processes execute events only if the LVTs of logical processes are equal to the GVT value. Thus, in synchronous protocols, computing GVT is the major task to progress the Overall simulation by facilitating the next smallest event to be executed. In contrast 123 to the complicated operations of the Optimistic protocol, the synchronous protocol has simple operations to process events. The smallest timestamped event is always executed first in a processor, and each logical process handles events with the FIFO queue. As for the model implementing the synchronous protocol in this section, a global common priority event queue is used per processor. All pending events from assigned logical processes are in the priority queue. After finding the GVT value, the events having the same timestamp are dequeued and executed by each processor. In order to achieve better performance in the synchronous protocol, the global priority queue is implemented as an array structure as in the sequential program that is used for speedup. Table 4.3 shows the performance results of the synchronous protocol varying with the number of processors on the 838584 circuit. Because the synchronous protocol uses a barrier synchronization to compute the GVT per simulation cycle, the GVT computation takes longer time as the number of processors increases. In contrast to the synchronous protocol, the Optimistic protocol needs the GVT value to check the termination condition and to perform fossil collection. Thus, computing the GVT is not significant for simulation progress compared to the synchronous protocol. Since the GVT is computed asynchronously over processors by passing a token, the optimistic protocol takes very little GVT computation time except in the case of P = 2 in Table 4.3. In this case, the GVT token is passed frequently between the two processors, the number of token cycles increases greatly, and the GVT computation time is rather large compared with that in a larger number of processors. Although the Optimistic protocol uses a linked list structure for the event queue handling, it 124 achieves better performance results than the synchronous protocol when the number of processors is larger than four. The difference is in the GVT computation time, and the activity rate as shown in Table 4.3. The Optimistic protocol at optimal MCGSs has better activity rates than the synchronous protocol when the number of processors is larger than two, whereas the synchronous protocol has the same activity rate even when the number of processors is varied because it has the same number of executed events per simulation cycle according to the GVT. Also the table shows rollback rates at Optimal MCGSs. If the number of processors is more than a certain number (approximately 12), the rollback rate increases greatly. A challenging question in performance comparisons of parallel synchronization protocols is how much the Optimistic protocol can achieve a better performance than the synchronous protocol. There are three major considerations to compare the per- formances: the benefit of asynchronous protocol, the benefit of parallel processing, and the event processing cost. The first consideration about the benefit of the asyn- chronous protocol is to consider whether the application problem is suitable for the simulation with the asynchronous protocol. The synchronous protocol is good for ap- plication problems which have many processing events in the same timestamp value. For example, if a high portion of logical processes have LVT values that are equal to the GVT value, the synchronous protocol may be as efficient as the asynchronous pro- tocol. In contrast, asynchronous protocol is good for application problems which have events in various timestamp values. When only a few logical processes have LVTs that are equal to the GVT, asynchronous protocols can achieve better performance by executing events regarding the local clock of each logical process. 125 Second consideration about the benefits of parallel processing is to consider the communication overhead compared to the computation amount. If an application problem requires too much communication overhead compared to the computation amount, the parallel processing of many processors is no longer a benefit. A final consideration about event processing cost is how much the other processing steps are necessary during the simulation. Data structures to handle events and an appropriate scheme for event queues are one of the considerations. The optimistic protocol in— cludes the extra cost for rollback handling and storage management. The conservative protocol contains the extra processing of link clocks, local clocks, and null messages to keep the correct local causality constraints. 4.3.1 Timing Granularity When we compare performance results between optimistic and synchronous proto— cols, the Optimistic protocol does not give a dramatically higher performance than synchronous protocol in Table 4.3 although it allows processors to execute events asynchronously and Optimistically. Not only the employed synchronization protocol, but also the assumption applied to the application problem can affect simulation performance. Under a different assumption, performance results of the optimistic protocol can increase greatly compared to those of the synchronous protocol. In particular, the way to build up timestamps of events according to the time delay in logical processes affects the simulation performance. In the domain of logic simulation, the delay of a gate is the time required for 126 an input signal change to be reflected at the output. Timing granularity [7] is a delay-based model for logic circuit simulation, ranging from very fine-grained timing (such as 0.1 ns delay on a gate) to coarse-grained timing (such as unit-delay and zero- delay on a gate). Fine-grained timing has small resolutions of simulation time and a large number of possible delays. In [7], Bailey reported that the timing resolution becomes coarser, as either the number of possible delay values decreases or the time resolution increases. Unit-delay timing is extremely coarse-grained timing. Bailey [5, 6] also developed two formal models to compare the effects of timing granularity on circuit parallelism, and showed that coarser-grained timing could result in higher levels of circuit parallelism. Considering the effects of timing granularity and the characteristics of protocols, according to [7], to use a synchronous protocol is sufficient if coarse timing models are used; if fine timing is used, an optimistic protocol should strongly be considered to use. As the number of processors increases, a synchronous protocol may result in bad effects; an Optimistic protocol is likely scale better. With the definite relationship between timing resolution and circuit activity, the application problem with a specific delay-model can affect the performance of parallel protocols. Therefore, the performance is not only affected by parallel processing considering the barrier synchronization and the communication overhead, and the event processing cost regarding designs of protocols, but also influenced by timing granularity. 127 4.4 Summary In this chapter, we presented experimental results of Time Warp to study effects of computation granularity. As the target machine for parallel simulation of VLSI circuits, the IBM SP2 has been used since it can provide a large number Of fast processors with large amount of local memory and has a fast interprocessor com- munication subsystem. We measured several performance metrics and categorized them into total execution time, execution activity, rollback situation, communication overhead, simulation progress, and speedup. According to the experimental results, computation granularity significantly affects the performance of Time Warp by con— trolling the degree of optimism and communication overhead. A large computation granularity increases the degree of optimism, resulting in frequent rollback situations. In contrast, a small computation granularity causes a high communication overhead and a slow simulation progress. Therefore, we conclude that computation granularity is a very important factor to be tuned properly to achieve the optimal performance of Time Warp on parallel and distributed systems. Chapter 5 Event Scheduling Schemes for an Optimistic Protocol on Distributed Systems In simulating large VLSI circuits on distributed-memory systems, one of the impor— tant considerations for an optimistic protocol is to schedule events especially when the LP ratio is larger than 1. This chapter considers two major factors to design an ef- ficient event scheduling strategy for Time Warp on distributed-memory systems. The first factor is computation granularity, which is the amount of computation to be pro— cessed in a processor between two consecutive interprocessor communication points. As shown in the previous chapters, by adjusting the computation granularity to be larger than a certain degree, we can prevent communication overhead from dominat- ing simulation performance. A large computation granularity, however, could increase 128 129 the number of rollbacks, wasting the simulation time to handle rollbacks. That is, while processors execute a large number of events without communication, processes could propagate many immature events and therefore the protocol should recover the earlier system state from the erroneously—advanced state when an event arrives whose timestamp is smaller than the current LVT. A considerable amount of overhead can occur in the rollback situations because the past state of each process should be saved and recovered by re—computations. As a consequence, the appropriate computation granularity might be small enough to prevent Time Warp from progressing events over-optimistically, while keeping large enough to minimize communication overhead. The second factor considered is the balancing progress of logical processes. Some logical processes may advance too far ahead of others, possibly leading to inefficient usages of memory and excessive rollbacks, especially for large application problems. By adjusting the computation granularity, we can control the degree of optimism apprOpriately as described in the first factor. However, within the range of the given computation granularity, logical processes assigned to the same processor would progress at different speeds and this may cause more rollback situations. If we con- trol the speed of each process by giving more chances of executing events to slower processes, we may reduce the frequency of rollback situations. We study two schemes of balancing the progress of processes: one of which is to restrict a process to execute events no more than the given number of times, and the other is not to allow an event to be executed if its timestamp is far ahead of other events in different processes. The details of those schemes are studied in Section 5.1, which was published in [25]. In Sections 5.1, the event scheduling schemes are presented considering the compu- 130 tation granularity and balancing progress of processes. Section 5.2 shows the experi- mental results of the effects of computation granularity and event scheduling schemes on a distributed system. The concluding remarks are presented in Section 5.3. 5.1 Event Scheduling As one of popular simulation applications, VLSI circuits involve a large number of Objects performing a small amount of computation for each communicating event. A VLSI circuit usually has several tens to more than hundreds of thousands of gates (or logical processes). The manipulation of an event requires around a hundred machine instructions; its event granularity is very small. In contrast, distributed—memory systems consist of a much smaller number of big processors and the interprocessor communication latency is large. In general, a distributed-memory system has up to hundreds of processors. Thus, it is anticipated that most VLSI circuit simulations will contain more than thousands of logic gates on a processor. Therefore, each pro- cessor should schedule the unprocessed events of a large number of assigned processes efficiently in order to achieve good performance. As we mentioned in the beginning of this chapter, there are two important considerations as we design an event schedul- ing strategy: the computation granularity and the progress balancing. By adjusting the computation granularity, the number of events executed in a simulation cycle can be controlled. Besides controlling the computation granularity, an efficient event scheduling strategy should be a scheme that can balance the progress of logical pro- cesses by considering what events are executed first and deciding the order of event 131 executions. The major characteristic of Time Warp is that each process can asynchronously execute its unprocessed events without considering causality effects. Due to this char- acteristic, some logical processes may advance too far ahead of others, possibly leading to the inefficient use of memory and excessive rollbacks. This situation might happen particularly for a large-scale simulation application with a small event granularity, such as VLSI circuits. By using the scheduling scheme of balancing the progress of logical processes, we can prevent the simulation from propagating incorrect compu- tations too far ahead and reduce seriously excessive rollback situations. We consider two balancing schemes: the Balancing Progress by Execution Chance (BPEC) scheme and the Balancing Progress by Virtual Time Window (BPVTW) scheme. The BPEC scheme controls the progress of a process by limiting the number of events executed in a process between interprocessor communications, while the BPVTW scheme controls the progress by using a virtual time window in a process, which allows events to be executed only whose timestamps are within the interval of the virtual time window. 5.1.1 Balancing Progress by Executing Chance (BPEC) The BPEC is a scheme that balances the progress of logical processes. It limits the number of chances (opportunities) of executing events in a logical process. In this scheme, we have two controllable pafameters: one parameter for the computation granularity and the other parameter for balancing progress. 132 As the first parameter to control the computation granularity, this scheme sets the maximum number of events that can be executed by a processor between com- munication points. This quantity is called the Maximum Computation Grain Size (MCGS). Within the limit of the MCGS, the processor scheduler selects events from logical processes through the central scheduling priority queue. That is, each proces- sor selects and executes events up to the limit of the MCGS. The scheduler counts the number of events executed in the processor since the last communication point while selecting events from the priority queue. Once the number of executed events reaches MCGS, the processor stops scheduling events and performs the interprocessor communication. The second parameter is the maximum number of chances needed to execute events in a logical process. When the scheduler selects events from the logical processes through the priority queue, it also counts the number of events exe- cuted in a logical process. The purpose of counting events per process is to balance the progress of processes. By giving the appropriate execution chances to processes, we can achieve the benefit of using parallel optimistic protocol. As well as boosting only processes far behind, the overall progress Of processes can advance by Optimisti- cally executing all processes. The maximum number of chances that a logical process can execute events between interprocessor communications is called the Balancing Progress Chances {BPC}. If a logical process no longer has unprocessed events or it has already executed as many events as the BPC, then the logical process is not allowed to enqueue its events into the priority queue. By controlling the maximum allowable chances of executing unprocessed events per logical process, the BPEC scheme not only balances the relative speed of the 133 processes’ progress, but also controls the computation granularity. In the case that MCGS is set to be larger than the optimal value of grain size, the computation granularity can be adjusted by using the proper BPC value. When all logical processes are individually controlled by the BPC, the total number of executed events in a processor cannot increase too much, and the granularity between two communication points is to be adjusted properly. The BPC is also able to control the event schedule by giving an equal number of execution chances to the logical processes. Instead of only executing events on a specific logical process, every process has the same chance to execute its events. Thus, the BPEC scheme contributes to progress balancing of logical processes and reducing the consequent rollback situations. Due to the characteristic of confining the strict upper bound of the number of events that can be executed by a processor between interprocessor communications, this scheme is a static method of controlling computation granularity. 5.1.2 Balancing Progress by Virtual Time Window (BPVTW) Unlike the BPEC scheme, the BPVTW does not have the strict upper bound of exe- cution chances for each logical process before interprocessor communication. Rather, the relative progress of logical processes is controlled with regard to the timestamp of events. That is, the scheduler prevents a process from executing events if the process is far ahead of the other processes in virtual time. For this purpose, we employ a virtual time window whose base is the GVT such that logical processes can execute 134 only events having timestamps within the interval of the virtual time window. This virtual time window is called the Balancing Progress Window and the size of the window is denoted by BPW. Thus, the Balancing Progress Window has the interval [GVT, GVT+BTW]. In this scheme, the overall progress is controlled by each logical process not by each processor. As in the BPEC scheme, the processor scheduler selects the smallest timestamped event from the central scheduling queue. The scheduler, however, does not count the number of executed events. When a logical process has an unprocessed event and its timestamp is in the interval [GVT, GVT+BT W], the event is enqueued into the central scheduling queue. If there is no unprocessed event whose timestamp is in the interval [GVT, GVT+BPW], the process does not have any chance to execute events until the next cycle after communication. In other words, the scheduler handles events until there are no available unprocessed events in the scheduling queue. Once the scheduling queue is empty, the processor communicates with other processors. The computation granularity in the BPVTW scheme is indirectly controlled by the BPW value. Since there is no static limitation of the maximum number of events to be executed by a processor between any two communication points, we do not employ the MCGS. Instead, the BPVTW scheme concerns the real progress of processes with their timestamps in order to balance the progress of logical processes. The computation granularity is consequently adjusted when the scheduler restricts scheduling events according to the BPW value. Because there is not a limited number Of executed events, the actual computation granularity may vary depending on the progress of the simulation. In this sense, the BTVTW scheme is considered a dynamic method 135 of controlling the computation granularity. A similar concept has been proposed as Moving Time Window (MTW) [75]. Although the MTW mechanism didn’t provide considerable improvement on some cases [37], as it will be shown from the experimental results in Section 5.2, the BPVTW scheme achieves a good performance improvement for large-scale and small- event-granularity applications on parallel and distributed systems due to its ability to control the computation granularity. In addition, for those applications with a large LP ratio, the BPVTW scheme provides the balancing effects as well as control over the degree of Optimism. 5.2 Experimental Results This section presents the experimental results on a cluster of six DEC Alpha work- stations interconnected by a DEC GIGASwitch through FDDI. The DEC Alpha 3000 workstation has a 133 MHz clock rate and a 32 MB memory. The GIGAswitch sup- ports a peak data transfer rate of 200Mbits per second. As benchmark circuits, we use several circuits from the ISCAS ’89 benchmark suite [15] and apply a pre-processing step to re-configure the circuits into a maximum of 3 input and 2 output ports as in Chapter 3. The added gates due to the pre—processing procedure have a zero delay and the normal gates have a unit delay for the logic simulation. In the circuits, we have the D f / f clocking interval set to be the same as the input clocking interval. The logical processes are randomly partitioned into processors. For parallel processing and the interprocessor communication on the distributed system, we use PVM (Parallel 136 Virtual Machine) 3.3 [65]. Time Warp is implemented on the distributed system as a master-slave model. 5.2.1 The Implementation Model The basic model of implementation for Time Warp in this chapter is the same as the model on the distributed-memory system in Chapter 3 except the scheme of GVT computation and the model of simulation cycle. We use the same token passing mechanism as in Chapter 4 to collect PVT values asynchronously from processors and to compute the GVT value. Without any barrier synchronization per simulation cycle, the simulation cycle repeats asynchronously on each processor. Processors are allowed to execute events according to their own progresses. As the data structure of event queues like the model of previous chapters, each logical process has its own event queues according to the input ports. In the event queues, all unprocessed events are pending and the previously-processed events are stored as well. Instead of using a separate output event queue, the event queues are used for both input and output events. The central priority queue per processor is used for scheduling events efficiently. The detailed event scheduling depends on the scheduling schemes that are applied. 5.2.2 Performance Measurement with the BPEC Scheme This subsection explains the experimental results when the BPEC scheme is applied as the event scheduling technique to Time Warp on the distributed system. 137 900 ’ I I I ' I I : S38584, BPC = 1 4— 800 :- S38584, BPC = 00 4.. - .A .I 535932, BPC=1 .9.— 700 L 335932 BPC = 00 A” _.* _ S38417,BPC=14_ 600 i S38417, BPC = 00 _ Total 5 Execution 500 ;’ Time L 400 _ (A 300 _ 200 _ 100 - I I 0 1000 2000 3000 4000 5000 MCGS Figure 5.1: Total Execution Time with the BPEC Scheme Effects of the MCGS To study the effect of varying MCGS, we have performed experiments with three different circuits. Figure 5.1 shows the total execution time for the 838584 (curves with stars), 835932 (curves with triangles), and 838417 (curves with dots) circuits as a function of MCGS. The solid curves represent the case of BPC = 1 and the dotted curves show the case of BPC z 00. The number of input vectors is 50, and the BPC is set to 1 and the unlimited value. The three circuits have the following LP ratios; 6733, 5357, and 6074, respectively. As we mentioned in the previous section of the BPEC scheme, the computation granularity is controlled by values of MCGS and BPC. To focus on only the effects of MCGS, in this subsection we consider only the dotted lines where the BPC has the unlimited value in Figure 5.1. The other solid lines are considered in the next subsection. As shown by the dotted lines, the total execution time can be minimized 138 I I . - I l . _ . I 6e+06 - S38584, BROE 1 4— — «35932;ch =1 A— 5e+°6 ’ S35932 BPC = 00 A - * _.A'S3841’7, BPC = 1 «— 4e+06 — - S38417, BPC = 00 - _ Number of 3e+06 — *_-' A 13 Rollbacks 3, ' .............. . 2e+06 — ' A" , ........... fi 1e+06—I/-_O..._2...,..--"' : _ ,«4 ' 0 A“ l l 1 I 1 0 1000 2000 3000 4000 5000 MCGS Figure 5.2: Number Of Rollbacks with the BPEC Scheme when the MCGS is tuned properly. When MCGS is near to 1, the total execution time increases very sharply as MCGS decreases. In this situation, the computation granularity determined by MCGS is so small that the simulation spends most of the time in the frequent communications. It implies that the MCGS should be larger than such a small value and that a substantial amount of computation could be performed between interprocessor communication points. The apprOpriate MCGS will minimize the degrading effects of the communication overhead. In the extreme case of MCGS = 1, the computation granularity is the same as the event granularity. On the contrary, the total execution time increases linearly as the MCGS increases after 200, approximately. It is because a large computation granularity may increase the number of rollbacks. To observe this relationship, Figure 5.2 shows the number of rollbacks varying the value of MCGS. Again, the dotted lines are for the case when the BPC is set to the unlimited value and the solid lines are when the BPC is set to 139 I I l I _ D 600 513207, BPC = 1 +— 550 _ 1313C7==22 4*— BPC = 5 A— 500 ’ BPC z oo 6- — 450 Total Execution400 ' 'Tune 350 _ 300 250 200 .L A I I l l 500 1000 1500 2000 h4CK3S Figure 5.3: Total Execution Time with the BPEC Scheme for 813207 Circuit 1. As mentioned before, we focus on the dotted lines in this subsection to focus on the effect of MCGS. In the dotted lines, the number of rollbacks is very small when MCGS is near to 1 because the overall simulation advances in a conservative fashion. As MCGS increases, the number of rollbacks increases almost linearly. This result implies that the larger the computation granularity is, the higher the probability is that immature events are propagated to other processors. As a consequence, the appropriate value of MCGS should be as small as possible, but larger than a certain value so that the communication overhead can be reduced as much as possible. As for the cases in Figure 5.1, the optimal performance occurs when the MCGS is around 200. Effects of the BPC In this subsection, we consider all the curves in Figure 5.1 to study the effects of the BPC. When the MCGS is much larger than the optimal value, the simulation 140 performance depends on the BPC value. As we discussed above, when the BPC has the unlimited value, the total execution time increases as the MCGS increase. In this case, the computation granularity is controlled only by the MCGS. When the value of BPC is small, say 1, the curve of total execution time goes to a steady performance regardless of the MCGS. Figure 5.3 shows the simulation results with’the 813207 circuit. In the simulation, 100 input event vectors are used. Like the circuits in Figure 5.1, the curves in Figure 5.3 also reach steady performance when the BPC has small values, such as 1, 2, and 5. At this steady state, the BPC value controls the computation granularity because of the limited number of executed events per process. With a small BPC, processes have limited chances to execute events although they have more unprocessed events. Thus, a fast process is prevented from executing too many events so that the process cannot go too far ahead of the others, and we are then able to obtain the balancing effect with a small BPC value. In the interval between the optimal and the steady states, the computation granularity is controlled not only by the BPC, but also by the MCGS. The depth of the valley of each curve at the optimal performance is deeper as the value of the BPC becomes larger. As shown in Figure 5.3 for the 813207 circuit, the simulation results with the BPC = I produce better performances than when the BPC is larger. Effects of the LP Ratio In order to study the effects of the LP ratio, we stage a set of simulations with three variations of the 835932 circuit. The experimental results are given in Figure 5.4. The D35932 circuit and the T 35932 circuit are the circuits that are constructed by 141 600 550 500 450 Total 400 Execution350 Time 300 S35932 +- - D35932 A— T35932 +— 7 250 r 200 r r 150 r 100 l L J l l 0 1000 2000 3000 4000 5000 6000 MCGS Figure 5.4: Effect of Computation Granularity on Total Execution Time by Varying LP ratio connecting the original S35932 circuit double and triple times, respectively. Therefore, the T35932 circuit has around a hundred thousand logic gates. Since the two circuits have logical gates multiple times of the 835932 circuit, they have multiple times higher LP ratios than the 835932 circuit. For all three curves, the BPC is fixed at 1. As before, each curve shows the optimal and the steady performances as the MCGS varies. However, unlike the 813207 circuit which has the steady performance with the optimal performance on all the ranges of MCGS when the BPC is 1, the steady performances of the three circuits in Figure 5.4 are much higher than their optimal performances. The larger the LP ratio is, the larger the difference is. Interestingly, even if tens of thousands of logic gates are assigned in a processor, the simulation shows reasonably good performance on distributed systems if the MCGS is tuned appropriately. These results imply that the effects of computation granularity on the performance become more significant as the LP ratio becomes large enough. 142 l W I I S35932 «a— ‘ D35932 +— _ T35932 6- H H H r—A to 11> O) 00 O O O O O O O O I I I I I l Tkmal Executiofi000 _ T Time 800 t — 6001? ‘. 400 .— = — 200 : _ , — 0 I l l l I 0 2000 4000 6000 8000 10000 BPW Figure 5.5: Total Execution Time with the BPVTW Scheme 5.2.3 Performance Measurement with the BPVTW Scheme Additional sets of experiments have been performed with the BPVTW scheme on the distributed system. Figure 5.5 presents the experimental results in terms of the total execution time, varying the BPW. The three circuits, S35932, D35932 and T35932, are used, which are described in Section 5.2.2. In the figure, all three circuits show good performances up to the BPW value of 2000. Unlike the BPEC scheme which shows sharp increases in total execution time for small values of BPC, the BPVTW scheme shows good performance even with very small values of BPW. It is because each process adds the unit delay of virtual time in executing events, and many events exist in the narrow virtual time interval. The interval of BPW which contains the optimal performances is quite wide in comparison to the sharp optimal range in the BPEC scheme. Thus, the progress balancing between logical processes with the virtual time window is effective 143 and works well in the wide interval of BPW. This is because the BPVTW scheme concerns the actual progress of processes with their timestamps. On the contrary, when the BPW is larger than 2000, the curves of total execution time increase with different rates as the BPW increases. In this larger window interval, the virtual time window does not take an important role of balancing progress of logical processes because there are many executed events in this interval. Also, the computation granularity is not restrained by the BPW. Thus, too many unprocessed events are executed, propagating lots of immature events. The larger the LP ratio is, the more unprocessed events are generated. In the figure, the T35932 circuit shows the worst performance when the BPW is larger than 2000. The D35932 circuit is worse than the 835932 circuit. Therefore, when the BPVTW scheme is used as the event scheduling method, the BPW must be tuned properly. Otherwise, the simulation performance will be as bad as if there were no limit of the MCGS and the BPC in the BPEC scheme. As discussed in Section 5.1, the BPVTW scheme does not control the computation granularity in a direct way as in the BPEC scheme. Rather, by using the virtual time window per logical process, the computation granularity is indirectly controlled while the progress of logical processes are balanced. In order to observe the effects of the BPW on the computation granularity, the average number of executed events between interprocessor communications are also shOwn in Figure 5.6. In the figure, where the simulation shows the almost Optimal performance, the average number of executed events per communication is around 400 or 700, depending on the circuit. This amount is similar to the optimal grain sizes for the same circuits as shown in 144 I 4000 835932 «a— — D35932 *— 3500 T35932 49— - 3000 3 Average Number 2500 _ of Executed 2000 ‘ Events 1500 _ 1000 _ 500 - 0 l L I I 0 2000 4000 6000 8000 10000 BPW Figure 5.6: Average Number of Executed Events with the BPVTW Scheme Figure 5.4. 5.3 Summary Event scheduling is an important design issue of optimistic protocol on distributed systems. When a huge application problem has the high LP ratio, the number of computed events between any two communication points could be controlled by com- putation granularity and the event execution order. By balancing the amount of computations with the frequency of communication and controlling the optimistic tendency, computation granularity significantly affects the performance efficiency of distributed optimistic protocol. In addition to the computation granularity, the event scheduling scheme which adjusts the progress balance of processes among the assigned logical processes should be well designed to improve the performance. By considering the characteristics of the application problem of logic circuits and 145 the Optimistic protocol, we propose two efficient event scheduling schemes on the op- timistic protocol: the BPEC scheme and the BPVTW scheme. The BPEC scheme limits the number of executed events per logical process between interprocessor com- munications by using the BPC parameter. The BPVTW scheme controls the progress of processes in virtual time by using the BPW parameter, and prevents processes from executing events far ahead of other processes. We experimented Time Warp considering the design issues of the control of the computation granularity and the event selection schemes on a cluster of DEC Alpha workstations. According to our experimental results, both schemes show good per— formances when the computation granularity is adjusted properly. Otherwise, com- munication overheads or excessive rollback situations may significantly degrade the simulation performance. The results also showed that the progress balancing among logical processes have significant effects on obtaining the Optimal performance. In addition, it is shown that the LP ratio and the characteristic Of the simulated circuit affect the simulation performance. Chapter 6 Comparison and Analysis of Parallel Asynchronous Protocols on Massively Parallel SIMD Architectures In general, MIMD machines have been used as parallel processing environments for distributed event driven simulation. Parallel processing on MIMD machines has ad- vantages over SIMD machines in several aspects, such as general computational ca- pabilities, larger-grain parallelism, and asynchronous computation on each processor. However, one significant drawback of a MIMD machine is that it usually has a much smaller number of processors than a massively parallel SIMD machine and pays a higher communication cost for message passing. The SIMD architecture is the only 146 147 alternative under the current available technology, in order to to achieve massive parallelism with several thousands of processors. It offers significant advantages over the MIMD architecture in VLSI circuit simulation in case that the number of gates is smaller than the number of processors on the SIMD machine. The VLSI circuit simulation requires a small-grain operation of a gate, and a large circuit requires a large number of processors, making it well suited for a massively parallel SIMD envi- ronment. The SIMD architecture also offers simpler synchronization, which is crucial to several aspects of the performance of parallel simulation. In contrast to a MIMD environment, a SIMD environment has no serious problem for the circuit partition and load balancing because a logical process is assigned to a processor. In this chapter, architectures of existing massively parallel SIMD machines are analyzed for parallel gate-level logic simulation. The considered parallel protocols of parallel discrete event simulation are synchronous, conservative, and optimistic protocols. The Connection Machine CM-2 having up to 64K processing elements and the MasPar MP-l having up to 16K processing elements are used as the target SIMD machines. The objective of this chapter is to understand the effect of machine characteris- tics of SIMD machines on the performance of parallel simulation and to help choose a massively parallel SIMD architecture for efficient parallel logic simulation. In the comparison of the two machine architectures, broad aspects of the machines are con- sidered, such as interprocessor communication methods and array access times. Ac- cording to the experimental results, the MP-l gives approximately two times faster performance than the CM-2 for up to 16K gate benchmark circuits, while the CM-2 148 has the advantage over the MP-l for a circuit with a much larger number of gates. The experimental work was published in [24]. One severe problem for parallel logic simu- lation on the massively parallel SIMD machines is that the size of the local memory is not enough for event queues. To cope with the problem, the Moving Time Window (MTW) technique [75] is applied to conservative protocol and optimistic protocol as a memory handling mechanism. The remainder of this chapter is organized as follows. Section 6.1 describes the data structure of event queues for the optimistic protocol to handle rollback situa- tions efficiently on the SIMD machines. Section 6.2 presents the performance of the instructions related to logic simulation on both the CM-2 and the MP-l. In addition, comparisons and descriptions of architectures on massively parallel SIMD machines are discussed in Section 6.3. Section 6.4 gives experimental results on both the CM-2 and the MP-l by using synchronous, conservative, and optimistic protocols. Based on the experimental results, the SIMD machine architectures are analyzed in view of parallel logic simulation. 6.1 Event Queue Manipulation Eflicient queue manipulation is essential for good performance of optimistic protocol due to rollback manipulation and space overhead problems. As an approach, a single queue scheme based on immediate cancellation has been proposed in [28], but it has been shown that it is not suflicient due to its inherent fossil collection overhead and sequential search problems. An event queue handling scheme, called CBS Q ( Circular 149 Binary Search Queue), is used in this chapter as a data structure for fast event queue manipulation. The data structure is based on the scheme of immediate cancellation. A CBSQ is a data structure which allows binary search on a circular list. There are three pointers for each CBSQ: Bottom, Front, and Rear. Bottom is a pointer to the remaining oldest event in CBSQ. Front points to the position of the first unprocessed event in CBSQ, while Rear is a pointer to the next available element for a new coming event. A binary Operation performs on CBSQ to find a position of an event. an old event Case A. Case B. - anew event Rear Front Rear Rear 3 j . Front Front Front Bottom ' Bottom Figure 6.1: A CBSQ Structure A binary search is needed to insert a new incoming event into a CBSQ at an input port as shown in case A in Figure 6.1, by just moving Rear pointer. If the timestamp of the new event is less than the timestamp of the event pointed by Front, that is rollback situation, all of the events placed from Front to Rear are automatically eliminated by just moving both Front to the new event and Rear to the next place of the element that contains the new event. This case is shown in case B in Figure 6.1. That is, we do not need any additional Operation to delete wrong events by using the CBSQ. When rollback occurs, binary searches on event queues are performed for rollback adjustment in CBSQs which are not even directly involved in the rollback. 150 In this case, Fronts are moved down to the event with timestamp equal to or just greater than the timestamp of the new LVT. Fossil collection also requires the binary search to find a boundary to delete unnecessary processed events. Bottoms are moved to point to the event having the smallest timestamp which is greater than the current GVT, whenever there is at least one full queue. On the average, it takes O(log2n) time to do a CBSQ search Operation on each queue, comparing the case of one linear queue per gate [28] of C(mn), where n is the number of events on a queue and m is the number of ports per gate. For a circuit with a maximum fan—in p, we need (2 + q)p binary search operations on the average at each simulation cycle, where q represents an average frequency of fossil collection (0 _<_ q s 1). The value of q varies, depending on the situation. For example, on the system has a unlimited memory size, Time Warp does not need to perform a fossil collection and q is zero. Since CBSQ is in each port of a gate, it reduces lots of cost in the queue management of Time Warp for searching the proper position especially when an event with a smaller timestamp arrives in an event queue and it makes a rollback situation. Using CBSQ is efficient for parallel logic simulation on massively parallel SIMD machines, because the number of accesses in a queue is logarithmically smaller than in the single event queue. Thus, Time Warp with the immediate cancellation can be efficiently handled by the CBSQ of each port. 151 Table 6.1: Performance of Instructions 6.2 Related Instructions and their Performances on SIMD machines Since the performance of logic simulation considerably depends on the execution times of machine instructions, it is essential to compare the performance on MP-1 and CM—2. The main instructions are interprocessor communication, global minimum, and array access operations. 0 Interprocessor communication instruction: Interprocessor communica- tion is used to propagate events between any two gates by sending a message from a selected processor to a destination processor. The router instruction is used for global communication in the MP-l, while the send instruction is used in the CM—2. Global minimum instruction: The global minimum is used to compute GVT by obtaining the minimum local virtual time Of all active gates. The SIMD machines can achieve the global minimum by using an instruction within a short amount of time compared to other machine architectures. Array access operations: To use the given local memory efficiently and improve the performance, event queues are implemented by array structures. 152 Array access operations affect the performance of manipulating event queues by reading and writing events of an array. Table 6.1 shows the execution times of the instructions. MPL(Massively Parallel Language on MasPar) [53, 26] is used for the measurement on the MP-l, while C / Paris is used on the CM-2. As a unit of measurement, ,asecond is used. The destination processors are randomly determined to get the performance of routing instruction in Table 6.1. The MP-l is about 2 and up to 6 times faster than the CM-2. 6.3 Comparisons of Massively Parallel SIMD Ma- chines The unique characteristic Of the SIMD machine is that every instruction is synchro- nized over all processors with different data by the front-end processor. The simula- tion cycle which is repeated to complete the given tasks of simulation is synchronized over all of the processors. In view of logic simulation, when the number of processors and the memory space for event queues are enough to simulate a circuit and each processor has a good capability on a SIMD machine, better performance for a circuit with a large number of gates can be obtained. The CM-2 is known as a parallel ma— chine with a large number of small grained processors, while the MP—l is a machine with a relatively large processor memory size compared to other SIMD machine. In this section, we discuss machine-dependent characteristics in logic simulations on the two SIMD machines. 153 6.3.1 Queue Space One of the important factors for efficient parallel logic simulation is to have enough event queue space because, in general, queue space determines the number of events to be used for the simulation. Each processor on MP-l and CM-2 has 16K and 8K bytes, respectively, as the local memory size. The MP-l, when MPL is used, provides only about 10KB out of 16KB for user space, while the CM-2, when C / Paris is used, supports almost 8KB as user space. Under the situation with the limited local memory space, processing a large number of input vectors may cause overflow in event queues for conservative simulation and optimistic simulation. The Moving Time Window (MTW) technique [75] is used to prevent the queue overflow. 6.3.2 Parallel Virtuality The number of available processors is another significant factor when each processor is assigned to a gate. The CM-2 provides up to 64K of physical processors and offers the VP (Virtual Processor) ratio concept for larger circuits than 64K, which indicates how many times each physical processor must perform a certain task in order to simulate the appropriate number of virtual processors [85]. The MP—l has a smaller number of processors than the CM-2. The number of physical processors of the MP-l family is only from 8K up to 16K. Using a larger local memory on MP—l, the virtual processor concept may be also implemented manually instead of by using the system support. When implementing the virtual processor concept, an important consideration is the partitioning of a benchmark circuit into . .3. 154 [ I 32 bits [ 64 bits [[ Congestion effect ] CM-2 140 190 7.7 (6.9, 8.4) times MP—l 28 43 14.1 (15.1, 13.1) times Table 6.2: Comparison of Congestion Effects processors of the MP—l. If one processor can be assigned for several gates, the MP-l may also cover a circuit with a large number of gates. However, the possible memory size per virtual processor will decrease by half whenever the VP ratio increases, and the entire processing time will be slow to process the assigned gates one by one. On the SIMD machine, assigning several gates into a processor may result in a high overhead due to the small memory space and the synchronized characteristic. 6.3.3 Communication Scheme The CM-2 [84] is a SIMD machine whose interprocessor communication topology is hypercube, while MasPar’s topology is mesh and router. Each CM-2 processor chip contains one router node which serves the 16 data processors on the chip. The router nodes on all the processor chips are wired together to form the complete router network. The algorithm used by the router can be broken into stages called petit cycles. The delivery of all the messages for a send operation might require only one petit cycle if a processor is active, but if all processors are active then the execution time is about 8 times slower than the one petit cycle time. In MasPar, the Global Router is a bidirectional communication instruction and a circuit switched style network organized as the 3 stage hierarchy of crossbar switches [14]. Each nonoverlapping square matrix of 16 processors in the processor 155 array is called a cluster. The system can communicate with all PE clusters simulta- neously, but it can communicate with only one processor per cluster at one time [53]. A congestion problem occurs when more than 2 processors in the same cluster send data through the outer routing port at the same time. Even if the router requires a very short time there is some delay to send several messages from a cluster. The case of all 16 processors in a cluster wanting to send data at the same time is up to 16 times slower than the case of only one processor communicates per cluster. Even when the worst case of activating 16 processors occurs, the router instruction on the MP-l using the external router network is about two times faster than the send instruction on the CM—2. Table 6.2 shows the performance of routing instruc- tion when only one processor in a cluster sends a message to its randomly assigned destination processor. From the evaluated performance, we can see that the MP-l is more sensitive to the congestion problem in a cluster. 6.3.4 Processor Capability A massively parallel SIMD machine contains numerous small-grain processors. Each processor has a large on-chip register set and all computations operate on the registers on the MP-l [61]. Most data movement within each processor occurs on the internal processor 4-bit Nibble bus and the Bit bus. For example, during a 32-bit integer instruction, a series of operations on successive 4-bit nibbles to generate the full precision result requires 8 clocks. In contrast, each processor of CM—2 is 1-bit serial, which causes an extremely long time to perform the read or write operations in the 156 array. These capabilities of the processor affect array access time on SIMD machines, and the CM-2 especially takes a long time to access an element of the array. Since logic simulation requires a lot of event handling, that is, accessing events, comparing events, and checking the queue boundary, the performance of the two machines are significantly affected by processor capabilities. 6.4 Experimental Results We simulated combinational circuits from ISCAS ’85 circuit suite [16] as benchmark circuits, including C1355, C1908, C6288, and C7552. Through a pre-processing step, the benchmark circuits are re-configured into the maximum 3 input ports and 2 output ports in order to take an advantage of the characteristic of the SIMD machine. The synchronous, conservative, and optimistic protocols are implemented on the MP-l and the CM-2, and experimented with 1000 input event vectors by using the MTW mechanism to manage the local memory efficiently. The experimental results on the performance Of synchronous, conservative, and optimistic protocols for benchmark circuits are shown in the Table 6.3. Speed ratio is the ratio of total execution time on the CM-2 to that on the MP-l. All MTW sizes of synchronous protocol are infinite because only a few events are executed and propagated on the synchronous protocol based on the GVT value and the event queues do not overflow. Among the parallel protocols, the conservative protocol achieve the better performance than other protocols on the SIMD machines. In addition, the 157 Protocol Synchronous Conservative Optimistic Speed ratio Speed ratio I MTW size Speed ratio I MTW size C1355 1.89 (214/113 2.12 (55/26) 00 2.52 (189/75) 3000 01908 1.82 (265/146 1.95 (41 /21) 28000 2.36 (184/78) 20000 ) ) C6288 1.89 (735/388) 2.10 (437/208) 2500 2.52 (1667/662) 500 C7552 1.74 (324/186) 1.78 (89/50) 10000 2.32 (402/173) 1000 TOM—2) TMP—l Table 6.3: Experimental Results (Speed ratio = MP-l runs 2 to 2.5 times faster than CM-2 and the optimistic protocol gives the better Speed ratio than synchronous and conservative protocols on the MP-l. From the experimental results, we can consider two analytical factors: array ac- cess frequency and congestion problem. The first factor produces several noticeable effects and wide differences. Optimistic protocol requires more event searching in a simulation cycle, causing more array accesses. Since the array access time as shown in 6.2 is much faster on the MP-l than the CM-2, the performance of optimistic protocol on the MP-l is almost two and half times better than that on the CM-2. It provides the highest speed ratio Of optimistic protocol among those three protocols. Conservative protocol is also influenced by this array access frequency effect, since there are more array accesses for link clock and local clock than synchronous protocol. Thus, conservative protocol has higher speed ratio than synchronous protocol. For the second factor, the MP-l is more sensitive to the PE cluster concept than the CM-2, as mentioned in 6.3. For synchronous protocol, there are only a few processors which are activated in each cycle according to GVT, so it results in little wasted time to go through a global router because the congestion is very low (about average 1.75 ,useconds / cycle on the MP-l for communication). Conservative protocol, however, has a small number of simulation cycles. This fact results in increasing the 158 degree of parallelism and communication time per cycle (about 1.97 pseconds / cycle). Optimistic protocol allows all gates to be activated whenever gates receive the new incoming events, resulting in a very high probability that each processor in a cluster tries to propagate events simultaneously. The interprocessor communication takes the longest time (about 2.12 pseconds / cycle). In Spite of this high congestion probability, the performance of optimistic protocol on the MP-l is almost two and half times better than that on the CM-2. This situation can be explained by the fraction of execution. time for each major part of the logic simulation processing to the total execution time. Processing of a logic simulation consists of three main parts: interprocessor communication, queue accessing and handling, and GVT computation. The fraction of execution time for each activity varies by protocols and circuits. It is measured in terms of percentage of its execution time as shown in the Table 6.4. Since the performance difference in array access instructions between the two machines is wide, the speed ratio increases as the queue handling percentage increases. If the congestion problem is severe, the percentage of communication time per cycle will increase. Thus, on the optimistic protocol, the performance difference of queue manipula- tions dominates the difference resulting from high congestion. The performance of conservative protocol on the MP—l is also affected by this domination, producing a higher speed ratio than synchronous protocol. Thus, we can see the congestion prob- lem does not cause a significant effect on the ISCAS ’85 circuits. The influence of the congestion problem, however, depends on how gates are allocated to processors. In the case of a multiplier circuit, conservative protocol is more sensitive to the conges- tion problem than array access frequency (speed ratio of syn. = 2.13, speed ratio of 159 [ [I Interprocessor Communication ] Synchronous 78.22% to 79.73% Conservative 50.59% to 53.85% Optimistic 9.20% to 17.04% | Event Queue Manipulation Synchronous 20.27% to 21.78% Conservative 46.15% to 49.41% Optimistic 82.96% to 90.80% [ J GVT Computation ] Synchronous less than 0.0001% Conservative 0.0003% Optimistic 0.0001% Table 6.4: Fractions of Execution Time con. 2 1.95, speed ratio of opt. = 2.72). Optimistic protocol still has a high speed ratio due to the effect of queue manipulations caused by the heavy rollback in the multiplier circuit. As we see in the above experiments, the array access frequency and the communi- cation scheme are major factors to make a simulation protocol better than the others. One method to lower the congestion effect depends on how to assign logic gates to processors based on the knowledge of gate activity in advance. Array access frequency can be reduced by using an efficient queue data structure for logic simulations. 6.5 Summary The characteristics of machine architectures decide the performance of parallel pro- tocols. When VLSI benchmark circuits are used as application problems, the perfor- mance of logic simulation depends on that Of related instructions in the employed ma- chine architecture. In order to compare the machine architectures in terms Of PDES, 160 the following characteristics and performances of machines are considered: computa- tion capability of a processor, interprocessor communication, global synchronization, local memory space, and the number of processors. The two target massively parallel SIMD machines are the MP—1 and the CM-2. Performances of the major instructions considered are faster on the MP-l than on the CM-2. Three logic simulation protocols, that is, synchronous, conservative, and optimistic protocols, are evaluated and compared on both of the machines. The ex- perimental results show that the MP—l is about 2 to 2.5 times faster than the CM-2 with some variances for all three simulation techniques. The results reflect the fact that the different performances of major instructions cause the faster execution times on the MRI. The variances of performance among three protocols are also explained by investigating how array access frequency and communication scheme have effects on the logic simulation protocols. In general, even though there is a congestion prob— lem in communications, the fraction of execution time spent in queue manipulations exceeds the fraction spent in interprocessor communications in conservative and op— timistic protocol. It results in that optimistic protocol has the highest speed ratio. Chapter 7 A Study of Machine-Independent Characteristics of Parallel Logic Simulation The factors that affect performance of parallel protocols can be classified into machine— dependent factors and machine-independent factors. Machine-dependent factors are ones that rely on the performance of processor operations and capabilities supported by the machine system. The related performance metrics are an instruction process- ing time, an array manipulation time, an interprocessor communication time, a global synchronization time, etc. when a parallel protocol is employed in a machine. In con- trast, machine-independent factors are independent of machine performance. Rather, they are determined by characteristics of circuits and parallel protocols. The critical path length of a circuit, the simulation parallelism of parallel protocols on a given 161 162 circuit, the number of simulation cycles, data structures, and other used schemes are in this category. In this chapter, we investigate the effects of two machine-independent factors on performance of parallel protocols by observing how the factors are related to the char- acteristics of parallel protocols, application problems, and other machine-independent factors. The first one is the critical path length of an application problem. We find the linear relationship of the critical path length with another machine-independent factor. With the relationship, when the critical path length in an application problem is known, we discuss how the number of simulation cycles and the total execution time of a protocol can be estimated without any actual experiments. As the second factor, we study the parallelism of simulation, which is defined as the ratio of the activated processes to the total processes during the simulation. Especially, the conservative protocol with null messages is studied in view of parallelism and a refined conservative protocol is proposed by reducing useless null messages. Since sending and cascading unnecessary null messages make all processors busy and increase communication delay wastefully, the refined protocol employs a mechanism, called Bounded Time Window, which bounds timestamps of null messages to be sent. The effects of this mechanism on experimental results of parallelism of simulation and the number of simulation cycles are studied. Finally, the MT W mechanism is studied as a mechanism to ma- nipulate event queues under the situation of the limited memory space. When we simulate a huge circuit with a large number of input event vectors, to utilize the lim- ited local memory for event queues is an important concern on the SIMD machine. In order to prevent event queues from overflowing without loss of performances, the 163 MTW mechanism is applied to conservative and Optimistic protocols. As the target machine the MasPar MP—l [53] is used for the experiments. The MP-1 has 16K processors and 16KB local memory size per processor. Since the SIMD machine has massively parallel processors and supports faster global synchronization, it is appr0priate to simulate logic circuits in parallel by assigning a gate into a proces- sor. The used model for experiments is the same as that in Chapter 6. Each processor is assigned only one logical process. In Section 7.1, the relationships of critical path length with other machine indepen- dent factor are studied. Section 7.2 presents the effect of null messages on parallelism of conservative protocol and shows a refined conservative protocol using the BTW mechanism. Section 7.3 shows the MTW mechanism as the tool of managing memory space. Concluding remarks are in Section 7 .4. 7 .1 Critical Path Length Producing the proper output results within a short time is the most important cri- teria to reduce the simulation cost for huge application problems. Total execution time is the Spent wall clock time for the simulation, which is measured from the beginning of simulation until the simulation is complete. The following simple for- mula expresses the total execution time, T, as a function of machine-dependent and machine-independent factors: T=r>I. . — ~ -- J . .- ‘ ' <> 0 Q 01 I I I l I I I 0 20 40 60 80 100 120 140 160 180 Length of Critical Path (a) Synchronous Protocol 100 I I I I I I I I 80 _ Benchmark Circuit 0 __ Linear Relation ' ~ 60 — -— CG 40— _ 20— <><>fiw“'”0 _ 0 e O ‘5' '0 ' I I I I 1 I I I 0 20 40 60 80 100 120 140 160 180 Length of Critical Path (b) Conservative Protocol 100 I I I r I I I I 80 _ Benchmark Circuit 0 Linear Relation - - ' CG 40— - . . . . - ’ ‘<‘>’ 20 ‘ O . . - ' ‘ . . - ' ' O 0 <> .0 00 O - ' ' ' L I I I I l I I 0 2O 40 60 80 100 120 140 160 180 Length of Critical Path (0) Optimistic Protocol Figure 7.2: The Relationship between the Length of Critical Path and the Value of CG 172 1 I I I I I I CON 01908 _— 0.8 b _ 0.6 - _ Parallelism 0.4 - _ 0.2 _ O 1 l I 1 I .4 0 200 400 600 800 1000 1200 Number of Simulation Cycles Figure 7 .3: Parallelism on the Conservative Protocol Excluding Null Messages (C1908 Circuit) 7.2 Parallelism of Simulation Parallelism of simulation is another important factor on simulation performance. This factor is the ratio of the active processes to the total processes at a point Of simulating an. application problem. The parallelism changes dynamically depending on the system state and event processing while the Simulation runs. In this section, we study the patterns of parallelism for three simulation protocols and the effect of null messages on the parallelism of conservative protocol. Figure 7.3 shows the parallelism Of conservative protocol on C1908 circuit. The number of input event vectors is 200. The active processes which execute events are included in the parallelism, and the null messages are excluded. As we see in the figure, there are three different parts in the parallelism of simulation. The first 200 simulation cycles produce the highest parallelism. The enqueued 200 input event 173 1 I I I I I I OPT C1908 — 0.8 — _ 0.6 .. Parallelism 0.4 - _ 0.2 - - O l J l_ I I l 0 200 400 600 800 1000 1200 Number of Simulation Cycles Figure 7.4: Parallelism on the Optimistic Protocol (C1908 Circuit) vectors into the circuit are processed and propagated to other processes in this first part. The middle part of the simulation cycles between 400 and 10000 is the interval of on-going event processing. According to the rule of protocol and the characteristic of a given circuit, the state of activation varies. Finally, the last part shows the tail of simulation in which the remaining events are processed. Although the locations of those three parts vary depending on the characteristics of circuits and protocols, the general trend of parallelism contains the three parts while simulating a combinational circuit. As shown in Figure 7.4, the optimistic protocol also has the three different parts of parallelism level. Because the optimistic protocol activates processes with- out considering the causality effect, its parallelism is higher than the conservative protocol. Table 7.4 presents the average parallelism of some ISCAS ’85 circuits. Since the C6288 circuit has a long critical path length compared to its total number Of gates, 174 [ Circuits I] Synchronous [I Conservative [[Optimistic ] I I Without Null With Null C1355 0.017030 0.123270 0.444010 0.193137 C1908 0.015966 0.127395 0.442529 0.218135 C6288 0.067693 0.157134 0.242911 0.188740 C7552 0.012733 0.069311 0.190080 0.094628 it produces the memory overflow in event queues for a large number of input event vectors. To Obtain the proper parallelism excluding the effect of MTW, we use a relatively small Size of input event vectors, 30. While the C1355 circuit has the same trend as the C1908 circuit has, the C6288 circuit gives a gentle SIOpe after the feeding part since it has a long critical path. The C7552 circuit produces a much longer tail than the other circuits. This property may result from the larger number of total gates compared to other circuits. To compare the differences of parallelism according to the characteristics of circuits, Figures 7.5 and 7.6 Show the parallelism on C6288 and C7553 circuits, respectively. The C1355 circuit has the almost same behavior as Table 7 .4: Average Parallelism the C1908 circuit, so we omit the figure. 175 1 I I r I I I I I I CON C6288 — 0.8 r _ 0.6 — _ Parallelism 0.4 - _ 0.2 5 — 0 l l l l I l l O 200 400 600 800 1000 1200 1400 1600 1800 Number of Simulation Cycles Figure 7.5: Parallelism on the Conservative Protocol Excluding Null Messages (C6288 Circuit) 176 1 I I I I I I CON C7552 --— 0.8 — _ 0.6 - _ Parallelism 0.4 - _ 0.2 a 0 l l 1 l _L I 0 200 400 600 800 1000 1200 Number Of Simulation Cycles Figure 7.6: Parallelism on the Conservative Protocol Excluding Null Messages (C7552 Circuit) 7 .2.1 A Refined Conservative Protocol Conservative protocol uses null messages to facilitate processing as well as to avoid a deadlock situation. Figure 7.7 shows the parallelism of the conservative protocol for the same Simulation as in Figure 7.3, except that Figure 7.7 contains the null messages into the parallelism. The two figures shows that null messages of conservative protocol highly increase the degree of parallelism. In other words, the conservative protocol generates many null messages between any two processes although the number of ac- tive processes is small. If the target machine system has a high cost for interprocessor communication, many null messages yield a longer communication time. On MP—l, however, the interprocessor communication is performed at a time by an instruction through a circuit-switched style network organized as a three-stage hierarchy of cross- bar switches [14]. Thus, the high parallelism due to the null messages does not make a big difference in the performance result on MP—l. The null message is a significant 177 factor for a machine architecture which has a long communication time per message, such as a MIMD machine and a distributed system. In conservative protocol, null messages are used to activate the child processes even when a process does not have events to send. The process sends null messages with the timestamp of the current Local Clock to the child processes, When the child processes receive the null messages, they can be sure that there is no incoming event with a larger timestamp than the null messages’ timestamp. If they have unprocessed events whose timestamp is smaller than that of the null messages and the events satisfies the activation condition, the events can be executed. However, when a process generates null messages frequently, sending null messages may cause bad effects. When the child processes do not have unprocessed events or they are waiting for events in other input ports, the frequently-incoming null messages make the processes keep busy without any useful processing, spending the communication time. Another case is the situation that the null messages waste the execution time without accelerating event computations. In this situation, null messages can be cascaded to descendent processes, but there are a few events to be executed on the descendent processes. Note that the purpose of sending null message to the child processes is that the child processes can execute their events up to the null message’s timestamp. However, if there are very a few events to be processed, most of null messages are involved in an unnecessary work by increasing the communication time without any benefit for simulation progress. 178 1 I I I I I I CON C1908 with null messages —— 0.8 0.6 r Parallelism 0.4 - 0.2 I- 0 l l I I l l 0 200 400 600 800 1000 1200 Number of Simulation Cycles Figure 7.7: Parallelism on Conservative Protocol Including Null Messages (C1908 Circuit) The BTW (Bounded Time Window) Algorithm To reduce the overhead due to null messages in the conventional conservative pro- tocol, we propose a refined conservative protocol. This protocol employs a window per process to restrict events to be executed according to the values of timestamps. That is, events which have the timestamps within the window are not allowed to be executed. The window is called the Bounded Time Window (BTW). This window can control to reduce the frequently-sending null messages. A characteristic of the refined protocol is that the window of BTW mechanism is not dependent on a global characteristic of simulation, but on a locality characteristic of each process. Since the BTW of process starts from the timestamp of the last sent event, each process has a different window interval which has a different low bound. Thus, each process can avoid sending null messages which reside within the window. By using the BTW per process, we locally control the number of null messages to be 179 sent so that all processes can proceed in their own progress without waiting for any global control or others’ late processing. Figure 7.8 describes the BTW algorithm in detail. Experimental Results Table 7.5 compares the performance results of the refined conservative protocol with that of the original conservative protocol. The average parallelism including null messages are measured by varying the BTW size of the refined conservative protocol. The numbers of used input event vectors are 200 and 1400. The result of the original conservative protocol is the special case of the refined conservative protocol when the BTW size is 0. As the BTW size increases, the average parallelism decreases without any large difference in the number of simulation cycles. It implies that BTW reduces unnecessary null messages by preventing processes from sending unnecessary null messages repeatedly. Figure 7.9 shows the parallelism as a function of the BTW size on the C1908 circuit. Although the BTW mechanism decreases parallelism, the total execution time of the refined protocol on the MP—l is the almost same as that of the original conservative protocol. This is because interprocessor communication on the MP-l is performed by an instruction by means of the built-up router, and the extra communication overhead due to the unnecessary null messages does not cost much on the MP-l. However, when the BTW is applied to a MIMD machine or a distributed system, the refined conservative protocol may achieve better performance improvement by reducing the number of null messages. 180 integer GVT plural integer LVT, Local_Clock, Link_Clock[N 0.1 nputPort] Allocate processes into processors. Read input event vectors and put them into initial processes. Repeat Compute LVT per process. Get Local_C'lock from the minimum LinkL’lock from input ports per process. if it is a new Local_C'lock then Set the Local_Cl0ck changed. Get GVT from the minimum LVT from all processes. if ( LVT <= Local_Clock || LVT 2: GVT ) then begin Activate the process Compute the event. Generate an event with a new data and timestamp based on current status. Check the event has the same data which was sent before. if yes then Set the event as a null message. else Set the event as a real message. end if else if LocaLC'lock is changed then make a null message with timestamp of Local_Clock. end if if the event is null message then if ( timestamp of the event - timestamp of the last sent event > BTW ) then Send the null message to the successors. else Discard the null message. else Send the real message to the successors. Activate the processes that have received new events. Change the Link_C'l0ck according to the incoming events. Insert the new events into event queues. Until ( No more events in all event queues.) Figure 7.8: The BTW Algorithm Parallelism 181 " Conservative Protocol Refined Conservative Protocol n = 200 II Without Null With Null BTW=100 BTW=200 BTW=300 Parallelism 0.182373 0.479925 0.470730 0.409726 0.407387 Sim. Cycles Jf 1221 1221 1225 1231 1232 j—n—=TOO—_IrWithout Null With Null BTW=100 BTW=200 BTW=3OO | Parallelism 0.198162 0.664161 0.641622 0.496218 0.493873 Sim. Cycles 7932 7932 7933 7940 7941 Table 7.5: Performance Comparisons on the Refined Conservative Protocol 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 200 400 600 800 BTW Size 1000 Figure 7.9: Average Parallelism of the Refined Conservative Protocol 182 7 .3 MTW Mechanism for Memory ”Management One of important considerations on parallel discrete event simulation is the memory size to simulate an application problem. Each process requires an amount of memory space to store incoming events in its input event queues. Depending on the employed protocol and the number of processes per processor, the required memory size may vary. If the size of application is large or the parallel system does not have enough memory, the limitation of memory might be an significant problem. On SIMD ma- chines, especially, a local memory size per processor is relatively smaller than on MIMD machines or distributed systems. Thus, controlling the required memory size per process without loss of performance is an important issue to utilize the limited system capacity efficiently. The M TW (Moving Time Window) mechanism is employed to handle the memory problem on the SIMD machine. In [75], MTW was used for Time Warp to solve the over-optimistically processing and the storage overhead problem as a hybrid parallel protocol scheme. In this chapter, we use MTW for a different concept: It is to halt the advance of events to prevent overflows in event queues. Since the parallel protocols are implemented by array structures or CBSQ on MP-l, each event queue has a limitation of queue size after statically being allocated. If there are more events than the size of event queue, an overflowing situation occurs in the queue. By controlling the advanced events not to be sent, parallel protocols can avoid overflows in event queues and are able to simulate huge logic circuits even on machines having the memory limitation. It is employed for simulation of conservative and Optimistic 183 Circuit Number of Input Event Vectors 200 I 400 I 600 l 800 I 1000 [ 1200 l 1400 01355 00 00 20000 20000 19000 19000 19000 C7552 10000 9000 9000 8000 8000 8000 8000 Table 7 .6: MTW Sizes Applied to the Experiments protocols. The MTW mechanism operates with the current GVT value and a MTW size If the LVT of a process is within the interval of [GVT, GVT + M TW size], the process is active and executes an event. If the LVT is beyond the interval, the process does not execute an event. Experiments with a large number of input vectors need to use the MTW mech— anism to prevent overflow situations on MP—l. Table 7.6 shows the applied MTW size to the experiments. When the numbers of input event vectors are 200 and 400 on C1355, the simulation can run without controlling the queue overflow. As the number of input event vectors increases, the used MTW size decreases to prevent the overflow situation in event queues. Figures 7.10 and 7.11 show the relationship between the number of input vectors and either the number of simulation cycles or the total execution time when the MTW size in the table is applied the optimistic protocol on C1355 and C7552, respectively. Although the MTW size decreases to control the queue overflow as shown in Table 7.6, the linear relationship can be con- tinuously kept in the figures. For example, the input event vectors 600 has a big difference in the MTW size on the C1355 circuit which changes from 00 to 20000. However, the corresponding performance differences are not significant as shown in Figures 7.10 and 7.11: the figures show a small variation of the linear lines between 400 and 600, producing a small increase on the lines of the number of simulation 184 l l l l l 20000 C7552 +— C1355 é;— 15000 Number of Simulation 10000 Cycles 0 l l l l l 200 400 600 800 1000 1200 1400 Number of Input Event Vectors Figure 7 .10: Relationship between the Number of Simulation Cycles and the Number of Input Event Vectors with the MTW Mechanism cycles and the total execution time. Thus, applying the MTW mechanism to parallel asynchronous protocols can achieve the control effect for the event queue overflow without significant increases of the number of simulation cycles and the total execu- tion time. The experimental results also present that the applied MTW size depends on the characteristics of circuits. 7 .4 Summary We study two machine-independent factors and their effects on performance of parallel protocols for logic circuits. The first factor is the critical path length of logic circuits. By showing a linear relationship with the CG value, we discuss how the CG value and the total execution time are related to the length of critical path. As the second factor, the parallelism of simulation is studied. With some minor variations depending on 185 600 _ T l l l 1 fl C7552 +— 500 __ 01355 A— _, 400 - — Total Execution 300 _ — Time 200 — - 100 - - i 0 l l l l 200 400 600 800 1000 1200 1400 Number of Input Event Vectors Figure 7.11: Relationship between the Total Execution Time and the Number of Input Event Vectors with the MTW Mechanism application problems, conservative protocol shows the highest parallelism when it uses null messages. Optimistic protocol is the next and synchronous protocol shows the lowest parallelism. On the conservative protocol, if null messages are not considered for parallelism, the parallelism is lower than that of optimistic protocol. We propose a refined conservative protocol that can reduce unnecessary null messages without loss of performances. The refined protocol may achieve a better improvement effect on the MIMD machines or distributed systems which have a high communication overhead. The last factor that we considered is the MTW mechanism. When we simulate a huge circuit with a large number of input event vectors, the MTW mechanism is used to utilize the limited local memory for event queues. By applying the MTW mechanism to conservative and Optimistic protocols, we could prevent event queues from overflowing without loss of performances. Chapter 8 Conclusion In this thesis, we have studied and implemented parallel asynchronous protocols of PDES on various distributed—memory systems for the application domain of VLSI logic circuits. The target machine architectures are the IBM SP2 as a parallel multi— processor system, a cluster of DEC Alpha workstations as a distributed system, and the MasPar MP-l/MP-2 as a SIMD machine. On the various types of machine archi- tecture platforms, we have investigated several machine-dependent and -independent factors in order to improve performances of parallel asynchronous protocols. As for the machine dependent-factors, studied are the number of processors, performances of machine instructions or operations, the interprocessor communication overhead, and the local memory space. As the machine—independent factors, considered are computation granularity, event scheduling strategies, protocol characteristics, critical path lengths, parallelism of simulation, etc.. In this chapter, we summarize the con- tributions of this thesis work to the parallel discrete event simulation on parallel and distributed systems. 186 187 o The Major Research Factor: Computation Granularity As one of the important factors for optimistic protocols on parallel and dis- tributed systems, computation granularity is proposed in this thesis. By com— bining some event granularities, the computation granularity can adjust the computation amount between any two interprocessor communication points. Although many researchers have studied the event granularity as a significant factor which affects the performance of optimistic protocols of PDES, the com— putation granularity, proposed in this thesis, has been neglected. The major functions of the computation granularity are to reduce the communication over- head between processors and to control the degree of optimism of processing events on parallel optimistic protocols. By controlling the computation granu- larity appropriately, the performances of parallel protocols can be optimized on parallel and distributed systems. 0 Analytic Prediction of Computation Granularity An analytic model is developed to study the effects of the computation granular- ity on the performance of parallel protocols on parallel and distributed systems. Based on the analytic model, prediction formulas of the number of simulation cycles and the total execution time are derived. The formulas can analyze the performances as a function of computation granularity, especially when the LP ratio is greater than one. A global synchronization is used in the model to compute GVT. Our analytic results, verified by the experimental results, show that the computation granularity significantly affects the performance results 188 on parallel and distributed systems. Interpretation of Effects of Computation Granularity Based on the experimental results performed on the IBM SP2, the effects of computation granularity are interpreted in various aspects of simulation per- formances. The performance aspects considered are execution activity, rollback situation, communication overhead, and simulation progress. Execution activ- ity shows the actual event computation amount between communication points, which is comparable to the possible event computation on logical processes. Rollback situation is interpreted by rollback frequencies and rollback rates. To study communication overheads on the nonblocking message communication, the computation fraction to the total execution time is used as the performance metric. Simulation progress shows the overall progress of simulation status until the simulation terminates. The speedup of parallel processing is also presented as a function of the number of processors. Implementations on Various Types of Machine Platforms Several machine architecture platforms on parallel and distributed systems are used to study parallel protocols: the SIMD machine platform, a distributed system and a multiprocessor computer for the MIMD machine platform. 1. SIMD Machines: Three parallel protocols of synchronous, conservative, and optimistic protocols are implemented by using the MPL on the mas- sively parallel SIMD machines of the MasPar MP-l and MP—2. The per- formance results are compared to those on the Connection Machine CM-2 189 in terms of major performance criteria of PDES. The MP—l achieves better performance results than the CM-2. To improve performances of PDES on the SIMD machines, the CBSQ data structure and a refined conservative protocol are proposed. . Distributed Systems: On a cluster of DEC Alpha workstations inter- connected by a DEC GIGASwitch, the optimistic protocol is implemented with different design perspectives from those on the SIMD machines. Since the number of processors (or machine nodes) is so small, each processor has to be assigned more than one logical process to accommodate a large application problem. The distributed system allows processors to maintain logical processes asynchronously and uses the PVM as the message-passing mechanism between processors. Event queues are distributed to each in- put port of logical processes and each processor has a central scheduling priority queue to schedule events of the assigned logical processes. . Multiprocessor Computers: The optimistic protocol is implemented on the IBM SP2 by using the message-passing library of MP1 to sup- port the asynchronous processing over processors. The SP2 provides the High-Performance Switch with the topology of a bidirectional MIN and the buffered wormhole routing. Moreover, due to its high scalability and good processor capability, the SP2 is a prominent machine for parallel sim- ulation with a large application. The event queue manipulation is similar to that on the DEC Alpha distributed system with a little bit enhanced 190 data structure. Asynchronous GVT computation based on token pass- ing algorithm is used. To reduce communication overheads, we employ nonblocking send routines. 0 Study of Event Scheduling Schemes To develop a good event-scheduling scheme, several considerations are neces- sary: data structures of event queues, a control of communication overhead, and a progress control of individual logical processes. In this thesis, we propose two event scheduling schemes for the optimistic protocol: the BPEC scheme and the BPVTW scheme. The BPEC scheme limits the number of executed events per logical process between interprocessor communications by using the BPC parameter. The BPVT W scheme controls the progress of processes in vir- tual time by using the BPW parameter, and prevents processes from executing events far ahead of other processes. According to the experimental results, both schemes show good performances when the computation granularity is adjusted prOperly. Otherwise, communication overheads or excessive rollback situations may significantly degrade the simulation performance. The results also indicate that the progress balancing among logical processes has significant effects on obtaining the optimal performance. a Study of Machine-Independent Factors The performance effects of the characteristics of parallel protocols and applica- tion problems, which are independent of machine characteristics, are also inves- tigated. The considered machine-independent factors are the length of critical 191 path, the parallelism of simulation, the MTW mechanism, the total number of logical processes, the number of input event vectors, and data structures. The relationships among these factors are studied by using experimental results. BIBLIOGRAPHY Bibliography [1] [2] [3] [4] [5] l6] l7] [8] [9] [10] [11] T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, “SP2 system architecture,” IBM System Journal, vol. 34, no. 2, pp. 152—184, 1995. V. D. Agrawal and S. T. Chakradhar, “Performance analysis of synchronized iterative algorithms on multiprocessor systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 739—746, November 1992. R. Bagrodia, Z. Li, V. Jha, Y. Chen, and J. Cong, “Parallel logic level simulation of VLSI circuits,” in Winter Simulation Conference (WSC ’94), pp. 1354—1361, December 1994. M. L. Bailey, “How circuit size affects parallelism,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 11, pp. 208— 215, Febrary 1992. M. L. Bailey, “A time-based model for investigating parallel logc-level simula— tion,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 11, pp. 816—824, July 1992. M. L. Bailey, “A delay-based model for circuit parallelism,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 12, pp. 1903— 1912, December 1993. M. L. Bailey, J. V. Briner, and R. D. Chamberlain, “Parallel logic simulation of VLSI systesm,” ACM Computing Surveys, vol. 26, pp. 255—294, September 1994. R. Baldwin, M. J. Chung, and Y. Chung, “Overlapping window algorithm for computing GVT in Time Warp,” in Proceedings of the 11th International Con- ference on Distributed Computing Systems, May 1991. P. Banerjee, Parallel Algorithms for VLSI Computer-Aided Design. PT R Pren- tice Hall, 1994. J. Banks and J. S. Carson, Discrete-Event System Simulation. Prentice Hall, 1984. G. Bell, “Ultracomputers: A teraflOp before its time,” Communications of the ACM, vol. 35, pp. 27—47, August 1992. 192 [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] 193 S. Bellenot, “Global virtual time algorithms,” in Proceedings of the SCS Multi- conference on Distributed Simulation, vol. 22, pp. 122—127, January 1990. C. Benveniste and P. Heidelberger, “Parallel simulation of the IBM SP2 inter- connection network,” in Winter Simulation Conference ( WS C ’95), pp. 584—589, December 1995. T. Blank, “The MasPar MP-1 architecture,” in Proceedings of the 35th IEEE COMPCOM Spring 1990, pp. 20—24, February 1990. F. Brglez, D. Bryan, and K. Kozminski, “Combinational profiles of sequential benchmark circuits,” in Proceedings of the IEEE International Symposium on Circuits and Systems, May 1989. F. Brglez, P. Pownall, and R. Hum, “Accelerated ATPG and fault grading via testability analysis,” in Proceedings of the IEEE International Symposium on Circuits and Systems, June 1985. R. Brown, “Calendar queues: A fast O(1) priority queue implementation for the simulation event set problem,” Communications of the ACM, pp. 1220—1227, October 1988. R. E. Bryant, “Simulation of packet communications architecture computer sys- tems,” tech. rep., Technical Report MIT-LCS-TR—188, Massachusetts Institute of Technology, 1977. C. D. Carothers, R. M. Fujimoto, and P. England, “Effect of communication overheads on Time Warp performance: An experimental study,” in Proceedings of the 8th Workshop on Parallel and Distributed Simulation, pp. 118—125, July 1994. R. D. Chamberlain and M. A. Franklin, “Hierarchical discrete-event simulation on hypercube architectures,” IEEE Micro, vol. 10, pp. 10—20, August 1990. K. Chandy and J. Misra, “Distributed simulation: A case study in design and verification Of distributed programs,” IEEE Transactions on Software Engineer- ing, vol. 5, pp. 440—452, September 1979. K. Chandy and J. Misra, “Asynchronous distributed simulation via a sequence of parallel computations,” Communications of the ACM, vol. 24, pp. 198—206, April 1981. E. Choi and M. J. Chung, “An important factor for Optimistic protocol on dis- tributed systems: Granularity,” in Winter Simulation Conference (WSC ’95), pp. 642—649, December 1995. E. Choi, M. J. Chung, and Y. Chung, “Comparisons and analysis of massively parallel simd architectures for parallel logic simulation,” in Sixth International Parallel Processing Symposium, pp. 671—674, March 1992. 194 [25] E. Choi and D. Min, “Event scheduling schemes for time warp on distributed systems,” in Winter Simulation Conference (WSC ’96), pp. 661—668, December 1996. [26] P. Christy, “Software to support massively parallel computing on the MasPar MP-1,” in Proceedings of the 35th IEEE COMPCOM Spring 1990, pp. 29—33, February 1990. [27] Y. Chung, Logic Simulation on Massively Parallel SIMD Machines. PhD thesis, Michigan State University, 1991. [28] Y. Chung and M. J. Chung, “Time Warp for efficient parallel logic simulation on a massively parallel SIMD machine,” in Proceedings of the Tenth Annual IEEE International Phoenix Conference on Computers and Communications, March 1991. [29] M. J. Chung and Y. Chung, “Performance estimation based on gate-to-processor ratio in parallel logic simulation,” in Proceedings of the 1992 International Con- ference on Parallel Processing, pp. 246—253, August 1992. [30] A. I. Concepcion and S. G. Kelly, “Computing global virtual time using the multi- level token passing algorithm,” in Proceedings of the 5th Workshop on Parallel and Distributed Simulation {PADS91), vol. 23, pp. 63—68, January 1991. [31] S. Das, R. Fujimoto, K. Panesar, D. Allison, and M. Hybinette, “GTW: A Time Warp system for shared memory multiprocessors,” in Winter Simulation Con- ference (WSC ’94), pp. 1332—1339, December 1994. [32] C. Fiduccia and R. Mattheyses, “A linear-time heuristic for improving network partitions,” in 19th Design Automation Conference, pp. 175—181, 1982. [33] P. A. F ishwick, Simulation Model Design and Execution: Building Digital Worlds, ch. 9. Prentice Hall, 1995. [34] E. H. Frank, A data-driven multiprocessor for switch-level simulation of VLSI circuits. PhD thesis, Carnegie-Mellon University, 1985. [35] R. M. Fujimoto, “Lookahead in parallel discrete event simulation,” in Proceedings of 1988 International Conference on Parallel Processing, vol. 3, pp. 34—41, 1988. [36] R. M. Fujimoto, “Time Warp on a shared memory multiprocessor,” in Proceed- ings of 1989 International Conference on Parallel Processing, vol. 3, pp. 242—249, 1989. [37] R. M. Fujimoto, “Parallel discrete event simulation,” Communications of ACM, vol. 33, pp. 30—53, October 1990. [38] A. Gafni, “Rollback mechanisms for optimistic distributed simulation systems,” in In Proceedings of the SCS Multiconference on Distributed Simulation, vol. 19, pp. 61-67, July 1988. 195 [39] F. Hao, K. Wilson, R. Fujimoto, and E. Zegura, “Logical process size in par- allel simulations,” in Winter Simulation Conference (WSC ’96), pp. 645—652, December 1996. [40] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Pro- grammability. McGraw-Hill, 1993. [41] D. Jefferson, “Virtual time,” ACM Transactions on Programming Languages and Systems, vol. 7, pp. 404—425, July 1985. [42] D. Jefferson, “Virtual Time II: The cancelback protocol for storage management in distributed simulation,” in Proceedings of the 9th Annual ACM Symposium on Principles of Distributed Computations, pp. 75—90, August 1990. [43] D. Jefferson and H. Sowizral, “Fast concurrent simulation using the Time Warp mechanism,” in Proceedings of the 1985 S CS Multiconference on Distributed Sim- ulation, January 1985. [44] D. W. Jones, “Concurrent Operations on priority queues,” Communications of the ACM, pp. 132—137, January 1989. [45] B. Kerninghan and S. Lin, “An efficient heuristic procedure for partitioning graphs,” The Bell System Technical Journal, vol. 49, pp. 291—307, 1970. [46] Y. H. Levendel, P. Menon, and S. H. Patel, “Special-purpose computer for logic simulation using distributed processing,” Bell System Technical Journal, vol. 61, no. 10, pp. 2873—2909, 1982. [47] J. M. Lin and S. G. Abraham, “Discrete event simulation on shared memory multiprocessors using global simulation information,” in Proceedings of the 1992 International Conference on Parallel Processing, pp. 254—261, 1992. [48] Y.-B. Lin and E. D. Lazowska, “Exploiting lookahead in parallel simulation,” tech. rep., Technical Report 89-10—06, Department of Computer Science, Univer- sity of Washington, 1989. [49] Y.-B. Lin and E. Lazowska, “Determining global virtual time in a distributed simulation,” tech. rep., Technical Report 90—01-02, Department Of Computer Science, University of Washington, December 1989. [50] W. M. Loucks and B. R. Preiss, “The role Of knowledge in distributed simu- lation,” in Proceedings of the SCS Multiconference on Distributed Simulation, vol. 22, pp. 9—16, January 1990. [51] B. Lubachevsky, “Efficient distributed event-driven simulations Of multiple-loop networks,” Communications of the ACM, vol. 32, pp. 111—123, January 1989. [52] N. Manjikian and W. M. Loucks, “High performance parallel logic simulation on a network of workstations,” in Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS ’93), vol. 23, pp. 76—84, May 1993. 196 [53] MasPar Computer Corporation, MasPar MP-I, July 1990. [54] MHPCC, “Maui High Performance Computing Center (MHPCC),” http://www.mhpcc.edu/mhpcc.html, 1997. [55] J. Miguel, A. Arruabarrena, R. Beivide, and J. A. Gregorio, “Assessing the performance of the new IBM SP2 communication subsystem,” IEEE Parallel 89' Distributed Technology, vol. Winter, pp. 12—22, 1996. [56] J. Misra, “Distributed discrete—event simulation,” Computing Surveys, vol. 18, pp. 39-—65, March 1986. [57] A. M. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory Of Statistics. McGraw—Hill, 1974. [58] R. B. Mueller-Thuns, D. G. Saab, R. F. Damiano, and J. A. Abraham, “VLSI logic and fault simulation on general-purpose parallel computers,” IEEE Trans- actions on Computer-Aided Design of Integrated Circuits and Systems, vol. 12, pp. 446—460, March 1993. [59] B. Nandy and W. Loucks, “An algorithm for partitioning and mapping conserva- tive parallel simulation onto multicomputers,” in 6th Workshop on Parallel and Distributed Simulation, vol. 24, pp. 139—146, January 1992. [60] B. Nandy and W. M. Loucks, “On a parallel partitioning technique for use with conservative parallel simulation,” in Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS ’93), vol. 23, pp. 43—51, May 1993. [61] J. R. Nickolls, “The design Of the MasPar MP-l : A cost effective massively parallel computer,” in Proceedings of the 35th IEEE COMPCOM Spring 1990, pp. 25—28, February 1990. [62] D. M. Nicol, “Parallel discrete—event simulation of fcfs stochastic queueing net- works,” in SIGPLAN, vol. 23, pp. 124—137, September 1988. [63] D. M. Nicol, “The cost of conservative synchronization in parallel discrete event simulations,” Journal of the ACM, vol. 40, pp. 304—333, April 1993. [64] D. M. Nicol, C. C. Micheal, and P. Inouye, “Efficient aggregation of multiple lps in distributed memory parallel simulations,” in Winter Simulation Conference {WSC ’89), pp. 680—685, 1989. [65] The Oak Ridge National Laboratory, PVM 3 User’s Guide and Reference Man— ual, May 1994. [66] D. Reed, A. Malony, and B. McCredie, “Parallel discrete event simulation using shared memory,” IEEE Transaction on Software Engineering, vol. 14, pp. 541— 553, April 1988. 197 [67] S. I. Resnick, Adventures in Stochastic Processes, ch. 1. Birkhauser Boston, 1992. [68] R. Ronngren, R. Ayani, R. Fujimoto, and S. Das, “Efficient implementation Of event sets in Time Warp,” in Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS ’93), vol. 23, pp. 101—106, May 1993. [69] A. E. Ruehli and G. S. Ditlow, “Circuit analysis, logic simulation, and design [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] verification for VLSI,” in Proceedings of IEEE, vol. 71, January 1983. B. Samadi, Distributed Simulation, Algorithms and Performance Analysis. PhD thesis, University of California, Los Angeles, 1985. D. D. Sleator and R. E. Tarjan, “Self-adjusting binary search trees,” Journal of the ACM, vol. 32, pp. 652—686, July 1985. S. P. Smith, B. Underwood, and M. R. Mercer, “An analysis of several approaches to circuit partitioning for parallel logic simulation,” in Proceedings of the 1987 International Conference on Computer Design, pp. 664—667, 1987. L. M. Sokol, P. A. Mutchler, and J. B. Weissman, “The role of event granularity in parallel simulation design,” in Proceedings of the 6th Workshop on Parallel and Distributed Simulation (PADS ’92), vol. 24, pp. 178—185, January 1992. L. M. Sokol and B. K. Stucky, “MTW: Experimental results for a constrained optimistic scheduling paradigm,” in Proceedings of the SCS Multiconference on Distributed Simulation, vol. 22, pp. 169—173, January 1990. L. Sokol, B. Stucky, and V. Hwang, “MTW: A control mechanism for paral- lel discrete simulation,” in Proceedings of the 1989 International Conference on Parallel Processing, vol. 3, pp. 250—254, August 1989. C. Sporrer and H. Bauer, “Corolla partitioning for distributed logic simulation of VLSI-circuits,” in Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS ’93), vol. 23, pp. 85—92, May 1993. J. Steinman, “A distributed emulation of space communications for the strate— gic defense system,” in In Proceedings of the Twenty-First Annual Pittsburgh Conference on Modeling and Simulation, vol. 21, May 1990. J. Steinman, “SPEEDES: A multiple-synchronization environment for parallel discrete-event simulation,” International Journal in Computer Simulation, vol. 2, pp. 251—286, 1992. J. S. Steinman, “Breathing Time Warp,” in Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS ’98), vol. 23, pp. 109-118, May 1993. H. S. Stone, High-Performance Computer Architecture. Addison-Wesley, 1987. C. B. Stunkel, D. G. Shea, and B. Abali, “The SP2 communication subsystem,” in http://ibm.tc.c0rnell.edu/ibm/pps/doc/, pp. 1—24, August 1994. 198 [82] C. B. Stunkel, D. G. Shea, B. Abali, M. G. Atkins, C. A. Bender, D. G. Grice, [83] [84 l—4 [85] [86] [87] [88] [89] [90] P. Hochschild, D. J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao, and P. R. Varker, “The SP2 high—performance switch,” IBM System Journal, vol. 34, no. 2, pp. 185—204, 1995. C. B. Stunkel, D. G. Shea, D. G. Grice, P. H. Hochschild, and M. Tsao, “The SP1 high—performance switch,” in Proceedings of the Scalable High Performance Computing Conference, pp. 150—157, May 1994. Thinking Machines Corporation, The Connection Machine System, May 1988. Thinking Machines Corporation, Connection Machine Model CM-2 Technical Summary, 1990. A. I. Tomlinson and V. J. Garg, “An algorithm for minimally latent global virtual time,” in Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS ’93), vol. 23, pp. 35—42, May 1993. S. J. Turner and M. Q. Xu, “Performance evaluation of the bounded Time Warp algorithm,” in Proceedings of the 6th Workshop on Parallel and Distributed Sim— ulation (PADS ’92), vol. 24, pp. 117—126, January 1992. University Of Tennessee, MPI: A Message-Passing Interface Standard, June 1995. D. B. Wagner and E. D. Lazowska, “Parallel simulation of queueing networks: Limitations and potentials,” in Proceedings of the 1989 ACS SICMETRICS and PERFORMANCE, vol. 17, pp. 146—155, May 1989. Z. Xu and K. Hwang, “Modeling communication overhead: MPI and MPL per- formance on the IBM SP2,” IEEE Parallel 8 Distributed Technology, vol. Spring, pp. 9—23, 1996. ”I[llllllllll[llllllll’