: .., E... . . r.rl.,5r:.;tcL....c:5.5.: 257,! i . it .2 2-5%.: 1. .- 2. . 2.. '7HE8tS 31293 00904 5810 This is to certify that the dissertation entitled A SCALABLE SNOOPY CACHE-COHERENCE SCHEME ON A MULTIPLE BUS MULTIPROCESSOR presented by Tyan—Shu JOU has been accepted towards fulfillment of the requirements for PhD degree in Computer Science QM Cam MajoU rofcssor / Date x7317 <93, (7,93 MSU is an Affirmative Action/Equal Opportunity Institution 0- 12771 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE ‘W MSU Is An Affirmative Action/Equal Opportunity Institution c:\circ\dateduetpm3-p.1 A SCALABLE SNOOPY CACHE-COHERENCE SCHEME ON A MULTIPLE BUS MULTIPROCESSOR By Tyan-Shu Jon, A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1993 ABSTRACT A SCALABLE SNOOPY CACHE-COHERENCE SCHEME ON A MULTIPLE BUS MULTIPROCESSOR By Tyan-Shu Jou Maintaining data coherence among multiple caches is one of the most important problems in building shared-memory multiprocessors. Snoopy and directory coher- ence schemes are the two main approaches for solving the cache-coherence problems on shared-memory multiprocessors. Traditional snoopy schemes work efficiently on small-scale, single-bus-connected systems, but the bottleneck caused by the shared bus significantly limits the allowable number of processors. A system which adopts a directory scheme can scale relatively better due to the possibility of using high- bandwidth interconnection networks, but at the expense of longer memory access latency. This dissertation proposes a scalable, snoopy cache-coherent multiprocessor based on a multiple bus network, and introduces a novel concept of partial-snooping, which can simplify the snooping mechanism when the number of processors and busses is large. To reduce the hardware complexity, each processor will snoop on a dynami- cally changing subset of the busses. The new scheme extends the number of processors that can be supported using a simple snooping mechanism before more complex con- sistency mechanisms become necessary. The physically-distributed-logically-shared memory model is chosen to let processors take advantage of local memory accesses for private data. The proposed coherence solution is particularly attractive consider- ing the high bandwidth and channel tunable capability of fiber optical networks. The performance of the proposed scheme versus a directory scheme is evaluated in three ways: qualitative discussion, simulation experiments, and queueing analy- sis. The discussion shows that the proposed system can support low access latency and high communication bandwidth at a moderate hardware cost. The comparative evaluation by simulation shows that even though the experiments were designed to give advantages to the directory scheme, the proposed scheme can usually perform better. The simple and effective analytical method can evaluate systems with up to hundreds of processors in seconds. To my parents, for the love they have given in thirty years iv ACKNOWLED GMENTS I’ve been fortunate to have the help and support from many people while pursuing my Ph.D., and I now like to give them my sincere thanks. First and foremost, my thank goes to my thesis advisor Dr. Richard Enbody. His intelligent ideas and our long discussions have laid the foundation for my work and guided me through the years of my Ph.D. program. Special thank also goes to Dr. Lionel M. Ni for all his support and advice. I am grateful for the valueless suggestions provided by Dr. Philip McKinley. I would not have been able to complete the Ph.D. program if it were not for the financial support from the teaching assistantship assigned by Department Chairper- son, Dr. Anthony Wojcik. In addition, I highly appreciate the guiding information and trusts from the professors that I have worked for. They are Professors Chester R. Forsythe, Donald Weinshank, Matt Mutka, and M. V. Ramakrishna. I learned a lot and had a good time in being a teaching assistant for each of them. I am especially precious the friendship between Professor Forsythe. Thanks with my heart also go to my parents and parents-in-law, for the love and care given by them, and to the family of my elder brother, Tien—Yen, for their continuously encouragement. My wife Tswei-Ping and my son Howen gave my life outside the computer science fullness and meaning; they put up with my absences, and they cheered me on when I needed it. I would have no chance to complete my study without their love and support. Over the years, my officemates in Engineering Building Room #324 and many other friends of mine not listed here have all encouraged me and gave me technical assistance on many occasions. Thanks go to all of them. Finally, I thank the authors of the Tango simulation system, Stephen Goldschmidt and Helen Davis at Stanford University, for offering us the code. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 1.2 Dissertation Outline Motivation ................................. 2 Cache Coherence Problems 2.1 Introduction ................................ 2.2 Directory—Based Coherence Protocols .................. 2.3 Snoopy Cache Coherence Protocols ................... 2.4 Coherence Schemes on Network Architectures ............. 2.5 Consistency Ordering ........................... 3 Bus-Based Networks 3.1 Introduction ................................ 3.2 A Classification of Bus~Based Networks ................. 3.3 Applying Snoopy Schemes to Multiple Bus Networks ......... 3.4 Optical Network Technology ....................... 3.5 Embedding Topologies in Optical Networks ............... 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5 3.5.6 3.5.7 Single-Bus Networks . . Point-to—Point Connection .................... The k—ary n—cube Topology ................... Multistage Interconnection Networks .............. The OMP Network . . . Optical Networks in Distributed-Memory Systems ....... Using Optical Networks 4 The Tango Tracing and Simulation System 4.1 Introduction ................................ vi 4.2 Trace-Driven vs. Execution-Driven Simulation ............. 37 4.3 Tango Simulation Approach ....................... 40 4.4 Experience in Using Tango ........................ 43 Address-Separated Multiple Bus Network 45 5.1 Introduction ................................ 45 5.2 Multiple Bus Networks .......................... 47 5.3 Performance Evaluation by Simulation ................. 53 5.4 Performance Evaluation by Queueing Models .............. 57 5.4.1 An Analytical Model for Model 6 ................ 58 5.4.2 An Analytical Model for Model 1 ................ 62 5.4.3 Analytical Results ........................ 65 5.5 Summary ................................. 68 The Cache Coherent Architecture 70 6.1 The MSU system ............................. 70 6.2 Coherence Protocols ........................... 74 6.2.1 Under a Sequential Consistency Model ............. 74 6.2.2 Under a Weak Consistency Model ................ 75 6.3 Qualitative Performance Expectation .................. 79 6.3.1 Pros and Cons of Coherence Schemes .............. 80 6.3.2 Impact from Memory References ................ 82 6.3.3 Performance Expectation ..................... 86 6.4 Implementation by Optical Networks .................. 86 Performance Evaluation 89 7.1 Simulated Directory—Based System ................... 89 7.2 Architectural and Timing Assumptions ................. 92 7.3 Benchmark Programs ........................... 95 7.4 Simulation Results ............................ 98 7.4.1 Comparison of Full-Snooping MSU and DirooNB ........ 98 7.4.2 Various Degrees of Snooping 0n MSU .............. 102 7.4.3 Impact of Faster Communication ................ 106 7.4.4 Impact of Shorter Directory Access Latency .......... 109 7.4.5 Impact of Memory Placement .................. 109 7.4.6 Impact of Larger Cache Size ................... 112 7.4.7 Performance on Various System Sizes (Scalability) ....... 115 7.5 Mean-Value Performance Analysis .................... 121 7.5.1 Workload Parameters ....................... 122 7.5.2 Mean Times for the MSU System ................ 123 7.5.3 Mean Times for the Directory-Based System .......... 125 7.5.4 Parameter Values Used in the Experiments ........... 126 7.5.5 Analytical Results ........................ 127 7.6 Summary ................................. 131 8 Conclusions 133 A An MVA Algorithm for a Network of Queues with a State— Dependent Server 136 B Workload Parameters Used in Analysis 139 BIBLIOGRAPHY 141 viii 6.1 6.2 6.3 7.1 7.2 7.3 7.4 B.1 LIST OF TABLES Actions for weak consistency when an access comes in (from local pro- cessor or from remote processors) while an access for the same block is pending locally. (R: read access. W: write access, which may be either a write miss or a Write hit.) ....................... Pros and cons of the three cache coherence schemes ........... Impact of reference patterns and system parameters on the perfor- mance of cache-coherent systems. .................... Default architectural assumptions .................... Default memory access latency for primitive operations in processor cycles. ................................... Latency of various operations in absence of bus contention and cache line replacement. ............................. Descriptions of the benchmark programs and the default problem sizes. Probabilities used in the analytical experiments ............. 78 83 85 93 95 96 97 139 1.1 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 LIST OF FIGURES (a) A shared memory multiprocessor with caches. (b) A NUMA archi- tecture with caches ............................. An example of operations for a cache write hit under a directory based coherent scheme. ............................. An example of write hit invalidation on a snoopy bus .......... In a sequentially ordered system, the Y will be printed as l by Proces— sor 2. In a weakly ordered system, the value of Y may be printed as 0 ....................................... An example to distinguish the sequential consistency and others. In a sequentially ordered system, the X will be printed as 1. In other systems, the X may be printed as 0. .................. Four topologies of bus-based multiprocessors ............... A 4 X 4 star coupler with input signals under different frequencies. . . The FOX architecture: tunable transmitters are used to determine destinations ................................. A star topology with one wavelength (a), is logically equivalent to a single—bus system (b) ............................ One transmitter and one receiver are used in one node. Logically this network is a point-to-point directed ring topology ............ Two transmitters and two receivers are used in one node for this exam— ple. Logically this network is a multiple—bus-connected k-ary n—cube topology ................................... Two transmitters and two receivers each attach to different channels are used in one node. Logically this network is a MIN. ........ (a) Configuration of the Orthogonal Multiprocessor (OMP). (b) MHMT representation of OMP. ..................... Configuration of a distributed~memory system connected by star- coupled optical network with tunable transmitters and fixed receivers. 21 25 28 29 30 31 31 33 34 4.1 4.2 4.3 4.4 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 6.1 6.2 6.3 6.4 6.5 Structure of a trace-driven simulator. .................. Structure of Tango, an execution-driven multiprocessor simulator. . . Tango application process state diagram ................. Production of a simulation system using Tango. ............ Three models of Address-Separated multiple bus network. ...... Multibus Models 4, 5 and 6. ....................... Ratio of execution time of Model 1 (single M —B connected, N x B x B) to Model 5 (fully connected, N x 00 x 00) ................ Ratio of the execution time of Model 6 (fully connected, N x 00 x B) to Model 5. ................................ Difference in execution time ratio between Model 1 and Model 6. . . . Execution time ratio of different values of r on Model 4 (fully- connected, N x rB x B) to Model 5 when N = 16. The r = 1 curve also represents Model 1, and the r : oo curve also represents Model 6. A queueing model for Model 6. ..................... State transition diagram for the network servers in the queueing model of Model 6. ................................ A queueing model for Network Model 1 (Address-Separated multiple bus network). ............................... An MVA algorithm for the analytical solution of Model 1. ...... Analytical performance results for Models 1 and 6 when N = 16 and I/fll = 10 .................................. Analytical performance results for Models 1 and 6 when N = 128 and l/fll =10 .................................. An illustration of the dynamic snooping concept of MSU. Each node has to snoop on the bus(ses) that it has blocks cached from. ..... The MSU system with Nb“, =2 8 and N37,p = 4. The first bus is for private and read-only memory accesses ( Nread = 1), every node has a fixed receiver connected to it. Each node also has a fixed receiver connected to a bus for receiving requests for the local main memory, and two tunable receivers which can connect to any two of the busses at a time. ................................. Transition diagram of the Berkeley Ownership Protocol. ....... Node architecture for implementing weak consistency .......... The expected comparative performance of two scheme under different system sizes ................................. xi 38 39 41 42 51 54 55 56 57 59 60 63 65 67 67 71 72 75 76 87 6.6 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.1 ,_i Optically-connected MSU system. One tunable transmitter, two fixed receivers and several tunable receivers are used in each node ...... The simulated network for a directory—based system with Nnod,3 = 4 and Nb“, 2 2. ............................... Average shared data cache hit rate for the three benchmarks for various number of busses when Nnode = 32. ................... Normalized execution time of full-snooping MSU and DirooNBNm un— der various number of busses for Mat benchmark. ........... Normalized execution time of full-snooping MSU and DirooNBNb... 11n- der various number of busses for Gauss (a) and Heat (b) benchmarks. Effect of various degrees of partial—snooping of MSU when Nnode = 32 on normalized execution time (a) and cache hit rate (b) for Mat benchmark under default parameters. The dashed line in (a) shows the execution time of DirooNB with N,,,,, busses for comparison. Effect of various degrees of partial—snooping of MSU when Nnode = 32 on normalized execution time (a) and cache hit rate (b) for Gauss benchmark under default parameters. The dashed line in (a) shows the execution time of DirooNB with N3np busses for comparison. Effect of various degrees of partial-snooping of MSU when NW]!e = 32 on normalized execution time (a) and cache hit rate (b) for Heat benchmark under default parameters. The dashed line in (a) shows the execution time of DirooNB with Nsnp busses for comparison. Comparison of full-snooping MSU and DirooNBNb... for different num- ber of busses when bus communication latency is reduced to half. Nor- 88 91 99 100 103 104 105 malized execution time for Mat (a) and Heat (b) benchmarks are shown. 107 Effect of various degrees of partial-snooping of MSU when the commu- nication latency is reduced to half. Normalized execution time when Nnode = 32 for Mat (a) and Heat (b) are shown. The dashed lines show the execution time of DirooNB with NS,1p busses ............. Execution time of DirooNBNbM when the directory look up and update latency is reduced to half for Mat (a) and Heat (b). Dash lines show the execution time under default parameters for comparison. ..... Impact of memory placement on memory-mapped Heat benchmark when Nnode = 32. (a) shows the comparison of full—snooping MSU and DirooNB on different number of busses. (b) shows the effect of various degrees of partial—snooping of MSU .................... 108 110 113 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.2 >-| A.1 B.1 Performance of full-snooping MSU and DirooNBNm under various number of busses when cache size is increased to 16K bytes per node for Mat (a) and Heat (b) benchmarks. ................. Effect of various degrees of partial-snooping of MSU when cache size is increased to 16K bytes per node. Normalized execution time (a) and cache hit rate (b) for Mat when NW1a = 32 are shown. The dashed line in (a) shows the execution time of DirooNB with N,,,,, busses. Effect of various degrees of partial—snooping of MSU when cache size is increased to 16K bytes per node. Normalized execution time (a) and cache hit rate (b) for Heat when Nnode = 32 are shown. The dashed line in (a) shows the execution time of DirooNB with N,"p busses. Impact of larger problem size with larger cache size. Normalized exe~ cution time (a) and cache hit rate (b) for Mat with 200x200 matrices size and 16K bytes cache when dee = 32 are shown. ........ Impact of larger cache size and better memory mapping. Normalized execution time (a) and cache hit rate (b) for Heat with optimum mem- ory mapping and 16K bytes cache when Nnode = 32 are shown. Performance of MSU and Dir under various number of processors (Nnode). Nbus is set to be %Nnode; and, for MSU, N3up = iNnode. (a) is from Mat benchmark, and (b) is from memory~mapped Heat ...... Analytical mean time between two memory requests of full-snooping MSU and DirooNBwa with Nnode = 32 under various number of busses. Analytical mean time between two memory requests for various degrees of partial-snooping on MSU when Nnode : 32. The dashed line is for DirooNB with NMp busses for comparison. ............... Analytical mean time between two memory requests for various degrees of partial-snooping on MSU when Nnode = 128. The dashed line is for DirooNB with NM,p busses for comparison. ............... Performance of MSU and Dir under various number of processors (dee). Nb,” is set to be %Nnode; and, for MSU, Nmp = i node. The p is the ratio that a program can be parallelized ............. An MVA algorithm for the analytical solution of Model 6. ...... Function f(:r) = 1 — €_0'5I, used in deciding pseum ........... xiii 116 117 118 119 120 128 131 140 CHAPTER 1 Introduction 1 . 1 Motivation Parallel systems with a large number of processors offer the possibility of enormous computing power that can never be achieved by a uniprocessor system. With today’s advanced microprocessor technology, the power can come at a low cost-performance ratio. However, simply extending the existing system designs to large-scale systems may not be feasible or may lead to significant per—processor performance degradation. Parallel systems can be divided into shared— and distributed—memory machines. The primary distinction is the programming abstraction presented to the programmer. The shared-memory programming environment presents the abstraction of a single system image which is used for communicating data values and synchronization. In a distributed-memory programming abstraction, the programmer must specify inter— processor communication explicitly, through message passing or memory transfers. It is widely believed that a shared—memory is a more natural or convenient environment for the programmer. However, efficiently scaling shared-memory multiprocessors to large sizes remains a challenging task for computer architects. Reducing the long shared-memory access latency on large-scale multiprocessors is one of the most difficult problems [1, 2], and presently it has become more sig- nificant because of the quickly-improving microprocessor technology. Distributing main memory to processors to let each node contain a portion of the global memory is one technique to shorten local memory access time. This kind of multiprocessor has been classified as a non-uniform—memory-access (NUMA) machine. Examples of distributed shared-memory multiprocessors include the BBN Butterfly, the Stanford DASH system [3] and the MIT Alewife [4] machine. To further decrease memory ac— cess time, attaching a private cache to each processor is one solution which not only can shorten the access latency but also can reduce network contention [5] (Figure 1.1). Cache memory can significantly increase system performance because of the temporal and spatial locality of memory references. The possible existence of multiple copies of the same data, however, gives rise to the cache coherence problem. Various coherence protocols have been addressed in the literature, and the general solution is that when a processor needs to modify a shared data in its cache, it has to invalidate or update the other processors which are holding the same data. main memory interconnection network interconnection network -[mem[ ~Imem| -[mem| Icache I Icache I [cache I Icache I [cache I Icache I processor processor processor processor processor processor (a) (b) Figure 1.1. (a) A shared memory multiprocessor with caches. (b) A NUMA archi- tecture with caches. Snoopy schemes and directory schemes are two common hardware approaches for maintaining cache coherence. Snoopy schemes efficiently achieve cache coherence on single—bus—connected multiprocessors by letting each tapped processor snoop on the operations of other processors. However, because all communication takes place on a bus, the bandwidth limitation of the bus prevents building a multiprocessor with many processors, eg. 30 processors [6, 7]. With today’s high-performance processors, a few processors can easily saturate a bus. Directory schemes, on the other hand, can be applied to any interconnection network and the systems are relatively more scalable. Most large—scale cache—coherent multiprocessors adopt directory schemes [4, 8, 9]. However, the overhead of directory maintenance and look—up time plus the high latency of the communication networks are significant drawbacks. Snoopy schemes require the broadcast be globally performed in one network cycle so that all processors can have current knowledge of the state of cached data and the broadcasting processor can ensure that all other processors have received the message. This kind of broadcast will be referred as an atomic broadcast. To apply a snoopy scheme to a large-scale multiprocessor, a high—bandwidth and low—latency network which can also support atomic broadcast or atomic multicast is needed. A natural extension of the single-bus connection is to employ multiple busses to increase the bandwidth, reliability and scalability at moderate cost. There are two common multiple-bus topologies: the first one is the multiple—hop-connected system, such as the Wisconsin Multicube [10] and the Aquarius system [9]. Snoopy schemes cannot directly apply to this topology because the multiple—hop connections cannot efficiently support atomic broadcast. The other topology has a pool of busses and lets all nodes connect to all busses [11, 12]. Applying snoopy schemes to this latter topology is complex and costly, which limits the scalability. Details of the limitation will be presented later in this dissertation. We believe that a bus-based architecture is still a viable architecture. While many existing parallel systems adopt interconnecting networks other than busses, exciting new bus-based multiprocessors are being produced by such workstation companies as SUN and SGI. For example, the new SGI Power Challenge system promises 5.4 GFLOPS peak performance by using 18 processors connected by a backplane with the bandwidth of 1.2 Gigabytes per second. To solve the dilemma of requiring a large-scale, cache-coherent, shared-memory multiprocessor but without the long memory access latency spent in maintaining di- rectories, this dissertation proposes a scalable, snoopy coherent multiprocessor based on a multiple bus network. This multiprocessor extends the performance of bus-based systems such as the SGI Power Challenge. In addition, the architecture can be ap— plied to both electronic and optical networks so it can take advantage of the high bandwidth of optics as they become more affordable. 1.2 Dissertation Outline The dissertation starts by introducing the cache coherence problem in Chapter 2. It introduces and compares the snoopy coherence scheme and the directory-based coherence scheme. This chapter also surveys various consistency ordering models as a background for the coherence protocol of the proposed multiprocessor. Many existing multiprocessors are bus—based systems because of the advantages of the simplicity and short communication latency of the busses. As suggested in the title of this dissertation, the proposed system is based on a multiple—bus network. Chapter 3 presents the main features of bus—based networks and the way to apply cache coherence protocols on bus-based systems. A novel classification of bus—based systems is presented to put our system in context with other systems. In addition, because of the progress in lightwave technology, presently it is possible to implement a multiple-bus network by using the multiple channels on an optical network. This chapter also surveys the current technology in optical networks and presents various ways to implement multiple busses optically. Chapter 4 introduces the tracing and simulation system used to collect the per- formance results of the proposed network and architecture. This system is named Tango‘an execution-driven simulator which was developed at Stanford University, and has been used for analyzing the DASH system [8], the SPLASH benchmarks [13], and for many other hardware and software studies. Chapter 5 presents a multiple-bus network, named the Address—Separated net- work, on which the choice of the transmitting bus depends on the memory address accessed. A variety of models for this network are studied, with other popular multi— ple bus models shown for comparison. The performance of this network is evaluated by both simulation and analysis. The results indicate the feasibility of applying it to a multiprocessor. Chapter 6 introduces a novel snoopy cache-coherent multiprocessor—the MSU system, based on the Address-Separated multiple-bus network. Each node of this system contains a processor, cache memory, a portion of the globally addressable shared-memory, I/O devices, and a network controller. The main feature of the system is partial snooping, that is, each node snoops on a subset of all busses. A qualitative discussion of this system is presented to show the advantages and disadvantages prior to quantitative evaluations. The quantitative performance of the proposed multiprocessor versus a directory— based system is evaluated by using both simulation experiments and queueing anal- ysis. The methodology and results are shown in Chapter 7. The sensitivity of var- ious system parameters is also studied to show the independence of the conclusions about the system and the choice of the parameters. The results show that the pro- posed system is suitable to be scaled into large systems, and that it outperforms the directory-based system. This dissertation is concluded in Chapter 8. This chapter also reiterates the contributions of the research and indicates future research directions. CHAPTER 2 Cache Coherence Problems 2. 1 Introduction Recently, the progress of memory technology has allowed the speed of dynamic ran- dom access memory (DRAM) to grow at a rate of about 7 % per year; however, the speed of processors has grown over 50 ‘70 annually since 1985 [14]. The uneven growth of speed between processors and memory results in the necessary usage of the small, high-speed private memory which operates like a cache of a processor, and serves as a buffer to the larger, slower main memory. Cache memories are common in uniprocessor systems, but they are even more effective in shared-memory multipro- cessors because they can alleviate not only the access latency, but also the network contention [5]. For such a multiple-cache system, more than one copy of the same data may exist in different caches at the same time. This situation leads to the so called cache coherence problem—the problem of maintaining consistency (coherence) among the multiple copies of data stored in different caches. Shared data may have consistency problems if they are shared by more than one process. Furthermore, if processes are allowed to migrate among different processors, private data may also have consistency problems when the copies of the same data exist in two or more caches due to the migration. In general, shared-memory multiprocessors can be divided into those providing cache coherence in hardware, and those requiring software support. Cache-coherent systems usually refer to the systems using hardware solutions. On the other hand, software supported systems utilize the assistance from compilers and/or programs to either disallow shared-and-writable data to be cached [15] or to add coherence operations prior to accessing any shared data [16, 17]. In some systems, such as the BBN Butterfly, there exist only instruction caches but no data caches to avoid cache coherence problems. Coherence problems also exist in other applications, and similar solutions to cache coherence protocols can be applied. A local-memory coherence system can allow a NUMA-type multiprocessor (such as the BBN Butterfly) to replicate or migrate re- mote data into local memories to take advantage of faster memory accessing time [18]. Some systems, known as cache-only memory architectures, are designed to always mi- grate memory to local processors by hardware [19, 20, 21]. Recently some researchers are implementing shared-memory or shared-objects abstractions on top of message- passing systems to ease the programming on multicomputers [22, 23, 24, 25]. This dissertation focuses on hardware cache-coherent systems. The background of this issue is presented in following sections. 2.2 Directory-Based Coherence Protocols A number of different coherence solutions have been proposed, and a survey has been done by Stenstriim [26]. From a hardware perspective, these schemes can be classified into three main categories: directory schemes, snoopy schemes, and the extensions of above two schemes on hierarchical topologies. The directory schemes all maintain some type of table or directory in which the sharing information can be found [27, 28]. When a shared block is going to be modified in one cache memory of a processor, the update information or an invalidation signal will be sent to other caches which have the same block. These schemes differ in where the directories are located (centralized or distributed), how they are maintained (full-map, limited, single-link chained, double—link chained), and what information is stored inside. Various directory schemes have been suggested and studied for large— scale systems [28]; among them, two schemes are popular. The first uses a bit-vector in the home directory for each cache line to keep information on the locations of the cache line. One example of such protocols has been implemented on the DASH multiprocessor [8]. The other popular directory scheme employs a singly- or doubly- linked list for each cached line, and keeps a home directory to point to the head of the list. This scheme is used in the IEEE Scalable Coherence Interface standard (SCI) [29]. Figure 2.1 shows an example of the operations when a write hit occurs in a bit- Vector directory scheme. When Processor 2 has a write hit on a shared datum, it must look in the directory to see whether other processors also have cached the same cache line. After receiving the request, the memory module that keeps the directory of the line informs other processors having the same line. When all acknowledgments from other processors have been received, the memory module modifies the directory and acknowledges the requesting processor. The greatest advantage of directory schemes is their insensitivity to the type of in- terconnection network used. Since the interconnection networks of many large-scale multiprocessors have multiple stages or hierarchical structures, directory schemes, rather than snoopy schemes, can be applied easily [29, 30, 31, 32]. The most crit— ical drawback of a directory scheme is the long access latency. The overhead of directory look-up and maintenance such as regenerating the sharing chain on SCI is unavoidable. In addition, for a directory scheme, the advantage of the independence of the broadcast capability on the network becomes a disadvantage when multiple 10 Proqessor Processor Processor 2 3 I [write hit I \§ ”'/ Interconnection \\\\ 4 1 ”I 2/ ////) network 5 \\ \ I / // / 3 | \\ \‘III/ .2 0 Ir dir Memory Memory Figure 2.1. An example of operations for a cache write hit under a directory based coherent scheme. copies of the same message need to be sent. Many researchers have tried to hide the long latency by relaxing the consistency ordering requirement to overlap memory accesses [33, 34, 35, 36, 37, 38]. Theoretically, relaxing execution ordering can gain a lot in performance by overlapping communication and computation [39]; but, to guarantee correctness, only write requests bypassing outstanding read requests be- tween any two synchronization points is practical [8], which limits the improvement in performance [40]. 2.3 Snoopy Cache Coherence Protocols The snoopy schemes rely on a broadcast bus on which all system communication takes place. Various snoopy protocols have been studied and surveyed [6, 41, 42, 43], in gen— eral every cache controller will monitor other caches’ requests by listening (snooping) on the bus, and then responds accordingly. Figure 2.2 shows an example of a write invalidation. When Processor 2 has a cache write hit on a shared datum, it broad- casts the invalidation signal on the bus. Processor 1 and 3 receive the invalidation by 11 snooping on the bus all the time. Processor Processor Processor 1 2 3 cache ‘ [cache ‘ i I. . . i . | Bus n _ __L'£“£‘:‘_"92t'_"_“_ _v ______ L. _> ] Memory Memory Figure 2.2. An example of write hit invalidation on a snoopy bus. The main feature of snoopy cache coherence schemes is that the broadcast op- eration is atomic—either all processors receive the message at the same time or the message cannot be sent due to the bus being busy. This atomic broadcast operation can avoid an inconsistency situation when more than one processor tries to invalidate or update the same block simultaneously. Nevertheless, because all communication takes place on a single shared-bus, the bus soon becomes the system bottleneck. The simulation results from other researchers [6, 44] show that the shared-bus can become the bottle neck when as few as ten processors are sharing data. Attempts to scale snoopy-coherent systems by replacing the bus with a higher bandwidth communication network such as the multistageinterconnection network may not work because snooping scheme relies on low~latency and atomic broadcasts to maintain coherence. Unless a high-bandwidth and low—latency network which can supply atomic broadcast can be achieved, scalability is a limitation to the traditional 12 snooping scheme. 2.4 Coherence Schemes on Network Architec- tures In order to allow a larger number of processors in snoopy cache-coherent multiproces- sors, several multiple-bus systems [10, 9] and hierarchical systems [32, 45, 47, 46, 48] have been proposed. A few levels of memory clusters are used in these architectures to release the constraint on the number of processors per bus. In addition, the snoopy scheme and the directory scheme may be combined together to provide consistency on different levels [9, 32]. These architectures are more scalable, but they usually suffer from the hardware complexity and the overhead introduced by the multiple levels of coherence protocols. Some examples of cache-coherent network architectures are addressed below in more details. The Wisconsin Multicube [10] uses a grid of busses connecting the processors to memory. A processor with a large cache resides at each crosspoint on the grid and a shared-memory module connected to each vertical bus. Each cache controller snoops on both the vertical and horizontal busses. Invalidation messages are broadcast on every vertical bus, while global read requests are routed to the closest cache with a copy of the requested block. Each memory block has a home column corresponding to the memory module that stores the block. Each cache controller stores in its column the identification of all modified blocks, which serves as routing information for read requests. A read request is broadcast on the horizontal bus and routed to the vertical bus where the modified block is stored. A similar topology is used in the Aquarius architecture [9], but this system dif- fers in the cache coherence mechanism and in the distribution of memory modules. The system combines features of snooping cache schemes, to provide consistency on 13 individual busses, and features of directory schemes, to provide consistency between busses. The memory is distributed per node, unlike the Wisconsin Multicube. In hierarchical systems, clusters of processors are connected by a bus or an in— terconnection network. Wilson [45] studied a hierarchically connected multiple—bus system. The design explores a uniform memory architecture with global memory connects to the topmost bus. It uses hierarchical caches to reduce bus use at various levels. A cache contains a copy of all blocks cached underneath it. Consistency among copies stored at the same level is maintained in the same way as for traditional snoopy cache protocols. However, an invalidation or a request must propagates vertically to a higher level if it cannot be served locally. Erickson [47] studied a system consists of a two-level hierarchical bus interconnect with physically distributed memory and private processor caches. A cluster consists of multiple processors sharing a single cluster bus, and multiple clusters are joined by a single global bus. A snooping coherence scheme provides a single, flat virtual address space for his design. The VMP-MC multiprocessor [46] is another hierarchically connected multiple bus system. The cluster bus has an interbus cache module that interfaces to the next bus in the hierarchy. This second-level bus has the main memory. In this system, larger clusters are connected Via a ring-based system to provide a large-scale, distributed— shared-memory system. 2.5 Consistency Ordering A serial uniprocessor generally executes instructions one at a time in the order speci- fied by the program. A load operation always returns the last value written to a given memory location. Likewise, a store operation fixes the value that will be returned by subsequent loads until the next store to the same location. For multiprocessors, 14 however, neither the memory system model nor the implementation is as straightfor- ward. The executing order of operations in different processors is difficult to define. Furthermore, the operations from the same processor may be performed in a different order because of different communication latencies involved. Consistency models are necessary to specify what accessing orderings are legal while several processes are accessing a shared location. Several memory consistency models have been proposed, and they define different levels of the ordering requirements. From the most strict requirement to the least, the common ones are: sequential consistency [49], processor consistency [50], weak consistency [51], and release consistency [39]. A survey of these ordering models can be found in [52]. The sequential consistency and weak consistency will be briefly introduced for further reference. Naturally, most programmers take for granted that the result of a sequence of write operations performed by a processor will be immediately observable by any other processor exactly in the same order as it was performed. Lamport [49] observed the above situation and defined sequential consistency as: A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each indi- vidual processor appear in this sequence in the order specified by its program. Thus, in a sequentially consistent system the end result of the program must be as if each instruction had finished executing before the next instruction is started and the in- structions are executed in program order. In this case, the synchronization among processors can be implicitly implemented by using shared variables. For example, the programmer of the program in Figure 2.3 will expect the Processor 2 to print 1 as the value of variable Y. In this program, variable X is used as a synchronization variable to define the order of the two processes. In a single-bus connected multiprocessor, sequential consistency can be easily 15 Shared int X=Y=O; Processor 1 Processor 2 O 0 O O O O Y = 1; while (X != l); x = 1; print Y; Figure 2.3. In a sequentially ordered system, the Y will be printed as 1 by Pro- cessor 2. In a weakly ordered system, the value of Y may be printed as 0. achieved if all memory accesses are issued orderly from any node and all copies of the same data can be updated or invalidated simultaneously [36]. For other network architectures, because different memory accesses may have different latency, it may be costly to maintain the sequential consistency. On the other hand, in a system other than a sequentially ordered one, explicit synchronization operations must be used whenever a specific order of accesses must be enforced. For example, in Figure 2.3 the value of Y may be printed as 0 by Processor 2 in a weakly ordered system, which is not the result expected by our programmer. A more complicated example can be found in Figure 2.4, where only a sequentially ordered system can guarantee the value of X be printed as 1, all other consistent models may have 0 as the printed result. The implementation of sequential consistency in parallel systems precludes the incorporation of many common performance enhancing features from uniprocessor architectures. Weak consistency is essentially a trade-off between software flexibility and hardware performance. Weakly ordered systems depend on explicit, hardware recognizable synchronization operations, such as the lock and unlock operations, to 16 Shared int x=Y=O; Processor 1 Processor 2 Processor 3 O O O t : : x = 1; while (X I: 1); while (Y != 1); Y = 1; print x; O 0 O O O O O I 0 Figure 2.4. An example to distinguish the sequential consistency and others. In a sequentially ordered system, the X will be printed as 1. In other systems, the X may be printed as 0. order the effects of events initiated by different processors in a system. The memory has to be consistent at those synchronization points, but the ordinary operations between two synchronization points can be executed in any order as long as the data dependence is still guaranteed. The weak consistency cannot guarantee the order of the completion time of se— quential statements except for synchronization operations. Dubois et al. [51, 53] defined weak ordering as: In a multiprocessor system, storage accesses are weakly or- dered if. (I) accesses to global synchronizing variables are strongly ordered, and if (2} no access to a synchronizing variable is issued in a processor before all previous global data accesses have been performed, and if (3) no access to global data is issued by a processor before a previous access to a synchronizing variable has been performed. For weak consistency, the non—synchronizing memory references only need to follow the program’s data dependencies; in all other cases, they can be executed in any order. In a weakly ordered system, an order that maximizes system performance can be used, which may include: issuing or completing instructions out of order, letting reads take precedence over writes, prefetching data, and, if the cache block contains 17 only one word, delaying cache coherence invalidation signals after the write to the cache. As a result, lockup-free caches can be designed to avoid processor blocking on a cache miss. Despite all of the above, Baer and Zucker [40, 54] showed that on a shared—bus architecture, the weak consistency model cannot provide any significant performance benefit over a sequential consistency model. They also claimed that the overhead of a weakly ordered system would outweigh its benefit on a bus-based system. CHAPTER 3 Bus-Based Networks 3.1 Introduction One major factor in deciding the performance of a multiprocessor is the time spent in interprocessor communication. The communication latency of the network is expected to be short so the multiprocessor can have high performance, and the bandwidth must be large to allow a large number of processors. A single bus network has the advantage of simplicity and short communication latency, but the bandwidth limitation prevents the system from possessing a large number of processors. A natural extension of the single-bus network is to employ multiple busses to increase the bandwidth, reliability and scalability. The performance of multiple- bus interconnections for multiprocessors has been analyzed in the literature and the results agree that multiple busses are very appealing [11, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65]. Simulation results have shown that multiple narrow busses can offer a high bandwidth with more flexibility than a wide bus [63]. It has also been claimed that a four-bus system can achieve almost the same performance as that of a 16 x 16 crossbar system while reducing the hardware cost significantly [11, 62]. The commonly used model is an N X M X B multiple-bus processor system, in which all the N processors and M memory modules are connected to all B busses. Many researchers 18 19 have shown various schemes to reduce the number of connections on a multiple bus network [66, 15], these schemes are especially significant when the number of busses is relatively large (i.e. more than half of the number of processors). Lang [55] and Chen [65] proposed a partial bus network to alleviate the loads and the complexity of the arbitration hardware caused by the large number of busses. In the partial bus network, each processor connects to all busses and each shared memory module statically connects to a subset of the busses. An alternative way to implement a bus-based network is to use optical devices instead of the traditional electronic busses [15, 67, 68, 69, 70]. The potentially huge bandwidth in an optical interconnect can dramatically reduce the communication contention in a multiprocessor. This chapter starts by presenting a classification of bus-based networks, followed by a discussion of applying snoopy schemes on multiple bus networks. Furthermore, this chapter introduces the state-of-the-art technology of optical networks that can be applied in multiprocessors, and the utilization of optical networks to emulate several popular multiprocessor topologies. 3.2 A Classification of Bus-Based Networks Classifying bus-based networks is necessary to put our proposed system in the proper context. A variety of multiple bus topologies have been proposed, which include one- dimensional, multiple—dimensional, and hierarchical topologies [71]. However, this classification scheme is incomplete because the multiple-dimensional topologies can also be represented and implemented in one dimension. Additionally, many multiple dimensional systems are hierarchically partitioned into several clusters. Our classification for bus-based multiprocessors is characterized by the number of hops between any two nodes and the number of transceivers (a pair of transmitters 20 and receivers attached to the same bus) per node. A single-hop system is one in which messages can be transmitted between any two nodes without any intermediate switches (as in multistage-interconnected systems) and no need for forwarding by other nodes (as in multiple-hop systems) [72]. If we let Ntmn, and N,“ represent the number of transmitters and receivers per node, respectively, a single transceiver system is one that Nmm, = Nrec = 1 and the transmitter and receiver are on the same bus for all nodes. For simplicity, both Ntmn, and N,“ are assumed to be constants for all nodes. The number of transceivers decides the number of busses that one node can access at a time. Listed below are the four topologies of bus-based multiprocessors (Figure 3.1): 0 Single Hop Single Transceiver (SHST). Single Hop Multiple Transceiver (SHMT). Multiple Hop Single Transceiver (MHST). Multiple Hop Multiple Transceiver (MHMT). SHST systems use a broadcast bus for connection. Several commercial systems such as Sequent Symmetry, Encore Multimax, SGI Onyx, and Stardent Titan belong to this category. Shared-memory environments can be easily applied on them, and the cache coherence problem can be solved by using the snoopy scheme. Because of the bus bandwidth limitation, scalability is a weakness of these systems. SHMT systems are multiple bus systems. Busses need not be physically separated, they can be obtained by dividing the bandwidth of a single physical connection. Sys- tems such as Synapse N+1, SGI Challenge and Tandem NonStop TXP have more than one shared bus. Most of the proposed SHMT systems fully connect each pro- cessor and memory module to all of the busses. One interesting application of SHMT is the cyclic shifter interconnection [66, 73, 74], in which each node connects to a 21 WWW igffipf (a) single hop single transceiver (b) single hop multiple transceivers W W W a (0) multiple hops single transceiver (d) multiple hops multiple transceivers Figure 3.1. Four topologies of bus-based multiprocessors. 22 constant-number subset of all busses. The other extreme case of SHMT systems is an N—node (IO-bus system with Ntmns : N.“ = N—1: it results in a point—to-point fully connected topology. The performance of the SHMT topology for multiprocessors or local area networks has been analyzed in the literature [64, 72, 75, 76, 77]. Results show that using multiple low-speed busses rather than one high-speed bus can yield significant performance improvement and a number of implementation advantages. One feature of these systems is that multiple receivers tapped to the same bus can re— ceive the same message at the same time. This feature makes the atomic multicast—a requirement for a snoopy coherence scheme—possible on the SHMT systems. MHST systems usually have good scalability. A subgroup of nodes connect to a bus, and some connecting switches are used to connect different busses. The hierar- chical multiprocessors such as the machines proposed by Wilson [45], Erickson [47], the University of Illinois’ Cedar project, the University of Toronto’s Hector [17], and the VMP-MC system [46] are examples of MHST systems. These systems are used to remedy the hardware complexity of crossbars or multistage networks on a large—scale multiprocessor. If cache coherence is required, different coherent schemes can be used in different levels—snoopy schemes may be adopted to maintain the coherence among the nodes in the same bus, and directory schemes may be chosen for higher levels. MHMT topologies have been used for connecting systems into a k-ary n—cube such as the Lattice-Mesh [78], the Spanning Multiaccess Hypercube [79], the Wiscon- sin Multicube [10] and the Aquarius system [9]. The advantage of this topology is scalability, but the disadvantages are the complexity of the hardware and the over- head from coherence protocols. Like MHST systems, multiple—level coherent schemes can be adopted if cache coherence is needed. 23 3.3 Applying Snoopy Schemes to Multiple Bus Networks Applying a snoopy scheme on a single-bus system is straightforward—all communi- cation appears on the shared bus and every processor’s cache monitors the bus and responds accordingly. The response to a data modification will be performed in the same bus cycle as the request issued on the bus. Sequential consistency [49] can be easily achieved in this system. Furthermore, since only one request can be issued on the bus at a time, the arbitrator of the bus forces all memory accesses to be in some sequential order. Both the hardware and the coherence protocol become complicated when the traditional snoopy schemes are applied to fully node-bus-connected multiple bus net- works. Because each node must now snoop on multiple busses, the nodes will have difficulty responding as fast as in single bus systems. In addition, the arbitration for a free bus takes more time if there are more busses. A study has shown that in a multiple-bus system utilizing lookahead arbitration techniques, memory-bus arbitra- tion becomes the performance limiting factor for large amounts of memory banks [61]. Multiplexors and buffers may be used to temporarily store requests, but the system consistency ordering must be relaxed [51], which will further complicate the design of cache controllers and place a burden on programmers. Some acknowledgment mech— anism is needed because operations such as synchronizations will need to know when the operation is globally performed. Such operations will generate more communica- tion traffic and increase communication latency. In addition, since the requests for the same memory block may exist on different busses at the same time, some arbitration mechanism is needed. All of the above complexity plus the requirement of multiple snooping devices (transmitters and receivers to the busses) for each processor cause snoopy schemes to no longer be simple and efficient. The same reason also limits the 24 scalability of the system because the snooping mechanism for each processor becomes more complicated and costly as the number of nodes and busses increases. To avoid the complications mentioned above, a multiple bus network that assigns a bus to a request according to the requesting address will be presented and discussed in Chapter 5 of this dissertation. A new snooping scheme on this network will be presented in Chapter 6 to ease the complexity of multiple bus multiprocessor. 3.4 Optical Network Technology Traditionally, all busses were implemented electronically. However, in a large-scale multiprocessor, the heavy traffic from the processors can easily exceed the bandwidth of the electronic bus which interconnects the system. The high bandwidth of opti— cal communications inevitably can be an alternative [15, 67, 68, 70]. Theoretically, the bandwidth of optical fibers is about 30 Terahertz in the transmission low-loss wavelength windows near 1.3 and 1.5 am. A variety of configurations of optical networks including bus, ring, tree and star have been proposed [69, 80, 81, 82]. The bus or ring architectures suffer from the linear power loss (in dB) with the number of tapped nodes, while the power loss of the star architecture only grows logarithmically. A star topology constructed by integrated 2 X 2 modular couplers makes hundreds of nodes in one optical network possible. With current technology, a 64 x 64 passive star is available, and the extension of the numbers to hundreds such as a 256 x 256 star is foreseeable in the near future [80]. Star couplers also benefit from being passive devices. One example of a passive star Ethernet LAN is used in the 44-story Southwestern Bell Facility in St. Louis, Michigan [83]. This system uses 3-level star hierarchy with 3 stars on the distribution center, 8 port stars serving groups of 5 floors, and each floor served by 2 16—port stars. It contains 540 nodes, 92 miles of fiber, and supports over 4000 computer terminals 25 and workstations. The optical network can be partitioned into multiple channels. Two possible methods can be used: time division and wavelength division. The former suffers from the required fine-tuned clock synchronization for the entire system, which is especially difficult for a large-sized system. The following discussion concentrates on the latter method. As shown in Figure 3.2, every input signal of a star coupler can be broadcast to all output branches, and if they are under different light wavelengths (frequencies), no interference will occur. xxxx x xx x xx . xxxx it; t Figure 3.2. A 4 X 4 star coupler with input signals under different frequencies. If packets under the same wavelength are transmitted in the network simulta— neously, communication contention will occur. A number of contention control tech— niques have been used in different systems, which includes carrier sense multiple access with collision detection (CSMA/CD), demand assignment multiple access (DAMA), and token-passing schemes [70, 84]. Wavelength Division Multiple Access (WDMA) technique, also called Wavelength Division Multiplexing (WDM) by others, can partition the available optical band- width into multiple narrow channels. A connection is established via a dynamic 26 allocation of a frequency (wavelength) for the duration of the connection. Sev- eral modulation formats have been used, including amplitude-shift keying (ASK), frequency-shift keying (FSK), phase-shift keying (PSK), and polarization-shift key— ing (PolSK) [85]. In order to let one optical device handle data in different wave- lengths, a number of wavelength tunable transmitting lasers and tunable receivers were reported [67, 85, 86, 87]. There are two important parameters regarding the proper choices of these devices: (1) the number of available channels, and (2) the tuning speed. In different techniques, the number of tuning channels is varied from 10 (local-oscillator filters), and 100 (acousto-optic filters) up to 1000 (2 stage Fabry- Perot filters). The tuning speed is from the scale of nanoseconds (semiconductor lasers and local-oscillator filters), to microseconds (acousto—optic filters), and up to milliseconds (Fabry—Perot filters). To build a fast interconnection network for a mul- tiprocessor, devices which can be tuned to tens of wavelengths with at most several nanoseconds random-tuning times are needed. Essentially, the fast-tuning devices described by Kobrinski et al. [88] can fit the requirement. Experimentally, switching times of less than 15 nanoseconds have been demonstrated in three—section distributed Bragg reflector (DBR) laser transmitter with maximum continuous tuning range of 2.2 nm (FSK modulation), which can support more than 50 wavelength—addressable channels. Tunable double-section distributed feedback (DFB) optical amplifiers (fil- ters) were shown to be capable of selecting Gigabits per second data packets with nanosecond switching times with up to 20 wavelength channels. The capacity of opti- cal communication systems has continued to double every 18 to 24 months, both for research experiments and commercial systems [89]. Therefore, considerable improve- ments are expected in the near future. It is predicted that lasers with the ranges of up to 256 wavelengths, and switching times less than one nanosecond will soon be possible [90]. A few prototypes of optically-connected multiprocessors exist already [85, 68], E " 27 and the FOX (Fast Optical Cross Connect) system [91] is illustrated below as an example. The FOX system is an experimental architecture for studying the poten— tial of optically-connected parallel processing computers. It uses fast-tunable laser transmitters and fixed wavelength receivers to communicate between processors and memory modules. Two. separate passive star couplers were used to broadcast the transmitting packets to all branches and only the packets under correct wavelength can be filtered through the receivers. Figure 3.3 shows the architecture of the FOX system. If enough channels (lightwave frequencies) are available, each receiver can have its own channel, so the communication is self-routing——no destination address is included in data packets. Processors can keep requesting memory access, and receive the responses from memory modules later. Data collisions may occur if more than one packet travels toward the same memory module at a time. Collisions were resolved by letting the sending processor wait for a successful response from the corresponding memory module. If no response is received by the processor after a period of time, a backoff algorithm will be used and then the packet will be retransmitted. It is inter- esting to note that in this architecture, if the response time of each memory access is the same and if each processor has a unique receiving wavelength, no collision will ever happen in the second star coupler. The FOX architecture can be modified by putting the entire burden of tuning on the processors’ side. Each processor connects to a tunable laser transmitter and a tunable receiver, and the memory modules only connect to fixed ones. In this case, each memory module uses the same channel for both sending and receiving signals. After a processor sends a request to a memory module through a certain channel, it tunes its receiver to listen to the same channel for the response. A similar idea was presented in a proposal of HYPASS (optoelectronic Hybrid PAcket Switching System) [90], a packet switch which is capable of several hundred Gigabits per second throughput. 28 processors x memories " ’ a C ’4 *1 :2 " p 4 ,a *2 x 3 I» ’ p 3 4 ,4 m 3 : 3.3 : ¢—) tunable transmitter X C pn I; - V” (9+— fixed wavelength receiver xn Figure 3.3. The FOX architecture: tunable transmitters are used to determine desti- nations. A design of an optically-connected multiprocessor has been proposed by Wailes and Meyer [15]. In their Multiple Channel Architecture (MCA), hundreds or even thousands of processor nodes, memory nodes, and I/ O nodes are connected by a star coupler. Each processor node uses three pairs of tunable transmitters and tunable receivers: one pair is for transmitting general data, another pair is for cache line write back, the final pair is reserved for executing barrier synchronization. Each memory node or I/ O node has a tunable transmitter and a tunable receiver pair. Thousands of channels are implemented via wavelength division multiplexing. The channel arbitration is through the CSMA / CD technique. Transceivers can flexibly tune to any channel, and the channel assignment is left to either the compilers, the programmers, or the operating systems. 29 3.5 Embedding Topologies in Optical Networks A variety of network topologies can be implemented by star-coupled optical networks, including all classifications of multibus networks discussed in Section 3.2. This section shows the implementation of many popular topologies for multiprocessors. 3.5.1 Single-Bus Networks If only one wavelength is used, an optical star topology is logically equivalent to a single bus system (Figure 3.4). This topology is classified into the SHST category. (a) (b) Figure 3.4. A star topology with one wavelength (a), is logically equivalent to a single-bus system (b). 3.5.2 Point-to-Point Connection Another use of the optical network is to treat the network as a multichannel system, with each available wavelength to work as a single channel. Figure 3.5 shows an example of implementing a SHMT network which works as a point-to—point directed ring. Although the topology is represented as a multibus topology, each bus in the 30 figure can be thought as a channel in a WDMA optical network connected by one star-coupler. Each node uses only one transmitter and one receiver. In order to communicate among different channels, the transmitter and the receiver for the same node attach to different channels to avoid conflicts. .n4 Figure 3.5. One transmitter and one receiver are used in one node. Logically this network is a point-to-point directed ring topology. 3.5.3 The k-ary n-cube Topology Figure 3.6 shows a network in which each node can transmit and receive data from two channels by using two transmitters and two receivers. Logically this topology is a multiple-bus mesh topology [92], and the same idea can be used to construct 3D mesh or other k-ary n-cube topologies. If tunable devices are used the number of transceivers per node can be reduced. For this topology, the number of hops in transmission is the number of the dimensions in the logical representation. 3.5.4 Multistage Interconnection Networks Optical networks can also emulate non bus-based systems. Figure 3.7 shows a perfect shuffle connected network, in which each node has two transmitters and two receivers, c1 .2 . u . 1+— .4 CPU—i I I'- n1 “3 C3 r cznzr c4 , t—A— Figure 3.6. Two transmitters and two receivers are used in one node for this example. Logically this network is a multiple-bus-connected k-ary n-cube topology. Channel Channel 9 1 ..... . >'nS] >[nli 10 2 ' """ 11 n2 [.4 n3 i—l 7 I14 16 8 >:D4i Figure 3.7. Two transmitters and two receivers each attach to different channels are used in one node. Logically this network is a MIN. 32 all attached to different channels. In this figure, an 8-node network uses 16 channels to achieve the communication, which can be implemented by one star-coupler. However, it is possible to reuse channels by using a star-coupler for each stage, thus the needed channel number can be reduced to be twice the number of nodes in one stage. Details of this topology have been studied by Acampora [93]. 3.5.5 The OMP Network Now let us see the ability of optical networks to emulate more complicated configu- rations. The OMP (orthogonal multiprocessor) architecture proposed by Hwang [94] is used as an example. Figure 3.8 (a) shows the original design of OMP with 3 pro- cessors, where processor i connects to the memory modules in row i and column i. The memory module (i, i) is the private memory of processor i, and memory module (i, j) where i 75 j is shared by processor i and processor j. Figure 3.8 (b) represents the connection of OMP by using an MHMT topology with Ntmm : Nrcc = 2. If an optical network is chosen to connect the system, one node can send and receive data from two channels by using one tunable transceiver or two fixed transceivers. For an n-processor OMP, either two n x n2 star couplers or one (n + n2) x (n + n2) star coupler can interconnect the system. For this kind of partially-connected network, there may be several alternative paths between any two nodes. 3.5.6 Optical Networks in Distributed-Memory Systems If a local memory accompanies each processor, compare to the FOX system described above, only half of the network devices are needed to construct a multiprocessor sys- tem. This architecture can be represented by a distributed-memory message-passing multicomputer or a shared-memory NUMA multiprocessor. A node may connect to a tunable transmitter and a fixed wavelength receiver, as shown in Figure 3.9. The 33 - 'rn - - I I I l ”'l Moa Fl M01 0"] M02 r1 . J - l l ’1 M10 "— M11 “'- M12 r2 I l I C C Oi-M2O C14»— “21 20— M22 (a) (b) Figure 3.8. (a) Configuration of the Orthogonal Multiprocessor (OMP). (b) MHMT representation of OMP. 34 system can be self—routing if the number of tuning channels is sufficient to let each receiver have a unique channel. This system cannot use CSMA / CD to control com- munication collisions because the nodes cannot detect the transmission by themselves. An alternative is to use fixed-wavelength transmitters and tunable receivers, as op- posed to what is shown in Figure 3.9. This configuration can offer atomic multicasting by letting multiple nodes receive signals from the same channel simultaneously. How- ever, because a receiver will listen to channel A). only after agreeing to accept data from node nk, demand polling or some complicated assignment techniques must be used. 1 J. ‘ 25). .1 l 2 3 I l T Figure 3.9. Configuration of a distributed-memory system connected by star-coupled optical network with tunable transmitters and fixed receivers. 3.5.7 Using Optical Networks Although optical networks can be used to implement a variety of network topologies, some of the topologies are not good choices. To consider a best topology, the major factor is the optic and electronic conversion overhead. Before the all-optics routing chips become available, when an intermediate node receives lightwave signals, it must 35 convert the optical signals into electronic signals, so the destination address can be read to decide the outgoing channel for the signals. And then, the signals will be converted into lightwave again under the proper wavelength channel for sending out to the next node. This process is relatively slow compared to the routing decision in traditional electronic networks. Because of it, the number of hops in an optical network cannot be large if a high-performance network is expected. Therefore, using optical network to emulate a directed ring as shown in Figure 3.5 suffers from the possible large number of hops, unless the messages are really long and the pipelining effect makes the distance invisible. The same limitation should be considered in any MHMT topology. Because of the above reasons, the SHMT topology is the most suitable topology for a large-scale multiprocessor connected by an optical interconnection network. However, the potential large number of transceivers needs to be carefully limited. An example will be shown later in this dissertation. CHAPTER 4 The Tango Tracing and Simulation System 4. 1 Introduction The network and the novel multiprocessor architecture proposed in this dissertation both were evaluated by using a simulation system named Tango [95, 96]. This chapter describes the Tango simulation environment and our experience in using it. The simulated architectures and the performance results will be addressed in successive chapters. Tango is a software simulation and tracing system used to gather execution statis- tics for parallel programs and multiprocessor systems, or to generate traces of syn- chronization events and data references. Developed at Stanford University, Tango is being used in a wide range of investigations from algorithm studies to detailed hard- ware evaluations, such as the evaluation of the SPLASH benchmarks [13] and the DASH multiprocessor [8]. The system simulates a target multiprocessor by interleav- ing the execution of processes on a uniprocessor host. The application is compiled and augmented to perform monitoring and simulation tasks. During the simulation, the augmented application processes are time-multiplexed to run on a uniprocessor 36 37 in a way that ensures memory and synchronization operations are executed in the correct order. In using Tango to simulate a target architecture, the user has to supply the specific memory and network timing model of the target multiprocessor. The user has the op- tion to choose that these routines are invoked at all memory references, at any shared memory references, or at synchronization points only. In the other words, the tradeoff between the accuracy and the cost of the simulation experiments is determined by the information interested by the user. The applications can be written in either C or FORTRAN using the m4 macros developed at the Argonne National Laboratory. The macros present a variety of abstractions that the programmer can use, such as locks, barriers, distributed loops, and messages. The macros are implemented using UNIX processes and semaphores, so the simulator is designed to run under an operating system that provides support for processes, shared memory, and semaphores. The simulation experiments we executed were on a DECstation 3100 under Ultrix. 4.2 Trace-Driven vs. Execution-Driven Simula- tion One common way to structure a simulation system is to use a trace-driven simulation, which decomposes the simulation system into two components: an address generator and a timing simulator for the memory and the network (Figure 4.1). The generator produces the trace data of a program’s execution; the timing simulator emulates the operations of the target machine. Trace-driven simulators have been used in several multiprocessor studies [6, 41, 97]. However, the most criticizable issue in trace-driven multiprocessor simulation is the accuracy of processes interaction. Most parallel programs are nondeterministic: the execution path through the program, the 38 Application Program Address Generator Memory/Network Timing Simulator ll Execuflon Statistics Figure 4.1. Structure of a trace-driven simulator. 39 order of events to maintain data consistency, and the latency of operations should all depend on the relative execution timing of processes. Although some timing adjustments may be possible, the interactions represented by multiprocessor trace- driven simulation generally do not correspond to correct execution of the program in the target architecture, especially in using the traces generated on one machine to simulate another machine [98, 99], One major feature of Tango is that it allows execution-driven simulation. In an execution-driven simulator, sequences of application instructions under one processor are executed until a globally visible operation is encountered. At that time, simula- tion code updates simulated time by either a simple increment or by performing a more detailed emulation of the target architecture. As the simulation progresses, the correct ordering of program events can be ensured since simulation times have been determined. So, unlike trace—driven simulation, execution-driven simulation ensures accurate ordering of program events and permits accurate simulation of contention and process interactions. The structure of Tango which addresses its execution-driven feature is shown in Figure 4.2. Application Program /\ Memory/Network Address Generator Timing V Simulator V 1! Trace Execution Figure 4.2. Structure of Tango, an execution-driven multiprocessor simulator. 40 4.3 Tango Simulation Approach A simulation process is created for each process in the application to associate with a unique processor in the target system. Functional behavior is simulated by compiling the code and executing it. The costly instruction interpretation and emulation is avoided by directly executing user application code whenever possible. Novel target machine instructions that do not exist on the available machine can be implemented in libraries. Each simulation process keeps a software clock, and the clocks are loosely synchro- nized while maintaining the order of operations of interest. A clock is updated after executing a set of instructions, at simulated memory references and synchronizations. Each assembly language instruction of the host machine is assigned a default delay corresponding to the latency of the target machine, and the delays are added in the simulated application during the compilation process. When a memory reference of interest to the user is encountered, either a subroutine or a process which acts as the timing simulator is responsible for calculating the reference latency at run-time. The executing order of events of interest is preserved by distributed schedulers: each simulation process is augmented to reschedule itself before each event of interest. The process farthest behind in simulation time is always scheduled to execute next. Figure 4.3 shows the process state diagram that illustrates the scheduling method described above. In Tango, the user application is compiled into a simulation system that is cus- tomized to meet the accuracy and efficiency requirements. Figure 4.4 shows the production of a simulation system from a C program (ex.C is the application pro— gram name used in the figure) using Tango. The first step is to expand the macros which provide machine and synchronization abstractions. The output of the macro expansion is a standard C language source file, and a standard compiler is used to 41 ready to be issued process farthest ll reschedule back in time operation functional simulation memory , , reference synchronization operation watiting to be issued by timing simulator event farthest back in time i synchronization memory/network simulation timing simulation i Figure 4.3. Tango application process state diagram. 42 ex.C C C expand macros 9X.C ex.S Tango processor augment parameters ex.S 6X.O timing library simulator routines memory parameters trace execution program summary statistics output for each processor Figure 4.4. Production of a simulation system using Tango. 43 produce an assembly language file from this source. An added augmentation step inserts code to increment the virtual time for each process, call the timing simulator at data references, and write trace records. The timing values for each operation provided by the host machine are stored in a processor parameter file. A standard as- sembler then is used to convert the augmented code into an object module. The final step is to link the object modules of the modified program with the timing simulator and library routines. When the simulation is being executed, a timing parameter file containing synchronization delay and memory system parameters is read. 4.4 Experience in Using Tango To customize Tango system for our experiments, we implemented routines to handle the detailed operations for the architectures, such as the contention on the intercon- nection networks, the private cache for each processor, the cache coherence protocols, the memory placement policy, and the location of shared pages. One major advantage of using Tango is its flexibility in choosing efficiency versus accuracy. For example, in simulating a uniform-memory-access machine, the tim- ing simulator can be used for each memory access. But, when Tango is used in simulating a NUMA machine, the details of private memory accesses is not that im- portant compared to the shared memory accesses, thus a simple calculation can be implemented for the private memory accesses and the simulation can focus on shared accesses. Detailed hardware simulation can be achieved by the routines supported by the user. The other major advantage of using Tango is its execution driven character- istic, which makes accurate simulation of coherence protocols possible. Applications can be easily implemented by using the m4 macros, which enables our study of the reference patterns of different programs. Furthermore, real program execution results are generated, which can be used to verify the correctness of the program. 44 The disadvantages in using Tango includes the difficulty in debugging and some hidden bugs inside the code. Because of the lack of detailed documents, users must read the source code to understand the system. The simulation is not very efficient because the context switching among processes produces a lot overhead. For example, running a 100 X 100 matrix multiplication simulation for a 32-processor target system took about two hours on a DECstation 3100, and a 200 x 200 matrix multiplication simulation took more than fifteen hours on the same host machine. The efficiency of Tango should be able to be significantly improved by using threads (light weight pro- cesses) instead of UNIX processes. The size of the simulated target machine is limited by several system parameters of the host machine, such as the maximum number of processes a user can generated at a time, the maximum number of semaphores a user can use, and the size of the shared memory a program can define. CHAPTER 5 Address-Separated Multiple Bus Network In this chapter we demonstrate the Viability of an address-separated multiple bus network [100]. We will show three different representations of this network that with identical bandwidths. The network will be evaluated using both the Tango simulator and queueing analysis. The queueing models introduced in this chapter will be the basis for analyzing the multiprocessor systems proposed later in this dissertation. 5.1 Introduction The performance of a multiprocessor system significantly depends on the efficiency of the interconnection network. Connecting all processors by a single bus was the initial architecture for commercial multiprocessors, and this architecture is still being used on many multiprocessors. One important reason for using a single bus on a multiprocessor is because the snoopy coherence can be easily applied. Due to the bandwidth limitation of a single bus, employing multiple busses on the network be- came interesting. One common method to utilize a multiple bus network is to build a multiple— 45 46 hop-connected system (MHST or MHMT) in which the communication between two nodes (either processors or memory modules) may need to travel through interme- diate nodes. This topology trades the advantages of a single hop to allow a large number of processors in one system. The other common topology is a single-hop N X M x B multiple-bus multiprocessor system (SHST or SHMT), in which usually all the N processors and M memory modules are connected to all B busses. For a message packet, the multiple busses work like a pool of resources—the packet will be transmitted when any one of the busses is available. This multiple bus network suffers from the fact that all processors and memory modules have to connect and listen to all busses all the time because input data packets can come through any bus. This problem becomes more significant as the number of busses increases. This chapter presents a new multiple bus interconnection network, the Address- Separated multiple bus network, on which one memory block can only be transmitted through a certain bus determined by the address of the block. One of the advantages of this model is that fewer connections between the memory modules and the busses are needed. Furthermore, the network can solve the interlock problem easily when two processors are accessing one memory block at the same time. Because one memory block will be transmitted through a fixed bus, busses can be used as semaphores to guarantee mutual exclusion in accessing memory blocks. This feature is especially important in maintaining cache coherence to avoid the problems of applying snoopy scheme on a traditional multiple bus network, as addressed in Section 3.3. It will be shown that the Address-Separated multiple bus network can reduce the complexity of a system while maintaining good network bandwidth. The performance of the network was evaluated by both simulation and queueing analysis. The results of the study will indicate the feasibility of a new bus-based, cache-coherent multiprocessor architecture described in the following chapter, where each processor can statically or dynamically connect to a subset of the busses. 47 5.2 Multiple Bus Networks The basic assumptions for the bus-based networks discussed in this chapter are: 1. A bus will be held only if the destination memory module is also available at that moment. 2. An arbitration scheme is used that allows one processor to succeed if multiple processors are competing for a bus. 3. A bus is held during the entire bus transaction, including completion of the memory cycles if memory was accessed. The circuit switching technique is assumed because of the following three rea- sons: 1) it can reduce the communication latency due to the elimination of the initial delay in arbitration, decoding, and setting of appropriate switches after a path is established, 2) the ease of performing atomic operations, which simplifies the imple- mentation of synchronization and mutual exclusion primitives, and 3) simplicity for simulation and analysis. Network bandwidth (the number of successful memory accesses per time unit) will be used in this section in discussing various multibus networks. The goal of the discussion in this section is to find a network model which reduces memory-bus connections while maintaining good bandwidth. We denote multiple-bus networks by N X M X B to indicate the number of processors, the number of memory modules, and the number of busses. For brevity, without losing any generality, it is assumed that M = kB where k is a positive integer. The case of k : 1 is discussed first. Claim 5.1: For an N x B X B network, the following two models have exactly the same bandwidth (Figure 5.1). 0 Model 1 (Single M —B connection): Memory Module i only connects to (UNI-l le-i WNl-i 48 Processors Busses Memory Modules (Model 1) Processors Busses Memory Modules Processors Busses Memory Modules (Model 3) Figure 5.1. Three models of Address-Separated multiple bus network. 49 Bus i. 0 Model 2 (Full M —B connection): Each memory module connects to all busses. Reasoning: Two types of contention (conflicts) may exist in a multiple bus system: contention on busses and contention on memory modules. In Model 1, the processor which holds Bus i is the only one that can use Memory Module i, so contention will only happen in accessing the busses and no memory contention will happen. Therefore, the N processors compete for B busses to reach the memory modules. In Model 2, the processor which can successfully access Memory Module i should have had no trouble in finding a bus because when a memory module is free, there must be a bus available due to M = B. On the other hand, when a processor is holding a bus, the destination memory module should also be available, otherwise the first assumption in this section is violated. So, again, N processors compete for B memory modules, which is equivalent to the contention occurring in Model 1. Claim 5.1 shows that in N x B X B systems, one memory module need only connect to one bus. All other connections between memory modules and busses are redundant. Also, fixing the transmitting bus for each memory block in this model does not cost anything. Now we generalize the value of k to any positive integer. Claim 5.2: Single M—B connected N x kB x B networks (Model 3) all have the same bandwidth for any k if 1: memory modules connect to one bus. Reasoning: Model 1 is a special case of Model 3 with k = 1. We can show that Model 3 is equivalent to Model 1 by combining the k memory modules 50 connected to Bus i into Memory Cluster i. The processor that is holding Bus i will be the one which can access Memory Cluster i. Additionally, only one module in a cluster can be accessed at a time. Therefore, these k memory modules bonded together simply form a bigger module, nothing else is different. In fact, the number of memory modules connected to each bus need not be the same for all busses in Model 3; the bandwidth of the network will always be identical. This model is similar to the N X M X B network with a single bus-memory connection discussed by Chen et al. [65] except that the two assumptions of arbitration schemes are not exactly the same. We can conclude from "the above two claims that single M—B connected N X B X B network (Model 1), fully M—B connected N X B X B network (Model 2) and single M-B connected N X kB X B networks (Model 3) all have exactly the same network bandwidth. The above two claims let us evaluate a complex network by using a much simpler model. From the definition of the Address- Separated network, it should be represented by Model 1 or 3. On the basis of the preceding discussion our network can be represented by any one of above three models. Our evaluation of the Address-Separated network will based on Model 1 because of its simplicity. To simplify the description in the following sections, the fully M —B connected N X M X B systems is named as Model 4 (Figure 5.2), where M = rB, and r is a real number greater than or equal to 1.0 such that rB is an integer. For example, if B = 8 and r = 1.5, then M = rB = 12. We always assume the number of memory modules (M) is a function of the number of busses (B) because when the number of busses increases, the number of memory modules should also be increased—there is no reason to have more busses than memory modules. Model 4 represents general multibus systems, and it has been studied thoroughly in the literature [11, 56, 60, 65]. Because in this model memory accesses can freely travel on any bus, and processors 51 Processors 1 2 3 Busses B Memory Modules (Model 4) Processors 1 2 3 Busses (1) Memory Modules (Model 5) Processors 1 2 3 Busses B Memory Modules (Model 6) Figure 5.2. Multibus Models 4, 5 and 6. 52 have to compete on both busses and memory modules, different assumptions are used by other researchers such as two—level arbitration, split transactions, different busses used for memory requesting and response. However, discussion of those network protocols is not addressed in this dissertation. The number of connections needed in Models 1 and 3 is (BN + M), which equals B(N+ 1) in Model 1 since M = B. In Model 4, B(N+ M) 2 EN + rB2 connections are needed, which is B(rB — 1) more connections than in Model 1. As mentioned in Section 5.1, in Models 1 and 3, the number of connections between processors and busses can be further reduced if one processor need not access certain memory modules due to either the locality of application programs or if multiple users are sharing the system and no interference is expected. The number of connections will also affect the hardware complexity in a system. The memory modules in a single M-B connected system are simpler than those in fully-connected multibus systems because each module only needs to connect to one bus. Also, in Models 1 and 3, contention will happen only on the busses, so no arbitration mechanism is needed for memory modules. Furthermore, if private caches will be used in each processor, the set—associative cache management policy can be easily implemented because addresses from different memory modules will come from different busses. Given the variety of multibus models, it will be interesting to investigate the optimum multibus networks and their performance. Before ending this section, the theoretical multibus networks which can achieve the maximum network bandwidth are presented. They will be named as Model 5 and 6 (Figure 5.2). Claim 5.3: For an N X M X B multibus system: 0 Model 5: if N is fixed, the maximum network bandwidth will be derived from the fully-connected system with M = B = total number of words (smallest memory accessing unit) in the memory. 53 0 Model 6: if both N and B are fixed, the maximum network bandwidth will be derived from the fully-connected system where M equals the total number of words in the memory. Reasoning: The key to achieving the maximum bandwidth is to reduce the possibility of contention to the minimum. If the number of busses equals the total number of words in the memory, any memory request can always find an available bus. Similarly, if the number of memory modules equals the number of words in the system, the memory module contention can be reduced to the minimum. The effect of accessing the same shared word from different places at the same time is ignored here because the operation will be different in different systems. Model 5 and 6 are just theoretical networks. No such systems will ever exist be- cause of the cost of the hardware. However, these systems are useful for understanding the capabilities and limitations of the systems described earlier. In addition, Model 6 will be used in later chapters as the network of a directory-based cache-coherent system for comparison. 5.3 Performance Evaluation by Simulation The results in this section are generated by using the Tango simulator described in previous chapter to run a parallel minimum-cut program using simulated annealing. The minimum-cut problem is used extensively in Electronic Computer Aided Design. The test problem is a 16-node graph with an average edge degree of 4.5 per node. All memory accesses are assumed to be 5 CPU cycles. No specific memory mapping method is used; that is, memory is uniformly distributed across the different memory modules. 54 At first we compare the performance of our architecture with the ideal models. Figure 5.3 presents the comparison between Model 1 (single M —B connected, N X B X B) and Model 5 (fully connected, N X 00 X 00). The latter model represents an ideal architecture. The curves show the ratio of execution time of the two models when N = 4, 8 and 16. For Model 1, when there is only one bus, the execution time is about (3/4) X N times longer than in a non-contention system (Model 5). As the number of busses increases, the ratio decreases. The ratio decreases quickly as the number of busses increases until the bus number is N / 2; after that point, the ratio decreases slowly. The curves will converge to 1 as B approaches infinity. 10 N = 16 *- _ N = 8 *— N = 4 ‘9" 8 - _ Execution Time Ratio Number of B and M in Model 1. Figure 5.3. Ratio of execution time of Model 1 (single M—B connected, N X B X B) to Model 5 (fully connected, N X 00 X 00). Figure 5.4 shows the comparison of the two ideal models. The curves are the ratio of execution time for Model 6 (fully connected, N X 00 X B) and Model 5. This figure 55 presents the optimum performance gain from increasing the number of busses. The ratio halves as the number of busses doubles, and when B : N, Model 5 achieves the same performance of Model 6. The equivalence is because no bus contention occurs when the number of busses equals the number of processors. This observation also verifies the correctness of our simulation. a] l l l 10 - N = 16 +— s \ N : 8 ". "’ \ N Z 4 '0— 8 r\ .. Execution \ Time 61 " Ratio {- k \ \ 4 ’ \ \ 1 a K X 2 \ \ \k\ E k: X \ \ +— —— ~ 1 \\.é \— é _— ~ ~ _ 2 4 8 16 Number of Busses in Model 6 Figure 5.4. Ratio of the execution time of Model 6 (fully connected, N X 00 X B) to Model 5. The comparison between Model 1 and Model 6 is shown in Figure 5.5. This figure indicates the places where the two models are closer, which can be used to determine the choice of number of busses in our architecture. The curves show the difference in execution ratio of the above two models compared to Model 5. When the number of busses is one, there is no difference between Models 1 and 6 because all processors have to compete for the single bus no matter how many memory modules or what 56 kind of connection is adopted. When the number of busses increases, the difference increases at first and then decreases, and eventually the two curves will merge together when the number of busses reaches infinity. The shape of the theoretical difference can be better observed on the curve of N = 8, on which the ratio difference starts to decrease when B = 4; fast at first, and then slowly. I l 2.5 - N = 16 *— - N = 8 "— N = 4 '9- 2 -- .. Difference in . .. r Execution [ Time ' Ratio / 1 - .. 0.5 - ° -, 0 l 1 l 2 4 8 16 Number of Busses (and M in Model 1) Figure 5.5. Difference in execution time ratio between Model 1 and Model 6. Figure 5.6 shows the execution ratios for N = 16 under different value of r for Model 4, a fully connected multibus network with M = r8. When r = 1 the network is the same as Model 2, which has equivalent bandwidth as Model 1 according to Claim 5.1. The simulation results verify this claim again. The r = 00 curve represents Model 6. The curves show that for a finite value of r, the execution ratio will converge when B approaches N. It indicates that by increasing the number busses the number of transceivers in a multiple bus multiprocessor can be reduced. 57 The difference between the Model 1 curve and the r = 2 curve is due to more memory modules and the free choice of the busses. Considering the saving of connections on Model 1, I believe the performance is very reasonable. With the assistance of private caches in a multiprocessor system, the bandwidth difference between Address- Separated multibus network and fully connected multibus network should be small, but the former can realize cache coherence protocols with much lower cost. l l r 10 Modellori‘:l'*—.. T‘ = 2 *— 7‘ = 3 '9— 8 Model6orrzoo-X- Execution Time 6 - _ Ratio \ 4 - \ _ X \ _ 2 - \ \ _ l :15 ~ "_ ~ -— -— _ 2 4 16 8 B (N =16,M = TB). Figure 5.6. Execution time ratio of different values of r on Model 4 (fully—connected, N X rB X B) to Model 5 when N = 16. The r = 1 curve also represents Model 1, and the r = 00 curve also represents Model 6. 5.4 Performance Evaluation by Queueing Models Queueing models for Models 1 and 6 can derive the approximate performance results very efficiently and can be used to verify the simulation results. These analytical 58 models will be used again later in this dissertation to analyze two cache-coherent multiprocessor architectures. The discussion begins from Model 6 due to the simplic- ity of its queueing model. 5.4.1 An Analytical Model for Model 6 Figure 5.7 shows the queueing model for Model 6. This queueing system involves N processors which access a pool of B busses in a closed form. The processors are represented by N processor servers; the buses are represented as a queueing system with B servers. As shown in the figure, there is no waiting queue for the processor servers since the number of requests equals to the number of processors in this closed- form queueing system. There is only one shared waiting queue for the B bus servers because busses are operated as a pool of resources. Because M z oo in Model 6, no memory contention will ever happen. Thus no queueing systems for the memory modules in the queueing model of Model 6 are necessary, the memory delay can be added to the service time in the network servers. A single request (or customer in the language of queueing theory) will move from its unique processor server to a network server and back to the same processor server. The time spent by the request in the processor server represents the computation time between two adjacent memory accesses. The time spent by the request in the network server represents the communication latency plus the time spent in memory modules. Totally N requests exist in the system. It is assumed that the computation time in each processor is exponentially dis- tributed with a mean of 1/ H1, and the service time of a network server is exponentially distributed with a mean of l/ug. No time is spent by a request in the waiting queue of the processors because the number of processor servers equals the number of requests. Consequently, the processor servers work like a queueing system with an infinite num- ber of servers. Because the number of busses is less than or equal to the number of 59 Service rate: M 1 N servers (Processors) N Service rate: )U. 2 <———. B servers (Network & Memory) Figure 5.7. A queueing model for Model 6. 60 processors (B S N), the service rate of the network servers depends on the number of requests inside. The service rate is nu; when there are 12 requests in the server with n < B, and Bflz otherwise. The arrival rate for the network servers when there are 71. requests inside is (N — 71);“, because all the requests that are not in the network server must being delayed in the processor server with the service rate of #1. Even though there are really two queueing systems involved, the processor servers and the network servers, the state transition diagram of this queueing model can be determined by only the network servers. This is because the model is a closed queueing network and the function of the processor servers is as a delay center only. Figure 5.8 shows the state transition diagram of the network servers. The states are the number of requests inside, and the arrows represent the entering and leaving rates for each state. 2M 3M BMZ BM By. B 2 2 2 [\2 N0 2 @f:0®'"0 :61) Figure 5.8. State transition diagram for the network servers in the queueing model of Model 6. We can denote B, to be the probability of n requests in the network server. To analyze the queueing model, we have to calculate the state probabilities first. The steady—state equations can be derived by letting the entering rate of a state equal its 61 exiting rate. Hence the equations are State exiting rate = entering rate 0 Nfllpo = #2131 n, 1 S n < B [mm + (N — n)p1]Pn = (n +1)p2Pn+1 + (N — n +1)p1P,,_1 n, B s n < N [Bflz + (N — n)p1]Pn = BH2Pn+1 + (N — n +1)p1Pn_1 N BMPN = HIPN-l an Equation 5.1 can be recursively solved to represent Pn in terms of P0 and yields —£%7&RP. 0LUI\)|>—‘ Figure 6.2. The MSU system with Nb“, 2 8 and N51,p = 4. The first bus is for private and read-only memory accesses ( Nread = 1), every node has a fixed receiver connected to it. Each node also has a fixed receiver connected to a bus for receiving requests for the local main memory, and two tunable receivers which can connect to any two of the busses at a time. On the network of the MSU system, each request and response has to be trans- mitted on a certain bus according to the address of the referred block, thus it is a form of the Address-Separated multiple bus network discussed in Chapter 5. To see the link between the two, the network of the MSU system is identical to the Network Model 3 in Figure 5.1 with N = Nnode, B = Nbu, and M 2 NW, x Nnode. The M memory modules are placed beside the Nnode processors now with NW, modules per 73 node. Since M was represented as kB previously, the value of k can now be simply ° N Nno e derived from ilvaA. bus The classification of the shared-and-writable and private—or-read-only blocks can be performed by the operating system, compilers, or programmers. Only the shared- and-writable blocks need to be snooped and maintained coherent. A processor can load local private and read-only blocks into its cache immediately without sending the request through the busses. If a required private-or-read-only block resides in another node, one of the Nread busses can be used to send the request and receive the data. No invalidate or update signals will be broadcast for private and read-only blocks, so both the communication traffic on these busses and the sn00ping processes of each processor are kept to a minimum. A processor can cache the shared-and-writable lines from at most (Nmp — Nread) different clusters simultaneously including the cluster it belongs to; because whenever these lines are cached, the processor must snoop on the corresponding busses. One of the dynamically switchable receivers will be devoted to that cluster. When a remote shared-and-writable line is needed by a processor, a request will be sent through the corresponding bus determined by the address of the line, and the response will also be received from the same bus. When there is no available receiver but a shared- and-writable line from a new cluster is needed, a replacement policy is used to decide which cluster should be replaced and then a cache—set replacement will occur. After replacement the receiver will snoop on the bus associated with the new cluster. A restriction of the system is that the shared-and-writable lines from only (N37,,D — Nmad) out of the (Nbus — Nrcad) clusters can be cached into a node at a time. We expect this restriction to be sufficient for most cases; that is, a processor only uses shared data from a few different clusters at a time. However, private and read-only data can be accessed from anywhere by using the Nread busses. 74 6.2 Coherence Protocols Because both a broadcasting ability and a snooping mechanism are supplied in the proposed architecture, many coherence protocols based on a shared—bus system can be directly applied in the MSU system. The protocol which can fit the architecture best must be a one that can: 1. minimize the intra-node actions to avoid local lockout between the processor, the cache and the main memory. 2. minimize the bus actions to let the system have better scalability. The needed coherence protocols are discussed separately for the sequential and the weak consistency ordering models mentioned in Section 2.5. 6.2.1 Under a Sequential Consistency Model According to the first suggested condition described above, the protocol should avoid writing cache blocks back to the main memory if it is not indeed necessary, to simplify the actions inside a node. With respect to the second condition, write-invalidate instead of write-update protocols should be adopted to make the messages short in a write operation. The Berkeley Ownership protocol [42] is one of the protocols that satisfies the requirements. The Berkeley protocol is a write-invalidate protocol that is based on the concept of cache block ownership. The owner of a block is either the main memory or one of the caches. A cache can obtain exclusive ownership of a block via two invalidation bus transactions. One is used on cache write misses and obtains the block for the requesting processor, at the same time it invalidates copies in other caches. The second is an invalidation signal that is used on cache write hits. Once the exclusive ownership has been obtained, a cache can update a block locally without initiating 75 read‘:::§// bus invalidate Invalid bus write miss ‘ bus ‘\ invalidate \\ or bus \.write «,3? i s s ....... ‘ Q write hit Shared-Dirty (owner) Dirty (exclusive owner) ooooooooooooooooooooooooooooooo * bus read miss I ..... . . write miss bus read miss write hit Figure 6.3. Transition diagram of the Berkeley Ownership Protocol. additional bus transfers. Owning a block obligates the owner to provide the data to other requesting caches and to update main memory when the block is selected for replacement. Cache-to—cache transfers are done in one bus transfer, with no memory update. A read miss will let the processor get a valid exclusive or shared copy of the block from the owner of the block. To summarize the above description, Figure 6.3 shows the state transition diagram for the protocol. 6.2.2 Under a Weak Consistency Model Gaining Performance The potential performance gain of weak consistency on the proposed architecture is especially high if the number of transmitters per node is less than the number of receivers. In this case, one transmitter must respond to the requests from several receivers and also must send requests for the local processor. If the transmitter can keep sending messages without waiting for the response of previously sent requests, a potential bottleneck on transmission can be relieved. Furthermore, for the weakly 76 ordered model, other processors need not stall while waiting for the responses delayed in this transmitter. A Lockup-Free Cache Architecture Because memory coherence has to be enforced at the synchronization points, buffers have to be used to keep track of outstanding requests. In addition, a few queues are needed for memory accesses. A lockup-free cache architecture for single-bus-connected system has been proposed by Scheurich and Dubois [37]. A revised one for the MSU system is shown in Figure 6.4 and the operations are briefly described below. Processor initializing queue pending access cache/bus controller cache memory re— ceiver re— ceiver re- ceiver issuin queue Busses Figure 6.4. Node architecture for implementing weak consistency. The processor can continue placing memory access requests on the initializing queue whenever the queue is not full. The cache/ bus controller can use the recycle line to reinitialize a request if the controller cannot handle the request because of some 77 pending situations. The pending buffer is used to keep a record of the outstanding requests which have been sent to the issuing queue but have not been performed. On a cache access miss or a local write hit when the block is not in the dirty state, the request has to be put on the issuing queue to be sent to the bus, and a record of the request has to be kept in the pending buffer. When a request from a write hit has been successfully issued on the bus, the request is assumed to be globally performed by all other processors, and the pending buffer can erase the record for the block. For a read or write miss, the record will be kept pending until the data of the block is received from one of the receivers. If a request out of the issuing queue is rejected by the bus arbitrator or by other processors, the recycle line of the issuing queue will be used to reissue the request. Interference Problems In addition to using the same protocol as in the sequential consistency model to serialize the write operations, the protocol for weak consistency has to address the problem that a memory access might occur while a request for the same block is pending. The second request may come from either the local processor or a remote processor. It will be the programmer’s responsibility to guarantee that two different processors will access the same word at the same time only if both accesses are read operations. However, accesses from different processors still can target the same block for different words. If the two accesses are both from the same processor, data dependence has to be maintained, thus the cache controller (not the processor) may have to stop accepting any new request from the initializing queue until the pending request is performed. Table 6.1 summarizes the proper actions for these problems. If when a local pending request has not being successfully issued and a remote request for the same block arrives from the bus, the remote request can be ignored. That is because after the local request is later issued, the newest owner will offer the new 78 Aosmmm 5&3 goo?! Aosmmm 882: 0.8an l A3255 >> >> Aosmmm Rodi uooflou Aosmmm 8803 .88qu l Aoaoaob m >> Aosmmm 5&3 poo? Aosmmm 8803 20st l AoaoEocv >> m Aosmmm 8&3 98qu Aosmmm 8&3 80an Aosmmm 389$ «53mm m 182 pom Acammm 883V cosmmm m 182 aom AoaoEouv m m oumemfifiob mzeam Hozobsoo 283V >> >> oazemfiamob CS wumwaom 8on m 3mm: mEEoSbfl 980: m >> ouzemficmg mzfim .5on38 280: >> m owmemfifiou .8 wfiwuoE mammuoa 280$ m m €203 advocate Hob $.83 macaw one not mmoooe mmoooo nompoe Eooq somuoe fieooq mat/Eda. womaom ASE out? a yo mama 3E3 a uofimo 03 BE 32:? nmmoooe «£55 ”3 .mmoooe wave ”EV .3182 wamvaom mm 483 083 one Sm mmoooe so 233 AmHOmmoooa 3082 59¢ Ho .888on #82 Soot E 858 mmoooe no sea? hoooammmGOo Mao? How maompo< Ab pimp. 79 data anyway. However, if the request has been issued, the arriving request has to be rejected. To prevent the other processors from issuing a request the controller can raise a special bus busy line just after receiving the request. Thus whenever a request is sent to a bus, the cache/ bus controller has to detect the line of that bus. If a request is rejected, the controller can just put it back into the issuing queue again. Synchronization Points In order to keep the coherence at the synchronization points, before a synchronization access is allowed to be performed, all previous accesses must be performed globally. Therefore, before the processor puts the synchronization access into the initializing queue, the recycle line of the queue must be disabled. When the cache / bus controller finds that the next access is a synchronization access, the controller has to wait until the pending buffer is cleared and then the controller can receive the synchronization access. After this point, the controller cannot accept any other request until this access is finished. Finally after the synchronization access is done, the recycle line for the initializing queue can be enabled. During the stall period of the controller, the processor can still proceed and put requests on the queue as long as the queue is not full. 6.3 Qualitative Performance Expectation This section presents the advantages and disadvantages of the cache coherence scheme of the MSU system. In particular, the effects of memory references from various applications and the effects of system parameters on the performance of these schemes are addressed in here. In the following chapter the arguments will be supported by simulation experiments and analysis. 80 6.3.1 Pros and Cons of Coherence Schemes One main advantage of directory schemes is that they are not restricted by the band- width limitation of a particular network. However, unlike snoopy schemes, the di- rectory look-up and maintenance make the access latency longer. For example, on a cache read miss to a shared line, the requesting node has to send the request to the directory of the line, and then from there the request must either be forwarded to the real owner of the line or the requesting node must receive the directory of the line. Furthermore, either the bit-vector or the linked list has to be modified to reflect the new status. The atomic broadcasting ability of the busses of a snooping scheme can let a message be sent to all nodes at the same time, which can significantly shorten the time for keeping data coherent. On the other hand, sending a signal to all nodes can only be executed serially on directory schemes on a general interconnection network. For example, on a write hit to a shared block, the invalidation signal can be atomically sent to all other caches on the bus. For the directory schemes, the invalidation process has to be done serially; moreover, on the linked-list scheme the sharing chain has to be broken and the location of the head has to be changed. Although part of the latency can be hidden under a weakly ordered model, any subsequent synchronization operation has to wait until all previous accesses are finished. In the mean time, a directory scheme also generates a lot of internode commu- nication traffic. For example, whenever a write hit occurs, individual invalidation messages must be sent to each processor which cached the same data, and all of them have to send an acknowledgment back to the directory or the requesting node before the write can complete. The writing processor can be idle for much of that time. Generally speaking, although directory schemes can be applied to high-bandwidth interconnection networks, snoopy schemes on bus-based networks can achieve low- 81 latency memory accesses; and the latency, not bandwidth, is often the limiting factor in the performance of a multiprocessor. The bandwidth of bus-based systems can be increased by using multiple busses or by using multiple-channel optical networks. Traditional single-bus-based snoopy schemes cannot scale with the growth of the number of nodes. If multiple busses are used without modifying the snoopy coherence scheme, as discussed in Section 3.3, the complexity of the coherence mechanism will grow fast when the number of busses or processors increases, which can seriously restrict the scalability of the system. An address-separated multiple bus network has lower network bandwidth than. a pool of multiple busses because the busses may not be evenly utilized. That uneven utilization is offset by a reduction in the complexity of a multiple bus snoopy scheme. The MSU system introduces the concept of partial snooping to simplify the snooping mechanism for each processor when the number of busses is large. By using this concept most of the advantages of snoopy schemes can be kept while improving the scalability of the system. An overhead introduced by the MSU system that does not exist in the traditional single-bus-based, snoopy-coherent system, is the switching time among the busses. But with today’s technology, this overhead is not significant [105]. The cache-set replacement overhead is a potential drawback for the proposed scheme. When a processor needs a new shared-and-writable block from a new block cluster and if no cache set is available, one presently cached set must be flushed out and the receiver has to switch to a new bus. All dirty lines in the replaced set need to be written back to the main memory. Also, the cache hit rate might be low after the replacement if the lines in the replaced set will be needed again later. Actually, the replacement of blocks is needed for any cache system unless the size of the cache is unlimited; the difference in our the proposed scheme is that in addition to the normal block replacement, replacement also occurs with a set of blocks. If memory access patterns are uniformly distributed to any cluster, the probability that the next access 82 will introduce a set—replacement can be calculated as: Nan _Nread (1_ h) X (1_ Nb P _ N d) X pshared X pwritable where h is the hit ratio, the pshared and paw-table are the probabilities that the requesting block is shared and writable, respectively. The N23,:N:::: is the probability that a block is in one of the clusters whose memory is being snooped by the node, so (1 —— N22,:NZZS) is the probability that a block belongs to a cluster that the node is not snooping on. If we choose it = 0.9, pshared = pwrgmbze = 0.5, N5”, = 8, NW, 2 4, and Nread = 1, the above equation will result in a value of 0.014. However, we believe that the locality of memory access patterns will cause a processor to use shared blocks from only a few clusters at a time. If increased locality reduces the probability of a memory access being in an uncached cluster to 0.1 instead of (1— W), the above equation will become 0.0025. That is, the probability of the set-replacement should be very low when a reasonable number of Nmp, so the overhead is not significant. The simulation results which will be shown later will also indicate that in most cases the extra replacement time is not critical to the performance. Increasing the reference locality by a different algorithm or a better memory placement can avoid this problem. The pros and cons for each of the three coherence schemes are summarized in Table 6.2. 6.3.2 Impact from Memory References In addition to the general advantages and disadvantages of cache coherence schemes, various reference patterns and system parameters can have significant impact on system performance. In MSU, a node can cache as many cache lines from another node as it wants to by snooping the corresponding bus. In addition, all nodes can be caching a single cache line if they each are willing to use one receiver for snooping that 83 .mOmmH—£ do mudO~ VOUGQHMQES KAQ fiomufivfiuoo $880826 25 8 80080.20 800308 0388083m 8o“ 8008 08% 8808008808 80808000 o» 088 8008808 80880883 .800 08080 w8o80 m88m8£ 03mmmom 8888888 00808880800808088 0888mm 88:38.08 80.308 80% w88oo8a Ewing 8.25.800 80.08888 8838 30m .8808: 000008 80808 2.08m A8038 szv w88oo8m 8888.08 883 88:88 28098808 8o 08088 Nao08m .3808 888 80888388 8 8088 25-08888 8 8 8025 .®a®£om kHOuUQHmU d Eda: UEGHH OOGQGOHGflNEIOOGOHOAOU mmog 080800 8808a :8 8o moo8m 8800088 8080 w8>0m 888880.? 00888388800808088 038% 8083.85 0803.8 88-0%86 8 8238000 80885 .8838 800080 8o808 208m #0888888 .0088 80808 083 m0€ouo08Q .8088 25-088888 0080808888 830088 OH 080 oESE 8 8 8888 008 88808880 8.00 68830088 w883888 8 8008830 .m0wO8 088 028 880 8088 08H. 08088 $08088 $000.0 8o808 w8oq 812308 ~80 08 80:89» 03 880 83008Q m8o0 _ wormfi 08080m _ .mvgvfiom OUQ®H®£00 @5060 00.2.: OH: wO wQOU mug—m mOHAm :ND OHLdL—u 84 bus. However, what the MSU system cannot do effectively is caching a little bit of every node’s memory, even if any particular cache line is only being used by some small number of nodes. On the other hand, a directory scheme can easily handle the case where every node is caching some small amount of memory from every other node, as long as any particular cache line is only being cached by a small number of nodes. However, it cannot easily handle the case where many nodes want to cache a particular piece of data. Generally, in real applications most data are either being cached by a small number of nodes or being globally shared [106, 107]. The MSU system can handle the latter case very well. The same case causes problems for directory schemes. For the former case, MSU’s behavior will depend on the distribution of the data, so the system may or may not effectively handle the case. However, because of the locality property in most applications and the help from the read-only bus in our system, we expect the MSU system to perform well in both sharing patterns. Also, distribution of data can be solved in software relatively easily. Table 6.3 depicts the effects of reference patterns and system parameters on per- formance for different systems. In general, a snoopy scheme will perform better than a directory scheme when shared data are frequently accesses. For example, consider the length of a linked list in SCI for hundreds of nodes: if a modification on one cache has to inform all the others, hundreds or thousands of operations and messages will occur in the system. Many other conditions are also considered in Table 6.3. One interesting phenomenon of the partial snooping in the MSU system is that when the NW, decreases, the number. of cache sets decreases also, so the number of lines in each cache set increases and that can increase the hit rate. Therefore, sometimes the system performance will be better when the number of snooped busses is reduced. 85 00808808808 w8>0888 8080 00808 88 w8000808 80 808008 0 08 8883 £00 08000 0 8 0088 80 808888 088 00000808 0080 8 .80>030E 8008080— 000000 088 w88088w808 8080 a088088 w8000808 “0808 88 w80008008 8 808008 0 08 8883 h0.800 08808000 .80 808888 088 08888m DmE 80 3:2. w8000800Q .000088 w808 8 808080880 08..— 08 0880 8008 088 880808 800 0080800 80800858 Evumxmm mflfimfide .m am mvmmdfi MO HOfiESG wfimmeHUCH 8808000808 80008000 88 8000800 800088086 088 8080 000088 8808 w880080 80 80008 088 0803020 800 8 .DmE 88 .088 088 80 888 08 08 0>08 :50 00000000 08080 808080 :0 888 .8008 0883 80808 80008 808008 8 880808 0080800 .3008m .000000 8080088 8080 08080 808008 80.8 880808 800 0080800 80800.8Q m88808 80808 888% .080800 8080088 0 80 80008 808008 0 0888 8883 880880.» 8008080 880808w 0 80 w88§> 088 800 :85 0805088080880 5080080888880 80308080 0800 088 8808808 080000008 088888 88 80308080 8050880880808 00808808808 808080 088 00008008 8:? 8080008808 800 08000 80 0808 88 308 088 8080 80088086 8008 0883 080. h808080880 808 08 000088 8080080 80 808888 088 8 800803088 00088 80 80008 088 0080808 :85 8 flDmE 88 0088008 80808 80808088 8 80808 w8000000 3880888 .808080 sz 088 80 88000 808 000088 80 80008 80008080883 08080 808080 308 0 w8000000 88808800808 088088 8080 800880080 08880 00800 :85 0803080808888 0080 8008 088 0800.080 80800.88 83080 3080 8088 0808 8008080 880808 80 8880088008 808808.08 00808808808 808080 80 800888 _ 80080 _ 0808080 8808080008000 *0 00808808808 088 80 0808080808 808020 8080 08808808 008080808 80 800888 .mé 030mL 86 6.3.3 Performance Expectation As a summary of this qualitative discussion, Figure 6.5 shows the expected compara- tive performance of a directory coherence system and a partial—snooped MSU system for various system sizes. When the system size grows, the number of processors, the network bandwidth, and the number of snooped busses in MSU increase together. When the system size is small, the directory coherence system is expected to perform better because at that moment the number of snooped busses in MSU is not large enough for caching the working set, and the frequent cache-set replacement may re- duce the hit rate and generate overhead. But when the system size is large enough to support enough snooped busses, the performance of MSU will soon outperform the directory coherence system due to shorter memory access latency. Furthermore, when the system size keeps growing, the long memory access latency in a directory scheme prevents the system from gaining performance, and even causes performance to de- crease because of the longer communication overhead. However, the MSU system can utilize the increased processors and the network bandwidth to gain performance. This expectation will be verified later by both simulation and analysis, and the identical relative performance shapes can be found in both Figures 7.17 and 7.21. 6.4 Implementation by Optical Networks In addition to traditional electronic busses, the multiple busses of MSU can also be implemented by the multiple channels of a state-of-the-art optical network. The proposed coherence solution is particularly attractive considering the high bandwidth and channel tunable capability of fiber optic networks. As mentioned in Section 3.4, current advances in lightwave technology make the potentially usable bandwidth of an optical fiber to be on the scale of Terahertz. The large bandwidth can be efficiently exploited by wavelength division techniques to partition the available bandwidth into 87 MSU / Execution Directory scheme Time System size Figure 6.5. The expected comparative performance of two scheme under different system sizes. multiple channels. With current technology, optical wavelength tunable transmitters and receivers can tune to different optical frequencies (wavelengths) within tens of nanoseconds. Multiwavelength optical networks allow multiple packets under different wavelengths to travel in one optical fiber without interfering with each other [92]. A node on MSU can snoop multiple channels by using multiple tunable receivers. The star topology constructed by a star-coupler is chosen because of its scalability and the fact that the it is a passive device. Figure 6.6 shows the architecture of an optical connected MSU system with tunable transmitters and tunable receivers. 88 .I O ‘ . OIOCGSSOI‘ local mem star coupler — b... Figure 6.6. Optically-connected MSU system. One tunable transmitter, two fixed receivers and several tunable receivers are used in each node. CHAPTER 7 Performance Evaluation This chapter compares the performance of the MSU system and a directory-based system by using simulation and queueing analysis. The simulation results were derived by implementing two architectural simulators on top of the Tango simulation system introduced in Chapter 4 [108]. The analytical results were derived by calculating the mean service times in the processor and network servers of the queueing models described in Section 5.4. The methodology and results from simulation are presented first. 7 .1 Simulated Directory-Based System Two architectural simulators were developed. The first model is for the MSU system described in Chapter 6. The other adopts a directory-based coherence scheme for comparison. A full—map, bit-vector directory scheme was chosen in our simulation because, in general, it achieves the best performance among directory schemes for large-scale systems [28]. The network of the simulated directory-based system has been chosen carefully for a fair comparison with a bus~based system. Adopting the address-separated multiple bus network would be unfair for the directory-based system because the constraint on 89 90 the choice of the busses according to the address of an access should not be applied to directory schemes. The reasons for using the address-separated busses on the MSU system are to ease the implementation of the snoopy scheme and to make the partial snooping possible. Nevertheless, as shown in Chapter 5, the address-separated network has smaller bandwidth than a full-connected multiple bus network. Thus, for a fair comparison with the MSU system, the network of the directory-based system should possess the same bandwidth as a fully-connected multiple bus network so that the transmission can be carried when any one of the busses is available. Since a multistage interconnection network is a natural choice for a directory- based system, we can present the network of the directory-based system in the same way. Figure 7.1 shows the network of the directory-based system with four nodes and two channels. When a node needs to transmit data, it finds an available channel through either one of the two 4-to-1 switches. If it succeeds, a path is built and the path will be held for reply, if any, until the end of the session. If it fails, the node will wait for one of the channels to become available. Since the channels are analogous to the busses on the MSU system, for simplicity, in the following only bus or busses will be used to refer to either the busses on the MSU system or the channels in the directory-based system; and, Nbus will refer to the number of either the busses or the channels. For example, the system in Figure 7.1 will be described as Nnode : 4 and Nbus = 2- Multistage interconnection networks normally have neither broadcast nor combin- ing capability built into the switches. To prevent the MSU system from gaining an advantage in the comparison from its use of broadcast, we chose communication pa— rameters for the directory to keep communication costs low. For example, the second and subsequent invalidation messages during an invalidation operation require half the time of the first message. This choice represents an implementation where most of the overhead of the multicast invalidation communication is incurred by the first 91 hannel 1 '- 1 tc>l‘l ,“||||||l|i 4 to 1 switch switch I‘ . V V ‘. "‘ '\ Channel 2 ’- l to 4 4 to 1 switch switch Node 4 Figure 7.1. The simulated network for a directory-based system with Nnode = 4 and Nbus : 2 92 message. Also, the switching time for the MIN-based directory system is assumed to be the same as the bus arbitration time in the MSU system. 7 .2 Architectural and Timing Assumptions The architectural assumptions for both simulators are listed in Table 7.1. Since the MSU system uses a set-associative cache, the same was used for the directory scheme. The replacement policy for cache lines and the cache sets on the MSU is a clock algorithm, which sets a space as unavailable for replacement when the space is accessed. The algorithm sequentially checks all places to look for an available space for the incoming item. If the currently pointed space is unavailable, the algorithm sets it available before checking the next one, so next time when this space is pointed to, it will be chosen for replacement if it is not accessed before beng pointed to again. As mentioned in previous chapters, the Berkeley Ownership protocol and the circuit switching technique were adopted. Although relaxing the consistency ordering requirement can improve performance in both systems, the optimal performance will be gained by different relaxation strategies in the two systems. To make the simulation experiments fair and simple, sequential consistency is used in the simulation for both systems. 93 Table 7.1. Default architectural assumptions One word: 4 bytes One cache line: 16 bytes Main memory page size: 1K bytes Data cache capacity (per processor): 256 lines Cache placement policy: set-associative with degree of # 256 of snooped busses Replacement policy: Clock Coherence protocol: Berkeley Ownership (a write-invalidate protocol) Switching technique: circuit switching Consistency ordering: sequential Because the network bandwidth is important in deciding the system performance, our simulation considered the contention in the networks for both systems. The MSU system has greater opportunity for bus contention because each access has to travel through a certain bus. Memory contention will never occur in the Address-Separated network for the same reason as in a single-bus network. However, memory contention is still possible for the directory-based system when two requests from different busses (channels) target the same memory module in a node, but this contention was ignored in our simulation for simplicity reasons. Because the sequential consistency is chosen in our simulation, each node in both the MSU and the directory-based system can only set up one path in the network at a time. But, as just mentioned, each node of the directory-based system in Figure 7.1 can handle multiple requests coming from the left side of the node at the same time, which gives a distinct advantage to the directory scheme. Note that the network of the directory-based system has the same bandwidth as the theoretical Network Model 6 in Figure 5.2. Even though the simulation gives an advantage to the directory scheme, the results will demonstrate that the MSU system can still perform better. Following the notation used by Chaiken et a1. [28], we can denote the simulated 94 directory scheme as DirooNBNm, which means that in the directory scheme (Dir), there is no limitation (00) on the number of processors that can cache a line. And, when a node invalidates the same cache line in other nodes, it only informs the nodes which cached the line rather than broadcasting to all nodes (NB), otherwise a lot of traffic will be generated on the Nbu,-channel network. The MSU system will be denoted by MSUfifZZf later in order to specify the values of Nmp and Nb”. Because of the strong application-dependence, no read-only busses were simulated for MSU system; that is, we always assumed Nread = 0 in the experiments. Elimination of those busses gives yet another advantage to the directory scheme. The default numbers in processor clock cycles for memory access latency are listed in Table 7.2. These numbers were derived by assuming that in a system with N MHz processors, the bandwidth of a single bus is approximately N Megabytes per second. The numbers in the upper part of the table are used for both systems, and the lower part shows the overhead which only happens in directory schemes. The table has only the default numbers; the sensitivity on performance of a variety of the latency values has also been studies and the results will be discussed in the following section. Table 7.3 lists the time needed for read and write operations when there is no network contention, replacement, or write back. In the simulation experiments for evaluating the network bandwidth, described in Chapter 5, a constant communication time was applied to every memory reference. Now in evaluating the multiprocessor systems, an option of Tango is turned on to in— voke the corresponding architectural simulator only when a shared memory reference is detected. That is, without losing any generality in a N UMA system, we assumed that all instruction fetches and private data references always hit in the processor cache so the simulator can focus on shared data accesses. This choice allows us to simulate the most critical characteristics while allowing a simulation run to complete in hours rather than days. 95 Table 7.2. Default memory access latency for primitive operations in processor cycles. I ParameterTOperation I Size I pcycles ] tarb Bus arbitration 2 teache Read and write cache hit one word 1 tin” Invalidation one line 4 treq Access request on bus one line 4 trpy Access reply on memory and bus one line 32 trpm Cache line replacement overhead one line 2 twbl Write back to local main memory one line 5 twb, Write back to remote main memory one line 20 Extra overhead for full-map bit-vector directory schemes: tab-m, Invalidation after the first one one line 2 tdloc Local directory look up and update 8 tdmt Remote directory look up and update 40 7 .3 Benchmark Programs Three benchmark programs—matrix multiplication, Gaussian elimination, and heat transfer—were implemented in our simulation experiments, and the descriptions of them can be found in Table 7.4. The Mat and the Gauss programs have been widely used as benchmarks in many memory architecture studies [54, 106, 109, 110]. The Heat program was selected because its memory access locality allows us to study the impact of data placement. All of the benchmarks are common applications and each one represents one category of applications. In addition, the programs and the reference patterns are quite straightforward which allow better reasoning on the simulation results. As mentioned in Section 4.4, the time—consuming, instruction- level simulation limited the size of our benchmarks. The default problem sizes are shown in Table 7.4, but the sensitivity of larger problem size was also studied in some experiments. The characteristics for each of the benchmarks are presented below. For the Mat program, almost all accesses are for shared data. All the items in the two multiplying matrices are read-only, and every item of the result matrix is only 96 08.038020» + 88.3 + 00..“ 0880\038w + 88¢ + 80......w + few + uzueow AQHOE®8\8.MUOC mmma 00:83 80 fidwm 00.:t..~H 82...: x 8 1.0+ 2.3 is: x 8 n e + .53 88080808: 25 853033 ”503033 + few + 3088 80830080808 0 80 88 088$. 3-31:8 o 2088 0888 80830.80 80 80 88 0885 00:85 D 08000» 828 fidvm 88:000er A08030880>8 8008008 80 no? 08... 08 3 ”080800 8880800888 88.6 + on...“ 88.8w + new IT 880w + 08000» mmma 9383 HQ fidvm 00.085 2...» 3:8 + 0.88 + 2088 08: 8083080808 0 80 88 08.85 00.3.08“ o 3080 088 800830-080 80 80 88 08.85 infirm O 08000w a—fi Udvm 88800ng ”080800 w88008m 08$ 803088000 088 _ 8080808 00000< _ 80508080 _ 808080.808 _ 8808000808 088 08000 8080 808808800 088 80 0080080 8 080308080 0808808, 80 8080808 .ms 08808. 97 Table 7.4. Descriptions of the benchmark programs and the default problem sizes. I Benchmark I Description Mat Multiplication of two 100 x 100 dense matrices. One processor takes the task of generating results of one row at a time. Gauss Using Gaussian elimination method to transfer a 100 x 101 matrix into an upper triangular matrix. Heat Heat transferring calculation by Jacobi method with double buffering on a 64 x 64 grid. Deriving the new value of a grid point takes data from its four nearest neighbors. Processors are synchronized after each round of calculation, and global difference is accumulated every 20 local calculations. The convergence tolerance was set to 10‘3. cached in one processor. In the other words, almost no invalidation operations occur in executing the Mat program. The Gauss benchmark is an example of programs that are difficult to parallelize well. The load balancing among processors may be biased after calculating for a while. That is, some processors will be idle or only executing synchronization operations while others are busy in computation. For the Heat program, only the information of grid points on the boundary of two adjacent processors is shared, no grid point information has a high degree of sharing. The global difference is the only variable shared by all processors; but this shared variable is modified by each processor only once every 20 rounds of calculation because each processor accumulates the local difference among its points first. As stated in Table 6.3, the characteristics of the selected programs of “no invali- dation”, “biased load balancing”, and “low—sharing” should all benefit the directory scheme. The data were stored in row-major order from the matrices into memory pages, and the pages were interleaved across nodes so each node could store a portion of the global data even for these small—size problems. A node may need data from sev- eral nodes at a time to complete some computation. This default memory placement can balance the network load for both architectural simulators, but causes problems 98 for partial-snooping MSU when the number of N”? is not large enough. In the other words, in addition to the simulator, we favored the directory scheme in the simulation again in the choice of the benchmark programs and the memory placement. 7 .4 Simulation Results This section provides the simulation results that show the performance of the MSU and the DirooNB systems. The purpose of the simulation experiments is twofold, one is to compare the two schemes, the other is to study the behavior of the MSU architecture. This section begins by showing the comparison of the directory scheme with full-snooping MSU, and then the performance of MSU under various degrees of partial-snooping. The impact of various system parameters will then be studied in the following subsections. Determining the sensitivity of the system to different parameter values is essential to demonstrate the validity of the simulations. One can take exception with individual simulation parameter choices, but the parameter sensitivity studies indicate that the conclusion is relatively independent of parameter choices. In the last subsection the performance under various numbers of processors will indicate the scalability of the two coherence schemes. 7 .4.1 Comparison of Full-Snooping MSU and DirooNB The full-snooping MSU means that the system snoops on all busses, but the network is still the address-separated multiple bus. On the other hand, the DirooN B uses the network shown in Figure 7.1. Figure 7.2 shows the cache hit rates for the three benchmarks used in the simula- tion experiments for both schemes. The data is collected from the simulation results of MSU, but the difference in hit rate between MSU and DirooNB is very small and can be ignored. The small difference is caused by the different memory access se- 99 quence in simulation. In Figure 7.2, when the number of busses increases, the hit rates drop. That decrease occurs because the degree of set-associativity in the cache placement policy decreases with a larger number of busses. Shared Uata Cache HI! Hate In "/0 Mat Gauss Heat 2 4 8 Cl 16 Number of Busses: Figure 7.2. Average shared data cache hit rate for the three benchmarks for various number of busses when Nnodc = 32. The normalized execution time used in Figure 7.3 and following figures was derived by dividing an execution time under a certain number of nodes (Nnode) using the specific scheme by the execution time of MSUxZZT with Nsnp = Nbus = %Nnode for the same benchmark program. For example, when Nnode = 16 the following calculation is used: exe. time of DirooNBle exe. time of MSU: normalized execution time of DirooNBm = The reason for using normalized execution time is to scale the performance results so that results using different numbers of nodes can fit in the same figure. The relative 100 shapes under the same number of nodes are preserved. Figure 7.3 shows the performance for the Mat benchmark. The figure shows that MSU is better in all cases, but the difference is reduced as the number of busses increases. That reduction in difference occurs because the directory scheme can use the added busses more effectively, as stated in Table 6.3. When Nb”, 2 16 and Nnodc = 32, the two schemes achieve almost the same performance. However, the fraction of Nbus/Nnode = 1/ 2 is unusually high, probably 1 / 4 or 1/8 is a more practical number. With a more reasonable number of busses, the performance of the snoopy scheme is far better than the performance of the directory scheme. Dir, Nn = 32 *— 5.0 _ ode = MSU, Nnode 32 €— " Dir, Nnode = 16 -><- MSU, Nnode = 16 {>— 4.0 - - Normalized Execution 3.0 - 0 Time 2.0 r .. 1.0 - , . $ 5 0.5 l L l 2 4 8 16 Number of Busses Figure 7.3. Normalized execution time of full-snooping MSU and DirooNBNbus under various number of busses for Mat benchmark. Figure 7.4 shows the same comparison as Figure 7.3 but for the Gauss and Heat benchmarks. In these cases, when Nnodc = 32, the directory scheme outperforms 3 .0 2.5 Normalized Execution Time 1 .5 1 .0 0.5 5.0 4.0 Normalized3.0 Execution Time 2.0 1.0 0.5 Figure 7.4. Normalized execution time of full-snooping MSU and DirooNBNm under 101 Dir, Nnode MSU: Nnode Dir: Nnode MSU: Nnode h—H—‘OOOD OEOENM fl >k Number of Busses (a) 16 \\ Dir: Nnode MSU: Nnode Dll‘, node MSU: Nnode \ .0 ‘5 II II II ll #6wa mamw ©>kf 8 Number of Busses (b) various number of busses for Gauss (a) and Heat (b) benchmarks. 102 MSU when the number of busses equals 16 for Heat, and both 8 and 16 for Gauss. A similar pattern happens when Nnode = 16. The better performance of the directory scheme in these instances should be mainly due to the low data sharing rate in these programs. However, in all other situations, the snoopy scheme on the MSU performs better. The above figures also indicate that the relative patterns on curves of the MSU and the Dir are generally similar under different number of nodes. 7.4.2 Various Degrees of Snooping on MSU Figures 7.5, 7.6 and 7.7 show the execution time and hit rates on MSU for various degrees of partial-snooping (various Nsnp) when Nnode = 32 for Mat, Gauss and Heat, respectively. For comparison purposes, the execution time of a directory scheme with N3”? busses is shown using a dashed line. On the (a) figures of Figures 7.5, 7.6 and 7.7, the curves show that at some value of N,,,,,, partial-snooping may derive better performance than full-snooping on MSU. In these situations, the MSU with partial-snooping can benefit from both scalability . and performance. This is a very encouraging experimental result because it shows the usefulness of the concept of partial-snooping. For Mat (Figure 7.5) and Heat (Figure 7.7) with Nnode : 32, MS a?” can outperform Dir‘ooNB16 at some values of NW9, but not in Gauss (Figure 7.6). This observation means that if the computation and communication load of an application cannot be balanced, such as in Gauss, the DirooNB can benefit from the freedom of using all the channels (busses), while MSU system suffers from the heavy communication traffic due to the low hit rates (shown in Figure 7.2). The reason for the better performance of partial-snooping can be observed in the hit-rate figures (Figures 7.5(b), 7.6(b) and 7.7(b)). They show that the best hit rate usually occurs at a certain degree of partial-snooping, but not at the point of full-snooping. One of the reasons for this phenomenon is that the degree of set- 103 2.0 - \ DirooNBNmp »<— _ \ MSU NW «9— 8 Msuxzw <—>~ \ , 1.5 - Normalized \ Execution \ ime 0.5 80- $ 7e~ .E .9 76— (U a: r: 74— I .°c’ 72~ O 8 .9 ml <6 0 68* 'O 9 0 66— .C a) 64__. / 8 busses / 8 16 / 16 busses Number of Snooped Busses ( N snp ) (b) Figure 7.5. Effect of various degrees of partial-snooping of MSU when Nnode = 32 on normalized execution time (a) and cache hit rate (b) for Mat benchmark under default parameters. The dashed line in (a) shows the execution time of DirooNB with NWT, busses for comparison. 104 \ DirooNBNmp —x— Msvgfif *9— - MSU16 P 9— \/ Normalized 1 -0 ‘ Execution Time l l m 8 Number of Snooped Busses (Nsnp) (a) 68- o\° .E 67“ 2 (U E 66~ '3': .°c’ 65* U (U o a 64* o E 63- (U .C m 62- 8 busses 8 16 I) 16 busses Number of Snooped Busses ( N snp) (b) Figure 7.6. Effect of various degrees of partial—snooping of MSU when Nnode = 32 on normalized execution time (a) and cache hit rate (b) for Gauss benchmark under default parameters. The dashed line in (a) shows the execution time of DirooNB with N,,,,, busses for comparison. 105 \ DirooNBNmp —x— \ MSU NW 6- .— 8 .. 2.0 \ MSUlglsnp 9_ Normalized Execution 1.5 - ime 1.0 - 05 l l l 4 8 Number of Snooped Busses (N mp) (60 l 82 76— 70— i=3 Shared Data Cache Hit Rate in % 0 8 busses 4 8 16 i 16 busses Number of Snooped Busses ( N snp) (b) Figure 7.7. Effect of various degrees of partial—snooping of MSU when Nnode = 32 on normalized execution time (a) and cache hit rate (b) for Heat benchmark under default parameters. The dashed line in (a) shows the execution time of DirooNB with N mp busses for comparison. 106 associativity increases as N,mp decreases for a fixed cache size. The other reason is that while one node can only cache data from a few clusters at a time, the cache size for each cachable cluster is larger, which results in a better hit rate if the number of cache sets is sufficient. With the ability to dynamically cache from all clusters, the size of the cache is virtually larger with partial-snooping, which is similar to the effect of virtually larger main memory in a virtual-memory system. The thrashing on cache sets can be observed in Figure 7.7(b) when Nb,” 2 16 and Nmp = 2. The number of cache sets is seriously insufficient in this case, so the hit rate drops to about 30%. It also makes the performance at that point bad, which can be observed in Figure 7.7(a). This observation indicates that when the value of Nmp becomes very low, the extra overhead of the cache-set replacement can become significant. It can be observed in Figure 7.7(a) that the MSU: outperforms MSUf6, which indicates that when the number of snooped busses is insufficient on the MSU system, increasing the total number of busses may even decrease the hit rate and performance. Section 7.4.4 will show how memory mapping can solve this problem. 7 .4.3 Impact of Faster Communication The MSU system uses multiple busses instead of one high—speed, large-bandwidth bus to reduce communication contention. However, we still studied the impact when the busses have larger bandwidth by using the Mat and Heat benchmarks. The choice of only two benchmarks is for brevity; the Gauss benchmark was discarded because of its uncommonly low hit rate. Figures 7.8 and 7.9 show the performance results for Mat and Heat benchmarks when the default bus speed is doubled with Nnode = 32. That is, the latency values of tiny, treq, tpr, twa, tdmv, and tdrmt are reduced to half of the default values in Table 7.1. Though the execution time changed, the performance patterns of normalized execution time curves in the figures are almost identical to the patterns in the figures under default bus bandwidth shown in previous subsections, 107 Dir, Nnode = 32 6&- MSU, Nnode = 32 G- 4.0 - Normalized Execution 3.0 - Time 2.0 l- 0.5 I l l 2 4 8 16 Number of Busses (a) Dir, Nnode = 32 —>(— MSU, Nnodc == 32 9— 4.0 - Normalized3-0 " Execution Time 2.0 _ 1.0 h 0.5 1 ' J 2 4 8 16 Number of Busses (b) Figure 7.8. Comparison of full-snooping MSU and DirooNBNm for different number of busses when bus communication latency is reduced to half. Normalized execution time for Mat (a) and Heat (b) benchmarks are shown. 108 2.5 i I \ \ ' DirooNBNanp .x_ \ \ MSng’” <9- 2.0 l" X MSUIG P e_ " \ \ Normalized \ Execution 1.5 r- \ ' Time \ \ \ 10 ~ 5 0.5 ' ' ' 2 4 8 16 Number of Snooped Busses (Nsnp) (a) 2.5 5 l \ l \ DirooNBNmp —)<- 2 0 _ \ MSUBx’” + _ - \ MSU16‘"” 9- Normalized Execution l.5 - Time 1.0 - 0.5 1 1 ' 2 4 8 16 Number of Snooped Busses (Nsnp) (b) Figure 7.9. Effect of various degrees of partial-snooping of MSU when the communi- cation latency is reduced to half. Normalized execution time when Nnode = 32 for Mat (a) and Heat (b) are shown. The dashed lines show the execution time of DirooN B with N”? busses. 109 which indicates that the bandwidth of the network has little impact on the relative performance results, in either a full-snooping or a partial-snooping system. 7 .4.4 Impact of Shorter Directory Access Latency Two major overheads in a directory scheme are the directory look—up and update latency, neither of which is required in a snooping scheme. Although we believe the default values of the directory access latency reflect practical numbers, the perfor- mance of directory schemes with shorter directory access latency is studied. Reducing the directory access latency further increases the advantage of directory schemes over the snoopy schemes. Figure 7.10 shows the performance of DirooNB when directory access time, tdloc and tdrmt, are reduced to half of the default values. The results are again from Mat and Heat benchmarks when Nnode equals 32. The dashed lines show the performance of Dir and MSU under default parameters for comparison purposes. It can be seen that the directory access latency does affect the performance of di- rectory coherence schemes; however, reducing it by half is not sufficient to allow the directory scheme to outperform MSU in most cases. In general, the relatively perfor- mance between Dir and MSU remains the same when the directory access latency is reduced by half. 7 .4.5 Impact of Memory Placement The placement of memory may significantly affect the performance of a N UMA ma- chine, including the architecture discussed in this dissertation, due to the different access latency of local and remote memory [111]. The impact of the memory mapping for MSU and DirooNB is studied in this subsection. For some applications, such as the matrix multiplication and the Gaussian elimination, it is difficult to predict what would be the best memory mapping pattern because there is no obvious method to 110 r l ’K Fast Dir, NW. = 32 9(— 5-0 ' \ Dir, Nnode = 32 e<- r MSU, Nnode = 32 {>- \ 4.0 - \ - Normalized \ Execution 3.0 - Time 2.0 r- 1.0 -" 0.5 Number of Busses (a) I Fast Dir, Nnode = 32 *- MSU, Nnodc : 32 O— 4.0 "' \ " Normalized Execution 3.0 - Time 2.0 - 1.0 - ' 0.5 Number of Busses (b) Figure 7.10. Execution time of DirooNBme when the directory look up and update latency is reduced to half for Mat (a) and Heat (b). Dash lines show the execution time under default parameters for comparison. 111 decide which processor will access a certain datum most frequently. The decision of memory placement is even more difficult if the application is well load—balanced among processors—the Mat and Gauss benchmarks for example. In a well load- balanced application under a shared memory environment, each processor takes a portion of the whole task at the beginning, and then the remaining subtasks will be taken one-by-one by processors as they finish their computation. Since the decision of task-division is dynamically decided at run time, it would be very difficult to find a good memory mapping pattern beforehand. Dynamically migrating memory pages among nodes might gain performance for these applications, but the migration pro- cess becomes overhead itself. The issue of page migration on cache—coherent systems has been studied by Gupta et al. [112], also via the Tango simulation system. A different way has been chosen to study the memory placement problem by specifying the static location of a certain memory space in the programs, which is the trend supported by many high-level parallel languages. The heat transfer problem is a good example for studying memory mapping because each processor can compute the temperature of one block in the working domain and the temperature data of the block can be stored locally. Only the data of boundary elements of the block are shared by neighboring processors, and one processor only shares temperature data with at most four neighboring processors. The Heat benchmark used previously does let each processor compute one block of the domain and share data with neighbors, but no particular mapping was done. The matrix elements were placed by row— major in consecutive pages, so one processor may need data stored in several different nodes. This row-major memory mapping was chosen so we could study the effect of insufficient snooped busses on MSU as shown in Figure 7.7. It is, in fact, an example of a program which should perform poorly on the MSU system. To study memory mapping, the Heat benchmark was modified into two new versions—mapped-Heat and optimally-mapped-Heat. Both programs store the memory of one matrix block 112 consecutively in one node, but the mapped-Heat doesn’t guarantee the memory will be stored locally in the node that computes the block, while optimally-mapped-Heat does. Figure 7.11 shows the execution time of the new versions of the Heat benchmarks. To better indicate the relationship between different algorithms, the actual execution time instead of normalized time is used. In this case scaling was unnecessary since the number of nodes is fixed. Figure 7.11(a) shows the execution time of full snooping MSU and the DirooNB. Compare the results with the performance of previous row- major mapped Heat shown in Figure 7.4(b). Observe that after memory mapping the MSU always outperforms the DirooNB, even though the new versions generate higher hit rates so the directory access latency becomes less significant in DirooN B. The di- rectory scheme gains more in changing from mapped-Heat to optimally-mapped-Heat due to high locality leading to both faster data and directory accesses. Figure 7.11(b) shows the performance of partial-snooping MSU under various Nsnp. Since in the new versions each node will only need temperature data from a few other nodes, reduc- ing the value of Nam, to four doesn’t increase the execution time, which is especially obvious in the optimally-mapped-Heat. However, when Nsnp = 2 the insufficiency of the snooped busses increases the execution time on the MSU system. 7 .4.6 Impact of Larger Cache Size We chose a small cache size of 4K bytes per node in the simulator to match our choice of small benchmarks so simulations could complete in reasonable time. The impact of larger cache memory is studied in this subsection. Figure 7.12 shows the comparison of full—snooping MSU and DirooNB when the cache size is increased to 16K bytes per node for Mat and Heat benchmarks. The performance pattern is the same as that for 4K bytes cache size (Figures 7.3 and 7.4(b)), but the directory scheme performs worse when the number of busses is small. 113 l r 6-0 " >3 Dir, map. -x-— _ \ Dir, opt. map. -)<— \ MSU, map. (>- 5.0 _ MSU, opt. map. 9— - Execution Time 4.0 - (in million pcycles) 2.0 - 10 l l 1 ~ 2 4 8 16 Number of Busses (a) 4? r r N 5.0 P MSUSNWD, map. ‘9— q U "W (>— \ MS 16 ,map. MSUgNmp, opt. map. '9— 4.0 - (3 \ MSUIGMP, opt. map. 9— ‘l Execution \ \ Time 3.0 - \ - (in million \ c cles \ p y ) 2.0 _ \ __ 1.0 - V — _ _ — — fig 1 2 4 8 16 Number of Snooped Busses (Nsnp) (b) Figure 7.11. Impact of memory placement on memory-mapped Heat benchmark when NW1.3 = 32. (a) shows the comparison of full-snooping MSU and DirooNB on different number of busses. (b) shows the effect of various degrees of partial-snooping of MSU. 114 I r 10.0 -' Dir, Nnode = 32 ->(—— ‘ MSU, Nnode = 32 9— 8.0 - Normalized Execution 6.0 - Time 4.0 - 2.0 - 0.5 .1. 1 ' I 2 4 8 16 Number of Busses (a) 6.0 [ l I 5 0 _ Dir, Nnode = 32 —><—— _ ' MSU, Nnode 1‘ 32 9— 4.0 - - Normalized Execution Time 3-0 h “ 2.0 — _ 1.0 -‘ . /? l 1 I 0.5 2 4 8 16 Number of Busses (b) Figure 7.12. Performance of full-snooping MSU and DirooNBNbM under various num- ber of busses when cache size is increased to 16K bytes per node for Mat (a) and Heat (b) benchmarks. 115 When the cache size is larger, the insufficiency of snooped busses on MSU affects the performance more significantly because the cache-set write back is more expensive now—more dirty cache lines may need to be written back to memory in a set replace- ment. In addition, frequent changing of the snooped clusters dominates the larger cache size. Figures 7.13 and 7.14 show the performance of the MSU under various degrees of partial-snooping when the cache size is 16K bytes for Mat and Heat, re- spectively. The hit rate figures illustrate that the best hit rates exist at full-snooping, not at a certain degree of partial-snooping any more due to the above reasons. However, partial-sn00ping still can benefit from larger caches in many situations. As we have mentioned, the problem size of our benchmarks is small in comparison to real world problems. Figure 7.15 shows the performance of MSU with 16K bytes caches under various Nsnp when the size of the matrices in Mat becomes 200 x 200. When Nbu, is sixteen, the best performance can be found at Nsnp = 8 again, not at the full-snooping. The hit rates (Figure 7.15) also reflect the performance trend. Furthermore, partial-snooping can benefit from larger caches when a better mem- ory placement is adopted. Figure 7.16 shows the performance of MSU with 16K bytes caches under various Nsnp when optimally-mapped-Heat is executed. The best hit rates are found when the number of snooped busses is four for both 8 and 16 busses. 7 .4.7 Performance on Various System Sizes (Scalability) The performance of MSU and DirooNB under various system sizes, from 8 to 64 processors, is shown in this subsection to examine scalability. In our experiments, the number of busses is half of the number of processors (N51,, 2 %Nnode). For MSU, the number of snooped busses is half of the number of busses (NW, 2 %Nbu, = anode). 4 Figure 7.17(a) shows the results from default Mat and Mat with larger matrix size and 16K bytes caches, and Figure 7.17(b) shows the results from Heat with better memory mapping. For default Mat and Heat, the difference between Dir and MSU is 116 I I 9K . 2.5 - \ DlrooNBNmP -)<— _ MSW: X 2.0 - 316 - \ Normalized \ Execution \ Time 1-5 ' \ _ \x 1.0 - \> 05 l l l 2 16 8 Number of Snooped Busses (N Mp) 82— 80‘ 78‘ 76— 740 Shared Data Cache Hit Rate in % 0 78 busses 6 / 16 busses 4 8 Number of Snooped Busses (Nsnp) 1 (b) Figure 7.13. Effect of various degrees of partial—snooping of MSU when cache size is increased to 16K bytes per node. Normalized execution time (a) and cache hit rate (b) for Mat when Nnode = 32 are shown. The dashed line in (a) shows the execution time of DirooNB with N,,,,, busses. 117 5.0 " DirooNBNsnp ‘X— _' MSUsz‘W’ 49— 4_0 _ MSUIG‘” 9“ — Normalized Execution 3 0 _ _ ime ' 2.0 - _ 1.0 r m 3 0.5 l l I 2 16 Shared Data Cache Hit Rate in % 8 E fi/ 8 busses 16 / 16 busses 8 Number of Snooped Busses ( (b) N snp ) Figure 7.14. Effect of various degrees of partial-snooping of MSU when cache size is increased to 16K bytes per node. Normalized execution time (a) and cache hit rate (b) for Heat when dee = 32 are shown. The dashed line in (a) shows the execution time of DirooNB with Nmp busses. Normalized 1'5 - 118 DirooNBNmp .x_ _ N MS 8 ’"P e— \ Maw <+ \ \ Execution Time 1.0 ' 05 l l l 2 Number of Snooped Busses (N sup) (3) 82 —/ ,,\° 80 — c .8 78 M/ ‘5 93 76 — '3‘: g 74 ‘ g 72 —/ £3 70 —/ 8 t, 68 — 9 g 66 - 1 64 fl '8 busses 4 a Number of Snooped Busses (Nsnp) 1 (b) 6 16 busses Figure 7.15. Impact of larger problem size with larger cache size. Normalized exe- cution time (a) and cache hit rate (b) for Mat with 200x200 matrices size and 16K bytes cache when Nnode = 32 are shown. 119 I >\ DirooNBNmp, opt. map. —x— 3.0 P Nu, - \ MSUW ,opt. map. '9— \ MSUIG‘”, opt. map. 9— Normalized Execution '0 — Time 1.0 - 0.5 ' ' l 8 Number of Snooped Busses (Nmp) 92~ 90~ 88~ aaf a4— 82~ so— 78— 76— 74~ 72— 7o— Shared Data Cache Hit Rate in % 8 busses 4 6 16 busses a Number of Snooped Busses (Nsnp) 1 (b) Figure 7.16. Impact of larger cache size and better memory mapping. Normalized execution time (a) and cache hit rate (b) for Heat with optimum memory mapping and 16K bytes cache when Nnode = 32 are shown. 120 l I 160 P 16K MSU (200x200) @— ‘ 140 - 16K Dir (200x200) —x— _ MSU (100x100) @— Dir (100x 100) -X— 120 - Execution 100 _ Time (in million 80 - pcycles) 40 *- 20 P 0 Number of Processors (Nnode) (a) _ r I _ 6'0 O MSU, map. (>— Dir, map. -x— 5.0 _ \ MSU, opt. map. 9— .. \ Dir, Opt. map. —><— , 4.0 - E Execution Time (in million 3.0 '- ’ ____ .8 pcycles) "" ,._. I, 2.0 - d 1.0 —- T T T * " *;¢> 0.0 ' ‘ ‘ 8 16 32 64 Number of Processors (Nnode) (b) Figure 7.17. Performance of MSU and Dir under various number of processors (Nnode). N5“, is set to be %Nnodc; and, for MSU, Nmp = iNnode. (a) is from Mat benchmark, and (b) is from memory-mapped Heat. 121 not significant when the number of processors is 32 or larger (Heat is not shown on the figure to avoid clutter). When there are eight processors for all applications, the execution time of MSU is always greater than the directory scheme because using only the two snooped busses is not enough to snoop on the working sets of the applications. For larger size Mat and Heat with memory mapping, on more than 8 processors, the execution time of MSU drops fast, while the execution time of the directory scheme decreases slowly or even increases for more processors and busses. 7 .5 Mean-Value Performance Analysis Approximate mean-value performance analysis has successfully evaluated the perfor- mance of some cache coherent multiprocessors [7, 44, 113, 114]. In general, they developed an equation for the mean processor response time based on the mean value equations of bus waiting time, memory interference and cache interference. In most cases direct calculation is difficult because these equations recursively depend on each other. The numerical quantities of the mean response times can be derived by putting some initial values in and then solving the equations iteratively until the results con— verged. The mean-value method used in this section to analyze the MSU and the DirooNB systems is different to the method just mentioned. Since the queueing models of the networks of the two multiprocessor systems have been solved in Section 5.4, the per- formance of the multiprocessor can be easily evaluated by applying accurate mean service times to the processor and the network servers in the two queueing models. This section shows the calculation of the service times of the queueing servers and discusses the analytical results. Although the subtle details of the multiprocessors are missed in the analytical models, this approximate analysis leads to a very efficient evaluation for large-scale systems. Compared to tens of hours in running one sim- 122 ulation experiment, the time spent in computing one analytical result dramatically reduced to be within one second. 7.5.1 Workload Parameters The workload of a system is decided by probabilities and latency constants. The following basic constants are specified for the workload model: a team,D is the mean processor computation time between memory requests. 0 The memory access latencies, including tarb, teache, tiny, treq, twy, trpm, twbl, twbr, tdinv, tdzoc and tdrmt, have been defined in Table 7.2 for simulation. The same values are used again in the analysis. 0 Nana/set is the number of cache lines per cache set on the MSU system. The following probabilities are used in the analysis: 0 pprwwe and pshamd are the probabilities that a memory access is for private and shared memory, respectively. 0 plow, and premote are the probabilities that a shared memory access locates locally and remotely, respectively. 0 phit and pm,” are the probabilities of cache hit and miss. 0 pwme and pread are the probabilities of a shared access to be a write access and a read access, respectively. 0 pea, and pm” are the probabilities that a node owns a cache line exclusively and non-exclusively, respectively, given that the node issues a write-access on the line. 123 o prpm is the probability of no vacant space in the cache for a cache-miss access. A cache line replacement has to be invoked. o psetJu-t and P30471533 are the probabilities that the cache set of a new cache line exists and does not exist on the MSU system, respectively, given that the access needs to invoke cache line replacement. 0 pdmy is the probability that a replaced cache line is dirty. 7.5.2 Mean Times for the MSU System The mean time between two memory requests issued by a processor, denoted as T, is the sum of the computation time and the memory access time. The mean memory access time is divided into private memory access time, TPn-vate, and shared memory access time, Tsharcd. The equation for T is T : tcomp + pprivate X Tprivate + pshared X Tshared- (7'1) As in the simulation experiments, private accesses are always assumed to be cache hits, thus Tprivate : teache- (72) The shared accesses can be divided into the time for cache-hit accesses and the cache-miss accesses. The cache-miss accesses may need to invoke replacement. So Tshared can be represented as: Tshared : phitThit + pmiss(Tmiss + prmerpm)- (7.3) 124 The time for a cache-hit access can be derived from Thit : preadTreadJIit + pwrite(pe.rTe:r:_hit + pn-e:r:Tn_e.r_hit)' (74) The latencies of T read _;,.-t, Tam“, T max _;,,-t, and Tm,” have been defined in Table 7.3. They are listed below again for easy reference: Tread_hit : TexJzit = tcachea (75) Tn_ea:-hit = tcache + tarb + tinvi and (76) Tmiss : tcache + tarb + treq + trpy- (77) Finally, the mean replacement time, Trpm, is the time of tTpm plus the write back time. The write back time depends on whether the set is hit or not. If hit, there is a probability that the replaced line is dirty. If not, all lines in a cache set need to be replaced, and again there is a probability for each line to be dirty. The mean time to write back a cache line to its main memory can be calculated as (oncaztwbI+Premotetwin). These discussions of the mean replacement time yield Trpm = trpm + psetJIitpdirty(placaltwbl + premotetwbr) +pset-mi33Nlines/setpdirty(placaltwbl 'l' Premotetwbr) : trpm + pdirty (psetJu't + pset-misleines/set)(plocaltwbl + premotetwa)' (7'8) Applying all the components of T into Equation 7.1 yields T : tcomp "l' pprivatetcache + pshared{ Phit [preadtcache + pwrite (pextcache + pn_ea:(tcache + tarb + tinv))] + pmiss [teache + tarb + treq 'i' trpy + prpmtrpm + 125 Prpmpdirty(pset_hit + Pset_misleines/set)(plocaltwbl + Premotetwbr)]}- (7-9) To derive the mean service times for the queueing systems in Figure 5.9, the terms of T are separated into two sets, sproceswr and snetwork. They are the time spent in the processor and the network, respectively. The smocesm should contain all terms with teammtcachmtarhtmp, and twbz in T; the snetwork contains all terms with t,,,,,, treq, t,,,,,, and twbr in T. Based on Equation 7.9, sproccswr and snctwork can be represented as Sprocessor = tcomp + pprivatetcache + psharcd{ phitLDw-eadtcache + pwrite(pextcache + pn_ex(tcache + tarb))] + pmiss[tcache + tarb + prpmtrpm + p'rpmpdirtyplocal(psetJLit + Psez_missszes/sez)twbzl}, and (7-10) Snetwork = pshared{phitpwritepn-extinv + pmiss[treq + trpy + prpmpdirtypremote (pset_hit + pset_misleines/set )twbr )] } - (711) Letting sprocessor be l/pl and snetwork be l/pg allows the queueing model in Fig- ure 5.9 to generate the performance quantities for the MSU system. 7 .5.3 Mean Times for the Directory-Based System The equations for the DirooNB system can be derived similarly. Because of the dif- ferences between the MSU and the DirooNB, some equations have to be changed. As shown in Table 7.3, due to the overhead of directory update, Tm,” and TRJIJ, should be redefined as: Tmiss : tcache + tarb + treq + trpy + placaltdloc + premotetdrmta and (712) Tn_e:r_hit : tcache + tarb + premotetdrmt + Plocaltdloc + 126 tinv + (”sharing _ 1)tdinv) (7'13) where rimming is the mean number of processors in which a shared line is cached at the same time. Because no set miss will happen in DirooNB, Trpm should be simplified from Equa- tion 7.8 into Trpm : trpm + pdirty(plocaltwbl + premotetwbr)- (7.14) Apply the new Tmm, TmeLM-t, and T rpm into T shared in Equation 7.3, and then the T for DirooN B can be derived from Equation 7.1. Finally, the equations of sproccssor and snetwork for DirooNB can be calculated as: Sprocessor : tcomp + pprivatetcache + pshared{ Phit [preadtcache + pwrite(tcache + pn_extarb + pn_explocaltdloc)] + pmiss [teache + tarb + placaltdloc + prpmtrpm + prpmpdirtyplocaltwbl]}a and (7-15) Snetwork : pshared{Phitpw1-itepn_ex[premotetdrmt + tinv + ("sharing — 1)tdinv] + pmiss(treq + trpy + premotetdrmt + prpmpdirtypremotetwbr)}- (716) As mentioned in Section 7.2, the network of the DirooN B system has the same bandwidth as the Model 6 in Figure 5.2. Letting 3707068330, and snetwork be l/pl and 1 /p2 of the queueing model in Figure 5.7 can generate the performance quantities for the DirooNB system. 7.5.4 Parameter Values Used in the Experiments In the experiments reported in this section, We assumed pprimte = 0.5, phi, = 0.75, pmad = 0.75, and prpm = 0.8. The pea: was assumed to be the value of (pwrite/nsharmg), 127 where n,haT,-ng was set to be 2.0. According to the Berkeley protocol, Pdirty should be the same as pm. To represent the spatial locality in memory accesses, plow; and pseun-t were chosen carefully. The plant was set to be I / (NW1.3 + l), where l is a locality factor and was assumed to be 10.0 in the experiments. The pmJu-t was assumed to be the cumulative function of an exponential distribution, and the value was generated by the function of max(N,np/Nbu,, 1 — e‘ANb“), where /\ was assumed to be 0.5. All other probabilities can be easily decided by using the above assumptions. All values of the probability parameters together with the values of 1 — 6‘05wa are listed in Appendix B for clarity. The same architectural assumptions and the communication latencies used in simulation experiments were used again in this analysis. One additional latency, tcomp, was assumed to be 10. 7 .5.5 Analytical Results Comparison of Full-Snooping MSU and DirooNB The analytical result of full-snooping MSU and DirooN B with Nnode = 32 is shown in Figure 7.18. As in Section 5.4.3, the mean time between two memory requests is used to represent the ratio of execution time. Like the normalization procedure used for the simulation results, all performance quantities are divided by the mean time between requests of MSU12. Observe that the results are very similar to the ones from simulation—the MSU system can outperform the Dir system, but the difference is reduced when the number of busses increases. Various Degrees of Snooping on MSU Figure 7.19 shows the analytical results for partial-snooping MSU systems when Nnode = 32. Because a fixed hit rate was used in all analytical experiments, the phe- 128 10 | l I 9 ' Dir, NM... 2 32 —><—--l 8 .. MSU, Nnode = 32 G— a 7 _- - Normalized 6 - - Mean Time Between 5 — _ Requests 4 _. 3 . 2 .. 1 ,_ ...... Number of Busses Figure 7.18. Analytical mean time between two memory requests of full-snooping MSU and DirooNBNm with Nnode = 32 under various number of busses. DirooNBNmp -><— _ MSUSxZ” -9— MSU16 P ~9— - I Normalizedg. 5 Mean Time Between Requests 2 - \ \ '- \ \ 1.5 — \ \ _ \ W< 1 ’- v > 0.5 L 1 l 2 4 8 16 Number of Snooped Busses (Nsnp) Figure 7.19. Analytical mean time between two memory requests for various degrees of partial-snOOping on MSU when Nnode = 32. The dashed line is for DirooNB with N mp busses for comparison. 129 nomenon of having the highest hit rate at a certain degree of partial snooping does not exist. Consequently, the curves are similar to simulation results in Figures 7.13(a) and 7.14(a) when the cache size is increased in the simulation experiments. However, we still can observe that when the number of snooped busses is large enough to handle the working set of the local processor, using more snooping busses cannot improve system performance. One major advantage of using analysis instead of simulations is that the analysis can significantly save the computational time in performance evaluation. Figure 7.20 shows an example of this by measuring a system with Nnode = 128. To generate the same figure from simulation, even if the host machine can afford that many active processes and other resources (such as semaphores), it might takes weeks to run all experiments on a workstation. In Figure 7.20, as before, the normalization was done based on the result of the MSUgfi. One can easily observe that the difference between the MSU and Dir is larger than when the system size is smaller. Performance on Various System Sizes (Scalability) Figure 7.21 shows the result of using the analytical models to verify the expectation on the MSU system made in Figure 6.5. The same assumptions used in simulation experiments (Section 7.4.7) are adopted again. Because the results are from various numbers of processors, the mean time between requests cannot represent the execution time relations among systems with different processors. We need to find a better way to represent the execution times using the analytical results. The method used to transfer the mean time between requests, T, into execution time is described as follows. Assume on a uniprocessor system the execution time of a program is mT, where m is the number of memory requests made in the execution. If the program can be fully parallelized on an N -node multiprocessor, the execution mT time on the multiprocessor would be T. However, most programs have serial parts 130 20 l K I 18 __ DirooNBNmp -x— _ \ 16 - \ M30432” +— - 14 ' X MSU32"up @- ‘ Normalized 12 ' \ T Mean Time Between 10 ' X ‘ Requests \ \ 8 '- \ \ .1 6 ~ \ \ \ - \x 4 - \\ .. 2 - x i "L '4 Y ' [\D 4 8 16 Number of Snooped Busses (N mp) Figure 7.20. Analytical mean time between two memory requests for various degrees of partial-snooping on MSU when Nnode = 128. The dashed line is for DirooNB with N snp busses for comparison. that cannot be parallelized. For simplicity, we assume a program can be separated into a part that can be fully parallelized to any number of processors, and a serial part that can only be executed sequentially on one processor (the same assumption was used in developing the famous Amdahl’s Law [115]). Assume p to be the ratio that a program can be parallelized, the execution time on an N -node multiprocessor now can be represented as L331 + (1 — p)mT. Since the value of m is a constant for one program, it can be taken out from the above formula. Therefore, the value of %T + (1 —- p)T can be used as the execution time for various numbers of processors, which is the Y axis used in Figure 7.21. Similar to our expectation and the results by simulation, Figure 7.21 indicates that the execution time of the MSU system drops faster than the directory-based system when the number of processors increases. 131 Dir, p = 0.80 -)<— ‘ MSU, p = 0.80 €— Dir, p = 0.90 -)(—- MSU, p = 0.90 G— ‘ Dir, p = 0.95 9(- MSU, p = 0.95 @- Execution _ Time Ratio / j /> 0 P l L 8 16 32 64 Number of Processors (Nnode) Figure 7.21. Performance of MSU and Dir under various number of processors (Nnode). Nb,” is set to be %Nnode; and, for MSU, N37,,D = iNnode. The p is the ratio that a program can be parallelized. 7 .6 Summary In this chapter, at first we demonstrated the feasibility of the MSU system by sim- ulation. Even though the directory scheme were advantageous from both simulator parameters and benchmark programs, the simulation results are promising. Not only can the proposed snoopy scheme with full-snOOping perform better than the direc— tory scheme, but adding the partial-snooping concept can improve the performance in some instances. Several default parameters were examined to further study the sys- tem, and we derived consistent results that when a high degree of parallelism can be achieved from either the software or the hardware, the MSU system can outperform the directory-based system. In addition, an effective analysis was developed to calculate the mean delay times of both the proposed and the compared systems. The method we adopted can be 132 used to evaluate large-scale multiprocessors with dramatically shorter computation time than simulation. The analytical results have further verified the expectation and simulation. CHAPTER 8 Conclusions The choices of memory architectures and coherence schemes have drawn attention in building new shared-memory multiprocessors. The research in this dissertation presents a new concept of building a distributed shared-memory multiprocessor. A novel, scalable, snoopy cache-coherent multiprocessor architecture based on the address-separated multiple bus network has been proposed in this dissertation. To allow the proposed system to handle more busses without increasing the complexity of the snooping mechanism, the concept of partial-snooping was introduced—each processor can snoop on a dynamically changing subset of the busses. The proposed scheme can enjoy fast memory access and also benefit from greater scalability. We conclude that the proposed scheme successfully uses the snoopy mechanism to extend the number of processors in one multiprocessor by the following facts: 0 Memory modules are distributed into nodes to allow fast local memory access and to avoid network contention and bottleneck. o The network bandwidth can grow with the number of processors by adding more busses. 0 Partial snooping design can alleviate the cost and hardware complexity when the number of busses grows. 133 134 o The multiple busses on the proposed system can be constructed by an optical network with wavelength-divided channels and tunable devices. The high net- work bandwidth supported by the state-of-the-art optical technology improves scalability. o The proposed system performs better than directory—based systems, especially when a higher degree of parallelism is supported, such as balanced load, larger problem size, high reference locality, better memory mapping and a larger num- ber of processors. 0 Current fast progress in distributed operating systems and parallel compilers should lead to better load balancing and data placement techniques, which would make the MSU system even more attractive. The contributions of this research can be described as follows: a The claims that prove the equivalence among different multiple bus network models tremendously simplify the performance evaluation, because complicated networks can be evaluated using simple ones with identical performance. 0 The concept of partial snooping is a novel idea. The study in this dissertation shows its feasibility to be used in multiprocessors to simplify the hardware on large—scale multiprocessors. o The MSU system indicates a possible way to scale shared-memory multiproces- SOI‘S. o The proposed system includes a design exploiting the state-of—the-art, star- coupled, optical networks with tunable devices. 0 A lowfilevel and accurate simulator with several benchmark programs for mul- tiprocessors was constructed to evaluate the proposed system. The simulator 135 has the potential to be utilized for other research. 0 A simple and effective analytical method was introduced which can evaluate parallel systems with up to hundreds of processors in seconds. Several issues have been addressed in this dissertation but were not studied thor- oughly, and they may be interesting for future study. These include the impact of the read-only bus on the MSU system, the performance improvement from a weakly ordered MSU system, and if the busses were implemented optically, the possible hardware complexity and delays of the conversion between lightwave and electronic signals. The study of a hierarchical structure with the proposed scheme or a hybrid with other schemes should be of interest to computer architects. Lastly, the concept of partial-snooping may even be applied to a directory-based system built on an in- terconnection networks where each processor can only access a subset of the outside world at a time. APPENDICES APPENDIX A An MVA Algorithm for a Network of Queues With a State-Dependent Server The MVA solution for the queueing system in Figure 5.7 (Model 6) is shown in this appendix. According to the assumptions used in Section 5.4.1, the computation time in each processor is exponentially distributed with a mean of 1 /p1 , and the service time of a network server is exponentially distributed with a mean of 1 /,u,2. To derive numerical results, we have to determine the mean delay spent by a request in the queueng system when the total number of requests is N, The mean time spent in the processor servers, 1510'), has been fixed to be 1 / #1 for any i, where i is the number of requests in the queueing model. Thus this analysis only needs to derive the mean time that a request spent in the network servers, which is denoted as 1222(N). The mean service time of the network servers with i requests, denoted as 32(2), is 1 i=1,...,B—1. 52(2) = 5’ (A.1) _1_ 2': B,...,N. Buz’ 137 Let Pj(i) be the probability of j requests in the network servers when there are a total of i requests in the queueing model. According to the Arrival Theorem, the mean waiting time in the network servers that an entering request will experience can be represented by i=1 Also, by using Little’s Law, the throughput of the model, Mi), can be derived as M') i (A 3) z = —. . U71 + 152 If a newly arriving request has to wait in the queue of the network servers because all network servers are busy, the probabilities of j requests in the network servers could be determined by Pj(z') = j_1(z' — 1). However, this probability has to be adjusted by a factor of the mean busy ratio of the network servers when j requests are in, which can be represented by Mr) X 52(3). Hence, [7,-(i) 2 Mi) x 32(3) >< P,_1(z' — 1), j = 1,. . . ,i. (A.4) Using Equations A2, A3, and A.4, the MVA solution can be shown in an algo- rithmic form in Figure A.1. 138 . Initially, 190(0) 2 1.0. .Fori=1toNdosteps3to6. . ID1(Z) = i, 1172(2): 23:1j X 82(j) X Pj_1(2 — 1). .w): 1' 11714-1172 ' . P,(z') = W) x 52(1) >< P,_1(2' — 1), forj = 1,. . . ,j. . P0(Z)=1— i Pj(2). i=1 Figure A.1. An MVA algorithm for the analytical solution of Model 6. APPENDIX B Workload Parameters Used in Analysis Table B.1 lists the values of all probability parameters that were used in the analytical experiments in Section 7.5: Table B.l. Probabilities used in the analytical experiments. [ Parameter l Value j pprivatc 0.5 pshared 0.5 placal fl; premote 1.0 — m Phit 0.75 pmiss 0.25 pread 0.75 pwrite 0.25 pea: 0.125 Pu-” 0.875 Prpm 0.8 pset_hit max(’,$—:f, 1 _ e-O-SNb...) pset-miss I — psetJu't pdi'rty 0.125 139 140 —0.5:z: The value of function f(:c) = 1 — 6 used in deciding pseuhit is plotted in Figure B.1. 0.9 0.8 0.7 0.6 f (1') = 1 _ e—O.51: 0-5 0.4 0.3 0.2 0.1 Figure B.1. Function f(:r:) : 1 — 6‘05“”, used in deciding psehm. BIBLIOGRAPHY [1] l2] l3] [4] [5] l6] [7] [8] BIBLIOGRAPHY K. Gharachorloo, A. Gupta, and J. Hennessy, “Hiding memory latency using 3 dynamic scheduling in shared-memory multiprocessors,’ in Proceedings of the 19th IEEE Annual International Symposium on Computer Architecture, pp. 22 — 33, 1992. S. Picano, E. D. B. III, and J. E. Hoag, “Programming costs of explicit mem- ory localization on a large scale shared memory multiprocessor,” in Proc. of Supercomputing, pp. 36 — 45, 1991. D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hen- nessy, “The DASH prototype: Logic overhead and performance,” IEEE Trans- actions on Parallel and Distributed Systems, vol. 4, Jan. 1993. A. Agarwal, B.—H. Lim, D. Kranz, and J. Kubiatowicz, “APRIL: A proces- sor architecture for multiprocessing,” in Proceedings of the 17th IEEE Annual International Symposium on Computer Architecture, pp. 104 — 114, 1990. J. R. Goodman, “Using cache memory to reduce processor-memory traffic,” in Proceedings of the 10th IEEE Annual International Symposium on Computer Architecture, pp. 124 — 131, 1983. J. Archibald and J .-L. Baer, “Cache coherence protocols: Evaluation using a multiprocessor simulation model,” ACM Transactions on Computers, pp. 273— 298, Nov. 1986. M.-C. Chiang and G. S. Sohi, “Evaluating design choices for shared bus multi- processors in a throughput-oriented environment,” IEEE Transactions on Com— puters, vol. 41, pp. 297 — 317, Mar. 1992. D. Lenoshi, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, “The directory—based cache coherence protocol for the DASH multiprocessor,” in Proceedings of the 17th IEEE Annual International Symposium on Computer Architecture, pp. 148 — 159, 1990. 141 142 [9] M. Carlton and A. Despain, “Multiple-bus shared-memory system: Aquarius [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] project,” IEEE Computer, pp. 80 — 83, June 1990. J. R. Goodman and P. J. Woest, “The Wisconsin Multicube: A new large- scale cache-coherent multiprocessor,” in Proceedings of the 15th IEEE Annual International Symposium on Computer Architecture, pp. 422 — 431, 1988. Q. Yang, L. M. Bhuyan, and R. Pavaskar, “Performance analysis of packet- switched multiple-bus multiprocessor systems,” in Real— Time Systems Sympo- sium, pp. 170 — 178, IEEE, 1987. M. Dubois, “Throughput analysis of cache-based multiprocessors with multiple busses,” IEEE Transactions on Computers, vol. 37, pp. 58 — 70, Jan. 1988. J. P. Singh et al., “Splash: Stanford parallel applications for shared-memory,” Tech. Rep. CSL-TR—92-526, Computer Systems Laboratory, Stanford Univer- sity, June 1992. J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Mateo, California: Morgan Kaufman Publishers, Inc., 1990. T. S. Wailes and D. G. Meyer, “Multiple channel architecture: A new optical in- terconnection strategy for massively parallel computers,” Journal of Lightwave Technology, vol. 9, pp. 1702 — 1716, Dec. 1991. H. Cheong and A. V. Veidenbaum, “Compiler-directed cache management in multiprocessors,” IEEE Computer, pp. 39 — 47, June 1990. Z. G. Vranesic, M. Stumm, D. M. Lewis, and R. White, “Hector: A hierarchi- cally structured shared-memory multiprocessor,” IEEE Computer, pp. 71 — 79, Jan. 1991. A. L. Cox and R. J. Fowler, “The implementation of a coherent memory ab- straction on a NUMA multiprocessor: Experiences with PLATINUM,” in Pro- ceedings of the 12th ACM Symposium on Operating Systems Principles, pp. 32 — 44, 1989. R. Bisiani and M. Ravishankar, “PLUS: A distributed shared-memory system,” in Proceedings of the 17th IEEE Annual International Symposium on Computer Architecture, pp. 115—124, 1990. E. Hagersten, A. Landin, and S. Haridi, “DDM—a cache-only memory archi- tecture,” IEEE Computer, pp. 44 —54, Sept. 1992. [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] 143 H. Burkhardt et al., “Overview of the KSRl computer system,” Tech. Rep. KSR—TR-9202001, Kendall Square Research, Boston, Feb. 1992. K. Li and R. Schaefer, “A hypercube shared virtual memory system,” in Pro- ceedings of International Conference on Parallel Processing, pp. I—125 - I—131, 1989. N. Nitzberg and V. Lo, “Distributed shared memory: A survey of issues and algorithms,” IEEE Computer, pp. 52 — 60, Aug. 1991. A. Mohindra and U. Ramachandran, “A survey of distributed shared memory in loosely-coupled systems,” Tech. Rep. GIT-CC-91 / 01, College of Computing, Georgia Institute of Technology, Jan. 1991. A. S. Tanenbaum, M. F. Kaashoek, and H. E. Bal, “Parallel programming using shared objects and broadcasting,” IEEE Computer, pp. 10 — 19, Aug. 1992. P. Stenstréim, “A survey of cache coherence schemes for multiprocessors,” IEEE Computer, pp. 12 — 24, June 1990. A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, “An evaluation of di- rectory schemes for cache coherence,” in Proceedings of the 15th IEEE Annual International Symposium on Computer Architecture, pp. 280 — 289, IEEE, 1988. D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal, “Directory-based cache coherence in large-scale multiprocessors,” IEEE Computer, pp. 49 — 58, June 1990. D. Gustavson, “The scalable coherent interface and related standards projects,” IEEE Micro, pp. 10 — 22, Feb. 1992. V. P. Srini, “Crossbar—multi-processor architecture,” in Cache and Interconnect Architectures in Multiprocessors (M. Dubois and S. S. Thakkar, eds.), pp. 221 — 243, Kluwer Academic, 1990. M. Thapar, “Cache coherence for scalable shared memory multiprocessors,” Tech. Rep. CSL-TR-92-522, Computer Systems Laboratory, Stanford Univer- sity, May 1992. Ph. D. dissertation. A. K. Nanda and L. N. Bhuyan, “Design and analysis of cache coherent mul- tistage interconnection networks,” IEEE Transactions on Computers, vol. 42, Apr. 1993. [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] 144 P. Bitar and A. Despain, “Multiprocessor cache synchronization issues, innova- tions, evolution,” in the 13th International Symposium on Computer Architec- ture, pp. 424 — 433, Computer Society Press, 1986. J. Torrelas and J. Hennessy, “Estimating the performance advantages of relax— ing consistency in a shared-memory multiprocessor,” in Proceedings of Interna- tional Conference on Parallel Processing, pp. I-26 -— I-34, 1990. P. Stenstrém, F. Dahlgren, and L. Lundberg, “A lockup-free multiprocessor cache design,” in Proceedings of International Conference on Parallel Process- ing, pp. I—246 — I—250, 1991. C. Scheurish and M. Dubois, “Correct momory operation of cache-based multi- processors,” in Proceedings of the 14th IEEE Annual International Symposium on Computer Architecture, pp. 234 - 243, 1987. C. Scheurich and M. Dubois, “Lockup-free caches in high-performance multi— processors,” Journal of Parallel and Distributed Computing, pp. 25 — 31, 1991. A. W. Wilson Jr. and R. P. LaRowe Jr. , “Hiding shared memory reference latency on the galactica net distributed shared memory architecture,” Journal of Parallel and Distributed Computing, vol. 15, pp. 351 — 367, 1992. K. Gharachorloo, D. Lenoshi, J. Landon, P. Gibbons, A. Gupta, and J. Hen- nessy, “Memory consistency and event ordering in scalable shared-memory mul- tiprocessors,” in Proceedings of the 17th IEEE Annual International Symposium on Computer Architecture, pp. 15 — 26, 1990. J.—L. Baer and R. N. Zucker, “On synchronization patterns in parallel pro— grams,” in Proceedings of International Conference on Parallel Processing, pp. II—60 — II—67, 1991. S. Eggers and R. Katz, “Evaluating the performance of four snooping cache coherency protocols,” in Proceedings of the 16th IEEE Annual International Symposium on Computer Architecture, pp. 2 — 15, 1989. R. Katz, S. Eggers, D. A. Wood, C. Perkins, and R. G. Sheldon, “Implementing a cache consistency protocol,” in Proceedings of the 12th IEEE Annual Inter- national Symposium on Computer Architecture, pp. 275 — 283, IEEE, 1985. J. K. Archibald, “A cache coherence approach for large multiprocessor systems,” in Proceedings of unkonwn, pp. 336 - 345, 1988. [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] 145 M. K. Vernon, E. D. Lazowska, and J. Zahorjan, “An accurate and efficient performance analysis technique for multiprocessor snooping cache-consistency protocols,” in Proceedings of the 15th IEEE Annual International Symposium on Computer Architecture, pp. 308 — 315, 1988. A. W. Wilson Jr. , “Hierarchical cache/ bus architecture for shared memory mul- tiprocessors,” in Proceedings of the 14th IEEE Annual International Symposium on Computer Architecture, pp. 244 — 252, 1987. D. R. Cheriton, H. A. Goosen, and P. D. Boyle, “Multi-level shared caching techniques for scalability in VMP-MC,” in Proceedings of the 16th IEEE Annual International Symposium on Computer Architecture, pp. 16 —- 24, 1989. C. Erickson, Design and Evaluation of a Hierarchical Bus Multiprocessor. PhD thesis, Dept. of Electrical Engineering, Michigan State Univ., 1991. Q. Yang, G. Thangadurai, and L. N. Bhuyan, “Design of an adaptive cache coherence protocol for large scale multiprocessors,” IEEE Transactions on Par- allel and Distributed Systems, May 1992. L. Lamport, “How to make a multiprocessor computer that correctly executes multiprocess programs,” IEEE Transactions on Computers, pp. 690 — 691, Sept.1979. J. R. Goodman, “Cache consistency and sequential consistency,” Tech. Rep. 61, SCI Committee, Mar. 1989. M. Dubois, C. Scheurich, and F. Briggs, “Memory access buffering in multipro- 7 cessors,’ in Proceedings of the 13th IEEE Annual International Symposium on Computer Architecture, pp. 434—442, 1986. T.-S. Jon, “A comparison among different memory consistency ordering mod- els,” tech. rep., Computer Science Department, Michigan State'University, Mar. 1992. Answer for one comprehensive exam question assigned by Lionel M. Ni. M. Dubois and C. Scheurich, “Memory access dependencies in shared-memory multiprocessors,” IEEE Transactions on Software Engineering, pp. 660-673, June 1990. R. N. Zucker and J .-L. Baer, “A performance study of memory consistency models,” in Proceedings of the 19th IEEE Annual International Symposium on Computer Architecture, pp. 2 -— 12, 1992. [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] 146 T. Lang, M. Valero, and I. Alegre, “Bandwidth of crossbar and multiple-bus connections for multiprocessors,” IEEE Transactions on Computers, pp. 1227 — 1234, Dec. 1982. M. A. Marson and M. Gerla, “Markov models for multiple bus multiprocessor systems,” IEEE Transactions on Computers, pp. 239 - 248, Mar. 1982. K. B. Irani and I. H. Onyiiksel, “A closed-form solution for the performance analysis of multiple-bus multiprocessor systems,” IEEE Transactions on Com— puters, pp. 1004 — 1012, Nov. 1984. O. Ganz and I. Chlamtac, “Analysis of multiple—bus synchronous and asyn- chronous communication systems,” Performance Evaluation, vol. 5, pp. 1 — 13, 1985. T. N. Mudge and H. B. Al-Sadoun, “A semi-markov model for the performance of multiple-bus systems,” IEEE Transactions on Computers, pp. 934 — 942, Oct. 1985. D. Towsley, “Approximate models of multiple bus multiprocessor systems,” IEEE Transactions on Computers, pp. 220 — 228, Mar. 1986. T. N. Mudge, J. P. Hayes, and D. Winsor, “Multiple bus architecture,” IEEE Computer, vol. 20, pp. 42 — 48, June 1987. L. N. Bhuyan, Q. Yang, and D. P. Agrawal, “Performance of multiprocessor interconnection networks,” IEEE Computer, pp. 25 — 37, Feb. 1989. A. Hopper, A. Jones, and D. Lioupis, “Multiple vs wide shared bus multipro- 7 in Proceedings of the I6thIEEE Annual International Symposium on Computer Architecture, pp. 300 — 306, 1989. cessors,’ Q. Yang and L. N. Bhuyan, “Performance of multiple-bus interconnections for multiprocessors,” Journal of Parallel and Distributed Computing, vol. 8, pp. 267 — 273, 1990. W.-T. Chen and J .-P. Sheu, “Performance analysis of multiple bus intercon- nection networks with hierarchical requesting model,” IEEE Transactions on Computers, vol. 40, pp. 834 — 842, July 1991. T. Lang, M. Valero, and M. A. Fiol, “Reduction of connections for multibus organization,” IEEE Transactions on Computers, vol. C-32, pp. 707 — 716, Aug. 1983. [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] 147 T.-P. Lee and C.-E. Zah, “Wavelength-tunable and single-frequency semicon- ductor lasers for photonic communications networks,” IEEE Communication Magazine, pp. 42 — 52, Oct. 1989. M. S. Goodman, “Multiwavelength networks and new approaches to packet switching,” IEEE Communications Magazine, pp. 27 — 35, Oct. 1989. P. W. Dowd, “Wavelength division multiple access channel hypercube processor interconnection,” IEEE Transactions on Computers, vol. 41, pp. 1223 — 1241, 1992. D. M. Chiarulli, S. P. Levitan, and R. G. Melhem, “Optical bus control for distributed multiprocessors,” Journal of Parallel and Distributed Computing, vol. 10, pp. 45 — 54, 1990. S. Thakkar, M. Dubois, A. T. Laundrie, and G. S. Sohi, “Scalable shared- memory multiprocessor architectures,” IEEE Computer, pp. 71 — 74, June 1990. Y. Birk, F. A. Tobagi, and M. E. Marhic, “Bus-oriented interconnection topolo- , gies for single-hop communication among multi—transceiver stations,’ in Proc. INFOCOM, pp. 558—567, 1988. New Orleans, Louisiana, Mar. 29-31. J. Kilian, S. Kipnis, and C. E. Leiserson, “The organization of permutation architectures with bused interconnections,” IEEE Transactions on Computers, vol. 39, pp. 1346 — 1358, Nov. 1990. P. K. McKinley and J. W. S. Liu, “Group communication in multichannel net- works with staircase interconnection topologies,” in Proceedings of ACM SIG- COMM, (Austin, TX), 1989. M. A. Marsan and D. Roffinella, “Multichannel local area network protocols,” IEEE J. on Selected Areas in Comm., vol. SAC-1, no. 5, pp. 885—897, Nov. 1983. I. Chlamtac and A. Ganz, “Design and analysis of very high speed network architectures,” in Proc. GLOBECOM, pp. 927—931, 1986. Houston, Texas, Dec. 1-4. B. P. Mohanty and T. D. Todd, “Connection-based media access for multichan- nel local and metropolitan area networks,” in Proc. INF OCOM, pp. 226—233, 1988. New Orleans, Louisiana, Mar. 29-31. [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] 148 L. V. Kalé, “Lattice-mesh: A multi-bus topology,” in Proc. International Con- ference on Parallel Processing, pp. 700—702, 1985. St. Charles, Illinois, Aug. 20-23. P. W. Dowd and K. J abbour, “Spanning multiaccess channel hypercube inter- connection,” IEEE Transactions on Computers, vol. 37, no. 9, pp. 1137-1142, 1988. M. M. N assehi, F. A. Tobagi, and M. E. Marhic, “Fiber optic configurations for local area networks,” IEEE Transactions on Computers, pp. 941 — 949, Nov. 1985. Y.-K. M. Lin, D. R. Spears, and M. Yin, “Fiber-based local access network architectures,” IEEE Communication Magazine, pp. 64 — 73, Oct. 1989. I. P. Kaminow, “Photonic multiple-access networks: Topologies,” ATE T Tech- nical Journal, pp. 61 —71, March/April 1989. F. W. Scholl and M. H. Coden, “Passive optical star systems for fiber optic local area networks,” IEEE Journal on Selected Areas in Communications, vol. 6, July 1988. J. W. Reedy and J. R. Jones, “Methods of collision detection in fiber optic CSMA/CD networks,” IEEE Journal on Selected Areas in Communication, vol. SAC-3, pp. 890 — 896, Nov. 1985. C. A. Brackett, “Dense wavelength division multipexing networks: Principles and applications,” IEEE Journal on Selected Areas in Communications, vol. 8, pp. 948 — 964, Aug. 1990. H. Kobrinski and K.-W. Cheung, “Wavelength—tunable optical filters: Appli- cations and technologies,” IEEE-Communications Magazine, pp. 53 — 63, Oct. 1989. P. E. Green and R. Ramaswami, “Direct dection lightwave systems: Why pay more?,” IEEE LCS, pp. 36 — 49, Nov. 1990. H. Kobrinski, M. P. Vecchi, M. S. Goodman, E. L. Goldstein, T. E. Chapuran, J. M. Cooper, C. E. Zah, and S. G. Menocal, “Fast wavelength-switching of laser transmitters and amplifiers,” IEEE Journal of Selected Areas in Commu- nications, vol. 8, pp. 1190 — 1202, Aug. 1990. 149 [89] K. L. Walker, “Directions in optical fibers,” AT€§ T Technical Journal, vol. 69, Novenber/ December 1990. [90] E. Arthurs, M. S. Goodman, H. Kobrinski, and M. P. Vecchi, “HYPASS: An [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] optoelectronic hybrid packet-switching system,” IEEE Journal of Selected Areas in Communications, vol. 6, pp. 1500 — 1510, 1988. E. Arthurs et al., “Multiwavelength optical crossconnect for parallel processing computers,” Electronic Letters, vol. 24, pp. 119 ~120, 1988. P. K. McKinley, “Lightwave multichannel networks with grid-based topologies,” in Proc. International Phoenix Conference on Computers and Communications, (Scottsdale, Arizona, Mar. 27-30), pp. 506—512, 1991. A. S. Acampora, “A multichannel multihop local lightwave network,” in Proc. GLOBECOM, pp. 1459—1467, 1987. Tokyo, Japan, Nov. 15-18. K. Hwang, P.-S. Tseng, and D. Kim, “An orthogonal multiprocessor for parallel scientific computations,” IEEE Transactions on Computers, pp. 47 — 61, Jan. 1989. H. Davis, S. R. Goldschmidt, and J. Hennessy, “Multiprocessor simulation and tracing using Tango,” in Proceedings of International Conference on Parallel Processing, pp. II—99 — II—107, 1991. S. Goldschmidt and H. Davis, “Tango introduction and tutorial,” Tech. Rep. CLS-410, Computer Systems Laboratory, Stanford University, Feb. 1991. M. Dubois and F. Briggs, “Effects of cache coherency in multiprocessors,” IEEE Transactions on Computers, pp. 1083—1099, Nov. 1982. P. Bitar, “A critique of trace-driven simulation for shared-memory multiproces- sors,” in Cache and Interconnect Architectures in Multiprocessors (M. Dubois and S. S. Thakkar, eds.), pp. 37 — 52, Kluwer Academic, 1990. E. J. Koldinger, susan J. Eggers, and H. M. Levy, “On the validity of trace- driven simulation for multiprocessors,” in Proceedings of the 18th IEEE Annual International Symposium on Computer Architecture, pp. 244 — 253, 1991. T.-S. Jon and R. Enbody, “Address-separated multibus architecture,” in Pro- ceedings of Int ’l Conference on Parallel and Distributed Systems, (Hsinchu, Tai- wan, R. O. C.), pp. 82 — 89, Dec. 1992. [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] 150 J. D. C. Little, “A proof for the queueing formulal = Aw,” Operations Research, vol. 16, pp. 651—665, 1961. M. Reiser and S. S. Lavenberg, “Mean-value analysis of closed multichain queue- ing networks,” Jounal of the ACM, vol. 27, pp. 313—322, Apr. 1980. S. S. Lavenberg and M. Reiser, “Stationary state probabilities of arrival instants for closed queueing networks with multiple types of customers,” Journal of Applied Probability, vol. 17, pp. 1048—1061, 1980. T.-S. Jon and R. Enbody, “A scalable snoopy coherence scheme on distributed shared-memory multiprocessors,” in Proceedings of Supercomputing ’92, pp. 652 — 660, IEEE Computer Society Press. T.-S. Jon and R. Enbody, “Using optical networks on multiprocessors,” Tech. Rep. CPS-93-9, Dept. of Computer Science, Michigan State University, Apr. 1993. J. K. Bennett, J. B. Carter, and willy Zwaenepoel, “Adaptive software cache management for distributed shared memeory architecture,” in Proceedings of the 17th IEEE Annual International Symposium on Computer Architecture, pp. 125 — 134, 1990. S. J. Eggers and R. H. Katz, “Characterization of sharing in parallel programs and its applicability to coherency protocol evaluation,” in Proceedings of the 15th IEEE Annual International Symposium on Computer Architecture, pp. 373 — 382, 1988. T.-S. Jon and R. Enbody, “Scalable snoopy vs. directory cache coherence schemes,” submitted, 1993. C. Dubnicki and T. J. LeBlanc, “Adjustable block size coherent caches,” in Proceedings of the 19th IEEE Annual International Symposium on Computer Architecture, pp. 170 — 180, 1992. M. Annaratone and R. Ruhl, “Performance measurements on a commercial multiprocessor running parallel code,” in Proceedings of the 16th IEEE Annual International Symposium on Computer Architecture, pp. 307 - 314, 1989. J. Ramanathan and L. M. Ni, “Critical factors in NUMA memory manage- ment,” in Proc. of 11th Intl. Conf. on Distributed Computeing Systems, pp. 500 — 507, May 1991. 151 [112] A. Gupta, T. Joe, and P. Stenstr6m, “Comparative performance evaluation of cache-coherent NUMA and COMA architectures,” Tech. Rep. CSL-TR-92-524, Computer Systems Laboratory, Stanford University, May 1992. [113] S. T. Leutenegger and M. K. Vernon, “A mean-value performance analysis of a new multiprocessor architecture,” in Proc. Conference on Measurement and Modeling of Computer Systems, 1988. [114] R. P. LaRowe Jr. , C. S. Ellis, and M. A. Holliday, “Evaluation of N UMA mem- ory management through modeling and measurements,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 686 —701, 1992. [115] G. M. Amdahl, “Validity of the single-processor approach to achieving large scale computing capabilities,” in AFIPS Conference Proceedings, vol. 30, pp. 483 — 485, AFIPS Press, 1967. MICHIGAN STATE UNIV. LIBRQRIES lllllllllllllllllllllllllllllllllllllllllllllll 31293009045810