MSU RETURNING MA‘IERIALS: Place in book drop to LJBRARJES remove this checkout from —3—. your record. FINES will - » be charged if book is returned after the date stamped below. Fifi"W‘”«fl‘ - , fl ‘ .‘ . b 4 \ *. , 7“, E." '; ’ .-/ PIPELINED DATA PARALLEL ALGORITHMS — CONCEPT, DESIGN, AND MODELING By Chung-Ta King A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science Michigan State University East Lansing, MI 48824 1988 ABSTRACT PIPELINED DATA PARALLEL ALGORITHMS — CONCEPT, DESIGN, AND MODELING By C hung-Ta King As multiprocessors become more and more popular, the need for efficient parallel algorithms becomes more and more imminent. In this thesis, the concept, design, imple- mentation, and modeling of a class of parallel algorithms, called pipelined data parallel algorithms, is discussed. When running on distributed-memory multiprocessors, pipe- lined data parallel algorithms exploit processor level macro-pipelining to achieve a bal- ance between computation and communication. The concept of pipelined data parallel computations is first introduced by contrast- ing with other styles of parallel computations. Various considerations in designing efficient parallel algorithms, especially data parallel algorithms, are discussed. A new paradigm, called large-grain pipellning, to facilitate the design of pipelined data parallel algorithms on distributed-memory multiprocessors is introduced. To estimate the performance of pipelined data parallel algorithms, an analytic model is introduced. The accuracy of the analytic model is studied by comparing the ana- lytic results with experimental results from a 64-node NCUBE multiprocessor. The close match between these two sets of results indicates that the analytic model can also be used to determine the optimal design parameters, such as the granularity, for a given problem instance. A systematic approach for designing pipelined data parallel algorithms on distributed-memory multiprocessors is presented. Starting from a nested-loop program, the approach restructures the program through a series of transformations. From data dependencies between the loops, data and operations in the program are grouped together. The grouping technique allows the control over the granularity so that a bal- anced computation and communication results in the generated pipelined data parallel program. Finally, major contributions of this thesis are summarized and future work is out- lined. To my parents and my wife iv ACKNOWLEDGEMENTS I wish to thank my advisor, Dr. Lionel M. Ni, for his guidance over the years. Without the research environment that he provided and his intellectual advice and inspiration, this dissertation would not have been possible. I would also like to thank Dr. Anthony S. Wojcik for his valuable suggestions and comments on this dissertation and careful reading the manuscript. I am grateful to Dr. A. Esfahanian, who is always willing to answer my questions and provide me with valuable references. I also wish to express my appreciation to Dr. D. Feldman for his encourage- ment and support. A person cannot accomplish anything without the help from others. I would like to thank all the people who helped me during my stay at Michigan State. In particular, I wish to acknowledge my friends T. Gendreau, E. Wu, Y.H. Liu, S.W. Chen, J. Liao, W.H. Chou, T.H. Pan, B. McMillin, Youran Lan, D. Ra, X.H. Sun, F. Fotouhi, and T. Znati. Finally, I would like to thank my parents and my wife for their constant support, patience, and love. TABLE OF CONTENTS List of Tables ............................................................................................................. viii List of Figures ............................................................................................................ ix Chapter 1. Introduction ........................................................................................... 1 1.1. Shared versus Distributed Memory Multiprocessors ............................. 2 1.2. A Model of Distributed-Memory Multiprocessors ................................ 4 1.3. Styles of Parallel Computations ............................................................. 8 1.4. Problem Statement ................................................................................. 15 1.5. Thesis Overview .................................................................................... 16 Chapter 2. Considerations for Designing Parallel Algorithms ............................ 18 2.1. A Representation of Parallelism — P-nets ............................................ 19 2.1.1. Basic Definition of P-nets ............................................................ 19 2.1.2. Variations and Analysis of P-nets ................................................ 25 2.2. Considerations for Partitioning and Mapping ........................................ 27 2.2.1. Number of Nodes and Partitions .................................................. 27 2.2.2. Mapping Considerations .............................................................. 28 2.2.3. Inter-Processor Scheduling .......................................................... 30 2.3. Considerations for Scheduling within Processors ................................. 31 Chapter 3. Designing and Analyzing Data Parallel Algorithms .......................... 35 3.1. NCUBE Multiprocessors ....................................................................... 36 3.1.1. The Hardware .............................................................................. 36 3.1.2. The Software ................................................................................ 38 3.2. Modeling Computation and Communication on DMMs ....................... 39 3.3. Case Study — Array Summation ........................................................... 45 vi 3.3.1. The Algorithms ............................................................................ 46 3.3.2. Analytic Modeling ....................................................................... 48 3.3.3. Performance Analysis .................................................................. 51 3.4. Case Study — Matrix Multiplication .................................................... 54 3.4.1. The Algorithms ............................................................................ 55 3.4.2. Analytic Modeling ....................................................................... 59 3.4.3. Performance Analysis .................................................................. 60 Chapter 4. Large-Grain Pipelining ........................................................................ 64 4.1. Parallel Programming Paradigms .......................................................... 65 4.2. Basic Concept of Large-Grain Pipelining .............................................. 68 4.3. Characteristics of Large-Grain Pipelining ............................................. 73 4.4. Performance of Pipelined Matrix Multiplication ................................... 76 Chapter 5. Modeling Pipelined Data Parallel Algorithms ................................... 80 5.1. A Model of Pipelined Computation ....................................................... 81 5.2. Throughput Analysis of a Computational Unit ..................................... 83 5.3. Throughput Analysis of Pipeline Arrays ............................................... 89 5.4. Analysis of Pipelined Matrix Multiplication ......................................... 95 5.5. A Comparative Study of the Analytic Model ........................................ 99 Chapter 6. Designing Pipelined Data Parallel Algorithms .................................. 103 6.1. Synthesizing Systolic Arrays ................................................................. 104 6.2. The Design Procedure ............................................................................ 110 6.3. The Grouping Problem .......................................................................... 111 6.4. Grouping with One or Two Dependence Vectors ................................. 117 6.5. Grouping with Three Dependence Vectors ........................................... 122 6.6. Grouping with Four or More Dependence Vectors ............................... 131 6.7. An Example of Grouping ....................................................................... 134 Chapter 7. Conclusion and Future Work .............................................................. 140 7.1. Summary ................................................................................................ 140 7.2. Future Work ........................................................................................... 143 Bibliography .............................................................................................................. 146 Appendix .................................................................................................................... 150 vii LIST OF TABLES Tabel 1.1. Performance figures for three DMMs ....................................................... 7 Table 1.2. Different styles of parallel computation ................................................... 10 Table 3.1. System parameters on the NCUBE .......................................................... 45 Table 5.1. Optimal partitions for pipelined algorithm ............................................... 100 viii LIST OF FIGURES Figure 1.1. The block diagram of a generic DMM ................................................. 5 Figure 1.2. A concurrent function parallel computation ......................................... 11 Figure 1.3. A concurrent data parallel computation ................................................ 12 Figure 1.4. A pipelined function parallel computation ........................................... 12 Figure 1.5. A pipelined data parallel computation .......................................... -. ....... 14 Figure 1.6. Another pipelined data parallel computation ........................................ 14 Figure 2.1. Matrix multiplication and its corresponding P-net ............................... 21 Figure 2.2. Execution of a P-net at various time instants ........................................ 23 Figure 2.3. Partitioning a P-net among three processors ......................................... 24 Figure 2.4. The D-net corresponds to the P-net in Figure 2.1 ................................. 26 Figure 2.5. A schedule for the D-net in Figure 2.4 ................................................. 32 Figure 3.1. Gathering/scattering operations in the NCUBE .................................... 42 Figure 3.2. Measured host overhead in sending a message ..................................... 43 Figure 3.3. Measured node to node communication time ....................................... 44 Figure 3.4. Array summation with centralized accumulation method .................... 46 Figure 3.5. Array summation with tree-structured accumulation ........................... 47 Figure 3.6. Timing diagram of array summation using centralized accumulation ........................................................................................ 48 Figure 3.7. Timing diagram of array summation using tree-structured accumulation ........................................................................................ 51 Figure 3.8. Comparison of ratioed and equal partition ........................................... 52 Figure 3.9. Comparison of tree-structured and centralized accumulation .............. 53 Figure 3.10. Predicted performance of various methods ........................................ 54 Figure 3.11. Matrix Multiplication with strip partition ........................................... 56 Figure 3.12. Matrix multiplication with block partition ......................................... 58 Figure 3.13. The algorithm for matrix multiplication ............................................. 59 Figure 3.14. Execution time of the matrix multiplication ....................................... 61 ix Figure 3.15. Predicted speedup of the matrix multiplication .................................. 62 Figure 4.1. Algorithm 4.1 for pipelined matrix multiplication ............................... 69 Figure 4.2. Data flows in Algorithm 4.1 .................................................................. 70 Figure 4.3. Algorithm 4.2 for pipelined matrix multiplication ............................... 72 Figure 4.4. Data flows in Algorithm 4.2 .................................................................. 73 Figure 4.5. Algorithm 4.3 for pipelined matrix multiplication ............................... 74 Figure 4.6. Data flows in Algorithm 4.3 .................................................................. 75 Figure 4.7. Comparison of pipelined matrix multiplication .................................... 77 Figure 4.8. Effects of partition sizes ........................................................................ 78 Figure 4.9. Comparison of speedups ....................................................................... 79 Figure 5.1. Model of a computational unit .............................................................. 82 Figure 5.2. Functional description of the computational unit ................................. 82 Figure 5.3. Timed Petri-net of a computational unit ............................................... 85 Figure 5.4. A linear array with single input stream ................................................. 87 Figure 5.5. A linear array with two input streams ................................................... 89 Figure 5.6. The Timed Petri-net of a linear array with two stages .......................... 90 Figure 5.7. Computational units connected in a 2-dimensional mesh .................... 93 Figure 5.8. The Timed Petri-net of a 2-dimensional mesh ...................................... 94 Figure 5.9. Measured and analyzed performance for various pipeline configurations ....................................................................................... 99 Figure 5.10. Measured and analyzed performance for various partition sizes ........................................................................................ 100 Figure 5.11. Predicted speedup for pipelined matrix multiplication ....................... 101 Figure 6.1. The computational structure of the matrix multiplication .................... 107 Figure 6.2. Projected computational structures of matrix multiplication ............... 109 Figure 6.3. Grouping of a computational structure along [1 0] .............................. 114 Figure 6.4. A computational structure with possible cycles ................................... 116 Figure 6.5. A computational structure with one dependence vector ....................... 118 Figure 6.6. A computational structure with two dependence vector ....................... 119 Figure 6.7 . Grouping along two directions .............................................................. 122 Figure 6.8. A computational structure with three dependence vectors ................... 123 Figure 6.9. A grouping which generates a non-dependence-preservin g structure ................................................................................................ 125 Figure 6.10. Dependence relationships between groups ......................................... 127 Figure 6.11. Relationship between groups in a universal planner array ................. 133 Figure 6.12. The computational structure of the example program ........................ 135 X Figure 6.13. The projected computational structure of the example ....................... 136 Figure 6.14. The grouping along [1 0] with a size of 2 .......................................... 138 xi CHAPTER 1 INTRODUCTION A multiprocessor is a computer system containing multiple processors which are capable of communicating and cooperating at different levels in order to solve a given problem [HwBr84]. A distributed-memory multiprocessor (DMM) is a multiprocessor in which each memory module (not cache memory) is physically associated with each processor and an interprocessor communication network provides a mechanism for communication between processors. Centered around the development of efficient parallel algorithms on DMMs, this thesis focuses on the design, implementation, and modeling of a special class of parallel algorithms, called pipelined data parallel algo- rithms, on DMMs. DMMs are different from shared-memory multiprocessors (SMMs), in which all memory modules are equally accessible to all processors through a processor-memory interconnection network. Both architectures have commercial implementations [Kwan87]. A brief review and comparison of SMMs and DMMs are presented in Sec- tion 1.1. This discussion serves to motivate our study of DMMs. A model of DMMs is given in Section 1.2. Characteristics of DMMs are discussed in more detail, which form the basis for the following discussion. As will be seen in Section 1.2, the major bottleneck in current DMMs is communication. The problems caused by the non- negligible communication overhead are discussed, and, from a user’s point of view, the best way to get around with the communication problem is to carefully design the 1 algorithm. The work presented in this thesis is devoted entirely to achieving this end. Pipelined data parallel algorithms are a class of parallel algorithms which uses pipelining and data level partitioning to achieve a very high degree of parallelism. Spe- cial features of pipelined data parallel algorithms make them very suitable for execution on DMMs. As a result, pipelined data parallel algorithms are the primary subject of this thesis. In Section 1.3, the basic concept of pipelined data parallel algorithms is intro- duced by contrasting them with other styles of parallel computations. Finally, a formal statement of the problems addressed in this thesis is given in Sec- tion 1.4, followed by an overview of the thesis in Section 1.5. 1.1. SHARED VERSUS DISTRIBUTED MEMORY MULTIPROCESSORS The multiprocessors discussed in this thesis are MIMD machines, in which multi- ple instruction streams simultaneously operate on multiple data streams. As previously mentioned, to provide such a concurrent processing environment, SMMs treat the memory as a shared resource and use a processor-memory interconnection network to allow the memory to be accessed equally from any processor. The simplest such inter- connection network is a single high speed bus to which multiple processors, memory modules, network interfaces, and device controllers are all tied. Most minisupercomput- ers, such as Alliant’s FX series, Encore’s Multimax, or Sequent’s Balance and Sym- metry series [Hwan87], choose this kind of connection. The advantage is, of course, the simple architecture which performs very well for up to a few tens of processors. How- ever, as the number of processors increases, the contention to access the bus increases, which, in turn, increases memory latency. This limits the scalability and expandability of bus-structured SMMs. At the other end of the spectrum is the SMMs with a crossbar interconnection net- work. The crossbar provides a one-to-one link from any processor to any memory module. The only contention that may occur is when two processors attempt to access the same memory module. The major disadvantage of the crossbar is the high cost in building the network, which, again, limits the scalability of the SMM. Another direct connect architecture [Hwan87] is the multistage interconnection network, which is a compromise between the bus and the crossbar. One of the problems with the multistage network is the difficulty in achieving cache coherence and in synchronizing parallel activities. Popular processor interconnection networks for DMMs include multistage and point-to—point networks. For example, BBN’s Butterfly uses a butterfly multistage net— work to allow processors to access each other’s local memories [BBN87]. Among point-to-point interprocessor connections, the hypercube is the most popular topology, as is found in NCUBE’s NCUBE/10, Intel’s iPSC, and AMTEK’s S-14 [Hwan87, HaMS86]. The reason for this is because the hypercube topology has a uniform struc- ture with logzN diameter, where N is the number of nodes in the hypercube, and a rich set of other topologies, such as ring and mesh can be embedded in it [SaSh85]. One of the most significant advantages of DMMs is their modularity and scalabil- ity. For example, in the hypercube, when the number of processors doubles (from N to 2N), the diameter of the network only increases by one, and the number of neighbors a processor has to connect to is also increased by one. Currently, DMMs with a few hun- dred processors are common. Thus, DMMs hold the promising potential for massive parallelism. It is for this reason DMMs are studied in this thesis. However, DMMs are not without problems. The main problem is due to the need for communication between processors to exchange data held in each other’s local memories. In general, there are two possible synchronization and communication methods for DMMs: shared-variable and message-passing. In the shared-variable approach, processes communicate by sharing common variables. Though the shared- variable approach provides a unified and friendly environment for users, it takes a lot of effort to support the logic view of a shared, global memory on top of distributed, local memories. Very high speed communication is necessary to close the performance gap between accessing local and remote memory modules. In the message-passing approach, processes communicate by passing messages explicitly. Message-passing is much simpler to implement on DMMs, because it only uses simple primitives such as send and receive. More importantly, to achieve real mas- sive parallelism, it is necessary to connect hundreds or even thousands of processors together. For such a complex system, message-passing is a natural choice for interprocess(or) communication. Nevertheless, message-passing requires a completely different style of algorithm design and programming than that of the conventional shared-variable approach. Therefore, in this thesis, we will concentrate on developing efficient parallel algorithms on message-passing DMMs. 1.2. A MODEL OF DISTRIBUTED-MEMORY MULTIPROCESSORS Figure 1.1 is the block diagram of a typical DMM. A set of node processors are interconnected together through a point-toopoint network, and each can communicate with a frontend host. The words processors and nodes will be used interchangeably in the following discussion. . (1) Node structure: Nodes have homogeneous structures and run in an asynchronous fashion. They communicate with each other by passing messages. Each node consists of a computa- tion unit, a communication unit, and a local memory. The computation is usually per- formed by the node CPU with some floating-point accelerators. Optional vector proces- sors may be used, such as in Intel’s iPSC-VX [Hwan87]. The communication unit is in charge of the communication with the host and other nodes. The task of communication may be performed by the CPU or by dedicated Host I/O processor processors f I l 1 node 0 node 1 ........ node N -1 A point-to-point interprocessor connection Figure 1.1. The block diagram of a typical DMM processors through DMA (Direct Memory Access). For example, in many second gen- eration DMMs, nodes contain intelligent routers dedicated to communication [ShF187]. These routers not only reduce the time for initiating and receiving a message, but also reduce the time in relaying messages. More about communication will be discussed shortly. Local memory in each node contains space for node operating system, user processes and communication buffers. For many systems, the size of local memory is still a limiting factor. For example, each node in the NCUBE has only 512 Kbytes local memory. In early DMMs, virtual memory is not available at the nodes, because there is no direct access from the nodes to any system peripherals other than through the host. The situation is changing in the second generation DMMs, in which special I/O subsys- tems are incorporated into the nodes to allow direct access to I/O devices [WiCM8 8]. (2) Host: One or more frontend hosts are connected to the nodes to manage the whole sys- tem. Popular host to node interconnection topology include one-to-one direct connec- tion, as in NCUBE’s NCUBE/10, or bus connection, as in Intel’s iPSC/l. In the first generation DMMs, all peripherals, such as disks, tape drivers, terminals, and network controllers, are all connected to the host. Therefore, hosts provide the primary user interface and perform such functions as program development, performance monitor- ing, program and data downloading, file system maintenance, and program debugging. With new concurrent I/O subsystems, the role of hosts in future DMMs may be changed or replaced by the nodes. (3) Node interconnection network: Most DDMs use a static point-to-point network for node interconnection, such as a hypercube, mesh, tree, or ring [Hwan87]. In earlier implementations, messages are routed through the network using the store-and-forward approach. That is, a message or a packet of a message is stored completely at each intermediate node before it is sent out to the next node. This not only introduces a very high communication delay, which is proportional to the distance of the communication path, but also increases the degree of interference with the intermediate nodes, e.g., the need to allocate communication buffers in these nodes. In the second generation DMMs, intelligent routers utilizing new techniques, such as wormhole routing [DaSe87], are used to reduce the overhead in routing messages. In worrnhole routing, as soon as a node examines the header of a message, it selects the next channel and begins forwarding the message down the channel. Thus, only a few control bits are buffered at each intermediate node. As a result, less interference is experienced in the intermediate nodes and message latency remains almost constant, regardless of the distance between the source and destination. The implication is that the network becomes virtually fully connected [ShFi87]. (4) System software: In a message-passing DMM, the only means of communication and synchroniza- tion at the system level, just as at the machine level, is by passing messages. There are no shared variables nor centralized sequence control. One node is dedicated to one job at a time. A user must gain exclusive access to a set of nodes before the user can use the nodes. Again, this situation will be changed in future generation DMMs. Conventional programming languages are supported with enhanced library routines to handle mes- sages and node management. A more friendly and advance programming environment is needed, which provides intelligent compilers and debuggers for DMMs. As mentioned earlier, new router designs reduce the latency in routing messages. However, there is still a non-negligible overhead in setting up and receiving a message. Most overhead comes from the system software when invoking send/receive routines and maintaining message buffers. Thus, even though DMA channels and routers can reduce the communication overhead, it is still up to the software — both operating sys- tem and algorithm — to minimize the overall communication delay. From the above discussion, one can see that DMMs achieve modularity and exten- dibility by minimizing the coupling between nodes. Nevertheless, this also introduces a non-negligible cost of communication. Table 1.1 lists performance figures for three representative first generation DMMs [Duni87]. Table 1.1. Performance figures for three DMMs AMETEK S-14 Intel iPSC N CUBE 8-byte transfer time (us) 640.0 1120.0 470.0 8-byte multiply time (us) 33.9 43.0 14.7 Communication/Computation 19 26 32 It can be seen in Table 1.1 that transferring a datum (a 64-bit float-point number) will take a much longer time than will performing a computation. This imbalance is expected to be improved in the second generation DMMs through streamlined com- munication primitives and intelligent routers. However, as discussed, the distributed nature of DMMs introduces a non-uniform access time in referencing local and remote memories. Therefore, it is essential not only to minimize the number of messages exchanged in the algorithm, but also to choose an appropriate message size in order to balance the computation with communication. In general, there are two possible sources of communication bottlenecks in a DMM. One is the communication bandwidth between the nodes, which has been dis- cussed above. Another bottleneck is the I/O bandwidth between the nodes and external world, e.g., disks. For example, in the current generation DMMs, only the host has access to disks. Thus, at initialization and summing-up stages of the computation, there is the need for the host to download data to the nodes and nodes to upload data back to the host, perhaps sequentially. It follows that the host and host-to-node interconnection (or I/O bandwidth) also tend to be system bottlenecks. Putting all these considerations together, it can be seen that, from the algorithm designer’s point of view, the best way to minimize the communication effect is to care- fully partition the algorithm and schedule the Operations — a topic which will be further discussed. 1.3. STYLES OF PARALLEL COMPUTATIONS In general, a parallel computation can be classified from two perspectives: one is according to the way the computation is partitioned and distributed, and the other is based on the way the computation is executed. From the first perspective, we can distin- guish between computations that are data parallel or function parallel [Oste87]. A function parallel computation decomposes the program into segments. Different proces- sors execute different segments in parallel. Function parallelism is suitable for applica- tions that can be implemented using programs with many unique subroutines, e.g., flight simulation. A data parallel computation divides data among the processors. Processors may be running the same program but working on different subsets of data. Algorithms using this technique are called data parallel algorithms [HiSt86]. Data parallelism is appropriate for applications that perform the same operations repeatedly and indepen- dently on a large set of data. In terms of programming, programs with loops to handle static and regular data structures are suitable for data parallel decomposition. Typical applications of data parallel algorithms include circuit simulation, network simulation, matrix operations, signal processing, and ray tracing. From the perspective of scheduling, a parallel computation can be characterized to be either concurrent or pipelined [HwBr84, Squi86]. Concurrency exploits spatial parallelism by utilizing several processors to execute multiple independent tasks simul- taneously. Tasks may be data parallel or function parallel. Pipelining exploits temporal parallelism in which each processor (called a stage) behaves like a filter or transformer which operates on its input data and passes output data to the succeeding processor. Through pipelining, communication occurs only between fixed and neighboring stages; and a datum, once entering the system, will be used or modified repetitively along the pipeline. Thus, a very high ratio of computation to I/O rate will result. One of the most important characteristics of pipelined computation is the notion of data flows — data flowing in the pipeline as they are processed stage by stage. To see if a computation contains data flows, the following flow test suggested in [NeSn87] seems to be very helpful: Flow test: Select an arbitrary edge (not necessarily at the boundary) and "radioactively tag" a single transmission across that edge. As the computation progresses, let all values computed with one or more radioactive values become radioac- tive. Then, define the "contaminated region" as the set of processors and 10 edges touched by some radioactive values. An algorithm passes the flow test if the edge over which the initial value was transmit- ted is on the boundary of the contaminated region. Note that the flow test is not perfect in the sense that there are algorithms generally called pipelined or systolic which fail the test, e.g., the systolic array for matrix multiplication proposed in [KuLe78]. How- ever, it does capture the essence of the flow concept and serves well in most cases. From the above discussion, we can see that there are four different styles of paral- lel computations, as shown in Table 1.2. In the remainder of this section, examples will be used to further explain the basic idea of each style. Table 1.2. Different styles of parallel computation Partition Function Data Concurrent Concurrent Concurrent Scheduling function parallel data parallel Pipelin ed Pipelined Pipelined function parallel data parallel (1) Concurrent function parallel computation In this style of computation, efforts are devoted to identifying concurrently execut- able operations or program modules. Consider the following program segment: 1:: 10; toriz=1tokdo (1.1) dt i=1X0i-1 + biXCt+1 ; Note that in the program segment above, lxaH is independent of biXC[+1. There- fore, these two operations can be executed in parallel, as shown in Figure 1.2. Here, two processors are used, and it is easy to see that workloads are not always balanced. Note that concurrent function parallel computation can also be applied to other program 11 constructs, such as a program statement, a function block, or a subroutine. Most scheduling and load balancing algorithms concentrate on scheduling these program modules to achieve concurrent function parallelism [Gonz77]. PMCSSOIO ........I.’.".‘??.‘?§.S.‘?.r. .1 ......... 1 =10 : z I . l d;:=tmpa +tmpb tmpa :=lxa,--1 tmpb :=b,-xc,-+1 Figure 1.2. A concurrent function parallel computation (2) Concurrent data parallel computation If data parallelism is used in decomposing the program segment (1.1), we will obtain an execution scheme similar to that in Figure 1.3. Since there is no data depen- dency between loops, iterations in (1.1) can be executed on multiple processors in parallel — processors perform the same task but use different data. A DMM can be used to execute the computation shown in Figure 1.3, in which all array elements are downloaded into corresponding processors beforehand, and then pro- cessor 0 (or the host) broadcasts the value of l (10 in (1.1)) to other processors. Note that the number of processors is not necessarily the same as the loop count. We can use n ( Consider the multiplication of two matrices A and B, AXB=C. Figure 2.1(a) describes the task, where circles represent data vertices (the matrices) and squares represent operation vertices (the multiplication). One way of performing the multiplica- tion is to partition A and B into four submatrices as follows: A00 A01 300 301 C00 C01 AXE: = : [A10 A11]x[810 311 C10 C11 where C,-,- =C§,9>+c§}> =A,OBO,-+A,-131,-, os1,js1. The P—nct shown in Figure 2.1(b) defines the necessary operations to calculate the matrix product in this case. The set of initial data vertices is U,-={A,-j, Bi]- l 09', jSl }, and the set of resultant data vertices is U,=[ C,-,- l OSi,jSI }. Suppose that both matrices A and B have a size 64x64, and that scalar multiplications and additions both take 1 unit of time. Then, each data vertex represents a 32x32 submatrix and has a size fu=1024. Also, fv=65,536 for each multi- plication vertex and fv=1024 for each addition vertex. CI The dynamic behavior of a P-net is described by token markings. A token in a data vertex u,- indicates the existence of a data element, which occupies a memory space of size fu(u,-). Data vertices can be viewed as FIFO queues for tokens. A token has two states: busy and ready. A token is busy when it is first created and its data elements are being generated by the preceding operation vertices. A token becomes ready when its 21 m OI 10 ll 00 10 01 11 x x l x x l x x x x X 0 1 ll ll l0 9 Si 91 + + l + + C00 C11 C10 C01 (a) (b) Figure 2.1. Matrix multiplication and its corresponding P-net data elements are ready for use by the succeeding operation vertices. Also associated with a token is a variable tag, which is initialized to the out-degree of the corresponding data vertex and is decremented by 1 whenever one succeeding operation finishes referencing the token. Basic firing rules for an operation vertex of a P-net are as fol- lows: (1) An operation is enabled if and only if each of its input data vertices holds at least one ready token. (2) An operation can fire if it is enabled. (3) When an operation v,- fires, one busy token is deposited in each of the output data vertices, with the tag initialized accordingly. (4) The operation will continue firing for fv(v,-) units of time. (5) At the end of the firing, the busy tokens in all output vertices become ready and the tags of the ready tokens in all input vertices are decremented by 1. If a tag is reduced to 0, the corresponding ready token is removed. 22 Figure 2.2 illustrates the executions of the P-net shown in Figure 2.1 at various instants of time, assuming maximal parallelism. It can be seen from Figure 2.2 that the total job execution time is 66,560 and the maximum memory space required is 16,384. Note that for a pure sequential system (e.g., a uniprocessor system) the computation time becomes 528,384, which is simply the sum of all the f, values. The execution of a P-net on a DMM can be characterized from two perspectives: one is time (when should an operation vertex be fired?) and the other is space (where should a data vertex reside?) A P-net can be modified to incorporate the effect of parti- tioning and communication on DMMs. Consider the following example: Suppose the matrix multiplication in Example 2.1 is to be scheduled on a DMM with three processors. A possible allocation of the P-net among these three processors is shown in Figure 2.3. The effect of communication delay is represented by introducing new Operation vertices, called communication vertices and denoted by vc. The buffers for received data in receiving processors are represented by the bufi‘er vertices, which are the outputs of communication vertices. The value of fv(vc) is the time to transmit the corresponding data elements from the source processor to the destination processor. Note that given a P-net and a mapping of the vertices to processors, it is straightforward to incorporate communication delay and derive the P-net shown in Figure 2.3. Cl In summary, data generations and consumptions in a P-net are represented by operation firings and token movements. Thus, data vertices in a P-net serve as templates of data — a token in a data vertex denotes the existence of a data element. Communica- tion and buffer vertices are introduced to represent communication and partitioning. Space and time parameters are incorporated into the model, which allows the trace of system status and resource requirements. Thus, P-nets are more suitable for algorithm development and performance analysis on DMMs. 23 16,384 0, Space Time Initial Configuration (a) O O O O (9 O (9 G (d) (C) Cl: fired operation 0. busy token 0: ready token Figure 2.2. Execution of a P-net at various time instants 24 Processor 1 Processor 2 Processor 0 1 I I l I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I .J . . . I we I n 1 . . v . . . . l . . 0 WI . . . . . . 6 I . . 1 v (0 _ . l . . l (l . rl'|'rw ''''''' c """"" I c " """"""" L I V V n. dim. 1 (m . . I 5 u I. _ v v . . . - I. o - 4 0 . 0 v v... . . . A. (I PIIIIIIW IIIIIIII C I IIIIIII C IWI‘r V V 1".' A (I- |||||||||||||| - " """""" J . _1 0 . . . . 3 . . v . . . . 2 . . I (I . . v . . . . l 9 . _ V V . . . I" I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I. Figure 2.3. Partitioning a P-net among three processors 25 Note that basic Petri nets [Pete77] are a special case of P-nets, in which fv=fu=0 for all vertices. This implies that P-nets have at least the same modeling and computing power as Petri nets. 2.1.2. Variations and Analysis of P-nets Based on the characteristics of data parallel algorithms, we can simplify the dis- cussion by considering a subset of P-nets, which has the following properties: (1) acyclic, i.e., the P-net does not contain a loop, (2) decision-free, i.e., every operation vertex has only one output data vertex, (3) safe, i.e., any data vertex can contain at most one token at any time, and (4) functional, i.e., every data vertex is generated by at most one operation vertex. From these properties, a much simpler representation can be derived, by noting that there is a one-to-one correspondence between a data vertex and an operation ver- tex. Thus, corresponding to each such P-net, G = (U,V,E,f,,,f,,), a simplified graph, called a D-net, can be defined. A D-net is a four-tuple, DG = (U,E’, f“ , f ’u), where (u;,uj) e E’, if and only if there exists a v e V such that (u;,v) e E and (v,u,-) e E f’u(uj) =fv(V). if (v.19) E 5 That is, f’u(uj) is the time to generate the data vertex u}. Figure 2.4 depicts the corresponding D-net for the P-net shown in Figure 2.1. If there is only one set of input data to be processed, i.e., there is no pipelining, then each data vertex will be fired only once and will contain at most one token. This case may occur if, for example, we consider only one iteration of a loop program or unfold the whole loop. It follows that, the execution of a D-net on a DMM can be characterized by two state functions. The state function Tu(u,-) denotes the time that data vertex u,- becomes ready, and the function Su(u,-) denotes the processor that u,- is to be generated. Given these two functions, other state information can be deduced. For ‘sztzsgl Bases: at as» «a ® Figure 2.4. The D-net corresponds to the P-net in Figure 2.1 @t O ® ® example, the total job execution time, T, is given by T = max Tu(u,-) (2.1) are U. Similarly, the time a data vertex u,- becomes busy is given by T111011) = 711041) -f'u(ut) and the time u,- can release its token is given by T u- = max T u- 112(1) (“"1965 14(1) It follows that the memory required for the execution is S = 52% Z K(ur.t)fu(u.-) (2.2) “35 U where __ 1 if Tur(u1)St 57112041) Km”) ' {0 otherwise 27 2.2. CONSIDERATIONS FOR PARTITIONING AND MAPPING Scheduling in a multiprocessor involves the assignment of operations or data ele- ments to particular processors for execution at particular instants. The multiprocessor scheduling problem has been extensively studied for over 20 years [Gonz77]. The list scheduling or critical path scheduling algorithm [AhCh74, Hu6l, PoBa87] is the most well known algorithm. Unfortunately, this kind of scheduling algorithms does not con- sider, and is difficult to incorporate into, communication delay and, thus, is mainly applied on SMMs. On the other hand, scheduling algorithms for DMMs emphasize minimizing inter- processor communication for a given problem. In general, the problem is solved by the min-cut algorithm [RaSH79] or is formulated as variations of the quadratic assignment problem [MaLT82]. One drawback of these scheduling methods is the ignorance of data dependence relationships. This may result in processor idleness due to synchronization. To take into account both communication delay and data dependency, the schedul- ing problem should be studied at two levels — inter-processor and intra-processor. Inter-processor scheduling is concerned with the partitioning and mapping of a given algorithm onto a multiprocessor. Given a mapping, intra-processor scheduling is con- cerned with the execution sequencing within each processor. Scheduling decisions at one level will affect those made at the other level. In this section, general considerations for partitioning and mapping algorithms are studied first in Section 2.2.1 and 2.2.2, respectively. Then we will concentrate on parti- tioning and mapping methods for algorithms represented by P-nets in Section 2.2.3. 2.2.1. Number of Nodes and Partitions Given an application and the associated D-net, the first problem facing an algo- rithm designer is the number of processors and the algorithm partitioning scheme that 28 should be used. The primary goal of algorithm partitioning is to partition the algorithm (i.e., the corresponding D-net) so that processors have an equal amount of the load to work with. The "load" of a processor should include not only the computation time of the assigned task but also the communication time with other processors and the idle time due to synchronization (see Example 2.1). Therefore, if the communication takes more time than the computation, the communication delay between two communicating vertices residing in two different processors will negate the gain due to parallelism. In this case, we should use fewer processors and group several related vertices together to increase the computation to communication ratio. We refer to those vertices assigned to the same processor as a group or partition. It follows that a good partition scheme, in addition to distributing workloads, should also minimize the communication traffic between groups and match the host capacity. At the end of the partitioning phase, an application contracted graph is obtained. A contracted graph, M =(VM,EM), has k=IVMI vertices, where kSIVH l. Each vertex corresponds to one group, and there is an edge between two vertices in M if communi- cation exists between their corresponding groups. The partitioning phase is sometimes referred to as contraction [Berm87]. 2.2.2. Mapping Considerations The multiprocessor on which the algorithm is to be executed can be represented by an undirected graph, called the host graph, which is denoted by H =(V,,,EH). Processor i in the multiprocessor is represented by a vertex vie V”, and a communication link between processor i and processor j is represented as an edge denoted by (v;,vj)e E 3. Based on the contracted graph, M, of an algorithm, a suitable processor intercon- nection pattern is chosen and each group in M is assigned to a different processor in H. The mapping procedure is inconsequential if there is no inter-group communication in 29 the contracted graph, or the host graph is a complete-connected graph. However, if the communication cost is non-uniform across the system and is a function of the distance between two communicating processors, then one must insure that the communication takes place using the shortest path. For machines, such as a hypercube multiprocessor which embeds a large set of topologies, the choice of a suitable interconnection pattern is relatively easier and more flexible. In general, mapping is a two phase operation. One phase, the group mapping phase, may be defined as g (v;)=vj, for all vie VM and any vie V”, which is a one-to-one mapping between vertices in M and processors in the host graph H. Two types of paths in a mapping may be identified: a logical path which is an edge in the contracted graph and a physical link which is an edge in the host graph. The second phase of the map- ping, the path mapping phase, is to define a function, h, to map the logical paths onto the physical links. The whole mapping function, f, is then defined as f=g o h. The number of physical links that a logical path has to traverse defines the dilation of that logical path. The number of logical paths which traverse a particular physical link is referred to as the sharing of that link. Since a physical link allows one message to pass at a time, the higher the sharing is, the longer the communication delay will be. Also, the more links which a message must traverse, the greater the communication traffic will be. Thus, a good mapping function, f=go h, should try to minimize the aver- age dilation cost and sharing cost. It can be shown that, for an arbitrary problem topology, both component partition- ing and group mapping are NP-complete problems [GaJo79]. Therefore, a good heuris- tic approach should be used, such as simulated annealing [Berm87]. It is believed that the partitioning and mapping problem must take into account the characteristics of the underlying architecture as well as the behaviors of the algorithms running on such a system. 30 2.2.3. Inter-Processor Scheduling As mentioned earlier, the loads of a processor include the computation, communi- cation, and idle time of the processor. Partition and mapping utilize this timing informa- tion to distribute the workload evenly among the processors. However, processor idle time cannot be determined unless intra-processor execution sequence is determined. The intra-processor schedule, in turn, determines inter-processor synchronization. On the other hand, an inter-processor partition and mapping scheme must be decided before the intra-processor schedule can be determined. One approach to break this dilemma is to perform partition and mapping without concern for the processor idle time. Then, given a D-net, (U,E’,fu,f;), the inter- processor scheduling problem can be formulated as a quadratic assignment problem as follows. Let du denote the distance between processor k and l in H, and qJLM) be the computation cost of vertex v,- on processor k, where 4,, is the speed factor of processor k. Define the assignment matrix X to be such that xik=1 if vertex v; is mapped to proces- sor k, and 0 otherwise. Then, the cost of an inter-processor schedule is Cos: (X) = met/mm + 2 arm/admirer) k I ("1‘o",‘)€EI where c is the cost of transmitting one byte of data between two adjacent processors. Memory limitation requires that, for each processor k, U11 (vary; 5 Mk i where M k is the maximum size of the available memory on processor k. The above formulation can be solved by iterative improvement, which starting from an arbitrary allocation, improves the allocation iteratively through such techniques as that proposed in [KeLi70], simulated annealing [KiGV83], or neuron networks [HoTa85]. Using the iterative method, we may apply the inter- and intra-processor scheduling alternatively in each iteration to obtain a more accurate cost estimation. Note that, through intra-processor scheduling, it is easy to identify the critical path(s) of 31 the computation. New configurations which involve moving vertices on these critical paths have a better chance of reducing the total execution time of the algorithm. The problem then becomes that of identifying, on the critical paths, the vertex whose reallo- cation will reduce the cost the most, and determining where the candidate vertex should be moved to. 2.3. CONSIDERATIONS FOR SCHEDULING WITHIN PROCESSORS In this section, issues for intra-processor scheduling are discussed. Given a D-net D=(U,E’,f,,,f’u) and a partition (U(1),U(2),...,U(")), where Uanng for all i¢j and U(1)UU(2)U...UU(")=U, intra-processor scheduling attempts to develop an execu- tion schedule, X, for all us U. The partition U (i), ISiSk, is assigned to processor i, where k is the number of processors used. Figure 2.5 shows a possible schedule for the D-net shown in Figure 2.4. Dashed arcs are added by the schedule to enforce the total ordering. In the following discussion, we will not consider the scheduling for initial data vertices, U3, and buffer vertices, Ub. Initial data vertices are assumed to have been loaded beforehand, while the execution time of a buffer vertex can follow that of the predecessor vertex, if remote data are assumed to be transferred to local nodes as soon as they are generated. A lower bound for the total execution time T of an algorithm is given by: Tap/gt 2 f’u(u) ueUm However, due to data dependence relationships, a processor may idle waiting for data from other machines. Thus, the above lower bound may not be achievable. To obtain the optimal schedule, we must first determine the cost for a given schedule. Recall that the state of a D-net can be completely described by the state function Tu and 5,, (see Section 2.1.2). 5,, is given from the partitions (i.e., from an inter-processor scheduling), 32 Processor 0 Processor 1 Processor 2 Figure 2.5. A schedule for the D-net in Figure 2.4 and it is easy to see that ’ ~+ max T u fu(ui) “Rm” u( ) ifuidU, Tum.) = 0 if uie U,- (23) where, as defined in Section 2.1.2, f ’u(u,-) is the time to compute u,- and R (u;,X) is the set of predecessor vertices of u,- in the schedule X. A vertex u j is a predecessor vertex of another vertex u,- if either (uj,u,-)e E’ or u; is generated immediately after u; in the schedule. From Tu and Equations (2.1) and (2.2), one can obtain the total execution time of the schedule and find out the maximum memory needed for each partition. If a schedule’s memory requirement does not exceed that of the maximum available, then the schedule is a feasible schedule. 33 One way to determine an optimal or near optimal intra-processor schedule for a given D-net is by means of the branch-and-bound method. The cost of a particular search path is expressed as the sum of the cost of the partial solution found so far and the cost from the present partial solution to the final solution (future cost). Possible can- didates for expansion in a search path are those data vertices in the D-net that are enabled at that point. For intra-processor scheduling, a simple estimation for the future cost is max 2 f’u(u) ISi 5k“ 6 I]? where U?) is the set of data vertices in partition U (i) that have not been scheduled so far. A particular search path can be pruned if it violates space or precedency constraints. Though branch-and-bound is a powerful way to search for feasible solutions, its complexity is still exponentially proportional to the problem size. One heuristic adapted from critical-path scheduling [AhCh74, PoBa87] works by associating two level values with each vertex in a D-net. The forward level of vertex u, af(u ), denotes the max- imum "distance" from u to a resultant vertex and is defined as a/(u) = max 2 f’u(ui) I IQEPI where P f is a path from the vertex u to a resultant data vertex. The backward level of a vertex u, ab(u), denotes the maximum "distance" from u to an initial vertex and is defined similarly. An intra-processor schedule can thus be found by the following steps: ( 1) For each partition U (i), lSiSk, arrange all vertices in that partition in an nonin- creasing order into a linear list L; according to af first, then orb. (2) For each partition U (i), lSiSk, check space requirement for L,- using Equation (2.1) and (2.2). (3) For each partition U (i), ISiSk, modify L,- if space constraint is violated. 34 (4) Report the resultant Li’s as the schedule if step (3) succeeded. Otherwise, there is no feasible solution for the set of constraints. A modification in step (3) may involve swapping the order of generation of data vertices within a processor so that the time interval that a piece of data will occupy the memory is minimized. Another way of increasing memory efficiency is to control the time to transfer a data vertex from one processor to another. If a schedule cannot be obtained at the end of step (4), then we might want to increase the number of processors or use another optimization method such as branch-and-bound introduced earlier. CHAPTER 3 DESIGNING AND ANALYZING DATA PARALLEL ALGORITHMS In the previous chapter, we discussed general considerations in designing parallel algorithms. In this chapter, we turn our focus to designing data parallel algorithms for DMMs, with emphasis on implementation and analysis issues. Designing parallel algorithms cannot be isolated completely from the architectural aspect of the underlying multiprocessor. Design and implementation of a parallel algo- rithm on a multiprocessor involves the knowledge of the specific language, operating system, and multiprocessor that are used. On DMMs, this may imply the knowledge of how send and receive routines are specified in the language and invoked in the operat- ing system, how messages are maintained and transmitted, and what are the rates of performing computation and communication. The more knowledge one has, the more efficient the resultant program will be. The same is true for analyzing algorithms. The purposes of analyzing algorithms include estimating algorithm performance to identify inefficiencies in the design and suggesting better scheduling schemes. Again, accurate models of the underlying archi- tecture and algorithm behaviors are needed. These models dictate what timing informa- tion is relevant and how timing information can be combined to provide an overall per- formance figure. 35 36 In this chapter, the design, programming, and analysis of data parallel algorithms for two typical examples, array summation and matrix multiplication, are discussed [NiKP87]. These algorithms were implemented on a 64-node NCUBE multiprocessor. Therefore, in Section 3.1, important characteristics of the NCUBE are reviewed first. Then, modeling computation and communication on DMMs is discussed in Section 3.2, including the measured timings on the NCUBE. Based on the model, data parallel algorithms for array summation and matrix multiplication are presented in Section 3.3 and 3.4, respectively. 3.1. NCUBE Multiprocessors The algorithms and experiments discussed in this thesis are implemented on the NCUBE multiprocessor. Therefore, in this section, the architecture of the N CUBE mul- tiprocessor is reviewed. Section 3.1.1 will concentrate on the hardware aspect of the NCUBE, followed by an introduction in Section 3.1.2 to the software aspect of the NCUBE. 3.1.1. The Hardware The NCUBE multiprocessor is a DMM with a hypercube interprocessor connec- tion which is capable of supporting up to 1024 nodes [HaMS86]. A hypercube of dimension 11 consists of N=2" nodes, each with a direct connection to n other nodes. There is an edge between nodes i and j if i and j have exactly one bit difference in their binary representation, where 0 S i, j SN -1. The most attractive feature of the hypercube tOpology is the rich set of topologies, including meshes and rings [SaSh85], that can be embedded inside the hypercube. Generally speaking, the hypercube offers a good bal- ance between node connectivity, communication diameter, and algorithm embeddabil- ity. That is the reason why most DMMs available today use the hypercube as the 37 underlying interprocessor connection network [Hwan87]. The processor in each node of the NCUBE is a 32-bit proprietary processor which includes a float-point arithmetic unit, memory management logic, and interprocessor communication mechanism, all on a single VLSI chip. Combined with six memory chips for a total of 512 Kbytes, a node consists of only 7 chips. The instruction set is very similar to that of the VAX, which include arithmetic and logic operations, shift, and jump. Using a 10 MHz clock, nonarithmetic instructions can be executed at about 2 MIPS (million instructions per second), single-precision floating-point operations at 0.5 MFLOPS (million floating-point operations per second), and double-precision at 0.3 MFLOPS. The NCUBE used in this thesis research has 64 nodes with a clock rate at 7 MHz. Each node has 22 bit-serial I/O lines which are paired into 11 bidirectional chan- nels. In this way, a node can communicate with up to 10 other neighbors to form a 10- dimensional hypercube and has one more link to connect to the host. Communication with other nodes is handled through asynchronous DMA. The host board in the NCUBE multiprocessor uses an Intel 80286 as the CPU and has support for various peripherals. Within each host board, there are 16 NCUBE node processors (called the I/O processors) to provide 128 channels to the hypercube, each I/O processor is connected to 8 nodes in the hypercube. The memory spaces of these I/O processors are part of the 4 Mbytes memory space of the host CPU. In this way, memory in the host board is shared between the 80286 CPU and 16 U0 processors. The hardware design of the NCUBE multiprocessor is excellent. A 1024-node NCUBE has a potential execution rate of 2000 MIPS or about 500 MFLOPS. However, it is up to the software —— the operating system and algorithm design — to realize this massive computing power. 38 3.1.2. The Software The operating system running on the host of the NCUBE multiprocessor is called AXIS, which is UNIX-like but lacks many Of the features of a mature UNIX. TO AXIS, the hypercube is nothing more than a device. As with ordinary files, the hypercube can be opened, written to, read from, and closed. Also, AXIS allows a user to allocate dis- joint subcubes. Thus, more than one user or application can share the hypercube. Pro- gramrrring on the NCUBE can be done using conventional C or FORTRAN augmented with system primitives to handle messages and cubes. The node operating system is called VERTEX, which is a small nucleus (less than 4 Kbytes) residing in each node. Its primary function is to provide communication and process management. System primitives available from the VERTEX include: nread() receive messages nwrite() send messages ntest() test for messages arrivals ntime() return time since node initialization whoami() return node parameters such as node id and subcube dimension From a node program, nwrite() can be called as follows: nwrite(msg, msglen, dest, type, status, flag) where msglen is the length Of the message in bytes, msg is a linear array holding the message, dest is the logical id Of the receiving node, type indicates the type Of the mes- sage, and status and flag are for error checking purposes. Therefore, if data in a mes- sage are scattered in different places, the program must first gather data into a linear array before nwrite() can be called. A pool Of system buffers is maintained by VERTEX for communication. These buffers are used on both sending and receiving nodes to support non-blocking send and blocking receive Operations [MuBA87]. Thus, the caller of nwrite() can resume after a 39 communication buffer has been allocated and the message has been copied into it. Buffers also serve the purpose of queuing messages when the communication channel is busy. When the DMA controlled message transfer has been completed, an interrupt is signaled. The corresponding interrupt service routine will release the communication buffers. The arguments to nread() are similar to those of nwrite(). nread() examines the unclaimed receiving buffers for a message specified in its arguments. When the requested message does arrive and is located, nread() will copy the message to the user process space and release the message buffer. Otherwise, it will wait until an appropri- ate message arrives. Thus, nread() supports blocking receive. Note that, even though the low level hardware implementation of an NCUBE node supports broadcasting, there is no broadcast primitive provided in the VERTEX. Note also that VERTEX uses a store-and-forward mechanism to relay messages. This mechanism has been shown to be inefficient and abandoned in the second generation DMMs [DaSe87, ShFi87]. Using current AXIS and VERTEX, not only is the programming more difficult than using the shared-variable model, but a very high communication overhead is induced by the inefficiency in the software. It follows that appropriate parallel pro- gramming tools and algorithm development methodologies are critical in efficiently utilizing the NCUBE and other DMMs. 3.2. MODELING COMPUTATION AND COMMUNICATION ON DMMS Modeling an algorithm on a computer system is to identify the most essential characteristics of the algorithm, with a goal to predict the behavior of the algorithm when executed on the machine. To achieve this goal, we need to define design parame- ters, such as the size of partitions, to characterize the behavior of the algorithm. 40 Furthermore, we need to obtain system parameters, such as the message transmission time and primitive computation time, of the underlying architecture to characterize the behavior of the machine. In this section we will show how to model communication and computation in a DMM, and what are the measured system parameters on the NCUBE. From the view point of programming, communication is nothing more than mani- pulating system calls to send and receive messages. In general, when invoking these communication primitives, the overhead involved can be classified as one that can be overlapped with the computation processor and one that cannot. For each category, one can further distinguish between the overhead that is invoked per message and the over- head that is paid on a per byte basis. Thus, for overhead that cannot be overlapped with the computation, we have the following estimations [GrRe86]: 0'), + thxs for communication initiated by the host (3.1) 0,, + 1,,“ for communication initiated by the nodes (3.2) where 0;. = message startup delay in the hOst 12;, = single-byte transmission time for messages initiated by the host 6,, = message startup delay in the node 12,, = single-byte transmission time for messages initiated by the nodes s = size of the message (in bytes) On the NCUBE, for example, 0;, and O',,, will include such overhead as invoking the send and receive primitives, nwrite() and nread(), and setting up the DMA channels [MuBA87]. On the other hand, 1:), and 1,, will include the overhead to copy data to and from system buffers. Similarly, we can define Gh’, th', 0,“, and 1:,,’, to characterize the overhead that can be overlapped with the computation processor. Specifically, th’ and tn’ will include the time to transmit the message over the communication link. It fol- lows that, to estimate the time to transmit a message of s bytes from the host to a node, 41 we can use the following formula: (0,, + rhxs) + (oh’ + Ih’XS) + tnxs (3.3) The first term accounts for the time to invoke the nwrite() primitive and to copy data to the system buffer, the second term takes care of the overhead involving data transmis- sion over the link, and the final term is the time for the destination node to copy the message to the receiving process. It is assumed here that the receiving process has already issued the nread() call and is waiting for the message. Thus, the time 0,, is not included. There is another important factor which will affect the performance of the resul- tant program, especially when matrices are involved. Note that matrices are usually stored in the memory in a row-major or column-major fashion. However, when sending a submatrix, the host process must first gather all elements of the submatrix into a con- secutive linear array before it passes the starting address of that array to the send primi- tive (see Figure 3.1(a)). Similarly, when receiving a submatrix, the system must copy the message into a linear array before the receive primitive returns. Then, the receiving process scatters the data to the corresponding locations in the matrix (see Figure 3.1(b)). Note that the gathering/scattering operation is performed on the host sequentially, which cannot be parallelized. Therefore, it is a very expensive operation. The host communication overheads with and without data gathering/scattering are both measured. To measure the host overhead in sending a message, the host sends r rounds of s-byte messages to n nodes. Similarly, the host overhead in receiving a mes- sage is measured by using n nodes, each sends r messages of s bytes to the host. The time, T, to perform these operations is measured on the NCUBE using the VERTEX system call, ntime(). To do this, the host designates one node, say node 0, as the time server. Immediately before the operation begins, the host sends one message to inform node 0 to start counting the time. Then, immediately after the operation finishes, the host sends another message to node 0 to stop the clock, and processor 0 sends back the 42 250999. §P.a.°E 2999?). SIP??? 5 matrix I 5 matrix I gathering : E scattering : E ' : message ' : message ________ 3 buffer _ _ _ _ _ _ _ _ 3 buffer buffer EXSIQIP§P§SE buffer £899.81)??? copy 1 I copy I i 1 com. 2 comm. 5. 3 buffer §_ ________ J: buffer (a) Gathering (b) Scattering Figure 3.1. Gathering/scattering operations in the NCUBE elapsed time. When the number of messages in the experiment is large, the overhead involved in sending these control messages can be ignored. The value returned by ntime() is measured in ticks where 1 tick is equal to 1024/processor clock rate (in Hz). Thus, if the processor clock rate is 7 MHz, then 1 tick will equal to 146.2 usec. From above experiments, the host overhead in sending or receiving a message is given by T/nr. The average time in sending one message from the host to a node is depicted in Figure 3.2. Data gathering/scattering is simulated in the experiments by copying the message from one array to another in the host before it is sent or received. It is easy to see from Figure 3.2 that the measured data can be fit into straight lines. In fact, if data gathering/scattering is included, then the measured host communi- cation overhead can be modeled by 0'}, + thxs = 8.20 + 0.068xs for sending a message (3.4) 0),, + thxs = 4.55 + 0.068xs for receiving a message (3.5) 43 ’ X 120 J - - with gather/scatter x, ’ 100 a - - - - w/o gather/scatter I , x’ Messa e I I x’ g 80- ,x’ transmission 60 _ , x , ’ time (ticks) 40 _ I , ’x ’ ’X ”he: ..... . ..... . ..... . ..... . ..... o ..... . ..... . ----- . 0 F r l I 1 0 500 1000 1500 2000 Message length (bytes) Figure 3.2. Measured host overhead in sending a message On the other hand, if the computation does not involve data gathering/scattering, then the host communication time can be modeled by 6;, + thxs = 8.08 + 0.007xs for sending a message (3.6) 0),, + thxs = 4.21 + 0.007xs for receiving a message (3.7) Note that (3.4), (3.5), (3.6), and (3.7) all conform to (3.1). Also, from the experiments, we observe that, although there are some overlappings between computation and com- munication, e.g., the average message transmission time, T/nr, decreases as the number of rounds, r, increases, the net effect is too small. Therefore, in analyzing algorithms run on the NCUBE, we will ignore the parameters 6;] and th’, which represent the communication overhead that can be overlapped with the computation processor. Node communication overhead on the N CUBE can be measured by arranging n nodes into a ring. Again, r rounds of message exchanges are performed. In each round, a node sends one message of s bytes to the next node in the ring and receives one 44 message from the previous node. The measured average message transmission time (T /2nr) is plotted in Figure 3.3 against message sizes. Again, the message transmission time can be modeled linearly by 0,, + 1:,,xs = 3.52 + 0.017xs (3.8) Message , x’ transmission x ’ time (ticks) , ’ n 1 1 0 500 1000 1500 2000 Message length (bytes) Figure 3.3. Measured node to node communication time Note that (3.8) applies to both send and receive operations. Also, the overlapping between computation and communication is small, e.g., (3.8) is independent of the number of receiving nodes. Thus, 0’", and tn’ can be ignored in the modeling. Note that Expression (3.8) does not include data gathering/scattering operations, because in the algorithms studied in this thesis, nodes do not perform any gathering/scattering opera- tions. Finally, the time (1,.) to execute the operation axb+c and associated loop manipu- lation overhead is measured, where a, b, and c are floating-point numbers. Similarly, the time (1b) for a+b and associated overhead is measured. The following table sum- marizes the system parameters measured on the N CUBE: 45 Table 3.1. System parameters on the NCUBE 0'}, 0’ hr 0,, 1:}, 1,, Tc Tb with G/S 8.20 4.55 3.52 0.068 0.017 0.24 0.15 w/O G/S 8.08 4.21 3.52 0.007 0.013 0.24 0.15 : all quantities are in ticks 3.3. CASE STUDY — ARRAY SUMMATION In this and the next section, we present two examples, array summation and matrix multiplication, to see how different factors, such as computation, communication, time, and space, may affect the performance of the algorithms. Algorithms were implemented on a 64-node NCUBE, which allows the discussion to relate to a more realistic environ- ment. Given a large set of numbers, a1, a2, ..., a, the array summation is to find out the sum, A = a1 + a2 +...+ aL. To parallelize this operation, the array is partitioned into k subarrays, which are allocated to k processors of a hypercube computer. Each proces- sor will receive one subarray and calculate the partial sum. The partial sums are then accumulated to give the total. The primary considerations here are the optimal number of processors to be used and the size of each partition. Based on the way the partial sums are accumulated, we consider two different array summation schemes in Section 3.3.1. In the centralized accumulation method, the host collects all the partial sums from the processors and evaluates A. Another approach, the tree-structured accumulation method, uses a binary-tree reduction among all the nodes to accumulate the partial sums. In Section 3.3.2, performance of the algo— rithms will be modeled using the analytic model introduced in Section 3.2, and, in Sec- tion 3.3.3, the analytic results will be studied and compared with experimental results from the NCUBE. 46 3.3.1. The Algorithms In the following discussion, let L denote the size of the array and x,- denote the size of the subarray loaded to node i, where L =xo + - - - +x,,_1. The execution time (including local computation and communication) at node i is denoted by t (x,-) and the size of the resultant data is denoted by s (x;). Suppose further that each data element takes b bytes. The P-net describing the array summation using centralized accumula- tion method is shown in Figure 3.4, where X,-, OSi Sk -1, is the set of array elements that are sent to processor i. 16’" 9 9.- J @197 .Iéi-Ei-éé r. I I I I L """"'-"1 I ---------J Figure 3.4. Array summation with centralized accumulation Since there is no communication between nodes, any one-to-one mapping from the subarrays to the processors will generate the same result. A straightforward way of par- titioning the array is the equal partitioning scheme, in which subarrays have the same size. However, this scheme does not take into consideration the balance between com- putation and communication. Thus, another partitioning scheme, ratioed partitioning, is 47 proposed, in which the array is divided in such a way that, when node i has just finished its uploading, node i+1 is ready to upload. By doing so, we can prevent the host from being idle between two adjacent data uploadings. In Section 3.3.2, we will show how to use analytic model to obtain the optimal partition sizes for ratioed partitioning. The P-net which describes array summation using the tree-structured accumulation method is shown in Figure 3.5, where k=4 nodes are used with node 0 as the root of the reduction tree. The root of the reduction tree is responsible for returning A to the host. Note that this tree-structured reduction operation can be executed on a hypercube in such a way that data movements between nodes in each step are only one hop away [SaSh85], i.e., all nodes only have to communicate with their direct neighbors. Again, either the equal partitioning scheme or the ratioed partitioning scheme can be used. O U: 1 O 1 H 1 1 I.» 1 -------1 93.9- ----------------1 OEGEO L----- --------J 'éimé'i'éié" Figure 3.5. Array summation with tree-structured accumulation 48 3.3.2. Analytic Modeling In this subsection, we will illustrate how to derive the execution time (T) for array summation. Consider first the centralized accumulation method using the ratioed parti- tioning scheme. The timing diagram of this method is shown in Figure 3.6, where C is the host idle time. From the timing relationship in Figure 3.6, any pair of x,- and xm , i = 0, ..., k-2, can be related as follows: (Inbxi‘mbxiH(6n+Tnb+’thb)+'tb+ohr = (011+Thbxi+1 +Tnbxt+1 )+Tbxi+1 +(0'n+Tnb) «—— downloading o—uploading—- x0 x1 xk-l 8(X0) 8(11) Sock-1) Host I'""+"“I .......}.....t ._C._.. I.--+--__1 ""'I‘-"""I t(xo) Node 0 l : t(x1) ..... Kirk—1) Node k-l 1 1 I:- data transfe . - - - - computation: — Figure 3.6. Timing diagram of array summation using centralized accumulation In the left-hand side, t(x,-)=(t,,bx,~+tbx,-) comprises the local execution time in node i, including copying data from message buffer and calculating partial sum, (6,,+1:,,b+1:;,b) is the uploading time (see Equation (3.3)), and 1,, and 6),, are the times during which the host accumulates the sum and invokes the next receive primitive, respectively. Quantities in the right-hand side can be interpreted similarly. Define Tab 4' Tb Thb + Tb + Ch, - 0'}, CL = and B = Thb + tab + 1?), ml; + tab + 1,, 49 Then we have x,-+1 = wt,- + B i =0, k—2 (3.9) It is easy to derive from (3.9) that x, =a‘xo +1— 01" i=0, k—l Combined with the fact that L = x0+ ...+ xk_1, we have _(1-oc)L-nl3 B x,_ H1, +1_a (3.10) Let Td denote the, downloading time of the last k-l subarrays, i.e., k-l T4 = 2 (0'), + thbxg) = (k-I)O’h + Thb (L —Xo) i=1 Then, the host waiting time C can be obtained from the timing relationship between the host and node 0 in Figure 3.6 as C = p(r,,bxo+rbxo+o,,+t,,b-Td+oh,) where _ y if y >0 p0)- {0 otherwise Note that C=0 implies that the host is a bottleneck, because, before the host finishes downloading to node k—l, node 0 has generated the first partial sum and is waiting for uploading. In other words, the host cannot keep up with the computing speed of the nodes. This also implies that we use too many processors that the commun- ication dominates the computation. From above expressions, the whole job execution time can be found as: k—l k-l T = z (Ohthxg) + C + 2 (Oh,+‘thb+’tb) i=0 i=0 =k(o,,+oh,+bt,,+tb)+bthL+C (3.11) 50 For the equal partitioning scheme, the derivation of the total execution time is - - . g . _ — — If the host is the bottleneck, then we need only consider operations in the host, which is equal to L T, = k (0h+t;,b-k-) + k (0,,,+r,,b+rb) On the other hand, if node computation dominates the array summation, then we need only consider the operations in node k—l, which take L L T2 = k (0,,+1:,,b—];-) + (rub-14b); + (0,,+1:,,b+1:,,b)+tb It follows that the total execution time can be approximated by T =Max{T1,T2} (3.12) Derivations for tree-structured accumulation methods are similar. The timing diagram of the tree-structured accumulation method with ratioed partitioning is shown in Figure 3.7. Using ratioed partitioning scheme, we can identify the following relation- ship between adjacent nodes from Figure 3.7: Tnbxi + '6in =(Gh+1hbxi+r+1nbxt+1) + Tbxi+1 Defining rnb + 1,, i 0,, a = = Thb + Tub + Tb Thb + Tab 4' “Cb we have xi+1 = 0‘11 -§ (3'13) 51 Host 1- - + - +--+---l Node0 1 1' l—-—l l—l Nodel : %--l Node2 : : 1——+----1 Node3 } g-.. - data transfer: - - - 0 computation: — Figure 3.7 . Timing diagram of array summation using tree-structured accumulation C = (rnb+rb)x,,_1 + log2k(0,,+2r,,b+rb) + (0,,+r,,b) T = k0,, + (L+1)t,,b + C (3.14) Similarly, the execution time for the equal partitioning scheme is given by T = (k +1)0,, + (L+1)t,,b + (Inbflb)% + r,,b + log2k(0,,+2'c,,b+rb) (3.15) From (3.11), (3.12), (3.14), and (3.15), performance of algorithms running on the NCUBE can be estimated using the system parameters given in Table 3.1. Optimal design parameters, k and x5, 0_<.iS.k-l, can also be determined by optimizing these expressions. 3.3.3. Performance Analysis In this subsection, performance of the array summation algorithms presented in Section 3.3.1 will be studied. First of all, the centralized accumulation method using the ratioed and equal partitioning schemes are compared (see Figure 3.8). The size of the 52 array is 8192 elements. The partition sizes, x,-, 0.<.iSk—l, for the ratioed partitioning scheme are obtained from Equations (3.9) and (3.10), and the analyized execution times are obtained from (3.11) and (3.12). 2000 — . —— Ratioed partition, measured E - - - - Ratioed partition, analyzed 1600 — *, Equal partition, measured 1‘; - - - Equal partition, analyzed 1: Execution 1200 ‘ I time (T) 300 _ 400 — 0 0 5 10 15 20 25 30 Number of processors (k) Figure 3.8. Comparison of ratioed and equal partition The accuracy of the analytic models is evident from Figure 3.8 by observing the close match between the curves representing analytic and experimental results. From Figure 3.8, the optimal number of processors for each scheme can be obtained at the "valley" of the corresponding curve — k=12 for both schemes. Note that the optimal number of processors is the dividing point between host-bound or node-bound execu- tion. Beyond this point, the host becomes the bottleneck. It can also be seen from Fig- ure 3.8 that the ratioed partitioning scheme improves the performance, because a better balance between communicath and computation can be achieved by adjusting data partition sizes. Comparison of centralized and tree-structured accumulation is presented in Figure 3.9 for both the equal and the ratioed partitioning schemes. Again, the partition sizes of 53 1200 1 i. 1000 - . —— Centralized, ratioed ----- Tree-structured, ratioed 800 — Centralized, equal Execution Tree-structured, equal . 600 — .................... time (7) . . ............ 400 — 200 - O I 1 l r r I 0 5 10 15 20 25 30 Number of processors (k) Figure 3.9. Comparison of tree-structured and centralized accumulation the ratioed partitioning schemes are derived from the analysis. It can be seen from Fig- ure 3.9 that the performance of the tree-structured approach is almost identical to that of the centralized approach, for all k up to the optimal value. This is because, for array summation, the accumulation phase involves only the transmission and addition of a single data element, which incurs only a small portion of the total execution time. How- ever, when k becomes large, the difference between these two approaches becomes apparent. This is evident from the curve corresponding to the equal partitioning, tree- structured accumulation in Figure 3.9, which extends below the curve of the ratioed partitioning, centralized accumulation method when k227. For the tree-structured accumulation method, the Optimal partition sizes obtained from (3.13) will become negative when k becomes large. This implies that the given number of processors is too large for efficient execution. This is also the reason why, in Figure 3.9, there is no data beyond k=16 for the tree-structured accumulation method using ratioed partition. 54 1200 -1 \ 1000 — — Centralized, ratioed ----- Tree-structured, ratioed 800 ~ Centralized, equal Execution i, - - - Tree-structured, equal 600 — L =8192, 0;.=0,,,=0,,=1.0 time (T) 400 — ,___ . _. 200 — w - - - O I I I l I I 0 5 10 15 20 25 30 Number of processors (k) Figure 3.10. Predicted performance of various methods Next, let us consider the influence of system parameters on algorithm design. Sup- pose the message startup delay can be improved from 8.08 ticks on the host and from 4.21 ticks on the nodes to 1 tick. Figure 3.10 illustrates the resultant total execution time, predicted by means of the analytic models. Again, tree-structured and centralized accumulations perform almost the same. Compared with Figure 3.9, the improvement due to decreased message startup delay is about 25% for the optimal execution time. In addition, the host can support more processors to improve the performance. 3.4. CASE STUDY — MATRIX MULTIPLICATION Matrix multiplication on a hypercube multiprocessor has been studied in different contexts [ChSm87, FOOH87]. In this section, the same problem is studied again. The purpose is not to propose any novel algorithm, but to show the tradeoffs between dif- ferent partitioning and mapping schemes and to illustrate ways of arriving at good 55 design decisions. It is assumed that the multiplication of the two mxm matrices, AxB = C, is to be performed on a hypercube with a fixed dimension, n=log2N. Each element in the matrices has a size of b bytes. Matrices A and B are initially stored in the host and C will be stored back to the host. It follows that the overhead involved in data downloading and uploading must also be taken into account. Various considerations in designing the matrix multiplication algorithm will be discussed in Section 3.4.1, followed by analytic modeling of the algorithm in Section 3.4.2. Using both experimental and analytic results of the matrix multiplication algo- rithm, we then discuss in Section 3.4.3 different tradeoffs in arriving efficient parallel algorithm designs. 3.4.1. The Algorithms A straightforward approach to perform the matrix multiplication on a hypercube machine is to partition matrix B into N, mx(m/N) submatrices, Bo, B 1, ..., BN_1. Each node is loaded with one submatrix B,- together with the whole matrix A. The resultant products C,- = AxB,- are then transferred back to the host to form a complete matrix C. The mapping is trivial because there is no interprocessor communication. However, each node will require a large amount of local memory to hold matrices A, B,-, and C,. With matrices of size 512x512 running on a 5-cube and having 4 bytes per element, each node will require more than 1 Mbyte of local memory. To relax the memory requirement, we can partition both matrices A and B. Con- sider the strip partition shown in Figure 3.11(a), in which matrix A is partitioned along the rows and B along the columns. Each processor is allocated with one submatrix of A and one submatrix of B. In each iteration, the processor performs a multiplication on the two local submatrices, and then shifts the B submatrix to the next processor in a ring fashion (as indicated by the arrows in Figure 3.11(b)). Repeating this circulation and 56 Ao : : : Coo:C01:C02:C03 """"" 1 I 1 "'1"'T""1" A C C C C ..... 1---- 30:3,i32133 3519133-? A2 : : : Czo'C2riC22'C23 A3 : i : C30:C31:C32:C33: (a) A0.30 141.31 A3.33 A2.432 C 00 or ' 11 10 T) (b) F- - 1- - - '1 i i : ____B_o____ I i i B, A01A11A2|A3 ......... C 1 1 1 32 I I I --------- I I I 83 (C) A0.30 141.31 142.32 A3.33 Tothe 00 01 10 ll hOSI ((1) Figure 3.11. Matrix Multiplication with strip partition 57 multiplication, we can evaluate all submatrices of C. Note that by using the grey code sequence as shown in Figure 3.11(b), all B submatrix circulations are only one hop away. Also, all submatrices of C can be sent back to the host as soon as they are gen- erated. Another strip partition method is to divide matrix A along the columns and B along the rows, as shown in Figure 3.11(c). Again, each pair of submatrices A,- and B,- are loaded into one node, 0 Si S 3. in this case, we obtain at each node a portion of the whole matrix C, C“) =A,-xB,-. One way to accumulate C, where C = C (0)+C (1)+C (2)+C(3), from all nodes is the tree-structured accumulation method (see Figure 3.11(d)). The root of the reduction tree is responsible for uploading the resultant matrix C back to the host. Note that all data transfers in the reduction tree are only one hop away. In fact, the above two methods are only special cases of the block partition scheme. In general, we can partition the matrices along both columns and rows, and use two parameters, n1 and n2, to denote the number of partitions along rows and columns, respectively (see Figure 3.12 for n1=3 and n2=4). In each iteration, every node per- forms a submatrix multiplication and accumulates the resultant C submatrices with all nodes in the same row using a binary-tree reduction. B submatrices are then transmitted down the column in a ring fashion. Assume that n1 and 11; are integer powers of two. Let nodeij denote the processor located at row i and column j of the mesh. Then the matrix multiplication algorithm can be taken as a combination of the ring and tree operations presented above. The algorithm is given in Figure 3.13. There are 112 rings involved in the multiplication. Each ring contains n1 nodes. Tree reduction is performed across the rings between corresponding nodes. Note that in Algorithm 3.1 the roots of the reduction trees are shifted at each iteration so that the corresponding processors will not be overloaded by having to upload every submatrix of C. The program for performing matrix multiplication using Algorithm 3.1 is listed in 58 Figure 3.12. Matrix multiplication with block partition _ ' ' - - . . . - 30013011302 r . . - A00140111402403 ---r---i--- Coogcorgcoz ---1---1---1--- 310:311:312 "'r"'1'" AroiArriArzlArs X ---:----1--- Croicrricrz ---1-—1--1--- 820.821.1922 ---1--1--- A201A211A221423 ---. -- .--- C201C211C22 - - 33013311332 - - (a) ”2‘ ”2‘ ”3‘ A 00 01 02 03 C C C 300 310 320 330 01 02 00 \___/ A10 A11 A12 A13 TOII'IC 301 311 321 B31 CIZCIOCH hOSt \_"___/ A A A A 3(2): 21 22 231—-e C20C21C22 59 Host load Aij and Bj,’ IO nodeij, 0 SI S Ill-1, O Sj 5112-1; receive C1,- from nOdegk, 0 S i,j S n1-1,k =j mod n2; C <— [C,-,-], OSi,an1-1. Node: ("Odeyn 0 S I = "1-1, 0 S j 5 712-1) receive A1,- and B},- trom the host; for! t—ito nl—l then 0to i—l do CH) (— Aij X311; perform a tree reduction with nodegk, OSkSnz-l, using node1,, r = 1 mod 112 as the root; if (i =1 mod n2) then send C1, to the host; if (j at i—l) then send B j, to ”Odekj, k = (i —1) mod n1; receive B j, M from nodekj, k = (i +1) mod n 1; Figure 3.13. The algorithm for matrix multiplication the Appendix. 3.4.2. Analytic Modeling We now proceed to analyze the performance of Algorithm 3.1. Before this can be done, it should be noted that in Figure 3.12 the host has to gather data before it down- loads submatrices and scatter data after it receives submatrices. Therefore, we should use (3.4) and (3.5) to model the host communication overhead. Now consider the matrix multiplication algorithm. Due to the operation of ring shifting, nodes in the same column will be synchronized by the slowest one — the one that is loaded the last by the host. Downloading submatrices of A and B from the host to all the corresponding nodes will take time 2 T, = 2N(0,, + BIL—1,19) 60 At each iteration, the following operations will be performed in the nodes: (1) Receive a submatrix of B, which takes time 1:,,bm2/N; (2) Perform a submatrix multiplication, which takes time (tcm3)/(n%n 2); (3) Perform a tree reduction with all nodes in the same row, which takes time logzn 2(0,,+(21:,,b+‘tb )m 2M? ); (4) The root of the reduction tree sends the resultant submatrix of C to the host, which takes time 0,,+t,,bm2/n%; (5) Send submatrix of B to the next node, which takes time 0,,+’t,,bm2/N; (6) Prepare to receive the next submatrix of B, which takes time 0,,. It follows that the iteration time at each node is equal to 2 3 2 2 T,- = 2(o,.+1:,,bfl—) + t,-:’— + log2n2(0,,+(2t,,b+tb)—m2 )+ (0,,+t,,b—-m2 ) N "1’12 n1 n1 Now consider the host. During each iteration the host will receive n1 submatrices of C, i.e., 2 m Tu = n1(0h,+thb—f) n1 The total execution time of the algorithm depends on which event takes longer. Let 2 m T =20 +1: b— m I: II N We have T=Td+(T,-—T,,,)+T,, +(n1—1)Max{T,-, Tu} (3.16) 3.4.3. Performance Study In this subsection, analytic results of Algorithm 3.1 are compared with experimen- tal results obtained from the NCUBE to gain more insight into design considerations. Note that the execution time given in (3.16) is a function of both n1 and n2. However, 61 if the problem is running on a particular hypercube, the size of the hypercube N is known. Then, we need only consider n1 (N =n 1 n2). Figure 3.14 illustrates the relationship between n1 and the execution time (T) for various sizes of the hypercube. The experiment uses 64x64 matrices. Again, results from the analytic model (Equation (3.16)) match closely with those measured from the NCUBE. Therefore, the analytic model can be used to predict the optimal partitions, n 1 and n2. 40000 . Measured 35000 —\ N -2 Analyzed 30000 .. 64x64 matrices Execution _ "M time (T) 20000 ............. N=4 15000 —1\\ 10000 —\' - N = 5000 — N :16 0 I l l I 1 2 4 8 16 Number of column partitions (n1) Figure 3.14. Execution time of the matrix multiplication From Figure 3.14 it can be seen that performance degrades when n1 is small. Recall that there are n1 tree reductions (logznz steps each) and nl-l ring circulations. The smaller that n1 is, the fewer number of data transfers there are. However, each step in a tree reduction requires the transmission of a matrix C 9) , which has (m /n1)2 ele- ments. Thus, the total size of data to be transferred in the tree reduction phase is 2 2 m m N n1(—) 10g2n2 = —1032— n1 n1 n1 (3.17) 62 The smaller that n1 is, the larger the data size given in (3.17) will be, and the longer it will take to transmit the data. This results in a longer execution time as shown in Figure 3.14. The analytic model presented in Section 3.4.2 is a very useful tool in choosing the optimal n 1 and arriving at a balanced design. The speedup of an algorithm is defined to be execution time using N nodes = 3.18 speedup execution time using one node ( ) - - - with G/S, no download/upload 60 " __ with G/S I x x 50 _ ...... w/o G/S X X . X 40 _ 64x64 matrices x Speedup Number of Processors (N) Figure 3.15. Predicted speedup of the matrix multiplication The optimal speedups for different sizes of the hypercube is depicted in solid line in Figure 3.15. It can be seen in Figure 3.15 that the speedup levels off very fast. Now suppose the gathering/scattering operations can be eliminated by, say, allowing the send primitive to access the matrix directly, then the speedup would nearly double (see the dotted line in Figure 3.15). Furthermore, if matrices reside in the nodes before and after the multiplication [ChSm87, FOOH87], then there will be no data downloading and uploading. The dashed curve in Figure 3.15 shows that the performance in this case is 63 much better than the original one (almost triple). Therefore, it is desirable to keep operations at the nodes in the hypercube as much as possible, avoiding unnecessary data downloading and uploading. Finally, assuming n1 S n2, then, the memory requirement at each node for Algo- rithm 3.1 is 2 2xbe"— + 12ml)2 N n1 Again, considering the case where matrices have a size of 512x512 running on a 5- cube, the local memory required is about 131 Kbytes using the partition n1=4 and n2=8, which improves over the 1 Mbyte memory needed in the first approach described earlier in Section 3.4.1. CHAPTER 4 LARGE-GRAIN PIPELINING One of the reasons why parallel algorithm design is considered difficult is the lack of design guidelines. Without general methodologies, it is very hard for a beginner to start with parallel algorithm development. Over the years, as we are gaining in experi- ence with parallel programming, several problem solving strategies are identified as effective means in structuring efficient parallel algorithms. These strategies are called programming paradigms. As pointed out in [NeSn87] programming paradigms not only provide a means for programmers to begin when designing a new algorithm, but they also relay solutions for other related problems to the problem at hand. Nevertheless, paradigms for parallel programming are different from those for conventional serial computations. Parallel programming paradigms must address the issues of partitioning, cooperation, sharing, and communication. In Section 4.1, we will first review important parallel programming paradigms. Based on the characteristics of DMMs and pipelined data parallel computations, a new programming paradigm, called large-grain pipelining [KiCN88a, KiNi88], is intro- duced in Section 4.2. Large-grain pipelining exploits the development of pipelined data parallel algorithms on DMMs. Using processor level macro-pipelining in the DMM, the resultant algorithm can achieve a computation rate that balances the communication rate. As an example, a pipelined matrix multiplication is presented. Again, the purpose 64 65 is not to introduce any new algorithm, but to illustrate the basic idea behind large- grain pipelining. In Section 4.3, important characteristics of large-grain pipelining will be summarized and compared with other programming paradigms. Finally, in Section 4.4, experimental results of the pipelined matrix multiplication algorithm on the NCUBE are presented. Various factors that might affect the performance are discussed. 4.1. PARALLEL PROGRAMMING PARADIGMS In general, there is no precise definition of what a paradigm should or should not be. In fact, different researchers might use different names to describe similar ideas and fail short, in one way or another, in accounting for some other aspects of the ideas [NeSn87, Fink87, Quin87]. Therefore, the discussion in this section is subjective and is by-no~means exhaustive and complete. Note that different paradigms may sometimes overlap with each other. Note also that a parallel algorithm might use a combination of several paradigms. (l) Divide-and-conquer In a broad sense, all parallel problem-solving strategies use some sort of divide- and-conquer technique. However, the divide-and-conquer paradigm discussed here is more Oriented toward a recursive and usually logarithmic style of computation. The problem is divided into smaller subproblems. The subproblems are further divided into even smaller subproblems until the sizes of the subproblems are manageable, where these subproblems are solved independently. The solutions are then combined together recursively until the final solution is obtained. Due to its recursive nature, the divide-and-conquer paradigm can easily generate parallel algorithms with a logarithmic time complexity. It is also obvious that divide- and-conquer is most suitable for developing concurrent data parallel algorithms, in which data are partitioned recursively and processed on multiple processors. Several 66 algorithms belong to this category, including bitonic-merge sort and fast Fourier transform (FFT). (2) Master-slave In this paradigm, one or more masters generate and deal out tasks to a set of slaves. Slaves carry out the tasks (usually independently), and the results are gathered at the masters. The master-slave paradigm is suitable for developing concurrent algo- rithms, either data or function parallel. A variation of the master-slave paradigm is the compute-aggregate-broadcast (CAB) paradigm [NeSn87], in which processors perform some computations indepen- dently, resultant data are combined into one or a few global values, and these global values are broadcast back to each processor. The processors that aggregate and broad- cast the global values play the role of masters. Algorithms using the CAB paradigm are usually engaged in a loop with each sequence of compute, aggregate, and broadcast constituting one iteration. The CAB paradigm is more suitable for concurrent data parallel algorithms due to the broadcast operation. Another variation of the master-slave paradigm is the client-server style of compu- tation found in many distributed systems. Clients (the masters) direct their requests to a number of servers (the slaves) which perform services on behalf of the clients. Many parallel program developing tools for SMMs, such as Force [Jord87, Oste87], adopt this paradigm in scheduling the computation. One advantage of the client-server paradigm is dynamic load balancing among the servers. Also, the computation is independent of the number of servers available. This paradigm is suitable for developing concurrent algorithms. (3) Election Election is the operation in which one processor from among a group of processors is singled out to perform some special functions. Election finds its application mostly in distributed systems, including replicated data updating, crash recovery, and mutual 67 - exclusion [KiGN 88]. Parallel algorithms on multiprocessors also exhibit the behavior of election, e.g., finding the pivot column in a parallelized simplex method [Fink87]. Sometimes, election is the only way to perform reliable computation in an unreliable environment. (4) Relaxation In relaxation, each processor works with the most recently available data [Quin87] — no processor has to wait for another processor to provide it with data. The computa- tion proceeds in an asynchronous and iterative fashion. Usually, during each iteration, a processor needs only exchange its most recent values with neighboring processors. A typical example of relaxation is solving partial differential equations (PDEs). However, the asynchronous nature of this paradigm causes difficulties in performance prediction and requires extra effort to detect the terminate conditions. In general, relaxation is very suitable for data parallel algorithms. (5) Pi pelined and systolic computation Pipelining, in its most common sense, refers to the operation involving an ordered set of stages in which the output of one stage is the input of the next stage. Each stage behaves like a filter; and data, when passed along the pipeline, are modified along the way. Traditionally, pipelining is used to develop pipelined function parallel algorithms (see Section 1.3). Examples include multi-pass transformers such as compilers, scene analyzers, and vector processors. Systolic algorithms are a special kind of pipelined data parallel algorithm which centered around VLSI computations [FoWa87, KuLe78]. Systolic algorithms are characterized by regular and rhythmic data flows and a set of identical and simple pro- cessors. Communication with the outside world occurs only at the boundary processors. - Thus, whenever a datum enters the system, it is used repetitively in the pipelines. In general, systolic algorithms are synchronous (except wavefront arrays [KuLJ87]). Therefore, timing is very important in systolic arrays. Furthermore, since 68 processing cells in a systolic array are simple and primitive, systolic algorithms use a very fine granularity (at the level of single data elements). Systolic algorithms are suit- able for developing data parallel algorithms and have applications in signal processing, matrix arithmetic, image processing, and pattern matching, and dynamic programming. From the above discussion, one can see that most of the parallel programming paradigms focus on concurrent computation. In view of the high communication over- head in DMMs, pipelining seems to be a more attractive programming paradigm for these machines. However, traditional pipelining techniques concentrate on function parallelism, while systolic algorithms aim at VLSI computation, which uses very fine granularity. Neither is oriented at developing efficient data parallel algorithms on DMMs. In the next section, a new paradigm, called large-grain pipelining (LGP), is introduced. LGP is specifically directed at developing parallel data parallel algorithms on DMMs. 4.2 BASIC CONCEPT OF LARGE-GRAIN PIPELINING The basic idea of LGP is to use node level macro-pipelining in a DMM to regulate the information flows in the system so that data can be processed and transmitted in a pipelined fashion. Nodes in the multiprocessor are organized into pipelines, and data are partitioned into blocks to form data streams. The sizes of data blocks determine the partition sizes of the problem, and the directions of data flows determine the execution sequence. Note that although each node may be equipped with pipelined arithmetic units, the major concern of LGP is the pipelining between nodes. Through LGP, the degree of overlapping between computation and communication can be maximized, which minimizes the effect of communication overhead. The net result is a balance between the computation rate and the communication bandwidth. 69 As an example of LGP, consider again the multiplication of two matrices, AxB=C, where A, B, and C are mxm matrices. As in Section 3.4.1, A and B are sup- posed to be stored in the host initially, and C will be stored back to the host. Thus, input data must be downloaded from the host to the nodes and results must be uploaded back to the host. Assume that the multiplication is to be performed on a hypercube DMM with N nodes. The following partitioning scheme will be used: The hypercube is configured into an n1xn2=N mesh, and matrices A and B are partitioned along columns and rows into mm; and nzxn3 submatrices, respectively. One way to perform the multiplication is to load submatrices of A into the corresponding processors in the mesh, then pipe submatrices of B into the hypercube from the host. Let nodeij denote the processor located at row i and column 1' of the mesh. The algorithm is described in Figure 4.1. The corresponding data flows in the algorithm are shown in Figure 4.2. : Host: 1.1. send Aij to node,-,-, OSI SI! 1-1, 0Sanz-l; 1.2. SGI'Id Bjk IO nodeoj, OSanz-l, 0SkSn3-1; 1.3. receive C,- from node;,,,2_1, 0Si Sn 1-1. Node: (nodeij, 0Si Sn1-l, OSjSle—I) 2.1. receive A1,- from the host; 2.2. for k := 0 to 113-1 do 2.3. it (i = 0) then receive 81,, from the host 2.4. else receive B ,1, from node,-_1,j; 2.5. II (I ¢ n1-1) then send Bjk IO node;+1,j; 2.6. Ca) 2: Al} XBjk; 2.7. perform a binary-tree reduction with node”, 0_<.l Snz—l, to obtain C1 := C§°)+...+Cf""1) in node1,,,2_1, where CE” := [C18, C1234 ]; 2.8. if (i = n2—l) then send C,- to the host. Figure 4.1. Algorithm 4.1 for pipelined matrix multiplication 70 h—n3=4—-—-1 300 301 302 303 From 310 311 312 313 the host 1——320 321 322 323 1—330 331 332 333 I A00 A01 Aoz A03 1 I "1:3 A10 A11 A12 A13 1 I I 1 A20 A21 A22 A23 I-———n2=4 9. (a) submatrix multiplication Cit” Ci” Ci?) CP—wo C10) C11) C12) C13)—.C1 To the host \__/ C50) Ci” Ci” CP—ect V (b) submatrix addition Figure 4.2. Data flows in Algorithm 4.1 71 It can be seen from Figure 4.2(a) that each pipeline will receive a stream of B sub- matrices from the host, and each stream consists of n3 data blocks, i.e., submatrices of B. After all submatrices of B have passed through the pipelines, each processor will have collected a portion of a submatrix of C with size (M /n1)xM. To obtain the final result, a binary-tree reduction is performed among the processors in the same row (Fig- ure 4.2(b)). Thus, there are two phases in each iteration: submatrix multiplication for evaluating A11- xB j), and submatrix addition for accumulating C11,, where 0Si: Host: 1.1. send A1,- to nodeij, 0SiSn1—l, OSanz-l; 1.2. SBTId Bjk IO nodeoj, 0Sj$ng-l, 0SkSng—1; 1.3. receive C11, from node-”2-1, 0Si Sn 1-1, OSkSn3-1. Node: (nodeij, (ISIS?! 1—1, 0Sang-1) 2.1. receive A1,- from the host; 2.2. for k := 0 to n3-l do 2.3. if (i = 0) then receive B ,7, from the host 2.4. else receive B 1,, from node1-1,,-; 2.5. if (i a: n1-1) then send 81,, to node,-+1,j; 2.6. C1? := A1,- x311; 2.7. perform a binary-tree reduction with node”, 0S1 Sn 2-1, to obtain Cu, := C52)+...+C§2"1) in node-$2-1; 2.8. 110' = ng-l) then send C1,, to the host. Figure 4.3. Algorithm 4.2 for pipelined matrix multiplication need to be stored in the processors. The corresponding data flows are described in Fig- ure 4.4. Note that, in Algorithm 4.2, the size of the data blocks involved in the tree reduc- tion is smaller than that in Algorithm 4.1. Therefore, the balance between granularity and communication overhead becomes even more important. If the communication cost is too high, then we can combine two or more submatrices of C into a larger data block in the reduction. In other words, the tree reduction is performed every two or more iterations instead of one. By increasing the granularity, we can reduce the effect of communication overhead in setting up messages. Binary-tree reduction is not the only way to accumulate submatrices of C. The algorithm using a linear reduction scheme is described in Figure 4.5. After calculated a C submatrix, each processor sends that C submatrix to its right neighbor. The accumu- lation progresses from left to right in a linear fashion until the complete C submatrix is obtained in the rightmost processor, where it is uploaded back to the host. The corresponding data flows are illustrated in Figure 4.6. 73 #— n3=4 ——-I 300 301 302 303 310 311312 313 h t 320 321 322 323 OS From the {-330 331 332 333 I Aoo-Aor\ 027403HC03 C02 C01C00 i l ‘1’ i Tothe n1=3A10—'A11 A12 A13-'C13 C12 C11 C10 1 \ 7 1 host I 20-Azr\-4227A23-’C23 C22 C21 C20 ‘v k— n2=4—-l Figure 4.4. Data flows in Algorithm 4.2 Note that in Algorithm 4.3 a processor will perform a submatrix multiplication before it pauses to wait for the C submatrix from the left neighbor (Statement 2.6 - 2.9 in Figure 4.5). This is because submatrix multiplication is a computation intensive operation (0 (n3)). If every processor waits until it receives the C submatrix from the left neighbor to perform its multiplication, then the delay within each stage will be very long. Performance analyses of these algorithms are given in the Section 4.4, where con- siderations for designing pipelined data parallel algorithms will be discussed further. 4.3. CHARACTERISTICS OF LARGE-GRAIN PIPELINING The most important characteristic of pipelined data parallel computation on DMMs is the concept of large-grain data flows. Through this concept, LGP exploits pipelined data parallel computation from the following two aspects. First, data flows, 74 : Hosu 1.1. send Aij IO nodegj, 0SiSn1—l, OSanz-l; 1.2. send Bjk IO nodeoj, OSanz-I, 0SkSn3-1; 1.3. receive C1,, from node1,,,2_1, 0Si Sn1-1, OSkSn3—1. Node: (nodeij, OSiSnl-l, OSanz—I) 2.1. receive A1,- from the host; 2.2. for k := 0 to n3-l do 2.3. if (i = 0) then receive B 11, from the host 2.4. else receive B ,1, from node--111; 2.5. if (i a: n1-1) then send Bjk to node-+1}; 2.6. Ca) := A,- x311; 2.7. if (i :0) then 2.8. receive C11“) from node1,j_1 2.9. C1? := C1? + C11”); . 2.10. if 0 = nz-l) thensend Ca) to the host 2.11. else send C51) to node; 1+1. Figure 4.5. Algorithm 4.3 for pipelined matrix multiplication data are partitioned into blocks. The flows of data blocks along the processors form data streams. The size of data blocks determines the granularity of the algorithm and is usu- ally large on DMMs to offset the communication overhead. The data blocks are the most basic unit of processing. In Figure 4.2, each submatrix of B or C forms a data block. ' Second, the flows of data connect the nodes in the system into a network. The net- work consists of multiple pipelines operating on multiple data streams. Each node may participate in several pipelines during the course of computation. The basic mode of operations in a processor is to receive one data block from each of the input streams and generate one data block to each of the output streams. This multiple pipelining scheme results in both spatial (multiple units) and temporal (pipelines) overlapping. In Figure 4.2, there are two kinds of data streams flowing in the system: one along the columns to disseminate submatrices of B, and the other along the rows to accumulate submatrices of C. 75 From the 320 321 322 323 hOSt 1-330 331 332 333 I 00-A01 402- 03 C03 C02 C01C00 I I I I Tothe "1:3 A10-'A11-'412-A13-’C13 C12 C11C10 ‘ i I hOSI 1 20*421 Azz-eAzst-rczs C22 C21 C20 1._ n2=4 —-1 Figure 4.6. Data flows in Algorithm 4.3 From the viewpoint of data flows and the characteristics of DMMs, we can sum- marize the following important features of pipelined data parallel computations: (1) Optimized performance: Performance of pipelined data parallel algorithms can be fine-tuned by controlling such design parameters as the number of pipelines, the number of stages in the pipelines, and the size of data blocks. In the next chapter, an analytic model will be introduced to serve this optimization purpose. Note that, due to the communication overhead in DMMs, granularity in pipelined data paral- lel algorithms is usually large. (2) Regular and local communication: Pipelined data parallel algorithms usually have regular data flows. Each node in the system only has to interact with its neighbors (i.e., predecessors and successors in the pipeline). (3) Asynchronous and data-driven operations: Due to the autonomous nature of DMMs, no global synchronization should be used in pipelined data parallel (4) (5) (6) 76 algorithms. Operations in the system are initiated by the arrivals of data blocks. In this regard, pipelined data parallel algorithms on DMMs are very similar to wave- front arrays [KuL187]. Note that, to control the flow between two successive nodes (stages) in the pipeline, local memory in the nodes can serve as the buffer between the producer and the consumer. Problem size independence: In DMMs, each node, being a general-purpose pro- cessor with local memory, is able to handle data sets with different sizes. Thus, given a fixed node configuration, one can change the size of data blocks to accom- modate variable problem sizes. Reduced I/O bandwidth: The number of active channels that the host has to maintain in data downloading and uploading is reduced by using pipelining. For example, in Figure 4.4, the host only has to communicate with nodes in the top row when downloading and with nodes in the right-most column when uploading. Identical node programs: Nodes in a pipelined data parallel algorithm usually perform the same function but on different data sets. This makes the development of pipelined programs easy. Given the basic idea of LGP, we will study in the next section the performance of the pipelined matrix multiplication algorithm, when they are executed on the NCUBE. 4.4. PERFORMANCE OF PIPELINED MATRIX MULTIPLICATION The design parameters n1 and n2 determine the configuration of the pipelines, where n2 gives the number of pipelines in the submatrix multiplication phase and 111 gives the number of stages in each pipeline. If the number of nodes (N) is fixed, then we only have to consider either n1 or n2, because N =n1n2. Figure 4.7 illustrates the rela- tionship between the column partition (n1) and the execution time (T) for the three algorithms introduced in Section 4.2. Multiplications of 64x64 matrices are studied 77 using 2 to 32 nodes. The optimal configuration for a given N can be found from the corresponding curve by choosing its minimum point. It can be seen that Algorithms 4.2 and 4.3 have the same performance around the optimal regions, and both are superior to Algorithm 4.1. This is because the latter has a reduction tree which involves large submatrices. Note that Algorithm 4.3 is expected to perform worse when n2 is large, due to a long linear path to accumulate C submatrices. However, from Figure 4.7, we can see that Algorithm 4.3 performs as good as Algo- rithm 4.2 does. A reasonable explanation is that operations in the processors can be fully overlapped in Algorithm 4.3. 40000 - ----- Algor. 4.1 35000 -., a . N=2 — Algor. 4.2 30000 _ ........... Algor. 4.3 Execution 25000 ~ 64x64 matrices, n3=8 20000 -~ - time (73 15000 _ . N=16. N=32 0 I I I I T 1 2 4 8 16 32 Number of column partitions (n 1) Figure 4.7. Comparison of pipelined matrix multiplication It can also be seen in Figure 4.7 that a close to square partition always results in good performance [FOOH87]. However, due to the host downloading and uploading operations, the algorithm favors partitions with few pipelines, i.e., a large n1. Note that, when n1 is small, C submatrices will be large, and so will be the reduction tree. In this case, the communication overhead is too high for efficient execution. 78 Parameter n3 determines the size of data blocks (81,-). The effect of block sizes on the performance of the algorithm is illustrated in Figure 4.8 using Algorithm 4.2. It is evident from Figure 4.8 that granularity does have a bearing on the algorithm perfor- mance. A large granularity forces the operations in the nodes to be executed nearly sequentially, which reduces the effect of pipelining between nodes. On the other hand, a small granularity can increase the degree of pipelining, but, due to the fixed overhead in transmitting messages, communication cost will also rise. Though it is not shown, the best partition size in this case is around n3=16. 11000- ---- ”3:2 R n3=8 10000 _.'..‘\ 00.00.00 "3:32 "-.,\\ 64x64 matrices, 16 nodes , 9000 a Executron 8000 - time (T)7 - 6000- 5000 1 I 1 1 1x16 2x8 4x4 8x2 16x1 Partition pattern (n 1 xnz) Figure 4.8. Effects of partition sizes Next, we compare the pipelined matrix multiplication algorithm (Algorithm 4.2) with Algorithm 3.1 introduced in Section 3.4, which was developed without using the concept of pipelining. In Algorithm 3.1, both submatrices of A and B are initially loaded into the nodes. Due to the operation of the ring shifting in Figure 3.12, nodes in the same column are synchronized to the slowest one. The measured speedups are plot- ted in Figure 4.9, where the speedup of an algorithm is defined in (3.18). 79 30 - ------ Algorithm 3.1 x — Algorithm 4.2 x x 25 7 64x64 matrices x 20 -l x x X Speedup 15 - X 10- 5—1 0 o ..... 0‘. O C .0 O .0 Number of processors (N) Figure 4.9. Comparison of speedups Note that the pipelined algorithm performs better than the non-pipelined algorithm does. This is because pipelined data parallel algorithms can better utilize the overlapped operations and balance the computation with communication through the choice of optimal design parameters. However, the improvement is not as significant as expected. The major reason is because of the communication overhead in current generation hypercube multiprocessors, e.g., NCUBE. The overhead in setting up a message is still too high, during which time the host cannot perform any useful computation (see the discussion in Section 3.2). Given the basic concept of pipelined data parallel computation, the most important issue, then, is to characterize its performance. An analytic model of pipelined data parallel algorithms will be introduced in Chapter 5, including performance analysis of the pipelined matrix multiplication from the analytic perspective. CHAPTER 5 MODELING PIPELINED DATA PARALLEL ALGORITHMS The primary strength of large-grain pipelining lies in its ability, through pipelin- ing, to balance computation with communication in the resultant pipelined data parallel algorithms. Nevertheless, this strength cannot be unleashed unless there is a way of deciding the optimal design parameters of the given algorithm, such as 0 The number of pipelines used 0 The number of stages in each pipeline - The number of data blocks in each data stream - The size of data blocks In general, analytic modeling of an algorithm serves two purposes. First, through the model, the performance of the algorithm can be predicted and studied under various environments. Second, by optimizing the estimated performance, one can determine the optimal set of design parameters. In this way, no expensive trial-and-error is required in determining the parameters, and the design can cope with changes in problem size or even the underlying machine architecture. We have presented in Chapter 3 a concise model for DMMs in general and the NCUBE multiprocessor in particular. Important system parameters of the NCUBE are given in Table 3.1. Therefore, in this chapter we will concentrate on modeling the logical behavior of a pipelined data parallel algorithm. 80 81 In Section 5.1, a model of pipelined computation is presented which focuses on a single stage in the pipeline [KiCN88b]. Throughput of a stage is analyzed in Section 5.2, and the model is extended in Section 5.3 to analyze the throughput of linear arrays and 2-dimensional meshes. As an illustration, in Section 5.4, the model is used to analyze the performance of the pipelined matrix multiplication algorithm (Algorithm 4.2). Accuracy of the analytic model is studied by comparing the predicted perfor- mance with that measured on the NCUBE. 5.1. A MODEL OF PIPELINED COMPUTATION Multiple pipelining is a general pattern of computation in pipelined data parallel algorithms. Following the flow concept, nodes in a pipelined data parallel computation can be modeled as a computational unit which has multiple input and output streams (see Figure 5.1). Each data sueam consists of many data blocks. From this point of view, a pipelined data parallel algorithm can thus be modeled as a network of computa- tional units linked together by data streams as exemplified in Figure 4.6. To analyze such a network, the relationship between input and output streams has to be studied first. An m-input computational unit in a pipelined data parallel computation can be viewed as a process, which takes inputs from m data streams, ids,- (0Si Sm -1), performs a function, ods=h (idso, ..., idsm_1), and generates another data stream, ads, as an out- put. Here only a single output data stream is considered. The result can be easily applied to the case of multiple output data streams by adding constant offsets. Suppose each input stream contains n data blocks. Then, the computational unit has to repeat a receive-compute-send sequence 11 times to complete the task. This iterative process is described in Figure 5.2. 82 data block (s) 60 80 I10 / 8 8 p. 6 8 n u .. . __. l 1 “1 Computational , unit 8 = 8 p. ‘8m_ 48m... ‘m_1 .o c. on o —. Figure 5.1. Model of a computational unit function h( ) { l‘ I : begin of computation ‘/ start_up( ); /‘ initialization time: or ‘/ torj:=0 to n—l do { 780V (id80, data 0. ..... ); Pol-z data_handling (data 0); /' handling time: f0 '/ recv (ids,,,-1 , datam-1, ..... ); P",_1.j: data_handling (data,,,_1); /‘ handling time: f,,,-1 ‘/ G j: data =computation (datao, ..., datam_1); I‘ computation time: c '/ send (ads, data, ..... ); } /‘ E: end of computation “I } Figure 5.2. Functional description of the computational unit Given the functional description of a computational unit, we are interested in the evaluation of its throughput. Before we can proceed, the following notions should be clarified. 83 A data stream, ds, is characterized by a two-tuple (u, 8), where u = the arrival time of the first data block 8 = the interarrival time between two consecutive data blocks. D 1.1 is an absolute time value measured from a fixed reference time instant, usually the instant the whole computation begins. Note that, although one can identify other param- eters of data streams, only 1.1 and 8 affect the modeling discussed here. Note also that 8 is, in general, a function of the size of data blocks and the delay induced by nodes upstream. Again, for the purpose of modeling, 8 is taken to be a constant by, say, averaging over the interarrival time between each pair of consecutive data blocks. A computational unit, cu, is characterized by or — the initialization time; f1, 0SiSm -l — the time required to handle a data block from input stream idsi; c — the time spent in computation and invoking the send() request in each iteration. [:1 Note that Otis an absolute time value, while f,- and c are time intervals. Given the set of parameters for the m input streams, (11,-, 81), 0Si K={lt 12% Consequently, we have the following lemma. The time the place G j, 0Sj Sn—l, will be reached is equal to Tj= Max {a+¢o. lli+i+i K(5i-(¢0+C))} +J'(¢o+C) (5.1) 0SiSm-l Cl From Lemma 5.1, the parameters (1.1, 8) of the output data stream can be deter- mined. For 1.1, we have the following theorem: The time that the subsequent computational unit will receive the j-th output data block is Tj+c+u1. From (5.1) we can see that, when j is 0, To: O Assume that there are N stages in the linear array. The single input stream to stage 0 has an arrival time of 1180) and a block interarrival time of 880). There are n data blocks in the stream. Each stage of the array is viewed as a computational unit which has parameters 01, f0, and c, for process initialization time, block handling time, and block computation time, respectively (see Definition 5.2). Consider stage 0. From (5.1), we have T8” = Maxta+¢ot 1189100} 11.02, = Maximo. llbo)+¢0+(n-1)K(5i10)-(¢0+C))} + (n-1><¢o+c) The throughput of this stage can be derived from (5.2) and (5.3), which give the follow- ing parameters for the output stream: 88 118‘) = T 8°’+c+v = Maxie. 110} + ¢o+c +11! don-1180’) 581)=Max{¢to+c, 85‘”- n 1 l The analysis for other stages are similar. In general, for stage i, 0Si SN -1, we have T8" = Max1a+¢o. u8>+¢oi = Maxia, 110} + i(0+C+\l!) + to 1111, = Maximo, u8’+¢0+(n -1)K(5ii)-(¢0+C))} + (n —1)<¢o+c> = 119400-1188) because 8g)2(¢0+c). Thus, the output stream from stage i has the following parameters 118'“) = Maxla. 1101 + 58'“) = 68') = 58" It follows that the first output data block will be generated by stage N —l (and, thus, by this linear array) at MaXIa, 110} + (N -1)(¢0+C +ll!) + (¢0+C) with an output interval of 881). The last output block will be generated by stage N —l at Maxie. 110} + (N-1)(¢0+C +111) + (¢0+C) + (rt—1183" which equals the total execution of this linear array. The above expression conforms to the general execution time of a pipeline [KwBr84], (N +n -1)xclock time where N is the number of stages in the pipeline and n is the number of data elements passed through the pipeline. El Note that the model presented in this section only describes the behavior of a sin- gle computational unit. However, as can be seen in Example 5.1, the behavior of com- putational units in regular arrays are very similar. Thus, it is possible to derive a 89 closed-form to characterize the performance of such regular arrays. The next section expands the analytic model presented here to study the throughput of linear arrays and 2-dimensional meshes. 5.3. THROUGHPUT ANALYSIS OF PIPELINE ARRAYS Suppose computational units are connected in a linear array or 2-dimensional mesh. In this section, we are interested in obtaining closed-form expressions for the per- formance of the whole array. Note that the throughput at each single stage can be derived using the techniques presented in the previous section. The throughput of a linear array with only one input stream has been studied in Example 5.1. Let us con- sider linear arrays with two input streams, as shown in Figure 5.5. 018‘”. 88‘”) (118”. at") 018”“. 88“”) (111°).810’) ___., 0 1 _. ......__.N-1__. Figure 5.5. A linear array with two input streams (1) Linear arrays with two input streams Again, there are N stages in the array. Each stage has two input streams: one from the outside world and one from the previous stage (except stage 0). Assume that the arrival times of data streams have the following relation: ugl=ug‘1)+8, and that 885880), for 0+(k —j)<¢o+c>1 The first two terms are obtained directly from (5.1). The remaining terms in (5.4) can be simplified by noting that T §02T§i21+¢0+a This is true because from stage 0 in Figure 5.6, it can be seen that T80)+¢0+c is just the length of one of the paths incident to the place labeled T10). However, T10) is equal to the maximum of the lengths of all paths incident to its corresponding place. It follows that TI,” can be rewritten as follows: T1": Max1a+¢o+k (¢0+C)i 1180):” 5+¢0+kMax{580).(¢0+6)}. rt‘”+} where 1.18) is replaced by 1150M: 8. To simplify the expressions, we introduce the fol- lowing symbols: 9 Z =01+¢0 +k(¢o+c) U = pho)+¢o+k MaXISIi’). ¢o+c1 X 2 0h + C + \II Thus, we have TS,” = Max{Z, U+i8, rf'1>+X} = Max{Z+X, U+i8, U+(i-1)8+X, rfjrzl+2m = Max {Z+(i—1)X, U+(i-j)8+jX, mount} OSlSi-l However, note that I can only take two values in order to maximize T5,”: 0 or i—l. Also, from (5.1), we have 92 It” = Maxia+¢o+k<¢o+c>. ut°l+¢o+kMaxist°>r¢o+c>h ul°l+¢i+kMax151°h<¢o+c>11 Combined together, we obtain an expression for 751' ): r51) = Max{Z+iX, U+iMax{8,X],1110)+¢1+kMax{810),(00+c)}} = Max{a+¢o, 113°>+¢0+k «ohm-(owner K(8—(¢1+c +11», (5.5) ul°l+¢i+k K(5l0)-(¢0+6))} + k (¢o+C) + i<¢i+c +110 Notice the similarity between (5.5) and (5.1). In fact, if i=0, then Equations (5.1) and (5.5) are identical. From (5.5), the tow] execution time of the whole array is equal to Tfiflqluc. (2) 2-dimensional mesh Figure 5.7 shows the computational units that are connected into a 2-dimensional mesh. Each stage has two inputs, and input streams have the following relations: 118°” = 118“” +1821 88"” = 580°) 111“” = 11100) + i 8'1 51“” = 510°) The corresponding Timed Petri-net is shown in Figure 5.8. From Figure 5.8, we can see that Tit" ’ = Maxia+¢o+k (¢o+C). Tl! ‘l'f’+(¢o+c +111). Till-”+01% +10} (5.6) The reason that T 5"”) and TIM-1), 0S1 +¢i+kMax18f>. ¢n+c1 93 01800). 880‘”) 018°”. 88°”) 01602). 6802)) l (1110”). 8500)) ——-~ 00 01 02 ___. (11110), 5110)) ___. 1o 11 12 ___. Figure 5.7. Computational units connected in a 2-dimensional mesh Thus, we have Tiff) = Max{Z, firmer, Tif'j'l)+l’} = Max{Z+Y, fli‘l'j)+X, Tlf‘l'j'l)+X+Y, TEJ‘Z)+2Y} = Max {Z+(i—1)Y, Tf‘l'j‘ll-hXHY, Tij°>+jn OSlSj-l However, T530) = Max{Z, T5,"1v°>+x, U1+i8'1} Thus, we have 7517) = Max{Z+jY, Tf‘l'j")+X+lY, U1+i81+jY} OSISj = 0M?{Z+X+1Y, rfj-Zf-‘uzxw, U1+i81+jY, U1+(i-1)81+X+1'Y } - J = Gigi-ax1zm—1)X+jr, TIO'j")+iX +lY, U1+81+(i—1)K(81-X )+(i-1)X +1Y} - S} However, T101): Max{Z, U0+j821, TIO'j‘1)+Y} 94 node 1' -1 ' j T a c nodeij f1 OOO s» Q c O. f0 OO f1 0 c Figure 5.8. The Timed Petri-net of the 2-dimensional mesh 95 - and T100) = Max[Z, U0, U1} It follows that T)? ) can be expressed as 1511' l = Maximo. 1100)+¢1+kK(5I)O)-(¢0+C))+jK(50-(C+\V+¢1)), (5.7) ui°’+¢i+k K(510)-(¢o+6))iK(5i-(C+\ll+¢o))} + k(¢0+C) + i<¢o+c +1) + j (¢1+C +10 The total execution time of the 2-dimensional array is then given by T9111” '1). For computational units with other interconnection structures, the technique developed in this section can also be applied to analyze their performance. 5.4. ANALYSIS OF PIPELINED MATRIX MULTIPLICATION The analytic model presented in the previous section can be directly applied to Algorithm 4.3, which has a 2-dimensional mesh structure. However, due to the tree reduction performed in Algorithm 4.2, the analysis derived in Section 5.3 cannot be used directly. Therefore, in this section, the execution time (T) of Algorithm 4.2 (see Figure 4.3) is derived using the analysis techniques for a single computational unit. Assume that matrices A, B, and C are all square matrices with size M xM. The analysis can be easily extended to non-square matrices. Matrices A and B are parti- tioned into n1xn2 and nzxn3 submatrices, respectively. Assume that n1, n2, and 113 can divide M evenly, and that N =n 1 11; nodes are used to perform the matrix multiplica- tion. Thus, we have (see Figure 4.4) 96 2 size ofAU is bM for0Si(o.+2v23) and (2) when it receives the A submatrix and is ready for 8,, 2.110 , i.e., or“) = (i+1)n2(0h+912) + W12 + 0n Then, T8), 1.1.“), and 50') can be Obtained from (5.1), (5.2), and (5.3), respectively. Note that the interarrival time (88)) of adjacent input data blocks to node1,,,2_1 is equal to the iteration time of node,-_1.,,2_1 (804)). Now consider the host. The host will finish downloading submatrices of A and B and be ready for the uploading at 01 = N(01t+912) + n2n3(01t+923) + 0111 When uploading, the host can also be viewed as a computational unit, which has n1 input streams of C submatrices with parameters (“(1“). 8(0), 0Si Sn1-l. After receiving one C submatrix, the host needs an interval of f,-=013 + 0),, to copy the submatrix from the system buffer and set up the next receive routine. From (5.1) we can see that the host will finish the uploading (which is also the time the job finishes) at T = Max l[ot+q>0, u<‘l+o,+(n3—1)(5<‘>-n1(913+o,,,))} +(n3-1)n1(013+01.r) 05isn1- where (I); = (n1—1)(013+0;,,) + 013 and C = 0hr: 99 5.5. A COMPARATIVE STUDY OF THE ANALYTIC MODEL In this section, the execution time derived in Section 5.4 is validated through experimental results obtained from the NCUBE. The close match between analytic and experimental results indicate the accuracy of the model. Note that the analytic results are obtained by using the system parameters listed in Table 3.1. 40000 -1 —— Measured 35000 -+ N=2 ------ Analyzed 30000 - 64 x 64 matrices, n3=8 . 25000 - Executron 20000 - _ \__. _ trme (T) N’4 15000 — v\ N: 10000 -— \- _ N=16 5000 — r ________ ;— N=32 0 I I I I P 1 2 4 8 16 32 Number of column partitions (n 1) Figure 5.9. Measured and analyzed performance for various pipeline configurations In Figure 5.9, the total execution times of the pipelined matrix multiplication a1 go- rithm (Algorithm 4.2), both measured on the NCUBE and predicted by the analytic model, are plotted. Again, multiplication of 64x64 matrices are studied using 2 to 32 nodes. The block size of submatrices of B is fixed (by setting n3=8), and the pipeline configuration is changed by choosing the parameter n 1. It can be seen from Figure 5.9 that our analytic model predicts the execution of pipelined matrix multiplication correctly and that errors between analyzed and measured data are within 5%. The same accuracy of the analytic model can also be observed if we change the partition size, as shown in Figure 5.10. 100 11000— —— measured: n3=32 \.._ measured: n3=2 10000 ° — - - analyzed: n3=32 9000 _ — — analyzed: "3:2 Execution 64x64 matrices, 16 nodes" 8000 - .-7 time (T) 7000 - 6000 — ‘ ~ ----- 5000 l h l l l 1x16 2x8 4x4 8x2 16x1 Partition pattern (n1xn2) Figure 5.10. Measure and analyzed performance for various partition sizes The optimal partitions predicted by the analytic model and obtained from the NCUBE are listed in Table 5.1. Again, we can observe a perfect match between these two sets of data. This indicates that the analytic model can be used to choose the best partition for a given problem instance. Table 5.1. Optimal partitions for pipelined algorithm (n 1 , n2. n3) N 2 4 8 16 32 measured (2.1.16) (4.1.16) (8.1.16) (8.2.16) (8.4.8) analyzed (2.1.16) (4.1.16) (8.1.16) (8.2.16) (8.4.8) 101 Having established the accuracy of the analytic model. we then use the model to predict the performance of Algorithm 4.2 using 1024x1024 matrices and up to 1024 nodes. Figure 5.11 shows the predicted speedups for various environments. From Fig- ure 5.11, it can be seen that. if the underlying architecture remains unchanged, i.e., we use the set of systems parameters in Table 3.1, then the speedup of the algorithm levels off when the number of processors increases to more than 600 (see the dashed line in Figure 5.11). As discussed. the communication overhead between nodes and between the host and nodes all contribute to this inefficiency. 1000 4 - - - - G/S, no overlap x x 900 -- Ideal x 800 _ -- -- -- ~- 50% overlap x x 700 — — w/o G/S x x 600 -— m=1024 Speedup 500 _ 400— 300 - ..--"'L IIII 200 _ .-'; ;;; O O .. O O .O O ..... .O .O C .0 . .. ----------—---—-- l 1 l r 0 200 400 600 800 1000 Number of processors (N) Figure 5.11. Predicted speedups of pipelined matrix multiplication Now suppose that 50% of the communication overhead can be overlapped with the computation processor. Thus. for example. the single-byte transmission time on the host. which is th=0.068 and th’=0 in Table 3.1. now becomes th=th’=0.034, where th’ is the transmission time that can be overlapped with the computation processor. The dotted line in Figure 5.11 shows the speedup corresponding to this case. It can be seen that. by increasing the overlapping between computation and communication 102 processors. the speedup of the algorithm can be doubled (at N =1024 nodes). Using the new techniques as described in Section 1.2, such a degree of overlapping is not difficult to achieve in future generation DMMs. In Section 3.2, we discussed the operation of gathering/scattering in the host and noted that this operation is very expensive. The solid line in Figure 5.11 shows the speedup that would obtain if the gathering/scattering operation takes no time. We can see that the speedup becomes almost linear in this case. The major reason is because we eliminate the sequential operation of gathering/scattering, which cannot be parallelized. Therefore, it worthes the effort to reduce the overhead involved in this kind of opera- tion by either system designs (e.g., better complier or host hardware) or programming techniques (e. g.. using pointers instead of indices to reference matrix elements). CHAPTER 6 DESIGNING PIPELINED DATA PARALLEL ALGORITHMS For many, designing algorithms is an art. However, as the technology advances, systematic approaches for developing algorithms have been discovered for solving a certain class of problems. This is possible because, in essence. the process of program development is nothing more than a series of transformations: from problem specification. through algorithm design, down to program coding and testing. In previous chapters, we presented the basic concept of large-grain pipelining and the modeling techniques to analyze pipelined data parallel algorithms. However, the power of large-grain pipelining cannot be fully exploited if the development of pipe- lined data parallel algorithms is very difficult. In this chapter, a systematic approach for designing pipelined data parallel algorithms on DMMs is described. The key to such a design procedure lies in the realization that systolic algorithms are a special class of pipelined data parallel algorithms (see Section 1.3). Therefore. many techniques for synthesizing systolic arrays can be adopted here [FOFW85, FoWa87]. In general. these techniques start with a nested-loop program or a set of linear recurrent equations, go through a series of transformations. and end up with a systolic array design. A brief review of these techniques will be presented in Section 6.1. 103 104 The major difference between pipelined data parallel algorithms running on DMMs and systolic algorithms is the granularity. Processing cells of systolic arrays operate on and exchange one data element at a time. However, this mode of operation is inefficient on DMMs due to the communication overhead in invoking messages. It follows that several data or operations should be grouped together to run on one proces- sor in the DMM, and data exchange between processors uses one group as a unit. The grouping problem thus becomes the unique feature and primary consideration in the design procedure. In particular. one has to address the following issues: 0 Which data and operations can be grouped together? 0 What is the size of a group? - How is the communication from one group to other groups minimized? A general procedure for designing pipelined data parallel algorithms on DMMs is out- lined in Section 6.2. Then. in the remainder of this chapter. the grouping problem is defined. assumptions are stated. and techniques for grouping data and Operations are elaborated. An example will be given at the end of the chapter (Section 6.7) to illustrate the application of the grouping techniques. 6.1. SYNTHESIZING SYSTOLIC ARRAYS Over the past few years, there has been a significant amount of work on synthesiz- ing systolic arrays [FaSh87. FOFW85, MoFo86]. These techniques target at problems which can be expressed as shift-invariant nested-loop programs, i.e., programs whose loop data dependencies do not change with loop indices [KuLJ 87]. Without being stated explicitly. all programs referenced in the following discussion will be assumed to be shift-invariant. In this section, the basic ideas of these synthesizing techniques are reviewed. which form the foundation of our design procedure. 105 Underlying all these synthesizing techniques is a representation which expresses data dependencies between loops of the given program. In general, a nested-100p pro- gram with n levels is usually expressed as follows: for qo := 10 to uo do for ([1 :=l1 t0 u1 do for q,,_1 Z= 1,1-1 IO 11”-} do (loop body) The above program can be transformed into a directed graph, Q, in an n-dimensional space. Each vertex in Q represents one loop instance and has a coordinate (qo. q,,_1) in the space if the corresponding loop instance has a loop index (qo, ..., q,,_1). There is an arc from one vertex v1 to the other v1 if the loop corresponding to v1 references a variable which is generated in the loop corresponding to v1. Also, there is an arc from an input variable to every loop that references that variable. A variable is an input vari- able if it is referenced before or without being assigned a value in the program. Such a graph will be called the computational structure of the loop program [MiWi84]. The arcs are expressed as the difference vector of the two end vertices and will be called dependence vectors of the structure. However, the derivation of a computational structure from a given loop program is not straightforward. Consider the program for matrix multiplication: fori :=0 t0 M—l do forj:=0toM-1 do (6.1) fork := O to M-l do c1,- := c,-j + aikijg‘ ; There is an implicit data dependency between adjacent iterations in the inner-most loop via the variable c1}, because c.)- is updated in each of the M iterations and the value of 6;,- will depend on its own value in the previous iteration. To enforce an explicit depen- dencies between the iterations. we should require that the the single-assignment rule be followed. Note also that aik will be used in each of the M iterations with the same j 106 index. If one iteration is executed on one processor (as in systolic arrays), then this is equivalent to a broadcast of a1}, to all processors corresponding to the M vertices. How- ever, broadcast is undesirable and is difficult to handle in systolic arrays [FoM084]. Thus. to eliminate broadcast and to enforce single-assignment rule. we must rename variables in program (6.1) as follows (with decorations and initializations omit- ted for clarity): for i := O to M-l do forj := 0 to M-l do fork := 0 to M-l do aijk == at.j-1.1t ; (6.2) bijk == bi-1.j,k 3 Cijk 3= 01.}.1-1 + Giijbijk 3 The corresponding computational structure is shown in Figure 6.1. The numbers in the circles indicate the loop index (i,j,k) of the corresponding iterations. Note that. since the program is shift-invariant, the set of dependence vectors out of each vertex is same for all vertices. In Figure 6.1, there are three dependence vectors for each vertex: D = {(10, d1, d2}={[010]a [100]: [001]} where D is called the dependence set of the computational structure. The process of eliminating broadcast and enforcing the single-assignment rule will eventually transform each variable in the loop into a pipelined variable. which depends explicitly on all the loop indices [MiWi84]. Given the computation structure. Q. of a program. we can map Q onto an array of processors by assigning each vertex (an loop instance) in Q to one processor. The dependence vectors in Q define the necessary interprocessor communication. However. in this implementation. each processor only executes one iteration and idles otherwise. As a result. there are no continuous data flows in the system and. the advantages of sys- tolic arrays are not fully exploited. It follows that a second level transformation needs to be done. 107 (122 202 212 222 a21—— 201 211 221 am 200 210 ‘220 ‘ ‘ 012%02 112 5{2 a11— 10} er 121 & H O p—a A \h 00 110 129 a01— 001 ¥11 021 aooj 000>— 010 02? 1 “ bzo 1’21 I922 bro bu b12 k , boo bor b02 ijk I Figure 6.1. The computational structure of the matrix multiplication 108 The basic idea is to project all vertices in Q along a particular direction. and thus transform Q from an n-dimensional structure into an (n -1)-dimensiona1 structure Q’. The projection direction represents the time axis. which indicates the progress of time. The projected structure Q’ represents the final spatial layout of the systolic array. Given a projection direction z, an index transformation matrix H can be derived [FaSh87]. Each vertex v in Q is projected onto a vertex HvT in Q’. Similarly, each arc din Q is projected onto an arc HdT in Q’. Thus. if vertices in the computational structure in Fig- ure 6.1 are projected along [i.j.k]=[1,1,1], we will have an index transformation matrix _ 4.10 H" [-1-1 2] The corresponding projected systolic array is depicted in Figure 6.2(a), which is the same as that proposed in [KuLe78]. The numbers in the circles of Figure 6.2(a) denote the indices of the loops that are projected onto it. Similarly, a projection along [0.1.0] will generate an index transformation matrix of _ -100 H"[001] The corresponding systolic array is shown in Figure 6.2(b), which has the .same struc- ture as that in Figure 4.6. In summary. systolic array synthesizing techniques adopt an intelligent representa- tion to express data dependencies within a nested loop program. From this representa- tion, a systematic transformation of the program into a systolic array is possible. Nevertheless, since VLSI circuits have only 2-dimensiona1 layouts. these techniques usually concentrate on computational structures in 2- or 3-dimensional spaces. i.e., n equals to 2 or 3. Although much work was developed with n being a general parameter, less known are the situations when n is greater than 3 or there are more than one time axis. 002 012 022 011 “21 010 am. 109 Coo 011 022 (a) 520 bzr I722 bll b12 bor boz 002001000 012011010 022 021020 . . n ‘ lrcatro ' multrp atrrx the m s of ' nal structure utatro comp 'ected 2. Pro] ' e 6. Figur 110 6.2. THE DESIGN PROCEDURE Based on the discussion presented in the previous section. a general procedure for designing pipelined data parallel algorithms on DMMs is outlined in this section. Again, we start with shift-invariant nestedoloop programs. The initial stages of this design procedure are very similar to those of systolic array synthesizing techniques. However. due to the communication overhead on DMMs. the design procedure will require additional stages to group data and operations so as to control the granularity. (1) Derive the computational structure of the program Transformations are necessary to restructure the program, for instance. to elim- inate broadcast and to enforce the single-assignment rule. Then, each iteration in the program is represented by one vertex in the computational structure, Q. Data dependen- cies between loops are identified and are represented as arcs in Q. (2) Map the computational structure onto the space-time coordinate system One direction is designated as the time axis. along which all vertices in Q are pro- jected. The purpose of projection is to increase processor utilization and to introduce systolic effects. However, the projection along a particular direction is equivalent to the grouping of all vertices along that direction. Thus, this step is optional and can be com- bined with the next step. (3) Group vertices in the computation structure In this step, adjacent vertices in Q’ are merged together to form larger vertices. By controlling the size of the groups. the granularity of the algorithm can be adjusted to balance the computation and communication on a DMM. Nevertheless. the grouping problem is non-trivial. For many computational structures, certain grouping schemes will introduce extra dependence vectors between the groups. This implies extra com- munication in the resultant algorithm. Details of the grouping will be discussed in the following sections. 111 (4) Assign the groups to the processors of the DMM In this step. we allocate all vertices (i.e., loop instances) in a group to one proces- sor in the DMM. Since the resultant computational structure has simple interconnec- tions. the mapping process is straightforward. Besides. as discussed in Chapter 1, using next generation DMMs [ShFi88], the mapping problem is no longer critical. (5) Determine the optimal design parameters Design parameters such as the exact size of groups and the number of pipelines are determined. The analytic model introduced in Chapter 5 can be used here. The accuracy of the model and its associated parameters have a bearing on the final performance. Note that due to the asynchronous nature of large-grain pipelining, the design pro- cedure presented here does not have to consider timing. On the contrary, since systolic arrays are synchronous, timing is very important in synthesizing systolic arrays to decide the time when a datum should be sent to another processor. In this regard, the concept of large-grain pipelining is similar to that of wavefront arrays [KuAGSZ]. In the remaining sections. the core of the above design procedure —- grouping — will be discussed more formally and in detail. 6.3. THE GROUPING PROBLEM Let Z. I, and 1", denote the set of integers. non-negative integers, and positive integers. respectively. The input to the grouping problem is a computational structure obtained from a shift-invariant nested-loop program with or without projection. There- fore, each vertex in the structure represents one loop instance (in the unprojected case) or a set of instances (in the projected case). The computational structure, Q, is a two-tuple. Q =(V,D). where V={(q0. ..., q,,_1) I l,-Sq,-Su,-, 0S1'Sn—1} is the set of vertices in Q. l.- and u,-. 0SiSn-1, 112 are the lower and upper bounds of the i-th coordinate. and D={d0. dm_1} is the set of dependence vectors in Q. El Suppose that V is a subset of Z ", the set of integer n-tuples. Thus. all coordinates referred to will be integers. Also, without loss of generality. the upper and lower bounds in V, i.e., u,- and l,-. 0Si Sn-l, are assumed to be constant, and l,-S0 and u,->0 so that the vertex with coordinate (0. ..., 0) is defined in Q. In this study. only acyclic computational structures are considered. In other words. if there exist do. ..., d1,_1eD and a0. ..., ak-1 e I, kSm, such that dodo +a1d1 + ‘ ‘ ' +ak_1dk_1 =0 then a0: ...=a,,-1=0. Cycles will result in a lock-step synchronization between adjacent nodes for each iteration. which is undesirable for DMMs. More details of acyclic com- putational structures will be given later. Let Q (V,D) be a computational structure. (1) A vertex v15 V is dependent on a vertex V16 V along a vector (1 if V] -v1=d (2) A vertex v15 V is reachable from a vertex v16 V via the set of vectors D’=[do. ..., d1,_1} if v1 = v1+ aodo + ...+ ak_1dk.1, where aieZ. 0SiSk—l. (3) The set of vertices. U (V,D’), which is spanned from a vertex vs V by a set of vectors, D’={d0. ..., dk_1}, is the set of vertices in Q that are reachable from v via D’, i.e., U(v.D’) = [w l we V, w=v+aod0+ ...+ ak_1dk_1, a,-eZ. 0SiSk-l} [:1 Note that any vertex in U (V,D’) is reachable from any other vertex in U (V,D’). Therefore. it is sometimes convenient to omit where the spanning starts and to speak only of the set of vertices that is spanned by D’, which is denoted U (D’). 113 The contracted structure, Q’, of a computation structure Q (V,D) with respect to the grouping Gd.,(Q) is a directed graph, where ' (1) Each vertex in Q’ corresponds to one group in G4,,(Q); (2) If P,- and PJ- are the groups in Gd,,(Q) which correspond to the vertices v1’ and v1’ in Q’, respectively, then there is an are from v1’ to v1 ’, if P,- depends on P,. Q’ is dependence preserving if Q’ is an acyclic computational structure. (V’,D’), with ID’ISID |. E] The graph shown in Figure 6.3(b) is a contracted structure of that shown in Figure 6.3(a). The idea of grouping is very similar to that of emulation in [FiFi82]. Note that, for a given grouping. the corresponding contracted structure may not be unique. depending 115 on how the base vertices are chosen. The significance of dependence preservance is that there is no extra dependence relationship introduced in the grouping. In other words. the number of processors with which a processor needs to communicate does not increase when loops are grouped. This is particularly important in DMMs with non-negligible communication overhead. The contracted structure shown in Figure 6.3(b) is depen- dence preserving. In fact. it has the same structure as the original computational struc- ture. As mentioned earlier, we only consider acyclic computational structures in this study. The implication of acyclic computational structures needs further explanation. In general, any computational structure derived directly from a nested-loop program is acyclic. Otherwise. there will be a deadlock in the execution of the program —— a loop needs a value from another loop which is waiting for the value generated in the first loop. Cycles will be introduced only after a computational structure is projected or grouped. For example. consider the following nested-loop program: f0ri:=0 to M—l do forj := 0 to M-l do bij I= bi.j-1 + bi-1.j+1 3 The corresponding computational structure for M =4 is shown in Figure 6.4(a). The pro- jection of the computational structure along the direction [1 0] will results in a con- tracted structure as shown in Figure 6.4(b). Cycles are introduced. Note that each ver- tex. k, 0Sk S3, in Figure 6.4(b) represents a program segment as follows: fori:=0t03do but I= biJc-l + bt-1,1t+1 3 Suppose each vertex in Figure 6.4(b) is executed on one processor. Then, each processor must exchange data with its two neighbors at every iteration. In other words. processor k cannot execute two consecutive loops and produce, say. both by‘ and b1), together without communicating with others. It follows that, even though several ver- tices are grouped together, the granularity of the algorithm does not increase at all! 116 along j W (C) I project Figure 6.4. A computational structure with possible cycles On the other hand, the grouping of the computational structure in Figure 6.4(a) along [0 1] will result in an acyclic contracted structure as shown in Figure 6.4(c). Each vertex, k, 0Sk S3, in Figure 6.4(c) represents a program segment as follows: for] := 0 to 3 do bkj == bk.j-1 + bk—1.j+1 ; Again. if each vertex is assigned to one processor, then processors do not have to exchange data at every iteration. The execution in processor k depends only on the out- puts from processor k—l. Thus. for example, processor k can execute two loops, j=0. 1. before it sends bko and b“ together to processor k+1. In this way. the granularity can be adjusted by changing the grouping size. Thus. cyclic computational structures may exist in systolic arrays (see Figure 6.2(a)), but they are not desirable in pipelined data parallel algorithms when executed on DMMs. 117 From what we have discussed so far. the grouping problem considered in this paper can thus be stated as follows: Given a computational structure Q. the grouping problem is to determine those groupings of Q which will result in dependence preserving contracted structures. [II In the following discussions, we will concentrate on computational structures in 2-dimensional spaces. For higher dimensional spaces. the results obtained here may be extended. We will consider computational structures with one, two. three, and more than three dependence vectors. respectively. For each case. we study the conditions under which the computational structure can be partitioned into independent sets of ver- tices. We also present the necessary conditions that a grouping will result in depen- dence preserving contracted structures. 6.4. GROUPING WITH ONE OR TWO DEPENDENCE VECTORS Suppose the computational structure Q (V,D) has only one dependence vector. i.e., D={do). Figure 6.5(a) depicts such a computational structure with do=[0 1]. Since there is only one dependence vector in a 2-dimensional space, vertices in Q are divided into a number of independent sets. Each independent set is spanned by the vector do, and vertices in different independent sets cannot reach each other. i.e., there is no data dependency between these set. Thus. these sets can be executed independently of each other. The grouping Gd”, rel+ of a computational structure Q (V,D) with D=[d0} is dependence preserving. 118 eoee oooe comes ’0 o e it: Figure 6.5. A computational structure with one dependence vector I'""""'-‘I 99 I'”""""""I Consider any group P in Gd“... Since there is only one dependence vector in the system. there exists a linear ordering of all vertices in P, say. (v0. ..., v,_1). where v1+1-v1=do. 0S j In a computational structure Q (V,D) with D={do, d1 }. the number of indepen- dent sets spanned by do and d1 is equal to IW I. Let v16 V be a vertex in Q such that the vertices v1=v1+d0+d1, vk=v1+do, and v1=v1+d1, are all in Q. Define the set W, = [ w I weZz, w=v1+xdo+yd1. 0Sx,y<1} Then, obviously, IW I: l W, l. Note that vertices in W, cannot reach each other via do and d1. To prove the theorem. we have to show first that Wgi. This implies that there are at least IW l independent sets. Next. we have to prove that all vertices in V are 120 reachable from a vertex in Wx. In other words, the number of independent sets in Q is at most I W I. It follows that vertices in Q are divided into exactly I W l independent sets. Now, recall the definition of V (Definition 6.1). On a 2-dimensional space, we have V = {(40.41) I 1054090. 115415141} Since v1, V1, V11, V16 V, all points with integer coordinates and inside the parallelogram enclosed by v1, V1, V1,, and v1 must also be in V. It follows that Wgi. On the Other hand, since Q is acyclic, do and d1 are independent. Any vertex vs V in Q can be expressed as v = v1+(ao+x)do + (a1+y)d1 = v1+ aodo + a 1d1 + w’ where do, 0162 and w’=xdo+yd1eZZ, OSx,y <1. By choosing appropriate do and an, we can make w’e W. This implies that v can be reached from a vertex, v1+w’ in W,. Thus, the theorem is proved. El From Figure 6.6, it can be seen that W=[(0,0)_, (0,1)}. Therefore, there are two independent sets in the computational structure. The significance of independent sets is not only that each set can be executed independently, but also that the grouping can be formed across the independent sets in any arbitrary manner. Let us now consider the grouping along either do or d1. The grouping Go,,(Q) of a computational structure Q (V,D) with D={do, d1} is dependence preserving if de D, r21, and the base vertices of the groups are chosen along do and d1. Since the grouping in one independent set will not affect that in other sets, we can consider each independent set separately. Thus, without loss of generality, we will 121 assume IW I=l, d=do, and d’=d1. Let P,- and Pi be two groups in Go”. Order vertices in P1- and Pj according to their dependencies in do as follows: For Pi: (no, ..., Ill-.1), WhCI‘C uk, 111-.16P1' and uk+1-llk=do, 03k The grouping of a computational structure Q (V,D), where D={do, d1}, along do with a size ro21 and along d1 with a size r121 is dependence preserving if the base vertices are chosen along do and d1. D Figure 6.7 illustrates a grouping along both directions, where groups are enclosed in the dashed lines. It is easy to check that the contracted structure Q’ is dependence preserving. In fact, Q’ has the same structure as the original computation structure. Again, the grouping along d2, where d2¢aodo and d2¢a 1d1, do, 0162, is equivalent to that on computational structures with three dependence vectors, D={do, d1, d2 }. The latter will be discussed in the next section. 122 Figure 6.7. Grouping along two directions 6.5. GROUPING WITH THREE DEPENDENCE VECTORS In this section, the grouping on computational structures with three dependence vectors is discussed. Let Q(V,D) be a computational structure with D={do, d1, d2}. Assume further that 02d2 =aodo +a1d1 (6.3) where do, a 1, and a; are the smallest positive integers to satisfy (6.3). A typical exam- ple is shown in Figure 6.8(a), where do=[1 O], d1=[1 2], d2=[l 1], and 2[1 1] = [1 O] + [1 2]. Again, we start with the study of independent sets in Q. It turns out that do plays a very important role in determining the independent sets. Let W be the base set with respect to do and d1 in Q. Then, I W l2a2. Since IW l21, the case for a2=1 is trivial. Now, suppose a221. Due to the fact that do, 01, and a2 are the smallest positive integers to satisfy (6.3), there does not exist any integer solution, x,-, y,-, 1.<.i 502-1, for each of the following az-l equations: 123 Figure 6.8. A computational structure with three dependence vectors idz =xido + Yidl In other words, for any vertex vs V the vertices v1=d2+v, ..., v33..1=(a 2—1)d2+v, are not reachable from v via do and d1. Furthermore, none of these vertices can reach each other via do and d1. It follows that each of v, v1 , ..., v1.2-1 lies in a different independent set spanned by do and d1. Thus, IW IZaz. Cl Let W’ denote the base set with respect to do, d1, and do. Then, W’ can be defined as W’= {w I weZz, w=xdo+yd1, 09c,y Given a computational structure Q(V,D) with D={do, d1, d2}, the number of independent sets spanned by do, d1, and d; is equal to IW’I and I W I: I W’ l xaz. 124 The proof of the first part is very similar to the proof of Theorem 6.2. Let v be a vertex in V such that all vertices in Wx’= {w 1 W622, w=v+xdo+yd1, 05x,y<1, and W¢v+zd2, 262} are defined in V. Then, lWx’ l =| W’ I. Note that Wx'=Wx-U (v. {(12 }) where W, is defined as in the proof of Theorem 6.2. Now, Wx contains vertices in V that cannot reach each other via do and d1. U (v,{d2}) is the set of vertices which can be reached from v via d2. It follows that the vertices in W,’ cannot reach each other via do, d1, as well as d2. In addition, any vertex ue V is reachable from a vertex in W,’, because ll = a ado-+0 1d1-1'0 2d2+W+V where aie Z, 05132, we W’. Thus, IW’I gives the total number of such independent sets. Now consider each independent set spanned by do, d1, and d2. Following the proof in Lemma 6.1, it can easily be seen that, within each set, there are a2 independent sets spanned only by do and d1. It follows that IW I: I W’ l xaz. D In Figure 6.8, since a2=2, there are two independent sets spanned by [l 0] and [1 2]. Furthermore, we have W={(0,0), (0,1)} and IW I=2. This implies that IW’|=1 and that all vertices can be reached from (0,0) via those three dependence vectors. As mentioned in Section 6.4, independent sets spanned by do and d1 can be con- sidered as forming independent "planes". Then, the third vector, d2, can be viewed as forming a thread from one plane to the other. From Theorem 6.4, we can see intuitively that the grouping along (1; with a size r2=c12 will be dependence preserving, because we are linking a2 independent sets together. However, any grouping along (12 with a 125 size larger than a2 will include vertices which belong to the same independent set into one group. As a result, cycles are introduced into the contracted structure, and the resul- tant structure is not dependence preserving. Consider the example shown in Figure 6.9(a) where where do=[1 0], d1=[0 l], d2=[l l], and [1 l] = [1 0]+[0 1]. (a) (b) Figure 6.9. A grouping which generates a non-dependence-preserving structure In Figure 6.9(a), the grouping along d2=[l l] is depicted in dashed boxes. The labels besides the boxes denote the group ids. Figure 6.9(b) shows the corresponding contracted structure with the group ids indicated in the circles. It can be seen that an extra dependence vector, [0 -1], is introduced. Thus, the contracted structure is not dependence preserving. To prove the above points in a more formal way, we need the following Lemma: Let Q(V,D) be a computational structure with D={do, d1, d2}, where azdz =aodo + a1d1. Let P,- and Pj- be two groups in the grouping Go”, rel+, such 126 that P} is dependent on P1 along de {do, d1}. If G111, is dependence preserving, then the base vertex of P 1- must depend on the base vertex of P,- along (I. Let no and vo be the base vertex of P,- and P j, respectively. If vo depends on no along (I, then there is a one-to—one dependence between the vertices in P,- and P ,- along d. As a result, no extra dependence vector will be generated along (I in the contracted structure. On the other hand, assume that PJ- depends on P,- along d, but vo—uo¢d. Sup- pose further that P j also depends on P1 along d2. The relationship between P1, P j, and P1 is illustrated in Figure 6.10. Let the vertices in P1, Pj, and P, be ordered according to the dependencies in d; as follows: For P1: (uo, ..., ur_1), where uk, u,_1eP,- and uk+1-uk=d2, 03k Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where azdz = aodo + a1d1. Then, the grouping Gdz,,(Q) with a size r>a2 is not dependence preserving. Suppose the group is dependence preserving, and P is a group in the grouping. If we move from the base vertex of P along do for do steps, then along d1 for a1 steps, then in each step we should visit a base vertex of a group (Lemma 6.2). However, since r>a2 and the relation in (6.3) holds, in the last step, we will visit a vertex which belongs to P and is not a base vertex. Thus, from Lemma 6.2, we arrive at a contradic- tion and prove that the grouping is not dependence preserving. 128 El Following the same argument, we can arrive at the following result: Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where (12d; = aodo + a1d1. Then, the grouping Go”, with r The proof is similar to that of Theorem 6.5. Let P be a group in Go”. Then, as we move from the base vertex vo of P along do for do steps and then along d1 for a1 steps, we should visit base vertices along the way until the last step where we arrive at a ver- tex vk. From (6.3), the following relation holds: vk-vo=aodo+a1d1=a2d2. However, Lemma 6.2 requires that vk be a base vertex in order for Go”, to be dependence preserving. This condition will be satisfied only when the group size r is a factor of a2. C! Now, let us consider the grouping along do and d1 and see how (12 may affect this grouping. We study the dependence relationships within a contracted structure first. Let Q (V,D) be a computational structure with D =(do, d1, d2 }, where 02d; =aodo + a1d1. Then, there exists a dependence preserving grouping along d. with size (11, where is {0,1,2}. Furthermore, the dependence vectors in the resultant contracted structure Q’(V’,D’) have a relationship of az’dz’ = ao’do’ + a 1 ’d1’ (6.4) where D’={do’, d1’, dz’}, and aj’=aj, ifd1¢d1; aj’=1, otherwise, 05]“ $2. Let d, and (1,, x,ye {0,1,2} denote the two vectors in D other than d1, and W’ be the base set with respect to do, d1, and d2. The grouping which is dependence preserv- ing along d1 can be formed as follows: 129 For each vertex in W’ perform the following operations: (1) Form a group along d1 with size 01-; (2) From the base vertex of the current group, choose the base vertex of the next group along a1d1, dx, or (1,. until all vertices in the current independent set are grouped. First, it is necessary to show that the above procedure does produce a grouping which is dependence preserving. Consider any independent set spanned by do, d1, and d2, which contains we W’. Let U (w, A) be the set of vertices which are spanned from w via A={a,~d1, dx, dy }. Note that vertices in U (W,A) can reach each other via A. From step (2) above, it can be seen that all base vertices in this independent set are contained in U (w, A). In addition, U (w, A) contains only base vertices. Suppose that this is not the case, and that uke U (w,A) is not a base vertex. Assume further that uke P,- and no is the base vertex of P1. Then, uk = w + za1d1+xdx +ydy = no + kd1 -> no = w + (za,--k)d1+xdx +ydy where x, y, 26 Z. However, k Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where d; = do + d1. Any grouping along de {do, d1} with a size rel + is dependence preserv- ing. Without loss of generality, assume d=do. Consider any group P in the grouping. Order all vertices in P according to the dependencies in do as follows: (vo, v,_1), where v11, v,_1eP and vk+1-vk=do, 03k Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where 02d; = aodo + a1d1. Then, the grouping along do with a size of iao, along d1 with a size of ja 1 , and along do with a size of a2, where i, j e I +, is dependence preserving. From Lemma 6.3, we can see that, if Q is grouped along do with a size ao, d1 with a size 01, and d; with a size a2, then the resultant contracted structure Q’ is depen- dence preserving and has a set of dependence vectors with the following relationship d2, = do, + d1, (6.5) Therefore, the grouping of vertices in Q’ along do’ with a size 1' and along d1’ with a size j, i, je I + is dependence preserving (Lemma 6.4). However, the grouping in Q’ along do’ with a size 1' is equivalent to the grouping along do with a size iao in Q. 131 Similarly, the grouping in Q’ along d1’ with a size j is equivalent to the grouping along d1 with a size ja1 in Q. Thus, the theorem follows. [:1 The strategy used in Theorem 6.6 is to transform the computational structure into the one with a hexagonal structure (described by (6.5)). This kind of structure is referred to as the universal planner array, which is the most general systolic array in 2-dimensional spaces [MiWi84]. The same structure will be used in the next section to construct grouping schemes for computational structures with three or more depen- dence vectors. 6.6. GROUPING WITH FOUR OR MORE DEPENDENCE VECTORS When computational structures have four or more dependence vectors, the rela- tionships between the dependence vectors become very complicated, which make it difficult to analyze individual vectors separately. Furthermore, the chance that there are independent sets spanned by all vectors is rare. As a result, we concentrate in this sec- tion only on grouping schemes but not on independent sets. Given an acyclic computational structure Q (V,D) on a 2-dimensional space with D: {do, ..., dm_1], where m>2, we can always find (from linear algebra theory) two vectors, say, do and d1, such that Cid} = aido + bid] (6.6) where a1, b1, (2161+, 251' 132 Let Q(V,D) be a computational structure with D={do, ..., dm_1}, where m>2, _ c1d1 = aido + b1d1, and 2Si For any vertex ue V, if u is dependent on v, then there exists a dependence vector d1, 09' Let Q(V,D) be a computational structure with D={do, ..., dm_1}, where m>2, c1d1 = agdo + b1d1, and 251' .amax and along d1 with a size r12bmx is dependence preserving. Consider any group P with base vertex vo. Let P;, Pj, and Pk be the three neigh- boring groups of P as shown in Figure 6.11. It suffices to show that, for any vertex v in P, X (v) is covered by P, P1, P1, and Po. The immediate consequence is that there will only be three dependence vectors in the resultant contracted structure, and these three dependence vectors satisfy the relation in (6.6). In other words, the contracted structure has a structure of the universal planner array. Any vertex w falls within P, P1, P j, and Pk, can be expressed as follows: w = vo +xamaxdo +ybmaxd1 0Sx,y <2 (6.7) Any vertex v in P can be expressed as 133 Now, from the definition of X (v), any vertex v’eX (v) can be expressed as v’ = v + xvtamudo +yv'bmaxd1 OSxo',y,/.<_l = V0 + (xv+xv’)d0 + (yv+yv’)dl where 05(xv+xv'), (yv+yv:)<2. Thus, v’ satisfies (6.7) and is in the region covered by P, P1, P j, and Pk. It follows that, for any v in P, X (v) falls within the above four groups (see the dashed box in Figure 6.11). Therefore, the grouping is dependence preserving. D do l vo l.— amax ——'I Figure 6.11. Relationship between groups in a universal planner array Theorem 6.7 presents a powerful grouping method for computational structures. Specifically, as long as the group size is large enough, then there always exist depen- dence preserving groupings along do and d1. In fact, do and d1 need not be dependence vectors of the computational structure, as long as the relations in (6.6) are satisfied. 134 By applying Theorem 6.7, the resultant contracted structure has only three depen- dence vectors. It follows that, in the final implementation, each processor only has to communicate with at most three other processors. The communication bandwidth is reduced, and the partition task is simplified, because any partition with a size greater than anm and hm, will be dependence preserving. The remaining task is to match the number of partitions (i.e., the number of groups) with the number of processors avail- able. Note that Theorem 6.7 can be extended to computational structures on higher dimensional spaces. In a 3-dimensional space, for example, a cube is adjacent to 7 other cubes in the first octant. Thus, any computational structure in 3-dimensional space can be grouped along three directions. As long as the group size is large enough, the group- ing is dependence preserving and each resultant group is adjacent to at most 7 other groups. 6.7. AN EXAMPLE OF GROUPING An example is given in this section to illustrate the application of grouping and to summarize what has been discussed so far. Consider the development of a pipelined data parallel algorithm from the following "artificial" nested-loop program: fori:= O to M—l do forj := 0 to M—l do for k := 0 to M—l do (6.8) C1"; 1: CU + bikXEj'k-2/gi-1,k 3 (1) Derive the computational structure of the program It can be seen from the above program that bgk’s and g1-1.k’s, OSiJcSM -1, are broadcast along the loops indexed by j, and e ”4’s, OSchSM -1, are broadcast along 1'. By eliminating broadcast and enforcing the single-assignment rule, the above program can be transformed into the following program: fori :=0toM-1 do 135 forj := Oto M-l do fork := 0 to M—l do by]; 2: bi. 12.4.11; (6.9) eijk == €i-1.j,k-2; 8.71: == 8i—r.j-1,k; cijk 1= chJc-l + bykxeijk/gijk The necessary initializations of b111,, em, and ggjk are omitted for clarity. From the transformed program, the dependence vectors can be identified as fol- lows: D={[0 1 0], [1 0 2], [1 l 0], [0 O 1]}. The variables which introduce the dependence vectors are listed below: bah—)[O 10] egjk—>[l O 2] gijk—>[1 10] Cijk“)[0 01] Figure 6.12 depicts the computational structure of the program in (6.9). Due to the complexity of the figure, we only show the projected vertices on each of the three planes. Thus, for example, in Figure 6.12, the plane labeled (1) contains the 16 vertices when the computational structure is project onto the ij-plane. Each vertex in Figure 6.12 represents one loop instance in the original program. (3) Figure 6.12. The computational structure of the example program 136 (2) Map the computational structure onto the Space-time coordinate system To take advantage of pipelining and further increase the granularity, a projection is sought to transform the 3-dimensional computational structure into a 2-dimensional one. For example, if we project the vertices along the j-axis, we will obtain a projected computational structure as shown in Figure 6.13. Now, each vertex in the projected structure represents M loops. In fact, the vertex at position (i’,k’) represents the follow- ing program segment: forj := O to M—l do 6:; == 61'} + bi’L’xej,K—2/gi’-l,l(; (6-10) Figure 6.13. The projected computational structure of the example Due to this projection, the variables b111, OSjSM -1, are projected onto the same vertex in the projected computational structure. This means that these variables are kept stationary in the corresponding processor, and that they have to be downloaded to the 137 corresponding processors before the execution can start. (3) Group vertices in the computational structure Note that the resultant dependence vectors in the projected structure are: do=[1 0], d1=[0 1], and d2=[1 2], where a2d1=[1 2]=[1 O]+2x[0 1]=aodo+a1d1 (6.11) Thus, we have ao=a2=1 and a1=2. Since a2=1, from Theorem 6.5, we can conclude that any grouping along d2=[1 2] is not dependence preserving. To derive groupings which are dependence preserving, either Theorem 6.6 or Theorem 6.7 can be used. For example, if Theorem 6.7 is used, then note that do 01 amn=7‘—2-=1 and bmax=zg=2 Thus, any grouping along [1 0] with a size ro21, and along [0 l] with a size r122, is dependence preserving. The dashed boxes in Figure 6.13 enclose groups of vertices corresponding to the grouping with ro=r1=2. On the other hand, if Theorem 6.6 is used, then we can first group vertices in Figure 6.13 along [0 1] with a size 2. From Lemma 6.3, we can see that the resultant contracted structure will have a set of dependence vec- tors which satisfies d2, = (10’ + (11’ Figure 6.l4(a) depicts the grouping. The corresponding contracted structure is shown in Figure 6.14(b). The contracted structure has do’=[l 0], d1’=[0 l], and d2’=[1 1]. Now, from Lemma 6.4, any grouping along do’ and d1’ is dependence preserving. Thus, for example, the grouping along [1 0] with a size 2 will generate a dependence preserving grouping which is same as that shown in Figure 6.13. Finally, as is noted in Section 4.3, we can exploit pipelining in pipelined data parallel algorithms not only from multiple pipelines but also from partitioning data to form data streams. Thus, we can further partition the M loops indexed by j in program 138 : l" (a) (b) Figure 6.14. The grouping along [0 l] with a size of 2 (6.10) into a number of blocks, each has rZSM loops. Assume that r2 divides M evenly. Then, the program executed in the vertex (i’,k’) of the resultant contracted structure becomes the following: forl:=0toM-1byr2 d0 /" communicate with neighboring processors */ forj 2= [IO l+r2—l d0 fori := i’xro to (i’+1)xro—1 do (6.12) for k := k’xr1 to (k’+1)xr1—1 do Cij == Cij + bikxej,k-2/gi—1,k: where ro is the grouping size along [1 0] and r1 is the grouping size along [0 1]. The implicit assumption here is that the loops can be interchanged [PaW086]. Thus, the loops indexed by j are moved to the outer-most level. In general, loop interchanging is possible if there is no cycle in the computational structure. Thus, each processor can execute large chunks of data (e.g., the three inner-most loops) before it needs to 139 communicate. (4) Assign the groups to the processors of the DMM Suppose the underlying DMM has a hypercube interprocessor connection topol- ogy. Then, there does not seem to have a perfect embedding of a hexagon onto a hyper- cube with dilation 1. A simple embedding with dilation 2 is to configure the hypercube into a 2-dimensional mesh, and map the hexagon onto the mesh, with each diagonal link, [1 1], in the hexagon going through a pair of edges, [1 0] and [0 l], in the mesh. In new generation DMMs, this mapping should not induce too much communication delay. (5) Determine the optimal design parameters The exact values of ro, r1, and r2 can be determined by the analytic model presented in Chapter 5. Note that, from a grouping, the following parameters can be obtained: 0 With which other processors does a processor need to communicate? 0 Which data does the processor need to exchange? 0 Howllarge should each message be? 0 How often does the processor need to communicate? Referring back to the program in (6.12), we can see that the processor (i’,k’) needs to communicate every r2 iterations (indexed by j) with its neighbors. In each data exchange, there will be r1 elements of ggjk sent to processor (i’+l,k’), r1 elements of e1},c and ro elements of 011-1 sent to processor (i’,k’ +1), and r1 elements of egjk sent to processor (i’+1,k’+l). The same amount of data should also be received from proces- sors (i’-1,k’), (i’—l,k’-1), and (i’,k’-—l). Finally, a total of ror1 elements of by; should be loaded into the processor before program (6.12) can start. CHAPTER 7 CONCLUSION AND FUTURE WORK We have presented in this thesis the design, implementation, and modeling of pipelined data parallel algorithms on DMMs. In this chapter, we summarize what we have discussed and highlight major contributions of the work reported here. Next, improvements to the current work will be suggested, impacts of new techniques and architectures will be assessed, and plans for future work will be outlined. 7.1. SUMMARY In Chapter 1, features of current DMMs were reviewed, from which important sys— tem bottlenecks were identified. Special considerations are required when one attempts to design efficient parallel algorithms on DMMs. Different styles of parallel computa- tions were discussed, and pipelined data parallel algorithms were singled out to be the primary subject of this thesis. Then, in Chapter 2, starting with a representation of parallel computation, the P-nets, we discussed various considerations in designing parallel algorithms on DMMs. The implementation and modeling of two data parallel algorithms were given in Chapter 3 to illustrate the tradeoffs in designing parallel algo- rithms in terms of computation and communication as well as time and space. With these tradeoffs in mind, a new parallel algorithm design paradigm called large-grain pipelining was introduced in Chapter 4. Large-grain pipelining exploits 140 141 node level macro-pipelining to increase the degree of overlapping between computation and communication. Examples of pipelined matrix multiplication algorithms were given to illustrate the application of large- grain pipelining. As discussed in Section 4.3, the unique feature of large-grain pipelining is the focus on balancing computation with communication by forming data blocks with appropriate sizes. These data blocks in turn form data streams which flow in the system to produce pipeline effects. Operations in the system are scheduled according to the flows of data, i.e., data-driven scheduling. Furthermore, the macro-pipelining in pipe- lined data parallel algorithms increases the degree of overlapping between nodes while reducing the I/O bandwidth in accessing data. The net effect is a highly parallel compu- tation. In Chapter 5, a model for analyzing pipelined data parallel computation on DMMs was presented. The model views a pipelined computation as consisting of elementary computational units, which are interconnected through data flows. The throughput of individual computational units as well as linear arrays and 2-dimensional meshes is analyzed by studying the time events represented as Timed Petri-nets. The analyzed results of the pipelined matrix multiplication algorithms were compared with experi- mental results obtained from a 64-node NCUBE multiprocessor. The close match between these two sets of data indicates the accuracy of the analytic model. The accuracy stems from two sources: (1) the capability of the analytic model in describing the behavior of a computational unit, and in modeling asynchronous and deterministic events; and (2) the accuracy in estimating system parameters, such as the message transmission time and primitive computation time. Note that the analytic model can be automated and be applied to a more general class of computations. In Chapter 6, a systematic approach for designing pipelined data parallel algo- rithms on DMMs was introduced. The design procedure takes a shift-invariant nested- loop program as its input and produces a pipelined data parallel algorithm through a 142 series of transformations. From the relations between loop dependence vectors, we stu- died properties of grouping vertices in the computational structure. The major result obtained is that it is possible to group vertices in a 2-dimensional computational struc- ture in such a way that the resultant contracted structure is a universal planner array (see Section 6.6). The extension to computational structures with higher dimension is possible. The major contribution of the grouping concept and related theorems is, of course, to lead the way to a systematic approach for designing pipelined data parallel algo- rithms on DMMs. The grouping concept is unique in that it takes a totally different view of algorithm transformation from that in synthesizing systolic arrays, i.e., group- ing versus projection. However, the results obtained in this thesis are also applicable to the latter, since projection is a special case of grouping. For example, the knowledge of grouping may assist in predicting the performance of systolic arrays derived from dif- ferent projections. Overall, it can be seen that we have presented a systematic and thorough study of pipelined data parallel algorithms — from the basic concept, performance modeling, to design and synthesis. Through the techniques introduced in this thesis, the resultant) algorithms can achieve a balance between computation and communication. Further- more, the systematic approaches used in these techniques hold the key to automated algorithm and program development. Various possibilities that stem from the research developed in this thesis will be probed further in the next section. 7.2. FUTURE WORK From the discussion throughout this thesis, we can identify at least the following areas that need to be studied further: 143 (1) Study other applications of large-grain pipelining We have presented pipelined matrix multiplication algorithms in this thesis. Due to their generality, it is expected that many problems can be solved by directly applying techniques developed here, especially those problems which have systolic array solu- tions. It is interesting to see how well the resultant algorithms perform compared to the existing algorithms. Another interesting topic would be the application of large-grain pipelining to problems which do not have a static and regular data structures such as those found in artificial intelligence. (2) Study pipelined data parallel algorithms under different environments We have studied pipelined data parallel algorithms mainly under the environment of message passing DMMs. It is interesting to see how and in what form pipelined data parallel algorithms can be run on a shared-memory multiprocessor. More importantly, since the shared-variable approach is a more natural way of programming, we should also study how large-grain pipelining can be incorporated into such an environment, perhaps in terms of reducing the amount of remote memory accesses. (3) Assess the impact of future generation DMMs Important trends in future generation DMMs have been discussed in Chapter 1, including larger memory, higher computing capability, improved computation to com- munication ratio, and concurrent I/O subsystems. It is expected that large-grain pipelin- ing can benefit from these novel designs. For example, increasing the communication speed will decrease the block size and increase the number of blocks in a data stream. This is equivalent to increasing the vector length in a pipelined vector processor, which results in higher pipeline efficiency [HwBr84]. With reduced communication overhead, large-grain pipelining can fully exploit the overlapping between computation and communication, which is not possible in current implementations. In addition, more sophisticated communication primitives can be supported, which alleviate, say, the overhead in data gathering/scattering operations 144 (see Section 3.2). Examples include the modification of send and receive primitives to allow the user to specify the access patterns within a block of data, or the implementa- tion of the broadcast primitive. Pipelined data parallel algorithms can also take advantage of concurrent I/O sub- systems, through which several streams of data can be directed into the system simul- taneously. This can further reduce the bottleneck induced by the host. However, depending on how nodes access the disks, new issues may arise concerning how data is distributed on the disks and how access conflicts are resolved. (4) Refine the grouping techniques Most theorems presented in Chapter 6 are aimed at characterizing the properties of grouping, e.g., grouping along which direction will result in a dependence preserving grouping. We have not yet addressed the issue of choosing which data elements to pack together into one message, that of defining and determining the optimal grouping scheme, and the issue of implementing the ideas discussed. Mathematical formulations are also needed to determine the computation and communication requirements of a particular grouping. This information can be used to determine the granularity of the algorithm. One important issue is to study the grouping of computational structures in higher dimensional space. Intuitively, the results presented in this thesis can be applied to higher dimensional spaces. However, a unified and rigorous mathematical treatment of all cases will be desirable, which should allow the extension of the theorems more easily. It is interesting to study other program constructs which differ from shift-invariant nested-loop programs. Examples include the IF-THEN—ELSE construct and anything not enclosed within a loop. We should also consider the problem of how to incorporate pipelined data parallel computations into other styles of computation. This includes the interface with Other nested loops in the program. 145 Finally, it is important to have a complete understanding of the relations between grouping and projection. The idea is to use grouping techniques to assist in synthesizing systolic arrays, and the goal is to predict the performance of the final systolic array from the knowledge of grouping. (5) Implement the techniques Most techniques presented in this thesis can be computer irnplemental. These tools should facilitate the development of parallel algorithms on DMMs. (6) Develop an intelligent compiler for DMMs The ultimate goal of the work presented in this thesis is to develop intelligent compilers on DMMs. The intelligent compiler takes a sequential program as its input and generates a parallel version suitable for DMMs. As a first step, a pre-processor to the compiler may be developed. The user provides information necessary for the paral- lelization as directives. The pie-processor attempts to parallelize the given program as much as possible and generates a transformed source program for compilation. Using the grouping techniques, many nested loops in the program can be parallelized and transformed into pipelined data parallel forms. Intelligent compilers are complex and difficult to implement. However, the tech- niques presented in this thesis point to a new direction in achieving this goal. Addi- tional research is needed to design more efficient parallel algorithms and to further advance our knowledge and capability in parallel processing. BIBLIOGRAPHY [AdCD74] T.L. Adam, K.M. Chandy, J.R. Dickson, "A Comparison of List Schedules [BBN87] [Berm87] [ChSm87] [DaSe87] [Duni87] [FaSh87] [Fink87] [FiFi82] [FoFW 85] [FoM084] [FoW387] [FoOH87] [GaJo79] for Parallel Processing Systems," Comm. ACM, Dec. 1974, pp. 685-690. BBN Advanced Computers, Inc., Butterfly Products Overview, Oct. 1987. F. Berman, "Experience with an Automatic Solution to the Mapping Prob- lem," in The Characteristics of Parallel Algorithms, L.H. Jamieson, D.B. Gannon, R.J. Douglass, Eds., MIT Press, 1987. V. Cherkassky, R. Smith, "Efficient Mapping and Implementation of Matrix Algorithms on a Hypercube," Technical Report, Department of Electrical Engineering, Univ. of Minn., 1987. W.J. Dally, C.L. Seitz, "Deadlock-Free Message Routing in Multiprocessor Interconnection Networks, "IEEE Trans. on Computers, May, 1987, pp. 547-553. T.H. Duningan, "Hypercube Performace," Hypercube Multiprocessors 1987, Ed., M.T. Heath, SIAM, 1987. PP. 178-191. N. Faroughi, M.A. Shanblatt, "Systematic Generation and Enumeration of Systolic Arrays from Algorithms," Proc. of the 1987 Int’l Conf. on Paral- lel Processing, Aug. 1987, pp. 844-847. R.A Finkel, "Large-Grain Parallelism — Three Case Studies," in The Characteristics of Parallel Algorithms, L.H. Jamieson, D.B. Gannon, R.J. Douglass, Eds., MIT Press, 1987. JP. Fishburn, R.A. Finkel, "Quotient Networks," IEEE Trans. on Comput- ers, Apr., 1982, pp. 288-295. J.A.B. Fortes, K.S. Fu, B.W. Wah," Systematic Approaches to the Design of Algorithmically Specified Systolic Arrays," Proc. of 1985 Int’l Conf. on Acoustics, Speech, and Signal Processing, 1985, pp. 300-303. J.A.B. Fortes, D.I. Moldovan, "Data Broadcasting in Linearly Scheduled Array Processors," Proc. of Symp. on Computer Architecture, 1984, pp. 224-231. J. Fortes, B.W. Wah, "Systolic Arrays — From Concept to Implementa- tion," IEEE Computer, July, 1987, pp. 12-17. G.C. Fox, S.W. Otto, A.J. Hey, "Matrix Algorithms on a Hypercube 1: Matrix Multiplication," Parallel Computing, Jan. 1987, pp. 17-31. M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman & Company, 1979. 146 [Gonz77] [GrRe86] [HaMS 86] [HiSt86] [HoTa85] [Hu61] [HwBr84] [Hwan87] [J ord87] [KeLi70] [KiCN 88a] 147 M.J. Gonzalez, Jr., "Deterministic Processor Scheduling," Computing Sur- vey, Sep. 1977. PP. 173-204. D.C. Grunwald, D.A. Reed, "Benchmarking Hypercube Hardware and Software," Technical Report, UIUCDCS-R-86-1303, Department of Com- puter Science, University of Illinois at Urbana-Champaign, 1986. LP. Hayes, T.N. Mudge, Q.F. Stout, S. Colley, J. Palmer, "Architecture of a Hypercube Supercomputer," Proc. Int’l Conf. on Parallel Processing, Aug. 1986, pp. 653-660. W.D. l-Iillis, G.J. Steele, Jr., "Data Parallel Algorithms," Comm. ACM, Dec. 1986. PP. 1170-1183. J .J. Hopfield, D.W. Tank, "Neural Computation of Decisions in Optimiza- tion Problems," Biological Cybernetics, 1985, pp. 141-152. T.C. Hu, "Parallel Sequening and Assembly Line Problems," Management Science, Nov. 1961, pp. 841-848. K. Hwang, F.A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill Book Co., 1984. K. Hwang, "Advanced parallel processing with supercomputer architec- tures," Proc. of the IEEE, Oct. 1987, pp. 1348-1379. H.F. Jordan, "The Force," in The Characteristics of Parallel Algorithms, L.H. Janrieson, D.B. Gannon, R.J. Douglass, Eds., MIT Press, 1987. B.W. Kemighan, S. Lin, "An Efficient Heuristic Procedure for Partitioning Graphs," Bell System Technical Journal, Feb. 1970, pp. 291-307. C.T. King, W.H. Chou, L.M. Ni, "Large-Grain Pipelining on Distributed- Memory Multiprocessors," Proc. of the Third Int’l Conf. on Supercomput- ing, May, 1988. [KiCN88b] C.T. King, W.H. Chou, L.M. Ni, "Pipelined Data Parallel Algorithms — [KiGN88] [KiNi8 8] [KiGV83] [KuLe78] Concept and Modeling," Proc. of the 1988 ACM Int’ 1 Conf. on Supercom- puting, July, 1988. CT. King, T.B. Gendreau, L.M. Ni, "Reliable Elections in Broadcast Net- works," to appear in Journal of Parallel and Distributed Computing. C.T. King, L.M. Ni, "Large grain Pipelining on Hypercube Computers," Proc. of the Third Conf. on Hypercube Concurrent Computers and Appli- cations, 1988. S. Kirkpatrick, C.D. Gelatt, Jr., M.P Vecchi, "Optimization by Simulated Annealing," Science, May 1983., pp. 671-680. H.T. Kung, C.E. Leiserson, "Systolic Arrays (for VLSI)," Sparse Matrix Proc., 1978, pp. 32-63. [KuLJ87] [MaLT82] [MiWi84] [MoFo86] [MuBA87] [NeSn87] [NiKP87] [Oste87] [Pete77] [PaW086] [PoBa87] [Quin87] [Ramc74] [RaS H79] [SaSc85] 148 S.Y. Kung, S.C. Lo, S.N. Jean, J.N. Hwang, "Wavefront Array Processors — Concept to Implementation," IEEE Computer, July 1987, pp. 18-33. PR. Ma, E.Y. Lee, M. Tsuchiya, "A Task allocation Model for Distributed Computing Systems,"IEEE Trans. Computers, Jan. 1982, pp. 41-47. W.L. Miranker, A. Winkler, "Spacetime Representations of Computational Structures," Computing, 1984, pp. 93-114. D.I. Moldovan, J .A.B. Fortes, "Partitioning and Mapping Algorithms into Fixed-Size Systolic Array," IEEE Trans. on Computers, Jan. 1986, pp. 1- 12. TN. Mudge, G.D. Buzzard, T.S. Abdel-Rahman, "A High Performance Operating System for the NCUBE," Hypercube Multiprocessors 1987, Ed., M.T. Heath, SIAM, 1987. F.A. Nelson, L. Snyder, "Programnring Paradigms for Nonshared Memory Parallel Computers," in The Characteristics of Parallel Algorithms, L.H. Jamieson, D.B. Gannon, R.J. Douglass, Eds., MIT Press, 1987. L.M. Ni, C.T. King, P. Prins, "Parallel Algorithm Design Considerations for Hypercube Multiprocessors," Proc. of 1987 Int’l Conf. on Parallel Pro- cessing, 1987, pp. 717-720. A. Osterhaug, Guide to Parallel Programming on Sequent Computer Sys- tems, Sequent Computer Systems, Inc., 1987. J .L. Peterson, "Petri Nets," Computing Survey, Sep. 1977, pp. 223-252. D.A. Padua, M.J. Wolfe, "Advanced Compiler Optimizations for Super- computers," Comm. of ACM, Dec., 1986, pp. 1184—1201. C.D. Polychronopoulos, U. Banerjee, "Processor Allocation for Horizontal and Vertical Parallelism and Related Speedup Bounds," IEEE Trans. Com- puters, Apr. 1987, pp. 410-420. M.J. Quinn, Designing Efiicient Algorithms for Parallel Computers, McGraw-Hill Book Co., 1987. C. Ramchandani, "Analysis of Asynchronous Concurrent Systems by Timed Petri Nets," Ph.D. Dissertation, Project MAC, MAC-TR-120, MIT, 1974. G.S. Rao, H.S. Stone, T.C. Hu, "Assignment of Tasks in a Distributed Pro- cessor System with Limited Memory,"IEEE Trans. Computers, Apr. 1979, pp. 291-298. Y. Saad, M.H. Schultz, "Topological Properties of Hypercubes," Technical Report, YALEU/DCS/RR-389, Department of Computer Science, Yale University, June 1985. 149 [ShFi87] Y. Shih, J. Fier, "Hypercube Systems and Key Applications," in Parallel Processing for Supercomputing and A1, K. Hwang, D. DeGrOOt, Eds., 1987. [Squi86] M.L. Squires, "Applying Parallel Processing," COMPCON Spring 86, 1986. pP. 394-396. [WiCM88] A. Witkowski, K. Chandrakumar, G. Macchio, "Concurrent I/O System for the Hypercube Multiprocessor," The Third Conf. on Hypercube Concurrent Computers and Applications, Jan., 1988. APPENDIX /* "' Array summation using nee-structured accumulation method with ratioed partitioning */ #include ' #defrne MAXNODE 64 /"' Maximum number of processors used */ #define MAXARRAY 100000 /* Maximum size of the array */ log2(n) /"‘ Calculate the integer logrithm of an integer number */ int n; { int x=1; n--; while(n = n>>1) x++; return(x); l main(argc,argv) /" The program running in the host */ int argc; char *argv[]; [ int i,j,k,*ipt; int m; /* Size of the array */ int n,d; /* Number of nodes used and the dimension */ float a[MAXARRAY]; /"' The array */ float sum,time,* fpt; int x[MAXNODE]; /* The size of the subarray in each processor */ int ch,type,nd,status,nllags; /* Communication parameters */ FILE *fopen0,*fpl,*fp2; /* Initialization */ m=atoi(argv[1]); n=atoi(argv[2]); d=log2(n): /* Input the partition sizes from the file ’part’ */ fpl=fopen("part","r"): for(i=0,j=0,ipt=x; idiff) { nd=(-1); type=30; nread(&unp,sizeof(float),&nd,&type,&llag); sum += trnp: 11 = ll>>1;] if(ll != O) { nd=id‘ll; nwrite(&sum,sizeof(lloat),nd,30,&llag);} else f“ I am the root, send back to the host */ 152 nwrite(&sum,sizeof(float),host,40,&llag); /* If processor 0, then stop counting the time */ if (id = 0 ) [ type=95; nread(ai,sizeof(lloat),&host,&type,&flag); ai[0]=ntime0 - time; nwrite(ai,sizeof(float),host,50,&flag); } I!!! * Mauix multiplication using Algorithm 3.1 */ #include #define MAXMT'X 64 /"' Maximum size of the matrices */ 81’3“") /* Gray code of n */ int n; [ retum((n>>1)‘n); } ginv(n) /* Inverse of the gray code it */ int n; [ int x=n; while(n = n>>1) x = x‘n; retum(x); ] log2(n) /"‘ Calculate the integer logrithm of an integer */ int n; { int x=0; while(n = n>>1) x++; remark): } main(argc ,argv) int argc; char *argvfl; { int i,j.k.l.U.V.X.y; int m,n; /"' Size of the matrices and number of processors */ int nl,n2,n3; /* Number of partitions */ int dl,d2: /"' Dimension of the subcubes */ int 312.5135] .52,s3: P Size of submatrices */ int ch,type,nd,status,nilags; f“ Communication parameters */ float a[MAXMTX][MAXMTX],b[MAXMT‘X][MAXMTX]; /* Matrices A and B */ float buf[4097]; f" Buffer for loading */ FILE *fopen0,*fpl; /* Initialization */ m=atoi(argv[1]); n1=atoi(argv[2]); n2=atoi(argv[3]); n3=atoi(argv[4]); n=n1*n2; dl=log2(n1); d2=log2(n2); s1=m/nl; /"' Number or rows in one partition in A */ s2=m/n2; F Number or columns in one partition in A */ s3=m/n3: s12=sl*sz; f“ Size of the submatrix in A: (n1,n2) partition */ sl3=sl*sl; /* Size of the submatrix in C: (n1,n1) partition */ l 153 /"' Read the input mauices */ fp1=fopen("data","r"); for(i=0; i>d2)+l)*sl; y=(buf[513]+1)*33; for(i=x-sl,l=0; j /"' Gray code of n */ int n; { retum((n>>1)‘n); } ginv(n) /* Inverse gray code of n */ int n; ( int x=n; while(n = n>>l) x = x‘n; rcmmu); l log2(n) /"' Calculate the integer logrithm of an integer I"I int n; { int x=0; while(n = n>>l) x++; retumos); l mainO { int i,j,k,l,ll,idd,diff,root,max; int m,n1,n2,n3,d,dl,d2,512,s13,sl .52; int id,id1,id2,su,nd,pid,host,type,flag; float ai[2048],bi[2048],ci[2049],buf[2048]; /" Buffers */ /" Receive the initialization parameters from the host */ whoami(&id,&pid,&host,&d); /"' Get environment info */ type=10; nread(ai,4*sizeof(float),&host,&type,&llag); m=ai[0]; n1=ai[l]; n2=ai[2]; n3=ai[3]; d1=log2(n1); d2=log2(n2); sl=m/n1; s2=m/n2; s12=sl*s2; sl3=s1*sl; /* Find predecessor and successor nodes in the ring */ idl=id>>d2: id2=id & (n2-1); k =id1 ? gray(ginv(id1)-l) : gray(nl-1); su=(k<> 1; while(ll>diff) { nd=(-l); type=30+l; nread(bufsl3*sizeof(float),&nd,&type,&flag); for(i=0; i>1; } if(11 != 0) { nd=id‘ll; nwrite(cisl3*sizeof(float),nd,30+l,&flag); } else I f" I am the root, send back to the host */ ci[sl3] = pid = (pid+l) % nl; nwrite(ci,(s13+l)*sizeof(float),host,40,&flag); } l /" Processor 0 sends the execution time bacj to the host */ if (id =-- 0) { type=95; nread(ai,sizeof(lloat),&host,&type,&flag); buf[0]=ntime0 - max; nwrite(buf,sizeof(float),host,50,&flag); } } /* * Matrix multiplication using Algorithm 4.2 *l - #include #define MAXMTX 64 /* Maximum size of the mauices */ log2(n) f“ Calculate the integer logorithm of an integer */ int n; { int x=0; while(n = n>>l) x++; retum(x); } main(argcargv) int argc; char *argv[]; { int i,j,k,l,u,v,x,y; int ch,type,nd,nllags; /" Communication parameters */ int m,n; /" Size of the matrices and number of processors */ int n1,n2,n3; /"' Number of partitions */ int d1,d2; f" Dimension of the subcubes for n1, n2 */ int s12,sl3,s23,sl,s2,s3; /* Size of submatrices */ float buf[4096]; /"' Message buffer */ float a[MAXMTX][MAXMTX],b[MAXMTX][MAXMTX]; I“ Matrices A and B */ 156 FILE *fopen0,*fp1; /* Initialization */ m=atoi(argv[1]); nl=atoi(argv[2]): n2=atoi(argv[3]); n3=atoi(argv[4]); n=n1*n2; d1=log2(n1); d2=log2(n2); sl=m/nl; /" Number or rows in one partition in A */ s2=m/n2; f" Number or columns in one partition in A */ s3=m/n3; 312=sl*s2; P Size of the submatrix in A l"I 813=s1*s3; s23=s2*s3; /* Size of the submatrix in B */ /* Read the input matrices */ fpl=fopen("data","r"); for(i=0; i>d2)+l)*sl; =('buf[sl3]+1)*s3; for(i=y-sl,l=0; j>1)‘n); } ginv(n) P Inverse gray code of n */ int n; { int x=n; while(n = n>>1) x = x‘n; retumm; l log2(n) /* Calculate the integer logorithm of an integer */ int n; { int x=0; while(n = n>>l) x++; remmm: l mainO [ int iJ.k.l.ll; int m,nl,n2,n3,312,sl3,323,s1,s2,s3,d1,d2; int nd,pid,host,type,flag,id,id1,id2,idg,diff,su,pr,maxt; float ai[4096],bi[4096],ci[4097]; /"' Buffers */ /"' Receive the initialization parameters from the host */ whoami(&id,&pid,&host,&diff); /"' Get environment info */ type=10; nread(ai,4*sizeof(float),&host,&type,&flag); m=ai[0]; n1=ai[1]; n2=ai[2]; n3=ai[3]; d1=log2(n1); d2=log2(n2); sl=m/nl; s2=m/n2; s3=m/n3; s12=s1*s2; sl3=sl*s3; $23=s2*s3; idl=id>>d2; id2= id & (n2-l); su = (id2 == n2-l) ? host: id+1; idg=ginv(idl); k=gray(idg+1); pr=(k<> 1; while(lbdift) { nd=(-1); type=30+l; nread(bis13*sizeof(lloat),&nd,&type,&llag); for(i=0; i>l; } if(ll 1= 0) { nd=id‘ll; nwrite(cis13*sizeof(float),nd,30+l,&flag);} else { /"‘ I am the root, send back to the host */ ci[slB] =1; nwrite(ci,(sl3+l)*sizeof(float),host,40,&flag); } l /* Processor 0 stops counting the time */ if (id = 0) [ type=95; nread(ai,4,&host,&type,&flag); ai[O]=ntimeO-maxt; nwrite(aisizeofaloat),host,50,&flag); } l