MSU LIBRARIES ._c—_ RETURNING MATERIALS: PIace in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped beIow. MIXED SYSTOLIC ARRAYS : A RECONFIGURABLE MULTIPROCESSOR ARCHITECTURE BY Tung-Liang Chang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1982 ABSTRACT MIXED SYSTOLIC ARRAYS: A RECONFIGURABLE MULTIPROCESSOR ARCHITECTURE BY Tung-Liang Chang Systolic arrays are special-purpose, high-performance data-flaw multiprocessor structures which are characterized by having a: simple, regular and short communication geome- try, which is considered as one of the most desired attri- butes in VLSI implementation. The principal drawback to these special-purpose arrays is that they are designed for specific algorithms which possess very simple and regular data flow patterns. These restrictions limit the types of algorithms or applications that can effectively be support- ed by such architectures. The primary objective of this research is to develop, characterize, and evaluate a basic computing model for a class of architectures, which broadens the scope of algo- rithms executable on systolic arrays. while retaining much of the simplicity and regularity of the original systolic array architecture. Computing elements and control buffers Tung-Liang Chang are mixed in regular geometric patterns to form reconfigu- rable systolic arrays known as mixed systolic arrays (MSA's). For a particular implementation, the mixing ratio is chosen to match an algorithm's local vs. global data re- quirements. Classes of algorithms with similar data require— ments may be executed on the same array by merely preset- ting the control buffers at load time. By decomposing an MSA into two basic regions, the dependence (”1 computing power vs. I/O bandwidth as a function of array edge size can be relaxed. Data-driven computations result in self-directed compu- tational rings within the MSA, and composite cells of data- flow instructions are executed in a pipelined fashion. Vertical grouping of composite cells along critical-path data-flow instructions and horizontal grouping across con- current data-flow instructions are employed for mapping data-flow directed graphs on computational rings for effec- tive execution. By viewing an MSA as a computing network of interlinked ring pipelines, data-flow programs can be uniformly distributed for efficient resource utilization. Data flow on the mixed systolic array is demonstrated by solving a second-order recursive equation with non- constant coefficients and by the implementation of a modi- fied hourglass computing model. MSA space-time complexity is compared. with that of the systolic array along ‘with other key performance measures. ACKNOWLEDGEMENTS The author wishes to express his deepest appreciation to Dr. P.D. Fisher, the author's thesis advisor, for his assistance during the course of this research and for super- vising this thesis and showing considerable patience. Thanks are also due to the other members of the graduate committee, Dr. J.B. Kreer, Dr. S.R. Crouch, Dr. D.K. Reinhard, Dr. R.G. Reynolds and. Dr. M.Au Shanblatt, for their suggestions and encouragement throughout this re- search effort. I wish to thank Ronald Kraus for preparing all of the illustrations contained in this thesis. Very special thanks are due V.E. Leichty for his encouragement during my stay with him in the past four years. Work reported here was supported in part by NSF under Grant No. MCS 79-09216. ii Chapter I. II. III. IV. TABLE OF CONTENTS INTRODUCTION 0 O O O O C O C O O O 0 BACKGROUND 0 O O O O O O O O C O O O 2.1 NNNN o o o 0 £11th Limitations of Single-Processor Machines . . . . . . . . . . . Array Architectures . . . . . . Systolic Arrays . . . . . . . . Data-Driven Machines . . . . . Local Computation and Communication . . . . . . . . . PROGRAMMABLE SYSTOLIC ARRAYS . . . . 3.1 3. 2 3. 3 Functional and Structural Description . . . . . . . . Sample Algorithm Implementation Discussion . . . . . . . . . . MIXED SYSTOLIC ARRAYS . . . . . . . 4.1 4.2 4.3 Structure of Mixed Systolic Arrays . . . . . . . . . . . . Uniform Mixing . . . . . . . . Partial Uniform Mixing . . . . DATA-FLOW COMPUTATIONS o o o o o o o 5.1 mm 0 lb WM U1 0 Store-and-Forward Control Buffers . . . . . . . . . . . . Computational Rings . . . . . Loading and Scheduling of Composite Cells . . . . . . . . Data-Driven Computations . . . iii Page 11 14 17 22 26 26 33 37 43 44 49 59 63 64 66 71 76 Chapter Page VI. THE HOURGLASS MACHINE . . . . . . . . . . 82 6.1 Modified Hourglass Computing Model 0 O O O O O O O O O O O O O O O 83 6.2 Hourglass Tree Machine . . . . . . . 85 6.3 Two-Phase Data Transfer MeChanism O O O O O O O O O O O O O O 89 6.4 Hourglass Machine Performance Measures . . . . . . . . . . . . . . 92 6.5 Implementation Considerations . . . . 95 VII 0 CONCLUSIONS 0 O O O O O O O O O O O O O O 97 7 .1 sumary O O O O O O O O O O O O O O O 97 7.2 Future Research . . . . . . . . . . . 102 REFERENCES 0 O O O O O O O O O O O O O O O 10 4 iv Table 2.1 3.1 LIST OF TABLES TWO STANDARD VON NEUMANN MACHINE DYNAMIC INSTRUCTION MIXES FOR SCIENTIFIC AND TECHNICAL APPLICATIONS [11] . . . . . . . DATA FLOW TIMING EXAMPLE . . . . Page 10 39 Figure 1.1 2.5 2.6 3.3a 3.3b 3.3c LIST OF FIGURES A modified systolic array made up of both control buffers and computing elements 0 O O I O O I O O O O O O O O The structure of the von Neumann maChine O O O O O O O O O O O O O O O The array organization of the Illiac Iv [12] O O O O O O O O O O O O O O 0 Two types of inner-product processors [9] . . . . . . . . . . . . . . . . . The hex-connected systolic array for matrix multiplication problem [8] . . A directed graph representation example . . . . . . . . . . . . . . . The block diagram of a basic data- driven machine [24] . . . . . . . . . A data-flow instruction cell example . A multiprocessor system for the fast execution of numerical programs [6] . A ring-structured processing module . A PM with N pairs of local and one pair of global I/O ports . . . . . . . A computation series, yl->y2->y3->y4 . Sequence of intracomputations executed on a PM 0 I O O O O O O O O O O O O 0 Sequence of intercomputations executed step-by-step on a three-PM array . . . vi Page 12 15 16 18 20 21 24 28 30 32 32 32 Figure 3.4 3.5 3.6 4.1 5.1 5.3a 5.3b 5.3c 5.4a A rotating wheel computing structure example 0 O O O O O O O I O O O O O O O The triangular PSA for solving a second- order recursive problem . . . . . . . . Computing steps for a second-order recursive equation on a triangular PSA . A mixed systolic array constructed from a diamond-like basis . . . . . . . A partially regular hexagonal mixed systolic array composed of a central region and an outer shell. The latter is constructed from a triangular basis . Hexagonal array bases with (a) Ob = l/7, (b) 9b = 2/7. and (C) 9b = 3/7 . . . . . A regular hexagonal mixed systolic array constructed from a pb = 1/7 basis . . . A regular hexagonal mixed systolic array constructed from a Db = 2/7 basis . . . A regular hexagonal mixed systolic array constructed from a Db = 3/7 basis . . . Four classes of mixed systolic arrays: (a) CICO, (b) CIDO, (C) DICO, and (d) DIDO. The arrows indicate the direction of data flow . . . . . . . . . The structure of the control buffer . . A typical two-stage computational ring within a hexagonal MSA . . . . . . A data-flow directed graph example . . . A two-stage computational ring . . . . . The Gantt chart illustrating the execution of (a) on (b) . . . . . . . . A data-flow directed graph example . . . vii Page 34 36 38 46 47 52 53 54 55 61 65 67 70 70 70 73 Figure 5.4b Page The simplified time/space grid representation for (a) . . . . . . . . . . 73 An example of vertical grouping of composite cells on a grid represen- tation graph . . . . . . . . . . . . . . 74 An example of horizontal grouping . . . . 77 The modified hourglass computing model . . . . . . . . . . . . . . . . . . 84 The structure of the data-flow hour- glass computing machine . . . . . . . . . 86 A five-level linking structure in the buffer tree . . . . . . . . . . . . . . . 87 The data address syntax . . . . . . . . . 91 viii CHAPTER I INTRODUCTION The demand for improved or advanced computer systems continues unabated with special emphasis on increased over- all throughput. One approach being considered exploits very-large-scale-integrated (VLSI) circuit technology to scale down IC devices to submicron features, thereby reduc- ing logic gate speed-power products to 0.1 pJ and below [1]. Based on this figure, single chips may contain 107 to 108 transistors, which is equivalent to about 103 unconnected 32-bit microprocessors [2,3]. However, with the interprocessor communication path complexity growing at a much more rapid rate than that of logic circuitry, inter- connections may account for most of the chip area. What's more, these interconnection networks may account for the critical time delays [2]. Therefore, the full reward for using VLSI circuits lies with the ability of the computer architect to design practical machines with simple and regular communication paths. Moreover, they must be design- ed 1x) exploit fully concurrencies that exist in the user's applications. 2 Array-based architectures appear to be very attractive in this respect. It is argued that such architectures have several important advantages: * Arrays require shorter communication paths--a shorter communication wire not only improves system operations, but also consumes less chip area [2]. Arrays have regular communication wiring patterns—~a regular layout leads to efficient, high density designs and simplified debugging procedures [4]. Arrays execute structured algorithms efficiently-- through parameterization procedures, algorithms with particular computational demands, control, and I/O struc- tures can be effectively executed on these architectures [5]. Arrays are built from a small set of predesigned modules --modularity reduces design time because design auto- mation techniques become useful. Modularity also reduces cost since general-purpose modules may be used in multi- ple applications once they are available to the system designer [4]. Arrays support very high concurrencies in local compu- tations and communications--locality of computation and communication leads to ease of algorithm decomposition and to an effectiveness of distributed computations [6]. Arrays are easy to expand--expandability allows the system designer to enhance the computing power or appli- cability of the architectures with a minimum effort [7]. 3 The purpose of this thesis is to investigate alterna- tive strategies which fully exploit emerging IC technology to yield useful, high-performance, cost-effective array- based computer architectures. One important class of VLSI architectures which has received a great deal of attention is the systolic array [8,9]. These arrays are characterized by having a simple, regular and short communication geometry which is consider- ed as one of the most desired attributes in VLSI implemen— tation. The principal drawback to these special-purpose arrays is that they are designed for specific algorithms which possess very simple and regular data flow patterns. These restrictions limit the types of algorithms or appli- cations that can efficiently be supported by such architec- tures. Thus the investigation of a new class of array archi- tectures, which can broaden the scope of algorithms execu- table on them while retaining much of the simplicity and regularity of the systolic array architectures, can be of practical use. The primary objective is to develop, characterize, and evaluate a basic computing structure for a class of array architectures--a modified systolic array--which broadens the scope of algorithms executable on systolic arrays while retaining much of the simplicity and regularity of the original systolic array structure. One aspect of the approach taken here is to introduce control buffers (CB's) into the array [5]. In contrast with the systolic array 4 structure in which data are regularly moved on a limited pattern of communication paths, these control buffers provide a means to increase the flexibility of data move- ments. More specifically, these CB's are programmable, thereby increasing the array's potential usefulness. As shown in Fig. 1.1, the modified array is partitioned into interconnected control buffers and computing elements. The control buffers (CB's) control the sequence of computations in the array and the computing elements (CE's) implement the primitive mathematical operations. Within the modified array, there are numerous mini-systolic-like subarrays, each of which executes data-flow subgraphs scheduled to it. The control buffers, when initiated by a control signal, tag data in such a manner as to control their direction of movement and the computations performed on the computing elements. As proposed, the array-based architectures constructed from the modified systolic array will support asynchronous as well as synchronous control structures, local as well as global communication structures, distribut- ed functional computation structures, and balanced I/O structures. In order to help define and establish the significance of this particular class of array architectures, some back- ground information is presented in Chapter II. As the groundwork for the development of this class of array archi- tectures, the design of a triangular programmable systolic array for solving a second-order recursive problem is .1 p \ l . gagggg Computing Element Figure 1.1. A modified systolic array made up of both control buffers and computing elements. 6 presented in Chapter III. This is followed in Chapter IV by the discussion of mixing in the systolic array and the functional and structural characteristics of the mixed array. Chapter V describes block-driven computations on the mixed systolic array, including the discussion of basic ring computing structures and their subsequent chain oper- ations for the support of effective block-driven compu— tations. Next, in Chapter VI, the design of an hourglass computing machine is presented, along with an analysis based on some performance measures such as throughput, cost—effectiveness, hardware complexity, and VLSI implemen- tation, among other things. And, finally, conclusions and thoughts which might lead to future research possibilities are included in the last chapter, Chapter VII. CHAPTER II BACKGROUND 2.1 Limitations of Single-Processor Machines The single-processor sequential machine performs one elemental activity at a time, much like a single individual would solve EH1 arithmetic problem by hand. But this compu- tational model has serious limitations. To understand these limitations let us consider the structure of such a com- puter-~the von Neumann machine (see Fig. 2.1). Here program instructions and data both reside in the random-access main memory. The program's task of producing output given some input is accomplished entirely by pumping single words across the data channel which connects the processor with its memory. A typical instruction execution cycle proceeds as follows: an instruction is fetched by the controller from main memory. Next, the controller decodes the instruc- tion. Then necessary operands, or addresses of operands, are fetched from main memory. Once all of these operands are available, the instruction is executed. The instruction cycle ends with results being left. in CPU registers or deposited in main memory. The communication channel connect- ing the CPU and main memory is generally "saturated”, ll r Control Unit f f W Main Memory ALU ...__ Figure 2.l. The structure of the von Neumann Machine. 9 leaving the arithmetic logic unit (ALU) idle a significant percentage of the time. Moreover, a large part of the infor- mation moving on this channel at any instant is not data at all, rather it is addresses for program statements and data. Backus refers to this channel as the ”von Neumann bottleneck” since it limits the nachine's throughput [10]. The only 'way to retain this sequential. architecture: and increase throughput is by reducing main memory read/write access times. But the ramifications of this bottleneck do not become completely clear until it is recognized that in a typical scientific application, where one would expect a large percentage of the machine operations to be arithmetic, less than 20% of all machine instructions executed are actually fixed-point or floating—point arithmetic operations [11]. Data access operations and control unit executed branch operations account for approximately one half and one third of the total instructions executed, respectively (see Table 2.1). Perhaps the greatest shortcoming of the single- processor sequential machine is that the structure of the problem being executed rarely corresponds to the structure of the machine. Specifically, simple operations such as vector addition or nmtrix nmltiplication have high degrees of concurrency built into the very structure of the mathe- matical operation, yet the single-processor sequential 10 TABLE 2.1 TWO STANDARD VON NEUMANN MACHINE DYNAMIC INSTRUCTION MIXES FOR SCIENTIFIC AND TECHNICAL APPLICATIONS [ll] Gibson mix, Flynn mix, Instruction class % % Load/store 31.2 45.1 Index 18 Branch 16.6 27.5 Compare 3.8 10.8 Fixed point 6.9 7.6 Floating point 12.2 3.2 Shift/logical 6.0 4.5 Other 5.3 1.3 100.0 100.0 11 machine does not take advantage of them to enhance through- put. 2.2 Array Architectures Array architectures, such as Illiac IV [12] and vari- ous distributed arrays [13-16], perform a single operation on a stream of operands available to each of the arrays' elements. As designed, these SIMD (Single Instruction Stream Multiple Data Stream) arrays are very effective for solving algorithms which either have very large data struc- tures, as in the vector and matrix problems, or a high degree of locality, as in the image processing applications [15-18]. Usually these arrays have fixed networks. At each node in the arrays is a von Neumann-style machine known as a processing element. Each processing element is associated with its own memory, and paths are provided for the input of data to the array for processing and the output of re- sults after processing. Depending on their structure, each N(i,j) in the array can communicate with its neighboring nodes N(i : n, j i m), where n and m are positive integers. As with the Illiac IV machine, each of the 64 nodes, N(i,j), i,j = 0,1,...,7 is connected to its four nearest nodes N((i : 1) mod 8), (j i 1) mod 8) (see Fig. 2.2). Data transfers within this array are taken on a uniform shifting basis under a central- ized control. Therefore, several continued. operations of shifting are required for transferring a data item from one J 12 '_.. L... ’.... .... L... I‘- -i- -‘-- -I- g... (7.01 (All—.1403}... ....N(7.7) Figure 2.2. The array organization of the Illiac IV [l2]. l3 node to a disjointed node. One important aspect of these array architectures is that the number of interconnections for each processing element is constant and these connecting wires are very regular and simple. This makes them very attractive in VLSI implementation. The drawbacks of these array architectures occur chiefly from the use of a centralized control and from their inherent fixed structures. These drawbacks include: 1. limited applicability--since every processing element must work in the same phase, a lot of effort must be taken in the design of concurrent algorithms before they can be efficiently executed on these machines [19,20]. 2. system clock required--the scaling down of IC devices will result in difficulty of moving infor- mation from point. to point synchronously’ with a system-wide clocking discipline. This will become even worse as the devices are scaled down to sub- micron features or as chips get larger [21]. 3. I/O constraints--due to the constraint on the number of maximum I/O pinouts in VLSI chips, getting data into or off the parallel arrays can be very costly to the arrays' overall performance [22]. 14 2.3 Systolic Arrays A systolic array rhythmically computes and passes data through a network of tightly-coupled processors [8-9]. Since data are regularly pumped into and out of each pro- cessor, this structure has an advantage in ease of implemen- tation and cost-effectiveness over the traditional parallel processors in which data flows are managed with the help of costly interconnection networks. The basic component of the systolic array processor is the inner-product step processor (see Fig. 2.3) which consists of three registers, Ra' Rb’ and Rc' each register having two ports, one for input and the other for output. During each unit-time interval, the processor loads data from its input lines into registers Ra' R and bl respectively, computes RC <-- Ra x Rb + R R c' c' and unloads Ra' Rb' and the new value of RC as out- put. Using this inner-product processor as the building block, a number of systolic arrays for band matrices multi- plication, LU-decomposition and solving triangular linear systems are designed [8-9]. As an example, a hex-connected systolic array for the matrix multiplication of two n x n band matrices, C = A x B with A = (a ), B = (bij)’ and C = (Cij)' 15 shown in ij Fig. 2.4. In this example, the elements in the bands of A and B and the initialized values of C are pumped through the systolic network in three directions in a synchronous fashion. Entries in C are accumulated as it is shifted 15 c a a . l c -n———m .-———.c b ____... ----b b a V v c a c‘-» axb + c Figure 2.3. Two types of inner-product processors [9]. l6 The hex-connected systolic array for matrix multiplication problem [9]. Figure 2.4. l7 upward from the bottom of the array, where the Cij enter with zero value. Each cij is able to accumulate all of its terms before it leaves through the upper boundaries. As described, the systolic array architectures provide the capability for realizing a number of important matrix operations. In addition to achieving a high computational rate by means of pipelining and concurrent computation, these arrays are characterized by having a simple, regular and short communication geometry which is considered as one of the most desired attributes in VlSI implementation. The principal drawback to these special-purpose arrays is that they lack the flexibility in implementation since they are designed for specific applications, a redesign and recons- truction for new applications are required. Another draw- back is that algorithms to be efficiently executed on these arrays must possess very simple and regular data flow patterns. Besides these drawbacks, due to the I/O con- straint of pinouts on a chip, systolic arrays are hardly to be applied for practical implementation. 2.4. Data-Driven Machines A data-driven machine executes data-flow programs modeled by Karp—Miller computation graphs [23]. These data- flow programs are represented as two-dimensional directed graphs, where nodes represent operations and links repre- sent the data movements from operation to operation. Fig. 2.5 shows a simple Fortran-like statement represented as a 18 f(a,b) 3 (gigig Figure 2.5. A directed graph representation example. l9 directed graph. The power of the directed graph is that data and control dependencies are explicitly expressed, as are the exact serial-parallel relationship of the set of operations to be executed. Unlike the conventional architec- tures, where execution of a program is managed by a program counter in a sequential manner, execution of a data-flow program is data-driven; an instruction proceeds on its own when its operands are available. Therefore, the data-driven approach allows a large number of instructions to be execut- ed simultaneously. Numerous data-driven machines have been reported in the past few years [24-28]. These include a series of machines with upgrading capability described by Dennis and Misunas [24-25]. Fig. 2.6 shows a block diagram of a basic data-driven machine from Dennis's group. In this machine, the instructions of a data-flow program are stored in the instruction memory cell. A data-flow instruction cell con- sists of a number of operand fields in which other instruc— tion cells should receive the result from this instruction cell. An example of such an instruction cell is shown in Fig. 2.7. When an instruction cell fires, the data in the cell which will be needed in later phases of execution are transmitted in a packet to an arbitration network which routes the packet to one of a number of processors. When a processor completes its execution, a result packet is formed and transmitted to a distribution network. The distribution network transmits the result value computed by 20 operation . packets Operation Units 7' Decision I_I I“) control Units packet: data Control decision packets Network packets I O O i ll ’ Instruction ___ Gen 0 """ O O _.l __ Distribution . Instruction . Arbitration Network Memory Network . . ‘.i _ Instruction _ I Cell n-l ’ Figure 2.6. The block diagram of a basic data-driven machine [24]. 21 0pcode Flag Operand (l) Operand (2) Destination Address (l) Destination Address (2) Figure 2.7. A data-flow instruction cell example. 22 the processor to each destination. Advantages of the data- driven architecture include flexibility in the machine design, flexibility in program control, exploitation of parallelism at all progranl levels. Disadvantages of the data-driven architecture include difficulties in implement- ing familiar data structures such as arrays and high over- head for communication between functional units of the machine [29-30]. 2.5. Local Computation and Communication The concept of defining local and. global variables developed in structured. programming languages suggests a similar approach for researchers who are developing VLSI computing systems. As predicted by Mead, future VLSI comput- ing systems will have hundreds of processing elements clustered together (”1 a single silicon chip [2]. The issue of how ix: coordinate these processing elements in order to improve the overall processing power has become more critical than before. An interprocessor communication net- work seems to be practical only if the amount of infor- mation flow among these processing elements is smaller than the number of tasks to be performed on them. To avoid a severe jam of information flow resulting from the use of a global communication network, one possible solution is to group a number of processing elements, which interact most frequently, into a block; thereby, data and control messages can move locally on the much shorter communication 23‘ paths [6,29]. Computer architectures based on the above approach have been reported [6,29]. One such machine proposed by Kuck et a1. [6], is shown in Fig. 2.8. In this machine, local control structures and functional tasks invoked by these structures are fully distributed over blocks of pro- cessor clusters under a global controller. Compound func- tions or local computations are performed in concurrency on each of these processor clusters following the execution of their local control in a dependency-driven fashion. A two- level communication network is provided; the local inter- connection network is used for both intracluster and adjacent clusters communications, while a global intercon- nection network provides facilities for both synchronous and asynchronous transfers of data or codes from the global memory to the clusters or vice versa. As described in Sections 2.3 and 2.4, the architec- tures of the data-driven machines and of the systolic arrays represent innovations of the computer advances built to meet specific needs. The data-driven architecture supports asynchronous execution on both of the control and the function structures, while execution of these two struc- tures on the systolic array is most likely to be synchro- nous rather than asynchronous. By combining these two machines' characteristics, it is possible to design a prototype machine which can be more useful and cost- effective than either of the other two. As proposed, this 24 Global Controller local Interconnection Network II Local Control I I r---- I I I I .I I Local Control A T V Local Control I I L--------J I I I I I I I I I I I I _..‘L_ L. : ,....‘L.. _.‘L_. I .1. J... P1 oo-Ec I p1 one PC : P1 new PC —I . I I . I II- —I — I TTII II'IIT I "In T TIIT : TIIII IT? I I II II I I l j I II I . . f / I I I II II I I I \ l I l I I I I I ' ' I I I I I I I I I A I I I I .quz. JDUL, I 111'. .IILIL . _I_IuL JLIL‘L . I I I I Global Interconnection Network I Global Memory L...m Figure 2.8. A multiprocessor system for the fast executions of numerical programs [6]. 25 new class of machines will support asynchronous execution on the global control structures in a data-driven fashion and execute the functions invoked by local control struc- tures synchronously’ on E1 local homogeneous modified systolic array; As described above, information communi- cation in these array machines is also organized into a global-level and local-level basis. Inside the arrays, directed paths or hierarchical paths provide for fast local data transfers. At the global level, a distribution network is used to transfer global data. and. microcoded control instructions. fo Th 5 V CHAPTER III PROGRAMMABLE SYSTOLIC ARRAYS In this chapter, we provide the requisite groundwork for the development of a new class of array architectures. The array structure to be considered here is a modified systolic array or progranunable systolic array [5]. First, we outline the basic programmable systolic array structure and its properties in Section 3.1. Following that, we illus- trate the array's ability to implement a second-order re- cursive equation with nonconstant coefficients. We conclude in Section 3.3 with a discussion of the space-time complexi- ty of this array structure and compare its features and general usefulness with those of the systolic array. 3.1. Functional and Structural Description In general, a programmable systolic array (PSA) has a structured configuration with a simple and interconnectable module as its building block. A functional and structural description of this PSA follows: * Processing Module (PM)--The processing module is the basic building block of the PSA. It is a localized computational center and a data router. For our purposes 26 27 here, each PM is constructed by connecting two pairs of interlinked computing elements (CE's) and control buffers (CB's) in a ring configuraiton (Fig. 3.1). But in general a PM may have more than two pairs. Also included in the processing module is a global I/O port which provides communication paths for the module. * Computing Element (CE)--The computing element is a coprocessor on which elementary FLP operations are perform- ed. It can be simply a numeric coprocessor such as the Intel 8087 [31] or a specially designed processor such as an inner—product step processor used in systolic arrays [8,9]. The choice is based on the computational demands of the class of algorithms to be implemented on the machine. * Control Buffer (CB)--The control buffer is basical- ly a data controller which directs the flow of incoming and outgoing data to and from the processing module according to their specific header and address destination tags. Each CB contains a number of buffer registers and a router. When the transfer of a tagged data item is made, the header is first decoded by the router to determine the destination ports to which this data item is to be sent, and the address is then applied to lead this data item to the proper location once transmitted on the port. Specifically, there are two types of tags being used in this array archi- tecture. The tags, which are attached to a global data item, are not permanent and are removed when the transfer is complete. But those attached to a local data item are 28 9 9 ports ports Figure 3.l. A ring-structured processing module. 29 preset and imbedded in a particular buffer register and are used exclusively to convey the data stored in the register. To support the global and local hierarchy concept, each control buffer is multiply connected. A global I/O port, shared by the control buffer pair, provides communi- cation for the processing module with the outside world, while local I/O ports are used for intermodule communi- cations. The structure of a processing module varies depending on the number of local I/O ports in the module. It can be a hexagon, an octagon, a decagon, etc. A PM with N pairs of local I/O ports is shown in Fig. 3.2, where the subscripts 9. and 9 represent the local port and global port, respec- tively. The index k is used to number the module while the index j denotes the jth local port in the module. The number of local I/O ports associated with a processing module can be enlarged, but there is a limit on the increase of this number due to fan-out limitations. * Computation Series (CS)--A computation series is defined as a finite sequence of identical step computations with the requirement that each succeeding step computation in the series be driven by the result from the previous step. Conceptually, the forming of a computation series is derived by applying the data-driven principle to its exe- cution [24-25]. The complexity of the step computations varies: it can be a simple elementary FLP operation such as 3O I (k) 09(k) 9 \ 1‘" \ Q @3‘ ‘69- ‘st 3‘: C's—- PM(k) ~—-- 02mm \ / \ / \ / \ / noun o,(j.k) Figure 3.2. A PM with N pairs of local and one pair of global I/O ports. 31 addition or multiplication, or it can be a term of FLP operations such as (a + b x c), or it can even be as complex as an inner product of (aibi)° A computation series can represent either intercomputations or intracompu- tations, depending (Hi how it is executed inside the array. An intracomputation series refers to computations that are performed within a processing module, while an intercompu- tation series is taken sequentially on a step-by-step pro- cessing module basis. In general, a PSA built witn N pro- cessing nodules will be able to compute N intracomputation series and one intercomputation series simultaneously. An example of these intra- and intercomputations is illustrat- ed graphically in Fig. 3.3. * Global Series Control (GSC)—-A global series control signal, which is under the control of the host processor, initiates a new intercomputation series. * Local Series Control (LSC)--A local series control signal, which is under the control of the control buffer, initiates a new intracomputation series. * Global Data (GD)--The global data are pumped either into or out of the PSA under the control of the host processor. * Local Data (LD)--The local data circulate within the array under the control of the control buffer. They can be intra types or inter types depending on where they are located. * Constant Data (CD)--The constant data are 32 Figure 3.3. (a) A computation series, y1 + y2 + y3 +Iy4. (b) Sequence of intracomputations executed on a PM. (c) Sequence of intercomputations executed step-by-step on a three-PM array. 33 stationed in the registers of the control buffer. Usually, these data can be preloaded before computing commences by the host processor. In a programmable systolic array, the execution of concurrent functions is represented as the parallel exe- cution of inter- and intracomputation series. Each intra- computation series corresponds to a particular function, while each intercomputation series indicates the amount of interactions on these functions. The execution of these series can In; visualized as rotating wheels. The principal wheel, which represents the intercomputation series, will rotate against the inner wheels of intracomputation series. The force which is applied to initiate the moving of a wheel is in reality the series control (SC) signal. If all of the rotating wheels are set to move in the clockwise direction, conflict-free computations result. This is accomplished by feeding the address codes and the control headers into the wheels to steer the direction. In Fig. 3.4, we illustrate this rotating wheel computing structure concept. 3.2. Sample Algorithm Implementation The utility of this programmable systolic array struc- ture can be illustrated through implementation of a second- order recursive equation of the following general form of -2 Yi = C1 +314 (cijxj + dijyj), 2 g 1 g n, Y = o 34 Principal Figure 3.4. A rotating wheel computing structure example. 35 C. are constant, C.. and d. 1 1] 1j are precomputed weights, J interested in a modular design, the computing of outputs x. are input data, llj are output results. Since we are Yi are distributed over a number of interconnected pro- cessing modules. In this particular application, three proc- essing modules PM(l), PM(2), and PM(3) connected in a tri- angular configuration are used with each PM(k) assigned to compute outputs Y 2= 0, 1, 2,..., etc. (k + 32 + l)’ The specific design of a PM is of major concern and consists of the design and interconnection of a CE and a CB. Based on the inner-product type of computation in the computing of outputs Yi' the grain size of the CE is chosen to be an inner-product step computation of (a + b x c). Since double recurrence occurs in each computation of Y. 1' the CB possesses dual local I/O ports. The fully connected PSA with specifications for each of its building components is shown in Fig. 3.5. And to understand how a second-order recursive equation is evaluated on such an array, it is instructive to note the flow of the intercompu- tation series and those of the intracomputation series within the array. The intercomputation series is chosen here to be Y -->Y2-->Y3--> . . . -->Yn . 1 And the intracomputation series are those which advance in computing of Yi' that is, those series of 36 CBIZJ) CE(2.down) 0 (I) 053.2) I {3) Figure 3.5. The triangular PSA for solving a second-order recursive problem. 37 ) +C.. X. ]--> [Ci + C' 11-2 1-2 11-1Xi-1 [((Ci-I-C ]--->[(ci + cii_lxi_1 ii-lxi-l) + Cii-in-Z) + dii-ZYi—Zl’ — — Execution of the intercomputation series is taken along the perimeter of the array with each step computation Y(k+3£+l) being computed by the CE(k,up), while three concurrent intracomputation series are executed simultaneously within the ring-structured processing modules. Before the computation initiates, the constant and the precomputed weights, the address codes, and the control headers are loaded into the registers of the CB's. Follow- ing execution of the series controls, the input data X(k+3£) are alternately pumped into the Ig(k) ports. Accordingly, intra- and intercomputation series are comput- ed. And, a few cycles later, the computational results Y(k+32+l) are pumped out at the Og(kn ports; simultane- ously, they are bmoadcast to the addressed modules via the intermodule connected paths to stimulate new computations. We illustrate the first eight computational steps in Fig. 3.6, and a more detailed description of the data flow timing for this example is provided in Table 3.1. 3.3. Discussion The triangular array executes an rrdhput second-order recursive computation in O(n) time units after the initial preload step. Basically, the array's performance bears some resemblance to that of a pipeline architecture [32-33]. A 38 Figure 3.6. Computing steps for a second-order recursive equation on a triangular PSA. .3!9 Table 3.l Data Flow Timing Example Computation steps Module Specifications PM(l) PM(Z) 994(3) 1 CE(l.uo) : idle CE(Z.up) : idle CE(3.uo) : idle CE(l.down) : C2+C21 X1 CE(Z.down) : idle CE(3.oown) : idle 2 CE(l.up) :(C2+C21X1)¢0 X 0 CE(2.uo) : idle CE(3.up) : idle CE(l.down) : idle CE(Z.down) : C3+C32 X2 CE(3.down) : idle 3 CE(IJID) : Id]! CE(Z.UO):(C3‘C32 xz)’ C31 x1 CE(39U9) 3 Y1 n I :X‘oI (l)¢C8(l.2)¢C8(Z.l) :x‘¢C8(Z.l) C8(3.2): y1._, "§\3) 9 caII.I) ‘ CE(l.uo) : v2 CE(2.uo) 2 idle c£(3.mn:(c,,+c,3 1:94,sz2 CE(l.down) : C5+C5‘ X‘ CE(Z.down):((C3+C32 X2) * CE(3.down) : idle c3l "imai '1 09(1) I C3(l.2) : v2 __, C8(2.l) :xsoig(2)—CB(2.ZI*CB(3.I) :xsscal3.l, 5 CB(3.Z) I:£(l.uII):(c.54-I:54 x,)+c53 II3 CE(2.up) : v3 CE(3.up) : idle CE(l.down) : idle CE(2.down) : CG‘C65 X5 CcIa.down):((C‘+C43 X3, o ‘42 ‘2' ‘ “:2 Y2 'X ‘C8(l l) 09(2) -x oi ’3)‘C8(3 2)—CB(l “ . 6 e CE(Z.2) 1Y3 _. C3(3,l) e 5 gt o 9-; 6 C8(l.2) CE(l.uo) : idle CE(2.up):(C6*C65 X5)*C5‘ X‘ CE(3.up) : Y4 CE(l.down):((C5+C54 X4) * CE(2.down) : idle CE(3.down) : C7rC76 x6 c53 x:IWsa ’3 °X+I(U4EUJD43Q'H m-cmzi) 053’ ' 7 g ’ ' ' 7 ' ”(3.2) 1Y4—o can.” C3(2.2I 7 CE(l’up) : Y5 CE(Z.UD) 2 idle CE(39UDI-IC7TC76 XSI+C7S X5 CE(l.down) : C8*Ca7 X7 CE(E.dOUn):((C6*C55 X5) r CE(3,d0wn) : idle '64 x4)+45‘ Y4 09(1) ‘ . v . - , . 1. CE(IOZ) ~ '5‘ CE(Zgl) .X8*Ig(2)"CB(2.2)‘CB(3cII -x8.‘08('01) CE(Ivup) : (:8+C87 X7I+C86 x6 CE(Z.UD) 1 Y5 CE(ngD) 2 Idle CE(l.down) : idle CE(2,down) : C9+C98 x8 CE(3.down):((C7vc76 x5; . C75 xSW75 Y5 40 start-up time, tx, is required to set the array for later pipelining operations. It includes the time, th, to pre- load the tags of header and address and the initial comput- ing time, ti‘ As in a 3-stage pipeline, the second-order recursive computation requires an initial computing time of 3 x ty based on the unit time latency ty. Since a multi- ply and an add operation are performed during each unit time, after adding the data transfer time, tc, we can express the latency of the pipeline operation as t = t + t + t y p d c where tp is the time of a multiply operation and td is the time of an add operation. Thus, for an n-input second-order recursive problem the execution time, Tr' on a triangular PSA can be expressed as r th + 3(tp + td th + (3 + n)(tp + ta + tc). *3 II +t)+n(t +t +t), c p d c As with a von Neumann machine, it takes 6 multiply operations and 6 add operations for each second-order recurrence. Thus, by assuming the same data transfer time required for each operation, the execution time, Tv' on such a machine is T = n(6tp + 6td + 12tc), or 6n(tp + t + 2tc). d 41 Therefore, the PSA represents a speed—up of Tv 6n(tp + td Tr th + (3 + n)(tp + td + tc) +2t) c The significance of this speed-up has stemmed in part from the concurrent operations of three processing modules and in part from the 2-stage pipeline operation in each module and also from applying data-driven computations which, as observed, have reduced the overhead of the so- called von Neumann bottleneck [10] by a factor of 3. Like the pipeline operation, the PSA will have a maximum speed- up of 6 provided the number of operations n is much greater than the order of recurrence. The programmable systolic array described here excels over the systolic array in two aspects. First, the PSA possesses a dynamic probelm-solving capability. Through ndcroprogramming different control headers into the control buffers, a PSA can be applied to solve a variety of probelms as compared to a static systolic array, which is specially designed to solve a particular problem. For example, the described triangular PSA can be used in solving a second-order recursive problem, a 3 x 3 matrix multiplication, an inner—product computation, or any compu- tational tasks which can be decomposed into up to three concurrent operations; this is not possible with a systolic array. Second, the PSA contains a set of internal buffer registers, which provide a means to preload the constant 42 data, thereby reducing the traffic density at the global I/O ports. For example, the precomputed weights and the constant data in the second-order recursive computation described in the previous section or the elements in one of the two matrices in the matrix multiplication can be pre- loaded. But there is an added cost in using the PSA. The host processor must interact with a more complex control structure. Moreover, the control buffer and the amount of intermodule connections cause the PSA to reduce the maximum number of computing elements per unit chip area. CHAPTER IV MIXED SYSTOLIC ARRAYS Attempts to broaden the systolic array's potential usefulness have been made either through designing new structures of systolic systems, such as systolic priority queues [34], or through modifying the systolic array. Examples of the latter approach include programmable systolic arrays [5] and cellular systolic arrays for the dynamic programming computation and the transitive closure problem [35]. Designing new systolic structures for speci- fic applications has its inherent limitation because many specialized VLSI devices would be required, i.e., one for each algorithm. However, the modifying approach may be more fruitful. Methodologically, this approach implements control structures or switching circuits into the systolic array, or simply includes control information in the data packets [5,35,36]. In this chapter, we present strategies for mixing control buffers and computing elements in the array structure. The word mixing is employed here to indicate the presence of both control buffers and computing elements within the arrays and to specify, as a result, how these arrays' characteristics change from their original 43 44 ones. First, the basic structure of a mixed systolic array (MSA) is described. Next, in Section 4.2, we discuss the impact that different mixing ratios have on a given uniform network and how this relates to global vs. local data re- quirements for a particular algorithm. This is followed by the discussion of partial uniform mixing and various MSA structures. 4.1. Structure of Mixed Systolic Arrays The first step toward the modification of a systolic array is through mixing, a strategy of incorporating both control buffers and computing elements into an array. A control buffer is a data router; it directs the flow of incoming data and outgoing data to and from the computing element according to their specific data type headers and destination address tags, but it is also a data and control code buffer, providing a local storage queue for data-flow instruction cells. For our purposes, a computing element is a coprocessor on which elementary arithmetic operations (+,-,x,/) are performed. A mixed systolic array (MSA) differs from Kung's systolic arrays in a number of ways [37]. Structurally, a mixed systolic array is formed by using two basic elements, as compared to the original array, which contains only one basic computing element. If the basic computing elements and the control buffers are not uniformly placed in the array, an MSA array displays some degree of irregularity in 45 its structure. Functionally, an MSA executes its compu- tational steps according to the sequence of programmed control codes stored in the control buffers. Therefore, a given implementation may be applied to solve a broader spectrum of user design algorithms than a conventional systolic array. Geometrically, an MSA communicates with the outside world through its control buffers, which are con- sidered as the masters. This master-slave MSA architecture is more easily interfaced with other chips than a slave- based systolic array chip. Basically, the structure of an MSA is determined by the mixing profile. By the term uniform mixing, we mean there is a subarray or basis from which the array can be constructed. We consider a uniformly mixed array a regular array. Fig. 4.1 shows a mixed systolic array constructed from a diamond-like basis. When mixing is placed on some particular regions in the array with empha- sis, it yields an irregular array structure. And an irregu- lar array is partially regular if it can be segmented into two or more regions which are each uniformly mixed (see Fig. 4.2). The mixing profile determines the possible frames of data-flow patterns within the MSA. Unlike the original systolic array, on which execution of an algorithm is based on two-dimensional or two-way pipelined data flows moving along in the preformed paths from one side of the array to 46 Basis AVA ‘\\‘ ‘lfl' x ’ {cglg‘ae’t'tcxc V" Computing Control Element Buffer Figure 4.l. A mixed systolic array constructed from a diamond-like basis. Figure 4.2. 47 A partially regular hexagonal mixed systolic array composed of a central region and an outer shell. The latter is constructed from a triangular basis. 48 the other, an MSA, in contrast, has reconfigurable pipe- lined paths. Therefore, algorithms with similar global vs. local data requirements can all be effectively executed on it rather than on numerous systolic arrays. The mixing profile contains information about the mixing density and the boundary conditions. The mixing density function, 0 , is defined as N o = CE (4.1) NCB + NCE where NCB And NCE are the number of control buffers and the number of computing elements contained in the array, respectively. When the mixing density is high, the array most likely will be used for algorithms which require more complex data routing and less complex computing. However, the array with a low mixing density will be applied primari- ly to algorithms which have simple data routing but require a large amount of computing. There are two limiting cases: when p=0, the MSA reduces to a systolic array. And when O=1 the MSA reduces to a reconfigurable interconnection network with no computing power. The boundary conditions in an MSA are represented by a boundary condition function, 9: N . a: CB , (4.2) NCB where N is the number of control buffers placed on the CB' 49 boundary of the array. Since the control buffers on the boundary of the MSA are the communicaticni centers,l con- nections from these centers to the outside world provide the basic I/O operations. Therefore, the evaluation of n is a direct neasure of the complexity of the I/O connections. Thus, the proper choice of o and S) for a particular imple- mentation relates directly to the ratio of the global and local data requirements of the algorithms for which the MSA is intended. 4.2. Uniform Mixing Although the number of mixing profiles which yield a regular array structure is large, only a few of them appear to be practical for real algorithm applications. What's more, a balance must be achieved between the computing power and the data routing capability in order to maximize resource utilization and throughput. For these two reasons, we will focus here only on those profiles which yield a geometric balance between the control buffers and the computing elements in the structure of their corresponding 21352- Also, for ease in implementation we assume that each control buffer has the same physical size as the computing element. The hexagonal array is a very important array struc- ture because of its utilization of the planar communication and its high packing density. In a hexagonal array the total number of elements, N, contained in the array can be 50 evaluated from the array's edge length, n, as N = 2[n + (n + 1) + (n + 2) +...+ (2n -2) + (Zn-1)] 3n2 - 3n + 1 (4.3) The smallest hexagonal array is the one which has size N=7 when n=2. Since most of the array structures of interest are subsets (us a hexagonal topology [30] and the hexagonal array of any size can be grown from such a structure with sharing or overlapping, this elemental array structure conveniently serves as the mixing big—S. When this smallest hexagonal array is mixed with the control buffers, the basis-mixing density, Ob, in the b_a§_i_s can be one of the following: , i=1,2,....,7. \IlI-h But mixing profile constraints state that each control buffer to be placed into a basis (1) will have the same number of computing elements at its nearest neighbors except the one which is to be placed in the center of the basis and (2) will seek to have the maximum number of computing elements at its nearest neighbors. The first constraint assures uniform mixing in the mixed array when it is grown from such a basis, while the second maximizes the computing power of the mixed array. Based on these two constraints, the set of computing elements neighboring each control buffer with respect to the above basis-mixing 51 density set is The bases corresponding to the first three mixing profiles are shown in Fig. 4.3. MSA's grown from these particular bases are intended for computation use. With an increased fraction of control buffers being mixed into the arrays, we are interested in knowing to what extent the computing power has been sacrificed, as well as what constraints must be added due to the presence of the control buffers. The hexagonal MSA's grown from the bases in Fig. 4.3 are presented here for comparison. The hexagonal arrays, as shown in Fig. 4.4, are formed (or grown) by adding a new basis adjacent either to each of the growing bases' edges or corners. The mixing density function, I), and the boundary condition function, 9 , in these MSA's are given as follows: Case 1: (MSA's constructed from a Ob = l/7 basis) The number of control buffers in this type of MSA's is given by NCB = 6 [p + (p-l) +...+ 0] + 1 n-1 where p = 2 . If n is odd, 2 3n + l NCB '— 52 Figure 4.3. Hexagonal array bases with (a) ob = l/7, (b) pb = 2/7, and (c) ob = 3/7. 53 00000000 O®O®O®O®O OOOOOOOOOO O®O®O®O®O®O oessosososo oooooooooo ososososo oooooooo Figure 4.4.(a) A regular hexagonal mixed systolic array l/7 basis. constructed from a ob = 54 00000000 ®O®O®O®O® OOOOOOOOOO ®O@O®O®O®O® ®O®O®O®O®O% OOOOOOOOOO ®O®O®O®O® 00000000 A regular hexagonal mixed systolic array constructed = 2/7 basis. from a ob Figure 4.4.(b) 55 O®OO®OO® @OO%OO%OO OO®OO®OO®O O®OO®OO®OO® O®OO®OO®OO® OO®OO®OO®O @OO®00®OO O®OO®OO® Figure 4.4.(c) A regular hexagonal mixed systolic array 3/7 basis. constructed from a ob = 56 If n is even, 3(n-1)2 + 1 NCB = 4 Thus, 3n2 + l 2 , n is odd 4(3n - 3n + 1) D: 2 3(n-l) + 1 2 , n is even. 4(3n - 3n + l) The number of control buffers on the boundary of this type of MSA's is given by NCB' = 0 , n is even N , = 3(n-1) , n is odd. Thus, CB 12(n-l) -—-2———,nisodd 3n + 1 Q: 0 , n is even. Case 2: (MSA's constructed from a Db = 2/7 basis) The number of control buffers in this type of MSA's is given by NCB = 2 [p + (p+l) +...+ (n-l-n(mod 2))] + (n-l + (n-l)(mod 2)), where n + (_1)n(mod 2) p = 57 If n is an odd number, n—1 n-1 N = 2 [ — + (— + 1) +...+ (n-2)] CB 2 2 3n2 - 4n + l + (n-l) = 4 . If n is an even number, n+2 n+4 CB 2 2 3n2 - 2n 4 0 Thus, 3n2 - 4n + 1 2 , n is odd 4(3n - 3n + l) o = 2 3n - 2n 2 , n is even. 4(3n - 3n + l) The number of control buffers on the boundary of this type of MSA's is given by { (n-l) , if n is odd N,= CB 2(n-1) , if n is even. Thus, 4(n-l) 2 , n is odd 3n - 4n + 1 Q: 8(n-1) -—7f————- , n is even. 3n - 2n Case 3: (MSA's constructed for a pb = 3/7 basis) The number of control buffers in this type of MSA's is 58 given by n x (n-l) + 1, if n = 3i-l, i = 1,2,... NCB — n x (n-l) , otherwise. Thus, 112 - n + 1 2 , if n=3i-l, i=l,2,... 3n - 3n + l p: n2 - n 2 , otherwise. 3n - 3n + 1 The number of control buffers on the boundary of this type of MSA's is given by 2n-1 , if n 3i-1, 3i+1, i = 1,2,... 2n-3 , if n = 31. Thus, 2 l n— 2 , if n = 31-1, n -n+1 2n-2 Q: 2 , ifn=3i+l, 1:1121000 n -n 2n-3 2 , 1f n = 31. n -n The mixing density function measures reveal that the hexagonal MSA's (Fig. 4.4) contain roughly from two thirds to three quarters of their constituents as computing elements. Since in hex-connected systolic arrays only a 59 portion (one third) of the computing elements is active at a time, the presence of control buffers in these MSA's leads to no reduction in the real computing power. The boundary condition function measures show a similar linear decrease in n in these MSA's. So, I/O communication de- crease as n-1 relative to on-chip communication. Thus, for a proper balance between global data and local data in algorithms executed on such uniformly mixed arrays, global data requirements must also decrease as the inverse of the local data. But this constraint can be overcome by relaxing the uniform mixing requirement. 4.3. Partial Uniform Mixing The design parameters of the uniformly mixed systolic arrays described in the previous section emphasize balanc- ing the computing power and the local data routing capabili- ty inside a given array. However, since the number of I/O pinouts is a major constraint in VLSI implementation [22] and since balancing on-chip processing with I/O is a must on a VLSI chip [35], an MSA should be designed to include the built-in I/O structure as well. Unlike the systolic arrays, in which I/O can take place only at the boundary computing element, in an MSA, I/O is granted to the control buffer. By carefully mixing control buffers into MSA's, various I/O structures can be created. An I/O structure, in general, can be constructed as being either centralized or 60 decentralized depending on the global data transfer band- width. A centralized I/O structure has only one single port, while a decentralized I/O structure has multiple ports. One important option is mixing a given MSA into two uniform regions, one region for computing and local data routing purposes and the other for I/O or interconnection purposes. Thus a partially regular array is formed. Topo- logically, a partially regular array has a centralized input or output structure, or both, depending on the speci- fic applications for which it is designed. The region which is mixed for exclusively the I/O structure or interconnec- tion purposes contains control buffers only; that is, the mixing density has a value of one. Since within a hexagonal array structure, the most likely spot to be mixed for the centralized I/O structure is in the central region, we consider only the partially regular array which has a structure shown in Fig. 4.2. Based on the built-in I/O structure (see Fig. 4.5), an MSA can be classified as follows: (1) CICO (centralized input centralized output). A CICO type of MSA is characterized by having a relatively low I/O bandwidth for its global data. Such an MSA has a fast response time and is very suitable for on-line pipelined database systems [38]. (2) CIDO (centralized input decentralized output). A Figure 4.5. 6] (d) Four classes of mixed systolic arrays: (a) CICO, (b) CIDO, (c) DICO, and (d) 0100. The arrows indicate the direction of data flow. 62 CIDO type of MSA is characterized by having a relatively low input bandwidth but a relatively high output bandwidth for its global data. Such an MSA is very useful for centralized massive system control applications or for algorithms which can be decomposed into hierarchical computations [27]. (3) DICO (decentralized input centralized output). A DICO type of MSA is characterized by having a relatively high input bandwidth but a relatively low output bandwidth for its global data. Such an MSA is best used for distributed stream-oriented computing systems [39]. (4) DIDO (decentralized input decentralized output). A DIDO type of MSA is characterized by having a rela- tively high I/O bandwidth for its global data. Such an MSA will likely be applied to simple two- dimensional pipelined operations [8-9]. This DIDO structure is the classic systolic array architec- ture. By decomposing an MSA into two basic regions, the constraint on uniformly mixed systolic arrays can be relaxed since there is now less of a dependence on comput- ing power vs. I/O bandwidth as a function of array edge size. This is so because the boundary condition function, 9, in a partially regular array can be reduced by a factor of 2 over the uniform array to meet pinout and/or global I/O data requirements. CHPATER V DATA-FLOW COMPUTATIONS Computations are carried out in a data-driven fashion on an MSA. The control buffer is the data-driven master, and the computing element is the slave. The driving mecha- nism takes place in the control buffer; when the required data are available a data-flow instruction cell is executed on a neighboring computing element [24]. One important consequence of applying data-driven principles is that computations can be uniformly distributed over self-direct- ed computational rings within the MSA. Thus, a computation- al ring is the basic data-flow computing structure in the MSA. In this chapter, we focus on data-flow computations on an MSA. First, the store-and-forward control buffer is pre- sented. Next, in Section 5.2, composite cell computations on computational rings are illustrated. This is followed in Section 5.3 by the description of forming composite cells on a data-flow graph. And finally, in Section 5.4, ‘we discuss data flow within the MSA and we also compare the array's performance to the systolic array. 63 64 5.1. Store-and-Forward Control Buffers Fundamentally, the control buffer is designed for store-and-forward purposes. It stores data-flow instruction cells at load time, arbitrates the incoming data and forwards outgoing data at execution time. The basic structure of the control buffer is illustrated in Fig. 5.1. The components consist of an input FIFO, an output buffer, a queueing FIFO, a driven memory, a linking buffer, and a router. The I/O buffers are connected to the I/O ports for I/O operations. The queueing FIFO is connected to a neighboring computing element for discipling executable instruction cells. The linking' buffer' is 'used. for internetwork purposes. The driven memory is applied exclusively for the purpose of storing data-flow instruction cells. And the router is used to arbitrate the input data to one of the queueing buffer, the output buffer, the linking buffer or the driven memory. Each data item entering a control buffer is identified by an attached data type header, which specifies a certain operation such as a memory read/write, an output operation, or a by-pass operation, and is guided by an attached address tag which leads the data item to its destinations. When a write operation occurs, the data item is written into the addressed instruction cell within the driven memory of the control buffer; simultaneously, a flag is raised to indicate a condition that the content at that particular cell is ready to be driven. However, when a read 65 address Driven w Output Linking Memory r 1 Buffer Buffer Ifi III I I " data O «i- L t a T r’ E O O m G. 3 (D F? 4. in '5 G. O c. o: n E I 5" - s + c w m 8 " G. 3 O :- r: 2 data + control code a g g 2 n o 2. j °' ° (3 In (I! Queueing Input FIFO 8 Buffer Router Inputs Figure 5.l. The structure of the control buffer. 66 operation occurs, it is to drive out the content of the accessed instruction cell into the queueing buffer ready for execution on a neighboring computing element. A read operation resulting from driving a not-ready instruction cell is called "aborted.” An aborted read operation will be forced to reverse its operation; that is, to become a write operation. An output operation is the result of a terminat- ed computation. The by-pass operation is requested when the data item is to seek no more than a pass through the con- trol buffer and its neighboring computing element. 5.2. Computational Rings A computational ring is the basic data-flow computing structure in the MSA. It is formed by connecting control buffers and computing elements alternately together into a ring configuration. A computational ring formed with p pairs of control buffers and computing elements is called a p-stage computational ring. Fig. 5.2 shows a typical two- stage computational ring within a hexagonal MSA. In a computational ring, computations are carried on in a pipelined fashion with the constraint that each suc- ceeding computation requires a data item from its previous computation. In other words, the result of a computation becomes the driving force to the execution of its succeed- ing computation. For this reason, the ”driving operation" is used to describe this computation mechanism. Mathemati- cally, such computations may be represented by a composite 67 O®OO®OO® @OO®OO®OO O®OO®OO®OO® OO®OO®OO®O @OO®OO®OO O®OO®OO® Z-stage Figure 5.2. A typical two-stage computational ring within a hexagonal MSA. 68 expression of a form f£(a£,..., fi(ai""’ f2(a2, f1(a1, a0))...), 0 < 1 g 2 in which the 151's denote either the elementary arithmetic function (+,-,x,/) or the results and the ai's are the data. In data-flow formation, these composite computations can be compiled into a. composite cell of data-flow instructions as ((fllal,a0), (leaz), ... (lea2)) When executed, this composite cell proceeds as Begin (Composite Cell Computation) fl := f(al, a0); For i := 2 to 2 Do Begin If Flag (Ai-l) = 1 Then a. <-- M (Ai ); 1 -1 fi := f (fi-l’ ai) Else M(Ai_1) <-- fi-l; Flag (Ai-l) := 1; fi—l <-- M (Ai—l); fi := f (fi-l' ai) End End (Composite Cell Computation) 69 where M denotes a driven memory operation and Ai's repre- sent the address tags associated to the ai's or the results of fi's. Obviously, the continuity' of composite computations depends on the availability of the data ai's. When the data are not available, the computing element may stand idle. In performance, a p-stage computational ring resembles a p-stage asynchronous circular pipeline. The delay in the control buffer is referred to as that of latching oper- ations while the time taken in the computing element is considered as a stage delay in the pipeline. With execution 2 (C) Figure 5.3.(a) A data-flow directed graph example. (b) A two-stage computational ring. (c) The Gantt chart illustrating the execution of (a) on (b). 71 As can be seen, four computational steps are taken and the only transfer times taken are mainly from the memory driving operations, each of which consumes a memory cycle time. The significance of composite cell execution (”1 a computational ring is that each instruction is locally executed. As a result, communication overhead is reduced to a minimum. Apparently, the purpose of forming computational rings within an MSA is to support locality of computation and data transfer. But this is not all. On a p-stage compu- tational ring, control buffers are linked like a p-stage pipeline. This allows data and control codes to be pipelin- ed into these buffers when the MSA is required for loading. A savings of a factor of p in the number of I/O pinouts with the MSA is expected. 5.3. Loading and Scheduling of Composite Cells As in the pipelining operation, to keep the compu- tational ring full is essential to its performance. Return- ing to our previous example (Fig. 5.3), notice that each CE(l) and CE(Z) is idle once during four computational steps. It would be more practical if a new composite cell could be initiated immediately for execution when a discon- tinuity of a driving operation has occurred. This could be achieved by reasonably increasing the queueing buffers size. However, two issues arise: the first relates to how a data-flow graph can be transformed into a scheduled set of 72 composite cells; the second is concerned with how these scheduled cells should be loaded on a particular compu- tational ring. To deal with these two issues, transfor- mation of a data-flow graph to a simplified time/space grid representation is applied. In this grid representation, a data-flow operation is represented by a darkened circle at a cross point and a data-flow link is represented by either a vertical arc or a sleped arc. In such a representation, the vertical axes are marked with the computational steps and the horizontal axes are measured by the degree of concurrency. Each computational step represents a stage delay on the computational ring and is assumed to be a constant. Concurrent data-flow operations are drawn on different vertical axes but on a single horizontal axis. Fig. 5.4 shows a time/space grid representation for a data-flow graph example. BasicalLy, a composite cell can be formed by grouping the circles along a particular vertical line or by grouping the circles on several different lines. Fig. 5.5 illustrates how particular composite cells can be formed by grouping the circles in the grid representation of Fig. 5.4. A composite cell which is formed by grouping those circles on a critical path in the grid representation is called the critical-path composite cell. Def: Let C1 = ((fllal), (leaz),...) be a critical-path composite cell in a grid representation and let L(Ci) be the length or the number of computational 73 Figure 5.4. (a) A data-flow directed graph example. (b) The simplified time/space grid representation for (a) 74 Figure 5.5. An example of vertical grouping of composite cells on a grid representation graph. 75 steps in Ci' then L(Ci) Z L(Cj), for every compo- site cell Cj starting at (fllal). A grid representation may consist of a number of critical-path composite cells. Cl and C2 in Fig. 5.5 are two such cells, each of which contains six computation- al steps. If we remove critical-path composite cells from a grid representation, the remaining grid is called a sub- grid. Critical-path composite cells formed in a subgrid contain fewer computational steps than those in the origin- al grid. Again if we remove those critical-path composite cells from a subgrid, another subgrid remains. Through recursively grouping critical-path composite cells and then removing them from a grid or a subgrid, one can obtain an ordered set of composite cells with decreased computational steps. It is easy to see that this conquer-and-divide de- composition scheme results in a minimum number of composite cells in a given grid representation. Scheduling composite cells for execution on a p-stage computational ring can be considered as a mapping problem. For a single composite cell, C1 = (Ci,l' Ci,2"°’) with C = (ftlat), the scheduling is simply a one-dimensional i,t one-to-one mapping as cit --> CB((t+k) mod p), k=0,1,...,p-l where the offset index k refers to the kth stage in the computational ring on which the composite cell's first 76 computation Ci 1 is started. However, when more than one I composite cell is to be scheduled, we may refer to a multiple-to-one mapping Ci’t\ . :CB(rmodp),r=t+k=s+k', . kl k' = 0,1,...,p-1. C. 3:5 Computations which are scheduled on a multiple-to-one mapping are mainly of interaction types such as a fork or a join. A fork activates concurrent driving operations while a join implements data-flow firing upon two interacted composite cells. By grouping circles which represent data- flow operations either from forks or to joins, the schedu- ling problem can be extended to a two-dimensional mapping. In Fig. 5.6, we illustrate the horizontal mapping on the time/space grid representation example of Fig. 5.4. 5.4. Data-Driven Computations By considering an MSA as a computing network of m tightly-linked p-stage computational rings (an example, see Fig. 4.2, which has m=6, p=3), an MSA is capable of execut- ing p x m data-flow instructions in one computational step. The ”block-driven principle" is applied to describe this 77 Join Grouping f--------- Fork Grouping Figure 5.6. An example of horizontal grouping 78 phenomenon, which is the firing of a computational block consisting of an arbitrarily large number of composite cells [29]. In data-flow notation, a computational block, B, is represented by an acyclic directed graph of size m by d, where m is the number of computational rings within the mixed systolic array, and d is the computation depth or the critical path in B. When the computation depth d is constant, the computational block has fixed size; when it is nonconstant, the computational block has a varied size. Through data-flow decomposition methodology, any applica- tive data-flow program can be structured as a sequence of ). Each block transforms computational blocks (B 82,...,B l' k an input data stream into an output data stream [40-41], with Bi = m x di’ where di is the computation depth in B. 1, lgigk. Any acyclic directed graph of data-flow computations can be represented by a data record and a collection of composite cells, which are specified by a frame of microprogramming control codes. Therefore, a computational block Bi can be functionally specified by {FilDi}, where Fi denotes the particular frame associated with Bi and Di denotes the size of the data record. The performance based on block-driven computations is 79 measured by the average computational rate, B, _ -1 k R = K 2 R , (5.1) . 1 1=l where R1 is the frame computational rate for Bi' Let N1 --the number of computations in Bi' ai ——the frame set-up time for {Fi'Di}' di --the critical path in Bi’ Bi --the communication overhead, T --the execution time for a single computation or the delay time in a control buffer, h --the word-length to the pinout number ratio, and f(m)--the average transfer overhead per compu- tation in T . Then PNi Ri = (5.2) a + 8 + d T 1 1 with m-lN. < d. < N. (5.3a) 1 — 1 — 1 a T < B. < d.'r + f(m)d.T (5'3“ — 1 — 1 1 hDiT — S “i .<_ hDiT (5.3c) m Therefore, for a uniform computational block (di = Nim-l) being executed on an m-input DICO type of MSA, Ri becomes 80 R = pNi i N.T _ er hD.T —l- + [l + f(m)] —$— + 1 m m m (5.4) pm [2 + f(m) + ———]T N. 1 With block-driven computations, the size of the data record to be loaded into the MSA (i.e., the input global data) is usually much less than the total number of computations (i.e., amount of local data); therefore, we assume hD. —l «1. N. 1 Equation (5.4) thus reduces to pm R1 = _ , (5.5) [2 + f(m)]T which is quite similar to the performance of the benchmark hexagonal systolic array since it has a constant computa- tional rate of (2/3)pm'I‘-l [8]. For the CIDO or CICO type of MSA's, the frame set-up factor is thiNi-1 , which may become the bottleneck to the system. As can be seen, the average transfer overhead coefficient, f(m), plays a significant role in the overall performance measure. The decreased computational rate with the MSA, when compared with the systolic array, represents the primary cost for achieving 81 the flexibility of a reconfigurable multiprocessor struc- ture that is capable of implementing more than a single algorithm. CHAPTER VI THE HOURGLASS MACHINE A data-flow machine, based on a modified hourglass computing model, is presented in this chapter. The hour- glass machine is loaded with a pair of MSA's, with a DICO type of MSA at one end and a CIDO at the other. Functional- ly, the mixed systolic array with a DICO structure is intended for block-driven computations, while the CIDC type is used for storing data-flow instruction cells and data- flow firing. Objectively, the structure of this machine is not designed for a specific application; rather it is intended to illustrate the architectural potential of mixed systolic arrays and the hierarchy of data transfers result- ing from block-driven computations. This chapter is organized as follows: First, the hour- glass computing model is introduced in Section 6.1. Next, Section 6.2 describes the functional structures of this data-flow hourglass machine. In Section 6.3, we present a two-phase data transfer mechanism. This is followed in Section 6.4 by the analysis of performances measured on this particular machine. And, finally, Section 6.5 dis- 82 83 cusses some premises of this machine regarding VLSI imple- mentation. 6.1. Modified Hourglass Computing Model The modified hourglass computing model takes its name from the concept of two-dimensional data moving in an hour- glass. The hourglass model has a (data-moving end and a data-receiving end with a narrow passage in between. The data-moving end is loaded with a block of data-flow instruc- tions. And the execution of these data-flow instructions can be considered as the moving of data, while the data- receiving end collects the computational results. Within the hourglass, a data item moves from point to point with the guidance of an appended control code. This control code contains a data type header and a destination address field. The data type guides the data item either into transmission paths or into reflection paths (see Fig. 6.1). The length of the transmission paths is fixed; therefore, there is no preference for all global data transfers. The reflection paths vary, ranging from the shortest paths localized on a computational ring to the logarithmic paths. The destination address specifies the particular location the data item is headed for. The hourglass computing model, as characterized by its structure, provides highly con- current activities at both ends which gradually decrease toward the narrow passage. 84 B I o Air-.040 . Distribution :ree Buffer Tree Instruction FIFO Computational Block °acket Global Data Local Sata Transmission Path Qeflection Path Figure 6.l. The modified hourglass computing model. 85 6.2. Hourglass Tree Machine As it was proposed, the novel hourglass computing model aimed at exploring as much as possible the compu- tation locality as well as communication locality in block- driven computations. Based on this model, we synthesize a data-flow tree machine, or an hourglass tree machine. As shown in Fig. 6.2, this particular tree machine contains primarily a pair of "mirrored trees", one is called the buffer tree and the other is the distribution tree. In this section, we describe the structure of these two trees and the functional units associated with the tree machine. Buffer Tree-~The buffer tree is a DICO type of MSA, containing m tightly-linked p-stage computational rings along with a tree linking structure (an example, see Fig. 6.3, a five-level linking structure). It is capable of computing (m x p) FLP operations concurrently and transfer- ring the results in logarithmic unit time. In the buffer tree, there are (m-2) nodes and one root node. Each one of these nodes is a control buffer, at which arriving data will either be buffered down or be forwarded out. Associat- ed with each leaf node is the p-stage computational ring on which composite data-flow computations are performed. Besides functioning as the computing power, the buffer tree plays two other major roles: first, it is an interconnec- tion network for the m computational rings; second, it is a buffering channel between the processing power and the instruction memory cell. 86 O o o —__... a} . o . 0 Global 3 . . Distribution O —» controller 0 ° 0 o . o a o .} o o O _I I Inumqm 1 ‘5 ‘w “s kW Figure 6.2. The structure of the data-flow hourglass computing machine. 87 00000000 000000000 0000000000 00000900000 IAV AV AV AV AV 000.07% a an a €®OOOO I \ I \ II AV AV AV AV COCO/nu a a "3.0000 oooQQQddooo 0000000000 000000000 00000000 Figure 6.3. A five-level linking structure in the buffer tree. 88 Distribution Tree-~The distribution tree is an n- level pipelined binary switching network of CIDO type. It provides the basic mechanism for routing or multicasting streams of global data to a set of data-flow computational blocks; and. at the same time, supervises the firing of computational blocks. There are 2n-l computational blocks of data-flow instruction cells tied. to the lowest-level leaf nodes of the tree. When a data item enters the root node, it is pipelined through the n-level distribution tree based on the appended control code. The address field in the control code contains two parts: a major address and a minor address. The major address specifies what computa- tional blocks the data item is headed for while the minor address selects the location of the destination instruction cell in the specified computational block. For providing proper firing, in each of the 2n-l computational blocks there is a decremented count. The decremented count is initially set tx: a number indicating the amount of dependent data required for block firing. When a data item is written into an instruction cell in that block the decremented count is reduced by one. And when the count reaches zero, the computational block is fired and execu- tion commences. Global Controller--The global controller plays two major roles: first, it distributes composite cells of data-flow instructions over m computational rings at load time; second, it monitors the status of each of the m 89 computational rings at execution time. Within the global controller, there is both a ring count and a decoder array. The ring count is initially set to 111. Each time the con- troller receives an acknowledge signal from a released computational ring, the ring count is reduced by one. As the ring count reaches zero, the global controller starts fetching a new computational block of data-flow composite cells from the Data-Flow Intelligent FIFO if they are available. The computational block is then decoded on the decoder array and the resulting composite cells are dis- tributed over m computational rings for execution. Intelligent FIFO--The intelligent first-in-first-out unit is used for queueing the computational blocks which are fired and ready for execution. Due to the irregular number of data-flow instructions contained in various compu- tational blocks, each FIFO is designed to manage a fixed number of data-flow instructions. 6.3. Two~Phase Data Transfer Mechanism Because there are two types of data to be transferred in the same buffer tree network, each node of the buffer tree, a control buffer, is designed to work on a two-phase basis. In the push mode, data are pushed forward from the lower-level nodes to the higher-level nodes; whereas, in the pull mode, data are pulled backward in an opposite way. Each data item is tagged with a destination address field and a one-bit data type header. The width of the address is 90 determined by the tree height-~the higher the tree height, the wider the field. Specifically, a locally dependent data item has a relative displacement address and a one-bit direction header, while a globally independent data item has an absolute address. The relative displacement address is determined by the distance in which the two communicat- ing leaf nodes are apart and by the relative position in which the two nodes are located (see Fig. 6.4). Data to be pushed forward or pulled backward depend on a one-bit mode control by ORing the one-bit data type header and selected bits in the address field. With a tree height of m, this mode control at the ith level, Mi' is given by m-l U ( U a k>i M.= 1 a0 ), 0 < 1 < m-l (6.1) k _ _ where the notation U stands for the logic OR operation and a1 are the binary value in the address. If the mode con- trol is ”1", data will be pushed forward; otherwise, they will be buffered at the node at which they last visited and be ready to be pulled backward. Global data, which carry a "l" in the data type header, a0, will allow themselves to be pushed forward through the buffer network. However, local data which carry a "0" in the data type header can never be pushed forward beyond the root level, because the mode control at the root level is always "0" for these data. Data which have already been buffered down to a node at some level will be pulled 91 address .- data 0 o o a] 30 global absolute 1 data address local d relative . 0 isp acement data address a0 : data type header am-a.l : address field a0 = l amaa.I : absolute address a0 = 0 am : direction header am_]-a.l : relative displacement address Figure 6.4. The data address syntax. 92 backward by one of the two son nodes. The decision is deter- mined by a left-right control, LRi’ at that level, with 1 g i 5 m-l (6.2) where an is the direction header which determines whether the data to be directed to their right or to their left. If the left-right control is "l", the data will be pulled backward by the right son node, and if it is '0", they will be pulled backward by the left son node. 6.4. Hourglass Machine Performance Measures The modified hourglass computing machine is character- ized by its distributed computing structure and its hier- archical communication geometry. The distributed computing structure, as described in Section 5.4, provides a poten- tial computing power of p x m. And utilization of this computing power provides a direct measure on the hourglass machine's computational rate. The hierarchical communi- cation geometry is integrated with a linear intra-ring data path structure and a logarithmic inter-ring data path structure with a combined data transfer bandwidth of (p+l)m-l of which ;>):In is the intra-ring bandwidth and m-l is the inter-ring bandwidth. Since there exists a consider- able difference in the bandwidth as well as the data trans- fer times between these two types of data transfers, the 93 ratio (us the data transfer rate (the number of data trans- fers in a given computational block) between these two types will have a great impact on the hourglass machine's overall performance. Let ui be the data transfer rate corresponding to a data path of i unit times and let logzm V = 2: ui i=1 then the average transfer overhead f(m) in Equation 5.4 can be evaluated as logzm 2: i x u. 1 f(m) 1:2 (6.3) v . Here vua assume each intra-ring transfer is merely a memory driving operation and is taken in a unit time. This can be accomplished with the construction of fast pipelined data paths within the computational rings. In the case that each computational ring has an equal number of inter-ring data transfers with every other ring, 111 can be expressed as u. = , i = 2, 3, ..., logzm (6.4) 94 Thus, f(m) becomes _ (v - ulw'1 logzm f(m) = 2: i (logzm -1) i=2 V - u1 = -l (2 + 3 + ... + logzm) V 1092(2 m) V - u1 = -—-——-—— (log2m + 2) 2V 5 ——— (logzm + 2) (6.5) 2 where E is the ratio of the inter—ring data transfer rate to the intra-ring transfer rate. Algorithms which support uniform inter-ring data transfers show a low degree of locality. However, the data flow of these algorithms usually has a very regular and simple pattern. In another case, each computational ring interacts in a weighted manner, more frequently with rings on shorter data paths. For example, the data transfer rate, ui, has an exponentially decreased funciton of V - u 1 21 1 For such an example, the average transfer overhead, f(m), becomes _ -1 1092m (v — ul)i f(m) = V :2 i=2 21 95 V - ul logzm i = —— z , l 2 3 4 logzm = E (_+- +_+ 000 + ) 4 8 16 m 3 = —£ ' for I“ >> 1. (6'6) 2 As evaluated, algorithms which support an exponential decay inter-ring data transfer rate show a high degree of locality and have a constant average transfer overhead. 6.5 Implementation Considerations The feasibility of constructing a large hourglass machine relies on the assumption that a. great deal of computing elements and control buffers can be integrated onto a single chip. Two issues which concern us most are the I/O connections issue [8-9] and the on-chip intercon- nections issue [22]. In its structure, the hourglass machine shows some promising aspects on these two issues. It is observed that (l) the I/O activities take place only at the two ends of the hourglass; the number of I/O pinouts grows only proportionally with the number of computational rings and (2) the regular, simple and short communication geometry inside the hourglass helps reduce substantially a large amount of om-chip interconnections. Yet, the follow- ing premises make the hourglass computing machine attrac- tive to us. (l) (2) (3) (4) 96 The hourglass machine supports asynchronous compu— tations based on data-driven principles. This allows the use of fast asynchronous logic or self-timing logic circuitry which, as being studied, represents a considerable speed-up over synchronous logic [421. The use of the boundary control buffer as the I/O communication center relaxes some constraints imposed on the systolic array architecture. The most signifi- cant one is that the I/O pipelined rate must be kept at the same as that of an inner-product computation. In the hourglass the I/O (data can be fed into or pumped out of the hourglass at a rate independent of the computing element's computational speed. The DICO and CIDO combination inside the hourglass allows different IC technologies to be used. For example, the centralized region is grown from a basis- mixing density of one and this region can be construct— ed from a fast IC technology such as ECL logic as separated from the other decentralized region. By this approach, the computational results from the DICO tree can be swept across the narrow passage at a much higher rate than the I/O data rate taking place at the two ends of the hourglass. The high degree of locality in the structure of the hourglass machine promises the use of a local clocking discipline for data transfer. This avoids using a more complex system-wide clock discipline on a VLSI chip. CHAPTER VII CONCLUSIONS 7.1. Summary VLSI defines a technology which promises to provide for the future development of a chip with a circuit density of three orders of magnitude greater than today's state-of- the-art 32-bit microprocessor [2-3]. This promise, coupled with the increased ability to design multiprocessor sys- tems, has spawned an extensive research in computational paradigms that differ from the conventional von Neumann paradigm--the paradigm, which, as pointed out by Meyer [43], has dominated the computer industry in the past two decades. In fact, except for a few commercial machines (e.g., some made by Burroughs Corporation), there have been no significant advances in computer architecture since the 1950's. Some so-called advances that might come to mind (e.g., Cache memories, instruction pipelining, the micro- programming concept) are not really computer architecture advances; they are improvements in the implementation of a particular computer architecture. It has become increasingly apparent that, even with this upgrading, the von Neumann paradigm does not provide 97 98 the computational structure necessary to efficiently imple- ment solutions to many problems [9,10,41]. Kung's systolic array paradigm [8-9], and Dennis's data-flow paradigm [24-28] are both examples of computational alternatives to the von Neumann model--alternatives that promise to be more cost-effective, when implemented, through using VLSI tech- niques than the von Neumann paradigm. Data-flow machines are built to take advantage of parallelism inherent in data-flow programs. Systolic arrays are special-purpose, high-performance data-flow structures which are characterized. by having’ a simpleq regular' and short communication geometry. But a principal drawback to these special—purpose arrays is that they lack flexibility in implementation; a given systolic array only implements one specific algorithm (e.g., matrix multiplication, where the size of the matrices are fixed for a particular array). Another drawback is that an algorithm to be executed efficiently (H: a systolic array must possess a very simple and regular data-flow pattern. What's more, due to the I/O constraint of pinouts on the chip, it is questionable that the systolic array can be applied broadly as a coprocessor for practical algorithm implementations. The purpose of this research ‘was 11) investigate a class of modified systolic array architectures known as mixed systolic arrays which broaden the scope of appli- cations executable on systolic arrays while retaining much of the simplicity and regularity‘ of the systolic array 99 architecture. Specifically, we investigated a class of array architectures which can be reprogrammed, can be implemented with a reasonable amount of I/O pinouts, can self-execute after the arrays have been loaded, and can implement both asynchronous and synchronous computations. To carry out this goal, our research efforts were based upon four tasks: (1) development of the mixed systolic array architectures, (2) modeling of the mixed systolic array architectures, (3) development of a classification scheme and evaluation model for these array architectures, and (4) development of a sample implementation for these array architectures. In the development phase, we first focused on the design of a small-scale mixed systolic array or a program- mable systolic array in Chapter III. The structure, the components, and the functional behavior of this program- mable systolic array were described, and the execution of a second-order recursive equation with nonconstant coef- ficients on this array was illustrated. It was observed that this programmable systolic array accomplished a dynamic problem-solving capability without degrading its performance as compared to Kung's nonprogrammable systolic array. In Chapter IV, our research next centered upon the development of the more generalized mixed systolic arrays. We applied the mixing profile concept to construct mixed systolic arrays from a building block as a basis. A case study was made by developing mixed systolic arrays from a 100 number of hexagon bases with different basis-mixing densi- ties. We also identified the mixing density function and the boundary condition function--the two parameters which characterize a mixed systolic array. The mixing density function (Eq. 4.1) suggested that a particular MSA be used for algorithms supporting a complexity ratio between data routing and computing. The boundary condition function (Eq. 4.2) related a particular implementation to the ratio of the global and local data bandwidth requirements of the algorithms for which the MSA was intended. We further deve- loped four different types of mixed systolic arrays, CICO, CIDO, DICO, DIDO, based on a two-region mixing strategy, and specified their potential uses. In the modeling phase described in Chapter V, our research advanced to studying how mixed systolic arrays functioned. We began by forming a local computational ring within a mixed systolic array. A computational ring was formed to support both local computation and local communi- cation, as well as to execute data-flow composite cells based on data-driven principles. We then developed methods to map a two-dimensional directed graph on a particular computational ring for effective execution. In vertical mapping, data-flow composite cells were formed along cri- tical paths; while in horizontal mapping, the fork type of data-flow instructions and the join type of data-flow instructions were grouped together. By overlapping these two mappings, we were able to load or schedule data-flow 101 instructions from a directed graph on computational rings for local executions. In the evaluation phase, we developed a general evalu— ation model for measuring the performance of the mixed systolic array based on an average load time and an average transfer overhead measure in Equation 5.4. The average load time reflects the I/O structure of a particular MSA imple- mentation, and the average transfer overhead measures the locality in algorithms executed on the mixed systolic array. A case study was made with a DICO type of MSA based on block-driven computations. This study showed that, when most of the computations were uniformly distributed over the computational rings, a DICO type of MSA had a per- formance similar to a hex-connected systolic array. The necessary locality represents the primary cost for achiev- ing the flexibility of a reconfigurable multiprocessor structure that is capable of implementing more than a single algorithm. In the sample implementation phase discussed in Chapter VI, we developed a modified hourglass computing model by using a DICO type of MSA and a CIDO type of MSA. The hourglass computing model showed some promising aspects in both VLSI computations and VLSI implementation. Based on block-driven computations, the hourglass model provides a programmable communication path hierarchy for fast data transfer while at the same time achieving very high levels 102 of concurrency. Performance measures showed that the hour- glass had a low constant average transfer overhead for algorithms which support a high degree of locality in data transfer. 7.2. Future Research The mixed systolic array architecture under investi- gation here represents a new approach which applies the data-flow concepts to the execution of parallel algorithms on a modified systolic array. It fits nicely into the mainstream of architecture studies reported elsewhere [24- 28,34-36]. Moreover, the investigation of this architecture points toward several important areas for additional study. An investigation could be made into the system design of matching parallel algorithms for a certain type of mixed systolic array, or a combination of different types. With a hierarchical system design facility, one is able to charac- terize algorithms based on a fixed set of parameters (e.g., computational demands, control structures, and global data vs. local data bandwidth requirements). And through a one- to-one mapping, the matching processes can be carried out in an iterative manner until a prespecified precision level is reached. As an example [44], a class of algorithms can be evaluated in terms of the size of its component modules, the ways in which they can communicate or interact, and how these interactions are controlled. Candidate MSA's are then each evaluated in terms of their relative abilities to 103 support these algorithms' demands. A second investigation that could. be pursued is the development of an on-chip communication model for various MSA's structures. With an on—chip communication model, the communication cost can be evaluated in terms of computing power and queueing size and I/O bandwidth when the algorithms are executed on a support- ing MSA architecture. As an example, consider Kung's sys- tolic algorithms on VLSI computations. The argument, ”compu- tation is cheap while communication costly”, can be evaluat- ed when such a model is available. And, finally, further investigation that could be taken is the on-chip scheduling problem. Since communications are likely to be the dominant cost factor on future VLSI chips, on-chip scheduling metho- dologies must be developed to minimize these costs. As an example, the scheduling problem discussed in Section 5.3 can be extended to the algorithm or system level so as to minimize the total communication cost. 10. REFERENCES Keyes, R.W., ”The Evolution of Digital Electronics towards VLSI , " IEEE Journal of Solid-State Circuits , Vol. Sc-l4, No. 2 (April 1979), pp. 193-201. Mead, C.A., and Rem, M., "Cost and Performance of VLSI Computing Structures," Technical Report 1584, Calif. Institute of Technology Computer Science Dept. (1978). Johnson, R.C., "32-Bit Microprocessors Inherit Main- frame Features," Electronics, Vol. 54, No. 4 (Feb. 1981), pp. 138-140. Patterson, D.A., and Sequin, C.H., "Design Consider- ations for Single-Chip Computers of the Future," IEEE Trans. oanomputers, Vol. C-29, No. 2 (Feb. 1980), pp. 108-115. Chang, T.L., and Fisher, P.D., "Programmable Systolic Arrays," to be presented at the IEEE Compcon, Spring 1982. Gajski , D.D. , Kuck , D.J. , and Padua, D.A. , "Dependence Driven Computation , " Digest of Papers , IEEE Compcon Srping 81 (Feb. 1981), pp. 168-172. Kuck, D.J., The Structure of Computers and Compu- tations, Vol. I, John Wiley & Sons, New York, 1978. Kung, H.T., and Leiserson, C.E., "Algorithms for VLSI Processor Arrays," Introduction to VLSI Systems, by C.A. Mead and L.A. Conway, Addison-Wesley, 1980, pp. 271-292. Kung , H.T . , "Let ' 3 Design Algorithms for VLSI Sys- tems , " Proc . Caltech Conf . on VLSI (Jan . 1979) , pp . 65-90 . Backus, J., "Can Programming be Liberated from the von Neumann Style? A Functional Style and its Algebra of Program,” CACM, Vol. 21, No. 8 (Aug. 1978), pp. 613- 641. 104 ll. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 105 Ferrari, D., Computer Systems Performance Evalu- ation, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1978, pp. 245-255. Barnes , G. , et a1 . , "The Illiac IV Computer , " IEEE Trans. Comp., Vol. C-l7, No. 8 (Aug. 1968), pp. 746- 777. Robinson, A.L., "Array Processors: Maxi Number Crunch- ing for a Mini Price," Science, Vol. 203, No. 12 (Jan. 1979), pp. 156-160. Gostick, R.W., "Software and Algorithms for the Distri- buted-Array Processors," ICL Technical Journal (May 1979), pp. 116-135. Kruse, B., ”A Parallel Picture-Processing' Machine," IEEE Trans. Computers, Vol. C-14 (April 1975), pp. 424-433. Narendra, P., ”VLSI Architectures for Real-Time Image Processing," Digest of Papers, IEEE Compcon Spring 1981, (Feb. 1981), pp. 303-306. Flanders, P.M., et a1., ”Efficient High Speed Comput- ing with the Distributed Array Processor,” in High Speed Computer and Algorithm Organi_zation, edited by D.J. Kuck et a1., Academic Press, 1977, pp. 113-128. McCormick, B.H., "The Illinois Pattern Recognition Com- puter," IEEE Trans. on Computers (Dec. 1963), pp. Stone, H.S., "An Effective Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations," Thurber, K.J., and Wald, L.D., "Associative and Paral- lel Processors,” Computing Surveys, Vol. 7, No. 4 (Dec. 1975), pp. 215-255. Seitz, C.L., ”Self-timed VLSI Systems," Proc. Caltech Conf. on VLSI (Jan. 1979), pp. 345-355. Franklin, M.A., and Warm, D.F., "Pin Limitations and VLSI Interconnection Networks,” Proc. 1981 Intl. Conf. on Parallel Processing (Aug. 1981), pp. 253- 258. Karp, R.M., and Miller, R.E., "Properties of a Model for Parallel Computations: Determinacy, Termination, Queueing," SIAM J. Appl. Math., Vol. 14 (Nov. 1966), pp. 1390-1411. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 106 Dennis, J.B., "Data Flow Supercomputers," Computer Magazine, Vol. 13, No. 11 (Nov. 1980), pp. 48-56. Dennis, J.B., and Misunas, D.P., "A Computer Architec- ture for Highly Parallel Signal Processing," Proc. of the ACM 1974 National Conf. (Nov. 1974), pp. 402-409. Watson, I., and Gurd, J., "A Prototype Data-Flow Com- puter with Token Labeling," AFIPS Conf. Proc., Nation- al Computer Conf. (June 1979), pp. 623-638. Davis, A.L., ”A Data-Driven Machine Architecture Suita- ble for VLSI Implementation,” Proc. Caltech Conf. on VLSI (Jan. 1979), pp. 479-494. Rumbaugh, J.E., "A Parallel Asynchronous Computer Architecture for Data Flow Programs," Project MAC TR- 150, M.I.T., Cambridge, Mass. (May 1975). Chang, T.L., and Fisher, P.D., "A Block-Driven Data- Flow Processor," Proc. 1981 Intl. Conf. on Parallel Processing (Aug. 1981), pp. 151-155. Davis, A.L., Denny, W.M., and Sutherland, I., ”A Char- acterization of Parallel Systems," Technical. Report 108, Computer Science Dept., University of Utah (Aug. 1980). Palmer , J. , "The Intel 8087 Numeric Data Processor , " Proc . Seventh Annual Symp . on Computer Architecture Chen. T.C., ”Parallelism, Pipelining, and Computer Efficiency," Computer Design, Vol. 10 (Jan. 1971), pp. 69-740 Davidson, E.S., and Larson, A.G., "Pipelining and Parallelism in Cost-Effective Processor Design," Res. Report, Digital System Lab, Stanford University, Stanford, Calif. (1973). Leiserson, C.E., "Systolic Priority Queues," Proc. Caltech Conf. on VLSI (Jan. 1979), pp. 199-214. Guibas, L.J., Kung, H.T., and Thompson, C.D., ”Direct VLSI Implementation of Combinatorial Algorithms , " Proc. Caltech Conf. on VLSI (Jan. 1979), pp. 509- 525. Gannon, D., and Snyder, L., "Linear Recurrence Systems for VLSI: The Configurable, Highly Parallel Approach,” Proc. 1981 Intl. Conf. on Parallel Processing (Aug. 1981), pp. 259-260. 37. 38. 39. 40. 41. 42. 43. 44. 107 Chang, T.L., and Fisher, P.D., ”Mixed Systolic Ar- rays," submitted for presentation at Ninth Annual Symp. on Computer Architecture, Texas, 1982. Song, S.W., "A Highly Concurrent Tree Machine for Data- base Applications," Proc. 1980 Intl. Conf. on Paral- lel Processing (Aug. 1980), pp. 259-268. Weng, K., "Stream Oriented Computation in Recursive Data-Flow Schemes," Project MAC TM-68, M.I.T., Cambridge, Mass. (Oct. 1975). Bergland, G.D., "A Guided Tour of Program Design Metho- dologies , " IEEE Computer , Vol . 14 (Oct . 1981) , pp . 13-37. Arvind , "Decomposing of a Program for Multiple Pro- cessor Systems , " Proc . 1980 Intl . Conf . on Parallel Processing (Aug. 1980), pp. 7-14. Bridge, C.L., Fisher, P.D. and Reynolds, R.G., "A Syn- chronous Arithmetic Algorithms for Data-Driven Ma- chines, "Proceedings of the 5th Symposium on Computer Arithmetic, (May 1981), pp. 56-62. Meyer, 6., Advances in Computer Architecture, Wiley, New York, 1978. Reynolds, R. G. and Chang, T.L., "The PAL System-A Parallel Algorithm Design System for VLSI-Based Array Architectures," to be presented at The 10th IMACS World. Congress on SYSTEMS SIMULATION AND SCIENTIFIC COMPUTATION, (Aug. 1982), Montreal, Canada.