.io- - ”'0' u --qv , -uo-vur ; u: L . hm.- t w— . " a ~ “.. :11 ‘3 .v . aw . m- 1:" ‘ . . -: fl. _ .- - .1! .M J... n {Z- n ~ o. "433’ n '4‘! ’.‘.”'l , gut-..“ p.14”. ': vvla oI-él‘r. ,, .d...‘ “4-, W>Lfia,‘ u...“ ‘ _, ,_.....~ I ‘tL‘w m. » 54m ' ‘4‘»! $.13»? . 2‘7!) :‘73 d .. .n 9.2"“ ' OJ w- l.4. n:- .W or «- u.» I; ,4 u ... , 21W . “in" :331.‘ .. a"... .- T, ‘. M (12":th I; i. 1‘3".- ....- .5 v u, .w. .m .- , -p—np ,_ n a m- a. n 1 u 3"..‘3? . AI? —,-A m? _ _,, .. ,r- .~ ,. ' I w .. u («— “1‘75“”? . M :5!”1 “13.5“ M - .A (cut-'1 4 ‘ Ln v..~...g m . w . .a-v L- r4...» 4 u lune-4 mu" n ”- 2.4m 4..» -.,‘,~ 5 a». no . Ju '- '4 3:»;- n u- ‘1’: .» 5 “‘2‘ ,-,;;'}§'*$§i$ ,v,x“p'.t§:§. ‘ 'x’irfiiri. ~ Shaka»? {P‘é‘g , . ...... a". . «nag. .4“ r .n. m ,, _. 45mg!» “153.33%:ng 43.3%. awn}: s' ; pigggfl‘iflggs 5:353:14}??? ‘ ‘ 1:: 'r “53' z. m‘ 5.. 9'. , 2:. $3511.; mu 2.. amt" «"2..- 4533 .52.“ ".3... a ; H x ‘1! .. 141- I i r sq ' 1 ‘= Z’é‘z‘eirsaé‘i‘éa 3‘ . 2:19 $43“? l“ . 2m, 1: ‘ .45; E: .. ‘ 3g}? 1 . 3% :‘I‘ not! ‘1 , ‘3 . .- n L _ ,pw ., w. ..‘wnuov """" .4.- «we | a - m2 3a., - :aiiz3’ci} 5! - i‘ fiiifiigigii“3“§¥géé"§ A' . 2222 ”I. 13" 'f' 333% ‘ 33 » ; f» q B. 131‘; i! 5 is o {hf :" ‘s‘ik . c'dfi;‘19"¥:~_’4‘. "a ' ‘ ,émfiflihs . a .3132???- g ;: Uni-35.“? gig: ’ I Ms. 9 CV“ 19) MICHIGAN STATE I 31II IIZIIII3 II IIIIIIIIIIIIIII/IIIIIIIII 0405 7701 This is to certify that the thesis entitled A VERY LONG INSTRUCTION WORD ARCHITECTURE IMPLEMENTED ON THE SPLASH 2 FPGA ARRAY presented by Roy C. Wang has been accepted towards fulfillment of the requirements for Master's degree in ElectricaI Eng. Mic/m Major professor Date 3/17.] ?6 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE IN RETURN BOX to romovo this chookout from your rooord. TO AVOID FINES rotum on or boron duo duo. DATE DUE DATE DUE DATE DUE MSU IoAn Milan-tin Wood Opportunity intuition W1 A Very Long Instruction Word Architecture Implemented on the Splash 2 FPGA Array By Ray C. Wang A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical Engineering 1995 ABSTRACT A Very Long Instruction Word Architecture Implemented on the Splash 2 FPGA Array By Ray C Wang The need for faster computing systems constantly pushes the limits of technology. While special purpose and custom computing systems can provide very high levels of performance, the cost associated with these systems limits their application. Recently, there has been significant improvements in the performance of general purpose processors which can be used in a wide variety of applications. The speed of these general purpose processors is limited by how many instructions they can perform per clock cycle. To increase that number, machines need to be able to execute instructions in parallel. Machines capable of such parallel execution are known as superscalar processors. Unlike other proposed superscalar models, the VLIW (Very Long Instruction Word) model extends the RISC paradigm of simplified hardware. This thesis examines the feasibility of using Splash 2, an FPGA-based processing array, configured as a VLIW processor. The VLIW architecture implemented in this study is capable of performing two operations concurrently. This study demonstrates the use of a new platform to prototype and test architectural and compilation theories and highlights its capabilities and limitations. Copyright © by Roy C. Wang 1995 Dedicated to my parents, Maria Wang and Tang S. Wang, whose sacrifice and perseverance has given me a better life. iv ACKNOWLEDGMENTS I would sincerely like to thank my advisor, Dr. Diane Thiede Rover, for all of her help and guidance in the course of this research. Her insight provided much needed direction and encouragement. I would also like to thank the other members of my committee, Dr. Elias Strangas and Dr. Michael Shanblatt for their efforts on my behalf. Dr. Strangas has provided support throughout my graduate education. His generosity and faith proved to be an ' invaluable source of strength and support. I wish to thank my family and friends for always being there for me. Last, but hardly least, I would also like to thank Lisa for her editing efforts and support. TABLE OF CONTENTS LIST OF FIGURES ix LIST OF TABLES xi 1 Introduction 1 2 Background Information 4 2.1 Scalar Processors .......................................................................................... 4 2.2 Parallel vs. Superscalar Processors .............................................................. 6 2.3 Instruction Level Parallelism ....................................................................... 8 2.4 Design Issues of Superscalar Processors ...................................................... 10 2.5 Code Motion ................................................................................................. 12 2.6 Reconfigurable Systems ............................................................................... 16 2.7 Summary ...................................................................................................... l9 3 Design Methodology & Architecture Specification ' 20 3.1 Architecture Driven Design Methodology ................................................... 20 3.1.1 Evaluation Criteria .................................................................................... 22 3.2 Architectural Specifications ......................................................................... 22 3.2.1 Splash 2 ..................................................................................................... 23 3.2.2 Dex JR. ...................................................................................................... 25 3.2.3 Dex-II ........................................................................................................ 26 3.2.4 The Instruction Set .................................................................................... 29 33 Design Constraints ....................................................................................... 31 3.3.1 Data Hazards ............................................................................................. 31 3.3.2 Memory Hazards ....................................................................................... 33 vi 3.3.3 Control Hazards ......................................................................................... 33 3.4 RTL Specifications of Dex-II ....................................................................... 34 3.4.1 PE Partitions .............................................................................................. 34 3.4.2 Communication Paths ............................................................................... 36 3.4.3 Fetch Stage ................................................................................................ 39 3.4.4 Decode Stage ............................................................................................. 39 3.4.5 Execution Stage ......................................................................................... 43 4 Simulation and Synthesis Environment & Results 46 4.1 Simulation & Synthesis Process ................................................................... 46 4.2 Synthesis Results .......................................................................................... 51 5 Test Programs 55 5.1 The Runtime Environment ........................................................................... 55 5.2 Fibonacci Sequence ...................................................................................... 56 5.2.1 Dex-II Verification .................................................................................... 56 5.2.2 Program Implementation and Performance ............................................... 57 5.3 Bubble Sort ................................................................................................... 59 6 Conclusions and Future Investigations 64 6.1 The Dex-II Evaluation .................................................................................. 64 62 VLIW vs. RISC ............................................................................................ 65 6.3 Splash 2 Evaluation ...................................................................................... 66 6.4 Future Investigations .................................................................................... 67 Appendix A 69 A1 Controll .vhd (PEI) ..................................................................................... 69 A2 Contr012.vhd (PE2) ..................................................................................... 70 A3 Decode] .vhd (PE3) ..................................................................................... 72 A4 Decode2.vhd (P134) ..................................................................................... 75 A5 Execute1.vhd (PES) ..................................................................................... 77 vii A6 Execute2.vhd (PE6) ..................................................................................... 80 A7 Execute3.vhd (PE7) ..................................................................................... 83 A8 Execute4.vhd (PE 8) .................................................................................... 84 A9 Xbarcontrol.vhd (PEO) ................................................................................ 88 A10 Xbarconfi g .................................................................................................. 89 Appendix B 92 BI Fibonacci Sequence Results ......................................................................... 92 82 VLIW Fibonacci Sequence Results .............................................................. 9S BIBLIOGRAPHY 99 viii LIST OF FIGURES Figure 2.1. Instruction execution models ....................................................................... 6 Figure 2.2. Instruction stages ......................................................................................... 10 Figure 2.3. Percolation scheduling ................................................................................. 13 Figure 2.4. Register renaming ........................................................................................ 14 Figure 2.5. Compensation code ...................................................................................... 15 Figure 2.6. Simplified block diagram of the XC4000—Family CLB .............................. 17 Figure 3.1. Architecture hierarchy ...................... I ........................................................... 21 Figure 3.2. Splash board ................................................................................................. 23 Figure 3.3. Design flow .................................................................................................. 24 Figure 3.4. The Dex JR. ................................................................................................. 27 Figure 3.5. The Dex-II .................................................................................................... 28 Figure 3.6. Instruction format ........................................................................................ 30 Figure 3 .7. Data dependency .......................................................................................... 32 Figure 3.8. Cascaded ALU ............................................................................................. 33 Figure 3 .9. Memory hazard ............................................................................................ 33 Figure 3.10. Control hazard ............................................................................................ 34 Figure 3.11. PE partitioning ........................................................................................... 35 Figure 3.12. Cycle 1 ....................................................................................................... 36 Figure 3.13. Cycle 2 ....................................................................................................... 37 Figure 3.14. Cycle 3 ....................................................................................................... 37 Figure 3.15. Cycle 4 ....................................................................................................... 38 Figure 3.16. Cycle 5 ....................................................................................................... 38 ix Figure 3.17. Cycle 6 ....................................................................................................... 38 Figure 3.18. Fetch RTL Description .............................................................................. 40 Figure 3.19. Decode RTL Description ........................................................................... 42 Figure 3.20. Execute RTL Description .......................................................................... 45 Figure 4.1. VHDL hierarchy .......................................................................................... 47 Figure 4.2. Code description .......................................................................................... 48 Figure 4.3. Simulator output .......................................................................................... 49 Figure 5.1. Fibonacci Execution Results ........................................................................ 58 Figure 5.2. Two Fibonacci programs ............................................................................. 59 Figure 5.3. RISC bubble sort program ........................................................................... 60 Figure 5.4. VLIW bubble sort program .......................................................................... 61 Figure 5.5. Optimized RISC bubble sort program ......................................................... 63 Figure 5.6. Optimized VLIW bubble sort program ........................................................ 63 LIST OF TABLES Table 1. Instruction set ................................................................................................... 29 Table 2. Summary of synthesis results ........................................................................... 54 xi CHAPTER 1 Introduction New technology invariably alters the landscape of design and implementation of computer systems. Current trends of processor design have moved towards a RISC (Reduced Instruction Set Computer) architecture. Lower cost memory, automated design, and the need for a modular approach due to VLSI (Very Large Scale Integration) and WSI (Wafer Scale Integration) have all played a significant role in this shift of ideology. As we push the upper limits of performance with pipelining, it becomes inevitable to exploit the fine grained parallelism of programs. This is known as instruction level parallelism (ILP). Superscalar processors, which can execute multiple instructions concurrently, have already been designed and implemented to take advantage of existing software. They promise even more speed-up in the future as compiler technologies and operating systems are written to take advantage of their added capabilities. Processor design is not only changing, but changing rapidly. The development of computer technology is one of the fastest moving industries in the world. Introducing a new product a month or two before competitors can capture a majority of the market share. With this motivation, it is not difficult to see why shrinking the design cycle is imperative. FPGA (Field Programmable Gate Array) technology is increasing in both array size and speed. These chips offer a short development cycle by implementing and testing designs without the need for expensive wafer fabrication. Individually, they are 1 2 used today for ASIC (Application Specific Integrated Circuit) and small prototyping applications, however, larger ensembles offer new capabilities in rapid prototyping and custom computing. Splash 2 is one of the first computing systems to explore these possibilities. Its design allows for SIMD (Single Instruction stream, Multiple Data stream) type architectures as well as highly pipelined and systolic applications. This thesis explores the issues associated with implementing an instruction set processor with the FPGA-based system, Splash 2. The Spyder architecture [17] has shown that it is possible to utilize FPGAs in a general purpose processor. The Spyder, however, dedicates the FPGAs as reprogrammable execution units to provide some flexibility in its functionality. Splash 2 and other FPGA arrays represent a totally new class of machines. The size and resources of the Splash 2 allows an entire architecture to be prototyped and implemented with FPGAs. This would allow the computer designer to test new features of an ISP (Instruction Set Processor) by reconfiguiing the Splash 2. For example, programs that may benefit from very specific hardware routines can reprogram the data path for better performance, and still retain the functionality of the remaining system. FPGA technology, however, is still in a relatively young stage. The limitations imposed by the Splash 2 constrain designs. The limited logic on a single FPGA forces us to partition our design across several FPGAs. The architecture of the Splash 2 also limits the off chip resources available to each FPGA, such as memory and busses. These restrictions present a significant obstacle to implementing a general purpose instruction set processor. Our approach to studying ILP and superscalar processors is to prototype a VLIW (Very Long Instruction Word) processor on the Splash 2. The simple hardware needed to implement a VLIW architecture is well suited for the Splash 2. A VLIW architecture will also exploit the fine grained parallelism of programs that conventional processors have not. The problems associated with parallel processing motivate the need for a platform 3 that can rapidly prototype and test new architectural and compilation theories. While conventional computing machines are built to process either general purpose code or highly customized instructions, FPGA-based arrays offer the true general purpose machine that can be configured as both. New architectures are developed from studies of past processors and applications. In chapter 2, we will discuss some of the issues and approaches of implementing instruction set processors. These issues lay the foundation to our new architecture, the Dex-II. Chapter 3 provides a detailed description of our design methodology and implementation strategies. It also provides an architectural specification for Dex-II, the VLIW architecture implemented on the Splash 2 array. Chapter 4 presents the design environment and provides the simulation results and synthesis statistics of our implementation. In chapter 5, an analysis of two test programs will help us evaluate how successful the VLIW architecture extracts ILP. Chapter 6 will summarize the findings and discuss how future implementation of FPGA arrays may be improved. CHAPTER 2 Background Information This chapter provides the background information and motivation for this thesis. It presents a brief history of architectures and the concepts applied to improve their performance. We discuss the challenges of parallel computing as well as introduce the concept of instruction level parallelism. Finally, we consider reconfigurable systems and how they can contribute to innovations in computer architectures. 2.1 Scalar Processors First generation computers were typically load/store machines. They were basically single accumulator processors not unlike todays programmable calculators. If multiple variables were needed in the computation, they had to be stored into memory and loaded at a later time. Memory address calculations and access were time consuming. Therefore, more complex methods of manipulating data gave birth to the first CISC (Complex Instruction Set Computer) machines. These machines are characterized by a larger set of registers and complex memory addressing modes [15]. For example, the Motorola 68040 microprocessor implements 113 instructions with 16 general purpose registers and supports 18 different addressing modes. The popular Intel 80486 microprocessor implements 157 instructions with 12 addressing modes. The concept of the CISC is to put commonly executed sequences of code into one single instruction and add special hardware to execute it as quickly as possible. Each instruction 4 5 is performed by executing subprograms written in microcode. Microcode represents the physical bits of signals that control the data path. The advantage to this design is the compatibility of code between machines and the reduced cost for program memory. If the hardware is changed, all that was needed to compensate is a rewrite of the microcode to support the same instructions. Applications are able to run without the need to recompile. These processors, however, have become very elaborate and complex to design. Transistor counts have rocketed into the millions, and the speed-up of each successive generation has dwindled. The search for more speed led to techniques used in mainframes and supercomputers. Pipelining [12] was introduced to exploit temporal parallelism. Pipelining allows instructions to be overlapped in stages. Each stage completes part of the instruction so several instructions can be executing at the same time. To make pipelining efficient, each stage should take about the same time to complete its tasks. CISC techniques are not very well adapted to handle pipelining. Each individual instruction has its own time frame and data paths are littered with special hardware to implement those special instructions. RISC processors, however, are designed to balance these stages. RISC instructions use very basic functions to simplify the hardware as much as possible. The functionality of the processor is preserved, but the number of instructions needed to complete the same task was increased. The increase in memory size required to run programs have become less of an issue as memory sizes were quadrupling about every three years [12]. The RISC machine is characterized by a smaller set of instructions and a larger supply of registers. Addressing is kept as simple as possible and memory operations are usually limited to simple load/store commands. A typical RISC machine is the Sun SPARC CY 7C601 with 69 instructions. There are 136 registers divided with a technique called register windowing. This is basically a paged register file to supply clean registers for subroutines. The RISC model tries to partition instructions into well-defined stages to 6 make pipelining as efficient as possible. The results are higher clock rates and faster design times. Chips such as the DEC Alpha and the PowerPC have already broken the 100MHz mark. Now the search for even more speed has begun to focus on spatial parallelism. Increasing the number of instructions that can execute at the same time brings us into the realm of parallel and superscalar computing. 2.2 Parallel vs. Superscalar Processors The idea of parallel and superscalar processors is to allow instructions to be executed concurrently [18]. This is not a new idea, and there have been many successful approaches to parallel computing. Parallel processors can execute several instruction streams, while a superscalar processor can execute several instructions of a single instruction stream. Parallel execution of instructions increases the throughput of the ‘ ‘7 T-ime |ii[i2|i3]i4| Serial Instruction Execution | I” | i ' l 12 I i I I3 I I 14 I Pipelined Instruction Execution I n l | 12 I l 13 I II4I Superscalar Pipelined Instruction Execution Figure 2.1. Instruction execution models 7 processor and decreases the time needed to finish a task as can be seen in Figure 2.1. This figure depicts the relative time needed to finish four tasks (ll-I4) and shows how pipelined and superscalar execution times compare to serial execution. Many of the traditional parallel processors need highly specialized hardware and special algorithms to achieve their goals. The Flynn classification [12] proposes four different models of computers. The SISD (Single Instruction stream, Single Data stream) architecture is the standard general purpose scalar computer. The SIMD architecture is represented by machines such as the MasPar MP—l and Thinking Machines CM-2. This type of architecture has many identical functional units and a single instruction unit. The instruction unit broadcasts a command, and all the functional units perform the same function. This is extremely efficient for many different matrix operations where the same task is performed on large amounts of data. Unfortunately, only very specific applications can take full advantage of this setup. The MISD (Multiple Instruction stream, Single Data stream) computers are a bit more difficult to classify. Certain systolic arrays may be considered as MISD computers. The data flows through stages of the array and is operated on by whatever instruction executes at each stage of the array [7]. The Multiple Instruction, Multiple Data stream MIMD processors are the most common parallel computers. Mainframes, supercomputers, and hi gh-end workstations all employ some form of multiprocessing capability. Employing the MIMD model, they can exploit data parallelism and specialized parallel code to expedite the completion of a single program. However, these computers are primarily used to serve many users and programs simultaneously. While they can complete a multitude of tasks, each individual program or task itself is not being executed any faster [12]. To complete a single task using multiple processors requires either a regular data structure, such as matrix calculations, or complex synchronization and message passing mechanisms. Several issues have been studied to overcome the barriers of parallel processing. Parallel programming requires a good understanding of how the hardware is organized in 8 order to take full advantage of the scalable hardware. Variations of high level programming languages have been used to overcome some of these issues. A language called dbC (Data-parallel Bit C) was developed with parallel data structures for SIMD type execution. By using the new constructs supplied by the language, the compiler can effectively distribute the computation among the available processing elements. dbC has been used for parallel machines such as the Cray and Terasys. Even more interesting is the possibility of using a dbC compiler to generate hardware on an FPGA-based array to implement each program with its own hardware [8][9]. Parallel machines also suffer from synchronization problems. There are basically two methods of sharing data and keeping data coherent. The message passing model uses a distributed memory system. Each processor keeps its own data and sends data via messages when needed. The MasPar MP—l and Ncube2 are two examples of a message passing computer. The other method is a shared memory model where each processor communicates through a common memory space. The shared memory space makes programming much easier since the programmer does not need to worry about the location of data. Although the shared memory model is theoretically sound, the scalability of such a system is limited by the bandwidth of the memory and the complexity of the caching scheme [30] used in the system. The DASH multiprocessor [19] developed at Stanford uses a distributed directory based cache coherence scheme [19][l3] to scale the single memory space. However, these schemes need both software and hardware support in order to be utilized. Still, others argue that a highly scalable memory system is not necessary due to the inherently low level of parallelism that can be extracted from programs [12]. 2.3 Instruction Level Parallelism It becomes quite obvious from Figure 2.1 that the number of instructions that can be executed concurrently will greatly affect the performance of superscalar architectures. 9 This number is limited by the ILP of the program. ILP is the amount of parallelism that can be extracted from a program written for a sequential processor [5]. This is limited by true data dependencies, procedural dependencies, and resource conflicts [18]. Data dependencies are instructions whose operands depend on the results of previous instructions. Instructions in general cannot execute until their operands are available. Several methods to resolve data dependencies have been used. The Intel i960CA, HP PA-RISC7100, HyperSPARC, and IBM's RSGOOO all use a runtime technique known as scoreboarding to resolve data dependency [15][12][28][33]. This technique requires keeping a table or ”scoreboard” of what stage of execution each register is in. Other dynamic techniques use reservation stations and a version of the Tomasulo algorithm [28] implemented on the IBM 360 floating point unit. The major disadvantages of these implementations is the hardware cost and the complexity associated with them. Dependencies can also be eliminated during compile time [5]. VLIW processors depend on the compiler to generate compact code in order to exploit ILP. The biggest disadvantage to this approach is the lack of good compilers and the incompatibility of code from machine to machine. Procedural dependencies are instructions that are affected by branching. At conditional branches, there are two sets of instructions that can be executed. The processor often has to await resolution of the branch to continue execution. Several different approaches have been used to decrease the impact of branching. Speculative execution [20][12] can be used to eliminate some of these branch latencies. Compiler techniques utilizing trace scheduling methods [4] can also move instructions across blocks of code to fill up unused cycles. Resource dependency involves the available number of functional units to execute instructions on. The decision of how many functional units to implement depends heavily on the type of programs being run and the level of parallelism that can be 10 extracted. Too many units would leave costly hardware idle, while too few would create a bottleneck for the performance of the processor. 2.4 Design Issues of Superscalar Processors The physical hardware where operations are performed are called functional units. In a scalar architecture, only one function can be executed at any given time. In a superscalar architecture, functions are differentiated in hardware to allow multiple instruction execution among all of them. The most common functional units are integer units, floating point units, memory units, and control units. The number and types of units vary from architecture to architecture. The Motorola 88110, for example, has ten functional units while the DEC Alpha has only four [33]. More units, however, do not necessarily indicate better performance. There are typically four stages in executing an instruction as shown in Figure 2.2: Fetch, Decode, Execute, and Writeback [12]. The Execution stage can be easily sealed with the addition of hardware. The Fetch, Decode and Writeback stages are the critical points where bottlenecks can impede performance gains. The Fetch stage is responsible for getting instructions out of the instruction cache. In a superscalar processor where more than one instruction can be issued, this becomes a challenging task. Different architectures adopt many different approaches to this problem. The number of functional Il Fetch Decode Execute Writeback Figure 2.2. Instruction stages units and the dependencies on the instruction dictate how many and which instructions can begin to execute. For instance, the DEC Alpha can issue up to two instructions per 1 l clock cycle as long as they are operating on different functional units [33]. The Intel i960CA can fetch four instructions and issue up to three of them if they are on separate functional units [33]. The sophistication of issuing instructions greatly affects the ability of keeping the hardware busy. Therefore, having a multitude of functional units does not automatically make a better system. It becomes apparent that the order of instructions plays a critical role in exploiting the maximum level of parallelism. Instructions can be issued in-order or out-of-order [18]. All of the architectures discussed issue instructions in-order of the program. Since there are a limited number of functional units, certain combinations of operations cannot be issued together. Two integer operations, for example, cannot be issued on the same clock cycle if there is only one integer unit. In order to execute comedy, a stall or NOP (No Operation) must be injected into the code. Out-of-order issues can eliminate this and use up the wasted stall time. This is achieved by using an instruction window to look ahead at several instructions to determine if there are any data dependencies. Unfortunately, the complexity involved in the hardware has deterred an implementation of this scheme. Instructions can also be completed in-order or out-of-order [18]. The latencies of floating point units tend to be much longer than those of integer units. With in-order completion, instructions may be needlessly stalled while one very long instruction is executing. With out-of-order completion, this can be avoided as long as data dependencies are observed. Some of the problems with out-of-order completion are the added hardware for dependency checks and the difficulties dealing with precise interrupts and exceptions. Since instructions may be completed out of order, the restart point may cause the instruction to be incorrectly executed twice. Again, extra hardware is required to support precise interrupts. The dynamic techniques of achieving parallel instruction execution all require extra hardware. A VLIW processor tries to adhere to the RISC philosophy. Instead of adding complex hardware schemes to resolve the problems of multiple instruction issues 12 during runtime, VLIW processors resolve dependencies during compile time. The instruction word contains all the information for each functional unit. This makes for very long instructions as more and more functional units are added. By scheduling and analyzing programs off-line, we can get rid of the hardware overhead and ultimately achieve better performance than a dynamic solution. This approach also allows for very simple instruction issuing and decoding. The Multiflow TRACE series were the first commercially available processors implementing the VLIW architecture [29]. Unfortunately, the performance that was achieved was severely limited by the compiler technology. It was unable to extract enough ILP at the time [29]. There are basically two methods to extract ILP for a VLIW architecture: code movement and guarded execution. Code movement is much easier and cheaper to implement than guarded execution. Guarded execution requires the addition of hardware registers and additional logic along with semaphore semantics in software. The basic idea is to be able to execute speculatively along both branches when a conditional branch is encountered and to invalidate the path not taken. This method usually requires many shadow registers and memories to be effectively implemented. 2.5 Code Motion Code motion, otherwise known as list scheduling, is an effective method to expose ILP. A human compiler can write very efficient assembly level code, but today's compilers are still not as smart as we would like them to be. The two most popular forms of list scheduling are known as percolation scheduling [22], and trace scheduling [4]. Early efforts such as the Bulldog [3] compiler and the Multiflow TRACE [28] implement these techniques with varying degrees of success. Percolation gets its name from trying to "percolate" instructions up as far as possible to fill in wasted NOP space. There are several key points that must be observed when moving code around. The primary challenge is keeping the semantics the same. 13 Examples of code will be in the format as follows 11: 0P1 * Comments and explanations 0P2 12: 0P1 where II is a single instruction consisting of two concurrent operations, and 12 is a single instruction with only one operation. Before After 1 1 A = B+C A = B+C H = F+H D = A+3 D = A+3 F = G+B E = D+1 E = D+1 1= (D) (G)= A Figure 2.3. Percolation scheduling From the example in Figure 2.3, we assume that there are only three basic blocks of code (1-3). If there were successors following blocks 2 and 3, we would also have to take those into account. The key to preserving the semantic meaning is to avoid overwriting registers that are "live" and memory locations. This restriction on memory basically dictates that store instructions cannot be moved. Ensuring that executing instructions do not affect live registers can be achieved in several ways. We can use register renaming to replace variables or to compute speculated variables. In Figure 2.4, we could not move 14 the instructions in block 3 without affecting the code in block 2. This is due to the two variables, I and F, in block 3 that are also used in block 2. If we rename the I and F variables in block 3 to H and J, we see that it is still possible to move the code from block 3 into block 1. This is an inexpensive and powerful performance booster. Before After 1 1 A = B+C A = B+C H = (D) D = A+3 D = A+3 E = D+l J = G+B E = D+1 F = I+D 2 F = I+D 3 I = (D) F=G+B 2 (G) = F+E (G) = F+E Rename I to H Rename F to J Figure 2.4. Register renaming Another semantic preserving method is adding compensation code. If we restrict our movements to instructions that can be undone, we can shorten one path possibly at the expense of another. This is the idea behind trace scheduling [4]. If we have a good profile of how programs normally operate, then by reducing the path taken most often, the added cost of compensation code is negligible. In Figure 2.5, we reduced the left path from four to three instructions at the expense of the right path which went from four to five instructions. If we assume this is a loop with 100 iterations, then the total number of cycles needed to execute this program would be 90*3 + 10*5 = 320 cycles. The original code would have taken 90*4 + 10*4 = 400 cycles. This is a substantial improvement for 15 our very small example. The difficulty in such a scheme, however, is to have an accurate runtime profile of programs. If for some reason, the program switched over to executing the right path ninety percent of the time, we would experience a penalty rather than the intended performance boost. Before After 1 l A = B+C A = B+C I = (D) D = A+3 D = A+3 E = D+1 G = G+1 E = D+l 90% F = F+D yx 2 F = F+D 3 I = (D) 3 G=G+l H=G+F G=G-1 F = F-D H = G+F I____ Figure 2.5. Compensation code The other major drawback of rescheduling is exception handling. Note that in block 3 of Figure 2.5, there is a memory read into variable I. On a real system, this may cause a page fault and require the system to stop and load up a new page of memory. The increased memory reference can cause the system to start thrashing if it is scheduled up to block 1. This further limits our ability to move code around. A technique called sentinel scheduling [20] combines the ideas of guarded execution with list scheduling to avoid these problems. 16 2.6 Reconfigurable Systems The parameters and problems associated with parallel and superscalar computing present many possible solutions. Reconfigurable systems allow the designer to implement and test various designs quickly and efficiently. A VLIW architecture called the VIPER [10] claims a 100 MIPS peak throughput. At 25 MHz, it is capable of performing a branch, two load/store operations, and up to four ALU operations per clock cycle. A study done with the VIPER on pipelining and bypassing [1] also showed that the added cost of a fully interconnected network for operand forwarding is significant. The additional bus lines increased cycle time and silicon area. This analysis shows that the dynamics of large systems can vary widely with different implementations and applications. By using a reconfigurable system, we can study a greater number of applications to find the best solution. A reconfigurable system gives us the ability to tailor an architecture through software. This allows the designer to optimize the amount of hardware for each application. Architectures that would benefit from additional forwarding paths, for example, could be implemented on the same system as an architecture whose main purpose is to get the fastest cycle time. Reconfigurable systems can provide computer designers with important data on the impact of architectural changes and features. The Splash 2 is a reconfigurable system composed of 17 FPGAs. The Field Programmable Gate Array is a matrix of programmable cells called CLBs (Configurable Logic Blocks) [34]. The Xilinx 4000 series uses two function generators per CLB that can accept up to four inputs and perform any Boolean function on them. The functions are implemented with look-up tables, so the delay associated with each generator is 17 independent of the function being implemented. A simplified block diagram [34] of the CLB is shown in Figure 2.6. The multiplexer controls and logic functions are determined by the configuration. 2332 Figure 2.6. Simplified block diagram of the XC4000-Family CLB The design can be synchronous by using the two flip-flops to latch values, or combinational using the unlatched outputs. Each CLB is surrounded by an interconnection network that allows the CLBs to communicate with each other. Delay times governed by routing are a key issue in developing fast designs with FPGAs. Signals are brought off chip through IOBs (Input Output Blocks). These pins can be configured as latched, tristated, or combinatorial signals. The third generation Xilinx 4025 has 1024 CLBs in a 32x32 array with 256 IOBs. This is large enough to implement simple algorithms and state machines. We will examine a few of these efforts, each with a unique strategy on how to utilize the FPGA for custom computing. 18 FPGAs have the ability to be reprogrammed an unlimited number of times. This allows for fast design and development of ASICs and prototypes at relatively low cost. Applications such as image processing benefit from the reconfigurable hardware as well. In general, these applications have parameters that may change from execution to execution, but possess potential for improved speed from hardware implemented algorithms. Image classification [25] and convolution [24] are just a few examples of the successful applications of this technology. Other applications utilize the FPGAs as a reprogrammable co-processor or as execution units [16]. A self—timed floating point co- processor [23] can be added without affecting the system clock of an existing system. Several reconfigurable processors have also begun to appear [17][6][32][14][2]. The ability to generate hardware specifically for the instructions to be executed is an exciting prospect. Self-reconfiguring processors [6][32] utilize the feature of some FPGAs to be partially reconfigured. This allows for a "virtual hardware" setup where a processor may support several operations, but only have room for one or two actual configurations to be waning on the FPGA. This also provides the ability to create as many functional units as necessary to extract the maximum level of parallelism. However, work done with the RRANN2 [l 1] shows that the overhead for reconfiguring systems is quite high. In a neural net application, 80% of the total time was spent in reconfiguration. This overhead still limits these systems to highly parallel applications. Systems such as the SPYDER [17] and the DISC architecture [32] were built with the idea of executing instruction level programs. They have hardwired data paths and a small array of FPGAs for reconfigurable execution units. The larger more general purpose boards such as the Teramac [2], the Enable++ [14] and the Splash 2 [7] all have more FPGA chips and programmable crossbars to implement different communication topologies. These boards have been shown to perform well for programs with large amounts of data parallelism, however, they need to be programmed specifically to take advantage of their hardware. 19 With the advent of good hardware descriptor compilers and tools, the concept of using an implementation independent model [27] to explore new instruction set architectures was proposed. With these large arrays and VHDL as a tool, it becomes possible to create an instruction set architecture on top of these general purpose custom computing arrays. This gives us the ability to create a custom computing system flexible enough to implement the four models of computers that Flynn proposed. 2.7 Summary Exploring the different ways we can enhance single processor performance, we note that there are basically two methods of increasing throughput: (l)decreasing cycle time, and (2)executing multiple instructions. Cycle time ultimately depends on the technology available. Multiple instruction execution however deals with the organization of the architecture and the programs executing on the processor. The key for successfully executing multiple instructions lies in the ability to schedule them to avoid hazards. Instruction scheduling techniques can be static or dynamic. Although initially more attractive, dynamic scheduling techniques cost more in hardware and add complexity to the machine. This will set an upper limit of performance when compared to a statically scheduled machine. Statically scheduled machines do not need to examine instructions on the fly before execution. While it is too early to say which approach will ultimately perform best, both approaches need to be implemented and studied. Splash 2 offers a cost-effective environment to realize and test these architectures quickly. CHAPTER 3 Design Methodology & Architecture Specification With an understanding of some of the problems and issues of superscalar processors, we now consider implementing a superscalar processor on an FPGA-based array. The design of our superscalar processor is driven by a set of architectural specifications. This chapter first introduces the RISC architecture as the foundation of our new VLIW machine. The details and limitations imposed by the Splash 2 are discussed followed by the final architectural specification of the Dex-II that was implemented. Finally, we examine the various hazards that exist and how they are resolved. 3.1 Architecture Driven Design Methodology Two distinct architectures form the new VLIW architecture, Dex-II. The Dex JR. provided the framework for the RISC instruction set, and the Splash 2 architecture drove the implementation strategies. This architectural driven approach is distinct from application driven designs. Application driven architectures are targeted towards very specific tasks. The Splash 2 was developed as a method to implement application driven architectures. Application driven architectures become obsolete when the application is no longer needed or outdated. Using the Splash 2 allows the user to replace the architecture when 20 21 Inputs Inputs Application Specific Program Architecture VLIW Architecture Splash 2 Architecture Splash 2 Architecture All Architectures Outputs Outputs Figure 3.1. Architecture hierarchy this occurs. Although less costly than building a new system each time a new task arises, using the Splash 2 does incur some overhead. Architectures are now limited by the resources and capabilities of the Splash 2 system. Figure 3.1 shows how an application specific architecture is constrained by the Splash 2 architecture. Application driven architectures are determined by the specific task and the resources of the Splash 2 architecture. Figure 3.1 also shows how the VLIW architecture is implemented within the constraints of the Splash 2 architecture. It is similar to an application specific architecture, however the purpose of the VLIW is not to perform a specific task. The VLIW architecture can be programmed to solve several different tasks by replacing the program block in Figure 3.1. Instruction set architectures change over time as new hardware is developed and new instructions are added. Splash 2 offers the ability to 22 prototype and alter the architecture as needed. The architecture of the VLIW machine is driven by the RISC instruction set, the issues of pipelining, and the resources of the Splash 2 architecture. 3.1.1 Evaluation Criteria The architecture of the Dex-II is based upon the RISC instruction set. This instruction set provides all the primitives necessary to perform more complex instructions. The Dex-II, therefore, must successfully implement these instructions. In doing so, we can compare the performance of our implementation to the base RISC architecture. We will also evaluate how the VLIW instruction set affects the extraction of ILP. A comparison of the VLIW and RISC code in Chapter 5 will establish a quantitative measure of the performance of the new architecture. This measure of performance allows computer designers to better understand and improve architectures. A successful implementation of the Dex-II will also allow us to measure the utilization of the FPGAs. This allows us to evaluate the use of the Splash 2 resources in prototyping instruction set architectures. This data, presented in Chapter 4, is important to improving future generations of FPGA-based arrays. 3.2 Architectural Specifications The architectural specifications set forth by the Splash 2 and Dex JR. sets the expectations and implementation of the Dex-II. Our goal is to implement a processor that has the same functionality of the Dex JR. on a platform as flexible as the Splash 2. We will then extend the original RISC architecture to a VLIW implementation called the Dex-II. The architectural specifications provide the level of detail necessary to achieve that goal . 23 3.2.1 Splash 2 Implementing any design on the Splash 2 requires a good understanding of [the underlying hardware. The Splash 2 system is based on the Xilinx 4010 FPGA. Each chip represents a processing element that consists of 400 CLBs. Each CLB can perform two independent Boolean functions of up to four variables, a function with five variables and a four variable function in certain instances, or a single function up to nine variables. The resulting signal can be combinatorial or registered by the two flip-flops. A careful study of the simplified block diagram in Figure 2.6 should clarify the functionality of the CLBs. The resources of the CLBs will determine the final size needed to implement a design. The XC4010 also has 160 1088 to control and condition signals entering and leaving each chip. These [083 are used by arranging the chips in a linear array using two edges of the chip to communicate with each other and a third edge to a central crossbar as can be seen in Figure 3.2. The left and right edges can talk only to the adjacent chips through a 36-bit data path while the third edge can communicate with up to five different chips through the crossbar. X1 X2 X3 X4 X5 X6 X7 X8 X0 Crossbar X16 X15 X14 X13 X12 X11 X10 X9 Figure 3.2. Splash board 24 The crossbar is a 36-bit data path arranged in 8-bit segments for a total of four 8- bit slices and one 4-bit slice. Each slice can be configured to transmit data in one direction. It can store and switch among eight different configurations connecting any of the bit slices from one chip to another. Control of the crossbar is handled by the seventeenth FPGA, X0. Each PE (Processing Element) also has a 256K by 16-bit memory associated with it. This is accessed through the fourth edge of the chip with 18 address lines and 16 data lines. Timing of the memory is handled separately by the Splash 2 system. A read operation requires two global clock cycles, one to latch the address and the following cycle to read the data. Although it is possible to pipeline back to back reads, a read followed by a write must be separated by at least one clock cycle. Since a write is initiated by placing an address and the data to be written, a read followed by a write would require the used data lines. Control of the Splash 2 system is handled by a SPARC host attached to the board through a VME backplane. The Splash 2 is scalable by adding up to 255 more Splash VHDL Source I Simulation l.— Logic Synthesis Timing l Extraction Place and Route I I Splash 2 Figure 3.3. Design flow boards to this backplane. Each board will have its X1 chip attached to the previous boards X16 chip and so forth. The host can be interfaced with C libraries or a symbolic debugger known as T2. These software interfaces allows the user to map bitstreams to specific FPGAs, control the clock, and provides many other debugging and I/O functions. 25 The design flow for programming the Splash 2 system is depicted in Figure 3.3. Programming of each PE is done with a hardware description language, VHDL. Physical pins of each PE are mapped into logical signals using a combination of Splash 2 and vendor libraries. Simulation can be done to ensure that the algorithm is correct using Synopsys design tools. After simulation, the user synthesizes the VHDL code into a gate level description using Synopsys compilers. At this point, the Xilinx software is invoked to automatically place and route the desired hardware. The final result is a timing analysis of the design and a bitstream that can be downloaded to the Splash 2 system. 3.2.2 Dex JR. The framework of our new processor, the Dex-II, comes from a simple, clean RISC architecture called the Dex JR [31]. The instruction set consists of load/store instructions, three mathematical operations, and conditional branching. The design of the Dex JR. called for a four stage pipeline: Fetch, Decode, Execute, and Writeback. This balanced our pipeline to provide the best possible cycle time with the components that were available. A four port, 32x32-bit register file provides operands for mathematical operations as well as absolute addressing for memory. Four ports were necessary in order to write a single result and supply two operands per cycle. The fourth port was exploited by implementing a double load from memory. This special instruction proved to be a significant factor in performance by increasing the bandwidth to memory. Since all the mathematical operations take two operands, this reduced the time to load the operands from two cycles to one. Programs which require intensive memory cycles such as array and matrix operations also benefited from the ability to load multiple values. Data dependencies of consecutive instructions were eliminated by forwarding paths. Scheduling of the correct forwarding path for execution is done statically during compile time. Only one delay slot is necessary for the branch commands as the execution 26 of the branch was done in the Fetch stage. The delay slot comes from waiting for condition codes to be resolved in the Execute stage. The global clock is broken down into a four phase clock to perform memory operations. This ensures that the setup and hold time requirements are met. The instruction length is 32-bits non-encoded and the data and instruction caches are independent of each other. The data path, designed as a 32-bit architecture, is shown in Figure 3.4. 3.2.3 Dex-II Since FPGA technology is more complex than dedicated chips, the top speed and performance will always lag the most current ASIC technologies. In order for an FPGA- based architecture to compete, it must be able to prOVide unique features through its programmability. This feature allows Splash 2 to prototype an ISP with the added performance of the latest architectural advances. The goal, therefore, was to increase processor throughput with a redesign of the Dex JR. by doubling the processor power. This is accomplished by designing the architecture to execute two instructions per clock cycle instead of one. The Dex-II is a VLIW version of the Dex JR. A VLIW architecture was chosen for its relative simplicity of hardware. This was a major concern as the size of designs for each PE was limited by the physical FPGA itself. The architecture implemented on the Splash 2 is also sealed from the original 32—bit architecture down to a 16-bit data path due to the limited communication resources between PBS. The functionality of the instruction set was retained. 27 Merril Mom I MDR 2 Wtiic AddressA “ 3 Write Address B AM29C334 Register Reed Address A File Reed Address B A B Immediate Memz M ernl | l I Forwarding Path l Adder puapiieTl . r1:7 BecktoCoutrolPafli Figure 3.4. The Dex JR. The pipeline depth was also altered to three stages due to the increase from a four phase clock to a six phase clock. This was to accommodate the need for more communication between stages and implementation across multiple PEs of the Splash 2 board. This reduction in pipeline stages also eliminates several forwarding paths. With communication resources so highly limited, this tradcoff was necessary to retain data coherency. 28 The instruction length is doubled to 64 bits to execute two instructions. Extra bits are used to pass register destinations of both processors determined at compile time to retain data coherency. The redesigned data path is shown in Figure 3.5. The box enclosing the functional units allow any combination of two instructions to be executed with its own operands. Both memory and registers are kept coherent with a combination of hardware and compile time methods. Dexoilfl Register Address , , Register File I I I I IOPIAITIOPAITI @Code 7 Add/Sub Mult Memory 93nd Codes Figure 3.5. The Dex-II 29 3.2.4 The Instruction Set The instruction set consists of 14 different operations, as listed in Table 1. This reduced instruction set allows more complex functions and memory addressing by the combination of these simpler instructions. The Dex-II allows any two of these operations to be executed at the same time except for branching. Table 1. Instruction set Instruction Description Register Transfer Level Rx, Ry, Rz LD Load from memory Rx <- (Ry) LDI Load immediate Rx <- immediate ST Store to memory (Rx) <- Ry ST Copy Rggister Rx <- Ry BRA Branch PC <- Address BZ Branch if Zero PC <- Address if Z BN Branch if Neg. or Zero PC <- Address if N BNZ Branch ifijg. or Zero PC <- Address if NZ BNV Branch if Neg. or oVfl. PC <- Address if NV ADD Add Rx <- Ry + Rz SUB Subtract Rx <- Ry - Rz SFTL Shift Left. Rx <- Ry shifted left 1 SFI‘R Shift Right Rx <- Ry shifted right 1 ‘ MULT 8-bit Multiply Rx <- Ry * Rz NOP No Operation 30 ALU Operations, Load, Store 31 27 19 15 14 10 9 S 4 O lOpcodel I OpA l 0138 lDestToplDestBotl Load Immediate [31 27 25 10,9 5 4 o [ OpcodeJ J Immediate Value lDest Top I Dest Bot I Branch [31 27 26 10, l Opcode I I Destination Address I I Operation Opcode LD 00100 LDI 00101 Registers are addressed from ST 00110 R1-R31. R0 will always return a ST 00111 zero and writing to it will not ADD 10000 affect it. SUB 10001 SFTL 10010 SFI‘R 1001 l BRA 01000 32 01001 BN 01010 BNZ 01011 BNV 01 100 NOP 00000 Figure 3.6. Instruction format 31 Figure 3.6 depicts the three different instruction formats. The OpA and OpB fields identify the source registers for the execution units. The Dest Top and Dest Bot fields identify the destination register for the top half of the processor and the bottom half of the processor. Both halves will have identical destination registers, however, their Opcode and source registers will differ. Registers are addressed from 0 through 31 represented in standard binary form. By making the destination registers the same on the top half and bottom half of the processor, we encode information in the instruction rather than passing that information during runtime. This ensures that all instructions affecting registers will update all four PBS in the decode stages concurrently with a minimal exchange of information. There are also extra bits and room left in the opcode for future expansion and addition of instructions. Two possible additions are register windowing instructions and paged memory to exploit unused memory. 3.3 Design Constraints The Dex-II was designed to keep wasted NOPs at a minimum. However, spatial and temporal parallelism introduces certain restrictions that must be observed. These constraints are classified as data, memory, and control hazards. 3.3.1 Data Hazards Data hazards are constraints caused by the data path. Two operands that have data dependency between them cannot be issued on the same clock cycle. Figure 3.7 shows an example of data dependent instructions. The first example shows two instructions that cannot be executed together since the second instruction is dependent upon the first instruction. This is executed correctly by the insertion of NOPs in the second instruction slot to delay execution until the result has been computed and saved to register 3. True data dependencies can not be avoided, however, certain architectures implement special hardware to reduce their impact. 32 Example. The following instructions will produce erroneous results ADD R3,R1,R2 l"R3<-R2-i-R2 . ADD R5,R3,R4 ‘R5<-R3+R4 This is due to the fact that R3 will not be computed in time for the second instruction. In order to execute correctly, either another instruction needs to be scheduled in its place, or a NOP must be injected ADD R3.RI.R2 NOP ADD R5. R3. R4 NOP Figure 3.7. Data dependency The ALU of the Dex-II, unlike the SuperSPARC, are completely separate from one another. The cascaded ALU scheme shown in Figure 3.8 is used in the SuperSPARC to eliminate this restriction. If the second instruction depends on the first, than they are executed sequentially with ALUC, otherwise they are executed independently on ALU2 and ALUO. This scheme was implemented at the expense of another pipeline stage. For the highly piped design, this solution is feasible, however, for the Dex-II this would effectively give the same performance as the example with two instructions. Even worse, the addition of another pipeline stage would affect the performance of other instructions. v I ALU2 ALUO ALUC 33 Figure 3.8. Cascaded ALU Read after Write (RAW) and Write after Read (WAR) data dependencies are eliminated by the decode/writeback scheme employed. This scheme ensures that registers are updated prior to broadcasting the operands for the next instruction. Details of the implementation can be found in the Synthesis chapter. 3.3.2 Memory Hazards Care must be taken to avoid assigning the same register or memory location with different values on the same cycle. Figure 3.9 shows code which tries to write two different results into the same register location. Example. ADD R3,R1,R2 *R3<-R1+R2 ADD R3,R4, R4 * R3 <- R4+R4 This will corrupt the register file. ADD R3, R1, R2 NOP *Or reschedule another instruction ADD R3, R4, R4 NOP Figure 3.9. Memory hazard Writing different values to the same memory address or register in the Dex-II will cause the top and bottom processor to have conflicting memory and registers. This may force programs that dynamically address memory to insert NOPs to ensure data coherency. 3.3.3 Control Hazards Keeping both processors synchronized is a challenging task. Since the top and bottom processors have independent program counters, they need the same information in 34 order to branch correctly. Since most conditional branching occurs on comparisons, condition codes are only generated by the Add/Sub unit. This means that both instructions setting the condition codes must be the same and that the conditional branch must follow with a one instruction delay. Figure 3.10 illustrates how control hazards are resolved in the Dex-II by this strategy. Example. SUB R3,RI,R2 *Sets uptopCC SUB R3,RI,R2 l"SetsupbottomCC NOP " Or filled with some useful instruction NOP * Or filled with some useful instruction BBQ 1000 *Branch to $1000 if zero flag set BEQ 1000 *Branch to $1000 if zero flag set Figure 3.10. Control hazard 3.4 RTL Specifications of Dex-H To implement the Dex-II on Splash 2, we begin by developing a simple RTL (Register Transfer Level) description of the Dex-II. This allows us to partition our design according to resources and communication paths available on the Splash 2. It became very apparent that the communication paths offered by the crossbar and linear array would be the determining factor of how our design would be implemented. The number of clock cycles needed to implement an instruction cycle was determined by the data paths between PBS. After this was established, a detailed RTL that is easily translated into VHDL code was simulated and synthesized. 3.4.1 PE Partitions Mapping a design over to the Splash system is a challenging task. Each processing element is limited by the number of CLBs, number of I/O pins, and the amount of memory available per PE. Even with larger FPGA chips, the amount of -—. . ,l' 35 hardware that can be assembled onto them is still somewhat limited. With this in mind, we begin by splitting the resources stage by stage. The partitioning of functions is shown in Figure 3.11. Processing elements PEI - PE8 represent half of the complete VLIW processor and can stand alone to execute scalar RISC code. This half of the processor will be referred to as the top processor hereafter. Elements PE9 - PE16 are the mirror hardware to implement the second processor. This half will be referred to as the bottom processor. The control path is implemented across the first two His and supplies a 32-bit instruction word to its half of the processor. Since the memory is only l6-bits wide, two elements are needed to generate the 32-bit word. Elements PE] and PE2 also house the program counter, the instruction memory, and the conditional branching logic. The decode stage contains the register file and circuitry to forward data. We chose to use two elements in this case to provide extra bandwidth to allow dual reads from memory. This allows us to shorten the number of clock cycles needed to complete a single instruction cycle. The execute stage consists of four functional units: Two adder/subtraetor units, a multiplier, and the memory manager. This setup is cloned on the bottom to provide equal I I I I P51 PE2 l 1253 P54 l P55 P56 P57 PE8 I I | I I | | I Fetch I Decode I & I & l Execute Branch I Writeback I l | l I l | | | PE16 P515 : PE14 P513 : P512 P511 P510 P59 I I Figure 3.11. PE partitioning 36 functionality on both the top and bottom halves of the processor. The memory units of PE8 and PE9 utilize the extra left-right communication lines between them to keep the memory coherent. 3.4.2 Communication Paths The amount of communication between PBS was critical in this partitioning scheme. As noted before, the major problem encountered here is the limited number of communication paths to implement bus structures. While the crossbar is capable of handling multiple connections, we cannot, for example, selectively broadcast a 16-bit operand from the decode stage to each of the four execute stages. The left/right communications of each chip are dedicated to the control path. Instructions issued from PEI and PE2 flow to the right and supply each pipeline stage with the next instruction. The data path, therefore, is limited to using the crossbar for passing operands and results around. Using the programmable crossbar, each cycle could send and receive data from different PEs. i l l I PEI PE2 PE3 PE4 PES P56 PE7 PE8 l l I I Figure 3.12. Cycle 1 Figure 3.12 shows the first of six cycles. PEI and PE2 fetch the next instruction from their respective memories while PE3 and PE4, the Decode stage, fetch the operands from their register file memory. The 16-bit results from the Execute stage from PBS and PE6 are sent back to the Decode PEs through the crossbar. If the instruction in the Execute stage is a memory store, then PE8 performs the store operation. 37 PE++PE2 PE3 PE4>PE5 P56 PE7 PBS 4___4.____5r__L__l I l Figure 3.13. Cycle 2 Figure 3.13 depicts the communication paths for the second cycle. The Fetch stage swaps information about the new instruction and forms the full 32-bit instruction word. The Decode stage begins to pass the next instruction along to the Execute PBS and receives the final two results from PE7 and PE8. PBS and PE6 now send the results of the condition codes to the Fetch stage while PE7 and PE8 complete their memory reads. PEI Pm PE3 PE4 Figure 3.14. Cycle 3 The third cycle is where synchronization with the second processor occurs. Information about the destination register of the second RISC processor is encoded in the instruction word, therefore, only the data needs to be communicated. Figure 3.14 shows the Decode stage sending and receiving the 16—bit results from the second processor. Meanwhile, the execute stage continues to pass along the next instruction and PE8, the memory manager, exchanges information on what kind of memory Operation is performed. The Fetch stage at this point also updates its PC (program counter). If a branch is indicated, the PC loads the absolute address from the instruction word, otherwise, the PC is incremented by one. 38 PE] PE2 PE3 PE4 PES PBGHPPN PE8 Figure 3.15. Cycle 4 Cycle four shown in Figure 3.15 begins the writeback sequence. PE3 and PE4 writes the result from the Execute stage into the register file. Since only one operand can be written at a time, the result from the second processor must wait. The memory manager swaps address and operand information. If the second processor performed a store operation, then we must also perform the same store operation to keep the two memories coherent. i l PEI PE2 +> PE3 PE4 PBS PEG PE7}> FEB I 1 Figure 3.16. Cycle 5 Figure 3.16 shows the two decode stage PEs writing the results from the second processor and sending the operands for the next instruction to PBS and P136. The next instruction for the Decode stage is passed from the Fetch stage and the final execution PE gets the next instruction as well. Figure 3.17. Cycle 6 39 The final cycle shown in Figure 3.17 illustrates the Decode stage passing operands to the last two PBS. The second Decode PE receives the next instruction. Each PE updates its respective instruction buffer to begin the next instruction. 3.4.3 Fetch Stage The Fetch stage, whose operations are described with RTL statements in Figure 3.18 keeps track of the current program counter (PC). Using the PC as an address, it reads instructions from PE] and PE2 in phases 1 and 2 to form a complete 32-bit instruction. Since each PE only has a 16-bit memory chip, to form the complete instruction, PEI must send the most significant word to the right and PE2 concatenates the two to send to the Decode stage. Notice that the branch instructions are also executed in this stage. In order to get the correct branch address, PEl must have portions of the lowest significant word. Therefore, PE2 must also send its values to the left. This occurs from phase 2 to phase 3 through the XP_I..eft and XP_Right data path. Cycle 4 is where the actual branch takes place, otherwise PC merely increments by one. 3.4.4 Decode Stage The Decode stage is the most complex of all the stages. It needs to supply all four execution units with two operands each and receive the results from all four execution units. A multiplexer selects the correct result and the information is swapped between the top and bottom processor to ensure the register files are identical. This is the stage that dictated the six-phase clocking scheme. 83.58: .5. as»... 2.... 55... lllllllllllllllllllllllllllllllllllllllllllllllll W I Ioluw meammuwm m Emma maloh .w .omI I I I I wwulnwwm @memm mmwflcmmflvlumI I I I I I I I m I fine; a .8 a $2 .v on figs a so: a :8 .v on— IIIIIIII a Edmlxwvaeflflmflamomml I I I I I I I I I I I meEHdevaodmmmownm I I I I I I I I w I 333 55% .V 00 Emma .amx -v 00 IIIIIIIIMMXHW—olflawufilumw IIIIIIIIIIII m .mszWWWWflEIEMWmWIIIIIIIINI MG: IV 5. m: :30": MD: .V 37 SH :30": Um .V 3 Oh .V g .H Nm—m Mmm mafia 63m 654 BEE“; Hoammwom 41 The Decode stage begins by assuming that the operands are in memory and starts the read process. The S-bit address is taken directly from the instruction word Id. The only difference between the PB3 and P54 is which operand they read. In the second phase, the value of the register is read from memory. The results from the first execution unit (PBS) is also read into both decode stages. Since the Xbar can be configured in four 8-bit segments, each decode stage can accept up to two 16-bit results each phase. The third phase latches the results from execution units 3 and 4 (PE7 and PB8). The results from each execution unit are funneled into a 4—1 mux and selected by the opcode of the executing instruction, Ix. The output of the mux will contain the correct result from the instruction in the execute stage of the pipeline. Note that the mux is not latched and thus provides the correct value by the end of the third phase, even though the inputs were latched at the beginning. The execution result is latched from the output of the mux at the beginning of phase 4 on the top and bottom halves of the processor. Likewise, the top latches the bottom result in phase 4. With both results available, the decision to forward operands can be made. If the source registers in the decode stage match the destination register in the execute stage, than the operand is replaced with the new result. The correct operand begins to broadcast its values to execution units 1 and 2 (PBS and PE6). Cycle 4 also physically writes the results back into the register file if the destination is not R0. Cycle 5 writes the bottom half of the processor result if the destination is not R0. Both operands are sent to execution units 3 and 4 (PB7 and PBS). The last phase latches the next instruction and moves the current Id to Ix. 2 4 .wafifiaflovg mo 83 .8 :32? 93 >05. 60:82-5: 3.558 0.3 839 82E. * Santos: a: 883 .2...” 8E... M03835 98 £595 364 commas; eoemwwom I ._. 3 u wnmadx 13 “an": @3573: o IIIIIIIIII II 3.3: I IIEWMIIIIIIIII ImdovafimmmwlllIIIIllIII Id34v4o4mfiamx m machv an: meoev ~52 “cm .309 IV Mg «cm .309 .V g IIIIIIIIII deciflqqaameIIIIIIIIIIIIIfiaMaIadflNIIIIIIIwI Samoan—.2 .v ~52 gnomes—2 .v ~52 m8. .65 .v ~22 mob amon— .v ~22 8 - m: smx .v gave 8 .. m: amx -v mac... mg u 89on a 8 - m: amx .v m8 30m u 89on a 8 - m: amx .v <8 mmom unease: Seamus: .v m8 «amalgam: 233%: .v <8 IIIIIII Hamelmflsflm aunwfiwmwl I I I I I I I I I A... maamxwaflfl .Iflqamm. I I I I I I I w. I § - Eamx -v as: 8 - Eamx .v as: 8 - Eamx .v 32 .2 - 55x -v as: IIIIIIIIII .awmmmxuhwélllilInuawhmaumwwmsuaulliuml 32 .v m8 «a: .v <3 mwom .v «<2 $.m3¢..¢<§ A9 . Oékml AddedRB4attlck24 BoardOPE30flsetO 00000030000000100010002000A Loop ADD R3,RI,R2 NOP BoardOPE 130MOW0rd832 00000020000000100010002000A T2:6>l AddedRBSaitlckm BoardOPanfisetO 00000010000000100020002000A ST R1,R2 ST R2,R3 BoardOPE 13 OffsetOWords 32 000000:0000000100020002000A T2:7>l AddedRBBattlckas BoardOPESOflsetO 000000: 0000 0001 0002 0002 000A BRA Loop BRA Loop BoardOPE taoflsetOWords 32 00000020000000100020002000A T2£>I AddedRB7attick42 BoardOPanflsetO 000000:0000000100020003000A Loop ADD R3,RI,R2 NOP BoardOPE 13 OffsetOWords 32 000000:0000000100020003000A Figure 5.1. Fibonacci Execution Results The VLIW still outperforms the RISC processor by reducing the number of cycles needed in the calculation loop. Note in Figure 5.2 that while the RISC machine uses four instructions, the VLIW is capable of producing the same results using only three. This represents a 33% increase in the performance of our new machine compared to the RISC by increasing the IPC to 1.3, or 2.47 MIPS. While lower then the overall comparison, this represents a more realistic gain in performance. To demonstrate the flexibility of the architecture, we also attempted to tailor the architecture to our problem and get a faster design. Since the Fibonacci program only required the add function to be implemented as an execution unit, we removed the excess 59 RISC LDI R1,l LDI R2,1 Loop ADD 113,111,122 ST R1,R2 ST R2,R3 BRA Loop ELEM LDI Rl,1 LDI R2,] Loop ADD R3,RI,R2 NOP ST R1,R2 ST R2,R3 BRA Loop BRA Loop Figure 5.2. Two Fibonacci programs functionality which improves the performance of the most critical PE. As a result, our cycle time was reduced to let our system clock run at 12.7 MHz with a peak RISC MIPS 'of 2.11 and a peak VLIW MIPS of 4.23. This represents a 14% increase in performance of both processors. 5.3 Bubble Sort The bubble sort algorithm represents code that performs many comparisons, memory accesses, and several conditional branches. This kind of code introduces many NOP cycles that decrease the performance of the processor. The initial program is first introduced followed by an optimized program utilizing some of the techniques introduced previously. The results will give a better understanding of the necessity of compiler optimizations. Figure 5.3 shows a non-optimized version of the bubble sort program for our RISC machine and Figure 5.4 is the same program on our VLIW version of the machine. As we can see, we initially reduce the total number of instructions from 23 to 19. Out of 60 I LDI R1, 0 2 LDI R2, 0 3 LDI R3, Length 4 LDI R8, 1 5 Loopl SUB R4, R3, R1 6 NOP 7 82 End 8 LD R2, R1 9 LD R5, (R1) 10 Loop2 LD R6, R2 11 SUB R4, R5, R6 12 NOP - 13 BN Swap 14 Check ADD R2, R2, R8 15 SUB R4, R2, R3 16 NOP 17 BN Loop2 18 ADD R1, R1, R8 19 BRA Loopl 20 Swap ST (R1), R6 21 ST (R2), R5 22 LD R5, R6 23 BRA Check 24 End Figure 5.3. RISC bubble sort program these four, only two represent a significant gain in performance. The initialization routine was shortened from 4 to 2 instructions. While this is significant, the program spends a minimal amount of time in this routine. The loops, however, are where the bulk of the program execution takes place. The results here were marginal. The only improvements in performance is observed in Loopl and the Swap routine, both reducing the number of instructions by one. This represents only an 11% increase in performance. To optimize the VLIW machine, we begin by trying to move instructions into the NOP cycles. Instruction #6 can be moved up to instruction #4 since we either perform the instruction or end the program. Instruction #15 now can moved up to the NOP of instruction #13, however, we must take into account what happens if the conditional branch is true. In order to maintain the functionality, we must add compensation code at instruction #9 to undo our speculative execution. We can also move instruction #11 up to instruction #90, however, we would need to add two instructions to compensate in the 61 l LDI R1, 0 LDI R2, 0 2 LDI R3, Length LDI R8, 1 3 Loopl SUB R4, R3, R1 SUB R4, R3, R1 4 NOP NOP 5 B2 End 32 End 6 LD R2, R1 LD R5, (R1) 7 Loop2 LD R6, R2 NOP 8 SUB R4, R5, R6 SUB R4, R5, R6 9 NOP NOP 10 BN Swap BN Swap 11 Check ADD R2, R2, R8 NOP l2 SUB R4, R2, R3 SUB R4, R2, R3 13 NOP NOP 14 BN Loop2 BN Loop2 15 ADD R1, R1, R8 NOP 16 BRA Loopl BRA Loopl 17 Swap ST (R1), R6 ST (R2), R5 18 LB R5, R6 NOP l9 BRA Check BRA Check 20 End Figure 5.4. VLIW bubble sort program Swap routine. This would shorten one path and lengthen another. Since the number of times the Swap routine is called is dependent on the data set, making this substitution would not be beneficial without prior knowledge of the data. The optimized version of the VLIW code shown in Figure 5.6 manages to shorten Loopl to 3 instructions and Loop2 from 10 to 9 instructions. To be fair, we perform the same optimizations on our RISC processor. This time, we can only move instruction #8 to instruction #6. The RISC code is shown in Figure 5.5. The final comparison shows a total of 22 instructions for the RISC and 17 for the VLIW. Code size of the VLIW was reduced to 74% of the RISC code. If we consider the performance of the loops, we managed to reduce the total number of instructions from 18 down to 15. The optimized VLIW code executes 21 useful instructions in 17 clock cycles for a IPC of 1.23. The RISC performs 20 useful instructions in 22 clock cycles for a IPC of .90. In this case, the high number of conditional branches with the lack of computation ' between sections would not benefit from the ability to schedule operations into the redundant control instructions. This penalty for conditional branches occurs whenever a 62 program executes several branches in succession. The lack of useful instructions between each section makes moving code across branches difficult to impossible. It becomes evident that smart compilers are necessary if we want to see an effective VLIW processor utilized in the future. While smart compilers are needed for scheduling instructions in any superscalar processor, methods of scanning larger blocks of instructions during run time can be used to reduce the impact of bad compilers. However, run-time methods will require extra cycle time that may not be practical in a super-pipelined, superscalar architecture. Static scheduling will still be faster ultimately. The programs used to characterize our architecture are simple examples. These programs allow us to visualize the effectiveness of our instruction set. They also allow us to see the shortcomings in how we deal with the various hazards. The bubble sort program, for instance, clearly shows that we need a better way to handle control instructions. The Splash 2 also does not support functions such as floating point arithmetic. To execute real applications and benchmarks, these functions would need to be emulated on the Sun host, or additional hardware must be added to the Splash 2. “5 ‘ ‘N~1“":."l!m- 63 l LDI 2 LDI 3 LDI 4 LDI 5 Loopl SUB 6 LD 7 82 8 LD 9 Loop2 LD 10 SUB I 1 NOP 12 EN 13 Check ADD 14 SUB 15 NOP 16 EN 17 ADD 18 BRA 19 Swap ST 20 ST 21 LD 22 BRA 23 Fad 5.555.553.1933 aa-s°° '2': "3 .5335 neg Figure 5.5. Optimized RISC bubble sort program 1 LDI R1, 0 LDI R2, 0 2 LDI R3, Length LDI R8, 1 3 Loopl SUB R4, R3, R1 SUB R4, R3, R1 4 LD R2, R1 LD R5, (R ) 5 82 End BZ End 6 Loop2 LD R6, R2 NOP 7 SUB R4, R5, R6 SUB R4, R5, R6 8 SUB R1, R1, R8 NOP 9 EN Swap BN Swap 10 Check ADD R2, R2, R8 NOP ll SUB R4, R2, R3 SUB R4, R2, R3 12 ADD R1, R1, R8 NOP 13 BN Loop2 BN Loop2 l4 BRA Loopl BRA Loopl 15 Swap ST (R1), R6 ST (R2). R5 16 LD R5, R6 NOP 17 BRA Check BRA Check End Figure 5.6. Optimized VLIW bubble sort program CHAPTER 6 Conclusions and Future Investigations This chapter summarizes our findings and evaluates the success of our implementation of an ISP on the Splash 2. The possibilities of using Splash 2 as a tool in studying computer architectures and compilers is also presented. Lastly, proposals for future investigations and projects are offered as possibilities to extend the study of computer architectures on Splash 2. 6.1 The Dex-II Evaluation The successful implementation of our design did achieve the goal of implementing a RISC and a VLIW architecture. The clock speed of 1.85 MHz allows the Dex-II to execute large programs in a reasonably short period of time. This allows for characterizing architectures and architectural features of real applications which would otherwise be prohibitive in simulation. Even so, the performance is too slow to be useful in practical applications. The primary factor for poor cycle time stems from the fact the architecture suffers from an unbalanced pipeline. Empty slots evident in phases of each unit except the Decode stage signify a poorly balanced pipeline. This was the result of waiting for operands to be passed around on an inadequate interconnect array. The pipeline may be better balanced by adding functionality in the Fetch and Execute stages decreasing the 65 amount of instructions needed to complete tasks. Additional functionality, however, starts to drift away from the RISC paradigm, and other problems may arise. Since the design is based on VHDL and automated synthesis for FPGAs, changes in the architecture can be quickly implemented as seen with the Fibonacci program. Additional hardware features can be added with relative ease. Support for register windowing can be added with the addition of some opcodes. Subroutines and jump commands can also be implemented onto the Fetch stages with relative ease. Specialized execution units can be programmed in an attempt to increase the performance of the processor for specific applications. Additional hardware to measure performance could also be added to monitor processor usage. This would provide invaluable data to support trace scheduling techniques. 6.2 VLIW vs. RISC The Dex-II implements two RISC processors side by side on the Splash 2. This design permits two RISC instructions to execute in a single clock cycle. Although there are some data and control hazards that must be avoided, the VLIW architecture is successful in increasing the IPC. For the VLIW architecture, we achieve an IPC of 1.50 for the Fibonacci program and 123 for the Bubble Sort program. The RISC versions of the same code achieved an IPC of 1.00 and .90, respectively. The prototype processor indicated problems with our control scheme. The results from our program analysis focuses attention at the double branch instruction which was required to keep both processors in synchronization. This problem did not exist for the single RISC processor model. Future designs with this architecture must successfully implement a more unified control path to control the multiple data paths. 6.3 Splash 2 Evaluation Computers were originally designed as machines that could solve many problems by using different programs. With FPGAs, programmers and designers now have the ability to alter the machine to fit the problem. The success of these arrays will be dependent upon the resources available to each FPGA. While logic emulators built with arrays of FPGAs can be used to prototype circuits, it will be the resources such as the memory and the crossbar of the Splash 2 that will offer a new level of flexibility to prototype more complex architectures. The bottleneck for successful design on the Splash 2 was the lack of communication between chips to implement bus structures. While we were able to achieve a respectable clock rate of 11 MHz from the FPGA, the six-cycle division forced by communication restrictions effectively reduced our cycle time to a little under 2 MHz. In order for faster speeds to be obtained, the level of interconnects needs to be enhanced through a crossbar with greater functionality as well as FPGAs with greater I/O capabilities. The lack of floating point processing is also a factor that will limit the usefulness of FPGA-based arrays. Floating point processing is required in many applications in both science and engineering. These types of computations are usually highly repetitive and can be structured in such a way to take advantage of the reconfigurable architecture. Although the use of synthesis tools considerably shortened the design cycle, these tools were not perfect. As the efficiency of the synthesis tools and number of FPGA libraries increase, these synthesis tools will improve the overall performance of reconfigurable computing systems. The size of our design generated by the synthesis tools utilized less than 50% of the CLB resources. This suggests that our design is close to the optimal speed as the routing tools had plenty of space to route signals within the lmauurmn... g—nw . .v. .m... 67 FPGA. By adding more functionality to our designs, these tools become even more significant as they directly impact the speed of the design. 6.4 Future Investigations Although the practical uses are limited at this time, the Dex-II does offer substantial research opportunities. FPGA-based systems, for example, can provide an invaluable tool for compiler testing. Different architectures can be simulated on the Splash 2 to execute compiled codes for various scheduling techniques. As reconfigurable processors become a reality, new compiler technology can be rapidly adapted. Synthesis tools will also play a large role in the success of reconfigurable processors. Compilers no longer need to be written and optimized for a machine, but the machine can now be optimized to support the compiler. Speculative execution is another method to decrease execution times. Guarded execution and complicated shadow registers can be used to let speculative instructions run their course. This method is not very feasible in our design due to the restrictions of the board itself, however each board as a whole may be configured as a shadow system. A system can be implemented where the host is used to synchronize several Splash boards executing a program. When a program branch is encountered, both routes are sent off to two boards configured as a processor. The host needs to maintain the actual state of the machine but can orchestrate hundreds of these processors. Threaded programming is also an emerging software methodology that would support a co-processor approach. Separate threads can be run on independent boards without the communication overhead to track the actions of other threads. Memory would be managed by the host and treated as a cached memory system. The Dex-II is an aggressive utilization of the resources available from the Splash 2. The architecture is far from perfect, however. The design is limited to a 16-bit architecture until even larger chips make it possible to compact larger designs and 68 provide more communications. However, the ability to change designs so quickly does make the Splash 2 and general FPGA-based arrays very attractive. A key factor in RISC development was the shortened turnaround time for each new architecture. FPGA-based systems would allow new architectures to be prototyped quickly. This provides a powerful tool for computer and compiler designers to test and implement new designs and ideas. APPENDICES Appendix A Synthesis Code A1 Control1.vhd (PEI) architecture FetchJeft of Xilinx_Processing_Part is signal PC: Bit_Vector (17 downto 0); signal loadaddress: Bit_Vector (15 downto 0); signal Ix: Bit_Vector (31 downto 0); signal Id: Bit_Vector (31 downto 0); signal Ifetch: Bit_Vector (31 downto 0); signal increment_pc: Bit; signal branchbit: bit := '0'; signal resetzbit := '0'; signal local_clk: integer range 0 to 6; signal one: Bit_Vector (17 downto 0); signal CC: Bit_Vector (3 downto 0); signal Right: Bit_Vector (31 downto 0); signal mdr: Bit_Vector (15 downto 0); signal Xbarin, Xbarout: Bit_Vector (35 downto 0); signal Xbarenable: Bit_Vector (4 downto 0); begin RIGHTIN: for i in o to 15 GENERATE Pad_lnput(XP_right(i), right(i)); END GENERATE RIGHTIN; RIGHTOUT: fori in 16 to 31 GENERATE Pad_0utput(XP_right(i), right(i)); END GENERATE RIGHTOUT; XP_Mem_RD_L <= '0'; increment_pc <= '1'; one <= ”000000000000000001"; process begin 69 70 wait until XP_Clk'Event and XP_Clk = '1'; local_clk <= Iocal_clk + 1; if Iocal_clk = 5 then Iocal_clk <= 0; end if; Pad__lnput (XP_Mem_D, mdr); Pad_0utput (XP_Mem_A, PC); if Iocal_clk = 1 then lfetch(31 downto 16) <= mdr; right(31 downto 16) <= mdr, Xbarenable <= '01 111"; end if; if Iocal_clk = 2 then [fetch ( 15 downto 0) <= right (15 downto 0); end if; if Iocal_clk = 3 then PC <= PC + one; if (Ifetch(31 downto 30) = 01) then PC(15 downto 0) <= Ifetch (25 downto 10); end if; end if ; end process; XP_Mern_WR__L <= '1'; XP_HSO <= '2'; XP_GOR_Result <= '0'; XP_GOR__VaIid <= '0'; XP_Int <= '0'; end Fetch_left; A2 ControlZ.vhd (PE2) architecture Fetch_right of Xilinx_Pnocessing_Part is signal PC: Bit_Vector (17 downto 0); signal one: Bit_Vector (17 downto 0); signal Ifetch: Bit_Vector (31 downto 0); signal resetzbit := '0'; signal Iocal_clk: integer range 0 to 6; signal CC: Bit_Vector (3 downto 0); signal Left: Bit_Vector (31 downto 0); signal mdr: Bit_Vector (15 downto 0); signal Xbarout, Xbarin: Bit-Vector (35 downto 0); signal Xbarenable: Bit_Vector (4 downto 0); 71 begin LEFTIN: fori in 16 to 31 GENERATE Pad_Input(XP_left(i), Left(i)); END GENERATE LEFTIN; marrow“: for i in o to 15 GENERATE Pad_Output(XP_left(i), 1211(1)); END GENERATE LEFl‘OUT; XP_Mem_RD_L <= '0'; one <= "000000000000000001"; XP_Xbar_EN_L <= Xbarenable; process begin wait until XP_CIk'Event and XP_Clk = 'l'; Pad_Output(XP_Right(31 downto 0), [fetch); local_clk <= local_clk + 1; if local_clk = 5 then local_clk a 0; end if; Pad_Input (XP_Mem_D, mdr); Pad_Output (XP_Mem_A, PC); Xbarenable <== '11111"; if local_clk = 0 then XP_LED c '1'; end if; if local_clk = 1 then Ifetch(15 downto 0) <= mdr; left(15 downto 0) <= mdr; Xbarenable <= "01111"; XP_LED <= '0'; end if; if local_clk = 2 then [fetch (31 downto 16) c left (31 downto 16); CC <= Xbarin (35 downto 32); XP_LED <= '1'; end if; if local_clk = 3 then PC G PC + one; if [fetch(31 downto 30) = 01 then if ((Ifetch(29 downto 27)=000) or (CC(3 downto 1)) = [fetch (29 downto 27)) then PC( 15 downto 0) <= [fetch (25 downto 10); end if; end if; XP_LED <= '0'; end if; end process; 72 XP_Mem_WILL <= '1'; XP_HSO ¢ '2'; XP_GOILResult <= '0'; XP_GOILValid <= '0'; XP_Int <= '0'; end Fetch_right; A3 Decodel.vhd (PE3) architecture Decode_left of Xilinx_Processing_Part is signal Itemp: Bit_Vector (35 downto 0); signal Ix: Bit_Vector (35 downto 0); signal Id: Bit_Vector (35 downto 0); signal Ifetch: Bit_Vector (35 downto 0); signal resetzbit; signal local_clk: integer range 0 to 6; signal Left: Bit_Vector (35 downto 0); signal Right: Bit_Vector (35 downto 0); signal Xbar_en: Bit_Vector (4 downto 0); -- enable bit for Xbar (low) signal zero: Bit_Vector (4 downto 0); signal Rer: Bit_Vector (15 downto 0); signal Mdr: Bit_Vector (15 downto 0); signal RarL: Bit_Vector (17 downto 0); signal Xbarin, Xbarout: Bit_Vector (35 downto 0); signal OpA: Bit_Vector (15 downto 0); signal temp: Bit_Vector (15 downto 0); signal result: Bit_Vector (15 downto 0); signal MuxinO, Muxinl, Muxin2, Muxin3, Muxresult: Bit_Vector (15 downto 0); begin SelectO : mux4_l[-[ port map (Muxin0(0), Muxinl(0), Muxin2(0), Muxin3(0), Ix(30), Ix (31), Muxresult (0)); Selectl : mux4_lH port map (Muxin0(1), Muxinl(l), Muxin2(l), Muxin3(l), Ix(30), Ix (31), Muxresult (1)); Select2 : mux4_1H port map (Muxin0(2), Muxinl(2), Muxin2(2), Muxin3(2), Ix(30), [x (31), Muxresult (2)); Select3 : mux4_lH port map (Muxin0(3), Muxinl(3), Muxin2(3), Muxin3(3), Ix(30), Ix (31), Muxresult (3)); Select4 : mux4_1H port map (Muxin0(4), Muxinl(4), Muxin2(4), Muxin3(4), Ix(30), Ix (31), Muxresult (4)); Select5 : mux4_1H port map (Muxin0(5), Muxinl(S), Muxin2(5), Muxin3(5), *E Select6 : Select7 : Select8 : Select9 : Selecth Selectl 1 Select12 Select13 : Selectl4 SelectlS 73 Ix(30), Ix (31), Muxresult (5)); mux4_1H port map (Muxin0(6), Muxinl(6), Muxin2(6), Muxin3(6), Ix(30), [x (31), Muxresult (6)); mux4_1H port map (Muxin0(7), Muxinl(7), Muxin2(7), Muxin3(7), Ix(30), [x (31), Muxresult (7)); mux4_1H port map (Muxin0(8), Muxinl(8), Muxin2(8), Muxin3(8), Ix(30), Ix (31), Muxresult (8)); mux4_lH port map (Muxin0(9), Muxinl(9), Muxin2(9), Muxin3(9), Ix(30), [x (31), Muxresult (9)); : mux4_1H port map (Muxin0(10), Muxinl(lO), Muxin2(10), Muxin3(10), Ix(30), Ix (31), Muxresult (10)); : mux4_1H port map (Muxin0(11), Muxinl(l 1), Muxin2(l l), Muxin3(11), Ix(30), Ix (31), Muxresult (11)); : mux4_lH port map (Muxin0(12), Muxinl(12), Muxin2(12), Muxin3(l2), _ Ix(30), [x (31), Muxresult (12)); mux4_lH port map (Muxin0(l3), Muxin1( 13), Muxin2(13), Muxin3(l3), Ix(30), [x (31), Muxresult (13)); : mux4_lH port map (Muxin0(l4), Muxinl(l4), Muxin2(14), Muxin3(l4), Ix(30), [x (31), Muxresult (14)); : mux4__1H port map (Muxin0(15), Muxin1( 15), Muxin2(15), Muxin3(15), Ix(30), Ix (31), Muxresult (15)); --Xbarin is from the xbar where xbarout is the value being sent to the xbar... Pad_Xbar (XP_Xbar, XbaroutXbarin, Xbar_en); XP_Xbar_EN_L <= Xbar_en; Xbarout (31 downto l6) <= Muxresult; Xbarout (15 downto 0) <= Muxresult; Muxinl <= OpA; zero <= ”00000”; Pad_Output (XP_right, IFetch); process begin wait until XP_Clk'Event and XP_Clk = '1'; Pad_Input (XP_left, Ifetch); Pad_Output (XP_Mem_A, RarL); Pad_InOut (XP_Mem_D, Rdrl, Mdr, '0'); XP_Mem_RD_L <= '1'; XP_Mem_WR_L <= '1'; temp <= Xbarin (15 downto 0); --Syncing bottom local_clk G local_clk + I; if local_clk = 5 then local_clk G 0; end if; if local_clk = 0 then Pad_Output (XP_Mem_A (4 downto 0), Id ( l9 downto 15)); XP_Mem_RD_L <= '0'; Xbar_en <= ”10000“; end if; 74 if local_clk = 1 then XP_Mem_RD_L <= '0'; Pad_InOut (XP_Mem_D, Rdrl, OpA, '0'); Muxin2 c: Xbarin(3l downto 16); Xbar_en <= ”10000”; end if; if local_clk = 2 then -- get data from Xbar, save the bottom half to write next cycle. Muxin3 <=-—- Xbarin(3l downto 16); MuxinO <= Xbarin(15 downto 0); Xbar_en <= "11100"; end if; if local_clk = 3 then if Ix(9 downto 5) = Id(l9 downto 15) then OpA <= Muxresult; end if; if Ix(4 downto 0) = [d(19 downto 15) then OpA <= Xbarin(15 downto 0); end if; Ix (31 downto 30) <= ”01"; temp <= Xbarin (15 downto 0); «Syncing bottom Pad_Output (XP_Mem_A (4 downto 0), Ix(9 downto 5)); if (Ix (9 downto 5) /= zero) then Pad_InOut (XP_Mem_D, Muxresult, Mdr, '1'); XP_Mem_WR_L <= '0'; end if; Xbar_en <-—- "lllll'; end if; if local_clk = 4 then - need to write bottom Muxresult Xbar_en <= "11111”; Pad_Input (XP_left, Id); Id G Ifetch: [temp <= Id; if (Ix (4 downto 0) /= zero) then Pad_Output (XP_Mem_A (4 downto 0), Ix(4 downto 0)); Pad_InOut (XP_Mem_D, temp, Mdr, '1'); XP_Mem_WR_L <= '0'; end if; end if; if local_clk = 5 then Ix <= ltemp; end if; end process; XP_HSO <= '2'; XP_GOR_result <= '0'; XP_GOR_Valid <= '0'; XP_Int <= '0'; XP_LED <= '1'; end Decode_left; 75 A4 Decode2.vhd (PE4) architecture Decode_right of Xilinx_Processing_Part is signal Ix: Bit_Vector (35 downto 0); signal Id: Bit_Vector (35 downto 0); signal Ifetch: Bit_Vector (35 downto 0); signal reset:bit; signal local_clk: integer range 0 to 6; sigml Left: Bit_Vector (35 downto 0); signal Right: Bit_Vector (35 downto 0); signal WriteEnable: Bit; signal Xbar_en: Bit_Vector (4 downto 0) ; -- enable bit for Xbar (low) signal zero : Bit_Vector (4 downto 0) ; signal Mdr. Bit_Vector (15 downto 0); signal Rer: Bit_Vector (15 downto 0); signal RarL: Bit_Vector (17 downto 0); signal Xbarin, Xbarout: Bit_Vector (35 downto 0); signal OpB: Bit_Vector (15 downto 0); signal temp: Bit_Vector (15 downto 0); signal Muxin0, Muxinl, Muxin2, Muxin3, Muxresult: Bit_Vector (15 downto 0); begin SelectO : mux4_1H port map (Muxin0(0), Muxinl(0), Muxin2(0), Muxin3(0), Ix(30), Ix (31), Muxresult (0)); Selectl : mux4_1H port map (Muxin0(1), Muxinl(l), Muxin2( 1), Muxin3(l), Ix(30), Ix (31), Muxresult (1)); Select2 : mux4_lH port map (Muxin0(2), Muxinl(2), Muxin2(2), Muxin3(2), Ix(30), Ix (31), Muxresult (2)); Select3 : mux4_1H port map (Muxin0(3), Muxinl(3), Muxin2(3), Muxin3(3), Ix(30), Ix (31), Muxresult (3)); Select4 : mux4_1H port map (Muxin0(4), Muxinl(4), Muxin2(4), Muxin3(4), Ix(30), Ix (31), Muxresult (4)); Select5 : mux4_1H port map (Muxin0(5), Muxinl(S), Muxin2(5), Muxin3(5), Ix(30), Ix (31), Muxresult (5)); Select6 : mux4_1H port map (Muxin0(6), Muxinl(6), Muxin2(6), Muxin3(6), Ix(30), Ix (31), Muxresult (6)); Select7 : mux4_1H port map (Muxin0(7), Muxinl(7), Muxin2(7), Muxin3(7), Ix(30), Ix (31), Muxresult (7)); Select8 : mux4_lH port map (Muxin0(8), Muxinl(8), Muxin2(8), Muxin3(8), Ix(30), Ix (31), Muxresult (8)); Select9 : mux4_1H port map (Muxin0(9), Muxinl(9), Muxin2(9), Muxin3(9), Ix(30), Ix (31), Muxresult (9)); Select10 : mux4_1H port map (Muxin0(10), Muxinl(lO), Muxin2(10), Muxin3(10), 76 Ix(30), Ix (31), Muxresult (10)); Selectll : mux4_1H port map (Muxin0(11), Muxinl(l 1), Muxin2(11), Muxin3(11), Ix(30), Ix (31), Muxresult (11)); Select12 : mux4_lH port map (Muxin0(12), Muxin1( 12), Muxin2(12), Muxin3(12), Ix(30), 1x (31), Muxresult (12)); Select13 : mux4_lH port map (Muxin0(13), Muxinl(l3), Muxin2(13), Muxin3(l3), Ix(30), Ix (31), Muxresult (13)); Selectl4: mux4_1H port map (Muxin0(l4), Muxinl(14), Muxin2(14), Muxin3(l4), Ix(30), [x (31), Muxresult (14)); Select15 : mux4_1H port map (Muxin0(15), Muxinl(15), Muxin2(15), Muxin3(15), Ix(30), Ix (31), Muxresult (15)); Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en); XP_Xbar_EN_L <= Xbar_en; Xbarout(31 downto [6) <= Muxresult; Xbarout(15 downto 0) <= Muxresult; Muxinl <= OpB; zero<= "00000"; Rarl <= ”000000000000000000"; Rdrl <= ”0000000000000000"; Pad_Output (XP_right, Id); process begin wait until XP_Clk'Event and XP_Clk = '1'; Pad_Input (XP_left, Ifetch); Pad_Output (XP_Mem_A, RarL); Pad_InOut (XP_Mem_D, Rdrl, Mdr, '0'); XP_Mem_RD_L <= '1'; XP_Mem_WR_L <= '1'; temp <= Xbarin (15 downto 0); Ix <= Ix; local_clk <= local_clk + 1; if local_clk = 5 then local_clk <= 0; end if; if Iocal_clk = 0 then Pad_Output (XP_Mem_A (4 downto 0), Id (14 downto 10)); XP_Mem_RD_L <= '0'; Xbar_en <= ”10000"; end if; if local_clk = 1 then XP_Mem_RD_L <= '0'; Pad_InOut (XP_Mem_D, Rdrl, OpB, '0'); Muxin2 <= Xbarin(31 downto l6); Xbar_en <= "10000”; end if; if local_clk = 2 then 77 -- get data from Xbar, save the bottom half to write next cycle. Muxin3 <= Xbarin(15 downto 0); MuxinO <= Xbarin(31 downto 16); Xbar_en <= ”11100”; end if; if local_clk = 3 then -- need to write secondary Muxresult if Ix(9 downto 5) = Id(14 downto 10) then OpB <= Muxresult; end if; if Ix(4 downto 0) = Id(l4 downto 10) then OpB <= Xbarin(15 downto 0); end if; Ix(31 downto 30) <= "01"; temp <= Xbarin (15 downto 0); XP_Mem_WR_L <= '1'; if (Ix (9 downto 5) I: zero) then Pad_Output (XP_Mem_A (4 downto 0), Ix(9 downto 5)); Pad_InOut (XP_Mem_D, Muxresult, Mdr, '1'); XP_Mem_WR_L <= '0'; end if; Xbar_en <= "[1111”; end if; if local_clk = 4 then -- need to write secondary Muxresult Xbar_en <= 'lllll"; XP_Mem_WR_L <= '1'; if (Ix (4 downto 0) /= zero) then Pad_Output (XP_Mem_A (4 downto 0), Ix(4 downto 0)); Pad_InOut (XP_Mem_D, temp, Mdr, '1'); , XP_Mem_WR_L <= '0'; end if; end if; if local_clk = 5 then Pad_Input (XP_left, Id); Ix <= Id; end if; end process; XP_HSO ¢= 'Z'; XP_GORJesult <= ‘0'; XP_GOR_Valid <= '0'; XP_Int c '0'; XP_LED <= '1'; end Decode_right; A5 Executel.vhd (PES) 78 architecture Executel of Xilinx_Processing_Part is signal Id: Bit_Vector (35 downto 0); signal Ix: Bit_Vector (35 downto 0); signal reset:bit := '0'; signal local_clk: integer range 0 to 6; signal Left: Bit_Vector (35 downto 0); signal Right: Bit_Vector (31 downto 0); signal OpA: Bit_Vector (15 downto 0); signal OpB: Bit_Vector (15 downto 0); signal Muxinl ,Muxin2,Muxin3 Muxin4: Bit_Vector (15 downto 0); signal Xbarin, Xbarout: Bit_Vector(35 downto 0); signal Xbar_en: Bit_Vector (4 downto 0); -- Muxin2 is not used in this pe and should be removed later... signal Result: Bit_Vector (15 downto 0); signal zero: Bit_Vector (15 downto 0); signal CondCode : Bit_Vector (3 downto 0); begin -- Adder unit addsub : adsu16h port map (OpA, OpB, Ix(27), Muxinl , CondCode(3)); -- Channel one of the results from the adder or shifter onto the xbar -- depending on opcode selected and desired function. SelectO : mux4_1H port map (Muxinl(0), Muxinl(0), Muxin3(0), Muxin4(0), Ix(27), Ix (28), Result (0)); Selectl : mux4_lH port map (Muxinl(l), Muxinl(l), Muxin3(l), Muxin4(1), Ix(27), Ix (28), Result (1)); Select2 : mux4_1H port map (Muxinl(2), Muxinl(2), Muxin3(2), Muxin4(2), Ix(27), Ix (28), Result (2)); Select3 : mux4_lH port map (Muxinl(3), Muxinl(3), Muxin3(3), Muxin4(3), Ix(27), Ix (28), Result (3)); Select4 : mux4_1H port map (Muxinl(4), Muxinl(4), Muxin3(4), Muxin4(4), Ix(27), Ix (28), Result (4)); Select5 : mux4_1H port map (Muxinl(S), Muxinl(S), Muxin3(5), Muxin4(5), Ix(27), Ix (28), Result (5)); Select6 : mux4_1H port map (Muxinl(6), Muxinl(6), Muxin3(6), Muxin4(6), Ix(27), Ix (28), Result (6)); Select7 : mux4_1H port map (Muxinl(7), Muxinl(7), Muxin3(7), Muxin4(7), Ix(27), Ix (28), Result (7)); Select8 : mux4_lH port map (Muxinl(8), Muxinl(8), Muxin3(8), Muxin4(8), Ix(27), Ix (28), Result (8)); Select9 : mux4_lH port map (Muxinl(9), Muxinl(9), Muxin3(9), Muxin4(9), Ix(27), Ix (28), Result (9)); Select10 : mux4_1H port map (Muxinl(lO), Muxinl(lO), Muxin3(10), Muxin4(10), Ix(27), Ix (28), Result (10)); Selectll :mux4_1H port map(Muxin1(11), Muxinl(l 1), Muxin3(l 1), Muxin4(l l), Ix(27), Ix (28), Result (11)); Select12 : mux4__lH port map (Muxinl(12), Muxinl(12), Muxin3(12), Muxin4(12), 79 IX(27). 1x (28). Result (12)); Select13 : mux4_1H port map(Muxin1(13), Muxinl(l3), Muxin3(l3), Muxin4(l3), Ix(27), [x (28), Result (13)); Select14 : mux4_1H port map (Muxinl(14), Muxinl(14), Muxin3(l4), Muxin4(14), Ix(27), Ix (28), Result (14)); Select15 : mux4_1H port map (Muxinl(15), Muxinl(15), Muxin3(15), Muxin4(15), Ix(27), Ix (28), Result (15)); zero <= '0000000000000000"; Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en); XP_Xbar_EN_L <= Xbar_en; Xbarout (31 downto 16) <== result; Xbarout (15 downto 0) <= result; Xbarout (35 downto 32) <= CondCode; Pad_Output (XP_right, Id); process begin wait until XP_Clk'Event and XP_Clk = '1'; local_clk <:= Iocal_clk + 1; if local_clk = 5 then local_clk <= 0; end if; Pad_Input (XP_left, Left); if local_clk = 0 then -- Compute and broadcast results. -- >>>>>>> Shift left <<<<<<<< Muxin3 (15 downto 1) <= OpA (l4 downto 0); Muxin3 (0) <= '0'; -- >>>>>>>> Shift right sign extension <<<<<<<<<<< Muxin4 (14 downto 0) <= OpA (15 downto 1); Muxin4 (15) <= OpA(15); ‘cnd if ; if local_clk = 1 then Pad_Input (XP_left, Id); if Result = zero then CondCode(l) c '1'; end if; if Result (15) = '1' then CondCode(2) <= '1'; end if; end if; if local_clk = 2 then end if; if local_clk = 3 then Xbar_en <= "10000"; end if; if local_clk = 4 then -- Latch new ops from the decode stagell 80 Opa G Xbarin (31 downto 16); Opb G Xbarin (15 downto 0); Xbar_en G "11111”; end if; if local_clk = 5 then Ix G Id; end if; end process; -- XP_Left G TriState (XP_Left); - XP_Right G TriState (XP_Right); XP_Mem_A G TriState (XP_Mem_A); XP_Mem_D G TriState (XP_Mem_D); XP_Mem_RD_L G '1'; XP_Mem_WILL G '1‘; XP_HSO G '2'; XP_GOR_Result G '0'; XP_GOILValid G '0'; XP_Int G ‘0'; - XP_Xbar_EN_L G'lllll"; XP_LED G '1'; end Executel; A6 Execute2.vhd (PE6) architecture Execute2 of Xilinx_Processing_Part is signal Id: Bit_Vector (35 downto 0); signal Ix: Bit_Vector (35 downto 0); signal reset:bit; signal local_clk: integer range 0 to 6; signal Left: Bit_Vector (35 downto 0); signal Right: Bit_Vector (31 downto 0); signal OpA: Bit_Vector (15 downto 0); signal OpB: Bit_Vector ( 15 downto 0); signal Muxinl Muxin2,Muxin3 ,Muxin4: Bit_Vector (15 downto 0); signal Xbarin, Xbarout: Bit_Vector(35 downto 0); signal Xbar_en: Bit_Vector (4 downto 0); -- Muxin2 is not used in this pe and should be removed later... signal Result: Bit_Vector ( 15 downto 0); signal CondCode : Bit_Vector (3 downto 0); signal zero: Bit_Vector (15 downto 0); begin 81 -- Adder unit addsub : adsu16h port map (OpA, OpB, Ix(27), Muxinl , CondCode(3)); -- Channel one of the results from the adder or shifter onto the xbar -- depending on opcode selected and desired function. SelectO : mux4_1H port map (Muxinl(0), Muxinl(0), Muxin3(0), Muxin4(0), Selectl Select2 : Select3 : Select4 : Select5 : Select6 : Select7 : Select8 : Select9 : Select10 : Selectl 1 Select12 : Select13 : Select14 : Select15 : Ix(27), Ix (28), Result (0)); : mux4_1H port map(Muxin1(1), Muxinl(l), Muxin3(l), Muxin4(l), Ix(27), Ix (28), Result (1)); mux4_1H port map (Muxinl(2), Muxinl(2), Muxin3(2), Muxin4(2), Ix(27), Ix (28), Result (2)); mux4_1H port map (Muxinl(3), Muxinl(3), Muxin3(3), Muxin4(3), N27). Ix (28). Result (3)); mux4_1H port map (Muxinl(4), Muxinl(4), Muxin3(4), Muxin4(4), Ix(27), Ix (28), Result (4)); mux4_1H port map (Muxinl(S), Muxinl(S), Muxin3(5), Muxin4(5), Ix(27), Ix (28), Result (5)); mux4_1H port map (Muxinl(6), Muxinl(6), Muxin3(6), Muxin4(6), Ix(27), [x (28), Result (6)); mux4_1H port map (Muxinl(7), Muxinl(7), Muxin3(7), Muxin4(7), Ix(27), Ix (28), Result (7)); mux4_1H port map (Muxinl(8), Muxinl(8), Muxin3(8), Muxin4(8), Ix(27), Ix (28), Result (8)); mux4_1H port map (Muxinl(9), Muxinl(9), Muxin3(9), Muxin4(9), Ix(27), [x (28), Result (9)); mux4_1H port map(Muxinl(10), Muxinl(lO), Muxin3(10), Muxin4(10), Ix(27), Ix (28), Result (10)); : mux4_1H port map (Muxinl(l l), Muxinl(l 1), Muxin3(11), Muxin4(11), Ix(27), Ix (28), Result (11)); mux4_1H port map (Muxinl(12), Muxinl(12), Muxin3(12), Muxin4(12), Ix(27), Ix (28), Result (12)); mux4_1H port map (Muxinl(l3), Muxin1( l3), Muxin3( 13), Muxin4(l3) Ix(27), Ix (28), Result (13)); mux4_1H port map (Muxinl(14), Muxinl(14), Muxin3( l4), Muxin4(14), Ix(27), Ix (28), Result (14)); mux4_1H port map (Muxinl(15), Muxinl(15), Muxin3(15), Muxin4(15), Ix(27), Ix (28), Result (15)); 9 Pad_Output (XP_ri ght, Id); Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en); XP_Xbar_EN_L G Xbar_en; Xbarout (31 downto 16) G result; Xbarout (15 downto 0) G result; Xbarout (35 downto 32) G CondCode; zero <= '0000000000000000”; process begin wait until XP_Clk'Event and XP_Clk = '1'; 82 local_clk G Iocal_clk + 1; if local_clk = 5 then local_clk G 0; end if; Pad_Input (XP_left, Left); if local_clk = 0 then -- Compute results. -- >>>>>>> Shift left <<<<<<<< Muxin3 (15 downto 1) G OpA (l4 downto 0); Muxin3 (0) G '0'; - >>>>>>>> Shift right sign extension <<<<<<<<<<< Muxin4 (14 downto 0) G OpA (15 downto l); Muxin4 (15) G OpA(15); end if; if local_clk = 1 then if Result = zero then CondCode(1)G'1'; end if; if Result (15) = '1' then CondCode(2) G '1'; end if; end if; if local_clk = 2 then Pad_Input (XP_left, Id); end if; if local_clk = 3 then Xbar_en G ”[0000"; end if; if Iocal_clk = 4 then -- Latch new ops from the decode stagell OpB G Xbarin (31 downto 16); OpA G Xbarin (15 downto 0); Xbar_en G "11111"; end if; if local_clk = 5 then Ix G Id; end if; end process; XP_Left G TriState (XP_Left); XP_Right G TriState (XP_Right); XP_Mem_A G TriState (XP_Mem_A); XP_Mem_D G TriState (XP_Mem_D); XP_Mem_RD_L <= '1'; XP_Mem_WR_L G '1'; XP_HSO G '2'; XP_GOILResult G '0'; 83 XP_GOILValid G '0'; XP_Int G '0'; -- XP_Xbar_EN_L G'lllll"; XP_LED G '1'; end Execute2; A7 Execute3.vhd (PE7) architecture Execute3 of Xilinx_Processing_Part is signal Id: Bit_Vector (35 downto 0); signal Ix: Bit_Vector (35 downto 0); signal reset:bit; signal Iocal_clk: integer range 0 to 6; signal Left; Bit_Vector (35 downto 0); signal OpA: Bit_Vector (15 downto 0); signal OpB: Bit_Vector (15 downto 0); signal Mar: Bit_Vector (17 downto 0); signal Mdr: Bit_Vector (15 downto 0); signal Xbarin, Xbarout: Bit_Vector(35 downto 0); signal Xbar_en: Bit_Vector (4 downto 0); signal Result: Bit_Vector (15 downto 0); begin Pad_Output (XP_right, Id); Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en); XP_Xbar_EN_L G Xbar_en; Xbarout (31 downto 16) G Mdr; Xbarout (15 downto 0) G Mdr; --Mar(l7 downto 9) G OpA(8 downto 0); --Mar(8 downto 0) G OpB(8 downto 0); process begin wait until XP_Clk'Event and XP_Clk = '1'; local_clk G local_clk + 1; if local_clk = 5 then local_clk G 0; end if; Pad_Input (XP_Jeft, Left); 34 if local_clk = 0 then Pad_Output (XP_Mem_A, Mar); XP_Mem_RD_L <= ‘0'; Xbar_en G "1 1111"; end if; if local_clk = 1 then Pad_Input (XP_Mem_D, Mdr); XP_Mem_RD_L G '1'; end if; if local_clk = 2 then end if; if local_clk = 3 then Pad_Input (XP_Jeft, Id); end if; if local_clk = 4 then Xbar_en G "10000"; end if; if local_clk = 5 then -- Latch new ops from the decode stage. OpB G Xbarin (31 downto 16); OpA G Xbarin (15 downto 0); Mar(17 downto 9) G Xbarin (8 downto 0); Mar(8 downto 0) G Xbarin (24 downto 16); Xbar_enG "11111”; Ix G ld; end if; end process; XP_Mem_WR_L G '1'; XP_HSO G '2'; XP_GOR_Result G '0'; XP_GORNalid G '0'; XP_Int G '0'; XP_LED G '1'; end Execute3; A8 Execute4.vhd (PE 8) architecture Execute4 of Xilinx_Processing_Part is signal Id: Bit_Vector (35 downto 0); signal Ix: Bit_Vector (35 downto 0); signal reset:bit; 85 signal local_clk: integer range 0 to 6; signal Left: Bit_Vector (35 downto 0); signal BotIx: Bit_Vector (15 downto 0); signal OpA: Bit_Vector (15 downto 0); signal OpB: Bit_Vector (15 downto 0); signal Mar: Bit_Vector (17 downto 0); signal Mdr: Bit_Vector (15 downto 0); signal Rdr: Bit_Vector (15 downto 0); signal WriteEnable: Bit; signal Xbarin, Xbarout: Bit_Vector(35 downto 0); signal Xbar_en: Bit_Vector (4 downto 0); signal Result: Bit_Vector (35 downto 0); signal Resultl: Bit_Vector (15 downto 0); signal Result2: Bit_Vector (15 downto 0); sigml OpcodeWrite, SrcRegister, SrcMemory, Srclmmediate: Bit_Vector (4 downto 0); begin RIGHT IN: for i in 0 to 15 GENERATE Pad_Input(XP_right(i), BotIx(i)); END GENERATE RIGHTIN; RIGHTOUT: fori in 16 to 31 GENERATE Pad_OutpuKXPJighui). Ix(i»; END GENERATE RIGHTOUT; -- Pad_Output (XP_right(31 downto 16), Ix(31 downto 16)); -- pad_lnput (XP_right(15 downto 0), BotIx); OpcodeWrite G ”00110”; SrcRegister G ”001 1 1"; SrcMemory G "00100"; Srclmmediate G ”00101”; Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en); XP_Xbar_EN_L G Xbar_en; -- Xbarout (31 downto 16) G resultl; -- Xbarout (15 downto 0) G resu112; process begin wait until XP_Clk'Event and XP_Clk = '1'; local_clk G local_clk + 1; if local_clk = 5 then local_clk G 0; end if; 86 Pad_Input (XP_left, Left); Pad_InOut (XP_Mem_D, Rdr, Mdr, WriteEnable); Pad_Output (XP_Mem_A, Mar); XP_Mem_WR_L G '1'; XP_Mem_RD_L G '1'; Xbarout (31 downto 16) G OpA; Xbarout (15 downto 0) G OpA; if local_clk = 0 then Pad_Output (XP_Mem_A (15 downto 0), OpB); if (Ix (31 downto 27) = OpcodeWrite) then Pad_InOut (XP_Mem_D, OpA, Mdr, '1'); XP_Mem_WR_L G '0'; else XP_Mem_RD_L G '0'; end if; Xbar_en G "11111"; end if; if local_clk = 1 then if Ix (31 downto 27)= SrcMemory then Pad_InOut (XP_Mem_D, Rdr,Xbarout (31 downto 16), '0'); Pad_InOut (XP_Mem_D, Rdr,Xbarout (15 downto 0), '0'); end if; if Ix (31 downto 27)= Srclmmediate then Xbarout (31 downto 16) G Ix(25 downto 10); Xbarout (15 downto 0) G Ix(25 downto 10); end if; end if; if local_clk = 2 then Xbar_en G ”10000"; end if; if local_clk = 3 then Pad_Output (XP_Mem_A (15 downto 0), Xbarin (31 downto 16)); Rdr G Xbarin(15 downto 0); if BotIx(lS downto 11) = OpcodeWrite then Pad_InOut (XP_Mem_D, Xbarin (15 downto 0), Mdr, '1'); XP_Mem_WR_L G '0'; end if; resultl G OpB; result2 G OpA; Xbar_en G "11111”; end if; if local_clk = 4 then Pad_Input (XP_left, Id); Xbar_en G "10000"; local_clk G 5; end if; if local_clk = 5 then -- Latch new ops from the decode stage. 87 Opa G Xbarin (31 downto l6); Opb G Xbarin (15 downto 0); Xbar_en G "11111”; Ix G Id; end if; end process; XP_HSO G '2'; XP_GOILResult G '0'; XP_GOR_Valid G '0'; XP_Int G '0'; XP_LED G '1'; end Execute4; 88 A9 Xbarcontrol.vhd (PEO) architecture Xbarcontrol of Xilinx_Control_Part is signal count : integer range 0 to 6; begin process begin wait until X0_Clk'Event and X0_Clk = '1'; count G count + 1; if (count = 5) then count G 0; end if; X0_Xbar_set G itobv (count,3); end process; X0_SIMD G TriState(X0_SIMD); X0_,XB_Data G TriState(X0_XB_Data); X0_Mem_A G TriState(X0_Mem_A); X0_Mem_D G TriState(X0_Mem_D); X0_Mem_RD_L G '1‘; X0_Mem_WR_L G '1". X0_GOR_Result_In G "W"; X0_GOR_Valid_In G "W”; X0_GOR_Result G '0'; X0_GOR_Valid G '0'; - X0_XBar_Set G "000”; X0_XBar_Send G '0'; X0_X l6_Disable G '0'; X0_Int G '0'; X0_Broadcast_0ut G '0'; X0_HSO G '2'; X0 _XBar_EN_L G '1'; end Xbarcontrol; w A10 Xbarconfig 1 3 34 65 In 00000034M1000110 3 3 65 HR 00000034M10001M0 21 34 43 56 [1 0000004311000110 56 ll 0000004311000110 0 0000000000000000 0000000000000000 wmwmmm l n o m. m 123456789NHHBMUM w123456789WUnBMUm 3 OOMIOOOOOOOOOOOO 3 OOMIOOOOOOOOOOOO 0000000000004300 0000000000004300 2 .WOOOOOOOOOOOOOOOO m 0087 M9 0087 N9 0 0078 91 0 0078 91 4 @5600000000000000 mmw 0 23 56 1234567891 llMll 91 3 3 43 34 0000003411000110 34 43 0000004311000110 3 3 000000431M000M10 0000000000000000 mmymmw 0 23 56 1234567891 llMll 3 0000430000M10000 3 0000430000M10000 3 00003400001M0000 3 00003400001M0000 6 .m M0000000000000000 ..m. w 23 56 123456789 11M11 Appendix B Runtime Results Bl Fibonacci Sequence Results T2 Version 1.88 Created Wed Oct 5 09:31 :37 EDT 1994 NEW INTERFACE BOARD (rev2) T2:1> source pe5.init 2 boards available on Splash 2 unit 0 T2:2> source fib.step Added 881 at tick 6 Board 0 PE 3 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A T2:3> source fib.step Added BB 2 at tick 12 Board 0 PE 3 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A T2:4> source fib.step Added PB 3 at tick 18 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0002 000A 000A 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:00000002000A000A000A000A000A000A T2:5> source fib.step Added R8 4 at tick 24 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0002 0003 000A 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:000000020003000A000A000A000A000A T2:6> source fib.step Added PB 5 at tick 30 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0002 0003 0005 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000200030005000A000A000A000A 92 93 T2z7> source fib.step Added BB 6 at tick 36 Board 0 PE 3 Offset 0 Words 32 000000:0000000300030005000A000A000A000A Board 0 PE 4 Offset 0 Words 32 000000:0000000300030005000A000A000A000A T2:8> source fib.step Added R8 7 at tick 42 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0003 0005 0005 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000300050005000A000A000A000A T2.9> source fib.step Added BB 8 at tick 48 Board 0 PE 3 Offset 0 Words 32 00000020000000300050005000A000A000A000A Board 0 PE 4 Offset 0 Words 32 00000020000000300050005000A000A000A000A T2:10> source fib.step Added R8 9 at tick 54 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0003 0005 0008 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 00000020000000300050008000A000A000A000A T2:11> source fib.step Added RB 10 at tick 60 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0005 0005 0008 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000500050008000A000A000A000A T2:12> source fib.step Added RB 11 at tick 66 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0005 0008 0008 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 00000020000000500080008000A000A000A000A T2:13> source fib.step Added RB 12 at tick 72 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0005 0008 0008 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0005 0008 0008 000A 000A 000A 000A T2:14> source fib.step Added RB 13 at tick 78 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0005 0008 0000 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000500080000000A000A000A000A T2:15> source fib.step Added RB 14 at tick 84 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0008 0008 0000 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000800080000000A000A000A000A T2:16> source fib.step Added RB 15 at tick 90 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0008 0000 0000 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 00000020000000800000000000A000A000A000A T2:17> source fib.step Added RB 16 at tick 96 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0008 0000 0000 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000800000000000A000A000A000A T2:18> source fib.step Added RB 17 at tick 102 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0008 0000 0015 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 00080000 0015 000A 000A 000A 000A 95 BZ VLIW Fibonacci Sequence Results T2 Version 1.88 Created Wed Oct 5 09:31 :37 EDT 1994 NEW INTERFACE BOARD (rev2) T2:1> source all.init 2 boards available on Splash 2 unit 0 T2:2> source ai|1.step Added BB 1 at tick 6 Board 0 PE 3 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 00000020000000A000A000A000A000A000A000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A T2:3> 1 Added R8 2 at tick 12 Board 0 PE 3 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000: 0000 000A 000A 000A 000A 000A 000A 000A T2:4> l Added R8 3 at tick 18 Board 0 PE 3 Offset 0 Words 32 000000: 0000 00010001000A 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 00010001000A 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 . 000000: 0000 0001 0001 000A 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000: 0000 00010001000A 000A 000A 000A 000A T2:5> l Added BB 4 at tick 24 Board 0 PE 3 Offset 0 Words 32 000000: 0000 000100010002 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000100010002000A000A000A000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 000100010002 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000: 0000 000100010002 000A 000A 000A 000A T2:6> l Added BB 5 at tick 30 Board 0 PE 3 Offset 0 Words 32 000000: 0000 00010002 0002 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 00010002 0002 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 0001 0002 0002 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000:0000000100020002000A000A000A000A T2:7> 1 Added BB 6 at tick 36 Board 0 PE 3 Offset 0 Words 32 000000:0000000100020002000A000A000A000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0001 0002 0002 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000:0000000100020002000A000A000A000A Board 0 PE 13 Offset 0 Words 32 000000:0000000100020002000A000A000A000A T2:8> 1 Added R8 7 at tick 42 Board 0 PE 3 Offset 0 Words 32 000000: 0000 00010002 0003 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0001 0002 0003 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 0001 0002 0003 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000: 0000 00010002 0003 000A 000A 000A 000A T2z9> 1 Added R8 8 at tick 48 Board 0 PE 3 Offset 0 Words 32 000000:0000000200030003000A000A000A000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0002 0003 0003 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 0002 0003 0003 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000:0000000200030003000A000A000A000A T2:10>l AddedFiBeattick54 Board 0 PE3 OffsetOWords 32 000000:0000000200030003000A000A000A000A Board 0 PE40ffset0 Words 32 000000:0000000200030003000A000A000A000A Board 0 PE 14 Offset 0 Words 32 000000:0000000200030003000A000A000A000A Board 0 PE 13 OffsetOWordse2 000000:0000000200030003000A000A000A000A T2:11>l Added RB 10attlck60 Board 0 PE 3 Offseto Word332 000000:0000000200030005000A000A000A000A Board 0 PE 4 Offseto Words 32 000000:0000000200030005000A000A000A000A Board 0 PE 140ffset0 Words 32 000000:0000000200030005000A000A000A000A Board 0 PE 13 OffsetOWords 32 000000:0000000200030005000A000A000A000A T2:12> i Added R811 attick 66 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0003 0005 0005 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000:0000000300050005000A000A000A000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 0003 0005 0005 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000:0000000300050005000A000A000A000A T2:13> 1 Added RB 12 at tick 72 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0003 0005 0005 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0003 0005 0005 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 00000020000000300050005000A000A000A000A Board 0 PE 13 Offset 0 Words 32 000000:0000000300050005000A000A000A000A T2:14> 1 Added RB 13 at tick 78 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0003 0005 0008 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0003 0005 0008 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 0003 0005 0008 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000:0000000300050008000A000A000A000A 98 T2:15> 1 Added RB 14 at tick 84 Board 0 PE 3 Offset 0 Words 32 000000:0000000500080008000A000A000A000A Board 0 PE 4 Offset 0 Words 32 000000:0000000500080008000A000A000A000A Board 0 PE 14 Offset 0 Words 32 000000:0000000500080008000A000A000A000A Board 0 PE 13 Offset 0 Words 32 000000:0000000500080008000A000A000A000A T2:16> I Added RB 15 at tick 90 Board 0 PE 3 Offset 0 Words 32 00000020000000500080008000A000A000A000A Board 0 PE 4 Offset 0 Words 32 000000: 0000000500080008000A000A000A000A Board 0 PE 14 Offset 0 Words 32 00000020000000500080008000A000A000A000A Board 0 PE 13 Offset 0 Words 32 000000:0000000500080008000A000A000A000A T2:17> 1 Added RB 16 at tick 96 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0005 0008 0000 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0005 0008 0000 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 0005 0008 0000 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000:0000000500080000000A000A000A000A T2:18> l Added F1817 at tick 102 Board 0 PE 3 Offset 0 Words 32 000000: 0000 0008 0000 0000 000A 000A 000A 000A Board 0 PE 4 Offset 0 Words 32 000000: 0000 0008 0000 0000 000A 000A 000A 000A Board 0 PE 14 Offset 0 Words 32 000000: 0000 0008 0000 0000 000A 000A 000A 000A Board 0 PE 13 Offset 0 Words 32 000000: 0000 0008 0000 0000 000A 000A 000A 000A BIBLIOGRAPHY [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] BIBLIOGRAPHY Abnous, A. Bagherzadeh, N. "Pipelining and Bypassing in a VLIW Processor" IEEE Transactions on Parallel and Distributed Systems. Vol. 5. No.6. June 1994. pp. 658 - 664 Amerson,R. Carter, RJ. Culbertson,W.B. Kuekes,P. Snider,G. "Teramac- Configurable Custom Computing" Proceedings of IEEE Workshop on F PGAs for Custom Computing Machines, 1995. Ellis, J.R. "Bulldog: A Compiler for VLIW Architectures" Cambridge, Mass. MIT Press, 1985. Fisher, J. A. "Trace Scheduling: A Technique for Global Microcode Compaction" IEEE TOC, 030. July 1981. pp. 478-490 Fisher,J.A. Rau,B. R. "Instruction-Level Parallel Processing" Science, September 1991. French, P.C. Taylor R.W. "A Self-Reconfiguring Processor" Proceedings of IEEE Symposium on F PGAs for Custom Computing Machines , 1994. Gokhale, M.B. Holmes, W. Kopser, A. Lucas, S. Minnich, R. Sweely, D. Lopresti, D. "Building and Using a Highly Parallel Programmable Logic Array" Computer, January 1991. pp. 81-87 Gokhale, M.B. Minnich, R. "FPGA Computing in a Data Parallel C" Proceedings of IEEE Workshop on F PGAs for Custom Computing Machines, April 1993. pp. 94-101 Gokhale, M.B. Schott, B. "Data Parallel C on a Reconfigurable Logic Array" Draft Gray, J. Naylor, A. Abnous, A. Bagherzadeh, N. "Viper: A 25-MHz, 100- MIPS Peak VLIW Microprocessor" Proceedings of the IEEE 1 993 Custom Integrated Circuits Conference. May 9-12, 1993. pp. 4.1.1 - 4.1.5 99 [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] 100 Hadley, J .D. Hutchings, B ..L "Design Methodologies for Partially Reconfigured Systems" Proceedings of IEEE Workshop on F PGAs for Custom Computing Machines, 1995. Henncssy, J. Patterson, D. "Computer Architecture A Quantitative Approach" 1990. Hill, MD. Larus, J .R. Reinhardt, S.K. Wood, D.A. "Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors" ACM Transactions on Computer Systems, Vol. I I , No.4. November 1993. pp. 300-318 Hog], H. Kugel, A. Ludvig, J. Manner, R. Noffz, K.H., 202, R. "Enable++: A second generation FPGA processor" IEEE Workshop on FPGAs for Custom Computing Machines, 1995. Hwang, K. "Advanced Computer Architecture: Parallelism, Scalability, Programmability" 1993. Iseli, C. Sanches, E. "A C++ Compiler for FPGA custom execution units synthesis" Proceedings of IEEE Workshop on F PGAs for Custom Computing Machines, 1995. Iseli, C. Sanches, E. "Spyder: A Reconfigurable VLIW Processor using FPGA's" IEEE Workshop on F PGAs for Custom Computing Machines, 1993. Johnson, M. "Superscalar Microprocessor Design " 1991. Lenoski, D. Laudon, J. Gharachorloo, K. Weber, W. Gupta, A. Hennessy, J. Horowitz, M. Lam, M. "The Stanford Dash Multiprocessor" Computer, March 1992. pp. 63-79 Malke S. Chen, W.Y. Bringmann, R.A. Hank, R. E. ku, W.W. Ran, B.R. Schlansker, M.S. "Sentinel Scheduling: A Model for Compiler-Controlled Speculative Execution" ACM Transactions on Computer Systems, Vol. I I , No. 4. November 1993. pp. 376-408 Meier, R.D. "Rapid Prototyping of a RISC Architecture for Implementation in FPGAs" IEEE Workshop on F PGAs for Custom Computing Machines, 1995. [22] [23] [24] [25] [26] [27] [23] [29] [30] [31] [32] 101 Nicolau, A. "Percolation Scheduling: A Parallel Compilation Technique" Dept. of Computer Science Cornell Tech. Report TR 85-678, May 1985. Novak, J .H. Brunvand, E. "Using FPGAs to Prototype a Self-Timed Floating Point Co-Processor" Proceedings of the I 994 IEEE Custom Integrated Circuits Conference. pp. 85-88 Ratha, N.K. Jain, A.K. Rover, D.T. "Convolution on Splash 2" IEEE Symposium on F PGAs for Custom Computing Machines 1995. Robert, M. Gorria, P. Miteran, J. Turgis, S. "Architectures for a Real Time Classification Processor" Proceedings of the I 994 IEEE Custom Integrated Circuits Conference. pp. 197-200 Rover, D. Tsai,V. Chow,Y. Gustafson,J. "Signal-Processing Algorithms on Parallel Architectures: A Performance Update" Journal of Parallel and Distributed Computing, Vol. 13. 1991. pp. 237-245 Salinas, M.H. Johnson, B.W. Aylor, J .H. "Implementation-Independent Model of an Instruction Set Architecture in VHDL" IEEE Design & Test of Computers 1993. Schuette, M.A. Shen, J .P. "An Instruction-Level Performance Analysis of the Multiflow TRACE 14/300" Association for Computing Machinery 1991. Shen,J.P. Class notes, 1992-1993. Takayanagi,T. Sawada, K. Sakurai,T. Parameswar,Y. Tanaka,S. Ikumi,N. Nagamatsu,M. Kondo,Y. Minagawa, K. Brennan,J. Hsu, P. Rodman, P. Bratt,J. Scanlon,J. Tang,M. Joshi,C. Nofal,M. "Embedded Memory Design for a Four Issue Superscalar RISC Microprocessor" Proceedings of the 1994 IEEE Custom Integrated Circuits Conference. pp. 585-590 Wang, R. Milletary, J. Kobayashi, T. "The Dex JR." Senior project report 1993. Wirthlin, MJ. Hutchings, B.L. "A Dynamic Instruction Set Computer" Proceedings of IEEE Symposium on F PGAs for Custom Computing Machines 1995. 102 [33] Wolfe, A. Shen, J .P. "Superscalar Processor Design" Proceeding of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (Asplos V) 1992. [34] Xilinx "The Programmable Logic Data Book" Xilinx, Inc. 1994. MICHIGAN STATE UNIV. LIBRnRIEs 1|HIWIW‘IIWI"11111111111”1111111111qu 31293014057701