.io-

- ”'0'

  
 
  
   

u --qv

, -uo-vur ;

 

u: L

. hm.-

 

t

 

w—
.

   

" a
~
“.. :11
‘3 .v .
aw .

m-
1:"

   
 
 
 
 

‘ .
. -: ﬂ.
_ .-
- .1!
.M

J...
n {Z- n ~

 
    

o.

 

  

"433’

 
 

  

 
   
    
   
 
    
  

n '4‘! ’.‘.”'l
, gut-..“

   
  
 
  
    
      
   
 
 
 

p.14”.

':

vvla

oI-él‘r. ,,
.d...‘ “4-,

     

W>Lﬁa,‘
u...“ ‘
_, ,_.....~ I
‘tL‘w
m.

» 54m '

‘4‘»!
$.13»? .
2‘7!) :‘73 d ..
.n
9.2"“ '
OJ

w-

l.4.

 

n:-
.W

or

 

«-
u.»
I;

,4

 

u ... ,
21W
.

“in"
:331.‘
..

a"...

.-
T,

‘.
M (12":th I; i.

1‘3".-
....- .5
v u,
.w.
.m

 

.-
,
-p—np
,_
n
a m- a.
n

1

u

 

3"..‘3?
.
AI? —,-A
m?
_ _,,
.. ,r- .~ ,.
' I
w ..
u («—

“1‘75“”?
.
M
:5!”1
“13.5“
M
- .A

(cut-'1

4 ‘
Ln
v..~...g
m
.
w

. .a-v
L- r4...»
4
u
lune-4
mu"
n ”-
2.4m
4..»

-.,‘,~

5

  
  
  

a».
no .

Ju '-
'4

     

3:»;-

 

n
u-

‘1’:
.»

   

    
 
 
  
  
 

      
 
 
  
  
    

5 “‘2‘
,-,;;'}§'*$§i$
,v,x“p'.t§:§. ‘
'x’irﬁiri.
~ Shaka»?
{P‘é‘g

   

,

.

......
a". .

«nag.

   

.4“

r
.n.
m

,,

_. 45mg!»
“153.33%:ng 43.3%.
awn}: s'
; pigggﬂ‘iﬂggs
5:353:14}???
‘ ‘ 1::

'r

 

  
  

 
   
 

“53'

     
 
 
 
    
 
 
 
 
 
 
 
 
 
 

 

   
    
 
 
 
 
  
  
 
  
  

 
 
 
 
 
 
  
 
    
 

z.

m‘
5..

9'. ,

2:. $3511.;

mu 2..

amt"

«"2..- 4533

 
 

.52.“
".3...
a

;

H
x ‘1!
.. 141- I
i r sq ' 1
‘= Z’é‘z‘eirsaé‘i‘éa 3‘
. 2:19 $43“?
l“

. 2m,
1: ‘ .45; E:
..

‘ 3g}?
1 . 3%

     
 
  
  

 

    
  
  

:‘I‘
not!
‘1 , ‘3 . .-
n L
_ ,pw ., w.
..‘wnuov """"
.4.-
«we | a
- m2

    

 

       
    

  

3a., - :aiiz3’ci} 5! - i‘
ﬁiiﬁigigii“3“§¥géé"§ A' . 2222
”I. 13" 'f' 333% ‘ 33
» ; f» q B. 131‘; i! 5 is
o {hf :" ‘s‘ik .
c'dﬁ;‘19"¥:~_’4‘. "a '
‘ ,émﬁﬂihs .
a .3132???- g
;: Uni-35.“? gig: ’ I

Ms.

9
CV“ 19)

MICHIGAN STATE

I 31II IIZIIII3

 

II

 

 

 

 

 

 

 

 

 

 

 

 

 

IIIIIIIIIIIIIII/IIIIIIIII

0405 7701

This is to certify that the

thesis entitled

A VERY LONG INSTRUCTION WORD ARCHITECTURE
IMPLEMENTED ON THE SPLASH 2 FPGA ARRAY

presented by
Roy C. Wang

has been accepted towards fulﬁllment
of the requirements for

Master's degree in ElectricaI Eng.

 

 

Mic/m

Major professor

 

Date 3/17.] ?6

0-7639 MS U is an Affirmative Action/Equal Opportunity Institution

 

LIBRARY

Michigan State
University

 

 

 

PLACE IN RETURN BOX to romovo this chookout from your rooord.

TO AVOID FINES rotum on or boron duo duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU IoAn Milan-tin Wood Opportunity intuition
W1

A Very Long Instruction Word Architecture
Implemented on the Splash 2
FPGA Array

By

Ray C. Wang

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Electrical Engineering

1995

ABSTRACT

A Very Long Instruction Word Architecture
Implemented on the Splash 2
FPGA Array

By

Ray C Wang

The need for faster computing systems constantly pushes the limits of technology.
While special purpose and custom computing systems can provide very high levels of
performance, the cost associated with these systems limits their application. Recently,
there has been signiﬁcant improvements in the performance of general purpose
processors which can be used in a wide variety of applications. The speed of these
general purpose processors is limited by how many instructions they can perform per
clock cycle. To increase that number, machines need to be able to execute instructions in
parallel. Machines capable of such parallel execution are known as superscalar
processors. Unlike other proposed superscalar models, the VLIW (Very Long Instruction
Word) model extends the RISC paradigm of simpliﬁed hardware.

This thesis examines the feasibility of using Splash 2, an FPGA-based processing
array, conﬁgured as a VLIW processor. The VLIW architecture implemented in this
study is capable of performing two operations concurrently. This study demonstrates the
use of a new platform to prototype and test architectural and compilation theories and

highlights its capabilities and limitations.

Copyright © by
Roy C. Wang
1995

Dedicated to my parents,
Maria Wang and Tang S. Wang,

whose sacriﬁce and perseverance has given me a better life.

iv

ACKNOWLEDGMENTS

I would sincerely like to thank my advisor, Dr. Diane Thiede Rover, for all of her
help and guidance in the course of this research. Her insight provided much needed
direction and encouragement.

I would also like to thank the other members of my committee, Dr. Elias Strangas
and Dr. Michael Shanblatt for their efforts on my behalf. Dr. Strangas has provided
support throughout my graduate education. His generosity and faith proved to be an '
invaluable source of strength and support.

I wish to thank my family and friends for always being there for me. Last, but

hardly least, I would also like to thank Lisa for her editing efforts and support.

TABLE OF CONTENTS

 

 

 

 

 

LIST OF FIGURES ix
LIST OF TABLES xi
1 Introduction 1
2 Background Information 4
2.1 Scalar Processors .......................................................................................... 4
2.2 Parallel vs. Superscalar Processors .............................................................. 6
2.3 Instruction Level Parallelism ....................................................................... 8
2.4 Design Issues of Superscalar Processors ...................................................... 10
2.5 Code Motion ................................................................................................. 12
2.6 Reconﬁgurable Systems ............................................................................... 16
2.7 Summary ...................................................................................................... l9
3 Design Methodology & Architecture Speciﬁcation ' 20
3.1 Architecture Driven Design Methodology ................................................... 20
3.1.1 Evaluation Criteria .................................................................................... 22
3.2 Architectural Speciﬁcations ......................................................................... 22
3.2.1 Splash 2 ..................................................................................................... 23
3.2.2 Dex JR. ...................................................................................................... 25
3.2.3 Dex-II ........................................................................................................ 26
3.2.4 The Instruction Set .................................................................................... 29
33 Design Constraints ....................................................................................... 31
3.3.1 Data Hazards ............................................................................................. 31
3.3.2 Memory Hazards ....................................................................................... 33

vi

3.3.3 Control Hazards ......................................................................................... 33

 

 

 

 

3.4 RTL Speciﬁcations of Dex-II ....................................................................... 34
3.4.1 PE Partitions .............................................................................................. 34
3.4.2 Communication Paths ............................................................................... 36
3.4.3 Fetch Stage ................................................................................................ 39
3.4.4 Decode Stage ............................................................................................. 39
3.4.5 Execution Stage ......................................................................................... 43
4 Simulation and Synthesis Environment & Results 46
4.1 Simulation & Synthesis Process ................................................................... 46
4.2 Synthesis Results .......................................................................................... 51
5 Test Programs 55
5.1 The Runtime Environment ........................................................................... 55
5.2 Fibonacci Sequence ...................................................................................... 56
5.2.1 Dex-II Veriﬁcation .................................................................................... 56
5.2.2 Program Implementation and Performance ............................................... 57
5.3 Bubble Sort ................................................................................................... 59
6 Conclusions and Future Investigations 64
6.1 The Dex-II Evaluation .................................................................................. 64
62 VLIW vs. RISC ............................................................................................ 65
6.3 Splash 2 Evaluation ...................................................................................... 66
6.4 Future Investigations .................................................................................... 67
Appendix A 69
A1 Controll .vhd (PEI) ..................................................................................... 69
A2 Contr012.vhd (PE2) ..................................................................................... 70
A3 Decode] .vhd (PE3) ..................................................................................... 72
A4 Decode2.vhd (P134) ..................................................................................... 75
A5 Execute1.vhd (PES) ..................................................................................... 77

vii

A6 Execute2.vhd (PE6) ..................................................................................... 80

 

A7 Execute3.vhd (PE7) ..................................................................................... 83
A8 Execute4.vhd (PE 8) .................................................................................... 84
A9 Xbarcontrol.vhd (PEO) ................................................................................ 88
A10 Xbarconﬁ g .................................................................................................. 89
Appendix B 92
BI Fibonacci Sequence Results ......................................................................... 92
82 VLIW Fibonacci Sequence Results .............................................................. 9S
BIBLIOGRAPHY 99

 

viii

LIST OF FIGURES

Figure 2.1. Instruction execution models ....................................................................... 6

Figure 2.2. Instruction stages ......................................................................................... 10
Figure 2.3. Percolation scheduling ................................................................................. 13
Figure 2.4. Register renaming ........................................................................................ 14
Figure 2.5. Compensation code ...................................................................................... 15
Figure 2.6. Simpliﬁed block diagram of the XC4000—Family CLB .............................. 17
Figure 3.1. Architecture hierarchy ...................... I ........................................................... 21
Figure 3.2. Splash board ................................................................................................. 23
Figure 3.3. Design flow .................................................................................................. 24
Figure 3.4. The Dex JR. ................................................................................................. 27
Figure 3.5. The Dex-II .................................................................................................... 28
Figure 3.6. Instruction format ........................................................................................ 30
Figure 3 .7. Data dependency .......................................................................................... 32
Figure 3.8. Cascaded ALU ............................................................................................. 33
Figure 3 .9. Memory hazard ............................................................................................ 33
Figure 3.10. Control hazard ............................................................................................ 34
Figure 3.11. PE partitioning ........................................................................................... 35
Figure 3.12. Cycle 1 ....................................................................................................... 36
Figure 3.13. Cycle 2 ....................................................................................................... 37
Figure 3.14. Cycle 3 ....................................................................................................... 37
Figure 3.15. Cycle 4 ....................................................................................................... 38
Figure 3.16. Cycle 5 ....................................................................................................... 38

ix

Figure 3.17. Cycle 6 ....................................................................................................... 38

Figure 3.18. Fetch RTL Description .............................................................................. 40
Figure 3.19. Decode RTL Description ........................................................................... 42
Figure 3.20. Execute RTL Description .......................................................................... 45
Figure 4.1. VHDL hierarchy .......................................................................................... 47
Figure 4.2. Code description .......................................................................................... 48
Figure 4.3. Simulator output .......................................................................................... 49
Figure 5.1. Fibonacci Execution Results ........................................................................ 58
Figure 5.2. Two Fibonacci programs ............................................................................. 59
Figure 5.3. RISC bubble sort program ........................................................................... 60
Figure 5.4. VLIW bubble sort program .......................................................................... 61
Figure 5.5. Optimized RISC bubble sort program ......................................................... 63
Figure 5.6. Optimized VLIW bubble sort program ........................................................ 63

LIST OF TABLES

Table 1. Instruction set ................................................................................................... 29

Table 2. Summary of synthesis results ........................................................................... 54

xi

CHAPTER 1

Introduction

New technology invariably alters the landscape of design and implementation of
computer systems. Current trends of processor design have moved towards a RISC
(Reduced Instruction Set Computer) architecture. Lower cost memory, automated
design, and the need for a modular approach due to VLSI (Very Large Scale Integration)
and WSI (Wafer Scale Integration) have all played a signiﬁcant role in this shift of
ideology. As we push the upper limits of performance with pipelining, it becomes
inevitable to exploit the ﬁne grained parallelism of programs. This is known as
instruction level parallelism (ILP). Superscalar processors, which can execute multiple
instructions concurrently, have already been designed and implemented to take advantage
of existing software. They promise even more speed-up in the future as compiler
technologies and operating systems are written to take advantage of their added
capabilities.

Processor design is not only changing, but changing rapidly. The development of
computer technology is one of the fastest moving industries in the world. Introducing a
new product a month or two before competitors can capture a majority of the market
share. With this motivation, it is not difﬁcult to see why shrinking the design cycle is
imperative. FPGA (Field Programmable Gate Array) technology is increasing in both
array size and speed. These chips offer a short development cycle by implementing and

testing designs without the need for expensive wafer fabrication. Individually, they are

1

2
used today for ASIC (Application Speciﬁc Integrated Circuit) and small prototyping
applications, however, larger ensembles offer new capabilities in rapid prototyping and
custom computing. Splash 2 is one of the ﬁrst computing systems to explore these
possibilities. Its design allows for SIMD (Single Instruction stream, Multiple Data
stream) type architectures as well as highly pipelined and systolic applications.

This thesis explores the issues associated with implementing an instruction set
processor with the FPGA-based system, Splash 2. The Spyder architecture [17] has
shown that it is possible to utilize FPGAs in a general purpose processor. The Spyder,
however, dedicates the FPGAs as reprogrammable execution units to provide some
ﬂexibility in its functionality. Splash 2 and other FPGA arrays represent a totally new
class of machines. The size and resources of the Splash 2 allows an entire architecture to
be prototyped and implemented with FPGAs. This would allow the computer designer to
test new features of an ISP (Instruction Set Processor) by reconﬁguiing the Splash 2. For
example, programs that may beneﬁt from very speciﬁc hardware routines can reprogram
the data path for better performance, and still retain the functionality of the remaining
system.

FPGA technology, however, is still in a relatively young stage. The limitations
imposed by the Splash 2 constrain designs. The limited logic on a single FPGA forces us
to partition our design across several FPGAs. The architecture of the Splash 2 also limits
the off chip resources available to each FPGA, such as memory and busses. These
restrictions present a signiﬁcant obstacle to implementing a general purpose instruction
set processor.

Our approach to studying ILP and superscalar processors is to prototype a VLIW
(Very Long Instruction Word) processor on the Splash 2. The simple hardware needed to
implement a VLIW architecture is well suited for the Splash 2. A VLIW architecture will
also exploit the ﬁne grained parallelism of programs that conventional processors have

not. The problems associated with parallel processing motivate the need for a platform

3

that can rapidly prototype and test new architectural and compilation theories. While
conventional computing machines are built to process either general purpose code or
highly customized instructions, FPGA-based arrays offer the true general purpose
machine that can be conﬁgured as both.

New architectures are developed from studies of past processors and applications.
In chapter 2, we will discuss some of the issues and approaches of implementing
instruction set processors. These issues lay the foundation to our new architecture, the
Dex-II. Chapter 3 provides a detailed description of our design methodology and
implementation strategies. It also provides an architectural speciﬁcation for Dex-II, the
VLIW architecture implemented on the Splash 2 array. Chapter 4 presents the design
environment and provides the simulation results and synthesis statistics of our
implementation. In chapter 5, an analysis of two test programs will help us evaluate how
successful the VLIW architecture extracts ILP. Chapter 6 will summarize the ﬁndings

and discuss how future implementation of FPGA arrays may be improved.

CHAPTER 2

Background Information

This chapter provides the background information and motivation for this thesis.
It presents a brief history of architectures and the concepts applied to improve their
performance. We discuss the challenges of parallel computing as well as introduce the
concept of instruction level parallelism. Finally, we consider reconﬁgurable systems and

how they can contribute to innovations in computer architectures.

2.1 Scalar Processors

First generation computers were typically load/store machines. They were
basically single accumulator processors not unlike todays programmable calculators. If
multiple variables were needed in the computation, they had to be stored into memory
and loaded at a later time. Memory address calculations and access were time
consuming. Therefore, more complex methods of manipulating data gave birth to the
ﬁrst CISC (Complex Instruction Set Computer) machines. These machines are
characterized by a larger set of registers and complex memory addressing modes [15].
For example, the Motorola 68040 microprocessor implements 113 instructions with 16
general purpose registers and supports 18 different addressing modes. The popular Intel
80486 microprocessor implements 157 instructions with 12 addressing modes. The
concept of the CISC is to put commonly executed sequences of code into one single

instruction and add special hardware to execute it as quickly as possible. Each instruction

4

5

is performed by executing subprograms written in microcode. Microcode represents the
physical bits of signals that control the data path. The advantage to this design is the
compatibility of code between machines and the reduced cost for program memory. If
the hardware is changed, all that was needed to compensate is a rewrite of the microcode
to support the same instructions. Applications are able to run without the need to
recompile. These processors, however, have become very elaborate and complex to
design. Transistor counts have rocketed into the millions, and the speed-up of each
successive generation has dwindled.

The search for more speed led to techniques used in mainframes and
supercomputers. Pipelining [12] was introduced to exploit temporal parallelism.
Pipelining allows instructions to be overlapped in stages. Each stage completes part of
the instruction so several instructions can be executing at the same time. To make
pipelining efﬁcient, each stage should take about the same time to complete its tasks.
CISC techniques are not very well adapted to handle pipelining. Each individual
instruction has its own time frame and data paths are littered with special hardware to
implement those special instructions. RISC processors, however, are designed to balance
these stages. RISC instructions use very basic functions to simplify the hardware as
much as possible. The functionality of the processor is preserved, but the number of
instructions needed to complete the same task was increased. The increase in memory
size required to run programs have become less of an issue as memory sizes were
quadrupling about every three years [12].

The RISC machine is characterized by a smaller set of instructions and a larger
supply of registers. Addressing is kept as simple as possible and memory operations are
usually limited to simple load/store commands. A typical RISC machine is the Sun
SPARC CY 7C601 with 69 instructions. There are 136 registers divided with a technique
called register windowing. This is basically a paged register ﬁle to supply clean registers

for subroutines. The RISC model tries to partition instructions into well-deﬁned stages to

6
make pipelining as efﬁcient as possible. The results are higher clock rates and faster
design times. Chips such as the DEC Alpha and the PowerPC have already broken the
100MHz mark. Now the search for even more speed has begun to focus on spatial
parallelism. Increasing the number of instructions that can execute at the same time

brings us into the realm of parallel and superscalar computing.
2.2 Parallel vs. Superscalar Processors

The idea of parallel and superscalar processors is to allow instructions to be
executed concurrently [18]. This is not a new idea, and there have been many successful
approaches to parallel computing. Parallel processors can execute several instruction
streams, while a superscalar processor can execute several instructions of a single
instruction stream. Parallel execution of instructions increases the throughput of the

‘
‘7

T-ime

 

|ii[i2|i3]i4|

Serial Instruction Execution

| I” | i
' l 12 I i
I I3 I
I 14 I

 

 

 

Pipelined Instruction Execution
I n l
| 12 I

l 13 I

II4I

 

 

 

Superscalar Pipelined Instruction Execution

Figure 2.1. Instruction execution models

7

processor and decreases the time needed to ﬁnish a task as can be seen in Figure 2.1.
This ﬁgure depicts the relative time needed to ﬁnish four tasks (ll-I4) and shows how
pipelined and superscalar execution times compare to serial execution. Many of the
traditional parallel processors need highly specialized hardware and special algorithms to
achieve their goals. The Flynn classiﬁcation [12] proposes four different models of
computers. The SISD (Single Instruction stream, Single Data stream) architecture is the
standard general purpose scalar computer. The SIMD architecture is represented by
machines such as the MasPar MP—l and Thinking Machines CM-2. This type of
architecture has many identical functional units and a single instruction unit. The
instruction unit broadcasts a command, and all the functional units perform the same
function. This is extremely efﬁcient for many different matrix operations where the same
task is performed on large amounts of data. Unfortunately, only very speciﬁc
applications can take full advantage of this setup. The MISD (Multiple Instruction
stream, Single Data stream) computers are a bit more difﬁcult to classify. Certain
systolic arrays may be considered as MISD computers. The data ﬂows through stages of
the array and is operated on by whatever instruction executes at each stage of the array
[7]. The Multiple Instruction, Multiple Data stream MIMD processors are the most
common parallel computers. Mainframes, supercomputers, and hi gh-end workstations all
employ some form of multiprocessing capability. Employing the MIMD model, they can
exploit data parallelism and specialized parallel code to expedite the completion of a
single program. However, these computers are primarily used to serve many users and
programs simultaneously. While they can complete a multitude of tasks, each individual
program or task itself is not being executed any faster [12]. To complete a single task
using multiple processors requires either a regular data structure, such as matrix
calculations, or complex synchronization and message passing mechanisms.

Several issues have been studied to overcome the barriers of parallel processing.

Parallel programming requires a good understanding of how the hardware is organized in

8

order to take full advantage of the scalable hardware. Variations of high level
programming languages have been used to overcome some of these issues. A language
called dbC (Data-parallel Bit C) was developed with parallel data structures for SIMD
type execution. By using the new constructs supplied by the language, the compiler can
effectively distribute the computation among the available processing elements. dbC has
been used for parallel machines such as the Cray and Terasys. Even more interesting is
the possibility of using a dbC compiler to generate hardware on an FPGA-based array to
implement each program with its own hardware [8][9].

Parallel machines also suffer from synchronization problems. There are basically
two methods of sharing data and keeping data coherent. The message passing model uses
a distributed memory system. Each processor keeps its own data and sends data via
messages when needed. The MasPar MP—l and Ncube2 are two examples of a message
passing computer. The other method is a shared memory model where each processor
communicates through a common memory space. The shared memory space makes
programming much easier since the programmer does not need to worry about the
location of data. Although the shared memory model is theoretically sound, the
scalability of such a system is limited by the bandwidth of the memory and the
complexity of the caching scheme [30] used in the system. The DASH multiprocessor
[19] developed at Stanford uses a distributed directory based cache coherence scheme
[19][l3] to scale the single memory space. However, these schemes need both software
and hardware support in order to be utilized. Still, others argue that a highly scalable
memory system is not necessary due to the inherently low level of parallelism that can be

extracted from programs [12].

2.3 Instruction Level Parallelism

It becomes quite obvious from Figure 2.1 that the number of instructions that can

be executed concurrently will greatly affect the performance of superscalar architectures.

9

This number is limited by the ILP of the program. ILP is the amount of parallelism that
can be extracted from a program written for a sequential processor [5]. This is limited by
true data dependencies, procedural dependencies, and resource conﬂicts [18]. Data
dependencies are instructions whose operands depend on the results of previous
instructions. Instructions in general cannot execute until their operands are available.
Several methods to resolve data dependencies have been used. The Intel i960CA, HP
PA-RISC7100, HyperSPARC, and IBM's RSGOOO all use a runtime technique known as
scoreboarding to resolve data dependency [15][12][28][33]. This technique requires
keeping a table or ”scoreboard” of what stage of execution each register is in. Other
dynamic techniques use reservation stations and a version of the Tomasulo algorithm [28]
implemented on the IBM 360 ﬂoating point unit. The major disadvantages of these
implementations is the hardware cost and the complexity associated with them.
Dependencies can also be eliminated during compile time [5]. VLIW processors depend
on the compiler to generate compact code in order to exploit ILP. The biggest
disadvantage to this approach is the lack of good compilers and the incompatibility of
code from machine to machine.

Procedural dependencies are instructions that are affected by branching. At
conditional branches, there are two sets of instructions that can be executed. The
processor often has to await resolution of the branch to continue execution. Several
different approaches have been used to decrease the impact of branching. Speculative
execution [20][12] can be used to eliminate some of these branch latencies. Compiler
techniques utilizing trace scheduling methods [4] can also move instructions across
blocks of code to ﬁll up unused cycles.

Resource dependency involves the available number of functional units to execute
instructions on. The decision of how many functional units to implement depends

heavily on the type of programs being run and the level of parallelism that can be

10
extracted. Too many units would leave costly hardware idle, while too few would create

a bottleneck for the performance of the processor.

2.4 Design Issues of Superscalar Processors

The physical hardware where operations are performed are called functional units.
In a scalar architecture, only one function can be executed at any given time. In a
superscalar architecture, functions are differentiated in hardware to allow multiple
instruction execution among all of them. The most common functional units are integer
units, ﬂoating point units, memory units, and control units. The number and types of
units vary from architecture to architecture. The Motorola 88110, for example, has ten
functional units while the DEC Alpha has only four [33]. More units, however, do not
necessarily indicate better performance.

There are typically four stages in executing an instruction as shown in Figure 2.2:
Fetch, Decode, Execute, and Writeback [12]. The Execution stage can be easily sealed
with the addition of hardware. The Fetch, Decode and Writeback stages are the critical
points where bottlenecks can impede performance gains. The Fetch stage is responsible
for getting instructions out of the instruction cache. In a superscalar processor where
more than one instruction can be issued, this becomes a challenging task. Different

architectures adopt many different approaches to this problem. The number of functional

 

Il

 

 

 

 

Fetch Decode Execute Writeback

 

 

 

 

 

 

Figure 2.2. Instruction stages

units and the dependencies on the instruction dictate how many and which instructions

can begin to execute. For instance, the DEC Alpha can issue up to two instructions per

1 l
clock cycle as long as they are operating on different functional units [33]. The Intel
i960CA can fetch four instructions and issue up to three of them if they are on separate
functional units [33]. The sophistication of issuing instructions greatly affects the ability
of keeping the hardware busy. Therefore, having a multitude of functional units does not
automatically make a better system. It becomes apparent that the order of instructions
plays a critical role in exploiting the maximum level of parallelism.

Instructions can be issued in-order or out-of-order [18]. All of the architectures
discussed issue instructions in-order of the program. Since there are a limited number of
functional units, certain combinations of operations cannot be issued together. Two
integer operations, for example, cannot be issued on the same clock cycle if there is only
one integer unit. In order to execute comedy, a stall or NOP (No Operation) must be
injected into the code. Out-of-order issues can eliminate this and use up the wasted stall
time. This is achieved by using an instruction window to look ahead at several
instructions to determine if there are any data dependencies. Unfortunately, the
complexity involved in the hardware has deterred an implementation of this scheme.
Instructions can also be completed in-order or out-of-order [18]. The latencies of ﬂoating
point units tend to be much longer than those of integer units. With in-order completion,
instructions may be needlessly stalled while one very long instruction is executing. With
out-of-order completion, this can be avoided as long as data dependencies are observed.
Some of the problems with out-of-order completion are the added hardware for
dependency checks and the difﬁculties dealing with precise interrupts and exceptions.
Since instructions may be completed out of order, the restart point may cause the
instruction to be incorrectly executed twice. Again, extra hardware is required to support
precise interrupts.

The dynamic techniques of achieving parallel instruction execution all require
extra hardware. A VLIW processor tries to adhere to the RISC philosophy. Instead of

adding complex hardware schemes to resolve the problems of multiple instruction issues

12
during runtime, VLIW processors resolve dependencies during compile time. The
instruction word contains all the information for each functional unit. This makes for
very long instructions as more and more functional units are added. By scheduling and
analyzing programs off-line, we can get rid of the hardware overhead and ultimately
achieve better performance than a dynamic solution. This approach also allows for very
simple instruction issuing and decoding. The Multiﬂow TRACE series were the ﬁrst
commercially available processors implementing the VLIW architecture [29].
Unfortunately, the performance that was achieved was severely limited by the compiler
technology. It was unable to extract enough ILP at the time [29]. There are basically two
methods to extract ILP for a VLIW architecture: code movement and guarded execution.
Code movement is much easier and cheaper to implement than guarded execution.
Guarded execution requires the addition of hardware registers and additional logic along
with semaphore semantics in software. The basic idea is to be able to execute
speculatively along both branches when a conditional branch is encountered and to
invalidate the path not taken. This method usually requires many shadow registers and

memories to be effectively implemented.

2.5 Code Motion

Code motion, otherwise known as list scheduling, is an effective method to
expose ILP. A human compiler can write very efﬁcient assembly level code, but today's
compilers are still not as smart as we would like them to be. The two most popular forms
of list scheduling are known as percolation scheduling [22], and trace scheduling [4].
Early efforts such as the Bulldog [3] compiler and the Multiﬂow TRACE [28] implement
these techniques with varying degrees of success.

Percolation gets its name from trying to "percolate" instructions up as far as
possible to ﬁll in wasted NOP space. There are several key points that must be observed

when moving code around. The primary challenge is keeping the semantics the same.

13

 

Examples of code will be in the format as follows

11: 0P1 * Comments and explanations
0P2
12: 0P1

where II is a single instruction consisting of two concurrent operations, and 12 is a single
instruction with only one operation.

 

 

 

 

 

Before After
1 1

A = B+C A = B+C
H = F+H

D = A+3 D = A+3
F = G+B

E = D+1 E = D+1
1= (D)

 

 

 

 

 

 

(G)= A

 

 

 

 

 

 

Figure 2.3. Percolation scheduling

From the example in Figure 2.3, we assume that there are only three basic blocks of code
(1-3). If there were successors following blocks 2 and 3, we would also have to take
those into account. The key to preserving the semantic meaning is to avoid overwriting
registers that are "live" and memory locations. This restriction on memory basically
dictates that store instructions cannot be moved. Ensuring that executing instructions do
not affect live registers can be achieved in several ways. We can use register renaming to

replace variables or to compute speculated variables. In Figure 2.4, we could not move

14
the instructions in block 3 without affecting the code in block 2. This is due to the two
variables, I and F, in block 3 that are also used in block 2. If we rename the I and F
variables in block 3 to H and J, we see that it is still possible to move the code from block

3 into block 1. This is an inexpensive and powerful performance booster.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Before After
1 1
A = B+C A = B+C
H = (D)
D = A+3
D = A+3
E = D+l J = G+B
E = D+1
F = I+D
2 F = I+D 3 I = (D)
F=G+B 2
(G) = F+E (G) = F+E
Rename I to H

 

 

 

 

 

 

Rename F to J

Figure 2.4. Register renaming

Another semantic preserving method is adding compensation code. If we restrict
our movements to instructions that can be undone, we can shorten one path possibly at
the expense of another. This is the idea behind trace scheduling [4]. If we have a good
proﬁle of how programs normally operate, then by reducing the path taken most often,
the added cost of compensation code is negligible. In Figure 2.5, we reduced the left path
from four to three instructions at the expense of the right path which went from four to
ﬁve instructions. If we assume this is a loop with 100 iterations, then the total number of
cycles needed to execute this program would be 90*3 + 10*5 = 320 cycles. The original

code would have taken 90*4 + 10*4 = 400 cycles. This is a substantial improvement for

15
our very small example. The difﬁculty in such a scheme, however, is to have an accurate
runtime proﬁle of programs. If for some reason, the program switched over to executing

the right path ninety percent of the time, we would experience a penalty rather than the

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

intended performance boost.
Before After
1 l
A = B+C A = B+C
I = (D)
D = A+3
D = A+3
E = D+1 G = G+1
E = D+l
90% F = F+D
yx
2 F = F+D 3 I = (D) 3
G=G+l H=G+F G=G-1
F = F-D
H = G+F
I____

 

 

 

 

Figure 2.5. Compensation code

The other major drawback of rescheduling is exception handling. Note that in
block 3 of Figure 2.5, there is a memory read into variable I. On a real system, this may
cause a page fault and require the system to stop and load up a new page of memory. The
increased memory reference can cause the system to start thrashing if it is scheduled up to
block 1. This further limits our ability to move code around. A technique called sentinel
scheduling [20] combines the ideas of guarded execution with list scheduling to avoid

these problems.

16

2.6 Reconﬁgurable Systems

The parameters and problems associated with parallel and superscalar computing
present many possible solutions. Reconﬁgurable systems allow the designer to
implement and test various designs quickly and efﬁciently.

A VLIW architecture called the VIPER [10] claims a 100 MIPS peak throughput.
At 25 MHz, it is capable of performing a branch, two load/store operations, and up to
four ALU operations per clock cycle. A study done with the VIPER on pipelining and
bypassing [1] also showed that the added cost of a fully interconnected network for
operand forwarding is signiﬁcant. The additional bus lines increased cycle time and
silicon area. This analysis shows that the dynamics of large systems can vary widely
with different implementations and applications.

By using a reconﬁgurable system, we can study a greater number of applications
to ﬁnd the best solution. A reconﬁgurable system gives us the ability to tailor an
architecture through software. This allows the designer to optimize the amount of
hardware for each application. Architectures that would beneﬁt from additional
forwarding paths, for example, could be implemented on the same system as an
architecture whose main purpose is to get the fastest cycle time. Reconﬁgurable systems
can provide computer designers with important data on the impact of architectural
changes and features.

The Splash 2 is a reconﬁgurable system composed of 17 FPGAs. The Field
Programmable Gate Array is a matrix of programmable cells called CLBs (Conﬁgurable
Logic Blocks) [34]. The Xilinx 4000 series uses two function generators per CLB that
can accept up to four inputs and perform any Boolean function on them. The functions

are implemented with look-up tables, so the delay associated with each generator is

17
independent of the function being implemented. A simpliﬁed block diagram [34] of the
CLB is shown in Figure 2.6. The multiplexer controls and logic functions are determined
by the conﬁguration.

2332

 

Figure 2.6. Simpliﬁed block diagram of the XC4000-Family CLB

The design can be synchronous by using the two ﬂip-ﬂops to latch values, or
combinational using the unlatched outputs. Each CLB is surrounded by an
interconnection network that allows the CLBs to communicate with each other. Delay
times governed by routing are a key issue in developing fast designs with FPGAs.
Signals are brought off chip through IOBs (Input Output Blocks). These pins can be
conﬁgured as latched, tristated, or combinatorial signals. The third generation Xilinx
4025 has 1024 CLBs in a 32x32 array with 256 IOBs. This is large enough to implement
simple algorithms and state machines. We will examine a few of these efforts, each with

a unique strategy on how to utilize the FPGA for custom computing.

18

FPGAs have the ability to be reprogrammed an unlimited number of times. This
allows for fast design and development of ASICs and prototypes at relatively low cost.
Applications such as image processing beneﬁt from the reconﬁgurable hardware as well.
In general, these applications have parameters that may change from execution to
execution, but possess potential for improved speed from hardware implemented
algorithms. Image classiﬁcation [25] and convolution [24] are just a few examples of the
successful applications of this technology. Other applications utilize the FPGAs as a
reprogrammable co-processor or as execution units [16]. A self—timed ﬂoating point co-
processor [23] can be added without affecting the system clock of an existing system.

Several reconﬁgurable processors have also begun to appear [17][6][32][14][2].
The ability to generate hardware speciﬁcally for the instructions to be executed is an
exciting prospect. Self-reconﬁguring processors [6][32] utilize the feature of some
FPGAs to be partially reconﬁgured. This allows for a "virtual hardware" setup where a
processor may support several operations, but only have room for one or two actual
conﬁgurations to be waning on the FPGA. This also provides the ability to create as
many functional units as necessary to extract the maximum level of parallelism.
However, work done with the RRANN2 [l 1] shows that the overhead for reconﬁguring
systems is quite high. In a neural net application, 80% of the total time was spent in
reconﬁguration. This overhead still limits these systems to highly parallel applications.

Systems such as the SPYDER [17] and the DISC architecture [32] were built with
the idea of executing instruction level programs. They have hardwired data paths and a
small array of FPGAs for reconﬁgurable execution units. The larger more general
purpose boards such as the Teramac [2], the Enable++ [14] and the Splash 2 [7] all have
more FPGA chips and programmable crossbars to implement different communication
topologies. These boards have been shown to perform well for programs with large

amounts of data parallelism, however, they need to be programmed speciﬁcally to take

advantage of their hardware.

19
With the advent of good hardware descriptor compilers and tools, the concept of
using an implementation independent model [27] to explore new instruction set
architectures was proposed. With these large arrays and VHDL as a tool, it becomes
possible to create an instruction set architecture on top of these general purpose custom
computing arrays. This gives us the ability to create a custom computing system flexible

enough to implement the four models of computers that Flynn proposed.

2.7 Summary

Exploring the different ways we can enhance single processor performance, we
note that there are basically two methods of increasing throughput: (l)decreasing cycle
time, and (2)executing multiple instructions. Cycle time ultimately depends on the
technology available. Multiple instruction execution however deals with the organization
of the architecture and the programs executing on the processor.

The key for successfully executing multiple instructions lies in the ability to
schedule them to avoid hazards. Instruction scheduling techniques can be static or
dynamic. Although initially more attractive, dynamic scheduling techniques cost more in
hardware and add complexity to the machine. This will set an upper limit of performance
when compared to a statically scheduled machine. Statically scheduled machines do not
need to examine instructions on the ﬂy before execution. While it is too early to say
which approach will ultimately perform best, both approaches need to be implemented
and studied. Splash 2 offers a cost-effective environment to realize and test these

architectures quickly.

CHAPTER 3

Design Methodology & Architecture
Speciﬁcation

With an understanding of some of the problems and issues of superscalar
processors, we now consider implementing a superscalar processor on an FPGA-based
array. The design of our superscalar processor is driven by a set of architectural
speciﬁcations. This chapter ﬁrst introduces the RISC architecture as the foundation of
our new VLIW machine. The details and limitations imposed by the Splash 2 are
discussed followed by the ﬁnal architectural speciﬁcation of the Dex-II that was
implemented. Finally, we examine the various hazards that exist and how they are

resolved.

3.1 Architecture Driven Design Methodology

Two distinct architectures form the new VLIW architecture, Dex-II. The Dex JR.
provided the framework for the RISC instruction set, and the Splash 2 architecture drove
the implementation strategies. This architectural driven approach is distinct from
application driven designs.

Application driven architectures are targeted towards very speciﬁc tasks. The
Splash 2 was developed as a method to implement application driven architectures.
Application driven architectures become obsolete when the application is no longer

needed or outdated. Using the Splash 2 allows the user to replace the architecture when

20

21

Inputs Inputs

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Application
Speciﬁc Program
Architecture
VLIW Architecture
Splash 2 Architecture Splash 2 Architecture
All Architectures
Outputs Outputs

Figure 3.1. Architecture hierarchy

this occurs. Although less costly than building a new system each time a new task arises,
using the Splash 2 does incur some overhead. Architectures are now limited by the
resources and capabilities of the Splash 2 system. Figure 3.1 shows how an application
speciﬁc architecture is constrained by the Splash 2 architecture. Application driven
architectures are determined by the speciﬁc task and the resources of the Splash 2
architecture.

Figure 3.1 also shows how the VLIW architecture is implemented within the
constraints of the Splash 2 architecture. It is similar to an application speciﬁc
architecture, however the purpose of the VLIW is not to perform a speciﬁc task. The
VLIW architecture can be programmed to solve several different tasks by replacing the
program block in Figure 3.1. Instruction set architectures change over time as new

hardware is developed and new instructions are added. Splash 2 offers the ability to

22
prototype and alter the architecture as needed. The architecture of the VLIW machine is
driven by the RISC instruction set, the issues of pipelining, and the resources of the

Splash 2 architecture.

3.1.1 Evaluation Criteria

The architecture of the Dex-II is based upon the RISC instruction set. This
instruction set provides all the primitives necessary to perform more complex
instructions. The Dex-II, therefore, must successfully implement these instructions. In
doing so, we can compare the performance of our implementation to the base RISC
architecture.

We will also evaluate how the VLIW instruction set affects the extraction of ILP.
A comparison of the VLIW and RISC code in Chapter 5 will establish a quantitative
measure of the performance of the new architecture. This measure of performance allows
computer designers to better understand and improve architectures.

A successful implementation of the Dex-II will also allow us to measure the
utilization of the FPGAs. This allows us to evaluate the use of the Splash 2 resources in
prototyping instruction set architectures. This data, presented in Chapter 4, is important

to improving future generations of FPGA-based arrays.

3.2 Architectural Speciﬁcations

The architectural speciﬁcations set forth by the Splash 2 and Dex JR. sets the
expectations and implementation of the Dex-II. Our goal is to implement a processor that
has the same functionality of the Dex JR. on a platform as ﬂexible as the Splash 2. We
will then extend the original RISC architecture to a VLIW implementation called the

Dex-II. The architectural speciﬁcations provide the level of detail necessary to achieve

that goal .

23
3.2.1 Splash 2

Implementing any design on the Splash 2 requires a good understanding of [the
underlying hardware. The Splash 2 system is based on the Xilinx 4010 FPGA. Each
chip represents a processing element that consists of 400 CLBs. Each CLB can perform
two independent Boolean functions of up to four variables, a function with ﬁve variables
and a four variable function in certain instances, or a single function up to nine variables.
The resulting signal can be combinatorial or registered by the two ﬂip-ﬂops. A careful
study of the simpliﬁed block diagram in Figure 2.6 should clarify the functionality of the
CLBs. The resources of the CLBs will determine the ﬁnal size needed to implement a
design.

The XC4010 also has 160 1088 to control and condition signals entering and
leaving each chip. These [083 are used by arranging the chips in a linear array using two
edges of the chip to communicate with each other and a third edge to a central crossbar as
can be seen in Figure 3.2. The left and right edges can talk only to the adjacent chips

through a 36-bit data path while the third edge can communicate with up to ﬁve different

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

chips through the crossbar.
X1 X2 X3 X4 X5 X6 X7 X8
X0 Crossbar
X16 X15 X14 X13 X12 X11 X10 X9

 

Figure 3.2. Splash board

 

 

 

 

24

The crossbar is a 36-bit data path arranged in 8-bit segments for a total of four 8-

bit slices and one 4-bit slice. Each slice can be conﬁgured to transmit data in one

direction. It can store and switch among eight different conﬁgurations connecting any of

the bit slices from one chip to another. Control of the crossbar is handled by the

seventeenth FPGA, X0.

Each PE (Processing Element) also has
a 256K by 16-bit memory associated with it.
This is accessed through the fourth edge of the
chip with 18 address lines and 16 data lines.
Timing of the memory is handled separately by
the Splash 2 system. A read operation requires
two global clock cycles, one to latch the
address and the following cycle to read the
data. Although it is possible to pipeline back to
back reads, a read followed by a write must be
separated by at least one clock cycle. Since a
write is initiated by placing an address and the
data to be written, a read followed by a write
would require the used data lines.

Control of the Splash 2 system is
handled by a SPARC host attached to the board
through a VME backplane. The Splash 2 is
scalable by adding up to 255 more Splash

 

 

 

 

 

 

VHDL
Source

 

 

I

 

 

Simulation

 

 

 

l.—

 

 

Logic
Synthesis

 

 

 

 

Timing
l Extraction

 

 

 

 

 

Place and
Route

 

 

I
I

 

 

Splash
2

 

 

Figure 3.3. Design ﬂow

boards to this backplane. Each board will have its X1 chip attached to the previous

boards X16 chip and so forth. The host can be interfaced with C libraries or a symbolic

debugger known as T2. These software interfaces allows the user to map bitstreams to

speciﬁc FPGAs, control the clock, and provides many other debugging and I/O functions.

25

The design ﬂow for programming the Splash 2 system is depicted in Figure 3.3.
Programming of each PE is done with a hardware description language, VHDL. Physical
pins of each PE are mapped into logical signals using a combination of Splash 2 and
vendor libraries. Simulation can be done to ensure that the algorithm is correct using
Synopsys design tools. After simulation, the user synthesizes the VHDL code into a gate
level description using Synopsys compilers. At this point, the Xilinx software is invoked
to automatically place and route the desired hardware. The ﬁnal result is a timing

analysis of the design and a bitstream that can be downloaded to the Splash 2 system.

3.2.2 Dex JR.

The framework of our new processor, the Dex-II, comes from a simple, clean
RISC architecture called the Dex JR [31]. The instruction set consists of load/store
instructions, three mathematical operations, and conditional branching. The design of the
Dex JR. called for a four stage pipeline: Fetch, Decode, Execute, and Writeback. This
balanced our pipeline to provide the best possible cycle time with the components that
were available.

A four port, 32x32-bit register ﬁle provides operands for mathematical operations
as well as absolute addressing for memory. Four ports were necessary in order to write a
single result and supply two operands per cycle. The fourth port was exploited by
implementing a double load from memory. This special instruction proved to be a
signiﬁcant factor in performance by increasing the bandwidth to memory. Since all the
mathematical operations take two operands, this reduced the time to load the operands
from two cycles to one. Programs which require intensive memory cycles such as array
and matrix operations also beneﬁted from the ability to load multiple values.

Data dependencies of consecutive instructions were eliminated by forwarding
paths. Scheduling of the correct forwarding path for execution is done statically during

compile time. Only one delay slot is necessary for the branch commands as the execution

26
of the branch was done in the Fetch stage. The delay slot comes from waiting for
condition codes to be resolved in the Execute stage.
The global clock is broken down into a four phase clock to perform memory
operations. This ensures that the setup and hold time requirements are met. The
instruction length is 32-bits non-encoded and the data and instruction caches are

independent of each other. The data path, designed as a 32-bit architecture, is shown in
Figure 3.4.

3.2.3 Dex-II

Since FPGA technology is more complex than dedicated chips, the top speed and
performance will always lag the most current ASIC technologies. In order for an FPGA-
based architecture to compete, it must be able to prOVide unique features through its
programmability. This feature allows Splash 2 to prototype an ISP with the added
performance of the latest architectural advances. The goal, therefore, was to increase
processor throughput with a redesign of the Dex JR. by doubling the processor power.
This is accomplished by designing the architecture to execute two instructions per clock
cycle instead of one.

The Dex-II is a VLIW version of the Dex JR. A VLIW architecture was chosen
for its relative simplicity of hardware. This was a major concern as the size of designs for
each PE was limited by the physical FPGA itself. The architecture implemented on the
Splash 2 is also sealed from the original 32—bit architecture down to a 16-bit data path due
to the limited communication resources between PBS. The functionality of the instruction

set was retained.

27

 

 

 

 

 

 

 

 

 

 

 

 

Merril Mom
I MDR 2
Wtiic AddressA “ 3
Write Address B
AM29C334
Register
Reed Address A File
Reed Address B
A B
Immediate Memz
M ernl | l I Forwarding Path

 

 

 

 

 

 

  
 

 

l Adder puapiieTl

. r1:7

    

 

 

 

 

 

 

BecktoCoutrolPaﬂi

 

 

Figure 3.4. The Dex JR.

The pipeline depth was also altered to three stages due to the increase from a four
phase clock to a six phase clock. This was to accommodate the need for more
communication between stages and implementation across multiple PEs of the Splash 2
board. This reduction in pipeline stages also eliminates several forwarding paths. With
communication resources so highly limited, this tradcoff was necessary to retain data

coherency.

28
The instruction length is doubled to 64 bits to execute two instructions. Extra bits
are used to pass register destinations of both processors determined at compile time to
retain data coherency. The redesigned data path is shown in Figure 3.5. The box
enclosing the functional units allow any combination of two instructions to be executed
with its own operands. Both memory and registers are kept coherent with a combination

of hardware and compile time methods.

Dexoilﬂ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Register
Address ,
, Register
File
I I I I
IOPIAITIOPAITI
@Code 7
Add/Sub Mult Memory
93nd Codes

 

 

 

 

Figure 3.5. The Dex-II

29

3.2.4 The Instruction Set

The instruction set consists of 14 different operations, as listed in Table 1. This
reduced instruction set allows more complex functions and memory addressing by the

combination of these simpler instructions. The Dex-II allows any two of these operations

to be executed at the same time except for branching.

Table 1. Instruction set

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Instruction Description Register Transfer Level
Rx, Ry, Rz

LD Load from memory Rx <- (Ry)

LDI Load immediate Rx <- immediate

ST Store to memory (Rx) <- Ry

ST Copy Rggister Rx <- Ry

BRA Branch PC <- Address

BZ Branch if Zero PC <- Address if Z

BN Branch if Neg. or Zero PC <- Address if N

BNZ Branch ifijg. or Zero PC <- Address if NZ

BNV Branch if Neg. or oVﬂ. PC <- Address if NV

ADD Add Rx <- Ry + Rz

SUB Subtract Rx <- Ry - Rz

SFTL Shift Left. Rx <- Ry shifted left 1

SFI‘R Shift Right Rx <- Ry shifted right 1 ‘

MULT 8-bit Multiply Rx <- Ry * Rz

NOP No Operation

 

30

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ALU Operations, Load, Store
31 27 19 15 14 10 9 S 4 O
lOpcodel I OpA l 0138 lDestToplDestBotl
Load Immediate
[31 27 25 10,9 5 4 o
[ OpcodeJ J Immediate Value lDest Top I Dest Bot I
Branch
[31 27 26 10,
l Opcode I I Destination Address I I
Operation Opcode
LD 00100
LDI 00101 Registers are addressed from
ST 00110 R1-R31. R0 will always return a
ST 00111 zero and writing to it will not
ADD 10000 affect it.
SUB 10001
SFTL 10010
SFI‘R 1001 l
BRA 01000
32 01001
BN 01010
BNZ 01011
BNV 01 100
NOP 00000

 

 

 

 

Figure 3.6. Instruction format

31

Figure 3.6 depicts the three different instruction formats. The OpA and OpB
ﬁelds identify the source registers for the execution units. The Dest Top and Dest Bot
ﬁelds identify the destination register for the top half of the processor and the bottom half
of the processor. Both halves will have identical destination registers, however, their
Opcode and source registers will differ. Registers are addressed from 0 through 31
represented in standard binary form. By making the destination registers the same on the
top half and bottom half of the processor, we encode information in the instruction rather
than passing that information during runtime. This ensures that all instructions affecting
registers will update all four PBS in the decode stages concurrently with a minimal
exchange of information. There are also extra bits and room left in the opcode for future
expansion and addition of instructions. Two possible additions are register windowing

instructions and paged memory to exploit unused memory.

3.3 Design Constraints

The Dex-II was designed to keep wasted NOPs at a minimum. However, spatial
and temporal parallelism introduces certain restrictions that must be observed. These

constraints are classiﬁed as data, memory, and control hazards.

3.3.1 Data Hazards

Data hazards are constraints caused by the data path. Two operands that have
data dependency between them cannot be issued on the same clock cycle. Figure 3.7
shows an example of data dependent instructions. The ﬁrst example shows two
instructions that cannot be executed together since the second instruction is dependent
upon the ﬁrst instruction. This is executed correctly by the insertion of NOPs in the
second instruction slot to delay execution until the result has been computed and saved to
register 3. True data dependencies can not be avoided, however, certain architectures

implement special hardware to reduce their impact.

 

32

 

Example.
The following instructions will produce erroneous results

ADD R3,R1,R2 l"R3<-R2-i-R2
. ADD R5,R3,R4 ‘R5<-R3+R4

This is due to the fact that R3 will not be computed in time for the second instruction. In order to
execute correctly, either another instruction needs to be scheduled in its place, or a NOP must be
injected

ADD R3.RI.R2
NOP

ADD R5. R3. R4
NOP

 

Figure 3.7. Data dependency

 

The ALU of the Dex-II, unlike the SuperSPARC, are completely separate from

one another. The cascaded ALU scheme shown in Figure 3.8 is used in the SuperSPARC
to eliminate this restriction. If the second instruction depends on the ﬁrst, than they are
executed sequentially with ALUC, otherwise they are executed independently on ALU2
and ALUO. This scheme was implemented at the expense of another pipeline stage. For
the highly piped design, this solution is feasible, however, for the Dex-II this would
effectively give the same performance as the example with two instructions. Even worse,

the addition of another pipeline stage would affect the performance of other instructions.

v I

ALU2 ALUO

 

 

 

 

 

ALUC

 

 

33

Figure 3.8. Cascaded ALU

Read after Write (RAW) and Write after Read (WAR) data dependencies are
eliminated by the decode/writeback scheme employed. This scheme ensures that
registers are updated prior to broadcasting the operands for the next instruction. Details

of the implementation can be found in the Synthesis chapter.

3.3.2 Memory Hazards

Care must be taken to avoid assigning the same register or memory location with
different values on the same cycle. Figure 3.9 shows code which tries to write two

different results into the same register location.

 

Example.

ADD R3,R1,R2 *R3<-R1+R2
ADD R3,R4, R4 * R3 <- R4+R4

This will corrupt the register ﬁle.

ADD R3, R1, R2
NOP *Or reschedule another instruction

ADD R3, R4, R4
NOP

 

 

 

Figure 3.9. Memory hazard

Writing different values to the same memory address or register in the Dex-II will cause
the top and bottom processor to have conﬂicting memory and registers. This may force

programs that dynamically address memory to insert NOPs to ensure data coherency.

3.3.3 Control Hazards

Keeping both processors synchronized is a challenging task. Since the top and

bottom processors have independent program counters, they need the same information in

34
order to branch correctly. Since most conditional branching occurs on comparisons,
condition codes are only generated by the Add/Sub unit. This means that both
instructions setting the condition codes must be the same and that the conditional branch
must follow with a one instruction delay. Figure 3.10 illustrates how control hazards are

resolved in the Dex-II by this strategy.

 

 

 

Example.

SUB R3,RI,R2 *Sets uptopCC

SUB R3,RI,R2 l"SetsupbottomCC

NOP " Or ﬁlled with some useful instruction
NOP * Or ﬁlled with some useful instruction
BBQ 1000 *Branch to $1000 if zero flag set

BEQ 1000 *Branch to $1000 if zero ﬂag set

 

Figure 3.10. Control hazard

3.4 RTL Speciﬁcations of Dex-H

To implement the Dex-II on Splash 2, we begin by developing a simple RTL
(Register Transfer Level) description of the Dex-II. This allows us to partition our design
according to resources and communication paths available on the Splash 2. It became
very apparent that the communication paths offered by the crossbar and linear array
would be the determining factor of how our design would be implemented. The number
of clock cycles needed to implement an instruction cycle was determined by the data
paths between PBS. After this was established, a detailed RTL that is easily translated
into VHDL code was simulated and synthesized.

3.4.1 PE Partitions

Mapping a design over to the Splash system is a challenging task. Each
processing element is limited by the number of CLBs, number of I/O pins, and the

amount of memory available per PE. Even with larger FPGA chips, the amount of

-—. . ,l'

35
hardware that can be assembled onto them is still somewhat limited. With this in mind,
we begin by splitting the resources stage by stage.

The partitioning of functions is shown in Figure 3.11. Processing elements PEI -
PE8 represent half of the complete VLIW processor and can stand alone to execute scalar
RISC code. This half of the processor will be referred to as the top processor hereafter.
Elements PE9 - PE16 are the mirror hardware to implement the second processor. This
half will be referred to as the bottom processor.

The control path is implemented across the ﬁrst two His and supplies a 32-bit
instruction word to its half of the processor. Since the memory is only l6-bits wide, two
elements are needed to generate the 32-bit word. Elements PE] and PE2 also house the
program counter, the instruction memory, and the conditional branching logic. The
decode stage contains the register ﬁle and circuitry to forward data. We chose to use two
elements in this case to provide extra bandwidth to allow dual reads from memory. This
allows us to shorten the number of clock cycles needed to complete a single instruction
cycle. The execute stage consists of four functional units: Two adder/subtraetor units, a

multiplier, and the memory manager. This setup is cloned on the bottom to provide equal

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I I
I I
P51 PE2 l 1253 P54 l P55 P56 P57 PE8
I I
| I
I |
| I
Fetch I Decode I
& I & l Execute
Branch I Writeback I
l |
l I
l |
| |
PE16 P515 : PE14 P513 : P512 P511 P510 P59
I I

Figure 3.11. PE partitioning

36
functionality on both the top and bottom halves of the processor. The memory units of
PE8 and PE9 utilize the extra left-right communication lines between them to keep the

memory coherent.

3.4.2 Communication Paths

The amount of communication between PBS was critical in this partitioning
scheme. As noted before, the major problem encountered here is the limited number of
communication paths to implement bus structures. While the crossbar is capable of
handling multiple connections, we cannot, for example, selectively broadcast a 16-bit
operand from the decode stage to each of the four execute stages.

The left/right communications of each chip are dedicated to the control path.
Instructions issued from PEI and PE2 ﬂow to the right and supply each pipeline stage
with the next instruction. The data path, therefore, is limited to using the crossbar for
passing operands and results around. Using the programmable crossbar, each cycle could

send and receive data from different PEs.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i l l I
PEI PE2 PE3 PE4 PES P56 PE7 PE8
l l I I
Figure 3.12. Cycle 1

Figure 3.12 shows the ﬁrst of six cycles. PEI and PE2 fetch the next instruction
from their respective memories while PE3 and PE4, the Decode stage, fetch the operands
from their register ﬁle memory. The 16-bit results from the Execute stage from PBS and
PE6 are sent back to the Decode PEs through the crossbar. If the instruction in the

Execute stage is a memory store, then PE8 performs the store operation.

37

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PE++PE2 PE3 PE4>PE5 P56 PE7 PBS

4___4.____5r__L__l I l

 

 

 

 

 

 

 

 

 

 

Figure 3.13. Cycle 2

Figure 3.13 depicts the communication paths for the second cycle. The Fetch
stage swaps information about the new instruction and forms the full 32-bit instruction
word. The Decode stage begins to pass the next instruction along to the Execute PBS and
receives the ﬁnal two results from PE7 and PE8. PBS and PE6 now send the results of

the condition codes to the Fetch stage while PE7 and PE8 complete their memory reads.

 

PEI Pm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PE3 PE4
Figure 3.14. Cycle 3

The third cycle is where synchronization with the second processor occurs.
Information about the destination register of the second RISC processor is encoded in the
instruction word, therefore, only the data needs to be communicated. Figure 3.14 shows
the Decode stage sending and receiving the 16—bit results from the second processor.
Meanwhile, the execute stage continues to pass along the next instruction and PE8, the
memory manager, exchanges information on what kind of memory Operation is
performed. The Fetch stage at this point also updates its PC (program counter). If a
branch is indicated, the PC loads the absolute address from the instruction word,

otherwise, the PC is incremented by one.

38

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PE] PE2 PE3 PE4 PES PBGHPPN PE8

 

 

 

 

 

 

 

 

 

Figure 3.15. Cycle 4

Cycle four shown in Figure 3.15 begins the writeback sequence. PE3 and PE4
writes the result from the Execute stage into the register ﬁle. Since only one operand can
be written at a time, the result from the second processor must wait. The memory
manager swaps address and operand information. If the second processor performed a
store operation, then we must also perform the same store operation to keep the two

memories coherent.

 

 

 

 

 

 

i l

PEI PE2 +> PE3 PE4 PBS PEG PE7}> FEB

I 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.16. Cycle 5

Figure 3.16 shows the two decode stage PEs writing the results from the second
processor and sending the operands for the next instruction to PBS and P136. The next
instruction for the Decode stage is passed from the Fetch stage and the ﬁnal execution PE

gets the next instruction as well.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.17. Cycle 6

39
The ﬁnal cycle shown in Figure 3.17 illustrates the Decode stage passing
operands to the last two PBS. The second Decode PE receives the next instruction. Each

PE updates its respective instruction buffer to begin the next instruction.

3.4.3 Fetch Stage

The Fetch stage, whose operations are described with RTL statements in Figure
3.18 keeps track of the current program counter (PC). Using the PC as an address, it
reads instructions from PE] and PE2 in phases 1 and 2 to form a complete 32-bit
instruction. Since each PE only has a 16-bit memory chip, to form the complete
instruction, PEI must send the most signiﬁcant word to the right and PE2 concatenates
the two to send to the Decode stage. Notice that the branch instructions are also executed
in this stage. In order to get the correct branch address, PEl must have portions of the
lowest signiﬁcant word. Therefore, PE2 must also send its values to the left. This occurs
from phase 2 to phase 3 through the XP_I..eft and XP_Right data path. Cycle 4 is where

the actual branch takes place, otherwise PC merely increments by one.

3.4.4 Decode Stage

The Decode stage is the most complex of all the stages. It needs to supply all four
execution units with two operands each and receive the results from all four execution
units. A multiplexer selects the correct result and the information is swapped between the
top and bottom processor to ensure the register ﬁles are identical. This is the stage that

dictated the six-phase clocking scheme.

83.58: .5. as»... 2.... 55...

lllllllllllllllllllllllllllllllllllllllllllllllll W I
Ioluw meammuwm m Emma maloh .w .omI I I I I wwulnwwm @memm mmwﬂcmmﬂvlumI I I I I I I I m I
ﬁne; a .8 a $2 .v on ﬁgs a so: a :8 .v on—

IIIIIIII a Edmlxwvaeﬂﬂmﬂamomml I I I I I I I I I I I meEHdevaodmmmownm I I I I I I I I w I
333 55% .V 00 Emma .amx -v 00
IIIIIIIIMMXHW—olﬂawuﬁlumw IIIIIIIIIIII m .mszWWWWﬂEIEMWmWIIIIIIIINI

MG: IV 5. m: :30": MD: .V 37 SH :30":
Um .V 3 Oh .V g .H
Nm—m Mmm maﬁa

63m
654 BEE“; Hoammwom

41

The Decode stage begins by assuming that the operands are in memory and starts
the read process. The S-bit address is taken directly from the instruction word Id. The
only difference between the PB3 and P54 is which operand they read. In the second
phase, the value of the register is read from memory. The results from the ﬁrst execution
unit (PBS) is also read into both decode stages. Since the Xbar can be conﬁgured in four
8-bit segments, each decode stage can accept up to two 16-bit results each phase. The
third phase latches the results from execution units 3 and 4 (PE7 and PB8).

The results from each execution unit are funneled into a 4—1 mux and selected by
the opcode of the executing instruction, Ix. The output of the mux will contain the
correct result from the instruction in the execute stage of the pipeline. Note that the mux
is not latched and thus provides the correct value by the end of the third phase, even
though the inputs were latched at the beginning. The execution result is latched from the
output of the mux at the beginning of phase 4 on the top and bottom halves of the
processor. Likewise, the top latches the bottom result in phase 4. With both results
available, the decision to forward operands can be made.

If the source registers in the decode stage match the destination register in the
execute stage, than the operand is replaced with the new result. The correct operand
begins to broadcast its values to execution units 1 and 2 (PBS and PE6). Cycle 4 also
physically writes the results back into the register ﬁle if the destination is not R0. Cycle
5 writes the bottom half of the processor result if the destination is not R0. Both
operands are sent to execution units 3 and 4 (PB7 and PBS). The last phase latches the

next instruction and moves the current Id to Ix.

2
4

.waﬁﬁaﬂovg mo 83 .8 :32? 93 >05. 60:82-5: 3.558 0.3 839 82E. *

Santos: a: 883 .2...” 8E...

M03835 98 £595

364 commas; eoemwwom

I ._. 3 u wnmadx
13 “an": @3573: o
IIIIIIIIII II 3.3: I IIEWMIIIIIIIII
ImdovaﬁmmmwlllIIIIllIII Id34v4o4mﬁamx m
machv an: meoev ~52
“cm .309 IV Mg «cm .309 .V g
IIIIIIIIII deciﬂqqaameIIIIIIIIIIIIIﬁaMaIadﬂNIIIIIIIwI
Samoan—.2 .v ~52 gnomes—2 .v ~52
m8. .65 .v ~22 mob amon— .v ~22
8 - m: smx .v gave 8 .. m: amx -v mac...
mg u 89on a 8 - m: amx .v m8 30m u 89on a 8 - m: amx .v <8
mmom unease: Seamus: .v m8 «amalgam: 233%: .v <8
IIIIIII Hamelmﬂsﬂm aunwﬁwmwl I I I I I I I I I A... maamxwaﬂﬂ .Iﬂqamm. I I I I I I I w. I
§ - Eamx -v as: 8 - Eamx .v as:
8 - Eamx .v 32 .2 - 55x -v as:
IIIIIIIIII .awmmmxuhwélllilInuawhmaumwwmsuaulliuml
32 .v m8 «a: .v <3
mwom .v «<2 <wom .v ~22 A.
vmm an 395

43
3.4.5 Execution Stage

The Execution stage of the pipeline consists of four PBs. Two PEs are conﬁgured
to act as an integer Adder/Subtractor unit, one Multiplication unit, and the last PB as a
Memory Management unit. Each unit can be reconﬁgured with relative ease to add
special instructions. In addition, there is no need to keep the top half and the bottom half
symmetric either. Therefore, it is possible to have up to eight different functional units.
Instructions, however, would be limited to executing on the half which has the hardware
to support them.

The Adder/Subtractor unit is found on PBS and PB6. This conﬁguration also has
the right and left shift functions programmed onto it. Operands for this unit are actually
latched at the end of the previous instruction. After the ﬁrst phase is completed, the result
is broadcasted on the crossbar to the decode stage. Condition codes are calculated during
the second phase. Codes supported by the Dex-II are Negative, oVerflow, and Zero. The
negative flag is set if the most signiﬁcant bit of the result is a 1. An overflow occurs if
the add/sub unit indicates a carry out condition, and the zero ﬂag is set if the result is
zero. The second Adder/Subtractor unit function is merely to generate the condition
codes. 3

The Multiplication unit utilizes the PE memory as a look-up table to achieve its
purpose. The reason for an 8-bit multiply is straightforward. The 16-bit data path can _
only support answers from two 8-bit words. Rather than design an elaborate plan for
multiple register destinations, we restrict the operands to eight bits. Higher bits are
ignored. The memory table is generated externally and loaded into the PE memory prior
to execution of the program. By taking the two operands and using them as an address,
the Multiplication unit can generate its result by the second phase. This works out well as
we can only send two results back during each phase and phase 1 is being used by both

add/sub units. The last phase is used to latch operands for the next instruction.

44

The Memory Management unit is located at the ends of the processor halves to
utilize the extra communication lines. The right side of PE8 is connected to the left side
of PB9 in the linear chain. This allows us to synchronize the efforts of PBS and PE9 to
keep a uniform main memory. The ﬁrst phase latches the memory addresses and
performs a write if the instruction is a store command. Otherwise, it assumes a read from
the address. The second phase will return a result to the decode stage. This result is
either the value of another register, the value of a memory location, or the l6-bit
immediate value in the instruction word Ix. The third phase begins the synchronization
of the two memories by exchanging results and instruction words. If the bottom
instruction was a store, then the top writes the value into memory and vice versa. Since
only stores can affect memory, only stores need to be considered. As before, the last

phase is used to latch the new operands for the next instruction.

.. 2.3.5. u x.
.2 - 2. Ex .v m8
8 - 2. 35. .v <8

.5288: .5. 383m .333»...—
.w:..2§m._ou§ «o 3.3 8.. :32... 2a hon... .3..3~.-=o= 3.258 2a 323 32.... *

.. 23:8. u x.
.o - 2. Ex .v m8
.2 - 2. Ex .v <8

.. 2 n “I225.
23 a. -v x. o

.o - 2. as. -v m8
.2 - 2. Ex -v <8 m

$0 I m: 23% .v and).

.2 - 3. Ex .v «<2. w
55 5 u £2. 2
<8 .v .o - 2. 38.
m8 .v .2 - 2. ﬁx
a .o - a. x. .v .2 9:32. .5 55.8.
.c - 2. in. ex .v on. com m
{Isa mawaamwsmamm IIIIIIII mama IIIIIIII 0 $22.. mall
me). .v 88. A... 2 case no 825 00 com
<8 .v 5x522 N
I I I I I ammomWsww. mpymmmmﬂ I I I I .mmomhﬂmamem IIIII @848 «saga. I I I I
<8 .v an: <8 .v .w - 2. ~22 m8 + <8 .v Ex 22..
m8 .322 .AA <8 -v Ex ”Emma .
.vv <8 -v Ex gram 2
mm... E. 8.. a. mm... 39:
ousooxm.

.96.. 5.2.2... .82wom

CHAPTER 4

Simulation and Synthesis Environment &
Results

With the RTL and architectural speciﬁcations completed, the design can be coded
with VHDL and functionally simulated using Synopsys tools. This chapter presents the
simulation tools used to implement the Dex-II. When the simulation correctly
implements the architectural speciﬁcation, the VHDL code can be synthesized and tested

on the Splash 2. The results of the synthesis are also presented in this chapter.

4.1 Simulation & Synthesis Process

The register transfer level description was coded in VHDL for both simulation,
and synthesis. Figure 4.1 shows how the VHDL hierarchy is arranged. To program each
PE, we must write VHDL code for each XPARTS module. A sample description of the
VHDL code is shown in Figure 4.2. The VHDL code is very similar to other high level
languages. A header deﬁnes the name of the PE that is being programmed, followed by a
list of signal declarations. These are followed by constant declarations and logic
assignments. A process statement marks the start of the description of the architecture's
logic. The wait statement deﬁnes the signal sensitivity of the process. The process will
be evaluated only when a signal on the sensitivity list changes. In this case, all of the
logic is synchronous on the rising edge of the clock. The code following the wait

statement describes the functionality of our architecture taken from the RTL descriptions.

46

47

29.5... .5...» its»...

 

 

22..-"; a:
gen—«<8...

 

 

 

 

 

can nit-to... =1ngqu 4 Buéaeouxs...
ﬂ .éaAFISEJsaseﬁﬂsx .x Q .ﬁﬁiagnﬁ

 

 

 

 

 

 

 

 

in...“ «HMS—K . . . tongs—.2 “ES—x

.n: EbohEEH ad Etogb
a...3§=ﬁbs§z_ﬂ ﬂ 3.8.3.8sz2
. ﬁlth—2.8: “SEEK «bourbon—o: “EEK

 

 

 

 

 

 

 

 

 

 

 

 

 

 

"GEMS—3E ﬁgux

 

 

 

 

 

 

 

813.0 F. . .. , “atom-Inn :2 kg
.3... ﬁrst... 335....
3.3.0 .1... .. g .3. I eddies 6.2.9.5

.27....—
AL 5.338231...

 

 

 

 

18A. 2!...» an
«€3.28»

 

.3321.»

 

 

iced-in "1m

e. sci, a. ii
.IE. is.

5
“NB—ND Jung

 

 

 

 

 

i§.8.ta£.
AM 2883235...
28:18.45... "no.5

 

 

 

Epig—
A; .taulaytlvg

 

 

 

   
     

architecture FetchJeft of Xilinx_Prooessing_Part is
signal PC: Bit_Vector (l7 downto 0);
signal Ioadaddress: BiLVector (15 downto 0);
signal Ix: Bit_Vector (31 downto 0);

Header

Architecture Name

 

 

Signal Declarations
begin g
RIGHTIN: for i in o to 15 GENERATE Special signal generation
ElPad_Input(XP_right(i), right(i));
END GENERATE RIGHTIN;
O
8

 

XP_Mem_RD_J£D<= '0';

increment_pcl.'.l<= '1'; I Constant Declarations I
one <= '000000000000000001';

process I Start of process I
begin

El

wait um“ XP_Clk'Event and XP_Clk = .1. I All transitions take place on rising clock edge I
D loeaLclk <= local_clk + 1;

El if IocaLclk = S then

Elsdlofcalﬁk <= 0; I Six phase clock state machine I
e l ;
O

ElPadanut (XP_Mem_D, mdr);
UPaCLOutput (XP_Mem_A, pc); Memory Interface

Elif looal_clk = 1 then

ED Ifetch(3l downto 16) <= mdr; ' .
III right(31 downto 16) <= mar; I Logic for each phase taken from RTL I
III Xbarenable <= “01111”;

El end if;

0
8
end pm; I End of process I

XP_Mem_WR_LED¢ '1';

XP_HStlIk: 'Z‘; .
XP_GOR_ResulID<-- '0'; I Assrgn unused Splash 2 control signals I
XP_GOR_Valicl3<= '0';

XP_IntEl'J<= ‘0';

 

 

 

 

 

 

 

end FetchJeft;

Figure 4.2. Code description

49

3&8 82.956 .3 83...—

o

 

 

 

 

.98

. .3:=§<E<gsﬁm<m

  

 

$258....xéizm25. .

 

 

 

 

.9m_.<8..E<..§cmE<.. .

      

u_ﬁ=ms..>$.m<gomhm<
.ouE_§\E<gcﬂx<
62.554.28.25... .
$2XE§E<eScﬁm<

 

Ehﬁgcamtdg

  

 

 

8” Oughﬁgsnm<ec m

 

 

  

$.5ng¢<§>3¢..¢<§

 

 

 

 

 

 

A9 . Oékm<g5mhx<g m

 

 

 

 

«58°:

.9 _ 32.....Sae. 3.5.53 .

 

 

 

”SS-n

 

3:8.-

«mg

8888 .EcﬁuEEEEEwEé

  

 

o

n_q_a_n___o_n_4_n_n—._o

_n_._n_n___o_n_¢—n_u__ o

M.uu_uoo<km<m§ Smﬁd<b§

 

 

   

JUIEHd<ECmFM<bQ

 

 

.

_
8-

. _
8v

1

_

 

. SO
Lastly, the control signals for the unused resources of the Splash 2 must be deﬁned to
avoid signal conﬂicts and illegal conditions of the Splash 2 system. A complete copy of
the implemented code can be found in Appendix A.

Simulating the design with Synopsys tools provided valuable insight into timing
problems. More importantly, the simulation allows us to simulate each PE separately.
This is an important factor in keeping the system design manageable. By using the
simulator, we can perform system test and integration by forcing inputs and observing
internal nodes and outputs of the PE. After veriﬁcation of a PE design, it is integrated
with the Splash array code, and the whole system is simulated. Unimplemented PBS in
the system could be replaced with simulation vectors. This incremental, PE by PE build
was crucial in eliminating system bugs.

The ﬁrst half of the Dex-II that was implemented was a modiﬁed version of the
Dex JR. This is a single instruction RISC version of the Dex-II that will be used to
evaluate the overall increase in performance of the VLIW architecture. Since the VLIW
architeaure is almost an exact mirror of the RISC architecture, the simulation and
veriﬁcation of the Dex JR. RISC design made it much easier to implement the ﬁnal
version of the Dex-II. This incremental architecture build also played a crucial role in the
ﬁnal implementation of the Dex-II.

Simulation also allowed debugging of programs that were hand-compiled. Figure
4.3 shows the simulated architecture executing the Fibonacci program which can be
found in Figure 5.1. This waveform displays the global clock as well as the local clock
on the ﬁrst two lines. The next three lines show the state of the instruction pipeline. We
can clearly see which instruction is in each pipeline stage. The program counter is
displayed on the next line followed by the condition codes and the branch bit. This
allows us to monitor the state of the control path. The next six lines display information
about the Decode stage. Note that when the ﬁrst instruction is actually executed, we can

see the WriteEnable signal go high as the two immediate loads are executed. The next

S 1
instruction is the addition function and the result of the Execute stage is shown on the last
line of the simulation output.

The objective of our simulation is to write VHDL code that will generate a correct
implementation of the Dex-II in hardware on the ﬁrst pass. Unfortunately, the compilers,
synthesis tools, and debugger are still relatively young and full of idiosyncrasies. For
example, the interactive command tool, T2, interfaces the host to the Splash 2. T2 has
the ability to scan the ﬂip-ﬂop states of the 1083. While the design may utilize these pins
going off-chip, the compiler will use a ﬂip-ﬂop from an internal CLB and conﬁgure the
108 without a registered value. This gives misleading feedback to the user as the pin
appears to never change its value.

While simulation did provide information regarding timing, the actual
implementation of our design code did not simulate correctly. As a result, the time taken
to synthesize and debug our implementation was dominated by repeated compilation and
synthesis. Using VHDL as a synthesis tool requires the designer to follow a certain style

and rules of coding to achieve the desired implementation.

4.2 Synthesis Results

Upon veriﬁcation of our design, the automated design tools were invoked to
produce a design speciﬁcally for the XC4010 FPGA. The UNIX script, VHDL2XNF,
invokes the Synopsys VHDL design compiler to generate an XNF (Xilinx Netlist Format)
wirelist. This XNF ﬁle is used by the XNF2BIT script which invokes the PPR (Partition,
Place and Route) software. These tools automatically place, partition, and route the XNF
description of each PE to the physical chip architecture. The placement and routing of
CLBs are iteratively improved upon automatically. The design ﬂow is shown in Figure

4.4 along with the tools used in each step.

52

At the highest degree of optimization, our design yielded the synthesis results
summarized in Table 2. This table shows us the percentage of total CLBs utilized in each
PE of our design and the timing associated with the synthesis results. The percentage of
CLBs utilized gives us a metric to estimate how much more logic can be implemented in
each PE. This is important for future design considerations.

As we can see from the table, our critical time is dictated by PBS. The 90.0 ns
result gives us an effective global clock speed of 11 MHz. With the division of the six-
phase clock, this would give us a peak 1.85 MIPS (Millions of Instructions per Second)
rating. Assurrriug that our VLIW machine can execute two independent instructions per
cycle, that would effectively double our peak to a 3.70 MIPS rating. Current

workstations have a typical peak MIPS rating in the hundreds. Thus, while slow

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

, I I
VHDL Te“
Source Editor
. . Synopsys VI-IDL
31111111330“ System Simulator
V
Logic Synopsys Design
Synthesrs Compiler
Timing ‘ Xilinx
V Extraction I Xdelay
Place and ' ' I
Route Xilinx PPR
Splash Splash ‘
2 2 4— T2

 

 

 

 

 

 

 

 

 

Figure 4.4. Design flow and tools

53
compared to workstations, this does represent a speed capable of running applications.

Another limitation of the Splash 2 involves incorporating memory subsystems and
additional peripherals, features found in any workstation. These functions must be
emulated on the Splash 2 using the Sun host. While slow, this would allow different
schemes of virtual memory and caching techniques to be explored with the prototyped
architecture. This is important as different architectures will utilize resources uniquely.

In order for the Dex—II to execute real applications, the instruction set would need
to be enhanced to support additional instructions. The indicator for expandability in our
synthesized design is the percentage of CLBs used. The percentages used in each PE are
all under 50%. Xilinx predicts that the usage of CLBs can approach 75% before routing
becomes impossible to achieve using automated design software. Thus, enhancements on
this initial design are possible.

VHDL offers a powerful tool to realize architectures. Its high level language
features allows designers to quickly design and integrate new systems. This high level
language also introduces certain ambiguities. Just as high level language compilers may
not generate efﬁcient machine code, VHDL synthesizers have difﬁculties generating
efﬁcient logic. VHDL favors certain coding styles for different synthesis targets. In
order to efﬁciently support systems that are dependent on synthesis results, such as
Splash 2, these tools will need to be improved considerably.

This chapter has presented the results of mapping and implementing the Dex-II
architectural speciﬁcations on the Splash 2 system using a series of simulation and
synthesis tools. We now have an operational version of a modiﬁed Dex JR. and Dex-II
on Splash 2. Various aspects of the process pose limitations, including support provided
by the tools. Beyond the process, the implementation has provided timing and utilization
results that help characterize this prototyping platform. Real applications are possible,

however limitation due to FPGA logic and Splash resources remain.

54

Table 2. Summary of synthesis results

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PE Function %CLB Critical
Used Time (ns)
0 Crossbar Control 1 14.9
1 Fetch 16 41.1
2 Fetch 19 44.4
3 Decode 42 54.8
4 Decode 46 55.7
5 Add/Sub Shifter 32 90.0
6 Add/Sub Shifter 31 80.8
7 Mulmily 33 34.9
8 Memory 43 53.2
9 Memory 49 55.9
10 Multiply 33 33.9
11 Add/Sub Shifter 30 81.7
12 Add/Sub Shifter 31 84.8
13 Decode 45 44.9
1 4 Decode 43 53.8
15 Fetch 20 44
16 Fetch 16 41.3

 

 

CHAPTER 5

Test Programs

The ultimate goal of a processor is to execute programs. We can test different
processors by running programs and comparing their execution times. In this study, we
will compare the VLIW version, Dex-II, versus the original RISC design, Dex JR. A
Fibonacci test program was implemented to verify operation of the Dex-II on the Splash
2 system. This program is hand compiled and loaded into memory using T2. Using
memory dumps of the register ﬁle and memory, we can ensure that the processor and
programs are executing correctly. A bubble sort program is also analyzed to better
characterize the performance of our processor with a larger program. This program
provides a better basis to compare the Dex-II versus the Dex JR. than the simple

Fibonacci program.

5.1 The Runtime Environment

The runtime environment was controlled through the T2 debugger. T2 provides
the interface to control all of the Splash 2 hardware. The interface allows us to download
the bitstream pattern generated by the synthesis tools for the Dex-II design, program and
read the memory for each PE, and control the system clock. This control allows us to
step through the execution of a program clock by clock and examine the results.

The comparison of the Dex-II, our VLIW machine, to the original Dex JR., our

RISC machine, will show the merits and shortcomings of the VLIW architecture. This

55

56
kind of comparison is necessary to evaluate the merits of architectural improvements. To
evaluate the performance the Dex-II, we count the number of useful instructions it can
execute per clock cycle. Useful instructions exclude NOPs. The repeated branch and
condition code generation instruction, typically a SUB instructiOn, is counted as a single
instruction. This metric is called IPC (Instructions Per Clock cycle). The inverse of the
IPC gives us CPI (Clock cycles Per Instruction). Using the IPC or CPI, we can calculate
how many MIPS (Millions of Instructions Per Second) the machine can execute if we
know what the clock cycle time is. Although the MIPS rating can misrepresent the
performance of the machine, it is still a useful tool in evaluating architectures with similar
instruction sets. In order to improve the performance of the architecture, it is essential to
increase the MIPS the machine can execute. This can be done by decreasing the cycle

time, or increasing the IPC.

5 .2 Fibonacci Sequence

The Fibonacci sequence is computed by adding the two previous members in the
sequence to generate the current member of the sequence. The ﬁrst two members of the
sequence start with twoones. The Fibonacci program, which computes the Fibonacci
sequence, is used to test the Dex-II. This program will initialize the register ﬁle by
loading initial values into R1 and R2. R3 is then computed by adding R1 and R2
together. The registers are then shifted by loading the value of R2 into R1 and R3 into

R2. The process is then repeated to compute the inﬁnite series.

5.2.1 Dex-II Veriﬁcation

The Fibonacci program exercises all of the hardware necessary for the Dex-II to
be implemented on Splash 2. It shows correct utilization of the PE memory, both read
and write, by successfully updating the register ﬁle and fetching of instructions.

Synchronization of the four separate register ﬁles is achieved through the crossbar, as is

57
operand passing. The results of the Fibonacci calculation also veriﬁes the correct
mathematical execution in the adder unit.

Figure 5.1 shows an instruction by instruction register dump of the Fibonacci
program used to test the Dex-II. After every six ticks of the global clock dumps the ﬁrst
ﬁve memory locations of PE3 and PE13. This represents the register ﬁles from the two
processors. Note that we start examining the register ﬁle at clock tick 18, or the 3rd
instruction cycle. This is due to the ﬁrst instruction ﬁlling the pipeline in the Fetch and
Decode stages. Figure 5.1 shows the results from the initial loading of the ﬁrst two
registers, and through a single loop of the Fibonacci program. We can see how each
instruction affects the register ﬁle. This program execution clearly demonstrates the

functional correctness of the Dex-II.

5.2.2 Program Implementation and Performance

The Fibonacci sequence is a relatively easy algorithm to implement. The code for
the single RISC processor was used ﬁrst to verify the synthesis steps. The VLIW version
was then execute to verify the additional features of the Dex-II. Both of these codes are
shown in Figure 5.2. The complete runtime results from both versions of the code are
located in Appendix B.

There are six useful instructions being performed in the RISC model in six clock
cycles giving us a IPC (Instructions Per Clock) of l. The VLIW, however, performs the
same six instructions in four clock cycles for an IPC rating of 1.5, or 2.775 MIPS. In the
VLIW case, if we could eliminate the need for either processor to perform the BRA
command, we could compact the code into three cycles for the theoretical IPC of 2. The
ﬁrst two instructions, however, are only performed once each time the program is

executed. To get a better comparison, we must analyze the calculation loop.

58

 

Results read back from Splash 2 Instruction being executed

12:4» lAdded RB 3 at tick 18

Board 0 PE 3 onset o

0000002000000010001000A000A LDI 121,1 LDI 9.2.1
Board 0 PE 13 Offset 0 Words 32

000000: 0000 00010001000A000A

T2:5>l

AddedRB4attlck24

BoardOPE30ﬂsetO

00000030000000100010002000A Loop ADD R3,RI,R2 NOP
BoardOPE 130MOW0rd832

00000020000000100010002000A

T2:6>l

AddedRBSaitlckm

BoardOPanfisetO

00000010000000100020002000A ST R1,R2 ST R2,R3
BoardOPE 13 OffsetOWords 32

000000:0000000100020002000A

T2:7>l

AddedRBBattlckas

BoardOPESOﬂsetO

000000: 0000 0001 0002 0002 000A BRA Loop BRA Loop
BoardOPE taoﬂsetOWords 32

00000020000000100020002000A

T2£>I

AddedRB7attick42

BoardOPanﬂsetO

000000:0000000100020003000A Loop ADD R3,RI,R2 NOP
BoardOPE 13 OffsetOWords 32

000000:0000000100020003000A

 

 

 

Figure 5.1. Fibonacci Execution Results

The VLIW still outperforms the RISC processor by reducing the number of cycles
needed in the calculation loop. Note in Figure 5.2 that while the RISC machine uses four
instructions, the VLIW is capable of producing the same results using only three. This
represents a 33% increase in the performance of our new machine compared to the RISC
by increasing the IPC to 1.3, or 2.47 MIPS. While lower then the overall comparison,
this represents a more realistic gain in performance.

To demonstrate the ﬂexibility of the architecture, we also attempted to tailor the
architecture to our problem and get a faster design. Since the Fibonacci program only

required the add function to be implemented as an execution unit, we removed the excess

59

 

 

 

 

 

RISC
LDI R1,l
LDI R2,1
Loop ADD 113,111,122
ST R1,R2
ST R2,R3
BRA Loop
ELEM
LDI Rl,1 LDI R2,]
Loop ADD R3,RI,R2 NOP
ST R1,R2 ST R2,R3
BRA Loop BRA Loop

 

 

 

Figure 5.2. Two Fibonacci programs

functionality which improves the performance of the most critical PE. As a result, our
cycle time was reduced to let our system clock run at 12.7 MHz with a peak RISC MIPS
'of 2.11 and a peak VLIW MIPS of 4.23. This represents a 14% increase in performance

of both processors.

5.3 Bubble Sort

The bubble sort algorithm represents code that performs many comparisons,
memory accesses, and several conditional branches. This kind of code introduces many
NOP cycles that decrease the performance of the processor. The initial program is ﬁrst
introduced followed by an optimized program utilizing some of the techniques introduced
previously. The results will give a better understanding of the necessity of compiler
optimizations.

Figure 5.3 shows a non-optimized version of the bubble sort program for our
RISC machine and Figure 5.4 is the same program on our VLIW version of the machine.

As we can see, we initially reduce the total number of instructions from 23 to 19. Out of

60

 

 

 

I LDI R1, 0

2 LDI R2, 0

3 LDI R3, Length
4 LDI R8, 1

5 Loopl SUB R4, R3, R1
6 NOP

7 82 End

8 LD R2, R1

9 LD R5, (R1)
10 Loop2 LD R6, R2

11 SUB R4, R5, R6
12 NOP -
13 BN Swap

14 Check ADD R2, R2, R8
15 SUB R4, R2, R3
16 NOP

17 BN Loop2

18 ADD R1, R1, R8
19 BRA Loopl

20 Swap ST (R1), R6
21 ST (R2), R5
22 LD R5, R6

23 BRA Check

24 End

 

Figure 5.3. RISC bubble sort program

these four, only two represent a signiﬁcant gain in performance. The initialization
routine was shortened from 4 to 2 instructions. While this is signiﬁcant, the program
spends a minimal amount of time in this routine. The loops, however, are where the bulk
of the program execution takes place. The results here were marginal. The only
improvements in performance is observed in Loopl and the Swap routine, both reducing
the number of instructions by one. This represents only an 11% increase in performance.
To optimize the VLIW machine, we begin by trying to move instructions into the
NOP cycles. Instruction #6 can be moved up to instruction #4 since we either perform
the instruction or end the program. Instruction #15 now can moved up to the NOP of
instruction #13, however, we must take into account what happens if the conditional
branch is true. In order to maintain the functionality, we must add compensation code at
instruction #9 to undo our speculative execution. We can also move instruction #11 up to

instruction #90, however, we would need to add two instructions to compensate in the

61

 

 

l LDI R1, 0 LDI R2, 0

2 LDI R3, Length LDI R8, 1

3 Loopl SUB R4, R3, R1 SUB R4, R3, R1
4 NOP NOP

5 B2 End 32 End

6 LD R2, R1 LD R5, (R1)

7 Loop2 LD R6, R2 NOP

8 SUB R4, R5, R6 SUB R4, R5, R6
9 NOP NOP

10 BN Swap BN Swap

11 Check ADD R2, R2, R8 NOP

l2 SUB R4, R2, R3 SUB R4, R2, R3
13 NOP NOP

14 BN Loop2 BN Loop2

15 ADD R1, R1, R8 NOP

16 BRA Loopl BRA Loopl

17 Swap ST (R1), R6 ST (R2), R5
18 LB R5, R6 NOP

l9 BRA Check BRA Check

20 End

 

 

Figure 5.4. VLIW bubble sort program

Swap routine. This would shorten one path and lengthen another. Since the number of
times the Swap routine is called is dependent on the data set, making this substitution
would not be beneﬁcial without prior knowledge of the data.

The optimized version of the VLIW code shown in Figure 5.6 manages to shorten
Loopl to 3 instructions and Loop2 from 10 to 9 instructions. To be fair, we perform the
same optimizations on our RISC processor. This time, we can only move instruction #8
to instruction #6. The RISC code is shown in Figure 5.5. The ﬁnal comparison shows a
total of 22 instructions for the RISC and 17 for the VLIW. Code size of the VLIW was
reduced to 74% of the RISC code. If we consider the performance of the loops, we
managed to reduce the total number of instructions from 18 down to 15.

The optimized VLIW code executes 21 useful instructions in 17 clock cycles for a
IPC of 1.23. The RISC performs 20 useful instructions in 22 clock cycles for a IPC of
.90. In this case, the high number of conditional branches with the lack of computation
' between sections would not beneﬁt from the ability to schedule operations into the

redundant control instructions. This penalty for conditional branches occurs whenever a

62
program executes several branches in succession. The lack of useful instructions between
each section makes moving code across branches difﬁcult to impossible.

It becomes evident that smart compilers are necessary if we want to see an
effective VLIW processor utilized in the future. While smart compilers are needed for
scheduling instructions in any superscalar processor, methods of scanning larger blocks
of instructions during run time can be used to reduce the impact of bad compilers.
However, run-time methods will require extra cycle time that may not be practical in a
super-pipelined, superscalar architecture. Static scheduling will still be faster ultimately.

The programs used to characterize our architecture are simple examples. These
programs allow us to visualize the effectiveness of our instruction set. They also allow us
to see the shortcomings in how we deal with the various hazards. The bubble sort
program, for instance, clearly shows that we need a better way to handle control
instructions. The Splash 2 also does not support functions such as ﬂoating point
arithmetic. To execute real applications and benchmarks, these functions would need to

be emulated on the Sun host, or additional hardware must be added to the Splash 2.

“5 ‘ ‘N~1“":."l!m-

63

 

 

l LDI
2 LDI
3 LDI
4 LDI
5 Loopl SUB
6 LD
7 82
8 LD
9 Loop2 LD
10 SUB
I 1 NOP
12 EN
13 Check ADD
14 SUB
15 NOP
16 EN
17 ADD
18 BRA
19 Swap ST
20 ST
21 LD
22 BRA
23 Fad

5.555.553.1933
aa-s°°
'2': "3

.5335
neg

 

Figure 5.5. Optimized RISC bubble sort program

 

 

1 LDI R1, 0 LDI R2, 0

2 LDI R3, Length LDI R8, 1

3 Loopl SUB R4, R3, R1 SUB R4, R3, R1
4 LD R2, R1 LD R5, (R )

5 82 End BZ End

6 Loop2 LD R6, R2 NOP

7 SUB R4, R5, R6 SUB R4, R5, R6
8 SUB R1, R1, R8 NOP

9 EN Swap BN Swap

10 Check ADD R2, R2, R8 NOP

ll SUB R4, R2, R3 SUB R4, R2, R3
12 ADD R1, R1, R8 NOP

13 BN Loop2 BN Loop2

l4 BRA Loopl BRA Loopl

15 Swap ST (R1), R6 ST (R2). R5
16 LD R5, R6 NOP

17 BRA Check BRA Check

End

 

Figure 5.6. Optimized VLIW bubble sort program

 

 

CHAPTER 6

Conclusions and Future Investigations

This chapter summarizes our ﬁndings and evaluates the success of our
implementation of an ISP on the Splash 2. The possibilities of using Splash 2 as a tool in
studying computer architectures and compilers is also presented. Lastly, proposals for
future investigations and projects are offered as possibilities to extend the study of

computer architectures on Splash 2.

6.1 The Dex-II Evaluation

The successful implementation of our design did achieve the goal of
implementing a RISC and a VLIW architecture. The clock speed of 1.85 MHz allows the
Dex-II to execute large programs in a reasonably short period of time. This allows for
characterizing architectures and architectural features of real applications which would
otherwise be prohibitive in simulation. Even so, the performance is too slow to be useful
in practical applications.

The primary factor for poor cycle time stems from the fact the architecture suffers
from an unbalanced pipeline. Empty slots evident in phases of each unit except the
Decode stage signify a poorly balanced pipeline. This was the result of waiting for
operands to be passed around on an inadequate interconnect array. The pipeline may be

better balanced by adding functionality in the Fetch and Execute stages decreasing the

65
amount of instructions needed to complete tasks. Additional functionality, however,
starts to drift away from the RISC paradigm, and other problems may arise.

Since the design is based on VHDL and automated synthesis for FPGAs, changes
in the architecture can be quickly implemented as seen with the Fibonacci program.
Additional hardware features can be added with relative ease. Support for register
windowing can be added with the addition of some opcodes. Subroutines and jump
commands can also be implemented onto the Fetch stages with relative ease. Specialized
execution units can be programmed in an attempt to increase the performance of the
processor for speciﬁc applications. Additional hardware to measure performance could
also be added to monitor processor usage. This would provide invaluable data to support

trace scheduling techniques.

6.2 VLIW vs. RISC

The Dex-II implements two RISC processors side by side on the Splash 2. This
design permits two RISC instructions to execute in a single clock cycle. Although there
are some data and control hazards that must be avoided, the VLIW architecture is
successful in increasing the IPC. For the VLIW architecture, we achieve an IPC of 1.50
for the Fibonacci program and 123 for the Bubble Sort program. The RISC versions of
the same code achieved an IPC of 1.00 and .90, respectively.

The prototype processor indicated problems with our control scheme. The
results from our program analysis focuses attention at the double branch instruction
which was required to keep both processors in synchronization. This problem did not
exist for the single RISC processor model. Future designs with this architecture must

successfully implement a more uniﬁed control path to control the multiple data paths.

6.3 Splash 2 Evaluation

Computers were originally designed as machines that could solve many problems
by using different programs. With FPGAs, programmers and designers now have the
ability to alter the machine to ﬁt the problem. The success of these arrays will be
dependent upon the resources available to each FPGA. While logic emulators built with
arrays of FPGAs can be used to prototype circuits, it will be the resources such as the
memory and the crossbar of the Splash 2 that will offer a new level of ﬂexibility to
prototype more complex architectures.

The bottleneck for successful design on the Splash 2 was the lack of
communication between chips to implement bus structures. While we were able to
achieve a respectable clock rate of 11 MHz from the FPGA, the six-cycle division forced
by communication restrictions effectively reduced our cycle time to a little under 2 MHz.
In order for faster speeds to be obtained, the level of interconnects needs to be enhanced
through a crossbar with greater functionality as well as FPGAs with greater I/O
capabilities.

The lack of ﬂoating point processing is also a factor that will limit the usefulness
of FPGA-based arrays. Floating point processing is required in many applications in both
science and engineering. These types of computations are usually highly repetitive and
can be structured in such a way to take advantage of the reconﬁgurable architecture.

Although the use of synthesis tools considerably shortened the design cycle, these
tools were not perfect. As the efﬁciency of the synthesis tools and number of FPGA
libraries increase, these synthesis tools will improve the overall performance of
reconﬁgurable computing systems. The size of our design generated by the synthesis
tools utilized less than 50% of the CLB resources. This suggests that our design is close

to the optimal speed as the routing tools had plenty of space to route signals within the

lmauurmn... g—nw . .v. .m...

67
FPGA. By adding more functionality to our designs, these tools become even more

signiﬁcant as they directly impact the speed of the design.

6.4 Future Investigations

Although the practical uses are limited at this time, the Dex-II does offer
substantial research opportunities. FPGA-based systems, for example, can provide an
invaluable tool for compiler testing. Different architectures can be simulated on the
Splash 2 to execute compiled codes for various scheduling techniques. As reconﬁgurable
processors become a reality, new compiler technology can be rapidly adapted. Synthesis
tools will also play a large role in the success of reconﬁgurable processors. Compilers no
longer need to be written and optimized for a machine, but the machine can now be
optimized to support the compiler.

Speculative execution is another method to decrease execution times. Guarded
execution and complicated shadow registers can be used to let speculative instructions
run their course. This method is not very feasible in our design due to the restrictions of
the board itself, however each board as a whole may be conﬁgured as a shadow system.
A system can be implemented where the host is used to synchronize several Splash
boards executing a program. When a program branch is encountered, both routes are sent
off to two boards conﬁgured as a processor. The host needs to maintain the actual state
of the machine but can orchestrate hundreds of these processors.

Threaded programming is also an emerging software methodology that would
support a co-processor approach. Separate threads can be run on independent boards
without the communication overhead to track the actions of other threads. Memory
would be managed by the host and treated as a cached memory system.

The Dex-II is an aggressive utilization of the resources available from the Splash
2. The architecture is far from perfect, however. The design is limited to a 16-bit

architecture until even larger chips make it possible to compact larger designs and

68
provide more communications. However, the ability to change designs so quickly does
make the Splash 2 and general FPGA-based arrays very attractive. A key factor in RISC
development was the shortened turnaround time for each new architecture. FPGA-based
systems would allow new architectures to be prototyped quickly. This provides a
powerful tool for computer and compiler designers to test and implement new designs

and ideas.

APPENDICES

Appendix A

Synthesis Code

A1 Control1.vhd (PEI)

architecture FetchJeft of Xilinx_Processing_Part is
signal PC: Bit_Vector (17 downto 0);
signal loadaddress: Bit_Vector (15 downto 0);
signal Ix: Bit_Vector (31 downto 0);
signal Id: Bit_Vector (31 downto 0);
signal Ifetch: Bit_Vector (31 downto 0);
signal increment_pc: Bit;
signal branchbit: bit := '0';
signal resetzbit := '0';
signal local_clk: integer range 0 to 6;
signal one: Bit_Vector (17 downto 0);
signal CC: Bit_Vector (3 downto 0);

signal Right: Bit_Vector (31 downto 0);

signal mdr: Bit_Vector (15 downto 0);
signal Xbarin, Xbarout: Bit_Vector (35 downto 0);
signal Xbarenable: Bit_Vector (4 downto 0);

begin

RIGHTIN: for i in o to 15 GENERATE
Pad_lnput(XP_right(i), right(i));
END GENERATE RIGHTIN;

RIGHTOUT: fori in 16 to 31 GENERATE

Pad_0utput(XP_right(i), right(i));
END GENERATE RIGHTOUT;

XP_Mem_RD_L <= '0';
increment_pc <= '1';
one <= ”000000000000000001";

process
begin

69

70

wait until XP_Clk'Event and XP_Clk = '1';
local_clk <= Iocal_clk + 1;
if Iocal_clk = 5 then
Iocal_clk <= 0;
end if;

Pad__lnput (XP_Mem_D, mdr);
Pad_0utput (XP_Mem_A, PC);

if Iocal_clk = 1 then
lfetch(31 downto 16) <= mdr;
right(31 downto 16) <= mdr,
Xbarenable <= '01 111";

end if;

if Iocal_clk = 2 then
[fetch ( 15 downto 0) <= right (15 downto 0);
end if;

if Iocal_clk = 3 then
PC <= PC + one;
if (Ifetch(31 downto 30) = 01) then
PC(15 downto 0) <= Ifetch (25 downto 10);

end if;
end if ;
end process;
XP_Mern_WR__L <= '1';
XP_HSO <= '2';
XP_GOR_Result <= '0';
XP_GOR__VaIid <= '0';
XP_Int <= '0';
end Fetch_left;

A2 ControlZ.vhd (PE2)

architecture Fetch_right of Xilinx_Pnocessing_Part is
signal PC: Bit_Vector (17 downto 0);
signal one: Bit_Vector (17 downto 0);
signal Ifetch: Bit_Vector (31 downto 0);
signal resetzbit := '0';
signal Iocal_clk: integer range 0 to 6;
signal CC: Bit_Vector (3 downto 0);

signal Left: Bit_Vector (31 downto 0);

signal mdr: Bit_Vector (15 downto 0);

signal Xbarout, Xbarin: Bit-Vector (35 downto 0);
signal Xbarenable: Bit_Vector (4 downto 0);

71
begin

LEFTIN: fori in 16 to 31 GENERATE

Pad_Input(XP_left(i), Left(i));
END GENERATE LEFTIN;

marrow“: for i in o to 15 GENERATE
Pad_Output(XP_left(i), 1211(1));
END GENERATE LEFl‘OUT;

XP_Mem_RD_L <= '0';
one <= "000000000000000001";
XP_Xbar_EN_L <= Xbarenable;

process
begin

wait until XP_CIk'Event and XP_Clk = 'l';

Pad_Output(XP_Right(31 downto 0), [fetch);
local_clk <= local_clk + 1;
if local_clk = 5 then
local_clk a 0;
end if;

Pad_Input (XP_Mem_D, mdr);
Pad_Output (XP_Mem_A, PC);
Xbarenable <== '11111";

if local_clk = 0 then
XP_LED c '1';
end if;

if local_clk = 1 then
Ifetch(15 downto 0) <= mdr;
left(15 downto 0) <= mdr;
Xbarenable <= "01111";
XP_LED <= '0';

end if;

if local_clk = 2 then
[fetch (31 downto 16) c left (31 downto 16);
CC <= Xbarin (35 downto 32);
XP_LED <= '1';

end if;

if local_clk = 3 then
PC G PC + one;
if [fetch(31 downto 30) = 01 then
if ((Ifetch(29 downto 27)=000) or (CC(3 downto 1)) = [fetch (29 downto 27))
then PC( 15 downto 0) <= [fetch (25 downto 10);
end if;
end if;
XP_LED <= '0';
end if;

end process;

72

XP_Mem_WILL <= '1';
XP_HSO ¢ '2';
XP_GOILResult <= '0';
XP_GOILValid <= '0';
XP_Int <= '0';

end Fetch_right;

A3 Decodel.vhd (PE3)

architecture Decode_left of Xilinx_Processing_Part is

signal Itemp: Bit_Vector (35 downto 0);
signal Ix: Bit_Vector (35 downto 0);
signal Id: Bit_Vector (35 downto 0);
signal Ifetch: Bit_Vector (35 downto 0);
signal resetzbit;

signal local_clk: integer range 0 to 6;

signal Left: Bit_Vector (35 downto 0);
signal Right: Bit_Vector (35 downto 0);

signal Xbar_en: Bit_Vector (4 downto 0); -- enable bit for Xbar (low)
signal zero: Bit_Vector (4 downto 0);

signal Rer: Bit_Vector (15 downto 0);

signal Mdr: Bit_Vector (15 downto 0);

signal RarL: Bit_Vector (17 downto 0);

signal Xbarin, Xbarout: Bit_Vector (35 downto 0);

signal OpA: Bit_Vector (15 downto 0);
signal temp: Bit_Vector (15 downto 0);
signal result: Bit_Vector (15 downto 0);

signal MuxinO, Muxinl, Muxin2, Muxin3, Muxresult: Bit_Vector (15 downto 0);

begin

SelectO : mux4_l[-[ port map (Muxin0(0), Muxinl(0), Muxin2(0), Muxin3(0),

Ix(30), Ix (31), Muxresult (0));

Selectl : mux4_lH port map (Muxin0(1), Muxinl(l), Muxin2(l), Muxin3(l),

Ix(30), Ix (31), Muxresult (1));

Select2 : mux4_1H port map (Muxin0(2), Muxinl(2), Muxin2(2), Muxin3(2),

Ix(30), [x (31), Muxresult (2));

Select3 : mux4_lH port map (Muxin0(3), Muxinl(3), Muxin2(3), Muxin3(3),

Ix(30), Ix (31), Muxresult (3));

Select4 : mux4_1H port map (Muxin0(4), Muxinl(4), Muxin2(4), Muxin3(4),

Ix(30), Ix (31), Muxresult (4));

Select5 : mux4_1H port map (Muxin0(5), Muxinl(S), Muxin2(5), Muxin3(5),

*E

Select6 :
Select7 :
Select8 :
Select9 :
Selecth
Selectl 1

Select12

Select13 :

Selectl4

SelectlS

73

Ix(30), Ix (31), Muxresult (5));
mux4_1H port map (Muxin0(6), Muxinl(6), Muxin2(6), Muxin3(6),
Ix(30), [x (31), Muxresult (6));
mux4_1H port map (Muxin0(7), Muxinl(7), Muxin2(7), Muxin3(7),
Ix(30), [x (31), Muxresult (7));
mux4_1H port map (Muxin0(8), Muxinl(8), Muxin2(8), Muxin3(8),
Ix(30), Ix (31), Muxresult (8));
mux4_lH port map (Muxin0(9), Muxinl(9), Muxin2(9), Muxin3(9),
Ix(30), [x (31), Muxresult (9));
: mux4_1H port map (Muxin0(10), Muxinl(lO), Muxin2(10), Muxin3(10),
Ix(30), Ix (31), Muxresult (10));
: mux4_1H port map (Muxin0(11), Muxinl(l 1), Muxin2(l l), Muxin3(11),
Ix(30), Ix (31), Muxresult (11));
: mux4_lH port map (Muxin0(12), Muxinl(12), Muxin2(12), Muxin3(l2),
_ Ix(30), [x (31), Muxresult (12));
mux4_lH port map (Muxin0(l3), Muxin1( 13), Muxin2(13), Muxin3(l3),
Ix(30), [x (31), Muxresult (13));
: mux4_lH port map (Muxin0(l4), Muxinl(l4), Muxin2(14), Muxin3(l4),
Ix(30), [x (31), Muxresult (14));
: mux4__1H port map (Muxin0(15), Muxin1( 15), Muxin2(15), Muxin3(15),
Ix(30), Ix (31), Muxresult (15));

--Xbarin is from the xbar where xbarout is the value being sent to the xbar...
Pad_Xbar (XP_Xbar, XbaroutXbarin, Xbar_en);
XP_Xbar_EN_L <= Xbar_en;
Xbarout (31 downto l6) <= Muxresult;
Xbarout (15 downto 0) <= Muxresult;

Muxinl <= OpA;

zero <= ”00000”;

Pad_Output (XP_right, IFetch);

process
begin

wait until XP_Clk'Event and XP_Clk = '1';

Pad_Input (XP_left, Ifetch);

Pad_Output (XP_Mem_A, RarL);

Pad_InOut (XP_Mem_D, Rdrl, Mdr, '0');

XP_Mem_RD_L <= '1';

XP_Mem_WR_L <= '1';

temp <= Xbarin (15 downto 0); --Syncing bottom

local_clk G local_clk + I;
if local_clk = 5 then

local_clk G 0;

end if;

if local_clk = 0 then

Pad_Output (XP_Mem_A (4 downto 0), Id ( l9 downto 15));
XP_Mem_RD_L <= '0';
Xbar_en <= ”10000“;

end if;

74

if local_clk = 1 then
XP_Mem_RD_L <= '0';
Pad_InOut (XP_Mem_D, Rdrl, OpA, '0');
Muxin2 c: Xbarin(3l downto 16);
Xbar_en <= ”10000”;

end if;

if local_clk = 2 then
-- get data from Xbar, save the bottom half to write next cycle.
Muxin3 <=-—- Xbarin(3l downto 16);
MuxinO <= Xbarin(15 downto 0);
Xbar_en <= "11100";
end if;

if local_clk = 3 then
if Ix(9 downto 5) = Id(l9 downto 15) then OpA <= Muxresult;

end if;

if Ix(4 downto 0) = [d(19 downto 15) then OpA <= Xbarin(15 downto 0);
end if;

Ix (31 downto 30) <= ”01";

temp <= Xbarin (15 downto 0); «Syncing bottom

Pad_Output (XP_Mem_A (4 downto 0), Ix(9 downto 5));
if (Ix (9 downto 5) /= zero) then
Pad_InOut (XP_Mem_D, Muxresult, Mdr, '1');
XP_Mem_WR_L <= '0';
end if;
Xbar_en <-—- "lllll';
end if;

if local_clk = 4 then
- need to write bottom Muxresult
Xbar_en <= "11111”;
Pad_Input (XP_left, Id);
Id G Ifetch:
[temp <= Id;
if (Ix (4 downto 0) /= zero) then
Pad_Output (XP_Mem_A (4 downto 0), Ix(4 downto 0));
Pad_InOut (XP_Mem_D, temp, Mdr, '1');
XP_Mem_WR_L <= '0';

end if;
end if;
if local_clk = 5 then
Ix <= ltemp;
end if;
end process;
XP_HSO <= '2';
XP_GOR_result <= '0';
XP_GOR_Valid <= '0';
XP_Int <= '0';
XP_LED <= '1';

end Decode_left;

75

A4 Decode2.vhd (PE4)

architecture Decode_right of Xilinx_Processing_Part is

signal Ix: Bit_Vector (35 downto 0);
signal Id: Bit_Vector (35 downto 0);
signal Ifetch: Bit_Vector (35 downto 0);
signal reset:bit;

signal local_clk: integer range 0 to 6;

sigml Left: Bit_Vector (35 downto 0);
signal Right: Bit_Vector (35 downto 0);

signal WriteEnable: Bit;
signal Xbar_en: Bit_Vector (4 downto 0) ; -- enable bit for Xbar (low)
signal zero : Bit_Vector (4 downto 0) ;

signal Mdr. Bit_Vector (15 downto 0);

signal Rer: Bit_Vector (15 downto 0);

signal RarL: Bit_Vector (17 downto 0);

signal Xbarin, Xbarout: Bit_Vector (35 downto 0);

signal OpB: Bit_Vector (15 downto 0);
signal temp: Bit_Vector (15 downto 0);

signal Muxin0, Muxinl, Muxin2, Muxin3, Muxresult: Bit_Vector (15 downto 0);

begin

SelectO : mux4_1H port map (Muxin0(0), Muxinl(0), Muxin2(0), Muxin3(0),
Ix(30), Ix (31), Muxresult (0));

Selectl : mux4_1H port map (Muxin0(1), Muxinl(l), Muxin2( 1), Muxin3(l),
Ix(30), Ix (31), Muxresult (1));

Select2 : mux4_lH port map (Muxin0(2), Muxinl(2), Muxin2(2), Muxin3(2),
Ix(30), Ix (31), Muxresult (2));

Select3 : mux4_1H port map (Muxin0(3), Muxinl(3), Muxin2(3), Muxin3(3),
Ix(30), Ix (31), Muxresult (3));

Select4 : mux4_1H port map (Muxin0(4), Muxinl(4), Muxin2(4), Muxin3(4),
Ix(30), Ix (31), Muxresult (4));

Select5 : mux4_1H port map (Muxin0(5), Muxinl(S), Muxin2(5), Muxin3(5),
Ix(30), Ix (31), Muxresult (5));

Select6 : mux4_1H port map (Muxin0(6), Muxinl(6), Muxin2(6), Muxin3(6),
Ix(30), Ix (31), Muxresult (6));

Select7 : mux4_1H port map (Muxin0(7), Muxinl(7), Muxin2(7), Muxin3(7),
Ix(30), Ix (31), Muxresult (7));

Select8 : mux4_lH port map (Muxin0(8), Muxinl(8), Muxin2(8), Muxin3(8),
Ix(30), Ix (31), Muxresult (8));

Select9 : mux4_1H port map (Muxin0(9), Muxinl(9), Muxin2(9), Muxin3(9),
Ix(30), Ix (31), Muxresult (9));

Select10 : mux4_1H port map (Muxin0(10), Muxinl(lO), Muxin2(10), Muxin3(10),

76

Ix(30), Ix (31), Muxresult (10));

Selectll : mux4_1H port map (Muxin0(11), Muxinl(l 1), Muxin2(11), Muxin3(11),
Ix(30), Ix (31), Muxresult (11));

Select12 : mux4_lH port map (Muxin0(12), Muxin1( 12), Muxin2(12), Muxin3(12),
Ix(30), 1x (31), Muxresult (12));

Select13 : mux4_lH port map (Muxin0(13), Muxinl(l3), Muxin2(13), Muxin3(l3),
Ix(30), Ix (31), Muxresult (13));

Selectl4: mux4_1H port map (Muxin0(l4), Muxinl(14), Muxin2(14), Muxin3(l4),
Ix(30), [x (31), Muxresult (14));

Select15 : mux4_1H port map (Muxin0(15), Muxinl(15), Muxin2(15), Muxin3(15),
Ix(30), Ix (31), Muxresult (15));

Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en);
XP_Xbar_EN_L <= Xbar_en;
Xbarout(31 downto [6) <= Muxresult;
Xbarout(15 downto 0) <= Muxresult;

Muxinl <= OpB;

zero<= "00000";
Rarl <= ”000000000000000000";
Rdrl <= ”0000000000000000";

Pad_Output (XP_right, Id);

process
begin

wait until XP_Clk'Event and XP_Clk = '1';

Pad_Input (XP_left, Ifetch);

Pad_Output (XP_Mem_A, RarL);
Pad_InOut (XP_Mem_D, Rdrl, Mdr, '0');
XP_Mem_RD_L <= '1';
XP_Mem_WR_L <= '1';

temp <= Xbarin (15 downto 0);

Ix <= Ix;

local_clk <= local_clk + 1;

if local_clk = 5 then
local_clk <= 0;
end if;

if Iocal_clk = 0 then
Pad_Output (XP_Mem_A (4 downto 0), Id (14 downto 10));
XP_Mem_RD_L <= '0';
Xbar_en <= ”10000";

end if;

if local_clk = 1 then
XP_Mem_RD_L <= '0';
Pad_InOut (XP_Mem_D, Rdrl, OpB, '0');
Muxin2 <= Xbarin(31 downto l6);
Xbar_en <= "10000”;

end if;

if local_clk = 2 then

77

-- get data from Xbar, save the bottom half to write next cycle.
Muxin3 <= Xbarin(15 downto 0);
MuxinO <= Xbarin(31 downto 16);
Xbar_en <= ”11100”;

end if;

if local_clk = 3 then
-- need to write secondary Muxresult
if Ix(9 downto 5) = Id(14 downto 10) then OpB <= Muxresult;
end if;
if Ix(4 downto 0) = Id(l4 downto 10) then OpB <= Xbarin(15 downto 0);
end if;
Ix(31 downto 30) <= "01";
temp <= Xbarin (15 downto 0);
XP_Mem_WR_L <= '1';
if (Ix (9 downto 5) I: zero) then
Pad_Output (XP_Mem_A (4 downto 0), Ix(9 downto 5));
Pad_InOut (XP_Mem_D, Muxresult, Mdr, '1');
XP_Mem_WR_L <= '0';
end if;
Xbar_en <= "[1111”;
end if;

if local_clk = 4 then
-- need to write secondary Muxresult
Xbar_en <= 'lllll";
XP_Mem_WR_L <= '1';
if (Ix (4 downto 0) /= zero) then
Pad_Output (XP_Mem_A (4 downto 0), Ix(4 downto 0));
Pad_InOut (XP_Mem_D, temp, Mdr, '1'); ,
XP_Mem_WR_L <= '0';
end if;
end if;

if local_clk = 5 then
Pad_Input (XP_left, Id);

Ix <= Id;

end if;
end process;
XP_HSO ¢= 'Z';
XP_GORJesult <= ‘0';
XP_GOR_Valid <= '0';
XP_Int c '0';
XP_LED <= '1';

end Decode_right;

A5 Executel.vhd (PES)

78

architecture Executel of Xilinx_Processing_Part is

signal Id: Bit_Vector (35 downto 0);
signal Ix: Bit_Vector (35 downto 0);
signal reset:bit := '0';

signal local_clk: integer range 0 to 6;

signal Left: Bit_Vector (35 downto 0);
signal Right: Bit_Vector (31 downto 0);

signal OpA: Bit_Vector (15 downto 0);

signal OpB: Bit_Vector (15 downto 0);

signal Muxinl ,Muxin2,Muxin3 Muxin4: Bit_Vector (15 downto 0);
signal Xbarin, Xbarout: Bit_Vector(35 downto 0);

signal Xbar_en: Bit_Vector (4 downto 0);

-- Muxin2 is not used in this pe and should be removed later...

signal Result: Bit_Vector (15 downto 0);
signal zero: Bit_Vector (15 downto 0);
signal CondCode : Bit_Vector (3 downto 0);

begin

-- Adder unit
addsub : adsu16h port map (OpA, OpB, Ix(27), Muxinl , CondCode(3));

-- Channel one of the results from the adder or shifter onto the xbar
-- depending on opcode selected and desired function.

SelectO : mux4_1H port map (Muxinl(0), Muxinl(0), Muxin3(0), Muxin4(0),
Ix(27), Ix (28), Result (0));

Selectl : mux4_lH port map (Muxinl(l), Muxinl(l), Muxin3(l), Muxin4(1),
Ix(27), Ix (28), Result (1));

Select2 : mux4_1H port map (Muxinl(2), Muxinl(2), Muxin3(2), Muxin4(2),
Ix(27), Ix (28), Result (2));

Select3 : mux4_lH port map (Muxinl(3), Muxinl(3), Muxin3(3), Muxin4(3),
Ix(27), Ix (28), Result (3));

Select4 : mux4_1H port map (Muxinl(4), Muxinl(4), Muxin3(4), Muxin4(4),
Ix(27), Ix (28), Result (4));

Select5 : mux4_1H port map (Muxinl(S), Muxinl(S), Muxin3(5), Muxin4(5),
Ix(27), Ix (28), Result (5));

Select6 : mux4_1H port map (Muxinl(6), Muxinl(6), Muxin3(6), Muxin4(6),
Ix(27), Ix (28), Result (6));

Select7 : mux4_1H port map (Muxinl(7), Muxinl(7), Muxin3(7), Muxin4(7),
Ix(27), Ix (28), Result (7));

Select8 : mux4_lH port map (Muxinl(8), Muxinl(8), Muxin3(8), Muxin4(8),
Ix(27), Ix (28), Result (8));

Select9 : mux4_lH port map (Muxinl(9), Muxinl(9), Muxin3(9), Muxin4(9),
Ix(27), Ix (28), Result (9));

Select10 : mux4_1H port map (Muxinl(lO), Muxinl(lO), Muxin3(10), Muxin4(10),
Ix(27), Ix (28), Result (10));

Selectll :mux4_1H port map(Muxin1(11), Muxinl(l 1), Muxin3(l 1), Muxin4(l l),
Ix(27), Ix (28), Result (11));

Select12 : mux4__lH port map (Muxinl(12), Muxinl(12), Muxin3(12), Muxin4(12),

79

IX(27). 1x (28). Result (12));

Select13 : mux4_1H port map(Muxin1(13), Muxinl(l3), Muxin3(l3), Muxin4(l3),
Ix(27), [x (28), Result (13));

Select14 : mux4_1H port map (Muxinl(14), Muxinl(14), Muxin3(l4), Muxin4(14),
Ix(27), Ix (28), Result (14));

Select15 : mux4_1H port map (Muxinl(15), Muxinl(15), Muxin3(15), Muxin4(15),
Ix(27), Ix (28), Result (15));

zero <= '0000000000000000";

Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en);
XP_Xbar_EN_L <= Xbar_en;

Xbarout (31 downto 16) <== result;
Xbarout (15 downto 0) <= result;
Xbarout (35 downto 32) <= CondCode;
Pad_Output (XP_right, Id);

process
begin

wait until XP_Clk'Event and XP_Clk = '1';
local_clk <:= Iocal_clk + 1;
if local_clk = 5 then
local_clk <= 0;
end if;

Pad_Input (XP_left, Left);

if local_clk = 0 then
-- Compute and broadcast results.
-- >>>>>>> Shift left <<<<<<<<
Muxin3 (15 downto 1) <= OpA (l4 downto 0);
Muxin3 (0) <= '0';

-- >>>>>>>> Shift right sign extension <<<<<<<<<<<
Muxin4 (14 downto 0) <= OpA (15 downto 1);
Muxin4 (15) <= OpA(15);

‘cnd if ;

if local_clk = 1 then
Pad_Input (XP_left, Id);
if Result = zero then CondCode(l) c '1';
end if;
if Result (15) = '1' then CondCode(2) <= '1';
end if;

end if;

if local_clk = 2 then
end if;

if local_clk = 3 then
Xbar_en <= "10000";
end if;

if local_clk = 4 then
-- Latch new ops from the decode stagell

80

Opa G Xbarin (31 downto 16);
Opb G Xbarin (15 downto 0);
Xbar_en G "11111”;
end if;
if local_clk = 5 then
Ix G Id;
end if;
end process;
-- XP_Left G TriState (XP_Left);
- XP_Right G TriState (XP_Right);
XP_Mem_A G TriState (XP_Mem_A);
XP_Mem_D G TriState (XP_Mem_D);
XP_Mem_RD_L G '1';
XP_Mem_WILL G '1‘;
XP_HSO G '2';
XP_GOR_Result G '0';
XP_GOILValid G '0';
XP_Int G ‘0';
- XP_Xbar_EN_L G'lllll";
XP_LED G '1';
end Executel;

A6 Execute2.vhd (PE6)

architecture Execute2 of Xilinx_Processing_Part is

signal Id: Bit_Vector (35 downto 0);
signal Ix: Bit_Vector (35 downto 0);
signal reset:bit;

signal local_clk: integer range 0 to 6;

signal Left: Bit_Vector (35 downto 0);
signal Right: Bit_Vector (31 downto 0);

signal OpA: Bit_Vector (15 downto 0);

signal OpB: Bit_Vector ( 15 downto 0);

signal Muxinl Muxin2,Muxin3 ,Muxin4: Bit_Vector (15 downto 0);
signal Xbarin, Xbarout: Bit_Vector(35 downto 0);

signal Xbar_en: Bit_Vector (4 downto 0);

-- Muxin2 is not used in this pe and should be removed later...
signal Result: Bit_Vector ( 15 downto 0);

signal CondCode : Bit_Vector (3 downto 0);
signal zero: Bit_Vector (15 downto 0);

begin

81

-- Adder unit
addsub : adsu16h port map (OpA, OpB, Ix(27), Muxinl , CondCode(3));

-- Channel one of the results from the adder or shifter onto the xbar
-- depending on opcode selected and desired function.

SelectO : mux4_1H port map (Muxinl(0), Muxinl(0), Muxin3(0), Muxin4(0),

Selectl
Select2 :
Select3 :
Select4 :
Select5 :
Select6 :
Select7 :
Select8 :

Select9 :

Select10 :

Selectl 1

Select12 :
Select13 :
Select14 :

Select15 :

Ix(27), Ix (28), Result (0));

: mux4_1H port map(Muxin1(1), Muxinl(l), Muxin3(l), Muxin4(l),

Ix(27), Ix (28), Result (1));

mux4_1H port map (Muxinl(2), Muxinl(2), Muxin3(2), Muxin4(2),

Ix(27), Ix (28), Result (2));

mux4_1H port map (Muxinl(3), Muxinl(3), Muxin3(3), Muxin4(3),

N27). Ix (28). Result (3));

mux4_1H port map (Muxinl(4), Muxinl(4), Muxin3(4), Muxin4(4),

Ix(27), Ix (28), Result (4));

mux4_1H port map (Muxinl(S), Muxinl(S), Muxin3(5), Muxin4(5),

Ix(27), Ix (28), Result (5));

mux4_1H port map (Muxinl(6), Muxinl(6), Muxin3(6), Muxin4(6),

Ix(27), [x (28), Result (6));

mux4_1H port map (Muxinl(7), Muxinl(7), Muxin3(7), Muxin4(7),

Ix(27), Ix (28), Result (7));

mux4_1H port map (Muxinl(8), Muxinl(8), Muxin3(8), Muxin4(8),

Ix(27), Ix (28), Result (8));

mux4_1H port map (Muxinl(9), Muxinl(9), Muxin3(9), Muxin4(9),

Ix(27), [x (28), Result (9));
mux4_1H port map(Muxinl(10), Muxinl(lO), Muxin3(10), Muxin4(10),
Ix(27), Ix (28), Result (10));

: mux4_1H port map (Muxinl(l l), Muxinl(l 1), Muxin3(11), Muxin4(11),

Ix(27), Ix (28), Result (11));

mux4_1H port map (Muxinl(12), Muxinl(12), Muxin3(12), Muxin4(12),
Ix(27), Ix (28), Result (12));

mux4_1H port map (Muxinl(l3), Muxin1( l3), Muxin3( 13), Muxin4(l3)
Ix(27), Ix (28), Result (13));

mux4_1H port map (Muxinl(14), Muxinl(14), Muxin3( l4), Muxin4(14),
Ix(27), Ix (28), Result (14));

mux4_1H port map (Muxinl(15), Muxinl(15), Muxin3(15), Muxin4(15),
Ix(27), Ix (28), Result (15));

9

Pad_Output (XP_ri ght, Id);
Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en);
XP_Xbar_EN_L G Xbar_en;

Xbarout (31 downto 16) G result;
Xbarout (15 downto 0) G result;
Xbarout (35 downto 32) G CondCode;

zero <= '0000000000000000”;

process
begin

wait until XP_Clk'Event and XP_Clk = '1';

82

local_clk G Iocal_clk + 1;
if local_clk = 5 then

local_clk G 0;
end if;

Pad_Input (XP_left, Left);
if local_clk = 0 then

-- Compute results.

-- >>>>>>> Shift left <<<<<<<<
Muxin3 (15 downto 1) G OpA (l4 downto 0);
Muxin3 (0) G '0';

- >>>>>>>> Shift right sign extension <<<<<<<<<<<
Muxin4 (14 downto 0) G OpA (15 downto l);
Muxin4 (15) G OpA(15);

end if;

if local_clk = 1 then
if Result = zero then CondCode(1)G'1';
end if;
if Result (15) = '1' then CondCode(2) G '1';
end if;

end if;

if local_clk = 2 then
Pad_Input (XP_left, Id);

end if;

if local_clk = 3 then
Xbar_en G ”[0000";

end if;

if Iocal_clk = 4 then
-- Latch new ops from the decode stagell

OpB G Xbarin (31 downto 16);
OpA G Xbarin (15 downto 0);
Xbar_en G "11111";
end if;
if local_clk = 5 then
Ix G Id;
end if;
end process;
XP_Left G TriState (XP_Left);
XP_Right G TriState (XP_Right);
XP_Mem_A G TriState (XP_Mem_A);
XP_Mem_D G TriState (XP_Mem_D);
XP_Mem_RD_L <= '1';
XP_Mem_WR_L G '1';
XP_HSO G '2';
XP_GOILResult G '0';

83

XP_GOILValid G '0';
XP_Int G '0';

-- XP_Xbar_EN_L G'lllll";
XP_LED G '1';

end Execute2;

A7 Execute3.vhd (PE7)

architecture Execute3 of Xilinx_Processing_Part is

signal Id: Bit_Vector (35 downto 0);
signal Ix: Bit_Vector (35 downto 0);
signal reset:bit;

signal Iocal_clk: integer range 0 to 6;

signal Left; Bit_Vector (35 downto 0);

signal OpA: Bit_Vector (15 downto 0);

signal OpB: Bit_Vector (15 downto 0);

signal Mar: Bit_Vector (17 downto 0);

signal Mdr: Bit_Vector (15 downto 0);

signal Xbarin, Xbarout: Bit_Vector(35 downto 0);
signal Xbar_en: Bit_Vector (4 downto 0);

signal Result: Bit_Vector (15 downto 0);
begin
Pad_Output (XP_right, Id);
Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en);
XP_Xbar_EN_L G Xbar_en;

Xbarout (31 downto 16) G Mdr;
Xbarout (15 downto 0) G Mdr;

--Mar(l7 downto 9) G OpA(8 downto 0);
--Mar(8 downto 0) G OpB(8 downto 0);

process
begin
wait until XP_Clk'Event and XP_Clk = '1';
local_clk G local_clk + 1;
if local_clk = 5 then
local_clk G 0;
end if;

Pad_Input (XP_Jeft, Left);

34

if local_clk = 0 then
Pad_Output (XP_Mem_A, Mar);
XP_Mem_RD_L <= ‘0';
Xbar_en G "1 1111";

end if;

if local_clk = 1 then
Pad_Input (XP_Mem_D, Mdr);
XP_Mem_RD_L G '1';

end if;

if local_clk = 2 then
end if;

if local_clk = 3 then
Pad_Input (XP_Jeft, Id);
end if;

if local_clk = 4 then
Xbar_en G "10000";
end if;

if local_clk = 5 then
-- Latch new ops from the decode stage.
OpB G Xbarin (31 downto 16);
OpA G Xbarin (15 downto 0);
Mar(17 downto 9) G Xbarin (8 downto 0);
Mar(8 downto 0) G Xbarin (24 downto 16);
Xbar_enG "11111”;

Ix G ld;

end if;
end process;
XP_Mem_WR_L G '1';
XP_HSO G '2';
XP_GOR_Result G '0';
XP_GORNalid G '0';
XP_Int G '0';
XP_LED G '1';

end Execute3;

A8 Execute4.vhd (PE 8)

architecture Execute4 of Xilinx_Processing_Part is

signal Id: Bit_Vector (35 downto 0);
signal Ix: Bit_Vector (35 downto 0);
signal reset:bit;

85
signal local_clk: integer range 0 to 6;
signal Left: Bit_Vector (35 downto 0);
signal BotIx: Bit_Vector (15 downto 0);

signal OpA: Bit_Vector (15 downto 0);
signal OpB: Bit_Vector (15 downto 0);

signal Mar: Bit_Vector (17 downto 0);
signal Mdr: Bit_Vector (15 downto 0);
signal Rdr: Bit_Vector (15 downto 0);
signal WriteEnable: Bit;

signal Xbarin, Xbarout: Bit_Vector(35 downto 0);
signal Xbar_en: Bit_Vector (4 downto 0);

signal Result: Bit_Vector (35 downto 0);

signal Resultl: Bit_Vector (15 downto 0);

signal Result2: Bit_Vector (15 downto 0);

sigml OpcodeWrite, SrcRegister, SrcMemory, Srclmmediate: Bit_Vector (4 downto 0);

begin

RIGHT IN: for i in 0 to 15 GENERATE
Pad_Input(XP_right(i), BotIx(i));
END GENERATE RIGHTIN;

RIGHTOUT: fori in 16 to 31 GENERATE

Pad_OutpuKXPJighui). Ix(i»;
END GENERATE RIGHTOUT;

-- Pad_Output (XP_right(31 downto 16), Ix(31 downto 16));
-- pad_lnput (XP_right(15 downto 0), BotIx);

OpcodeWrite G ”00110”;
SrcRegister G ”001 1 1";
SrcMemory G "00100";
Srclmmediate G ”00101”;

Pad_Xbar (XP_Xbar, Xbarout, Xbarin, Xbar_en);
XP_Xbar_EN_L G Xbar_en;

-- Xbarout (31 downto 16) G resultl;

-- Xbarout (15 downto 0) G resu112;

process

begin
wait until XP_Clk'Event and XP_Clk = '1';
local_clk G local_clk + 1;
if local_clk = 5 then

local_clk G 0;
end if;

86
Pad_Input (XP_left, Left);

Pad_InOut (XP_Mem_D, Rdr, Mdr, WriteEnable);
Pad_Output (XP_Mem_A, Mar);

XP_Mem_WR_L G '1';
XP_Mem_RD_L G '1';

Xbarout (31 downto 16) G OpA;
Xbarout (15 downto 0) G OpA;

if local_clk = 0 then
Pad_Output (XP_Mem_A (15 downto 0), OpB);
if (Ix (31 downto 27) = OpcodeWrite) then
Pad_InOut (XP_Mem_D, OpA, Mdr, '1');
XP_Mem_WR_L G '0';
else
XP_Mem_RD_L G '0';
end if;
Xbar_en G "11111";
end if;

if local_clk = 1 then
if Ix (31 downto 27)= SrcMemory then
Pad_InOut (XP_Mem_D, Rdr,Xbarout (31 downto 16), '0');
Pad_InOut (XP_Mem_D, Rdr,Xbarout (15 downto 0), '0');
end if;
if Ix (31 downto 27)= Srclmmediate then
Xbarout (31 downto 16) G Ix(25 downto 10);
Xbarout (15 downto 0) G Ix(25 downto 10);
end if;
end if;

if local_clk = 2 then
Xbar_en G ”10000";
end if;

if local_clk = 3 then
Pad_Output (XP_Mem_A (15 downto 0), Xbarin (31 downto 16));
Rdr G Xbarin(15 downto 0);
if BotIx(lS downto 11) = OpcodeWrite then
Pad_InOut (XP_Mem_D, Xbarin (15 downto 0), Mdr, '1');
XP_Mem_WR_L G '0';
end if;
resultl G OpB;
result2 G OpA;
Xbar_en G "11111”;
end if;

if local_clk = 4 then
Pad_Input (XP_left, Id);
Xbar_en G "10000";
local_clk G 5;

end if;

if local_clk = 5 then
-- Latch new ops from the decode stage.

87

Opa G Xbarin (31 downto l6);
Opb G Xbarin (15 downto 0);
Xbar_en G "11111”;
Ix G Id;
end if;

end process;

XP_HSO G '2';

XP_GOILResult G '0';

XP_GOR_Valid G '0';

XP_Int G '0';

XP_LED G '1';

end Execute4;

88

A9 Xbarcontrol.vhd (PEO)

architecture Xbarcontrol of Xilinx_Control_Part is
signal count : integer range 0 to 6;
begin

process
begin

wait until X0_Clk'Event and X0_Clk = '1';

count G count + 1;
if (count = 5) then count G 0;
end if;

X0_Xbar_set G itobv (count,3);

end process;

X0_SIMD G TriState(X0_SIMD);
X0_,XB_Data G TriState(X0_XB_Data);
X0_Mem_A G TriState(X0_Mem_A);
X0_Mem_D G TriState(X0_Mem_D);
X0_Mem_RD_L G '1‘;
X0_Mem_WR_L G '1".
X0_GOR_Result_In G "W";
X0_GOR_Valid_In G "W”;
X0_GOR_Result G '0';
X0_GOR_Valid G '0';

- X0_XBar_Set G "000”;
X0_XBar_Send G '0';
X0_X l6_Disable G '0';
X0_Int G '0';
X0_Broadcast_0ut G '0';
X0_HSO G '2';
X0 _XBar_EN_L G '1';

end Xbarcontrol;

w

A10 Xbarconﬁg

1 3 34
65 In 00000034M1000110
3 3
65 HR 00000034M10001M0
21 34 43
56 [1 0000004311000110
56 ll

0000004311000110

0

0000000000000000 0000000000000000

wmwmmm

l
n
o
m.
m
123456789NHHBMUM w123456789WUnBMUm

3
OOMIOOOOOOOOOOOO

3
OOMIOOOOOOOOOOOO
0000000000004300
0000000000004300
2
.WOOOOOOOOOOOOOOOO
m

0087 M9
0087 N9
0
0078 91
0
0078 91
4
@5600000000000000

mmw

0 23 56
1234567891 llMll

91

3 3

43 34
0000003411000110

34 43
0000004311000110

3 3
000000431M000M10

0000000000000000

mmymmw

0 23 56
1234567891 llMll

3
0000430000M10000

3
0000430000M10000

3
00003400001M0000

3
00003400001M0000

6
.m
M0000000000000000
..m.
w

23 56
123456789 11M11

Appendix B

Runtime Results

Bl Fibonacci Sequence Results

T2 Version 1.88 Created Wed Oct 5 09:31 :37 EDT 1994
NEW INTERFACE BOARD (rev2)

T2:1> source pe5.init
2 boards available on Splash 2 unit 0

T2:2> source ﬁb.step

Added 881 at tick 6

Board 0 PE 3 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A

T2:3> source fib.step

Added BB 2 at tick 12

Board 0 PE 3 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A

T2:4> source fib.step

Added PB 3 at tick 18

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0002 000A 000A 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:00000002000A000A000A000A000A000A

T2:5> source fib.step

Added R8 4 at tick 24

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0002 0003 000A 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:000000020003000A000A000A000A000A

T2:6> source fib.step

Added PB 5 at tick 30

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0002 0003 0005 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000200030005000A000A000A000A

92

93

T2z7> source fib.step

Added BB 6 at tick 36

Board 0 PE 3 Offset 0 Words 32
000000:0000000300030005000A000A000A000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000300030005000A000A000A000A

T2:8> source fib.step

Added R8 7 at tick 42

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0003 0005 0005 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000300050005000A000A000A000A

T2.9> source fib.step

Added BB 8 at tick 48

Board 0 PE 3 Offset 0 Words 32
00000020000000300050005000A000A000A000A
Board 0 PE 4 Offset 0 Words 32
00000020000000300050005000A000A000A000A

T2:10> source ﬁb.step

Added R8 9 at tick 54

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0003 0005 0008 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
00000020000000300050008000A000A000A000A

T2:11> source fib.step

Added RB 10 at tick 60

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0005 0005 0008 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000500050008000A000A000A000A

T2:12> source fib.step

Added RB 11 at tick 66

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0005 0008 0008 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
00000020000000500080008000A000A000A000A

T2:13> source ﬁb.step

Added RB 12 at tick 72

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0005 0008 0008 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0005 0008 0008 000A 000A 000A 000A

T2:14> source fib.step

Added RB 13 at tick 78

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0005 0008 0000 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000500080000000A000A000A000A

T2:15> source fib.step

Added RB 14 at tick 84

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0008 0008 0000 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000800080000000A000A000A000A

T2:16> source fib.step

Added RB 15 at tick 90

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0008 0000 0000 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
00000020000000800000000000A000A000A000A

T2:17> source ﬁb.step

Added RB 16 at tick 96

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0008 0000 0000 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000800000000000A000A000A000A

T2:18> source ﬁb.step

Added RB 17 at tick 102

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0008 0000 0015 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 00080000 0015 000A 000A 000A 000A

95

BZ VLIW Fibonacci Sequence Results

T2 Version 1.88 Created Wed Oct 5 09:31 :37 EDT 1994
NEW INTERFACE BOARD (rev2)

T2:1> source all.init
2 boards available on Splash 2 unit 0

T2:2> source ai|1.step

Added BB 1 at tick 6

Board 0 PE 3 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
00000020000000A000A000A000A000A000A000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A

T2:3> 1

Added R8 2 at tick 12

Board 0 PE 3 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32

000000: 0000 000A 000A 000A 000A 000A 000A 000A

T2:4> l

Added R8 3 at tick 18

Board 0 PE 3 Offset 0 Words 32

000000: 0000 00010001000A 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 00010001000A 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32 .
000000: 0000 0001 0001 000A 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32

000000: 0000 00010001000A 000A 000A 000A 000A

T2:5> l

Added BB 4 at tick 24

Board 0 PE 3 Offset 0 Words 32

000000: 0000 000100010002 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000100010002000A000A000A000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 000100010002 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32

000000: 0000 000100010002 000A 000A 000A 000A

T2:6> l

Added BB 5 at tick 30

Board 0 PE 3 Offset 0 Words 32

000000: 0000 00010002 0002 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 00010002 0002 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 0001 0002 0002 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000100020002000A000A000A000A

T2:7> 1

Added BB 6 at tick 36

Board 0 PE 3 Offset 0 Words 32
000000:0000000100020002000A000A000A000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0001 0002 0002 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32
000000:0000000100020002000A000A000A000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000100020002000A000A000A000A

T2:8> 1

Added R8 7 at tick 42

Board 0 PE 3 Offset 0 Words 32

000000: 0000 00010002 0003 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0001 0002 0003 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 0001 0002 0003 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32

000000: 0000 00010002 0003 000A 000A 000A 000A

T2z9> 1

Added R8 8 at tick 48

Board 0 PE 3 Offset 0 Words 32
000000:0000000200030003000A000A000A000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0002 0003 0003 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 0002 0003 0003 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000200030003000A000A000A000A

T2:10>l

AddedFiBeattick54

Board 0 PE3 OffsetOWords 32
000000:0000000200030003000A000A000A000A
Board 0 PE40ffset0 Words 32
000000:0000000200030003000A000A000A000A
Board 0 PE 14 Offset 0 Words 32
000000:0000000200030003000A000A000A000A
Board 0 PE 13 OffsetOWordse2
000000:0000000200030003000A000A000A000A

T2:11>l

Added RB 10attlck60

Board 0 PE 3 Offseto Word332
000000:0000000200030005000A000A000A000A
Board 0 PE 4 Offseto Words 32
000000:0000000200030005000A000A000A000A
Board 0 PE 140ffset0 Words 32
000000:0000000200030005000A000A000A000A
Board 0 PE 13 OffsetOWords 32
000000:0000000200030005000A000A000A000A

T2:12> i

Added R811 attick 66

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0003 0005 0005 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000300050005000A000A000A000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 0003 0005 0005 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000300050005000A000A000A000A

T2:13> 1

Added RB 12 at tick 72

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0003 0005 0005 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0003 0005 0005 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32
00000020000000300050005000A000A000A000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000300050005000A000A000A000A

T2:14> 1

Added RB 13 at tick 78

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0003 0005 0008 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0003 0005 0008 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 0003 0005 0008 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000300050008000A000A000A000A

98

T2:15> 1

Added RB 14 at tick 84

Board 0 PE 3 Offset 0 Words 32
000000:0000000500080008000A000A000A000A
Board 0 PE 4 Offset 0 Words 32
000000:0000000500080008000A000A000A000A
Board 0 PE 14 Offset 0 Words 32
000000:0000000500080008000A000A000A000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000500080008000A000A000A000A

T2:16> I

Added RB 15 at tick 90

Board 0 PE 3 Offset 0 Words 32
00000020000000500080008000A000A000A000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000000500080008000A000A000A000A
Board 0 PE 14 Offset 0 Words 32
00000020000000500080008000A000A000A000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000500080008000A000A000A000A

T2:17> 1

Added RB 16 at tick 96

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0005 0008 0000 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0005 0008 0000 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 0005 0008 0000 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32
000000:0000000500080000000A000A000A000A

T2:18> l

Added F1817 at tick 102

Board 0 PE 3 Offset 0 Words 32

000000: 0000 0008 0000 0000 000A 000A 000A 000A
Board 0 PE 4 Offset 0 Words 32

000000: 0000 0008 0000 0000 000A 000A 000A 000A
Board 0 PE 14 Offset 0 Words 32

000000: 0000 0008 0000 0000 000A 000A 000A 000A
Board 0 PE 13 Offset 0 Words 32

000000: 0000 0008 0000 0000 000A 000A 000A 000A

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

BIBLIOGRAPHY

Abnous, A. Bagherzadeh, N. "Pipelining and Bypassing in a VLIW Processor"
IEEE Transactions on Parallel and Distributed Systems. Vol. 5. No.6. June
1994. pp. 658 - 664

Amerson,R. Carter, RJ. Culbertson,W.B. Kuekes,P. Snider,G. "Teramac-

Conﬁgurable Custom Computing" Proceedings of IEEE Workshop on F PGAs for
Custom Computing Machines, 1995.

Ellis, J.R. "Bulldog: A Compiler for VLIW Architectures" Cambridge, Mass.
MIT Press, 1985.

Fisher, J. A. "Trace Scheduling: A Technique for Global Microcode
Compaction" IEEE TOC, 030. July 1981. pp. 478-490

Fisher,J.A. Rau,B. R. "Instruction-Level Parallel Processing" Science,
September 1991.

French, P.C. Taylor R.W. "A Self-Reconﬁguring Processor" Proceedings of
IEEE Symposium on F PGAs for Custom Computing Machines , 1994.

Gokhale, M.B. Holmes, W. Kopser, A. Lucas, S. Minnich, R. Sweely, D.
Lopresti, D. "Building and Using a Highly Parallel Programmable Logic Array"
Computer, January 1991. pp. 81-87

Gokhale, M.B. Minnich, R. "FPGA Computing in a Data Parallel C"
Proceedings of IEEE Workshop on F PGAs for Custom Computing Machines,
April 1993. pp. 94-101

Gokhale, M.B. Schott, B. "Data Parallel C on a Reconﬁgurable Logic Array"
Draft

Gray, J. Naylor, A. Abnous, A. Bagherzadeh, N. "Viper: A 25-MHz, 100-
MIPS Peak VLIW Microprocessor" Proceedings of the IEEE 1 993 Custom
Integrated Circuits Conference. May 9-12, 1993. pp. 4.1.1 - 4.1.5

99

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

100

Hadley, J .D. Hutchings, B ..L "Design Methodologies for Partially Reconﬁgured
Systems" Proceedings of IEEE Workshop on F PGAs for Custom Computing
Machines, 1995.

Henncssy, J. Patterson, D. "Computer Architecture A Quantitative
Approach" 1990.

Hill, MD. Larus, J .R. Reinhardt, S.K. Wood, D.A. "Cooperative Shared
Memory: Software and Hardware for Scalable Multiprocessors" ACM
Transactions on Computer Systems, Vol. I I , No.4. November 1993. pp. 300-318

Hog], H. Kugel, A. Ludvig, J. Manner, R. Noffz, K.H., 202, R. "Enable++: A
second generation FPGA processor" IEEE Workshop on FPGAs for Custom
Computing Machines, 1995.

Hwang, K. "Advanced Computer Architecture: Parallelism, Scalability,
Programmability" 1993.

Iseli, C. Sanches, E. "A C++ Compiler for FPGA custom execution units
synthesis" Proceedings of IEEE Workshop on F PGAs for Custom Computing
Machines, 1995.

Iseli, C. Sanches, E. "Spyder: A Reconﬁgurable VLIW Processor using
FPGA's" IEEE Workshop on F PGAs for Custom Computing Machines, 1993.

Johnson, M. "Superscalar Microprocessor Design " 1991.

Lenoski, D. Laudon, J. Gharachorloo, K. Weber, W. Gupta, A. Hennessy, J.
Horowitz, M. Lam, M. "The Stanford Dash Multiprocessor" Computer, March
1992. pp. 63-79

Malke S. Chen, W.Y. Bringmann, R.A. Hank, R. E. ku, W.W. Ran, B.R.
Schlansker, M.S. "Sentinel Scheduling: A Model for Compiler-Controlled
Speculative Execution" ACM Transactions on Computer Systems, Vol. I I , No.
4. November 1993. pp. 376-408

Meier, R.D. "Rapid Prototyping of a RISC Architecture for Implementation in
FPGAs" IEEE Workshop on F PGAs for Custom Computing Machines, 1995.

[22]

[23]

[24]

[25]

[26]

[27]

[23]

[29]

[30]

[31]

[32]

101

Nicolau, A. "Percolation Scheduling: A Parallel Compilation Technique" Dept.
of Computer Science Cornell Tech. Report TR 85-678, May 1985.

Novak, J .H. Brunvand, E. "Using FPGAs to Prototype a Self-Timed Floating
Point Co-Processor" Proceedings of the I 994 IEEE Custom Integrated Circuits
Conference. pp. 85-88

Ratha, N.K. Jain, A.K. Rover, D.T. "Convolution on Splash 2" IEEE
Symposium on F PGAs for Custom Computing Machines 1995.

Robert, M. Gorria, P. Miteran, J. Turgis, S. "Architectures for a Real Time
Classiﬁcation Processor" Proceedings of the I 994 IEEE Custom Integrated
Circuits Conference. pp. 197-200

Rover, D. Tsai,V. Chow,Y. Gustafson,J. "Signal-Processing Algorithms on
Parallel Architectures: A Performance Update" Journal of Parallel and
Distributed Computing, Vol. 13. 1991. pp. 237-245

Salinas, M.H. Johnson, B.W. Aylor, J .H. "Implementation-Independent Model
of an Instruction Set Architecture in VHDL" IEEE Design & Test of
Computers 1993.

Schuette, M.A. Shen, J .P. "An Instruction-Level Performance Analysis of the
Multiflow TRACE 14/300" Association for Computing Machinery 1991.

Shen,J.P. Class notes, 1992-1993.

Takayanagi,T. Sawada, K. Sakurai,T. Parameswar,Y. Tanaka,S. Ikumi,N.
Nagamatsu,M. Kondo,Y. Minagawa, K. Brennan,J. Hsu, P. Rodman, P.
Bratt,J. Scanlon,J. Tang,M. Joshi,C. Nofal,M. "Embedded Memory Design
for a Four Issue Superscalar RISC Microprocessor" Proceedings of the 1994
IEEE Custom Integrated Circuits Conference. pp. 585-590

Wang, R. Milletary, J. Kobayashi, T. "The Dex JR." Senior project report
1993.

Wirthlin, MJ. Hutchings, B.L. "A Dynamic Instruction Set Computer"

Proceedings of IEEE Symposium on F PGAs for Custom Computing Machines
1995.

102
[33] Wolfe, A. Shen, J .P. "Superscalar Processor Design" Proceeding of the 5th

International Conference on Architectural Support for Programming Languages
and Operating Systems (Asplos V) 1992.

[34] Xilinx "The Programmable Logic Data Book" Xilinx, Inc. 1994.

 

MICHIGAN STATE UNIV. LIBRnRIEs
1|HIWIW‘IIWI"11111111111”1111111111qu
31293014057701