w 2‘
r

d? H‘ 0' “”13. "i 'rln,

 

 

 
 
 

 

 

  

     
   

k‘\
‘97:)? .

 
 

 

1:? ”'31:,
' .3» my“; ‘3 *
‘3‘“: {It} a; '

» "I'
[\‘I'J W!” '”

    
 
 
  
  

'1‘?!” in 5'"

" {1*

 

{.1
'd

  
 
   
   
  
        
 
     
  
 
  
    

 
 

w

.g.

“'0: "jl' ‘

i

.IiNIII

LL"

   

 

l " .- '
"'1 ha. I, ‘
I'N- ‘

 

  

 

- 9\5

 

l mayhem" ‘
Mleﬁaigaa Mate
‘ University l

\ °V:'i';‘ , I

—.
II-

 

This is to certify that the

dissertation entitled

A ’iirrt yd RT ,L m“ Pt'i'l~t+ if)

TntFSSCK fit/K USE 1"- C‘ VLSI 3/l/C (Zyr'i'x)
presented by

‘M J

has been accepted towards fulﬁllment
of the requirements for

is r ‘
degreein Firclwﬁl b.

3%.qumj

Ht: j? r

a/V‘t/M/é/M 44%

8101' professor

Date /Z"2“83

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

MSU

LlBRARlES
——.

V

 

 

RETURNING MATERIALS:
Place in book drop to
remove this checkout from
your record. FINES will
be charged if book is
returned after the date
stamped below.

 

 

 

 

A FLOATING-POINT INNER PRODUCT STEP

PROCESSOR FOR USE IN A VLSI SYSTOLIC ARRAY

By

Yeong-Jeng Tseng

A THESIS
Submitted to
Michigan State University

in partial fulfillment of the requirements
for the degree of

MASTER OF SCIENCE

Department of Electrical Engineering and
Systems Science

1983

ABSTRACT

A FLOATING-POINT INNER PRODUCT STEP
PROCESSOR FOR USE IN A VLSI SYSTOLIC ARRAY

By

Yeong-Jeng Tseng

This thesis presents the design of a floating-point inner product
step processor, suitable for VLSI implementation. The operands and the
resultant are expressed using the IEEE-75A standard. A simulation of
the floating-point inner product step processor is developed to obtain
information on parameters of chip area and propagation delay time as
'functions of a minimum lithographic linewidth. An additional simulation
is developed for band matrix triangulation,' utilizing concurrent
Gaussian elimination, in which the processor is used as a basic building
block in a VLSI systolic array. The systolic array simulation uses
parameters of matrix bandwidth and lithographic linewidth to provide

results indicating total area geometry and delay time requirements.

To my parents

ACKNOWLEDGEMENTS

The author wishes to express his sincere appreciation to his major
adviser, Dr. Michael A. Shanblatt. for his guidance and encouragement in
the course of this research.

He also wishes to thank Dr. D. K. Reihard and the committee
members, Dr. P. D. Fisher and Dr. C. L. Hey, for giving valuable
suggestions and comments in this work.

Finally, the author owes a special thanks to Miss Echo Chang for

.her emotional encouragement-and support.

TABLE OF CONTENTS

LIST OF TABLES ..................... .... ........ ... ........ ....

LIST OF FIGURES . ................... ...... ....... .. ........ .....

I. INTRODUCTION . ..... . ....... . ....... . ............. . .......

l.l Problem Statement ..... .............................

l.2 Approach ........ ........... ........................

II BACKGROUND .......... .................. ......... .........
2.I VLSI Systolic Array for

Matrix Triangulation ..................... ..........

2.2 lEEE-75h Floating-Point Standard ...................

2.3 Floating-Point Arithmetic Operation ..... ...........

WWUWW

DESCRIPTION OF A FLOATING-POINT

INNER PRODUCT STEP PROCESSOR .... ....... ..... ..... . ......
3. Introduction ................ ..... ............ ..... .
.2 Multiplication and Alignment .......................
.3 First Shifter ........................ ..... .........
.h Addition ...........................................
.5 Second Shifter ..... ........... .....................
.6 Roundoff ............................ ..... ..........
DESCRIPTION OF A FLOATING-POINT DIVIDER ...... ....... ....
A.l Introduction ................................. ......
h.2 Divider ............................................
A.3 Shifter ....... ... ...... . ........ . ..... .............
h.h Roundoff ....... ....... . .......... .... ..............
SIMULATION DEVELOPMENT ............... ...................
5.l Chip Area Computation ....... ............ ...... .....
5.2 Propagation Delay Computation . ............. . .......
5.3 Significant Module Data ........... ...... . ..........

ll
12

I8
I8
20
2h

30
35

Al
Al
Al
A3
55

1+7
AB
'09
51

VI.

RESULTS AND CONCLUSIONS ...... i ...... . .....................
6.l Circuit Simulation Results .........................
6.2 Conclusions ........................................

REFERENCES

'Page

53
53
59

6l

Table

5.]
6.l

6.2

LIST OF TABLES

Port-coefficient time table. ... ....... . .....

Significant module data. ............

Simulation area and time result for
FLPMAC design with different options.

Comparison of area and time

results of different arrays. .....

vii

55

58

LIST OF FIGURE

Figure ' Page
2.l The hex-connected processor array for

pipelining the L-U decomposition

of a band matrix. ................ ....... ......... ..... . 5
2.2 The L-U decomposition of a band matrix. ........ . ....... 6
2.3a Augmented matrix {Alb}. ................................ 7
2.3b Augmented upper triangular matrix {Uid}. ............... 7
2.A Computing array structure for matrix

of arbitrary dimension with 8-3. . ......... ...... ....... 8
2.5 Basic-single format. ..... ...... i. ....................... II
2.6 Preliminary register. .................................. l2
2.7 Flow chart of the execution sequence

for normalized FLP addition/subtraction. ............. .. l5
2.8 Flow chart of the execution sequence

for normalized FLP mutliplication. ... .................. l6
2.5 Flow chart of the execution sequence

for normalized FLP division. ........... ........ .. ...... l7
3.l Block diagram of the FLPMAC. ......................... .. 19
3.2 Block diagram of the MA. ........... ..... ............... 21
3.3 A 5-by-5 sign-magnitude Braun array multiplier. ........ 23
3.h A AB-bit serial shifter. ..... ..... ... ..... . ....... ..... 25
3.5 A h-by-h parallel shifter. .......... ................... 27

viii

3.6a
3.6:.
'3.7
3.3
3.9
3.10
3.II
3.I2
3.13
3.lA
3.15
m.
h.2

h.3
A.A
6.l

6.2

A NOR form Z-to-A decoder. ...........

A NAND form 2-to-A decoder.

Block diagram of
An 8-bit leading
A AB-bit leading
Block diagram of
A right shifter.
Block diagram of

Block diagram of

An 8-bit increment by one circuit (IBO). ..........

Block diagram of

the FSe 0.000.000.000

zero detector. .......... ..............

zero detector. ........................

the AD. .000... ......... 0 ........

the $5. ...........................

the rounding circuit.

the overflow circuit.

Block diagram of the FLP divider. ......................

An n-by-n convergence division algorithm
based divider for mantissa division.

Block diagram of the Divider. ........... .........

Block diagram of

the Shifter. ........

000......0....

Entire chip edge size versus matrix bandwidth
for 0.8. 0.5 and 0.2 micron linewidth. ..........

Entire chip propagation delay versus matrix
bandwidth for 0.8, 0.5 and 0.2 micron linewidth.

28
28
29
3]
32
33
35
36
38
39
A0

A2

Ah

“5
A6

56

57

CHAPTER I

INTRODUCTION

VLSI (Mery Large Scale integration) systolic computing array
processors have provided a new frontier of research for improving the
performance of systems requiring rapid matrix manipulations.
Specifically, in the triangulation of the linear equations, A'g-b,
systolic arrays with their modularity and regularity allow for straight
forward circuit design, testing and implementation. Systolic array
processors utilize both pipeline and parallel processing concepts and
can execute the many inner product step operations, the kernel of matrix
triangulation. much faster than conventional serial architectures. The
benefit of VLSI technology is that these computing structures can be
fabricated on a single chip, or perhaps in a modular fashion on a few
chips, and then be attached to a host computer as a peripheral device
capable of rapidly producing solution vectors of the linear equation
system.

Previous design simulations of inner product step processing
elements (PE's) have been constrained to a fixed-point (FXP) number
system [l,2,3]. But. in order for these designs to have more universal
applicability, floating-point (FLP) PE's must be considered. The use of
FLP PE's has been delayed due to its complexity, in both time and space

when compared to FXP designs. Additionally, circuit designers have been

2
reluctant to adopt a nonstandardized FLP format. In 1979. the IEEE
proposed a FLP standard [A,5] for microcomputer and minicomputer

architectures which is gaining wide industry acceptance [6,7].

l.l Problem Statement

The purpose of this thesis is to design and simulate a FLP inner
product step PE using the IEEE standard as a basis. This will lead to
an assessment of the time and space complexity of such an element.

A simulation of the design will be developed to quantify the PE's
required chip area and its modurar delay time. Then, the developed FLP
inner product algorithm, with additional required circuit elements, will
be used in a simulation of an overall systolic structure for band matrix
triangulation. The overall structure will then be quantified with
respect to maximum matrix bandwidth per chip and problem throughput.

These results will provide further understanding into the potential.
promise and applicability of the VLSI systolic array for matrix

triangulation.

l.2 Approach

To begin the design of a FLP PE using the IEEE standard, some

initial design problems must be considered.

l. The precision problem: A tradeoff .between hardware cost and

3

computational precision must be examined with respect to the

required precision of the application.

2. Overflow and underflow: In the IEEE standard, both overflow and
underflow have unique forms. Our design must sense these
singularities and correctly modify the result so that

computation can continue with a minimum loss of accuracy.

There are several different approaches to the design of a FLP PE,
but only a few are suitable for VLSI implementation. These must
incorporate the necessary ingredients for good VLSI design, namely,
maximum parallelism and pipelining, design regularity and the use of
only local communications.

The research approach here will include three steps. The first
step is to design the optimal processor by examining specific design
tradeoffs. This includes comparing such items as shift register
candidates (serial .versus parallel) and methods for rounding of the
computed results (rounding to nearest versus truncation). The second
step is to quantify, via.simulation, possible choices so as to compare
delay times and geometric areas. The third step is to incorporate the
optimal inner product step PE design in a previously developed systolic
algorithm and simulate the array with various parameters of matrix
bandwidth and lithographic linewidth. The results of this final circuit
simulation will provide information of the maximum matrix bandwidth that

can be solved on a single chip and its associated solution time.

CHAPTER II

BACKGROUND

2.l VLSI Systolic Arrays for Matrix Triangulation

A VLSI algorithm implementing Gaussian elimination for L-U
decomposition was proposed by Mead and Conway [8]. The algorithm,
illustrated in Figure 2.1, can triangulate an NxN band matrix A, with
bandwidth 8, into upper triangular form 9 and lower triangular form L
(Figure 2.2). The bandwidth of a matrix is defined as B-(p+q-2)/2,
where p is the column number of the first row's rightmost nonzero
element, and q is the row number of the first colomn's bottommost
nonzero element.

To triangulate the matrix will take 3xN + min(p,q) time units and
pxq processors. Reducing time or processors can be done either by
minimizing the bandwidth of the matrix using the methods described in
[9,10,1l], or by revising the algorithm.

An early version of a revised algorithm was proposed by Kung and
Leiserson [12]. Their algorithm, composed of simple inner-product step
and division function processors, is used to carry out L-U decomposition
on a full matrix {A}. A further revised version of that algorithm was
proposed by Hwang and Cheng [13]. This version differs from the early
one in interconnection structure, latch operation and I/O requirements.

A

 

 

 

 

 

 

 

 

 

 

\Wlmﬂ I.II.I

vaiii

b4

 

The hex-connected processor array for
pipelining the L-U decomposition of

a band matrix [8].

Figure 2.1

 

 

 

 

 

 

‘11 ‘12 ‘13 an 0 1 “11 “12 “13 “1a . °
‘21 ‘22 ‘23 ‘21» ‘25 121 1 ° “22 “23 “2a “25
‘31 I‘32 ‘33 ‘3“ ‘35 = 131 132 1 ' “33 “3a “35
‘M ‘uz ‘na . 1&1 1&2 1A3 1 °
a52 ‘53 152 153 '..
O O
L. .3 I. - I— "
A L D

Figure 2.2 The L-U decomposition of a band matrix.

Moreover, it is designed to perform L-U decomposition on an entire
(augmented) linear system of equations, {Alb}. Neither of these two
versions considered any design specifics such as I/O interface details
or actual processor layout details. '

A more recent improvement to these original versions was developed
in [3]. By using an isolated row of PE's, the improved algorithm can
triangulate an arbitrarily large augmented band matrix {A'b} (Figure
2.3a) to an upper triangular form {ggg} (Figure 2.3b). Also, in [3],
circuits for I/O interface and processor layout were presented and
analyzed with respect to minimum delay time and chip area requirements.
In this version, Figure 2.A, inner-product cells, called MAC's (Multiply

and Add Cells) perform the functions w-xy+z, x-x and y-y. The

II

a12 a13 31h
a22 a23 32h a25
a32 a33 33:. a35
3A2 ”A3 ‘uu
a‘52 a53
aN(N-3) aN(N-2) aN(N-I) a
Figure 2.3a Augmented matrix {AID}.
”12 ”13 “1h
u22 u23 ”2A u25
”33 “3:. "35
”an “as
“55

Figure 2.3b Augmented upper triangular
matrix {DID}.

 

 

 

 

 

 

 

 

 

 

 

 

 

    

 

 

=6
f E f
DC DC DC g
01
AMA C IWAC
02
NH“: MEI IWAC
°3
DMAC INA Nun:
on *——0
0
d r ----- ----°---- ------ 3;
-—D section

 

 

 

 

Figure 2.h Computing array structure for matrix
of arbitrary dimension with 8-3 [3].

9
complementation circle shown on the topmost row of MAC's signifies a
two's complement operation. The division function is implemented by the
DC (inision gell), where g-e/f and f-f. The thick black lines between
each row of cells represent latch arrays. These latches provide the
synchronization for operands being pumped between rows of cells.

Matrix elements of A enter the processor via input ports
'1-|7 shown in Figure 2.A. The number of input ports is given by the
full breadth of the matrix, 28+]. Matrix elements of g are pumped out
of the processor via output ports 01-0“ shown in Figure 2.A. The number
of output ports is given by the bandwidth plus one for the diagonal
element, B+l.

The A section, shown on the bottom of Figure 2.A, resolves the
right hand side vector A into vector g. Elements of Q enter into the 9
section via Id and are pumped out via 0d.

An l/O port - coefficient timing table for triangulating the matrix
{Alp}, for B-3, is given in Table 2.1. In Table 2.1 it is seen that the
time required to obtain {gig} is 2N+3, or in general 2N+B. Thus, this
algorithm is classified as an O(n) algorithm. Previous versions, also
O(n) algorithms, have required more PE's and thus more chip area.

In all previously published versions of this array, PE's were
designed to handle only intergers, or, at best, the mantissas of FLP
numbers with identically adjusted exponents. This limited the function
and applicability of the algorithm. To alleviate this limitation, we

need a PE capable of handling general FLP operands.

t2N+6

31h

25

836

Table 2.1
12 I3 I“
all
1
a12 a22
1
a'13 “23 a33
1
32h 33h ahh
1
a35 ans a55
1
ans

IO

Port-coefficient time table [3].

13

”2A

35

“1h

25

II

2.2 IEEE-75A Floating-Point Standard

The IEEE-75A floating-point standard has been proposed for mini and
microcomputer use [A,5]. This standard defines four FLP formats in two
groups, basic and extended, each having two possible operand widths,
single and double. In this thesis, only the basic-single format is
chosen and used throughout the entire design.

The basic-single format for a binary FLP number X is shown in

Figure 2.5.

 

Isl E l M

01 89 31

Figure 2.5 Basic-single format [5].

Using an 8-bit biased exponent, a 23-bit implicit mantissa and a l-bit

sign digit, the format can express values v of x as follows:

I. If e-255 and ffO then v-NaN (Act g Aumber).

2. If e-255 and f-O then v-(-l)s.

3. If O<e<255 then v-(-I)‘2°"27(1.f).

A. If e-o and ffO then v-(-i)52"27(o.f).

5. If e-o and f-O then v-(-1)SO (Zero).

In the implementation of FLP arithmetic, rounding of the resultant
is often inevitable. There are four rounding modes described by the

standard:

l2

RN - Round to Nearest

RZ - Round toward Zero

RP - Round toward Positive Infinite
RM - Round toward Negative Infinite

An implementation of the standard may support either RN only, with R2
for Round to Integer, or all four rounding modes.

In this research, all four rounding modes and another mode,
truncation, are utilized. The rounding operation is implemented through

a preliminary register. This register has the following format:

 

Isl [VIN] ILIGIRISI
Figrue 2.6 Preliminary register [A]. -

Here V is the overflow bit for the significant digit field: N and L are
the most and least significant bits: 6 and R are two bits beyond L. S,
the sticky bit, is the logical OR of all bits thereafter. In the design
of the FLP inner product PE, the use of different rounding combinations
yields different chip area and delay time results in addition to
differences in resultant accuracy. These differences will be examined

in Chapters 3 and 6.

2.3 Floating-Point Arithmetic Operation

In the IEEE-75A standard. a nonzero number can be expressed as a

sign digit (5) concatenated with the mantissa (m) and then combined with

13

the exponent (e) as (s.m,e). The mantissa range is

I 5 {m} 5 2-2'p ' (2-1)
where p is the number of significant digits of the .mantissa. The
exponent is a biased integer in the range

2"(q'b) _<_ e g zq-b-l (2-2)
where q is the number of signifcant digits in the exponent and b is the
bias constant. .

FLP arithmetic operations of addition, subtraction, multiplication

and division are defined‘as follows:

I. Addition/Subtraction

( srm1 , el ) 1 ( sz.m2 , e2 ) - ( s3.m3 , e3 ) (2-3)
where

53 ' (’1+"’Chsa T ’2cnss'
Here, M is the operator designating addition (M-O) or subtraction
(M-l). CHSB is the carry out of the most, significant bits

operation.

 

P (cl-e2) _
m1 1 m2 x 2 for e1 > e2
'(°2'°1)
m3 -I m‘ x 2 i m2 for e1 < e2
[ml 1 m2 for e1 - e2
e1 for e1 > e2
e3 - e2 for e1 < e2
0 for e . e (Z-A)

IA

Since we add or subtract only two numbers at a time, the
resultant's mantissa is always bounded in the range
0 5 :m: < A (2'5)
which violates the mantissa definition in the IEEE-75h standard and
has to be normalized. When 25lm}<h, the mantissa must be shifted
one bit to the right and the exponent must be increased by one.
When Oglml<l, the mantissa must be shifted one bit to the left and
the exponent must be decreased by one. Flow charts for FLP addition

and subtraction are shown in Figure 2.7.

2. Multiplication

( srmI , el ) x ( $2.m2 , e2 )-- ( $3.m3 , e3 ) (2-6)
where

s3 - 51 + 52

m3 - mI x m2 and e3 - eI + e2.

When multiplying two numbers, the resulting mantissa is within the
range

l_<_ gm: 5 A. (2’7)
If the mantissa is in the range 2§lml<h it must be shifted one bit
to the right and the exponent increased by one. After that the
mantissa and exponent become

-1

m3 - 2 x m] x m2 and e3 - eI + e2 + 1. (2-8)

A flow chart for multiplication is shown in Figure 2.8.

3. Division
( s].mI , e1 ) / ( 52.m2 , e2 ) - ( s3.m3 , e3 ) (2-9)

where

15

C FLP ADD/SUBTRAITT)

[ Exponent Alignment]

I

Mantissa Addition
or Subtmction

I '

l Postnormalization

C' m 3

‘

 

 

 

 

 

 

 

 

Figure 2.7 Flow chart of the execution sequence
for normalized FLP addition/subtraction.

83' 31+ 82
"'3

After division, the mantissa of the resultant is within the range

- mI / m2 and e3 - e1 - e2.

0 5 :m: g 2. (2-10)
When the mantissa is smaller than one, it must be shifted one bit to
the left and the exponent decreased by one. After that the mantissa

and exponent become

16

 

 

 

 
    

 

Exponent
Addition

I_r

Normalization -
and Roundoff

  

 

 

 

 

 

 

Figure 2.8 Flow chart of the execution sequence
for normalized FLP multiplication.

m3-2xm1xm2 and 93-33- I. (2'11)

The flow chart of division is shown in Figure 2.9.

 

 

 

 
    

133a
Division

  

Exponent

Subtraction

 

 

  
   
 

Normalization 1
r

 

Ove Nonzero

esult
END)

Figure 2.9 Flow chart of the execution sequence
for normalized FLP division.

 

CHAPTER III

DESCRIPTION OF A FLOATING-POINT
INNER PRODUCT STEP PROCESSOR

3.1 Introduction

The FLP inner product processor performs the function D-AxB+C,
which requires a FLP multiplication followed by a FLP addition.
Therefore, the basic structure of the FLP inner product processor is a
multiplier followed by an adder. In its most basic form, the concept of
parallel processing need not be adopted. If parallelism were exploited,
the mantissa multiplication and part of the exponent alignment could be
performed at the same time. Modifying the basic structure to use

concepts of parallel processing and pipelining enables the design of a
FLPMAC (gloating goint Aultiply and Add gell). .This structure is
illustrated in Figure 3.1.

The FLPMAC structure is divided into five parts. Each part is a
segment of the arithmetic pipeline.

The first section of the FLPMAC is called MA (Aultiplication and
Alignment). This part not only performs the function of mantissa
multiplication and exponent addition, but also calculates the difference

of the product AxB and addend C.

18

I9

Input

Multiplication
and

Alignment

I

First Shifter

I

Addition

 

 

 

 

 

 

 

 

 

 

 

Second Shifter
Roundoff

I

Output

 

 

 

 

 

 

 

Figure 3.1 Block diagram of the FLPMAC.

20

The second part is called FS (Lirst §hifter) and uses the
difference coming from MA to complete the operation of exponent
alignment.

The third part is called AD (AQdition) and implements mantissa
addition and leading zero detection.

The fourth part is called SS (éecond §hifter) and uses the result
of the leading zero detection to complete the operation of
postnormalization.

The fifth part is called RF (Aound Ofi) and handles the roundoff
and overflow operations. There are two ways to handle roundoff, either

rounding to nearest or truncation.

3.2 Multiplication and Alignment

The MA consists of two parts, one for multiply and one for exponent
addition, which are independent and thus may be performed concurrently.
The exponent addition part also calculates the exponent difference
between the product AxB and addend C for alignment purposes.

The multiply part can be subdivided into two sections, a
sign-magnitude multiplier and two groups of 2-to-l multiplexers. Each
group is formed by A9 multiplexers and controlled by the S-signal. The
S-signal is the carry-out of the 9-bit ripple carry adder (RCA) shown in
Figure 3.2.

The sign-magnitude multiplier can be designed using several

candidate algorithms. These include the modified Booth's algorithm,

21

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A B
1,21; {git
zulei
Braun
Multiplier
In
A 5 49 : 49 5
2:1 MUX 2:1 MUX
(49) (1&9)
_ I 1 _
s
.1439 __l,+9
SFS+MFS Bafﬁn

Figure 3.2 Block diagram of the MA.

22

used in the IBM 360/91 [1A], algorithms that use parallel counters to
combine more bits of the partial products in order to form the result
more quickly [15,16,17] or algorithms that use CSA trees to combine the
partial products [18]. Among all the algorithms, the Braun array
multiplier is chosen. The reason for choosing the Braun array
multiplier is that its interconnection is quite simple and regular, and
it uses only one type of component, a full adder, which simplifies the
layout and testing process. A 5-by-5 sign-magnitude Braun array
multiplier is shown in Figure 3.3. The multiplier used in the FLPMAC is
similar to Figure 3.3. The FLPMAC multiplier accepts two ZA-bit
numbers, 23 bits coming from the mantissa and l implicit bit, and
generates a AB-bit product. Here, no roundoff operation is performed in
order to extract maximum precision. After the product is formed, a
decision is made as to which one is larger, the product or the addend.
Then, the smaller one is put into F5 for exponent alignment and the
larger one is passed to AD.

The exponent addition design is based on the idea of parallel
processing and is similar to the designs in [16,19]. An B-bit RCA,
shown in the upper left corner of Figure 3.2, is used for exponent
addition in the FLP multiplication. An inverter, indicated by lNVl, is
used to invert the sign digit. This is required as two biased numbers
added together may cause the sign digit to change. A second inverter,
indicated by INV2 on the right side of the 8-bit RCA, is to implement
the q-b operation that was described in Section 2-3.

After the results of the 8-bit RCA and the inverters are produced,

8 inverters and a 9-bit RCA are used to generate the exponent difference

23

aubO a3bo azbo albo 'aobo

no» "0” no” "0"

 

«on

 

 

 

 

 

 

 

'—__

Figure 3.3 A 5-by-5 sign-magnitude Braun array multiplier [18].

2A

of the product AxB and the addend C. The carry-out of the 9-bit RCA,
labeled S-signal, is used to indicate which number has the larger
exponent, the product AxB or the addend C. It is also used to select
the mantissa to be put into FS and the exponent to be put into SS. The
sum of the 9-bit RCA is used to control the shifter which is used to
align the mantissa. But the mantissa has only A8 bits, so any sum that
is greater than or equal to 158 would cause the shifter to output zero.
In order to save the number of shift operations and shifter
elements, a circuit called 01A8 is designed. This circuit is described
by the logic equation
0 - s s +58§

9 7
+s6§

9+57s8+5956SA+S9S6SA+S9S6SA+S9S6SA
55u+56555u° (3-1)
As described by this logic equation, if the sum is greater than A7 or
less than -A8, the 01A8 circuit will notify the shifter to output zero

by setting 0-1.

3.3 First Shifter

The shift operation can be performed in serial or in parallel. The
serial shifter requires a clock to synchronize the shift operation, thus
it may work slower but will usually take less chip area. The serial
shifter is shown in Figure 3.A, where SR is a AB-bit shift register as
described in [8]. The 6-input OR gate is used to detect the completion
of the shift operation. A circuit, marked by Z in Figure 3.h, is used
to make the shifter output zero when 0-1. This circuit consists of two

OR gates, one inverter and four AND gates. The worst case delay time is

25

S'Fs

 

SFS

r!
BI:FS

 

 

 

 

 

 

 

 

lei

 

(3le
1 f
1.

 

 

 

 

SR.

 

 

 

6-bit

Counter‘

 

 

 

 

 

 

 

 

 

 

 

iFS

:::::}f.up/down
I
» I
I I
I
I
I
I
D.
L.--..-.J

 

O

Figure.3.h A AB-bit serial shifter.

26

TINV+2xTAND+T60R+A8xTC, where TINV is the delay time of the inverter, TC
is the delay time of the 6-bit up/down counter, T6OR is the delay time
of the 6-input OR gate and TAND is the delay time of a 2-input AND gate.
The chip area required by the elements of the serial shifter is denoted
by AINV for an inverter, ADR for a 2-input OR gate, AAND for a 2-input
AND gate, AC for a 6-bit up/down counter, A60R for a 6-input OR gate and
ASR for a 1-bit shift register. The total chip area taken by the serial
shifter is then approximated by AINV+2xAOR+6xAAND+AC+A60R+A8xASR.

The parallel shifter is similar to the barrel shifter used in the
OM-2 project [8], conducted by Caltech. In the OM-Z project, the barrel
shifter is designed to do rotation, that is the number shifted out from
the most significant bit comes back to the least significant bit. But
in this design, the parallel shifter will not return the number which
has been shifted out. The parallel shifter is illustrated in Figure
3.5. ’

To operate a parallel shifter requires the use of a 6-to-h8 decoder
to control the number of shifts. There are two types of decoders [8],
using either NOR gates (Figure 3.6a), or NAND gates (Figure 3.6b). The
NOR gate type will generate a positive logic output and the NAND gate
type will generate a negative logic output. Comparing the two indicates
that both have the same delay time but the NAND gate type will require
less chip area even though it needs inverters to convert negative logic
to positive logic at the output.

The 6-bit output from the 9-bit RCA in MA is the difference of
product AxB and addend C. Due to the operation of the OAAB circuit, the

difference ranges from -A8 to +h7. It has 96 different values yet it

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_ F'*“ Shift u

p I, -4 -4] j' .1 :3:
I 73-, |TI ...J' :I: H

- - Shift 3

L BusZ

T I ‘ T] A 0ut2
;_I‘ T I‘ If I r‘ITVJJ

Shift 2

if Bus 1

‘ ‘ t ‘- Out1
.I‘__I‘ TI“ :F‘ICILJJ

Shift 1

+ - Buso

‘ ‘* * ctr ‘ “ u o

”LI—JFI mrjrlrl ‘11" ‘1er 0t
Shift 0

Figure 3.5 A A-by-h parallel shifter.

E .AB'

I.
If" -
A_I._.Del J
E

 

 

 

 

 

 

 

 

 

 

 

 

1.493%: d, -
d;

Leo—£1
B «we—+2

 

 

 

 

 

 

 

 

 

Figure 3.6b A NAND form Z-to-h decoder.

{a
. if a

DB.
i--)

29
has only 6A different binary forms. Therefore, the 6-to-A8 decoder
consists of 6A 6-input NAND gates, 15 2-input NOR gates and 33
inverters. The OR gate shown in Figure 3.7 between 6-to-A8 decoder and
the parallel shifter is the counterpart of circuit 2 which is shown in
Figure 3.A.

The total delay time for the parallel shifter is
T6-to-h80ECODER+TOR+TPA, where T6-to-A8DECODER is the delay time of the
6-to-h8 decoder, TOR is the delay time of a 2-input OR gate and TPA is
the delay time of a pass transistor. This is much faster than the
serial version. The chip area required by the parallel shifter is

A6-to-A8DECODER+AOR+h8xA9xAPA, where the chip area required by the

 

 

 

 

 

 

 

 

 

 

 

D 0
MFS SFS
6 -II- 1 48
, 1 Parallel
6-to-48 1(47) .
Decoder Shifter A 1
(1+8)
#8
.M'FS S 'Fs

Figure 3.7 Block diagram of the FS.

30
elements is denoted by A6-to-ABDECODER for the 6-to-h8 decoder and APA
for a pass transistor. After detailed calculation, using the data from
Table 5.], it is found that the parallel shifter, in fact, requires
almost the same amount of chip area as the serial shifter. Having the
benefit of faster operation time, the parallel shifter is the optimal

choice.

3.A Addition

The addition is implemented by a sign-magnitude adder [18]. After
the addition, the sum may be a denormalized number with some leading
zeros. For normalization, a leading zero detector (LZD) [20] is used to
detect the number of zeros proceeding the first "1". A LZD for an 8-bit
input and A-bit output is shown in Figure 3.8. The logic equations

decribing the four outputs are

w0"0'1 I0'1'2l 3'0' 1 '2 3 Wh'5+'o 1 2' 3 'A' 5 '6' 7

w1"0'1'2 'o'1' 2' 3+ '0' 1 '2 3 'h' 5 '6' 7

w2"0'1772 3 'h 'o' 1 2' 3 'h'5+'o'1'2'3'u'5'6

"3"0'1'2'3'h'5'6'7' (3-2)
A LZD for a 58-bit input is shown in Figure 3.9. The block diagram

of FS is illustrated in Figure 3.10. The addition overflow in Figure

3.10 is denoted by 0V. When OV-l, the LZD will output zero and enable

3'1
7

. Io I1 12 13 I4 15 I6 I
IA Is Io In IE IF G H

(I? or )H G F E D C B A Is

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N3 111 1112
"'0

Figure 3.8 An B-bit leading zero detector [20].

32

 

Figure 3.9 A AB-bit leading zero detector [20].

33
+M' S +M

F8 F3 AD AD

 

 

 

Siganagnitude RCA (#9)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. (+9
..48(li7-o)
-. 1 L20
(#8)
6
1
.l 1 __ 1(48) ‘_6 ._u8(u7-0)
S'AD 0V D' MiAD

Figure 3.10 Block diagram of the AD.

3A

the right shifter in SS.

3.5 Second Shifter

The shifter used here is very similar to the one used in F5 except
that the FS is a right shifter and here it shifts left. From the
definition of FLP addition, the mantissa of the result is in the range
15lm}<A. When the mantissa is in the range 2§Iml<h an overflow has
occurred. A right shifter is added to normalize the mantissa when
addition overflow occurs. This right shifter is shown in Figure 3.11.
The output of the right shifter is the calculated mantissa which is too
long for next processor to operate upon and some roundoff must be done.

Two kinds of roundoff modes can be chosen, rounding to nearest or
truncation. To do rounding to nearest, we logically OR the 21 least
significant bits of the calculated mantissa and OUT which .is the bit
that shifts out of the l-bit right shifter and the output of the 01h8
circuit to form the sticky bit. The sticky bit and the rest of the
mantissa are stored in the preliminary register for future use in RN.
To do truncation, we just truncate the 2A least significant bits of the
calculated mantissa.

The diagram shown in Figure 3.12 helps to explain this. The right
' side of Figure 3.12 is used to manipulate the exponent. A 9-bit 2's
complement RCA is used to decrease the exponent by the number of leading
zeros. An 8-bit RCA biases

the exponent and adds one when a mantissa overflow has been

detected in AD. The rest of the circuit on this side is used to check

 

_+ Bus 2
‘ 1 AI Out 2

f Bus 1

 

 

 

 

 

Eh}

Bus 0
Out 0

“his

 

 

—.

 

 

 

 

 

 

 

Figure 3.11 A right shifter.

whether an exponent overflow occurs and indicates its sign.

3.6 Roundoff

This part consists of two stages, rounding and overflow. The first
stage is rounding and since there' are two roundoff modes, RN and
truncation as mentioned before, this stage should be capable of handling
both. To do RN, the input for this stage is driven from the preliminary

register. A circuit called rounding is needed to round the input before

 

36

 

 

 

 

 

 

 

  

 

 

 

 

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

13' ”'11:: E33 D.
.I6 1.18 I9 6
I 213
6-to-47 Complement
....._ Shifter (L) L RCA (9) 1
Decoder “7 V
1 '1' i. 9 +1
. X1
(Overflow)
1 "1 F8(7-0)
10000000
O/i 12 3
' I
RCA (8)
1 1
«.1
.IB
7
1
T
1:3 1- 23(46-24) 1 4’8
G.R.S)
Mss o/U E'ss

Figure 3.12 Block diagram of the 55.

SI

.1"

 

SS

37
it proceeds to the next stage. To do truncation, the input for this
stage is the truncated output of SS and it can pass directly to the next
stage. The design of the rounding circuit is shown in Figure 3.13. The
EOD (Ending Qne Qetector) is a LZD without input stage inverters. A new
circuit called 180 (increment Qy Qne) is used to perform an increment by
one operation. This circuit is illustrated in Figure 3.1A.

The second stage is called overflow, The overflow circuit is needed
for both rounding modes. Several options are possible in the design of
this circuit. The most popular one is the saturation circuit described
in [21]. But, for the sake of design regularity, we use 8 2-to-l
multiplexers and a group of AND gates to discriminate the numbers for
positive infinite, negative infinite or nonoverflow number. The design
of the Overflow circuit is shown in Figure 3.15. After the. output of
the overflow circuit is produced, the Operation of inner product

calculation is completed.

SS

1.1

 

 

 

 

 

 

 

 

 

 

O/U E .83
-I- 1 8
EDD-8
--h
fies
. 180-8
3,1
_. 1 --8
O/U' "IR

 

Figure 3.13 Block diagram of the rounding circuit.

38

‘ss

 

EDD-24

 

F5

 

 

 

 

IBO-lev

 

 

 

 

--23

 

"a

 

nl

 

 

17L"

A-—-
D1. D‘—

3—1
30—:
”A A

E—

II

W?”

 

1W

 

 

 

 

5
D5 --1 .
P—
°6
D6
1.!

 

F.

ii, 11 9+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3-to-7 Decoder
C

I I out
qf:]T r T in
10 I1 12 I

3

 

 

Figure 3.1A An B-bit increment by one circuit (180).

 

 

 

 

 

 

 

 

 

 

 

 

E'sS or ER O/i O/U or O/U' M88 or MR SSS or SR
1 I
b 8 4.. 8
2:1.hﬂﬁX (8) i “ 1
1
1' 8 11- 23
Eov M01/ Sov

Figure 3.15 Block diagram of the overflow circuit.

CHAPTER IV

DESCRIPTION OF A FLOATING-POINT DIVIDER

A.l Introduction

The block diagram of the FLP divider, illustrated in Figure h.l, is
composed of three subblocks. Block A, the Divider block, performs the
mantissa division, exponent subtraction and exponent overflow detection.
Block B, the Shifter block, is resposible for correcting for a mantissa
underflow. Block C, the Roundoff portion, selects the correct answer
according to the previous block's data. For example, if the output of
the Shifter indicates a overflow occurred and the sign of the output is

negative, then the Roundoff will output zero.

A.2 Divider

The convergence division algorithm [16] is chosen for mantissa
division because the number of iteraton steps can be determined A pgiggl
in terms of word size. Additionally, since this algorithm requires
mostly iterative multipling procedures, the sign-magnitude multiplier

can again be used and thus maintain structural regularity.

Al

A2

Input

 

Divider

 

 

 

 

 

Shifter

I

Roundoff

I

Output

 

 

 

 

 

 

Figure A.l Block diagram of the FLP divider.

A3

A divider implementing the convergence division algorithm is
illustrated in Figure A.2 [22]. Using only one row of multipliers, this
divider drastically reduces hardware complexity while having
approximately the same speed as previous designs [3]. The output of the
divider is only 2A bits since that an iterative multiplying procedure
can tolerate error of calculation [18]. The exponent subtraction is
implemented by 8 inverters, a 2-input exclusive OR, and an 8-bit RCA
together with an inverter. The inverter is used to invert the sign
digit because subtraction of two biased numbers may change the sign
digit. The 2-input exclusive OR is used to detect the exponent
overflow.

The block diagram of the Divider is illustrated in Figure A.3.

A.3 Shifter

When the mantissa of the divisor is greater than that of the
dividend, the quotient becomes a denormalized number. To normalize the
quotient, its mantissa must be shifted one bit to the left and the
exponent must be decreased by one.

The normalization is implemented by a left shifter and an 8-bit RCA
' together with accessory circuits. The left shifter is the same as the
right shifter shown in Figure 3.11, except that the I/O sequence is
inverted. The left shifter is controlled by the most significant bit of
the quotient's mantissa. When the quotient's mantissa is less than one,

the most significant bit is zero which enables the left shifter. The

AA

N D

n In
Divider I....... Divisor ..'....IShift biI
Count .

 

 

 

 

 

 

 

 

 

    
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n.xn nxn
Multiplier Multiplier
Latch 5 . I Latch I
N'(1+8)...(1+81) ‘ L_ '
Shift _ .....I Shift I
n
Q

Figure A.2 An n-by-n convergence division
algorithm based DC [22].

A5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

EB EA sA SB MB MA
- 8 8 --1 - 1 . 2r; .. 23
.1
RCA (8) _l... ”1" 21+ x 2’4
1
Divider
ti”
‘8 41-1 .-1 -.ZLP
IE Sd. OV’ D

 

Figure A.3 Block diagram of the Divider.

B-bit RCA and the accessory circuits are used to bias and decrease the
exponent and also detect a possible exponent overflow.

The block diagram of the Shifter is illustrated in Figure A.A.

A.A Roundoff

A6

The circuit of this block is the same as the Overflow circuit used

in the FLPMAC. Thus structural regularity is maintained.

10000000 E D

«8 -1- 21+

 

 

 

 

 

 

 

 

 

"1 O.
1 1-bit left
— ~—4—- ov
Shifter 1
7
1 --23
T

 

 

 

Figure A.A Block diagram of the Shifter.

5.0
Q.

 

CHAPTER V

SIMULATION DEVELOPMENT

The goal of the circuit simulation is to quantify the delay time
and chip area of the FLPMAC, divider and accessory circuits. it is also
used to quantify the problem throughput and maximum bandwidth per. chip
for the systolic array structure.

The fundamental parameters manipulated in the simulation are the
lithographic linewidth A and the matrix bandwidth. Of particular
interest is the manipulation of which can track current trends in I.C.
fabrication technology. Thus, the feasibility of implementing the
proposed systolic array structure can be projected. The "A-model", a
geometry design rule introduced by Mead and Conway, specifies the
minimum allowable values (in terms ofﬁ.) for the widths, separations,
extensions and overlaps of the diffusion, polysilicon and metal lines,
respectively. The advantage of this model is that it is dimensionless.
Therefore, as the I.C. fabrication technology advances, A decreases and
the device geometries decrease proportionally.

The T-model, also introduced by Mead and Conway [8], is .a basic
tool for determining delay time. In the 7-model, T is defined as the
delay time required for an electron to pass through a channel of length

L, where for small Vds

 

ds _ (5'1)

A7

A8 .
The proportionality constant p is the mobility of the electron. The
inverter with a pull-up to pull-down ratio k and a gate capacitance load
Cg has the "down-going" and "up-going" delay time as r and kr.
Therefore, for a inverter 7 with C load capacitance,

total

(C /Cg)°r and (C /Cg)'kr of "down-going"and "up-going" delay

total total

times are required, respectively.

However, in a real circuit, the speed of a MOS device operation is
determined by the speed at which it is able to charge or discharge a
capacitive load [21]. Therefore, a revised 'r-model, the T-model, is
eused in the simulation. In the T-model, T is defined as the discharge
time for a basic inverter coupled with only one identical inverter.

A Fortran-coded simulation program was developed based on the
aboved models. The purpose of this program is to calculate the total
chip area and propagation delay time at the transistor level of the
proposed circuit structure. The propagation delay time includes both

active gate delays and communication path delays for the entire chip.

5.1 Chip Area Computation

The hardware design can be partitioned into three distinct parts.
input circuit, computing structure and output circuit. The design of
the I/O circuit for the array is based on the choice from several I/O
circuit candidates [3]. The input circuit, called a "data controlled”
circuit, pushes the data operands into the processing elements without
using any channel-selecting control signal. It utilizes the FIFO stack

concept and employs N sets (N is the number of bits per word) of

A9
Nin-stage shift registers (Nin is the number of the first level
FLPMAC's). Therefore, the total number of the l-bit shift registers in
this circuit is "XNin'

The output circuit, called a "SCS” (§hift-Register gontrol
§equence) circuit, is controlled by an SCS signal. The SCS signal is
stored in a one-bit shift register chain and is used to select the
proper operand output channel. The SCS circuit is composed of NoutXN+N

l-bit shift registers and N buffers, where Nou is the number of output

t
ports given by 8+1.

The processing elements of the computing structure are regularly
connected and support only local communication. Thus, it is possible to
model each element as a functional module and tesselate these modules
into larger building blocks. Also, the fundamental building blocks
within the PE's themselves, such as the full adder, 2-to-l multiplexers
and exclusive-0R gates, can be modeled and likewise tesselated onto the
chip.

As a result, there are only a few specific modules required to be
designed by ‘hand and “seeded" into the simulation. Since strict
adherence to the Mead and Cownay design rules often result in
sub-optimal use of the available chip area [23]. Therefore, other area

estimations were made by practical rules of thumb [2A].

5.2 Propagation Delay Computation

Due to the localized connection of the computing and I/O structure,

the delay parameters can be simply computed. Quantification of module

50

delay parameters involves calculation of the active gates' propagation
time with consideration of loading and internal communication path
delay.
0 Even though the A-model is basically a rule for geometry design, it
can also be used as a rule for calculating propagation delay time. The
relation between the A-model and the T-model is that T is linearly
dependent on A and when .A-3 microns, T-0.6 nsec [8]. The internal
communication path problem has been minimized by using three conducting
path types, namely:

1. N+ diffusion paths for short path lengths:

2. polysilicon paths for somewhat longer runs:

3. metalization for long signal paths.
However, polysilicon lines and metal lines have unit length delays much
smaller than that of a diffusion line. The polysilicon line and metal
line delay are considered negligible [8]. By using the fact that the
transit time is 100 ns for a 10 millimeter length line [8] and taking
the length of the longest sides of a given model, the communication path
delays can be approximated.

Finally, the total module delay is the sum of active gate delay and

communication path delays. The total chip delay is the maximum segment
delay times the number of operands plus the set-up time [3]. This is

due to the pipeline nature of the systolic structure.

SI

5.3 Significant Module Data

The methods for computing the chip area and propagation delay have
been discussed in the previous sections. The result for each

fundamental building block is shown in Table 5.1.

52

Table 5.1 Significant module data

Module Class

Inverter

2-input NAND gate
2-input NOR gate
'2-input AND gate
2-input OR gate
3-input AND gate.
3-input OR gate
Full-Adder

2:1 Multiplexer
Exclusive 0R
Buffer

Latch

l-bit Shift Register

Pass Transistor

AreaIx‘AEL

II
IA
20
A0
A0

A0

26
23
23
23

I3
19

X

X

x 16
x 19
x 19
x 69
x 13
x 19
x 17
x I7

x 17

Active Gate

Delany IL

I2
I2
16
16

32

20

13
18

CHAPTER VI

RESULTS AND CONCLUSIONS

Using the basic modular data from Chapter 5, it is quite easy to
calculate the chip area and propagation delay times. The
interconnection of the basic modules for the FLP inner product processor
is simple and regular supporting only local communications. Therefore,
any intermodular delay and null area is considered to be negligible.
Based on this fact, a circuit simulation for the FLPMAC is developed.
The result is discussed in Section 6.1. A new VLSI systolic array
structure for band matrix triangulation, using the FLPMAC as a basic
building block, is also simulated in order to find the maximum bandwidth
per chip and structure delay time in terms of various lithographic

linewidths.

6.l Circuit Simulation Results

The pattern resolution of optical lithography is predicted to be
about 0.5 microns in 1977 [25]. Moreover, many I.C. designers are now
looking at new and promising techniques including electron-beams and

x-ray lithography. With these new techniques, the pattern resolution

53

5A
limitation is predicted to be less than 0.5 microns. For example,
linewidths of 0.125 microns have been predicted, when using the
electron-beam method [26]. For these reasons, 0.8, 0.5 and 0.2 micron
linewidths have been chosen as realistic estimates for future VLSI
capabilities. These linewidths are used as quantification parameters in
the simulation.

The circuit simulation results are obtained by computing the FLPMAC
parameters with different candidates of shifter and rounding modes.
These results are illustrated in Table 6.1. Currently, the standard
I.C. chip size is commonly limited to about 1 cm2, due to the reasons
of productivity yield [3]. For a FLPMAC PE using the serial shifter and

2

the RN mode, the chip area is about 0.01567 cm for A-0.8 microns. For a

FLPMAC PE using the parallel shifter and the RN mode, the chip area is

2 for A.-0.8 microns. The area of both shifters is

about 0.01600 cm
almost the same, but the time for the PE.using the parallel shifter is
about one half of its counterpart. For this reason the parallel shifter
is chosen for the FLPMAC structure.

Different rounding modes also generate different results. By using
the truncation mode, one can save about A.5 2 of the chip area and an
8.5 X propagation delay time savings is realized.

The circuit simulation results of the new array are graphically
illustrated in Figures 6.1 and 6.2. These results can be compared with
the results obtained from previous array designs which have handled only
integer operands and included additional circuitry for pre- and post-

adjustment of the operands. This is illustrated in Table 6.2. It shows

that the bandwidth reduction in new array is about 1A 2 of that of the

55

Table 6.1 Simulation area and time result for
FLPMAC design with different options.

Chip area Delay time

A
IShifter i Rounding modeI gum) Immzz InsI

Design Option

Serial + RN 0.8 1.567 1A57
0.5 0.612 9I2
0.2 0.098 368
Serial + Truncation 0.8 1.A88 1398
0.5 . 0.581 876
. 0.2 0.093 353
Parallel +_RN 0.8 1.600 656
0.5 0.625 All
0.2 0.I00 I65
Parallel + Truncation 0.8 l.52I S98
0.5 0.59A 37A

0.2 0.095 I5]

56

 

 

O
D
*1 CHIP EDGE vs BRNDNIDTH
FOR 32 BITS PER HORD
a + 0.3 meson LINENIDTH
‘1'. x 0.15 11mm LINENIDTH
0 e 0.2 1111mm 111151110111
2:
0o
T
iEN‘
re
NB
2'“
J:
L)
O
9
o.
8
° '1'1'1"""'1.'1.' 1.1.51.

8 1o 12
BRNDNIDTH

Figure 6.1 Entire chip edge size versus matrix bandwidth
for 0.8, 0.5, and 0.2 micron linewidths.

57

 

 

 

O
O
"1 TIME vs BRNDHIDTH
FOR 32 BITS PER "can
a + 0.8 MICRON LINEHIDTH
‘134 x 0.5 meson LINEHIDTH
° 0 0.2 MICRON LINEHIDTH
C)
UJ
(no
Id:
236“
:1
2:
5‘3
00
E.
p.
D
e
D-
D
e
O l'l'l'l'l'l'l‘l‘l'l‘l
2 4 8 8 ID 12 14 16 18 20 22
BRNDNIDTH

Figure 6.2 Entire chip propagation delay versus matrix band-
width for 0.8, 0.5, and 0.2 micron linewidths.

58

Table 6.2 Comparison of area and time
results of different arrays.

Array for FXP operands Array using truncation
. for FLP operands

Bandwidth A area time ‘area time
lﬂdll. I cm2 I I ms I I cm2 I I ms I
12 0.8 3380A6.7 2.053 338718.6 2.683
0.5 211279.2 0.801 211699.I 1.0A8
0.2 8A511.7 0.128 8A679.6 0.168
10 0.8 281705.6 1.A83 282268.6 1.928
0.5 l76066.0 0.579 I76AI7.9 0.753
0.2 70A26.A 0.093 70567.2 0.120
8 0.8 22536A.5 1.005 225818.7 1.296
0.5 1A0852.8 0.392' IAII36.7 0.506
'0.2 563Al.l 0.063 56A5A.7 0.081
6 0.8 I69023.A 0.617 I69368.8 0.787
0.307

0.5 105639.6 0.2AI 105855.5
0.2 A2255.8 0.039 A23A2.2 0.0A9

59

previous array and the propagation delay increases by only 0.20 2.
Moreover, it should be mentioned that the new array is capable of

handling both the FLP and integer numbers.

6.2 Conclusion

In this thesis, a FLP inner product PE is designed and evaiuted.
The operands and resultants for this PE are expressed by using the
IEEE-75A standard. The chip area, propagation delay and maximum
bandwidth per chip are calculated through a circuit simulation. The
output of the simulation shows that up to 62 FLPMAC PE's can be

2 chip for A-0.8 microns and the delay time for a PE

fabricated on a 1 cm
is about 0.6 ns at this linewidth. The maximum bandwidth per 1 cm2 chip
for the new VLSI systolic array is 6 for.A-0.8 microns and 10 for
A-0.5 microns.

The design of the floating-point inner product step processor has

been completed. Howerver, there are still some remaining topics worthy

of future research.

1. In this thesis, the total chip area and propagation delay for
the FLPMAC are all estimated values. The exact values can be

obtained by using CAD system.

2. Since most parts of this thesis are concentrated in the design

of the FLPMAC. A further study of the FLP divider is needed

60

for further improvement of the VLSI systolic array.

3. The maximum bandwidth for the VLSI systolic array per
1 cm2 chip is in some finite value. When the required bandwith
is larger than the finite value, expansion of the array must be

realized in a modular fashion on more than one chip.

As the I.C. technology advances, new design rules and design
methodologies will emerge. With these advances will come the ability to
fabricate the actual circuits developed in this thesis. Also at this
future time these new methodologies will help to further the concepts of
VLSI dedicated computing structures in enabling rapid design and

verification in an automated CAD envirament.

10.

REFERENCES

Ciminiera, L. and Serra, A., "Arithmetic Array for Fast Inner
Product Evaluation," Proc. 5gp Symposium pp Computer Arithmetic
(May 1981). pp. 207-21A.

Rutenbar, R. A. and Park, Y. E., "Case Study of A VLSI Design
Project: A Simple Inner Product Machine," Proc. 5gp Szpposium
pp Computer Arithmetic (May 1981), PP. 18A-189.

Hsu, W. C. and Shanblatt. M. A., Evaluation pi A Single VLSI
Chip Algorithm jg; Triangulating Large Band Form Matrices,
Tech. Report No. MSU-ENGR 82-015, Michigan State University,
East Lansing, Michigan (August 1982).

 

Coonen, J. T., "An Implementation Guide to a Proposed Standard
for Floating-Point Arithmetic, " Computer (January 1980), pp.
68—79.

Stevenson, D., "A Proposed Standard for Binary Floating-Point
Arithmetic, Draft 8.0 of IEEE Task P75A," Computer (March
1981). pp. 51-62.

Waser, 5., "Hardware Alternative for Floating-Point
Processors,“ Proc. IEEE Int'l Micro App Minicomputer Conf.
(1979). pp- lAA-ISI.

Intel Corporation, "iAPX 86,88 User's Manual," Intel
Corporation, Santa Clara, CA. ,(July 1981), pp. S-89-S-113.

Mead, C. and Conway, L., Introduction pp VLSI Systems,
Addison-Wesley Pub. Co., Reading, Massachusetts (1980). PP.
1-288.

Alway, G. G. and Martin, D. W., "An Algorithm for Reducing the
Bandwidth of a Matrix of Symmetric Configuration," Computer
Journal (August 1965), pp. 26A-272.

Cuthill, E. and Mckee, J., ”Reducing the Bandwidth of Sparse

Symmetric Matrices," Proc. 2Ath National Conf. AAA, Brandon
System Press, New Jersey (1969), pp. 157-172.

61

ll.

12.

13.

1A.

15.

l6.

I7.

18.

I9.

20.

21.

22.

23.

2A.

62

Gibbs, N. E., Poole, W. G. Jr. and Stockmeyer, 'P. K., "An
Algorithm for Reducing the Bandwidth and Profile of a Spare
Matrix," SIAM pp Numer. Anal., Vol. 13 (1976). Pp. 236-250.

Kung, H. T. and Leiserson, C. E., "Algorithms for VLSI

Processor Array,” Symposium pp Sparse Matrix Computations,
Knoxville (I978).

Hwang, K. and Cheng Y-H, "VLSI Computing Structures for Solving
Large-Scale Linear Systems of Equation," Proc. 1980 Int'l Conf.
‘pp Parallel Processing (August 1980), pp. 217-230.

Anderson, S. F. et al., "The IBM System 360/Model 91:
Floating-Point Execution Unit," App Journal (January 1967), pp.
35‘53-

Wallace, C. S., "A Suggestion for a Fast Multiplier," IEEE
1122;. 9.0 5.221.903 92mm (February 1961). pp. 111-17.

Dadda, L., "Some Schemes for Parallel Multipliers," Alta
m. Vol. 3A (1965). pp. 3A9-356. ‘—

Reusens, P., Ku, W. H. and Mao, Y-H, "Fixed-Point High-Speed
Parallel Multipliers in VLSI," App Conf. pp VLSI Systems ppp
Computations (October 1981), pp. 301-310.

Hwang, Kai, Computer Arithmetic, John Wiley and Sons Inc., New
York, (1979), pp. 8A-25A.

Corinthios, M., Fortier, M, Geadah, Y. and Prussel, M., "A

'Floating-Point Computer for Generalized Spectral Analysis,"

me 9i Ih_e 1233.1. SW 9.0. L101 sn_d _Hicro _LCom ut._er (1976).
pp. 31-36.

Chang, T. L. and Fisher, P. D., “High-Speed Normalization and
Rounding Circuits for VLSI Floating-Point Processors," Proc.
IEEE Int'l Conf. pp Circuits ppp Computers (1980), pp. 512-516.

Taub, H. and Schilling, D., Digital Integrated Electronics,
McGraw-Hill Inc. (1977), pp. 35-53 and pp. 381-383.

Leung, Y.-Y. J. and Shanblatt. M. A., A VLSI Systolic Array jpp
Matrix Triangulation 1p Load Flow Analysis, Tech. Report No.
MSU-ENGR 83-003, Michigan State University, East Lansing,
Michigan (January 1983).

LaBrecque, M., "Fast Switches, Small Wires, Larger Chips," A55
MOSAIC, (January/February 1982).

Personal Communication with Dr. Donnie K. Reihard, Department
of Electrical Engineering and Systems Science, Michigan State
University (October 1983).

63

25. Keyes, R. W., "Physical Limits in Semiconductor Electronics,‘l
Science, Vol. 195 (March 1977), pp. 1230-1235.

26. Eidson, J. C., "Fast Electron-Beam Lithography," IEEE Spectrum
(July 1981). PP. 2A-28.

WIIIIIIIIIIIIIIIII IIIIIIIIIIIII II IIIIIIIES
31293 031117926