iélé WW“NW“HimIIHIHIWll“NHHHNIHIHHM

   

-THS_

thgt

ﬂ/ﬂl/W/W/WZ/Wi/7W WM ‘

This is to certify that the

thesis entitled

BINARY TREE STRUCTURES

FOR MATRIX MULTIPLICATION
presented by

Abdullah Celik Erdal

has been accepted towards fulﬁlment
of the requirements for

 

M o S 0 degree in EleCt 0 Eng 0

Major professor

F 2 5 l 9 8 3
Date ebruary ,

 

0-7 639

 

MM...“
1.135% E? 3
Michigan mate

University

  
 
 
  

 

 

MSU

LIBRARIES

 

 

 

 

 

RETURNING MATERIALS:
Place in book drop to

_ remove this checkout from

your record. FINES will
be charged if book is

returned after the date
stamped below.

 

 

BINARY TREE STRUCTURES FOR MATRIX MULTIPLICATION

BY

Abdullah Celik Erdal

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
‘ for the degree of

MASTER OF SCIENCE

Department of Electrical Engineering
and
Systems Science

1983

659.05%

ABSTRACT
BINARY TREE STRUCTURES FOR MATRIX MULTIPLICATION

By

Abdullah Celik Erdal

Multiple arithmetic processors are interconnected to form a binary
tree structure and then utilized as a high performance special-purpose
co-processor. Such co-processors may be incorporated into a digital
signal or image processing system to perform such tasks as matrix multi-
plication, which is routinely required to compute convolutions or dis-
crete Fourier transforms. Specifically, this research shows that, any
nxn dense matrix multiplication can be performed in [(2n + log(n))] time
steps by using n(2n-l) processing elements, where each processing ele-
ment is designed to be either a multiplier or an adder with some addi-
tional registers.

To demonstrate the utility of this binary tree structure, implemen-
tation of the convolution and the discrete Fourier transform operations
are presented. This binary tree structure may also be applied to other
digital image and signal processing algorithms thus establishing the

flexibility, generality, and the cost effectiveness of this structure.

To My Parents

ii

ACKNOWLEDGMENTS

I wish to express my sincere thanks and appreciation to Dr. P.D.
Fisher for his constant guidance, continuous encouragement and valuable
suggestions throughout this study.

Gratitude is also expressed to the members of my thesis committee,
Dr. M.A. Shanblatt and Dr. R.G. Reynolds, for their comments and
suggestions.

Above all, I especially wish to thank my wife, Nil, for her end-
less encouragement, patience, and understanding throughout the course
of this study.

This research was supported in part by NSF under Grant Number

MCS79-09216.

TABLE OF CONTENTS

Chapter
I. INTRODUCTION . . ........ ' ..............
II. BACKGROUND . . . . ..... '. . . . . . ..........
2.l Systolic Arrays . . . . . . . . . . . ....... . .
2.2 VLSI Cellular Arithmetic Arrays ............
2.3 Discussion . . . . . . . . . . . . . . . . . . .

III.

IV.

AN ALGORITHM FOR MATRIX MULTIPLICATION . . .

3.l Functional and Structural Description .........

3.2 An Algorithm for Dense Matrix
Multiplication on Trees . . . . . . . .
3.3 Discussion and Performance Analysis . .

APPLICATIONS OF THE TREE STRUCTURES . . . .

4.l Convolution . . . . .

4.2 Discrete Fourier Transform (DFT). . Z I : I I I Z I Z :

4.3 The Design of a Special-Purpose Device

for 1-D Convolution and DFT . . . . . . . . .

CONCLUSION .................... .

5.l Sunmary . . . .....................

5.2 Further Research ...........

REFERENCES .........................

iv

Figure
2.l
2.2
2.3

2.4

2.5
2.6

2.7
2.8
2.9

3.l
3.2
3.3
3.4
3.5
3.5b
3.6a
3.6b
3.6c
3.6d

LIST OF FIGURES

Page
Geometries for the inner product step processor [2] . . . 5
Mesh-connected systolic arrays [2] . . . . . ...... 7

Multiplication of a vector by a band matrix with
p32 and q=3 [2 J O O O O O O O O O O O O O O O O O O O O O 8

The linearly connected systoic array for the vector
multiplication problem in Figure 2.3 [2] . . . . . . . . .9

Band matrix multiplication [2] . . . .......... ll
The hex-connected systolic array for the matrix

multiplication problem in Figure 2.5 [2] . . ...... l2
The additive multiplication cell [5] .......... l3
3x3 square matrix multiplication ............ 15

VLSI array for pipelined multiplication of two 3x3
dense matrices [5] . . . . . . . . . . . . . . . . . . . 16

A processing element (PE) ................ l9
The tree structure . . . . . . ............. 2l

The tree structure with PE's and data flow shown . . . . 2l

Matrices A,B, and C where AxB=C ............. 24
Individual elements of the matrix C ........... 25
Redefined version of the first row of matrix C ..... 26
Time step 4 ....................... 28
Time step 5 ....................... 28
Time step 6 .............. . . . . . . . . . 29
Time step 7 ......... . ............. 29

Figure Page

3.6e Time step 8 ....................... 30
3.6f Time step 9 ....................... 30
3.69 Time step 10 ...................... 31
3.7 Computation table . . . . . . . . ....... . . . . . 32
3.8 Number of required processing elements vs. N ...... 34
3.9 Total computation time of an NxN dense matrix ...... 35
3.lOa Number of required processing elements vs. N ...... 36
3.l0b Number of required processing elements vs. N ...... 37
3.lla Total computation time of an NxN dense matrix ...... 38

3.llb Total computation time of an NxN dense matrix . ... . . . 39

3.llc Total computation time of an NxN dense matrix . . . . . . 40

3.12a Speedup vs. N ...................... 42
3.l2b Speedup vs. N ...................... 43
3.l3a Percent efficiency vs. N ................ 44
3.l3b Percent efficiency vs. N .............. . . 45
4.l Noncyclic convolution for N=4 . . . . . ......... 49
4.2a I-D convolution tree . ............ . . . . . 50
4.2b Realization of 1-D convolution . . . . . . . . . . . . . 51
4.3 DFT for N=4 . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 One dimensional DFT trees. (Real or imaginary

part of the overall I-D DFT design.) ........... 55
4.5 Two dimensional DFT tree for N=2 ............ 57
4.6 Block diagram of MPE and APE .............. 58
4.7 Block diagram of a possible single chip implementation

of the tree structure ("Tree-1C“) . . . . . . . . . . . . 60
4.8 Tree structured I-D convolution for N=l6, and

1-D DFT system for N=8 .................. 62

vi

CHAPTER I
INTRODUCTION

In many areas of computer application, such as image and signal
processing, the quality of the answer the computer returns is propor-
tional to the amount of computation performed [l]. The use of general
purpose computers has been common for these applications; however,
the high computational throughput and data rate demanded by these
applications make the conventional computers inefficient and thus
impractical for many contemporary applications.

Advances in the design and fabrication of Very Large Scale Inte-
grated (VLSI) circuits will soon make it feasible to implement com-
puters consisting of tens or even hundreds of thousands of computing
elements. For this reason, the tendency is to design cost-effective,
high-performance, special-purpose devices to meet specific application
requirements by employing many processing elements.

Many special purpose devices for applications that require highly
parallel computing have been proposed and built. Systolic arrays,
introduced by Kung [2-4], and the VLSI computing structures, introduced
by Hwang [5], are such devices. Both consist of a set of interconnected
cells, each capable of performing some simple operation, where each
cell rhythmically computes and passes data through the system. They
have simple and regular communication paths and are highly concurrent

modular systems. Both authors emphasize procedures for solving linear

systems of equations and some matrix computations, such as L-U decompo-
sition and matrix multiplication. These computations play a vital
role in numerous computer applications which require high-speed computer
performance. For example, matrix multiplication is an important image
and signal processing operation, and it is the subject of this research.
This research develops and investigates a high performance special
purpose device for matrix multiplication, which works as a co-processor
attached to a host computer. It capitalizes on the properties of VLSI
to achieve high throughput rates and high efficiency. Although the
algorithm and the structure conform to the restrictions of VLSI tech-
nology, they can be easily implemented.with pre-VLSI technology without
significant performance degradation. A binary tree structure is used
to obtain a simple, regular, and short communication geometry, which
are considered as some of the most desired attributes of a VLSI imple-
mentation. One of the main advantages of using the binary tree struc-
ture over either hexagonally or mesh connected structures proposed
by Kung [2-4] and Hwang [5], respectively, is that a tree structure
uses much fewer connections between the processing elements; also,
the longest distance any data must travel is considerably less than
any of the other schemes, leading to higher speeds. Furthermore, only
two different processing elements, namely a multiplier and an adder,
are used for simple, regular, and modular design to yield cost-effective
special-purpose systems.
There are many special and general purpose machines that employ
binary tree-structured interconnections between the processing elements.
For example, the “Inner Product Computer" discussed in a report by

Swartzlander [6], is a special-purpose computational unit intended

to be used as a co-processor to compute the inner product of complex
vectors by using the binary tree structures, which are very similar
to the tree structured arrays investigated in this research. The “X-
Tree,“ University of California, Berkeley, project [7], is a tree
structured general purpose multi-processor computer architecture.

It has the hierarchical structure of a binary tree, but extra links
are employed between the nodes to enhance fault-tolerance and balance
uniform message traffic [7]. The California Institute of Technology
“Tree Machine,“ based on Browning's doctoral dissertation [8], is also
a binary tree structured multiprocessor system for general purpose
applications. However, this machine does not have the extra links
between its processors as in the case of “x-Tree." Both of these
machines are general purpose computers.

Background information is presented in Chapter II in order to
review some alternative highly parallel computing structures for matrix
multiplication. Some examples are also given. The third chapter of
this research introduces the proposed matrix multiplication algorithm
and the structure of the binary tree, where each node of the tree
represents a processing element. The chapter closes with an example
matrix multiplication and compares the proposed algorithm with the
two algorithms introduced in Chapter II, on the basis of speedup and
efficiency. The fourth chapter gives application examples for the
proposed architecture and describes the processing elements, chips,
and the overall system architecture. Lastly, in the concluding chapter,
some of the advantages and drawbacks of this architecture are delineated.

Possible extensions of this work for future research are also included.

CHAPTER II
BACKGROUND

The purpose of this chapter is to review the matrix multiplication
algorithms of the two multiple special-purpose functional units intro-
duced by Kung [2-4] and Hwang [5]. These units employ many tightly
interconnected elements and are designed to be used in a highly par-
allel computing environment for applications such as signal and image
processing. .

Systolic arrays and cellular arithmetic arrays are introduced
in Section 2.l and Section 2.2, respectively. Sample matrix multi-
plication algorithms, which may be implemented on these arrays, are
also included. The chapter concludes with a discussion of some of

the negative aspects of these algorithms.

2.l Systolic Arrays

In a systolic system, data flows from the computer memory in a
rhythmic fashion, passing through many tightly connected processing
elements in a co-processor before returning to memory. Each processing
element, called an inner product step processor, performs a single
operation, namely the inner product step,

C + C + A * 8.

Figure 2.1 shows two types of geometries for this processing element

(PE). Each PE has three registers RA, RB, and RC’ and each register

 

 

 

 

 

 

Figure 2-l: Geometries for the inner product step
processor [2].

has two connections, one for input and one for output. In each time
step, where one step is defined as the interval of time for any one
of the PE's to complete its computation, the PE shifts the data on
its input lines denoted by A, B, and C into RA, RB’ and RC’ respec-
tively, computes

C+RC+RA*RB

R
and makes the input values for RA and RB together with the new value
of Rc available as outputs on the output lines A, B, and C, respec-
tively [2]. Each processing element is connected to its neighboring
PE's as shown in Figure 2.2. Using these different systolic arrays,
it is possible to design systolic systems or systolic co-processors
for computations for such applications as band matrix multiplication,
L-U decomposition, and for solving triangular linear equations [2-4].
For example, consider the problem of multiplying a matrix A 8 (aij)

with a vector x = (x],...,x0) (see Figure 2.3). The product term,

y = (y],...,yo) can be computed by the following equations [2]:

A

—‘
v
u

‘YT 0:
+ l
yi(k I) ‘ Vim I aikxk’
+
yl = yi(n ])

If A is an nxn band matrix with bandwidth

w = P + Q - 1,
(See Figure 2.4 for the case when P = 2 and Q = 3.), then the above
equations can be evaluated by pipelining the xi and y1 through a lin-
early connected systolic array consisting of w inner product step PE's

in (2n + w) time steps [2].

[I

 

 

 

 

 

 

 

 

(a)

I
J
J

 

 

 

 

 

 

r
I]

 

 

I-
I I
J

 

 

 

 

 

 

I
I
I _
L.

 

 

 

 

 

 

 

 

 

_,_.
‘1.
L... ___"_

(b)

 

Figure 2-2: Mesh-connected systolic arrays [2]-

:“':"‘ . . .

‘n “W2 [*1 yJ
4 a23 a22 :123 0 x2 yz

‘31 ‘32 a33 ‘34 x3 . y3

 

 

 

 

 

 

Figure 2-3: Multiplication of a vector by a
band matrix with p = 2 and q = 3 [2].

I a34 ‘43 I
I I
I l
I a I
I 33 a42 I
I I
I
I I
I 623 a32 L
I I
I I
I l
I
I ‘22 ‘31 II
I (’
l l
I [I I
I alz II21 /’ :
\\\ I I
\ /( I
b‘ a l’ I I
I \‘. n I I I
\ / I

a::. ‘-
x2 .:::. --_

I" Fix-
I F:
I I=<~

. IIII’

 

 

 

 

 

 

I ‘*I

 

 

Figure 2-4: The linearly connected systolic array for the
vector multiplication problem in Figure 2-3 [2].

IO

Another example is the multiplication of two nxn band matrices
with bandwidth w1 and "2' The matrix product C = Icij) of A = Iaij)
and B = Ibij) can be computed as follows 2 :

c (I)

ij 3 0'
(k+I) a (k)
C13 cij I aikbkj’
= (n+l)

Band matrices A, B, and C, shown in Figure 2.5, are pipelined through
a systolic array of (w1 * "2) hex-connected PE's. The interconnection
network and data flow are shown in Figure 2.6. Each Cij is able to
accumulate all of its terms before it passes through the upper bound-
aries. If A and B are nxn band matrices with bandwidths w1 and "2’
respectively, then a systolic array of (w1 * wz) hex-connected PE's
can compute the result C in [3n + min ("l’ wZX] time steps. If A and
8 are nxn dense matrices, then (3n2 - 3n + l) hex-connected PE's can
compute the result in C in 5(n - 1) time steps [4].

As described, the systolic array architectures provide the capa-
bility for realizing a number of important matrix operations. In
addition to achieving a high computational rate by means of pipelining
and concurrent computation, these arrays are characterized by having

a simple, regular and short communication geometry.

2.2 VLSI Cellular Arithmetic Arrays
Processing elements, as in the case of systolic arrays, compute
the inner product,

C + C + A * B

II

3".

 

a...
an

:8

2...
so

no

28

.ﬂNH coppaup_awupse x_apas team

a:

.28
NM”.

mu

2”.

3

:.U
:6

so

:0

 

 

~m
-a

~—

3....

Z

 

 

m-~ aa=m_a

:a
5,.

:

 

12

r
. .

\va;4
II

 

 

:3;

The hex-connected systolic array for the matrix

multiplication problem in Figure 2-5 [2].

Figure 2-6:

I3

Figure 2-7: The additive multiplication cell [5].

T4

and have three inputs and three outputs as shown in Figure 2.7. Al-
though it is not specified in [5], presumably each processing element
has a set of three registers that relates to the input and output lines
of A, B, and C. Each processing element is connected to its neigh—
boring PE to form a short communication path.

Consider the problem of multiplying two nxn dense matrices A =
Iaij) and B = Ibij) (see Figure 2.8 for n = 3), where the product

coefficients
n
cij "' kg, (aikbkj)

for all i and j [5]. The elements of matrices A and B are fed from
the lower and upper input lines in a pipelined fashion, one skewed
row or one skewed column at a time, as in Figure 2.9 [5]. Some dummy
zero inputs are interspaced with the matrix elements to ensure correct
results. Multiplying two nxn dense matrices requires n(2n - l) pro-
cessing elements and it takes (4n - 2) time steps to produce the pro-
duct matrix C = Icij)‘
2.3 Discussion

All the PE's for both algorithms are kept busy all the time, either
by performing the inner product computation or by simply passing the
data to their neighbors. Other computations, such as matrix inversion
and L-U decomposition, can also be implemented by using one or two
different PE's and different array architectures. Both of these VLSI
arrays are expandable to allow modular growth. However, they both
have some drawbacks. For example, they need data skewing, and control

of every input and output data, in order to accomplish their correct

Figure 2-8:

15

‘13 bII Ihz bI3 cII ‘ﬁz ‘13
a23 * P21 b22 b23 ' ‘21 c22 c23 ' c
a33 b32 b32 b33 C31 c32 c33

3x3 square matrix multiplication.

I6

3 I
b 23 D
13 D b32
O b
b 22 D
12 0 D31
D b
b 21 D
11 O D .
o c 0 C13 0 O

O o o
I
C) O o
u
n (A) n
_a N
U o —‘
O O O
N
O N O
N "‘
—J o N
O O O
_4
.u‘

D
aII g o
o 12 a
a 0 13
2I a o
O 22 a
a o 23
31 a O
O 32 a
3 O 33

Figure 2-9: VLSI array for pipelined multiplication of two
3x3 dense matrices [5].

l7

timing. This increases the circuit complexity and introduces additional
delays. One other negative aspect of these arrays is that to accomplish
the multiple use of every element of matrices A and B, the result of
each processing element, as well as the elements of A and B, are pipe-
lined through the array. However, this requires the use of more inter-
connection lines between the processing elements, and hence more chip
area. And the number of required registers also increases for the

same reason .

CHAPTER III
AN ALGORITHM FOR MATRIX MULTIPLICATION

This chapter provides the requisite groundwork and the algorithm
for the development of a highly parallel computing machine. Basically,
this machine is a tree structured array and may be classified as a
multiple special-purpose functional unit. First, the processing ele-
ments are defined and the basic tree structure is outlined in Section
3.1. In the next section, a new matrix multiplication algorithm is
introduced, and a sample matrix multiplication is provided. Section
3.3 concludes with a discussion of the space-time complexity of this
tree structure and compares its speedup and efficiency with those of

the structures reviewed in Chapter 2.

3.1 Functional and Structural Description

* Processing Elements (PE)--As mentioned earlier there are two
types of processing elements, namely, a multiplier and an adder. Each
of these PE's has three connection lines A, B, and C, as shown in
Figure 3.1, where lines A and B are the input lines and the line C

is the output line. Each PE performs only a single operation, either

C‘+ A x B
or

C *-A + B.

18

19

 

Figure 3-1: A processing element (PE).

20

In each time step, a processing element takes the data on its input
lines A and B, performs its operation, and latches its result into
register RC’ where RC is directly connected to the output line C.

All outputs are latched and the logic is clocked so that when one PE
is connected to another, the changing output of one during a time step
will not interfere with the input to another during this time step.

* Tree Structure--The tree structure, shown in Figure 3.2, has
seven nodes and three levels. The first node is called the root node,
and it is accepted as the level one of the binary tree. The nodes
4, 5, 6, and 7 are called leaf nodes or input nodes. This tree struc—
ture is redrawn in Figure 3.3 to show the actual data flow. As shown
in this figure, data is pipelined downward from the leaf nodes to the
root node, where the output of the root node is the final result of
the computation. Only the input nodes (leaf nodes) are represented

by the multipliers; the rest of the nodes are all adders.

3.2 An Algorithm for a Dense Matrix Multiplication on Trees

Let A = (aiJI’ B = Ibij)’ and C = (cijI be nxn dense matrices,
where C is the product term, which can be computed in (l + log n) time
steps by using n2(2n - l) PE's, and by employing the type of tree
structures shown in Figure 3.3. (Note: the logarithmic function will
be used in base two throughout this paper.) All the rows of the matrix
A and all the columns of matrix B are multiplied together in one time
step to produce all the partial products of the resultant matrix C.
Then the produced partial products are added in (log n) time steps
to obtain the final result. Consequently, very high speedup is obtained

over the other algorithms. However, it requires too many PE's and

21

Figure 3-2: The tree structure.

\/ \,/

Figure 3-3: The tree structure with PE's and data flow shown.

22

very large I/O bandwidth even for very small matrix multiplications.
Since this would be an unrealistic assumption, a different approach
will be taken to obtain high speedup with many fewer PE's.

In the remainder of this paper, it is assumed that A, B, and C
are all nxn dense matrices, where n = 2k for k = l,2,3,..., although
this is not an essential assumption. With this assumption, we can
define some terms that will be used throughout this paper as follows:

p = Total number of PE's that will be used for a given nxn matrix

multiplication.

t = Total amount of time it takes for a given nxn matrix multi-

plication to be computed.
Then, it will be shown in this section that

p = n(2n - l);

t = (2n +_log n);
where

n = 2k for k = l,2,3,...
for nxn dense matrix multiplication.

The first modification needed is to reduce the number of input
lines for the input nodes (multipliers) from two to one. This will
not only reduce the I/O bandwidth requirements, but also reduce the
number of pins for a chip implementation. However, the price that
we pay is the reduction of the overall speed. The second modification
is to include one extra register, R3, for each multiplier to use as
a buffer to hold the elements of the matrix B. This register will
be implemented inside the multiplier PE. Finally, it will be assumed

that n-way memory interleaving is incorporated in the host computer.

23

With this memory model, one word can be fetched from each of the n
memory units at once in one clock period, then the effective memory
bandwidth is n [9].

With all the given modifications and the assumptions, it is now
possible to load all the elements of the matrix B into the register

2 multiplication PE's in n time steps, one column at a time,

R8 of n
where each column has n elements. Note that if the I/O bandwidth is
larger, more data can be accessed in one time step. Also, it is as-
sumed that memory access time is equal to one time step, although it
may be possible to access the memory more than once in one time step,
which will speed up the overall computation time. Furthermore, for
simplicity, it will be assumed that each of the PE's, multiplier or
adder, have equal computation times.

Consider a 4x4 matrix multiplication example to illustrate this
algorithm, where matrices A, B, and C are shown in Figure 3.4. Indi-
vidual elements of the matrix C are shown in Figure 3.5a.

Each row vector will be computed separately. For this reason
n Column Vector Computation Units (abbreviated as CVCU) will be used,
where each CVCU will contain a total of (Zn - l) PE's bringing the
total number of PE's to n(2n - 1). These (2n - l) PE's in CVCU's
consist of n multiplier PE‘s and (n - l) adder PE's. Sepcifically,
for this example, 4 CVCU's have been used. Each CVCU contains 7 PE's,
where 4 of them are multiplier and 3 of them adder PE's. First con-
sider the CVCU-I which produces column vector-I of resultant matrix.
As shown in Figure 3.5a, the equations C1], C2], C3], and C4] have

four elements in common to all of them. For this reason, first these

common elements, namely b1], b2], b3], and b4], are loaded into the

24

 

 

 

 

Figure 3-4: Matrices A, B, and C where
A x B = C.

Figure 3-5a:

Column I.

b

c11 ' ‘11 11

' a21b11
b

c21

C

31 ' a31 11

c41 ' a41b11

Column II.

811512
b

c12

c22 I21 12

a31b12
b

c32

c42 ' I41 12
Column III.

b

c13 ' 811 13

c23 a21b13
c33 ‘ a31b13
c43 a41b13
Column IV.

c14 ' a11b14
c24 ' a21b14
c34 ’ a31b14

c44 ’ a41b14

25

21
a22b21
°3zb21

a42b21

a12b22
azzbzz
a32522

a42b22

‘12P23
a22°23
832%

642%

a12°24
a22b24
a32b24

a42b24

4.

613b

31
a23b31
a33b31

a43b31

313332

a23b

32
a33b32

b

a43 32

a13b

33
a23P33
a33b33

a42b33

a13534
a23b34
a33b34

a43°34

I b14b41

4.

+

4.

+

4.

+

+

+

+

+

+

4.

b24b41

b34b41
b44I’41
b14b42
b24b42
b34b42

b44b42

b14b43
b24P43
b34b43
b44b43

b44

b14
b24b44
b34b44

b14b44

Individual elements of the matrix C.

26

‘11 ' x11 I ‘12 I ‘13 I ‘14

‘21 I ‘21 I ‘22 I ‘23 I ‘24

+X +X

‘31 ' ‘31 32 I ‘33 34

c41 ' ‘ I ‘42 I ‘43 I ‘44

Figure 3-5b: Redefined version of the
first row of matrix C.

27

RB registers as shown in Figure 3.6a. Also, for simplicy, C1], C2],
C3], and C4], are redefined as shown in Figure 3.5b.

At time steps 5, 6, 7, and 8, matrix A is pipelined one row at
a time, where each row has n elements, and for an nxn dense matrix
there are n rows. So, it takes n time steps to pipeline all the rows
of nxn dense matrix. After the pipelining of the last row, it takes
(log n) time steps to merge the partial results of each PE. At each
time step each PE takes its operands, does computations on these op-
erands, and stores the result into the RC' Like a systolic array each
PE takes data in and pumps data out at every time step.

Only the CVCU-I is shown in Figure 3.6. CVCU's-II, -III, and
-IV, also perform the same computations at time steps 1 through 10.
Note that the same row elements are needed to do computations on these
CVCU's. For example, at time step 5, the first row of matrix A is
inputted to CVCI-I, -II, -III, and -IV simultaneously. At time steps
6, 7, and 8, row II, III, and IV of matrix A are pipelined through
every CVCU, respectively. From this, it is concluded that, only n
elements are needed at each time step to do all the computations as
shown in Figure 3.7.

Note that band matrix multiplications and matrix-vector multi-
plications can easily be implemented on these tree structures with
the same algorithm. For example, an nxn and n-element vector multi-
plication can be computed using n(2n - l) PE's in (n + l + log n) time

steps.

Figure 3-6a:

a12

Figure 3-6b:

28

Time step 4.

a13

Time step 5.

‘21

29

%m

Figure 3-6c:

ﬁn

Figure 3-6d:

ﬁn

Time step 6-

‘33

Time step 7.

30

Figure 3—6e:

Figure 3-6f:

Time step 8.

Time step 9.

‘44

 

31

\ I \ /

Figure 3-6g: Time step 10.

32

z:
:3

:J

>~Izu>o

:n

cw
cm mm

3 n v.

”.8
«no
”no
men

mm

a

.mpnmp cowaeasnsou

n

¢_a

a:
.ma
:6

m_a
”Na

"ma
«ea

N.
~_a
«N

Nma
Nee

— -~I=u>p —H~I:u>u
.2 . ~r.

mm

__a
.Na

.ma
.ca

"RIM beamed

«.8
-u
~mu
Nag

Nu:

__8
.ma
.ma
.eu

~13u>u

odd-3.3.3

33

3.3 Discussion and Performance Analysis

As shown in Section 3.2, nxn dense matrix multiplication can be
computed using n(2n - l) PE's in (2n + log n) time steps. Note that
for the algorithm proposed in this paper, there is no need for data
skewing or a data alignment network because the algorithm requires
only row access to the matrices A and B for the matrix multiplication
of AxB.

The number of processing elements and the total time steps required
for a given nxn dense matrix are shown in Figures 3.8 and 3.9, respec-
tively. Figures 3.lOa and 3.lOb show the comparison of this algorithm
with Kung‘s 2-4 and Hwang's [ 5 ] algorithms in terms of number of
required PE's for the range of n = 2 to n = 2048. Figure 3.lla, b,
and c depicts the comparison of these algorithms in terms of time steps
that are necessary for any given nxn dense matrix for the same range.
As shown in Figures 3.10 and 3.ll, the algorithm proposed in this paper
obtains higher speeds with fewer processing elements over the other
two algorithms. Note also, it is possible to use fewer PE's to compute
a given nxn dense matrix, however, this causes an increase in total
time steps. For example, 1 CVCU can be employed instead of n to com-
pute a given nxn matrix using (2n - l) PE's in (2n + n + log n) time
steps, which is still faster than a sequential processor. Combinations
of the number of PE's and the total time steps can be easily arranged
so that the computation rate of the overall system matches with the
I/O bandwidth of the host computer.

Speedup and efficiency provide more insight into the performance
of these algorithms, where speedup S, is defined as the ratio of the

serial computation time to that of parallel computation time; efficiency,

34

MILLIONS
cjh
2.1 //

1.5 ‘* ,/////

1.25 ‘I

 

 

375‘? ’
i x/
0.5 ,

I
i
325'7
I

 

   
 

 

   

1 TI I I I
200 400 600 800 1000

SIZE N OF AN NXN DENSE MATRIX

Figure 3-8: Number of required processing elements vs. N.

35

 

4000 q

 
  

3500 . H ERDAL

3000 ‘

2500 -

2000 ‘

1500 ‘

 

1000 -I

-.

500 ‘

 

up-.. ..

 

I

0 250 500 750 1000 1250 1500 1750 2000

SIZE N OF AN NXN DENSE MATRIX

Figure 3-9: Total computation time of an NxN dense matrix.

36

 

2000 ~ f 7'"

 
   
  
 
 
 
 
   
   
 

1750 "H ERDAL 31 HWANG ’
3" HMG ,’
1500 _.
1250 -' /
1000 " . .’
750-1 1‘
5m)?

250 II

 

 

 

I 1 I I I

2.5 5 7.5 10 12.5 15 17.5 20 22.5 25 27.5 30

SIZE NIOF AN NXN DENSE MATRIX

Figure 3-lOa: Number of required processing elements vs. N.

37

MILLIONS
2w /

mgr}. E53? 8. HUANG ,.

 

1.5"

L25” ,1 /"

075* /’.:///

05— ’ ,x'

0254 1’;;lF//I
(Junkeé': I

I I ‘ I
200 400 600 800 1000

SIZE N OF AN NXN DENSE MATRIX

 

 

 

Figure 3-lDb: Number of required processing elements vs. N.

M‘UI'T'I'HU) mZH—‘I

38

 

 

 

 

 

T 1 1 r 1' l l T T
10 12.5 15 17.5 20 22.5 25 27.5 30

SIZE N OF AN NXN DENSE MATRIX

Figure 3-1la:

Total computation time of an NxN dense matrix.

mvm—Iw I'T‘IZH—‘I

39

 

 

 

50 100

Figure 3-llb:

 

I
r 1 r T T I T I

150 200 250 300 350 400 450 500

SIZE N OF AN NXN DENSE MATRIX

Total computation time of an NxN dense matrix.

M'UI'I'I—IU') mzH-‘I

4O

 

4mm'4 , , - ///////
HERDAL ’ ,

3500 - I--I KUNG

3000 H

2500 ‘

2000 4!

 

1500-3"]

a.
1000 IT
tr

 

 

 

400 600 300 1000 1200 1400 1600 1800 2000

SIZE N OF AN NXN DENSE MATRIX

Figure 3-llc: Total computation time of an NxN dense matrix.

41

E, is regarded as the ratio of the actual speedup to the maximum possible

speedup using p processing elements, as shown below [30,11]:

S

ts / tp.: 1

m
ll

5/p_<_1

where t5 and t are serial and parallel computation times, respectively.

Figures 3.12a :nd 3.12b illustrate the speedup of each algorithm for
the range of n = 2 to n = 2048. Percent efficiency of these algorithms
for the range of n = 2 to n = 1024 is depicted in Figure 3.l3a and
3.l3b. As shown in Figures 3.12 and 3.13, higher speedup and higher
efficiency is obtained by the algorithm introduced in this paper.

For example, for n = 32, approximately, (9 / 4) and (9 / 5) times more
speedup, and (10 / 3) and (9 / 5) times more efficiency are obtained
over the Kung's and Hwang's algorithm, respectively. Note that effi-
ciency stays almost constant for n 512 for all algorithms; however,
much higher speedups can still be obtained for larger values of n over
the other two algorithms. One of the reasons that the tree structure
performs better than the other structures is that the longest data

path that one element has to travel is O(log n) instead of 0(n).
Another important reason for higher performance is that the tree struc-
ture makes use of all of its PE's all the time, whereas the other
structures use only half or less than half of their total PE's for

the actual calculation, and the rest of the PE's are used for simply
passing data, most of the time. More detailed performance analysis

of these algorithms can be accomplished by including other performance

criteria such as, utilization and redundancy as defined in [10,11].

mzH—i

MUM-4m

42

 

 

 

 

 

"1'4 ERDAL I"
:1: KUNG /
my HUANG //
300 ‘I
200 4*
100‘:
I
I
I I ("1: ,_. '

2.5 5 7.5 10 12.51517.5 20 22.5 25 27.5 30

SIZE N OF AN NXN DENSE MATRIX

Figure 3-12a: Speedup vs. N.

43

 

 

 

 

 

MILLIONS ‘ w
2_‘_T‘;—v' RDAL /
,., KUNG
Hwang /’
1.75 - ////”
1.5 -' ' ./
/./
1.25 e» .///
1 -u /. ”j
/ ’ ﬂ '
//' ,./' I
,/' .x’ [
./"// ;
3
1
I
l
J
250 500 760 1000 1250 1500 1750 2000

SIZE OF NXN DENSE MATRIX

Figure 3-12b: Speedup vs. N.

<ozm1—10H'n'nm 3Q

44

 

30

 

 

 

 

I
.3
u5:K\\
——57
i ‘“--—ve—- '
10
\
zo-Ja
.11.
.. “\
15 ‘}___ ._ .—
"'.---— — —--—--—. -- .. Q
\
10'} ‘.‘
IF--..~
“*4I ——————————— .-
5 1.

 

5 7.5 10 12.5 15 17.5 20 22.5 25 27.5 30

SIZE N OF AN NXN DENSE MATRIX

Figure 3-13a: Percent efficiency vs. N.

 

o\°

<OZMHOH'n-nm

30

25

20

15

10

U!

Q

 

45

 

 

 

 

 

 

 

H ERDAE
: 3 KUNG
HNANG ,
b..._-..-+-_-_-,_--.. --------------- 04
I :
L i
‘ I--I---—I-- ————— I- ------------- Ii
'1 3
1
g 1
J . j . ' .
250 500 750 1000 1250 1500 1750 2000

SIZE N OF AN NXN DENSE MATRIX

Figure 3-13b:

Percent efficiency vs. N.

46

Instead, this research focuses on other important issues such as,
system architecture and system applications as discussed in the fo1-

Towing chapter.

CHAPTER IV
APPLICATIONS OF THE TREE STRUCTURES

Many digital signal and image processing applications employ com-
putational tools such as convolution and discrete Fourier transform,
which can be formulated as matrix-matrix or matrix-vector multiplica-
tions. For example, convolution is used to compute auto and cross-
correlation functions, to design and implement finite impulse response
(FIR) and infinite impulse response (IIR) digital filters, to solve
difference equations, and to compute power spectra 12 . For this
reason, it is possible to achieve high performance digital signal and
image processing devices by making use of the tree structures intro-
duced in Chapter III for matrix multiplications.

In this chapter, based on the tree structured approach introduced
in Chapter 111, designs of special-purpose devices that can be used
in digital signal and image processing, to perform convolution and
discrete Fourier transform are proposed and discussed. In Sections
4.1 and 4.2, algorithms for l-dimensional (l-D) and Z-dimensional (2-D)
convolution and discrete Fourier transform (DFT) operations are intro-
duced, respectively. Tree structure for l-D computations is also shown
in these sections. In Section 4.3 processor level, chip level, and
system level architectures are proposed to perform convolution and

DFT operations to be used on digital signal and image processing areas.

47

4.1

sh‘
11h.
0D
nu

Th

SE

48

4.l Convolution

One dimensional noncyclic convolution can be defined as

N-l
y(n) = 2 Mn - k) x (k)
k=0

for n = O,l,...,2N - l

where h(n) and x(n) are input sequences of length N. Output sequence,
y(n), has length (2N - l), and all these sequences are defined to be
zero outside these lengths. The matrix formulation for N = 4 is shown
in Figure 4.1.
Assume that N = 2k, k = l,2,..., and input sequences h(n) and
x(n) both have length N. Then, it is possible to compute the output
sequence y(n) with (2N - l) processing elements in (2N + log N) time
steps. Figure 4.2 shows the l-D convolution tree with input and output
timing tables.‘ Note that it is assumed that the input sequence h(n)
is preloaded either sequentially or parallel into the RB registers.
Then, at each subsequent time step, elements of the sequence x(n) are
shifted to the right, one element at a time, through the shift registers,
where it is assumed that the shift registers are initially zero. The
operands in registers RB and the operands in the shift registers are
multiplied at each time step following the every shift operation.
This l-D convolution tree can be extended to perform Z-D convolution.
Two dimensional cyclic convolution, y(n], n2), of the two input

sequences h(n], n2) and x(n], n2) can be defined as

N 1 N 1

1

- 2-
y(n], n2) = Ego K2; h(n1 - k], n2 - k2) x (k], k2)
l 2-

K 0

49

 

 

 

 

[ha 0 0 0- x0 rye.
"1 110 O O x x1 - y]
h2 h1 ha 0 X2 yz
h3 h2 h1 h0 x3 y3
° “3 h2 h1 14
o 0 n3 hzj ,SJ

- o o 0 ha -y6

(a)
yo ' ho"o

Y] 8 h0X1 + h1xo

‘<
N
I

Y3 s hox3 + hlx2 + h2x1 + h3x0

y5 3 h2x3 + h3x2
yG . h3x3

(b)

Figure 4-1: Noncyclic convolution for N = 4.

50

 

Figure 4-2a: I—D Convolution tree.

51

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4-2b: Realization of I-D convolution.

* (SR = Shift Register)

52

for n] = 0,1,...,Nl - 1
r12 = 0,l,...,N2 - l

with the length of N1 x N2. For 2-D convolution, N l-D convolution
trees are used with additional adder PE's to merge the results of each
l-D convolution tree. The input sequence h(n], n2) is preloaded to

the RB registers, either sequentially or parallel (N elements at a
time). Then, the input sequence x(n], n2) is shifted to the right

one element at a time through the shift registers. After each shifting
operation, the elements in the shift registers are multiplied by the
elements in the RB registers, and the result is shifted down to merge
all the partial products. It is assumed that initially all the shift
registers are cleared. At certain times it is necessary to feed zero
values down to the multiplier PE's without losing the contents of the
corresponding shift registers. For this we may use (log N) bit decoders
to determine which shift registers will provide zero value to the
multiplier PE's to which they are connected. Input to the decoder

can be a simple (log N)-bit counter, which initializes itself every

N counts. Two dimensional convolution, with the length N x N, can

be computed in (N2 + 3 + log N) time steps by using (2N2 - l) PE's

2

and N shift registers, where N2 of the total PE's are multiplier PE's,

and (n2 - l) of them are adder PE's. It is assumed that the input

sequence h(n], n2) is preloaded, which takes (N2

+ 1) time steps if
it is loaded sequentially, first by shifting right through the shift
registers, then loading the content of every shift register to the
registers RB's in one time step. The first output appears after (3 +

log N) time steps. Then there will be a new output at each time step.

53

4.2 Discrete Fourier Transform (DFT)

A l-dimensional DFT [13] can be defined as

N-l
y(n) = z x0011“
k=0

for n = 0,1,...,N - l

where N = exp(-2TTj/N), (j is imaginary number).

Let X(n) be the input sequence for N = 4, where it is assumed
that x(n) is represented by real numbers. Then, the DFT of the vector
X(n), Y(n), can be calculated as shown in Figure 4.3. Note that this
is a simple matrix-vector multiplication. One way of computing this
l-D DFT is shown in Figure 4.4a. First, all the elements of the vector
X(n) are loaded, in parallel, into the RB registers in one time step.

Then the matrix an

, assuming it has already been loaded to a memory,

can be pipelined down through the tree N-elements at a time. Note

that the binary trees shown in Figure 4.4 are only for the real part

or the imaginary part of the overall output. So, for the total imple-
mentation of DFT, we need twice as many PE's than as shown in these
'figures. With this tree structure it takes (1 + N + log N) time steps
to compute N-point DFT's using 2(2N - l) PE's. All the memory units

(MU) can be accessed by one counter, which counts zero through N,
sequentially. Another possibility is to use 2N(2N - l) PE's as shown

in Figure 4.4b. In this case N-point DFT can be computed in (log N + l),

nk

assuming the matrix H is preloaded to the RB registers.

A two dimensional DFT of size N x N can be defined as

y(n]. n2) = 22 E? x (k], k2)N w
k =0 k =0

l 2

54

1 l1 l2 l xo yo

1 w2 w4 w x x1 . y1

1 w3 w6 w x2 y2

l w w w x3 y3
(a)

Yo'xo*"1 “‘2 “‘3

Y1 ' x0 + x1w + xzwz + x3w3

yz ' x0 + xlw2 + xzw4 + x3w6

y3 - x0 + x1w3 + x2116 + x3w9
(b)

Figure 4-3: DFT for N = 4.

 

55

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4-4: One dimensional DFT trees. (Real or
imaginary part of the overall I-D DFT design.)

56

for n1, r12 = 0,l,...,N - l

where w = exp(-2TTj/N).

DFT of the input matrix X(k], k2), Y(n];1 :2), cannb: redefined
as a matrix-vector multiplication, because w 1 I and w 2 2 can be
precalculated for all the values of n1, n2, k], and k2, and can be
loaded into memory before the DFT operation starts. As a result, we

2

obtain a N2 x N matrix. If we visualize the input matrix X(k], k2)

as a vector of length N2

, then the multiplication of the matrix N(z1, 22)
and the vector X(g), where g = 0,l,...,(N2 - l) is the output vector
y(g). First, the input matrix is loaded into the RB registers of the
multiplier PE's in N time steps, N elements at a time. Then the matrix
N(z], 22) is pipelined through the tree (see Figure 4.5) one column

at a time to compute the DFT.

4.3 The Design of a Special-Purpose Device for
l-D Convolution and DFT

The overall design can be divided into three design levels; pro-
cessor level, chip level, and systems level design.

* Processor Level Design-~As introduced in previous chapters,
tree structures employ only two different types of PE‘s. Namely,
multiplier PE's (MPE) and adder PE's (APE). An MPE consists of an
n-bit, signed, 2's-complement, fixed-point multiplier and two registers,
RB and RC (see Figure 4.6a). R8 is an n-bit input register that is
used to store one of the operands for the multiplication, whereas RC
is an 2n-bit output buffer register that is used to store the output
of the multiplication. APE has only one register, RC’ and is used

for the same purpose to store the output of an addition as shown in

57

x00 ’00
x01 ’01
‘Ho
x11 Yn

—J—l—l-—l
I-‘Id
‘tu-l-d
t“.'"
X
X
.1
o

(a)

r,
«h
cad—Ad
“t“:

 

Figure 4-5: Two dimensional DFT tree for N = 2.

* (Real or imaginary part of the complex output y(n).)

553

 

 

 

 

 

 

 

 

 

 

 

 

 

 

\r
4!"
n
I)"
y .
n-bit ln-biu
Multiplier "" R8
l N h-I::‘ :1 .__; Load
Clock ~ Zn-bit RC
‘% 2n
(a)

 

 

i

 

1
Clock 2n-bit R

 

2n+l

Figure 4-6: Block diagram of MPE and APE.

59

Figure 4.6b. MPE can be implemented, for example, by using the Booth
[14 Jalgorithm or the Baugh [15] array multiplier algorithm. The Baugh
[15] algorithm is implemented with only full adders, and hence provides
simple and regular layout. Carry lookahead adders [16,17] can be used
to implement the APE for fast addition time.

* Chip Level Design--The word length and the integration technology
are the two main factors that basically determine the number of devices
that can be put into a chip. (A device can be defined as a transistor,
gate, processing element, and so on.) One way of implementing the
tree structures is to assume one PE per chip, and use off the shelf
IC's for the overall design. Another approach would be to put more
than one PE's into the chip. If we assume a word length of 8-bits,
then, with VLSI technology it is possible to pack seven PE's into a
single chip (see Figure 4.7). We assume that the multipliers are
implemented by using the Baugh [15:]algorithm as described in [16]
and the adders be implemented by carry lookahead adders as described
in [17]. Tri-state buffers are used to connect each chip to its neigh-
boring chip to implement shifting operations for large matrices. For
larger word lengths, we either reduce the number of PE's to be planted
on a single chip, or increase the number of pins as long as the tech-
nology permits us to do so. Note that the chip shown in Figure 4.7
can be housed in a conventional 64-pin package.

* System Level Design--0ne of the important issues that needs
to be resolved is the determination of the largest N for the overall
system that will be implemented in a single pass through the system.

We may need to partition the computations for matrices that are larger

than the given system's size.

6()

v2...

 

.A=u~-eaehgv aeseoaepm
mos» on» we :o_uop=mew_aep apzu mpocwm mpn_mmoa a mo EomePu xuopm ”ﬁle mszmpu

 

 

 

 

zv.sllé111¢~%.illliﬁlllﬂ1
o..a.._ o_.a.a_
.e- 9.9- a .e:. «.5- a
eee5\oc.=ml

TQM. New .
-12.Amwm-

   

      

.IIIIIIIIIIII\1111.I 1.1.1: 1.5111111111111111!

2

...A sac—u

 

 

 

 

 

 

 

 

 

 

 
 

1H7] L g 1
u__a.._ a: a..e.o_
-a u.e- a .e-o o.e-c
- .1. 1 1 ”m
Ma: . ., . .uom a..em
11: ”mi! {11 11-11% JaU—HIA

 

61

Assume that N = 16. Then the overall design can be implemented
by employing the “Tree-IC's, single chip adders, memory units, and
a counter as shown in Figure 4.8. This system can be used for l-D
convolution, Y(n), where n = 0,l,...,(N - 1). If we let N = 8, and
assume sequence x(n) is real, then, the same tree structure shown in
Figure 4-8 can be used for 8-part DFT calculation. The overall system
requires one 4-bit address counter, 16 memory units with 16 words each,

4 Tree-IC's, and three single chip adder PE's.

62

Emumxm hue o-H use .op u z go$ Emumam =o_u:—o>:ou o-H cogsuuagum awe»

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.m ... z 28
”m-e mczm_u
----...-.
a.
we I
-| -11.... .621 li-ﬁI------1-_LA ".88
_e_.1 a.
FLT 1"
_1~u_ mwsA
— _
.. £00.- .. 8 s
u. h 1M— u. ouch 11sA umcaxaa

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

um:-

 

 

 

 

 

 

 

1...: a. 3%...

 

 

 

x seeaxuc.=m
— -

3%
e 3.6

 

CHAPTER V
CONCLUSION

5.1 Sumary

The purpose of this research was to develop and investigate a
high performance special-purpose device for matrix multiplication which
works as a co-processor attached to a host computer. This device may
be employed to perform discrete Fourier transform and convolution
operations for digital signal and image processing applications. The
design was constrained by the requirement that basic VLSI rules be
followed; i.e., the overall co-processor has a simple, regular struc-
ture with short, nearest-neighbor communication paths.

Systolic arrays, introduced by Kung [2-4], as well as other VLSI
computing structures, introduced by Hwang [5], were reviewed in Chapter
II. Examples for matrix multiplication algorithms were shown, and
the corresponding space-time complexity of each of these algorithms
were indicated. Specifically, it was indicated that an nxn dense
matrix multiplication can be computed in 5(n - l) and (4n - 2) time
steps by using (3n2 - 3n + 1) and n(2n - 1) processing elements by
the Kung‘s [2-4] and the Hwang's [5] algorithms, respectively.

A new binary tree structure matrix multiplication algorithm was
introduced in Chapter III. The algorithm capitalizes on the properties
of the VLSI to obtain high throughput rates and high efficiency. The

algorithm first forms the products of the matrix elements in the leaf

63

64

nodes, and then sums the partial results by merging them as they are
pipelined toward the root node of the binary tree. In this way,
massive parallelism can be achieved to introduce high degrees of pipe-
lining and multiprocessing. Processing elements, which are the nodes
of the binary tree, and the basic tree structures were also defined
in this chapter. It was shown that, the binary tree structures can
be implemented by only a few different types of simple processing ele-
ments, and they have simple, regular, and short communication geometry,
which are considered as necessary attributes of an efficient VLSI imple-
mentation. This in turn yields cost-effective special-purpose devices.
Chapter III concluded with a discussion of the space-time complexity
of this binary tree structure and the comparison of its speedup and
efficiency with those of the structures reviewed in Chapter II. It
was shown that, for any nxn dense matrix multiplication, the binary
tree structures algorithm provides higher speedup and efficiency over
the other algorithms. Specifically, it was shown that any nxn dense
matrix multiplication can be performed in [(2n + log(n))] time steps
by using n(2n - 1) processing elements. From this, we can conclude
that the binary tree structures use fewer processing elements than
Kung's [2-4] algorithm but use the same number of processing elements
as compared to the Hwang's [5] algorithm. And the binary tree struc—
ture provides a solution faster for any nxn dense matrix multiplication
than either of the algorithms mentioned above. Note that a processing
element that is used in the binary tree structures is either a multi-
plier or an adder with some additional registers, whereas a processing
element for the other structures contains both a multiplier and an

adder, with some additional registers. Furthermore, each PE in the

65

tree structure has only one register as compared to three registers
for the others. The reason for this is that each PE requires only
three input/output ports in binary tree structure as compared to six
input/output ports. All of these advantages yield a smaller chip area
and fewer input/output connection lines per PE for the binary tree
structures.

Some other important advantages of tree structures were also
indicated in Chapter II. For example, in binary tree structures there
is no need for data skewing for matrix multiplication, whereas it is
necessary to skew the data before inputting for the other structures
which introduces additional computation times. For matrix multiplica-
tion, once the matrix B is loaded to the RB registers, matrix A is
broadcasted to all column vector processing units at the same time
one row at a time, which simplifies the data accessing problem. As
the matrix A is pipelined through the binary tree, all the processing
elements are kept busy and utilized to perform the actual calculations,
instead of simply passing data from one PE to another as in the case
of Kung's [2-4] and Hwang's [5] algorithms. The algorithm introduced
in this thesis utilizes the inherent characteristics of the matrix
multiplication, and optimally matches this algorithm to the binary
tree structure.

In Chapter IV, special-purpose devices that can be used in digital
signal and image processing to perform convolution and discrete Fourier
transform were proposed and discussed. Specifically, one dimensional
convolution operation and its binary tree structure chip level and
system level architectures were illustrated. Applicability of the

binary tree structure to many other digital signal and image processing

66

applications, which can be represented as matrix-matrix multiplication,
is evident. The use of this binary tree structures can substantially
reduce the amount of computations required for the solution of many
problems in digital signal and image processing.

This binary tree structure is intended to be used in conjunction
with a conventional computer. It can be used to implement a complete
digital signal or image processing device, or can be used as part of
another device. And it can easily be extended to any size because
of its modularity.

Advances in the design and fabrication of VLSI circuits will soon
make it feasible to construct special-purpose devices, such as the
binary tree structures, which are excellent candidates for the VLSI
technology, with their simple, regular, and short communication geometry.
They can be used to obtain very high throughput rates for the problems
which are not amenable to solutions even with the state-of-the-art
conventional computers. The geometry, flexibility, and cost effectiveness
computers. The geometry, flexibility, and cost effectiveness will make

them very valuable tools for many important tasks.

5.2 Further Research

One area that should be investigated in the future concerns the
application of binary tree structures introduced to other signal and
image processing algorithms. Using these structures, it may be possible
to construct complete digital signal and image processing devices with
very high speeds. A second investigation that could be pursued is
the development of single chips with multiple PE's which could result

in a substantial reduction on the time-space complexity of the overall

67

system. And, finally, further investigation that could be taken is
the development of fast and large interlieved memories for the tree
structures and their scheduling problems between them, because the
communication line between the host computer and the attached special-

purpose device can present a bottleneck.

10.

11.

12.

REFERENCES

Haynes, L.S., “Highly Parallel Computing," IEEE Computer, Vol. 15,
NO. 1’ Jane 1982, pp. 7-80

Kung, H.T. and Leiserson, C.E., Systolic Arrays (for VLSI), Tech-
nical Report, Carnegie-Mellon University, Department of Computer
Science, Dec. 1978.

Kung, H.T., “Let's Design Algorithms for VLSI Systems,“ Proc.
of Caltech Conf. on VLSI, Jan. 1979, pp. 65-90.

Kung, H.T., “The Structure of Parallel Algorithms," In Advances
in Computers, (Yovits, M.C., Editor), Academic Press, New York,
1979.

Hwang, K. and Cheng, Y.H., "VLSI Computing Structures for Solving
Large Scale Linear Systems of Equations,“ Proc. of Intr'l Conf.
on Parallel Processing, Aug. 1980, pp. 217-230.

Swartzlander, E.E. et al., "Inner Product Computers," IEEE Trans.
on Computers, Vol. 27, No. 1, Jan. 1978, pp. 21-31.

Despain, A.M. and Patterson, D.A., "X-Tree: A Tree Structures
Multiprocessor Computer Architecture," Proc. of Fifth Intr'l.
Symp. on Computer Architecture, Apr. 1978, pp. 144-151.

Browning, S.A., The Tree Machine: A Highly Concurrent Computing
Environment, Technical Report (Ph.D. Thesis),—Computer Science,
California Institute of Technology, Jan. 1980.

Budnik, P. and Kuck, D.J., “The Organiation and Use of Parallel
Memories,“ IEEE Trans. on Computer, Dec. 1971, pp. 1566-1569.

Kuck, 0.0., The Structure of Computers and Computations, John
Wiley & Sons, 1978.

Lee, R.B-L., "Empirical Results on the Speed, Efficiency, Redun-
dancy, and Quality of Parallel Computations," Proc. of Intr'l
Conf. on Parallel Processing, 1980, pp. 91-100.

Agarwal, R.C. and Cooley, J.N., “New Algorithms for Digital Convolu-

tion,“ IEEE Trans. on Acoustics Speech, and Signal Processing,
Vol. 25, No. 5, Oct. 1977, pp. 592-410.

68

13.

14.

15.

16.

17.

69

Cochran, H.T. et al., “What is the Fast Fourier Transform?" IEEE
-I£2"5- 0" Audio gnd Electroacoustics, Vol. 15, 1967, pp. 45-557_-

Booth, A.D., “A Signed Binary Multiplication Algorithm,“ Quart.
J. Mech. Appl. Math., Vol. 4, 1951, pp. 236-240.

Baugh, C.R. and Nooley, B.A., “A Two's Complement Parallel Array
Multiplication Algorithm," IEEE Trans. on Computers, Vol. 22,
Dec. 1973, pp. 1045-1047.

Hwang, K., Comppter Arithmetic, Principles, Architecturesgpg
Design, John Wiley & Sons, 1979.

Erdal, A.C. and Fisher, P.D., “Relaxed Pinout Constraints Yield
Improved Parallel Adders for VLSI-Based Systems,“ IEEE Intr'l.
Conf. on Circuits and Computers, Sept. 1982, pp. 142-146.

GAN STATE UNIV. LIB

111111111111111111 111111111111111111111111111