MSU

 

 

 

RETURNING MATERIALS:
Place in book drop to

 

LJBRARJES remove this checkout from

-;—. your record. FINES will
be charged if book is
returned after the date
stamped below.

 

W
$3522 3 562:
' J 4/ ~ .-

' Ausztagﬁam

 

 

 

A NEW METHODOLOGY
FOR
SYSTEMATIC CONSTRUCTION OF SYSTOLIC ARRAYS

By

Nikrouz Faroughi

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirement

for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1987

ABSTRACT

A NEW METHODOLOGY
FOR
SYSTEMATIC CONSTRUCTION OF SYSTOLIC ARRAYS

By

Nikrouz Faroughi

This work involves the development of a systematic method for transforming an algorithm,
represented by mathematical expressions with uniform and bounded index spaces, into a
systolic array architecture. Systolic arrays are highly structured architectures tailored to a
speciﬁc application. They have speciﬁc architectural properties such as simple processing
elements (cells), simple and regular data and control communication, and local cell
interconnections. The new design method is based on an understanding of the relationship
between two highly structured representations of the algorithms; the mathematical
expressions and their systolic solutions. The method consists of three major steps: algorithm
representation, algorithm model, and architecture speciﬁcation. The algorithm representation
involves the translation of mathematical expressions into a set of equivalent simple
computations which are grouped into subsets based on the required set of operations and
same type Operands. A set of information, referred to as features, is extracted from the
simple computations. The features include the operands of each simple computation, the
lexicographical index differences of each pair of operands of a simple computation, and the
lexicographical index differences of the respective Operands of each pair of simple
computations. In the algorithm model, the properties of systolic arrays are represented in

terms of feature interrelationships. A sub-systolic array is designed separately for each subset

of the simple computations. The ﬁnal array is constructed by joining the sub-systolic arrays.
This process provides a method by which non—homogeneous systolic arrays are designed by
joining several homogeneous systolic arrays (i.e. sub—systolic arrays). A set of information,
referred to as valid projection vectors, is systematically determined from the feature
interrelationships. Each subset of the simple computations is represented geometrically where
the valid geometric transformations are determined from the valid projection vectors. The
projection of the new geometric representations produces the individual cell locations and the
simple computations of each cell. Other architecture speciﬁcations, such as data movement
and cell count ratio, are determined early in a design process and thus can be used to select
systolic solutions which require fewest cells and lowest I/O bandwidth. This method

produces all the systolic solutions for a given set of simple computations.

DEDICATION

This work is dedicated to

my parents,

Majid and Tahereh.

They have always supported
and
encouraged

my pursuit of learning.

iv

ACKNOWLEDGMENTS

I wish to express my gratitude and appreciation to my advisor and chairman of my
dissertation. committee, Dr. Michael Shanblatt, for his positive and continues
encouragement. I am also grateful to the other members of my committee, Dr.
Stanley Crouch, Dr. Erick Goodman, Dr. Lionel Ni, Dr. Harriett Rigas, and Dr.
Chin-Long Wey. Through this research I learned how to decompose algorithms into
simple computations, Dr. Wey provided me with valuable ideas as how to decompose
my dissertation into papers. I appreciated Dr. Goodman’s sincere interest in the

subject. His valuable comments helped me to tie up some loose ends.

Finally, to the many good friends, such as Dr. Charles Given my superviser and my
friend for his understanding and total support, Touran Cheraghi for her understanding

during my moments of frustration, I offer my heartfelt thanks.

Table of Contents

List of Tables ............................................................................................................. viii
List of Figures ............................................................................................................ ix
CHAPTER 1: INTRODUCTION .............................................................................. 1
1.1 Problem Statement ......................................................................................... 3

1.2 Approach ........................................................................................................ 4

1.3 Organization of The Dissertation .................................................................. 5

1.4 Chapter Summary ......................... ' ...... . .......................................................... 7
CHAPTER 2: BACKGROUND ................................................................................ 8
2.1 Systolic Arrays and their Properties ............................................................. 8

2.2 Automatic Systolic Design ....... . .................................................................... 12
2.2.1 Z—Operator Approach ............................................................................ 12

2.2.2 Index Transformations .......................................................................... 15

2.2.3 Data Flow Transformation .................................................................... 22

2.2.4 Dependence Graph Transformation ...................................................... 24

2.2.5 High Level Description Transformation .............................................. 25

2.2.6 Retiming Transformation ...................................................................... 27

2.3 Summary of the Literature Review ............................................................... 31
CHAPTER 3: ALGORITHM DESCRIPTION AND REPRESENTATION ........... 33
3.1 Algorithm Description ................................................................................... 34

3.2 Algorithm Representation .............................................................................. 37
3.2.1 Initial Partial Results ............................................................................ 40

3.2.21 Algorithm Initial and Final Representations ........................................ 43

3.3 Chapter Summary ........ . ................................................................................. 49
CHAPTER 4: ALGORITHM MODEL ..................................................................... 50
4.1 Mathematical Development ........................................................................... 50

vi

4.1.1 Feature Interrelationships ...................................................................... 63

4.2 Valid Projection Vectors ............................................................................... 90
4.3 Chapter Summary .......................................................................................... 98
CHAPTER 5: ARCHITECTURE SPECIFICATION ............................................... 100
5.1 Individual Cell Locations .............................................................................. 101
5.1.1 1D and 2D Systolic Arrays .................................................................. 103

5.1.2 m-Dimensional Systolic Arrays ............................................................ 111

5.1.3 Design Dimension ................................................................................. 114

5.1.4 Cell Count Ratio ................................................................................... 115

5.2 Data Movement .............................................................................................. 118
5.3 Cell Interconnections ..................................................................................... 120
5.4 Data Flow Timing .......................................................................................... 124
5.5 Input/Output Data Patterns ............................................................................ 130
5.6 Cell Functions ................................................................................................ 132
5.7 Chapter Summary .......................................................................................... 132
CHAPTER 6: DESIGN TOOLS AND RESULTS ................................................... 134
6.1 Design Facility ............................................................................................... 134
6.2 Results ............................................................................................................ 138
6.2.1 Matrix Multiplication ............................................................................ 139

6.2.2 LU Decomposition ................................................................................ 174

6.2.3 Solution of Lower Triangular Systems of Equations .......................... 182

6.2.4 Finite Impulse Response (FIR) Filter .................................................. 187

6.2.5 2D Convolution ..................................................................................... 192

6.3 Chapter Summary .......................................................................................... 206
CHAPTER 7: DISCUSSION AND CONCLUSIONS ............................................. 207
LIST OF REFERENCES ........................................................................................... 216

vii

List of Tables

3.1. Final representation of algorithm described by Eq. 3.28 ............................ 48
4.1. Interactive feature values of Eq. 4.1-1. .......................................................... 62
4.2. Valid local projection vectors for C1 and C2 of Eq. 4.2-2. .......................... 95
5.1. Lookup Table for Transformation Matrix Construction. ............................... 112
6.1. The valid global projection vectors for matrix multiplication. ..................... 140
6.2. CCR for each Vx of matrix multiplication. ................................................... 142
6.3. Data input directions for matrix multiplication systolic

solutions. ......................................................................................................... 143
6.4. Comparison of Number of Cells, DFT, and Data Movement for

matrix multiplication systolic solutions. ........................................................ 144
6.5. Simulation results for square (5x5) matrix multiplication. ........................... 145
6.6. The global valid projection vectors of LU decomposition. .......................... 179
6.7. Data input directions for LU decomposition. ................................................ 179
6.8. Global valid projection vectors for the FIR problem. ................................... 188
6.9. Data input directions for the FIR systolic solutions. ..................................... 189
6.10. Valid global projection vectors for 2D convolution. ..................................... 194

viii

2.1.
2.2.
2.3.
3.1.
4.1.
4.2..

4.3.
4.4.
4.5.
4.6.
4.7.
4.8.
4.9.
4.10.
4.1 l.
5.1.

5.2.
5.3.
5.4.

5.5.

5.6.
5.7.

List of Figures

A systolic array for 3 x 3 matrix multiplication [6]. .................................... 4

The Z-graph representation of the FIR ﬁlter y,- = w, * x,- + W2 * xi“. ..........
A cell function node [21]. ..............................................................................
Possible structures of simple cells. ................................................................
The geometric representation of the output operands of Eq. 4.1-1. .............

The geometric representation of the simple computations of

Eq. 4.1-1. .........................................................................................................

A non-homogeneous systolic array. ...............................................................
Three sub-systolic arrays of a non-homogeneous systolic array. .................
A homogeneous systolic array. .............................................................
The violated condition 1 of Theorem 4.5. .....................................................
The violated condition 1 of Theorem 4.6. .....................................................
The violated condition 1 of Theorem 4.7. .....................................................
The violated condition of Lemma 4.8. ..........................................................
A test tree. .......................................................................................................
The test tree of (c4, c7) of Eq. 4.1-33. ...........................................................

A 2-dimensional geometric representation with two simple
computations. ..................................................................................................

An original geometric representation. ............................................................
A transformation of the geometric representation of Figure 5.2. .................

The cell locations and simple computations of sub-systolic arrays of
subsets C1 (Eq. 5.1-25) and C2 (Eq. 5.1-26). ................................................

The primary inputs and outputs of sub-systolic solutions of C1
and C2. .............................................................................................................

A systolic array for matrix multiplication. ....................................................

The systolic solution for solving an upper triangular system of linear
equations. .........................................................................................................

ix

ll

51

56
58

108
109

114

123
126 V

131

6.1.
6.2.
6.3.
6.4.
6.5.
6.6.
6.7.
6.8.
6.9.

6.10.
6.1 1.
6.12.
6.13.
6.14.
6.15.
6.16.
6.17.
6.18.
6.19.
6.20.
6.21.
6.22.
6.23.
6.24.
6.25.
6.26.
6.27.
6.28.
6.29.

The system diagram. ....................................................................................... 136

A projection plane for matrix multiplication; Vx = (0, 0, 1)’. ...................... 148
A systolic solution for matrix multiplication; Vx = (0, 0, 1)’. ...................... 149
A projection plane for matrix multiplication; Vx = (0, 1, 0)‘. ...................... 150
A systolic solution for matrix multiplication; Vx = (0, 1, 0)’. ...................... 151
A projection plane for matrix multiplication; Vx = (1, 0, 0)‘. ...................... 152
A systolic solution for matrix multiplication; Vx = (1, 0, 0)’. ...................... 153
A projection plane for matrix multiplication; Vx = (0, 1, 1)‘. ...................... 154
A systolic solution for matrix multiplication; Vx = (0, l, 1)‘. ...................... 155
A projection plane for matrix multiplication; Vx = (1, 1, 0)‘. ...................... 156
A systolic solution for matrix multiplication; Vx = (1, 1, 0)‘. ...................... 157
A projection plane for matrix multiplication; Vx = (1, 0, 1)‘. ...................... 158
A systolic solution for matrix multiplication; Vx = (1, O, 1)‘. ...................... 159
A projection plane for matrix multiplication; Vx = (0, 1, -1)‘. .................... 160
A systolic solution for matrix multiplication; Vx = (0, 1, -1)‘. .................... 161
A projection plane for matrix multiplication; Vx = ( 1, —1, 0)‘. .................... 162
A systolic solution for matrix multiplication; Vx = (l, -1, 0)‘. .................... 163
A projection plane for matrix multiplication; Vx = (1, 0, -1)‘. .................... 164
A systolic solution for matrix multiplication; Vx = (l, O, -1)‘. .................... 165
A projection plane for matrix multiplication; Vx = (1, 1, 1)‘. ...................... 166
A systolic solution for matrix multiplication; Vx = (1, 1, l)’. ...................... 167
A projection plane for matrix multiplication; Vx = (1, 1, —1)‘. .................... 168
A systolic solution for matrix multiplication; Vx = (1, 1, —1)‘. .................... 169
A projection plane for matrix multiplication; Vx = ( 1, —1, 1)’. .................... 170
A systolic solution for matrix multiplication; V): = (1, —1, 1)’. .................... 171
A projection plane for matrix multiplication; Vx = (1, —1, -1)‘. .................. 172
A systolic solution for matrix multiplication; Vx = (1, —1, -1)‘. ................. 173
The projection planes of LU decomposition. ................................................ 180
The systolic solution for LU decomposition. ..... j ........................................... 181

6.30.
6.31.
6.32.
6.33.
6.34.
6.35.
6.36.
6.37.

6.38.

6.39.

6.40.
6.4 1.

6.42.

6.43.

6.44.

7.1.

The geometric representation of but = b. .......................................................
The systolic solution of La: = b. .....................................................................
The projection line of the FIR ﬁlter for Vx = (0, 1)‘. ...................................
The systolic solution of the FIR ﬁlter for Vx = (0, 1)‘. ................................
The projection line of the FIR ﬁlter for Vx = (1, 0)‘. ...................................
The systolic solution of the FIR ﬁlter for V): = ( 1, 0)‘. ................................
An equivalent systolic solution of 6.33. ........................................................

The projection plane of 2D convolution (Eq. 6.2-26) for
Vx = (0, 1, 0)’. .................................................................................................

The systolic solution for 2D convolution (Eq. 6.2-26) for
V): = (0, 1, 0)‘. .................................................................................................

The systolic solution of 2D convolution (Eq. 6.2-26) constructed by
overlapping the cells in the solution of 6.38. ................................................

The systolic solution of 6.39 with nine weights. ..........................................

The projection plane of 2D convolution (Eq. 6.2-28) for
Vx = (0, l, 0)‘. .................................... ' .............................................................

The systolic solution for 2D convolution (Eq. 6.2-28) for
V1: = (0, l, 0)‘. .................................................................................................

The systolic solution of 2D convolution (Eq. 6.2-28) constructed by
overlapping the cells in the solution of 6.42. ................................................

The systolic solution of 6.43 with nine weights. ..........................................

Y-chart rcpresentation of the new systematic systolic array
construction method. .......................................................................................

186
189
190
190
191
191

196

198

200

203

204

205

CHAPTER 1

INTRODUCTION

The revolution in VLSI technology allows the implementation of powerful
computational systems on a VLSI chip using multiple and regularly connected
processing elements. These VLSI systems that combine pipelining and multiprocessing
have been referred to as systolic arrays [1]. Many designs using the systolic array
concept exist for computational algorithms in pattern recognition, image and signal

processing, and matrix manipulation [1-10, 43-44].

The systematic transformation of an algorithm into a systolic array architecture is a
design automation problem of current interest The deve10pment of a systematic
methodology for transforming algorithms into systolic arrays has been addressed
[ll-21,23,25-34]. The power as well as the limitations of these methodologies have
been recorded [22]. One general limitation that exists in all methodologies is the
lack of prior knowledge about the resulting systolic designs. In many of the existing
methodologies the ﬁnal systolic array is the result of several ad hoc modiﬁcations to
the original design. The others use linear transformations and often choose the

transformation in an ad hoc manner.

In general, systematic construction of systolic arrays can be grouped into three major

steps [22].

1. Algorithm Representation- An algorithm expressed by some high-level

construct, e.g. mathematical expressions, is represented in a suitable format.

2. Algorithm Model- The algorithm representation from step 1 is modeled for

systematic systolic design.

3. Architectural Speciﬁcation- From the algorithm model the systolic array
speciﬁcations, such as the number of cells, design dimension, cell
interconnections, operations performed in each cell, data movement, and

data timing, are extracted.

The above three steps may or may not all be contained in a given methodology.

Furthermore, each major step may consist of several sub-steps.

This research describes a new methodology for systematic construction of systolic
arrays and includes all three major steps. For step 1, the algorithm is represented by
primitive computations. These computations are generated by applying an index
expansion rule to the algorithm index spaces. In step 2, the algorithm model is
formed as a “knowledge representation”. The inherent algorithm information that
affects systolic array architectural characteristics is represented by index difference
vectors. These characteristics include the use of a few different types of processing
elements (cells), simple and regular data and control communications, and local cell
interconnections. From this model, the simple computations are grouped among a set
of cells such that the systolic array properties are preserved. This is done by
geometric representation of simple computations by mapping them onto a set of cells
via geometric projections. Not all geometric projections will produce systolic arrays.
Hence geometric transformations are performed. The transformation matrices are
systematically produced from the algorithm’s inherent information. Step 3, the

architecture speciﬁcation, consists of two parts. Part1 is incorporated with the

algorithm model so that designs with different “cell count ratios”, design dimensions,
and data movement will be known earlier in. the design process. Part 2 Of step 3
identiﬁes the individual cell locations, cell interconnections, data ﬂow timing, and

data input patterns.

In general, algorithms suitable for systolic implementation are highly structured. It is
assumed that the number Of different systolic solutions for any given problem is
problem size independent Therefore, this systematic systolic array design approach is
ﬁrst applied to the smaller sized problems. The design information such as the index
transformation matrices and the cell interconnections are produced. The design
information is then easily extrapolated tO larger problem sizes tO produce

corresponding larger systolic arrays.

In the discussion that follows, the terms algorithm and problem are used

interchangably tO describe a set Of computations which follow a set Of speciﬁc rules.

1.1 PROBLEM STATEMENT

Many approaches exist for constructing systolic arrays from algorithms or from initial
non-systolic designs. For these methods, in general, a systolic solution is the result Of
both systematic and ad hoc procedures. There are approaches which provide a
systematic methodology for constructing systolic arrays [20,36]. However, these
approaches may not satisfy the simple cell property Of systolic arrays. The existing
methods do nOt directly incorporate the properties Of systolic arrays in the design
process. It seems appropriate that knowledge Of legal communication topologies
(such as local connectivity), modularity (such as regular communication), and
operation complexities (such as simple arithmetic Operations) should be used in

designing systems with such properties. A “good” approach for constructing systolic

arrays should use certain algorithm characteristics to control cell complexity, data

communication, and cell connectivity.

Furthermore, for a given algorithm there may exist many different systolic solutions,
each with different characteristic parameters such as the number Of cells, throughput,
total execution time, cell interconnection patterns, and data movement. All the systolic
solutions for a given algorithm need to be explored in order to select the best solution
for the speciﬁc application. This work presents a new approach for constructing
systolic arrays that incorporates the properties Of systolic arrays in the design process

and produces all the existing solutions for a given algorithm description.

1.2 APPROACH

An algorithm suitable for a systolic array implementation is ﬁrst transformed into its
equivalent set Of simple (primitive) computations. Each simple computation consists
Of one resultant variable, a few input variables, and a set Of simple Operations applied
to the input variables. The input variables are either constants or partial results (i.e.,
the resultant variables Of Other simple computations). Each simple computation
therefore has the desired form for computation in a cell. These computations are then
assigned tO a set Of simple cells; those assigned to the same cell are computed
sequentially in that cell. Therefore, the mapping Of an algorithm into a systolic array

becomes the problem Of mapping simple computations to a set Of cells.

The algorithm is viewed as a set Of input data, a set Of output data, and a ﬁnite set Of
rules that deﬁne how each output datum is computed from the input and possibly
from the previous output data. A set Of simple computations is said to be equivalent
to the algorithm if they preserve the algorithm correctness. The mapping Of the

simple computations into a systolic array is the same as assigning the computations to

a set Of cells such that the properties Of systolic arrays are preserved. That is, these
computations are assigned tO different cells such that data communications among the
cells are simple and regular, cell interconnections are local, and cell Operations are
simple. A set Of information, referred to as features, is systematically extracted from
the simple computations. The array properties are then represented in the form Of
feature interrelationships. A subset Of simple computations can be assigned tO a cell
if and only if all the feature interrelationships are satisﬁed. Moreover, a systolic
solution exists if and only if all the simple computations can be assigned tO a set Of

cells without violating the array properties.

As the size Of the problem increases so does the number Of equivalent simple
computations. Hence, Operating on these computations nO longer becomes a realistic
task. A methodology that uses geometric transformation and projection is used. The
indices Of the simple computations form an index space such that the projection Of
this space onto a lower dimension space maps several simple computations onto the
same point. Each point is then considered to be a cell. Not all such projections,
however, will preserve the properties Of systolic arrays. The valid projection
directions, referred tO as projection vectors, are systematically derived from the
feature interrelationships. These valid projection vectors are generated from small

problem sizes and later used to design larger arrays for larger problem sizes.

1.3 ORGANIZATION OF THE DISSERTATION

Chapter 2 presents the background tO this research. In particular, Section 2.1
describes the systolic array architecture and its advantages. In Section 2.2 the
literature on the systematic construction Of systolic arrays is reviewed. Section 2.3

presents a summary Of the existing methods for constructing systolic arrays.

Chapter 3 presents the algorithm representation. The class Of algorithms considered in
this study is discussed in Section 3.1. The algorithm representation, which consists Of
the algorithm initial representation and the algorithm ﬁnal representation, is presented
in Section 3.2. The algorithm initial representation transforms the given algorithm
intO a standard form (a set Of simple computations). The algorithm ﬁnal
representation contains a subset Of the algorithm’s inherent information. This
information together with the algorithm initial representation form the algorithm ﬁnal

representation.

Chapter 4 focuses on the algorithm model. Section 4.1 presents the mathematical
development Of the algorithm model where additional inherent information is
extracted. The simple computations Of the algorithm are represented geometrically.
The projection Of the geometric representation maps several primitive computations
onto the same point on the projection plane. The systolic array architectural
characteristics are then represented by certain relationships derived from the inherent
information. These relationships, presented in Section 4.2, are used tO determine
those geometric projections that preserve the systolic array architectural properties.
Section 4.3 presents the methodology developed from these relationships tO extract the

valid projection directions.

Chapter 5 presents the architecture speciﬁcation according tO which systolic arrays
are constructed using the design information extracted from the algorithm model. This
information is translated into linear integer index transformation matrices and is
discussed in Section 5.1. Each different transformation matrix corresponds tO a
different geometric transformation and geometric projection. The projections provide
the individual cell locations and the list Of primitive computations that are computed
in each cell. The cell interconnections are performed by using the nearest cell

information.

1.4 CHAPTER SUMMARY

Chapter 1 presented the three major tasks Of systematic construction Of systolic arrays
from algorithms. The tasks are: Algorithm Representation, Algorithm Model, and
Architectural Speciﬁcation. There exist many approaches for constructing systolic -
arrays either from an algorithm description or from an initial non-systolic design.
Only a few Of the existing methods use the architectural properties Of systolic arrays
directly in their design process. This research develops a new approach for systematic
construction Of systolic arrays in which the properties Of systolic arrays are
incorporated into the design process. An algorithm described by mathematical
expressions is decomposed into a set Of simple computations. A set Of features
extracted from the simple computations which is used tO control the cell complexity,
data and control communication, and cell connectivity. The new approach generates

all the systolic solutions for the given set Of simple computations.

CHAPTERZ

BACKGROUND

Section 2.1 Of this chapter describes the properties and advantages Of systolic array
architecture. Section 2.2 presents a brief review Of the existing methods for
systematic construction Of systolic arrays either from an algorithm description or from
an initial non-systolic design. Section 2.3 presents the literature review and points out

certain drawbacks Of existing methods.

2.1 SYSTOLIC ARRAYS AND THEIR PROPERTIES

Special purpose computer systems are typically used tO meet speciﬁc performance
requirements or tO Off-load computations that are tOO time-consuming for a general-
purpose computer. Since these systems have limited applicability, their cost-
effectiveness has always been the deciding factor in guiding their development. With
evolving VLSI circuit technology, the design cost Of a special-purpose system can be
held down only if appropriate architectures are used. Kung and Leiserson have

characterized one such architecture as the systolic design [1]. Systolic designs consist

of

1. a few different types Of simple processing elements (cells);
2. simple and regular control and data communications;

3. local cell interconnections;

4. extensive pipelining and multiprocessing [5,8].

The use Of only a few different types Of cells imply the design and testing Of only a
few different macros. These cells in general perform simple Operations involving a
small number Of Operands, hence providing simple communication among the
neighboring cells. The complete array is constructed by repeating the macros over the
chip area. Regular communication implies that the design can be made modular;
larger systems can be designed by combining proven smaller ones. The local cell
interconnections provide simple and regular wire routing, making the design Of these
special-purpose devices cost effective. The use Of pipelining and multiprocessing
enable one tO reach the performance requirement Of a special-purpose system. In a
systolic design, data ﬂows in a pipelined fashion between the cells, and
communication with the outside world is performed only at the boundary cells. All or

portions Of the boundary cells have input/output ports as shown in Figure 2.1.

Kung has deﬁned systolic designs as semi-systolic if global data communication is
used, and pure-systolic if nO global communication is used [5]. In global
communication, data is broadcasted tO or collected from some or all Of the cells. With
no global communication, data is passed from cell tO cell and each datum is used only
in one cell at any given time. When the number Of cells increases, using global
communication results in long connection wires; expanding these non-local
communication paths becomes difﬁcult without slowing down the system clock. In

general, pure-systolic is preferred but the correct choice Of semi- or pure-systolic

10

design depends on the chip I/O limitations, the speciﬁc application, and the
environment in which the chip is used. The cells in a systolic design operate
synchronously. The need tO distribute the clock signal tO every cell in the design

without clock skew is one problem of the systolic architecture.

Another architecture similar to the systolic architecture is the Wavefront Array
Processor [24, 42]. Unlike the systolic array, a wavefront array does not Operate
synchronously. Here the Operations of each cell are controlled locally and depend on
the availability Of the input data and on the delivery Of the previous output data tO the
neighboring cells. Due to this local control Of Operations, skewing Of the input data,

generally needed in systolic designs, may be avoided in wavefront designs.

Although a systolic array is usually a special purpose device customized tO a speciﬁc
algorithm, they can be designed with programmable cell functions and programmable
cell interconnection patterns tO be used for a small subset Of different algorithms
[37,38]. In this case, different functions may be computed in the same array by
programming (reconﬁguring) the array. The number Of different systolic arrays that
can be programmed into a reconﬁgurable array is limited by the number Of different
cell functions and the number Of different cell interconnections. Reconﬁgurable

systolic arrays are usually pure-systolic.

In general, a systolic architecture is used to implement a regular and compute-bound
problem. A problem is compute-bound if the total number Of Operations is larger than
the total number Of inputs and outputs. A problem is regular when repetitive
computations are performed on a large set Of data [5]. Since the introduction Of the
systolic architectures, numerous systolic designs for various known problems have
been proposed [1-10, 43-44]. Automatic construction Of systolic designs has only
recently been addressed [ll-21, 23, 25-34, 36, 39]. The quality Of a systolic design is

evaluated by its correctness, delay, data ﬂow timing, number Of cells (chip area),

IIIIIII

0+

Iiiiiii

031
021
01 1

00

D11
NZ
0
MS

032
022
012

O

O

b21
1322

O

11

1:31
b32

‘b33

t stands for tlme.

09...;
O
0(--to

Safe—on
O(-—-\1
2.?

o
0:

c=c+0¢b

b0

Figure 2.1. A systolic array for 3 x 3 matrix multiplication [6].

12

modularity and many other factors such as accuracy, ﬂexibility, power requirement,

fault tolerance and cost.

2.2 AUTOMATIC SYSTOLIC DESIGN

In this section existing methodologies for designing systolic arrays are reviewed.

These methodologies are grouped under separate titles based on related techniques.

2.2.1 Z-OPERATOR APPROACH

The Z-Operator, which is also referred to as the delay Operator is deﬁned as:

Zj Xi = X,’_j 2.2-1

01‘

Zj Xi = xl+j - 2.2-2

where the indices indicate displacement in time or space. When the Z-Operator is
applied to a mathematical expression with indexed variables, the index differences are
indicated by the powers of Z. For example, applying the Z-Operator notation to the

Finite Impulse Response (FIR) ﬁlter deﬁned by

y“ = z 01' xl-j 2.2-3
.l

yields

n=2qﬂo 224
,-

The new expression is then manipulated symbolically using the mathematical
properties Of the Z-Operator. Implementations with different properties can be derived

[7,23,25]. By direct hardware interpretation of the expression, correctness Of the

l3

implementation is assured. However, in general, the direct hardware interpretation
does not have the important characteristics of the systolic architecture. Further ad hoc

modiﬁcations to the designs are needed.

Kung and Lin use two representations to specify a design [27]. The two
representations are equivalent and either representation can be derived from the other.
The Z-graph representation is used to show the design speciﬁcations such as the cell
positions, the interconnections, and the delays. The algebraic representation is used
to perform algebraic transformations on a design to produce better equivalent designs.
With this algebra, a designer is able to eliminate global communication. For

example, consider Figure 2.2

__ \ r
f
-0 -1
Z Z
-0
Z
V2 y

Figure 2.2. The Z-graph representation Of the FIR ﬁlter y,- = w1 * x, + w2 * x,- +1.

X

 

 

in which Z" implies a delay Of k units. The algebraic representation Of Figure 2.2 can

be written as

[2] +— [8 2:] [:2] + [3;], * x 2.25
d

an

14

y = [1 o] [2] 2.2-6

01'
v<—Av+bx, 2.2-7
y=c’v,

where x is the input, y is the output, and v represents a vector corresponding tO the
cells. Furthermore, the NxN matrix A, the le vector b, and the le vector c

represent the delay cycles between the cells and the availability Of the data.

Two types of transformations are applied: retiming transformation and “It-slow”
transformation. Let D be a full rank, diagonal, NxN matrix where the diagonal

elements represent delays in the Z notation.

Let

u = D v 2.2-8
then

v = D”1 u 2.2-9
and

u «- (DAD‘1)u + (Db) x, 2.2-10

y = (c‘ D“) u,

where the vector it represents the new cells, and (DAD’I), (Db) and (cJ D‘l) represent

the delay cycles in the equivalent new design.

For “k-slow” transformation, the elements Of A, b and c are indicated in terms of k,
for example 2‘2 is replaced by 24". Then the Z exponents of matrix (DAD‘I) will be
Of the form d,- + kay- — dj where D=(Zd') and A=(Z'a"’). For no global communication

di+ kaij- dj- 21 (d,- + kay- dj= 0 implies zero delay). Solving this inequality for

15

minimum k will result in a design Of maximum throughput. For regular
communication all nonzero elements Of any columns Of (DAD’I) and (Db) must be
distinct. This assures that the value of a cell at any time is never sent to more than
one cell at a time, thus eliminating broadcasting. The initial Z-graph representation of

the algorithm is found by similar methods discussed in [7,23,25].

2.2.2 INDEX TRANSFORMATIONS

Index transformation has been used in the past to map algorithms to parallel
processors which are connected via limited interconnection networks. It was shown
that there is a certain amount of ﬂexibility when an algorithm is mapped to an
inﬂexible interconnection network [17]. This ﬂexibility was shown by the existence
Of several index transformations which reStructure the algorithm to use an inﬂexible
interconnection network. Index transformations have been used to allow an algorithm

to construct its own interconnections to form systolic designs [17].

This method is best suited for algorithms with arbitrary nest depth and uniform loop
body execution time. The indices Of all the variables in the algorithm are expanded in
an ad hoc fashion to eliminate the broadcasting; hence, each datum is uniquely
deﬁned. A linear transformation is then applied to the indices. Suppose the index set
consists Of the indices i, j and k; the projection of the new indices onto the i—j plane
then gives the size and the dimension Of the systolic array. The k coordinate indicates
the timing; hence, they must all be non-negative integers. The direction of the data
through the network is also determined by analyzing the data ﬂow of the algorithm.
The selection Of the proper transformations is again done in an ad hoc manner. The
Optimality of the resulting systolic design depends on the chosen linear

transformations.

16

Cappello and Steigliz use linear transformations on recurrence equations [14]. Starting
with recurrence equations describing the algorithm, a canonical representation is

Obtained by adjoining a time index to the deﬁnition Of the recurrence. For example
11);: ‘— yijJ—l + Wj xi—j 22‘“

is the canonical representation Of the FIR ﬁlter. The coordinates of the left side Of the
canonical form indicate the location where the computation on the right takes place.
This gives a geometric interpretation to the recurrence relations. A geometric
representation together with the ordering rule, indicates what data moves, where it
moves to, and when it moves. With various linear transformations Of the geometric
representation the original computational locations change, thus producing a different
geometric shape. Then, by factoring out the time dimension (geometric projection),
different designs are produced. The transformations are selected in an ad hoc fashion,
and in [14] it was also shown that certain transformations will lead to AP2 Optimal

design where A is the area and P represents the period.

Another method that uses the index transformation is reported in [l9,28-3l,36]. (The
basic concepts are reported in [30,31] and further studies Of the method are reported
in [19,28,29].) The organization and the Operations Of a systolic design are described
by a set Of 3-tuples (G,F,T). G represents the network geometry, the position of each
cell identiﬁed by its Cartesian coordinates. The function F associated with each cell
represents the Operations performed by that cell. T represents the network timing and
it indicates for each cell the time when the Operations F occur and when the data

communication takes place.

Given an algorithm in the form Of a high—level language, the loop bodies are ordered
in a lexicographical ordering to eliminate the broadcasts. Each computation is

assigned an index rather than associating the body of loops with the corresponding

17

loop indices as reported in [17]. One or more data-dependent vectors correspond to
each variable. Let x(ll) and x02) be two variables generated at a statement 5‘ using
indices 11 and 12. Variable x02) is said to be dependent on variable x(l 1) if I] < 12, in
the lexicographical sense, and x(11) is an input variable in the statement S. Then the

vector d = [2 — 11 is called a data-dependent vector for variable x.

Let d1, d2, . . . , dm be the set of data~dependent vectors calculated for the algorithm.
Let D=[d1, d2, . . . , dm] be the data-dependent matrix Of the algorithm. A linear
transformation T, when applied to D, will generate the datadependent matrix of the
new structure, that is TD=D’. Matrix D’=[d’ 1, d’z, . . . , d',,J represents the modiﬁed
data dependences in the new index space. Entries of D’ can be selected a priori,
provided that a valid transformation T exists. T is said tO be a valid transformation if
it is a bijection, monotonic and nonsingular. Sufﬁcient and necessary conditions for

such a T tO exist are reported in [30,31].

These transformation matrices can also be determined systematically from a set of
predetermined cell interconnections. Each transformation matrix, satisfying certain
speciﬁed constraints, maps the algorithm into an array whose cell connectivity is a
subset Of those predetermined interconnections [36]. This approach, which is the
most complete to date, has the potential of not satisfying a property Of systolic arrays,
namely simple cell structures. The data dependency vectors, used as a set Of
algorithm characteristics, do not contain any information about the cell complexity.
The Operations required in the body Of the inner loop determine the cell complexity.

Those algorithms with complex loop bodies will produce complex cell structures.

Matrix T is divided into two parts as:

gr]

where it is related to time such that ad,- > 0 for all i and s is related to the geometric

18

properties of the algorithm. In general it is chosen with rank 1 and s with rank n—1

where n is the rank Of T.

Fortes and Moldovan have used this method to model the design as broadcast-free or
reduced broadcast systolic design [28]. Although this method is introduced here as a
systematic approach to systolic designs, Moldovan, Wu, and Fortes have applied this
method to map arbitrarily large algorithms into a ﬁxed size systolic array [29]. In
particular the approach was applied to an arbitrarily large QR algorithm. Further
investigation has indicated how to model the design problem as a linear programming

problem that rrrinimizes the execution time [19].

Li and Wah’s approach is to model an algorithm, represented as recurrence processes,
into a non-linear integer programming 7 problem [12,33]. A systolic design is
characterized by three parameters: the velocity of the data ﬂow, the spatial distribution
of data, and the period Of computations. In general, linear recurrence equations are

expressed as

2‘0.» =f[z" ‘ Y (i. 1). x0. k). We 1)] 2.2-13

where 'y = -1 if the recurrence is a forward and 7:1 if the recurrence is a backward.
These two recurrence types are identical in that each type can be easily converted to
the other. For systolic designs only, the evaluation order of the terms is important. In
the above recurrence equation the superscript It represents time, and k in x(i, k) and
y(k, j) represents space. Also, the above recurrence equation is of the simplest form
since it and y may be a function Of i, j and k. Other forms of recurrence equations are

presented in [12].

Suppose the recurrence equation

2‘0.» =f[2" ' 10.1). 2:03 k). We 1)] 2.2-14

19

is implemented in a systolic design. It was shown that the following equations must

hold:

th *xd-i-xb: (k! * 2d, 2.2-15
tky*yd+)’ks=’ky*zd,
‘t*xd+xis=‘t*)’d,

‘t*zd+zts='r*)’d,

I; *yd+)’ts=‘j*xd,
t} * zd+ zjs— I, *xd,
“bl = 'tkylr
“bl = 'tkir
Itk'yl = Itki,
|t,-| 2 1,
‘th 2 1:
ltkl 2 1,

where x’s, y’s, and 2’s are vectors and t‘s are scalars [12,33]. xd represents the
' velocity Of datum x measured in terms Of the directional distance passed during a
clock cycle. It is assumed that the distance between any two neighboring cells is one.
If there are m-l delay registers between two neighboring cells in the pipeline
direction Of x, then Lxdl=llm SI. yd and 2d are measured similarly to xd. x“, 19,, xb,
Yrs, and so on, represent data distribution vectors and they are non-zero integers. For

example xi, is the directional distance between x(i,j) and x(i+1, j); xj-S is the

directional distance between 1(lj) and x(i, j+1). The t’s are computed as follows:

iz"(t+1, j) - 1: z"(i, Di 2.2-16
L J

0..
II
a"

V

U

 

z}. = z ::"(i, j+1)] - t[z"(i, I):

 

so

,k = 1: [2"+1(1',D] " PERL/)1

20

where 1:(.) is the computation time. tb, and tky are the time differences between the
access of two consecutive items Of variables x and y, respectively, in increasing order
of the indices. For the backward recurrences, ya and thy are positive; they are

negative for the forward recurrences.

It was shown that these equations represent the fundamental space-time relationship in
systolic designs [12,33]. Therefore, in order to produce an Optimal systolic array, the
design problem can be formulated into an optimization problem using these equations
as constraints. The Objective function is formulated based on the design criteria,
depending on the above parameters and the problem size. For example, taking the
minimum computation time criterion, the Objective function may take the form
g(t,-, ‘1” ’10 n1, n2) where n1 and n2 represent the problem size. For example, the

Objective function for multiplying two N x N matrices is
N * Itkl + (N — 1) * |t,| + (N -— l) * Itjl. 2.2-17a
The Optimal systolic design then can be formulated as:

minimize g(t,-, tj, tk, n1, n2) 2.2-17b

subject to the constraints of Eq. 2.2-15 and

It)" * [24' = 1‘1, 2.2-17c
Itil * b’dl = [‘2’
It," * MA = k3,

where k1, k2, and k3 are integers which represent the distance traversed for the next
computation. The solution to the non-linear programming problem involves ﬁnding
integer or rational numbers for all the parameters that describe the space-time

relationships of the systolic design.

21

Miranker and Winkler’s method [18] is similar to that described in [30,31]. Given an
algorithm in a high-level language with nested do-loops, loop index expansion is
performed in an ad hoc fashion to eliminate the broadcasts. The normal order Of the
Operations corresponds to a set of indices T which is referred to as the nodes Of the

computations. F is represented by a graph structure. For example,

3(2. 4) =B(1. 4) 2.2-18
and

C(2, 4) = C(2, 3)+A(2, 4)*B(2, 4) 2.2-19

indicate that these statements require computations at nodes (1,4), (2,3), and (2,4).
These nodes are called the domain of dependence for node (2, 4). In the structure
representing 1", dependence vectors (edges) are formed based at node (2, 4) which
point back to the nodes in its domain Of dependence. For this example the dependence
vectors are (—1, 0), (0, —1), and (0, 0), and are denoted by I“.
For systolic design, I" is homogeneous; that is the dependence status at one node is
the same as that at any other node. Furthermore, two structures are isomorphic as
graphs if there is a 1-to-1 correspondence, f, between them, such that, if p—q is a
dependence vector, then f(p)—f(q) must also be a dependence vector where p and q are
3 nodes. Different structures result by mapping p—q onto f(p)—f(q). These mappings are
the afﬁne mappings and they are given by the linear maps. An mxn integer matrix A
is selected as a linear mapping. Let d1, d2, ..., dm, be the dependence vectors of a
structure, then Ad, = d’ ,- for i2 1, ..., m gives the dependence vectors of the new
structure. Conditions for the existence of A are given in [18]. The ﬁrst element in
every d’, corresponds to time and the n—l other elements indicate the space. For
spatial representation of the systolic design, a universal planner systolic array (in 2-D)
is constructed where each connection corresponds to a dependence vector in space. A

universal planner systolic array contains all of the possible dependence vectors. The

22

last n—l elements Of every d’i represent the dependence vector in space and

correspond to a portion of the universal systolic array.

2.2.3 DATA FLOW TRANSFORMATION

Gannon has deﬁned a set of vector operators to derive a function speciﬁcation from
an algorithm where parallelism is explicitly represented [21]. The representation is
described from one domain to another, for instance f: 01—) Dz. For example,
f := y — (x + z) is represented as f: R3 —) R where (x,y,z) 6R3 and f eR. Graphically

f may be represented as in Figure 2.3.

 

 

x ,
2 1+

 

 

 

JV

 

 

 

 

 

 

Figure 2.3. A cell function node [21].

The product Operator (TI) represents the vectored application of a function which is

the simplest form of parallelism. For example

I l 1

represents the concurrent operation of the n functions f,-. Gannon has deﬁned other
operators such as the permutation and data movement Operators, the chain operator,
and the systolic iteration Operator [21]. The permutation and data movement

operators are performed on the data in order to deﬁne complex data permutation and

23

movement operations. For example the rotation operator performs a right circular shift
data movement. The chain Operator is used to chain a sequence of copies of f where
the output of one is connected to the input of the next one. The systolic iteration
Operator describes Operations where basic functions are used multiple times. With
these Operators the data ﬂow graph of an algorithm is determined. Based on the
properties of the algorithm and the operators used, the data ﬂow graph can be mapped

into a systolic design.

Ramakrishnan’s approach is to map a homogeneous program graph onto a linear array
of processors [16]. A program graph is a directed acyclic graph which represents
computations. The nodes represent the computation of a function, and the function
arguments at each node are the values represented by the incoming edges at that node.
A program graph is called homogeneous if it is suitable for execution on a linear
array. A homogeneous program graph is a labeled graph G = (V, E, L).
V: V1 U 51 U 52 where 51 represents the sink nodes, 52 represents the source
nodes, and V1 represents the remaining nodes. E is the edge set where each edge is
labeled based on its directions. L is a ﬁnite set and consists of k labels. Each node in
V, has It incoming edges and k outgoing edges where each incoming and outgoing
edge is assigned a unique label from L. The homogeneous graphs are limited subsets
of general data ﬂow graphs and several problems such as sorting, convolution, and

vector multiplication can be represented as homogeneous graphs.

The mapping of a homogeneous program graph G is a process that partitions G into a
set of nodes where each set is mapped onto the same processor (cell). A precise
formulation of a syntactically correct mapping and correct execution of G on a linear
array are given in [16]. Furthermore it was shown that if there exists a syntactically

correct mapping of G, then G belongs to a class of homogeneous graphs where there

24

exists a connected subgraph SC in G such that VG t; VSG and SC is a mesh graph.
The partitioning Of G is performed via a technique called “diagonalization Of the
mesh graph SG” [16]. The representation of the algorithms in homogeneous program

graphs is performed in an ad hoc fashion.

2.2.4 DEPENDENCE GRAPH TRANSFORMATION

Quinton’s method consists of three steps; 1) deﬁning the Uniform Recurrent
Equivalent (URE) system of equations from the algorithm; 2) ﬁnding a timing
function that is compatible with the dependencies of the URE; and, 3) ﬁnding an
allocation function that transfers the URE onto a systolic design [20]. Consider the

following URE equations:
U1(z) =f[U1(z - v1), U2(z - v2), ...,Up(z - vp)], 2.2-21
Ui(z) = U,-(z — v,) for i = 2, ..., p,

where 2 eD ; Z" is the convex set of Cartesian coordinates, f is a given p-variable
function, and v,- are the dependence vectors where v, 6.2" for all i. D is the domain
of the URE and together with the dependence vectors deﬁne a dependency graph of
the URE. The dependency graph is denoted as the pair (D, V) where v,e V for all i.

Before ﬁnding a timing function, a convex space of feasible timing functions is
determined. A timing function 1 for a dependence graph (D, V) is feasible if 1 maps
Z'l to Z such that t is nonnegative over D and for all x,y ED, 1:(x) > t(y) if it depends
on y. The feasible timing functions are restricted to functions called Quasi-Afﬁne

Timing Functions (QATF) deﬁned as
1(1) = la‘x - b], a 6Q", and b 6Q 2.2-22

where Q denotes the set of rational numbers, and a and b are calculated under certain

25

constraints [20].

Lastly, the URE equations are mapped onto a ﬁnite set of processors (cells) using an
allocation function g. Function g maps Z'l to Z" such that g(D) is ﬁnite and for all
x,yeD, g(x)=g(y) implies t(x)¢t(y). The feasible allocation functions are restricted to
the Quasi Linear Allocation Functions (QLAF) where they have good geometric
properties. Let g(x)=(g1(x), g2(x), ..., g,,_1(x))‘. g is said to be a QLAF if g,, for all i,
are linear functions from Z" to Z", and g,-(x) is of the form c(x) mod q where c is a
linear function from Z" to Z" and q is a positive integer. One way of ﬁnding a QLAF
is to project D along a conveniently chosen direction. Within the feasible ranges, the

timing functions and the allocation functions are selected arbitrarily.

2.2.5 HIGH LEVEL DESCRIPTION TRANSFORMATION

Lam and Mostow apply a series of transformations on the high level description of
the algorithm [26]. The method is implemented in a program called SYS that accepts
the algorithm described by for-loops, begin-end blocks, simple unnested function
calls, and scalar and array variables. A series of transformations are applied to the
abstract algorithm until a function level design, expressed in a predeﬁned language,

clearly describes the systolic design.

In the high level description of the algorithm, the user also indicates explicitly
whether to implement each sequential statement, such as a for-loop, in parallel or in
place. A loop can be implemented either by performing each iteration on the same

cell or by performing each iteration on a different cell. For example,

FOR i FROM 1 TO n IN PARALLEL DO
FOR jFROM 1 TOm IN PLACE DO

26
P13=Pt*xt+aj

will result in a systolic design where x’s and p’s stay in place and a’s are broadcasted.

Additionally,

FOR i FROM 1 TO n IN PLACE DO
FOR j FROM 1 TO m [N PARALLEL DO

Pt3=Pt*xt+aj

produces a systolic design where a’s stay in place and x’s and p’s move in the same

direction.

A systolic design is described as consisting of two parts, the structure and the driver.
The structure part describes the hardware» cells and their interconnections. The driver
part describes the format and the timing of the data streams. In order to correctly
perform the computations, two scheduling constraints are applied: 1) the physical
constraint implying no data collision and, 2) the logical constraint implying data
dependency. The effect of these two constraints appears in the structure of the
design. These constraints serve as a guide in choosing to implement, for example, a

loop in a sequential scheme, a parallel scheme, or a pipelined scheme.

Finally, a series Of Optimization steps are applied to the resulting systolic design. For
example, if the resulting design requires repeated values as input, then an optimization
step would be to change the design so that repeated values are preloaded. Or if the
cells use a common input datum, then an Optimization step would be to change the
design so that the common input datum propagates through the cells. SYS produces a
simple-minded representation of the systolic design. Systematic and user determined

transformations are then applied to generate better designs.

27

Ibarra, Kim, and Palis’ method involves: 1) the representation Of the algorithm
computations by a' sequential machine, such as a Turing machine and, 2) the
transformation of the sequential machine to a systolic design [39]. They have shown
that a 1-dimensional systolic array with time complexity T(n), where n is the number
of inputs, can be represented by an equivalent l-dimensional sequential machine with
“sweep” complexity T(n). The computations of a l-dimensional systolic array can be
“unrolled” in space and time to produce a 2-dimensional array exhibiting
combinational circuits. (This array indicates the ﬂow of data and communication
topology). It was shown that the computations of the equivalent sequential machine
can be simulated by the combinational circuits [39]. Thus, the problem of designing
systolic arrays is translated to that of designing sequential machines. Since in the
design of a sequential machine one does not deal with the problems of concurrency
and synchronization, the design of systolic arrays is simpliﬁed. Once a sequential
machine is designed for an algorithm, the computations of the sequential machine are
mapped onto an array exhibiting combinational circuits (similar to those derived from
unrolling systolic arrays in space and time). “Rolling up” the computation of this
array produces the systolic array. The construction of the sequential machine is done

in an ad hoc manner.

2.2.6 RETIMING TRANSFORMATION

Leiserson’s approach is not the transformation of an algorithm to a systolic design
but rather the application of a retiming transformation to an existing synchronous
system in order to improve the performance [13,32,34]. A synchronous system is
viewed abstractly as a network consisting of functional elements (cells) and globally
clocked registers (delays). The network is then modeled as a ﬁnite, rooted, vertex-

weighted, edge-weighted, and directed multigraph G = (V, E, vb, d, w). V is the set of

28

vertices (cells); E is the set of interconnection edges between the cells; vb represents
the host computer or the outside world; d is the propagation delay function where
d(v) 2 0 for v eV represents the propagation delay of cell v; and w is the delay
function indicated by the number of registers on each edge where w(e) 2 0 for e 65
represents the delay on edge e. Furthermore, it is assumed that G has no directed

I cycles of zero weight.

A retirrring of a system represented as G = (V, E, vh, d, w) is a transformation of G in
which registers are added and deleted in order to change G into a new graph
G, = (V, E, vh, d, w,). This transformation is a mapping r: V —-) Z and r(vh) = 0
where r is an integer-valued vertex-labeling function and w, = w(e) + r(v) — r(u),
where e 6E and e is the directed edge from u to v. Similarly w, = w(p) + r(v) — r(u),

where p is any path from u to v. Let (D(G) represent the clock period of G, then

0}. 2.2-23

W(u, V) = min{W(p)}, 2.2-24

W04. 0}.

where p is a path from u to v. It was shown in [13] that <1>(G)Sc is equivalent to:

(D(G) can be deﬁned as

<D(G) = maX{d(P) | W(P)

Let

D(u, v) = max{d(P) | w(P)

if D(u, v) > c then W(u, v) 2 1, where c is an integer constant.

The retiming transformation is modeled as a linear programming problem to ﬁnd a
retiming function r such that ¢(G,) < (D(G). r is said to be a legal retiming function

Of G such that ¢(G,) S c < ¢(G) if and only if:

29

1. 7.07),) = O,
2. r(u) — r(v) S w(e),
3. r(u) — r(v) S W(u, v) — 1 such that D(u, v) > c,

for all e eE, u,v eV, and edge e from u to v. The problem can also be modeled as a

search problem to ﬁnd the minimum c [13].

The work presented in [13,32] applies to all synchronous systems. Here only those
synchronous systems that can be transformed to a systolic design, where rippling
broadcastings are eliminated, is presented. Systems suitable for systolic design are
assumed to have d(v)=1 for all vs V. Let G=(V,E,v,,,1,w) and let G-1=(V,E,v,,,1,w’),
where w’(e)=w(e)—l for all ee E. G can, be transformed to a systolic design if and
only if G-l has no cycles of negative weight [13,32]. In other words, there exists a
retiming function r of G such that ¢(G,) S c if and only if G — 1 contains no cycles

having negative edge weight.

In addition to improving the system clock 0 by the retiming transformation, the
system clock may be further improved by k-slow transformation of G. k-slow
transformation of G implies multiplying all the edge weights of G by an integer k > 0.
That is, Gk = (V, E, vh, d, wk) is the k-slow transformation of G = (V, E, vh, d, w) if
wk(e) = k * w(e) for all e 6E. If G — 1 has cycles of negative weights, then G may be
k-slow transformed to G, such that 0,, - 1 has no cycles of negative weight. G, is

then transformed to a systolic design by appling the retiming transformation.

Kung’s retirrring transformation is applied to the time-invariant Signal Flow Graph
(SFG) representation of algorithms [15]. In the SFG, in general, a node represents an

arithmetic or logic operation performed with zero delay and an edge represents a

30

function or a delay. Let D and D’ represent delay elements and lower case letters
represent constants. An SFG will have on its edges either a delay element or a

constant indicating a multiplication by that constant.

There are two classes of SFGs, those with local interconnections and those with
global interconnections. Since SFG with global interconnections, such as the FFT,
will require spatially global interconnections, they are not suitable for systolic designs.

Here only the SFGs with local interconnections are considered, such as the FIR.

A general systematic procedure for transforming SFG’s into systolic designs is based

on two simple rules.

Rule 1— Time-Scaling: all delays D may be scaled, that is D —> nD’. Similarly
the corresponding input and output rates have to be sealed with respect to the

new time unit 0’ by a factor n.

Let a cut-set of a SFG be deﬁned as a minimal set Of edges which partitions the SFG
into two parts. The edges of the cut-set are grouped as inbound and outbound

edges depending on the assigned directions to the edges.

Rule 2- Delay-Transfer: advancing k time units on all the outbound edges and
delaying k time units on all the inbound edges and vice versa, will not affect the
general system Operation. The effect of lags and advances cancel each other in

the overall timing.

Based on these two rules it was shown that any zero-delay edge that is non-
temporally local can be transformed into a nonzero-delay edge. This is done by
selecting a “good” cut-set that includes the target edge (the zero-delay edge) but does

not include zero-delay edges in the Opposite direction to the target edge. The cut-set

31

could include zero-delay edges going in the same direction as the target edge.
According to Rule 2, the nonzero-delay edges in the opposite direction can give one
or more spare delays to the target edge(s). If there are no spare delays, Rule 1 is
applied to increase the number of delays in all the edges. It was shown that there

always exist “good” cutsets for SFG with local interconnections [13,32].

2.3 SUMMARY OF THE LITERATURE REVIEW

In the previous sections several existing methodologies for systematically constructing
systolic designs were discussed. In general these methods can be grouped into two
types. Type 1 methodologies start with a problem represented in a high-level
language, e.g. data ﬂow graph or a mathematical expression, and via systematic and/or
ad hoc procedures, they are used to construct systolic designs. Alternatively, type 2
methodologies start with a signal ﬂow graph or simple (“poorly”) constructed
systolic design and via systematic and/or ad hOc procedures construct systolic designs

exhibiting improved performance.

Although the existing methods differ in approach, their goals are the formal
construction of systolic arrays. They take into account important design properties
such as functional correctness, area minimization, propagation time, communication
topology, and modularity. These methods are also similar in not directly
incorporating the systolic array characteristics in the design process. That is, they do
not explicitly use array characteristics (i.e. simple Cells, simple and regular
communications, and local connectivity) as a guide in determining the systolic

solutions.

The approaches presented by Moldovan [36] and Li and Wah [12,33] are the only

approaches which use certain systolic array architectural properties in the design

32

processes. Moldovan’s approach, which applies to the algorithms described by nested
dO-loops, uses predetermined cell interconnection patterns in the design process. The
approach uses algorithm data dependency vectors which are determined from the inner
loop bodies. New dependency vectors, which favor systolic designs, are determined
from a set of predetermined cell interconnections. The approach, however, does not
use any information to control the cell complexities. Those algorithms with complex

loop bodies will produce complex cell structures.

Li and Wah’s approach represents the systolic array’s communication topology by
certain constraint equations in terms of data dependency information derived from the
algorithm. Systolic solutions are produced by solving an integer linear programming
problem satisfying a certain Objective function subject to the constraint equations.
The objective function represents certain Optimization criteria, such as minimum
computation time. The drawback of this method is that the solution involves solving
for an integer linear programming problem. Furthermore, the design information is in
terms of the velocity of data ﬂow, relative operand’s distances, and period of
computations. These include operational information rather than design information.
The model does not produce the individual cell locations and input/output Operands of
each boundary cell. The transformation of the velocity Of data ﬂow, relative
operand’s distances, and period of computations to a systolic array involves ad hoc

steps.

Each different existing approach for systematic construction of systolic arrays was
brieﬂy presented in Section 2.2. Fortes, Fu and Wah have compared many of these
methodologies using the Y-chart [22]. Among the type 1 methodologies discussed the

index transformation, presented by Moldovan, has fewest ad hoc steps.

CHAPTER 3

ALGORITHM DESCRIPTION AND REPRESENTATION

The new systolic design approach presented herein differs from existing approaches in
that the array characteristics (i.e. simple cells, simple and regular control and
communications, and local cell interconnections) are incorporated in the design
process. In this chapter the properties of simple cells are used to decompose the
algorithms. That is, the complexity of the processing elements themselves contribute
to the decomposition of the algorithm’s computations into a set Of equivalent

recurrence equations.

Deﬁnition 3.1 — A recurrence equation is called a simple computation if it
consists of one, two, or three variables used in the resultant’s calculation. It
consists Of one output operand and one, two, or three input operands and zero,
one ,or two Operators. (Some of the input operands may be made available at the

output to be used by another cell).

For systolic array implementation, a simple computation deﬁnes a cell structure with
constant area and constant latency. Only the simple computations with one, two, or
three input operands and zero, one, or two Operations are considered. These
correspond to the simple cells shown in Figure 3.1. In Figure 3.1b the partial results

remain stationary in the cell. A primary input may be preloaded into the cell’s

33

34

Input Input Input Input

partial partlal

remit remit ..

output output
(a) (b)

Figure 3.1. Possible structures of simple cells.

internal input register which also remains stationary during the computations. In this
case, the primary input is a part of every simple computation performed in that cell.
These cell models correspond to most Of the existing systolic solutions. The method
described here can also be extended to more complex cell models. However, complex
cell models may require complex cell communications which are not recommended

for systolic designs.

3.1 ALGORITHM DESCRIPTION

In order to satisfy the simple cell characteristics of systolic arrays it is important that
the mathematical expressions and the body of the nested loops have simple
structures. That is, if an expression is complex and involves many variables and
many Operations, it should be decomposed into two or more simpler expressions with

fewer variables and numbers of different Operations. For example, an Inﬁnite

35
Impulse Response (IIR) ﬁlter with seven weights may be described by

3 4
yi = 2% * Yuk—1 + 2": * xi+l—1 for i> 0- 3-1-1
i=1 (:1

Eq. 3.1-1 can be expressed by two simple and equivalent expressions as follows:
4
Si = 2V1 * xi-H-l for l) O, 3.1-2
1:1

yj = S} + iw,‘ * yj+k-1 for l > 0;
k=1
where s,- is introduced as an intermediate variable. It will be noted later that Eq. 3.1-2
corresponds to cells consisting of three input operands and one output operand. Each
cell performs multiplication and addition Operations. If an expression cannot be
decomposed into simpler expressions then the cell structure must be changed to
accommodate the more complex computations. But, as mentioned, simple cell
structure is a fundamental criterion of systolic array design and thus non-
decomposable expressions may indicate that the algorithm is not suitable for systolic

array implementation.

The algorithms under consideration are grouped into space-indexed and time-indexed

algorithms.

Space-Indexed Algorithms — These algorithms have uniform and bounded
index spaces. The indices of these algorithms can be represented in geometrical
space. The total number of values to be computed is ﬁnite and is computed
from the size of the index space. For example, matrix manipulations are space-

indexed algorithms.

Time-Indexed Algorithms — These algorithms are described by mathematical

expressions with at least one unbounded index space. They may require an

36

inﬁnite number of values to be computed. However, there often exists a
Computation Window Index Space (CWIS) for each of the variables. The sizes
of these windows are determined from the bounded index space. (If an
expression does not have a bounded index space, then it is a series. Therefore,
the problem of computation no longer exists. The problem becomes whether the
series converges or diverges). For each expression, the window size is equal to
the number of elements in the bounded index space. These windows, by

deﬁnition, are bounded.

For example, in Eq. 3.1-2 the CWIS for s,- is 4 and for yJ- it is 3. That is, Eq. 3.1-2

can be modiﬁed as follows:

Si = VI * xi+l—l for IS I S 4; 3.1-3

M"

N
ll

1

3
yj=sj+2wk*yj+k—l for ISjS3.
ltzl

Eq. 3.1-3 is then used to construct systolic solutions for the HR ﬁlter.

Let A be the set Of all algorithms considered in this study. The following

assumptions are made for an algorithm A e A:

l. A e A is highly structured and can represented in a closed form such as a

mathematical expression.

2. A e A can be represented by simple computations. Each simple
computation consists Of one output operand and one, two, or three input
operands and zero, one, or two Operations. An output operand is either a
primary output or a partial result. An input operand is either a primary

input or a previously computed output.

37

3. A e A has a uniform and bounded operand index space. Or, A can be
represented with uniform and bounded operand index space, such as in the

case of some time-indexed problems.

3.2 ALGORITHM REPRESENTATION

This section describes the method by which an algorithm is transformed into its

equivalent set of simple computations. The method is illustrated with examples.

Example 3.1 — Consider the square matrix-matrix multiplication algorithm in A.
Let AB=C where A,B, and C are all NxN matrices. This algorithm can be described

by the following mathematical expression:

N
Cij = Z a“, * bkj for j= 1, 2,..., N and i= 1, 2,..., N. 3.2-1
k=1
Let Z be the set of integers and let 2'" = {(21, 22, - - - , 2m)| 2,42 Z}. The matrix-matrix

multiplication algorithm has uniform and bounded index spaces which are deﬁned as

follows:

Da={l=(i,k)’|1.<_iSN,ISkSN}; 3.2-2
Db={l=(j,k)‘|1$j$N,lSkSN},

Dc={l=(i,j)’|lSiSN,1$jSN}‘,

where 00, Db, Dc t; Z2 are the index spaces of the matrices A, B, and C,

respectively.

38

To illustrate the calculations more explicitly, for N=3, i=1, and j=1,2, Eq. 3.2-1 can be

expressed as follows:
— * * * .
€11 - “11 bit + 012 b21+ 013 1’31, 3-2-3
€12 = 011 * bl2 + 012 * 1’22 + 013 * b32-

Eq. 3.2-3 indicates that the resultant c“ and cm are determined from the summation
of certain speciﬁed products. The order in which the products are computed and
summed is not important. This provides considerable freedom as to how these
calculations can be performed. The natural order .of perforrrring these products is in
the order of data availability. For the matrix multiplication algorithm, the elements
of matrices A and B are primary inputs and hence available to form the products in
almost any desired order. In the following representation the products are computed

in the lexicographical order of indices.
c,” s 0 for k = 0, 1,..., N—l; 3.2-4
c,“ = CiJJt-l + a”, * by for k = 1,..., N;
‘0' 5 Cow.
The initial partial results are set to zero and each Cij is computed through a series of

recursive calculations. From Eq. 32-4, the index space Dc is redeﬁned as: -

Dc={l=(i,j,k)|15i.<.3,1$js3,05ks3}, 3.2-5

where Dc g: Z3. It will be shown that the index differences in the lexicographical
sense can be used as design information and will be automatically extracted from the
simple computations. The lexicographical (lex) difference of two indices 1 and J in z’”

is calculated as:

lex(l, J) = II — J I. 3.2-6

39

For consistency, the dimensions of the index sets Da and Db are artiﬁcially extended

to match the dimension of Dc. That is,

Da={l=(i,j,k)‘|15is3,15js3,k=0}; 3.2-7

Db={l=(i,j,k)‘|15i$3,1$j53,k=0}.

Each of the index spaces Da, Db, and De are now in Z3. Eq. 32-4 is rewritten as
follows:
CiJ.k E 0 for k = 0, 1,..., N—l; 3.2-8
ciJ-J‘ = Cid-$4 + ai,k,0 * bkjp for k = 1,..., N;
Ci} 5 Cow.

For example, if [C = (2,2,1)e Dc and la = (1,1,0)e Da, then lex(lc, Ia) = (1,1,1).

(End of Example 3.1.)

In Eq. 3.2-8, index k is introduced to assign a precedence order to the basic
computations of the algorithm. The expression
Cij,k = ciJ,k-l + ai’k'o * bkjﬂ for k = 1,..., N

is a forward recursive function. However, such index expansion may result in
forward or backward recursive functions depending on the algorithm type. The

algorithms in A are therefore classiﬁed into two types.
Type 1 — The primary outputs are computed in increasing order of indices.

For example, the solution of the linear equations Lx = y, where L is a lower triangular
matrix and x = (X1, Iz, ..., xn_1, In): and y = 0’1, yz, ..., yn__1, y")! are VCCIOI'S, IS a

type 1 algorithm. Note that x1 is computed before x2.

40

Type 2 — The primary outputs are computed in decreasing order of indices.

For example, the solution of linear equations Ux = y, where U is an upper triangular
matrix and x = (x1, x2, ..., x,,_1,x,,)' and y = (yl, yz, ..., y,,_1,y,,)’ are vectors, is a
type 2 algorithm; x" is computed before xn_1. Note that, hybrid algorithms are also

possible.

Index-Expansion-Rule- If A e A is a type 1 algorithm, then the recursive
functions formed by the simple computations must be forward recursive

functions. The functions must be backward recursive functions if A is of type 2.

For example, the recursive function c,-J- = Cry-1 + is a forward recursive function
and c”: CiJ+1 + is a backward recursive function. If A is both a type 1 and
type 2 algorithm (hybrid), then the simple computations will form both forward and

backward recursive functions.

The order of computations in both cases corresponds to the order of the computed
primary outputs. In Example 3.1, the matrix-matrix multiplication algorithm is a

type 1 algorithm. An example of a type 2 algorithm is given in the next section.

3.2.1 INITIAL PARTIAL RESULTS

Each simple computation involves a set of operations performed on the partial result
and other input Operands. Naturally, at the start of the Operation the initial partial
results must be deﬁned. These initial partial results are determined either from the
algorithm, from the type of operations performed, or both. In Example 3.1, the initial
partial results, as indicated by Eq. 3.24, are determined from the Operations. Each
primary output cij of Eq. 32-4 is the sum of several products. Therefore, the initial

partial results are assumed to be zero. If the partial results are primary inputs, then

41

for each such primary input, a simple computation (with only one input Operand) is
used. The purpose of this is to easily identify the primary inputs in the ﬁnal design
process. The following example consists of primary input operands and a different

set of Operations.

Example 3.2 — Consider the algorithm for the solution of the upper triangular linear
system Ux = b in A, where U is an N x N upper triangular matrix and x and b are

le vectors. This is a type 2 algorithm and is described as follows:

Xi = (bi + Z —u,-j * Xj) / u;,- for I: N,..., 2, 1. 3.2‘9

j>i

Eq. 3.2-9 is not the only way this algorithm can be described. For instance, the result

inside the parentheses can be multiplied by u};-1 instead of being divided by u,-,-. In the

former description, the algorithm consists of multiplication, subtraction, and division

Operations whereas in the latter, it consists of only multiplication and subtraction
-1,

operations (provided that u,,- s are predetermined). For this algorithm, the

calculations for N = 3 are explicitly expressed as follows:
X3 = (b3) / “33; 3.2-10
x2 = (’92 - “23 * x3) / “22;
JIC1:071 — “12 * x2 — “13 * Jr3) / "11-

Eq. 3.2-10 indicates that the resultant xi’s are determined from certain speciﬁed
products which are. subtracted from bi’s and then divided by u,-,-’s. The order of
calculating the products is not speciﬁed and provides some design freedom. The
natural order for performing the basic computations (e.g. the products) is in the order
of data availability. In this algorithm, the elements Of vector x are calculated by back
substitution; that is, x3 is calculated before x2 and so on. The natural order of

performing these calculations is in the order of availability of the elements of vector

42

x. The following backward recurrence equations assume this natural order:
it“ E bi for k = N—i+3, 3.2-11
x“, = xi,k+1 for k = N—i+2,
xi,k = xi,k+1 ‘ “i,i+k+1 * xi+k—1,1 for k = 2,..., N4“,
xj’,‘ = xt,k+1 / uh- for k = 1,
ﬁaﬁb
for i = N,..., 2,1. For N = 3, the simple computations of Eq. 3.2-11 are expanded as

follows:

‘3 5 ‘31; 3.2.12

JIC31 = 1‘32 / “333

132 = 1‘33;
Jr33 5 b3;
XZ 5 x21;

J‘21 = 1‘22 / “22;

1‘22 = 1‘23 ' “23 * J"31?

1‘23 = 1‘24;
124 5 b2,
Jrr E Jr113

Jr11= x12 / “11;

112 = 1‘13 " “12 * x21;
Jr13 = J"14 ' “13 * J‘31;
1‘14 = Jlr15;

x15 5 b1'

b3, b2, and b1 are primary inputs and correspond to the initial partial results x33, x24,

and x15, respectively. At the ﬁnal stage of the systolic array design, 133, 1:24, and x15

43

will take the values b3, b2, and b1, respectively.

(End of Example 3.2.)

3.2.2 ALGORITHM INITIAL AND FINAL REPRESENTATIONS

In this new approach for constructing systolic arrays from algorithms, certain inherent
information that already exists in the algorithm is utilized. This information, referred
to as features, is extracted from the algorithm’s equivalent set Of simple computations.
In order to systematically extract these features, it is necessary that the simple
computations have a unique form. This unique form is referred to as the algorithm
initial representation. This initial representation allows the systematic extraction of
the features. There exist two types Of features: algorithm features and interaction
features. The algorithm features are extracted from each simple computation and the
interaction features (which are deﬁned in Chapter 4) are extracted from each pair Of
simple computations. The set of algorithm features that are needed in the new

systolic design approach is as follows:

1. the input and output operands of each simple computation;

2. the set of operations used to compute each Operand;

3. the lexicographical differences between the indices of the input Operands of

each simple computation;

4. the lexicographical differences between the indices of the input operands

and the index of the output operand of each simple computation.

44

Associated with each feature there exists a set of different feature values. Each simple
computation provides a different value for each of the features. To illustrate, consider

the following simple computation from Eq. 3.2—12:
X22 = X23 "’ “23 * X31. 3.2-13

The following features and feature values can be extracted from Eq. 3.2-13:

l. The input and output operands of Eq. 3.2-12 form a subset of features where x23,

u23, x31, and x22 are the feature values.

2. The set of operations used to compute an operand constitute features. Eq. 3.2-
13 has four operands, hence four different features. Multiplication and
subtraction operations (*—) are used to compute x22 and x23; no operation ( ) is
used for primary input u23; and division (I) is used to calculate x31. The values

of these features are (*—), (*-), ( ), and (l).

3. The lexicographical differences between the indices of the input Operands of
Eq. 3.2-12 form a subset of features. The lexicographical difference between the
indices of x23 and x31 is (3, l)—(2, 3) = (1, —2). Index differences (0, 0), (1, —2),
and (l, —2), as calculated from the indices of operand pairs (x23, u23), (x23, x31),
and (u23, 131): are the feature values. Later it will be shown that only the index
differences between the same type of operands are needed. For example, x23
and an are not the same type of operand; one is a primary input with variable
name u and the other is a partial result with variable name 2:. (Regular
communication requires that variables with two distinct names enter or leave the
array from two distinct lines). Similarly, x23 and x3] are not the same type of
Operand since each requires a different set of operations; x23 requires

multiplication and subtraction operations and x31 requires a division operation.

45

4. Index differences (0, 1),,(0, 1), and (1, -1), which are calculated respectively
from the indices of operand pairs (x22, x23), (x22, u23), and (x22, x31), also form

feature values.

The extraction of feature values in item 2 can be simpliﬁed by including the set of
operations required by each operand in the simple computation. That is, Eq. 3.2-l3

can be expressed as follows:
x22(*—) = x23(*-—) - u23( ) "‘ x31(/). 3.2-14
Or, Eq. 3.2.11, similarly is expressed as follows:
xix E b,- for k = N—i+3, . 3.2-15
xi,k(*—) = xi’k+1(*—) for k = N—i+2,
1544*") = xi,k+1(*-) " ui,i+k+l( ) * Ink—1,10) for k = 2"", N-i+1,
xi'k(/) = xi'k+1(*—) / u,,—( ) for k = 1,
It 5 xi],
for i = N,..., 2, 1. For example, in Eq. 3.2-15, (*-) is used to compute Ink for

1': N,..., 2, 1 and k = 2,..., N-i+l.

The lexicographical index differences represent the relative distances between the
operands of each simple computation. These index differences contain information
related to the simple and regular communication and local connectivity properties of

systolic arrays.

To extract the algorithm features formally, a simple computation is represented

generically as follows:
r(11)(01)=y(ly)(0y) 01 2(IZ)(02) 02 W(IWXOW) 3.2-16

where the different operands are indicated by the letters x, y, z, and w. The Operand

46

x(Ix) repreSents the output operand with generic variable name x and index lx. The
operands y(ly), z(lz), and w(Iw) represent up to three input operands and 0x, 0y, Oz,
and CW are each a set of operations used in the computation of the respective
operands. 0x, the set of operations used in the computation of x(Ix), is equal to set
{01, 02}. For example, substituting Eq. 3.2-14 in the generic form deﬁnes the

following: '

x(lx) 5 x22 implies Ix = (2,2) and x = x; 3.2-l7
y(Iy) 5 x23 implies ly = (2,3) and y = x;
2(12) 5 u23 implies 12 = (2,3) and z = u;

w(Iw) 5 x3] implies [w = (3,1) and w = x;

0x 5 *—;
0y E *-;
02 :- null;
0w 5 /;
01 = -
02 E *
Deﬁnition 3.2 — For an operand, such as x(1x)(0x), the variable name, x, and

the set of operations, 0x, indicate the operand type.

For example, operands x22(*—) and x23(*-) have the same operand type; both have the
same variable name, x, and set of operations, (*—). Operands x22(*—) and x31(/) do not

have the same operand type.

The generic representation of Eq. 3.2—16 deﬁnes the structure of the simple
computations that can be computed in the cell structures of Figure 3.1. Different
generic representations are needed if cell structures other than those shown in
Figure 3.1 are used. The method described here can easily be extended to different

cell structures. Using the generic form of the simple computations as deﬁned by

47

Eq. 3.2-16, the algorithm features can now be formally deﬁned.

Algorithm Features

 

1. The operands x(lx), y(1y), z(lz), and w(Iw) are operand features.
2. The set of operators 0x, 0y, Oz, and 0w are operator features.

3. The following lexicographical (Iex) index differences, which indicate the
relative distances between the operands of a simple computation, form the

relative-distance features:
ny = Iex(lx, 1y); 3.2-18
sz = Iex(lx, Iz);
wa = Iex(lx, 1w);
Vyz = Iex(ly, Iz);
Vyw = Iex(ly, 1w);

Vzw = Iex(lz, 1w).

The relative-distance features are computed for the algorithm of Eq. 3.2-9 for N = 3
and the results are shown in Table 3.1. In Table 3.1, the set of simple computations
are numbered c1 through c9 for easy reference. Each row in Table 3.1 indicates the
set of relative—distance feature values for each simple computation. For example,
nyI = (O, 1) is a value of feature ny of simple computation c1; wa4 = (1, -l) is a

value for feature wa of simple computation c4.

48

Table 3.1. Final representation of algorithm described by Eq. 3.2-8.

 

 

 

 

 

Simple Computation (6;) Features.
ny sz wa Vyz Vyw Vzw
6131531 U) = 132 (*-)/ “33 ( ) (0,1) (0.2) — (0,1) — _
023 x32 0“) = J‘33 (*‘) (0,1) — — _ _ _
633 x21 U) = 122 (*')/ “22 ( ) (0,1) (0,1) — (0,0) - _
04: x22 (*-) = Jr23 (*-) - “23 ( ) * X31 (I) (0.1) (0.1) (1,-1) (0.0) (M) (M)
C53 J"23 C") = 4‘24 (L) (0,1) - — _ _ _
06‘ 3‘11 U) = Jr12 (*-)/ U11 ( ) (0,1) (0,0) — (0,1) _ _
67: x12 (*-) = x13 (*-) - um ( ) * x21 (I) (0.1) (0.0) (1,-1) (0,1) (1,-2) (1,-1)
683 113 (*-) = 1‘14 (*-') - “13 ( ) * X31 (l) (0,1) (0,0) (2,-2) (0,1) (2,-3) (2,-2)
093 x14 (*-) = X15 (*-) (0,1) — - _ _ _
Deﬁnition 3.3 — Feature value ny, of simple computation c,- is referred to as

the dependency vector of 6;.

The simple computations of an algorithm are constructed such that all the dependency

vectors are the same. That is, the dependency vector of c,- is the same as that of cj.

This means that the simple computations have one dependency vector of the form

(0, 1), (O, O, 1), (0, O, O, 1), etc. The vector size is determined from the index space

of the simple computations. For example, in Table 3.1, all the values of feature ny

are equal to (O, 1). Furthermore, input operand y(Iy)(0y) always represents a partial

result.

The algorithm equivalent set of simple computations and the associated algorithm

features of each simple computation collectively are referred to as the algorithm ﬁnal

representation.

 

49

3.3 CHAPTER SUMMARY

An algorithm with uniform and bounded index space can be represented by
mathematical expressions. The algorithm is then systematically transformed into a set
of simple computations. The complexity of these simple computations deﬁnes the
complexity of the simple cells and vice versa. Only simple computations which
involve three or fewer variables and two or fewer operations are considered. This
corresponds to simple cells with one, two, or three input operands and zero, one, or
two Operations. The method is easily extrapolated to give cells with more complex
structures. However, complex cell structures require complex cell communications
which are not recommended for systolic arrays. The simple computations of an
algorithm have a unique form and deﬁne the algorithm initial representation. From
these simple computations, a subset of features referred to as the algorithm features
are extracted. Additional features referred to as the interaction features will be
extracted in Chapter 4. The simple computations and the corresponding algorithm

features deﬁne the algorithm ﬁnal representation.

The algorithm feature values are used repeatedly during the extraction of the design
information. These feature values are determined prior to analysis (i.e. algorithm ﬁnal
representation) in order to avoid their repeated computations. Note that the algorithm
feature values, computed from each simple computation, contain information about the
relative distances between the operands and the set of operations needed to compute
each operand. The relative operand distances will be shown to relate to the simple
and regular communication and the local connectivity properties of systolic arrays.
The set of operations needed to compute each operand will be shown to relate to the
complexity of the cells. In the next chapter, the algorithm ﬁnal representation is
modeled to determine the design information necessary for constructing of systolic

arrays.

CHAPTER 4

ALGORITHM MODEL

This chapter presents the algorithm model from which design information is extracted
and used for constructing systolic arrays. The model consists of certain local and
global properties that must be satisﬁed if there exists a systolic solution for the
algorithm. Local properties are the cell intra-connections, as indicated in Figure 3.1,
and cell operations. The legal cell operations depend on the cell complexity. Global
properties include simple and regular communications and cell to cell connectivity.
The local and global properties are represented by feature interrelationships. A

systolic solution exists if and only if all the feature interrelationships are satisﬁed.

4.1 MATHEMATICAL DEVELOPMENT

The simple computations of an algorithm are represented geometrically by letting the
indices of the output operands indicate the location in space where they are computed.
This is similar to the geometric representation considered in [14,17]. For example,
consider the simple computations of Eq. 3.2-8 for N = 3, which are repeated here for

convenience.

c1: x31(/) = x32(*—) / u33( ); 4.1-1

50

51

62: 1326-) = x33(*-);

53‘ 1:10 ) = 1220‘) / “22( )3

c4: 1:2(*-) = x23(*-) - u23( ) * X310);
‘53 1230‘") = Jr24("‘-);

C 31:1(/) =112(*-) / “11( )i

O\

or: Jr120‘") = JHitch“) - “12( ) * x210);
cs: 1130‘“) = Jr140'") " “13( ) * x31(/);
C92 1140‘") = 1150‘“):

The geometric representation of Eq. 4.1-1 is shown in Figure 4.1.

 

 

l

/\

4‘--.

3“. o

2». o o

‘--o o o
41 I \1
III I
12 3

Figure 4.1. The geometric representation of the output operands of Eq. 4.1-1.

2:}. point III Figure 4.1 indicates a simple cell and its coordinates indicate the output

operand that it calculates. The cell at location (2, 2), for example, computes x22

52

where x23, 1423, and x31 are the input operands. The functions of each cell and their
respective input and output Operands can also be included in the geometric
representation by replacing each point in Figure 4.1 by the corresponding cell. The
result is shown in Figure 4.2 where the MS cells perform multiplication and

subtraction operations and the D cells perform division.

Assigning each simple computation its own cell is obviously not a good choice as it
implies that most of the cells will be idle most of the time and many cells will be
needed. (However, there are algorithms such as the solution of lower triangular linear
systems that require as many cells as there are indices in the algorithm’s index space).
To reduce the number of cells, the geometric representation is projected onto a line or
a plane in order to overlap several cells onto one cell. The geometric representation
of Figure 4.2 is projected onto a line thereby assigning several simple computations to
one cell. The result is a linear array with fewer cells and a higher cell utilization.
The geometric representation of Figure 4.2 consists of two different cell types. It is

obvious that only the cells that are the same type may be overlapped.

Consider the MS cells of Figure 4.2, which form an array in the geometric space.
The projection of the MS cells onto the 1' axis is the same as overlapping all the rows.
This projection is said to be along vector (O,1)', or for short, projection vector (O,1)‘.'
This vector is indicated in the space by a line between two points with coordinates

(0,0) and (0,1). The direction assigned to this vector is an angular direction and is
indicated by the counterclockwise angle between this vector and the 1' axis. The
angular direction of projection vector (0,1)‘ is rt/2 and the projection of the MS cells
along this vector, as expected, is row overlapped. The column overlapping is
performed by projection vector (1,0)'. For diagonal and reverse diagonal overlapping
the projection vectors are (1,-1)‘ and (1,1)‘, respectively. The MS cells can also be

projected onto a line along projection vectors (1,—2)‘ or (2,-1)‘. However, these

53

projections (if possible) will produce systolic solutions consisting of too many cells.
Only the projection vectors with element values 0, -1, or 1 are considered. The D

cells can be projected onto a line using projection vectors (0,1)' or (1,0)‘.

Deﬁnition 4.1 — A portion of a systolic array responsible for a subset of

simple computations is referred to as a sub-systolic array.

Sub-systolic arrays have the same properties as systolic arrays. In addition they must
also satisfy necessary boundary cell communication linkages with other sub-systolic

arrays.

Deﬁnition 4.2 — A systolic array is said to be homogeneous if it is not

separable into sub-systolic arrays.

A non-homogeneous systolic array is separable into two or more sub-systolic arrays.
Each sub-systolic array is responsible for a subset of the simple computations. The
following criteria are used to separate a non-homogeneous systolic array into sub-

systolic arrays.

1. Different cell types — The simple cell requirement dictates that cells with

non-matching sets of operations should not be overlapped.

2. Different variable names — The regular communication requirement
dictates that the different sets of variables should enter and leave the array
from distinct input/output lines. This implies that cells with non-matching

input/output variable names should not be overlapped.

Those cells which perform the same set of operations and have the same input/output
variable names form a sub—systolic array. Figure 4.3 illustrates a non-homogeneous

systolic array which is constructed from three sub-systolic arrays. These sub-systolic

54

 

 

/\ x15
0
°—>o .
x14
x14 x24
U13 0
3 .— m—aG .31 0—>@ 0
x13 ' x23
x13 x23 x33
U12 u23 O
2 -- ”219° x21x3‘9e x31 09° 0
x12 x22 x32
x12 x22 x32
1 at. an ~© u22 u33
x11 x21 x31
i i 1‘ >
1 2 3

Figure 4.2. The geometric representation of the simple computations of Eq. 4.1-l.

55

arrays are shown in Figure 4.4. Figure 4.5 illustrates a homogeneous systolic array.
Note that this systolic array is not separable into sub-systolic arrays based on the

above criteria.

To construct a systolic array, the simple computations are grouped into subsets such
that a sub-systolic array is constructed from each subset. This is done by geometric

representation and projection of each subset of simple computations.

The simple computations of Eq. 4.1.1 are grouped into two subsets C1 and C2 as

follows:

C1 = {02, C4, C5, C7, Cs. 09}. 4.1-2

C2 = {01’ 03, Co}-

The input and output operands of the simple computations of C1 require the same
type of operations and the corresponding input and output variable names are the
same. Simple computations c2, c5, and c9 of C1 each consists of one input Operand.
The other input operands are assumed to be null. This type of simple computation, as
discussed in Chapter 3, is used to identify the initial partial results which are primary
inputs. Similarly, the simple computations of C2 require the same type of operations
and the corresponding input and output variable names are the same. A geometric
representation is constructed from the indices of the output operands for each subset
of simple computations. Each representation is separately projected onto a lower

dimensional space.

In order to determine which projections will produce valid systolic solutions, the valid
projection vectors must be determined. The valid projection vectors must satisfy three

major properties. (The third property does not apply to homogeneous systolic arrays).

56

l-l/u

 

U U I.
U U C
. |
_ .. .. . e .
I. U
n n
I. U U II U
000.4.
0 4
- 0000001.
3 4 O
OOMOOZ

3

u .‘8 0000100“.
I 3’3 00...“...00m00uwa
mlmlmlu 4‘."

mttmttmttm

 
      

 

 

1 2
OOﬂooaoom

00m00n00u
1 2
OOOOﬂOOM
000m004
1 U
000000“ t
.
1234567890 ﬂHR
- 1 Y.
t

o
d
0

Figure 4.3. A non-homogeneous systolic array.

57

1. They must preserve the properties of simple and regular communications.

2. They must preserve the local cell connectivity.

3. They must satisfy the boundary cell communications of the sub-systolic

arrays.

To determine the valid projection vectors a methodology is developed based on
feature interrelationships. That is, the feature values are used to assign simple

computations to a set of cells such that the above properties are satisﬁed.

Deﬁnition 4.3 — A pair of simple computations is said to interact if they are
computed in the same cell. An interacting pair is called a valid interactive pair

if it does not violate the properties of systolic arrays.

Consider simple computations c4 and c7 of Eq. 4.1-l.
C4: X22(*-) = X23(*—) — 1123( ) * X310); 4.1'3
07‘ x12(*‘) = 1‘13““) ‘ “12( ) * 1210).

Simple computation pair (c4, c7) is a valid interactive pair if the following local and

global properties are satisﬁed. (Pair (c4, c7) satisﬁes these properties).

1. The variable names and the set of operators of x22(*—) and x12(*-) must match.
That is, Operands x22(*—) and x12(*-) must be Of the same operand type. This
will ensure that those operands with different variable names or with a different

set of operators are not computed in the same cell.

2. There must exist a one to one match of variable names and operators for the
input operands of c4 and c-,. That is, operands x23(*—) and x13(*—) must be of

the same operand type. Similarly, Operands u23( ) and u12( ) and operands x310)

 

 

 

 

 

 

 

 

 

homogeneous systolic array.

59

1.3 011
7 021 0 012
B 031 A .32 8 013
18 ﬁg 023
11
12
0130000 0000031
0230001200 0002100032
0330002200011 01100022001333

001200023

:0   V

Figure 4.5. A homogeneous systolic array.

and x21(/) must be of the same operand types. This means that, when x22 and x12
are computed in the same cell then, operands x23 and x13 must be computed in
one cell and x31 and x21 must be computed in one cell. This is required so as
not to receive x13 and x13 from two separate input lines during the calculation of

x22 and x12 in the same cell. Similarly, x31 and x21 must not be received from

two separate input lines.

6O

3. Simple computations c4 and c7 must have unique input operands unless the
common input operand is a primary input. (Primary inputs may be preloaded
and remain stationary). This requirement is needed to satisfy the cell intra-

connections as illustrated in Figure 3.1.

4. The output operand of c4 must not be used as an input operand in c7 or vice
versa, unless the output Operand of c4 is the partial result used in c7 or vice

versa. This is required to satisfy the cell intra-connections.

5. If partial results x22 and x12 are computed in the same cell, then they must not be
sent as input Operands to more than one cell. That is, if x22 and x12 are used in
other simple computations, then these simple computations must be computed in
the same cell. Otherwise, the output line of the cell which computes x22 and x12

will be connected to more than one other cell.

Properties 1 through 5 indicate a set of conditions that must be satisﬁed before c4 and
c7 can be computed in the same cell without violating the properties of systolic arrays.
Although these properties appear simple, their implementation involves certain
complex feature interrelationships. Furthermore, computing two simple computations
in the same cell must not violate the local and global properties in another cell in the

ﬁnal array.

In Chapter 3, a set of features (referred to as the algorithm features) were extracted
from each simple computation. Additional features, called interactive features, are
extracted from each pair of simple computations and relate to the geometric projection
vectors. The algorithm features and interactive features then form the set of all
features extracted from a set of simple computations. They are used to represent the

local and global properties in terms of the feature interrelationships.

61

Let (c,-, c) be a pair of simple computations represented in generic form as follows:
ci: xi(lx,-)(0x,-) = y,(ly,-)(0y,~) 01,. z;(lz,-)(Oz,-) 02,- wi(lw‘-)(0w,-); 4.1-4
cj: x,(lxj)(0xj) = yj(ly,-)(0y,-) 01’- zj(lzj)(02j) 02]- wj(le)(0wj).
In Eq. 4.1-4, the indices i and j are used to identify the variable names, Operands’
indices, and the sets of Operations of different simple computations. For example, in

substituting simple computation Cr of Eq. 4.1-1 into the generic form, the items are

deﬁned as follows:
x1(lx1)(0x1) 5 x31(/) implies that, x, = x, le = 31, and 0x1 = I; 4.1-4a
Y1(I)’1)(0)’1) 5 1320-) implies that, Y1 = x, [Y1 = 32: and 0Y1 = L:
zl(lzl)(021) a u33( ) implies that, 21 = u, [21 = 33, and 021 = null;

w1(Iw1)(0w1) a null implies missing.

Similarly, substituting c4 of Eq. 4.1-1 into the generic form yields the following:
x4(Ix4)(0x4) 5 x22(*—) implies that, x4 = x, 1x4 = 22, and 0x4 = *—; 4.1-4b
y4(ly4)(0y4) a x23(*-) implies that, y,, = x, [y4 = 23, and 0y4 = *—;
24(124)(024) a u23( ) implies that, 24 = u, 124 = 23, and 024 = null;

w4(Iw4)(0w4) a x31(/) implies that, W4 = x, [W4 :31, and 0W4 = .

Using the simple computation pair, (Ct: cj), interactive features are now deﬁned.

Interactive Features

 

The following lexicographical (lex) index differences, which indicate the relative
distances between the operands of simple computations c,- and c-, are the

interactive feature values of (c;, cj):

Vxxij = Iex(lx,, Ixj); 4.1-15

62

Vyyij = Iex(lyi, by);

szij = Iex(lzi, Izj);

VWWij = lﬂX(IWi, le).

Vxx, Vyy, sz, and wa denote interactive features and Vxxij,

Vyyij, VZZij, and

waij denote unique values for the interactive features, respectively.

The values for the interactive features of the simple computations of Eq. 4.1-1 are

listed in Table 4.1.

Table 4.1. Interactive feature values of Eq. 4.1-1.

 

 

 

 

Computation Features

Pair (c;, C!) Vxx Vyy sz VW
(61» 63) (1. 0) (1. 0) (1. 1) -
(61: 65) (2. 0) (2. 0) (2. 2) -
(63. 65) (1. 0) (1. 0) (1. 1) -
(62» C4) (1, 0) (1. 0) - -
(62, 65) (1. '1) i (1. -1) - -
(62: 67) (2. 0) (2. 0) - -
(c2, c3) (2, -l) (2, -l) - -
(c2, c9) (2, -2) (2, -2) — —
(64, 65) (0. 1) (0. 1) - -
(64. 67) (1. 0) (1. 0) (1. 1) (1. 0)
(64» 63) (1. -1) (1. -1) (1. 0) (0. 0)
(64» 69) (1. -2) (1. '4) - -
(6s. 67) (1. 1) (1. 1) - -
(65, 63) (1. 0) (1. 0) - -
(65, 69) (1. 1) (1. 1) - -
(c7, c8) (0, l) (O, 1) (0, 1) (1, 0)
(6s. 69) (0. 1) (0. 1) - -

 

 

63

Deﬁnition 4.4 — Two simple computations c,- and cj are mapped to the same
cell location if feature value Vxxij = Iex(lx,, lxj) is linearly dependent to a
projection vector, Vx. That is, c,- and of form an interactive pair if Vx = t * Vxxij
where t 62R and ZR is the set of all integer and rational numbers. (This
relationship between the values of feature Vxx and projection vectors is needed to

satisfy the regular communication property).

For example, projection vector Vx = (0, l)‘, linearly dependent to feature value
Vxx47 = (O, 1), maps c4 and c7 onto one cell. Similarly, Vx = (O, 1)‘, which is linearly
dependent to (0, 2), maps simple computations c7 and c9 onto the same cell location

where Vxx79 = (0, 2).

4.1.1 FEATURE INTERRELATIONSHIPS

In this section, the conditions for valid interactive pairs are formally discussed. These
conditions test for violation of the local and global properties. Local properties consist
of cell intra-connections and cell operations while global properties consist of simple
and regular communications and cell interconnections. The conditions for valid

interactive pairs are expressed in the form of theorems.

Consider two simple computations c: and cj in generic form:
Ci: xtUXrXOxr) = y.(ly;)(0yr) 01: zr(Iz.-)(02.-) 02a wi(lw,-)(0w,-); 4-1-6

Theorem 4.1 - Simple computation pair (Ci, C) is a valid interactive pair if the

following hold:

1. 0x,- = 0x};
2- oyt' = 0in
3. Oz,- = 02,-;
4. 0w,- = 0w}.

Proof: The proof of condition 1 is simple; only those simple computations which
require the same set of Operations are allowed to be computed in the same cell. If
0x,- at Oxj then, the interaction of c,- and cj will require the cell to perform two sets of
Operations bx,- and Oxj. This violates the simple cell property. Condition 2 relates to
the simple communication property. Suppose c,- and cj interact in the same cell, q,
and condition 2 does not hold (i.e. 0y; at Oyj). Since 0y,- at 0y}, operands y,(ly,-) and
yj(lyj) will be computed in two different cells. However, partial results yi(Iy,-) and
yj(ij) are used to compute c,- and cj in cell q. This implies that cell q will have to
receive y,(Iy,-) and yj(ij) from two different cells. In other words, an input line Of cell
q must be connected to two different cells. This violates the simple communication

prOperty. The proofs of conditions 3 and 4 are similar to that of condition 2. C1

The regular communication property implies that the array input/output information
must have a regular format. For example, two different primary input variables must
be received from two distinct input lines at the same speed. Regular communication
allows modular design so larger arrays can be made by combining proven smaller

ones. The conditions that test for violation of this property are stated as Theorem 4.2.

Theorem 4.2 - Simple computation pair (C), C!) is a valid interactive pair if the

following hold:

65

1. The variable names of the output operands must be the same. That is,

= xi.

2. If z,(lzi)(02,-) at null and zj(lzj)(02j) at null then, 2i = 2}.

3. If w,(lw,-)(0w,—) at null and wj(1wj)(0wj) at null then, w,- = w}.

Proof: The regular communication property requires that the’operands entering and

exiting each cell must have the same variable names. [:1

Theorems 4.1 and 4.2 are used to group the simple computations into subsets. The
simple computations that belong to the same subset satisfy the conditions of Theorem
4.1 and 4.2. This initial grouping of the simple computations avoids repeated testing
of Theorems 4.1 and 4.2 for each pair of simple computations of each subset. For
example, consider the following simple computations of Eq. 4.1-1, are repeated here

for convenience:
0,: x31(/) = x32(*-) / u33( ); 4.1-7
62: ch320'“) = 1330—):
643 Jr229"“) = x23(*-)-‘ “23( ) * 15310)-

Simple computation pair (c2, c4) satisﬁes Theorems 4.1 and 4.2, but pair (c1, c2) does
not. Note that, c2, c4 6C1 and c1 6C2 in Eq. 4.1-2. For convenience in notation, let
C be a subset of simple computations. Let (Cl’ cj) eC indicate a simple computation

pair formed by c,- e C and cj eC.

Theorem 4.3 - Simple computation pair (Ct: cj) eC is a valid interactive pair if the

following hold:

66

1. If z,(Iz,-) and 2,02,) are previously computed outputs then, z,(Iz,-) at zj(lzj).
2. If z,-(Iz,-) and wj(le) are previously computed outputs then, zi(Iz,-) at wj(IwJ—).
3. If w,-(Iw,-) and zj(IzJ-) are previously computed outputs then, wi(Iw,-) at zJ-(lzj).

4. If w,-([w,-) and wj(le) are previously computed outputs then, WtUWt)

Proof: Suppose condition 1 does not hold. That is, z,(Iz,-) and 2,02,) are both
previously computed outputs and z;(lz,~) = zj(lzj). Suppose computations c; and cj are
computed in a cell, q. In this case operand zi(lz,-), which is the same as zj(IzJ-), is used
twice in cell q, once during the computation of c,- and again during the computation of

c- This implies that a feedback loop is required at that input line of cell q. But this

J.
contradicts the structure of the simple cells as described in Figure 3.1. (Only partial
results are allowed to loop back). The proofs of conditions 2, 3, and 4 are the

same. E]

In a simple computation such as c,- the input operands z,(Iz,-) or wi(Iw,-) or both are
either previously computed outputs or are primary inputs. Theorem 4.3 applies only
for non-primary inputs since a primary input can be preloaded into a cell and used

several times.

To illustrate the violated condition 4 of Theorem 4.3, consider simple computations c4

and c3 of Eq. 4.1-l, repeated here for convenience.
C4: 122(*") = X23(*-) - u23( ) * X310); 4.1-8
083 1‘13“") = 1‘14“") ‘ “13( ) * x310)-

Simple computations c4 and c3 can not be computed in the same cell since x31 is used

by both. In systolic arrays, data items move from one cell to the next in a pipelined

67

fashion and are not lOOped back. The computation of c4 and es in the same cell
implies that x31, which is a previously computed output, must loop back to be used in

the computation of c3.

Theorem 4.4 - Simple computation pair (c,-, cj) eC is a valid interactive pair if the
output Operands of c,- and cj are not used as inputs to the same cell unless they are

partial results. That is,

1. xi(x,-) ¢ Zj([Zj);

2. x,(xi) at WJ-(le);

3. 11(11') ¢ 2‘12");

4. xj(xj) at wi(lw,~).
Proof: Suppose condition 1 does not hold. That is, xi(Ix,-) = 2,02,). Suppose c,- and cj
are computed in cell q. Since x;(lx,~) = 2,02,), the output operand of c,- is used as an
input in the computation of cj. This implies that there must exist a feedback loop
from the output line of cell q to one of its input lines other than the input line used

for the partial results. This contradicts the simple cell structures of Figure 3.1. The

proofs of conditions 2, 3, and 4 are the same. [:1

To illustrate the violated condition 1 of Theorem 4.4, consider the following simple

computations:
043 (210+) = ...; 4.1-9
Cb: [32(*+) = . . . +121(*+) st: . . .

The computation of ca and Ch in the same cell implies that 121 must be fed back to be

68

used in the calculation of cb. This is a violation of cell intra-connections as illustrated

in Figure 3.1.

Theorem 4.5 - Simple computation pair (ci, cj) 6C is a valid interactive pair if the

following hold:

1. If x,- = z,- and 0x,- = 02,- then, szi at t * Vxxij where t6 ZR. (Set ZR is the

set of all integer and rational numbers).
2. If x,- = w,- and 0x,- = 0w,- then, wai at t * Vxxij- where te ZR.
3. If xi = zj and 019- = 02,- then, szj- at t * Vxxij where teZR.
4. If xi = wj and 0x]- : 0wj then, wai at t * Vxxij where IE ZR.

Note that in the above conditions, such as condition 1, zi(Iz,-) is a previously computed

output if x,- = Zi-

Proof: Suppose condition 1 does not hold. That is, x,- = z,-, 0x,- = 02,-, and
szi = t * Vnij where ts ZR. The properties x,- = z,- and 0x,- = 02,- imply that x,(Ix,-)
and z,-(Iz,-) have the sameoperand type. Hence, they are computed in the same sub-

systolic array. Let ck be a simple computation such that
z;(lz,-)(02,-) = xk(lxk)(0xk). 4.1-10

(Note that c,‘ exists because z,(Iz,-) is a previously computed output). This implies that
Operand z,(lz,-) is the output operand of Ck- Consider the following feature values

computed from simple computations c,- and ck:
szi = Iex(lx,, 12,-); 4.1-11
VII“ = (“(1113 In).

Note that 12,- = Ix,‘ since xk(lxk) = zi(lz,-). This implies Vxxi,c = szi. The following

69

exists:
szi = t * Vxxij where teZR; 4.1-12a
szi = Vxxik.

Hence,
Vxxij = t’ * Vxxi,‘ where ‘t’ eZR. 4.1-12b

The values of feature Vxx are linearly dependent to the projection vectors. Let Vx be
a projection vector linearly dependent to feature value Vxxr. From Eq. 4.1-12b, Vx is
also linearly dependent to feature value Vxxik. Using the projection vector Vx, the
geometric projection will map computations cg, cj, and ck onto one cell. Figure 4.6
illustrates the interaction of Cr" cj, and ck. However, the interaction of c,- and ck
violates condition 1 of Theorem 4.4. Conditions 2, 3, and 4 can be similarly

proven. El

The violated conditions of Theorem 4.5 do not exist in the simple computations of

Eq. 4.1-l. However, consider the following three simple computations:
ca: 121 (+*) =...; 4.1-13
651320”): +121 (+19»:
cc: 154 (+*) = - - ~'

The interaction of simple computations cb and cc violates condition 1 of Theorem 4.5.
Note that szb = [ex [(3, 2), (2, 1)] = (1,1) is linearly dependent to feature value
Vxxbc = [ex [(3, 2), (5, 4)] = (2,2). Feature value szb is calculated from the indices
of operands xb(lx,,) “=- 132 and 2,,(lzb) 5121 of cb and Vxxbc is calculated from the
indices of Operands xb(lxb) a 132 and xc(lxc) a [54 of cb and cc. Note that szb = Vxxab.
Projection vector Vx = (1, l)‘, which is linearly dependent to feature value Vxxbc and

szb, will map simple computations ca, cb, and cc onto the same cell. This means

70

(ca, cb), (ca, cc), and (Ch, cc) are interactive pairs. However, the interaction of ca and
cb violates condition 1 of Theorem 4.4.
Theorem 4.6 - Simple computation pair (c,-, c,) eC is a valid interactive pair if the

following hold:
1. If y,- = z,- and 0y,- = 02,- then, Vyz, at t * Vyy,-j where te ZR.
2. If y,- = w,- and 0y,- = 0w,- then, Vyw, at t * Vyyij where te ZR.
3. If yj- = 21- and Oyj = 02,- then, Vyzj rt t * Vyyij where teZR.
4. If y, = wj and 0y,- = 0w,- then, Vywj at t * Vyyy- where te ZR.

Proof: Suppose condition 1 does not hold. That is,

c,- and c, interact; 4.1-14
Y1 = 2:;
Oyt' = 021';
VyZi = t * Vyyij;

where t eZR. Let Ck: c,, and c,,, be three simple computations such that

y,(ly,-)(0y,-) = xk(lx,,)(0x,,); 4.1-15

Yj(l)’j)(0)’j) = xr(lxr)(0xr);

z,-(Iz,-)(02,-) = xm(lxm)(0xm).
Operands xk(lx,,), x,(lx,), and x,,,(1x,,,) are the output operands of ck, Cb and cm,
respectively. Since y,- = z,- and 0y,- = 02,-, Operands y,(ly,-) and z,-(Iz,-) are of the same
Operand type, which implies xk(lx,,) and x,,,(Ix,,,) are of the same Operand type. Also,
since c,- and cj interact, Operands y,(ly,-) and yj(ly,-) are of the same operand type and

implies xk(lx,,) and x,(lx,) are of the same operand type. Hence, simple computations

ck, c,, and cm will be computed in the same sub-systolic array.

71

2t 6'21)

11: 0R)

. Yr (In) Yr: (In)

 

XI (”(1)

XI: ('10:) =" 2t ('21)

Figure 4.6. The violated condition 1 of Theorem 4.5.

72

Let Vx = t * Vxxu, where te ZR, be a projection vector. Projection vector Vx is used
to map c,, and c, onto the same cell location on the projection plane. Consider the

following feature values:

Vxxu = Iex(lxk, 1x1); 4.1-16

Equations 4.1-15 and 4.1-16 imply that,

VII“ = Vyyij. 4.1-17

Furthermore, equations 4.1-14 and 4.1-17 imply that, Vyz, is linearly dependent to
Vxxu and hence linearly dependent to projection vector Vx. That is, Vyz, = t * Vx

where t eZR. But note that
Vyz, = lex(1y,-, 12,-); 4.1-18
Vxxbn = Iex(lxk, Ixm).

Finally, equations 4.1-15 and 4.1-18 imply that Vxxbn is the same as Vyz, and hence
linearly dependent to Vx. This implies that projection vector Vx will also map c,, and
c”, onto a cell location. However, the interaction of c,, and cm violates the simple
communication property. Figure 4.7 illustrates this violation. In Figure 4.7, the two
input lines Of cell p must be connected to the single output line of cell q. This will
require a DMUX and hence a complex control mechanism at the output line of cell q.

Conditions 2, 3, and 4 can similarly be proven. [:1

To illustrate the violated condition 1 of Theorem 4.6, consider the following simple

Computations:
Ca: [222 (+*) = ...; 4.1-19

Cb: [322 (+*) = ...;

 

73

xi 1m)- x, (In)
XI: (3"!) " Yr (M)

x. it...) - z. (12.)

 

Figure 4.7. The violated condition 1 of Theorem 4.6.

6c: 1323 (+*) =1322 (+*) +1222 (+*) * ...;
64: 1523 (+*) =1522(+*)+ ‘ ' '
Feature values Vyzc and Vyya, are linearly dependent and are computed as follows:
Vyzc = (3, 2, 2) - (2, 2, 2) = (l, 0, 0); . 4.1-20
Vyycd = (5, 2, 2) — (3, 2, 2) = (2, O, 0).

If projection vector Vx = (l, 0, 0)‘ is used to project the geometric representation,
simple computations cc and cl, will interact. Note that simple computations Cm Cb, cc,
and c4 belong to the same geometric representation. However, the interaction of Ca

and cb violates the simple communication property at the cell where simple

 

 

74

computation cc is computed.

Theorem 4.7 - Simple computation pair (c,-, C!) EC is a valid interactive pair if the
following hold:

1. Let ck and c, be two simple computations such that xk(lxk)(0x,,) = ,-(Iz,-)(02,-)
and x,(1x0(0x,) = w,(Iw,-)(0w,-). If z,-(Iz,-) and w,(1w,-) are previously computed
outputs and z,- = w,, 02,- = 0w,, and projection vector Vx, = t * Vxxg- where
t eZR, then a projection vector, such as sz which maps c,, and c, onto the

same cell location, must not be linearly dependent to feature value Vzw,-.

2. Let Ck and c, be two simple computations such that xk(lx,,)(0x,,) = j(Iz,-)(02,-)
and x,(lx,)(0,,,) = w,(lw,-)(0w,-). If 2,02,) and wj(1wj) are previously computed
outputs and z, = W}, 02,- = Owl, and projection vector Vx, = t * Vxxﬁ where
t eZR, then a projection vector, such as sz which maps ck and c, onto the

same cell location, must not be linearly dependent to feature value Vzwj.
Proof: Suppose condition 1 does not hold. That is,
z,- = w,-; 4.1-218
02,- : 0w,;

VXl = 11 * Vxx-

‘I’ 11 EZR;

VXZ = 12 * VZWi, 12 EZR;

and z,-(Iz,-) and w,(lw,-) are previously computed outputs. Projection vector Vx, map
simple computations c,- and c, onto the same cell. Let c,- and c, be computed in cell q.

The following exists:
Vxxk, = Vzw,- 4.1-21b '

From Eq. 4.1-21b, sz is linearly dependent to Vxxu. Therefore, projection vector

 

75

sz maps simple computations ck and c, onto the same cell. Let c,, and c, be computed
in cell p. Figure 4.8 shows cells p and q. The third and the fourth input operands of
c,- are computed in cell p. This would require a DMUX at the output operand line of
cell p, hence violating the simple communication property. Condition 2 can similarly

be proven. D

To illustrate the violated condition 1 of Theorem 4.7, consider the following simple

computations:
Ca: 1329*) = ' ' ' + 1320) * 1230); 4.1-22
cb: t23(/) = ...;
cc: t32(/) =

The projection of the geometric representation using projection vector sz = (1, -1)‘,
where cb and c,_. are in the representation, maps simple computations cb and Cc onto
the same cell. However, this mapping violates the simple communication property at
the cell where ca is computed. Note that, sz is linearly dependent to feature value

VZWa = [ex [(3:2)r (2,3)] = (19 -1):

Until now the geometric representations were deﬁned in terms of the indices of the
output operands. The projection of these geometric representations maps (assigns) the
output operands and hence the sirnple computations onto certain cell locations. Now
consider the geometric representations in terms of the indices of the input operands.
The projection of these representations assigns the input Operands to certain cell
locations. A sub-systolic array is constructed by mapping the output operands and the
respective input Operands onto a set of cells. To do this, a valid projection vector is
needed to map the output operands and at most three valid projection vectors (one for

each existing input line) are needed to map all the input operands. Note that these

 

 

76

valid projection vectors must map the right operands to the right cells in order to

preserve the algorithm correctness.

In general, however, the input operands are not unique; an input operand may be used
by more than one simple computation. Therefore the projection of a geometric
representation constructed from the indices of the input operands does not deﬁne the
cell locations. It only indicates those input operands that are assigned to the same
cell. On the other hand, the output operands are unique. Therefore, only the
projections of the geometric representations, which are constructed from the indices of
the output Operands, are used to deﬁne the individual cell locations. In order to map
the right input and output operands to the right cell locations, the valid projection

vectors for the output operands and the input operands are determined.

To distinguish the different projection vectors from one another, the operands of a
simple computation are named. Consider the generic form for simple computations,

which is repeated here for convenience.
x(1x)(0x) = y(1y)(0y) 01 2(12)(02) 02 w(Iw)(0w). 4.1-23
The following discussion refers to

x(1x)(0x) as the output Operand;

y(Iy)(0y) as the input-l operand;

z(lz)(02) as the input-2 Operand;

w(Iw)(0w) as the input-3 operand.

Based on these references, a sub-systolic array is constructed by determining the valid
Pijection vectors which assign the correct output operands, input-1 operands, input-2

Operands, and input-3 operands, to a set of cell locations such that the algorithm

 

77

x. in.) - a. an.)

.. 6x.) - z. (12.) E E
g \9 z. (12.) v. 0'.)

y. (In) ..
x. :0.)

Figure 4.8. The violated condition 1 of Theorem 4.7.

 

 

 

78

correctness and the properties of systolic arrays are preserved.

The projection vectors of the geometric representations constructed from the indices
of the output operands are linearly dependent to the values of feature Vxx (Deﬁnition
4.4). Similarly, the projection vectors for the geometric representations constructed
from the indices of the input-1, input-2, and input-3 Operands are deﬁned to be
linearly dependent to the values of features Vyy, sz, and wa, respectively.

Furthermore, the following vectors are used to indicate these projection vectors:

Vy - indicates the projection vector of input-l operands;

Vz - indicates the projection vector of input-2 Operands;

Vw - indicates the projection vector of input-3 operands.

Vx, which was deﬁned earlier, indicates the projection vector of output operands. In
order to deternrine the valid projection vectors, certain projection vector

interrelationships must be satisﬁed.

Lemma 4.8 - Let ck and c, be two simple computations such that
xterXOXD = YtUYiXOYt); 4.1-24
xt(1xt)(0xt) = yj'UYjXOyj).

For the simple computation pair (c,-, c,-) eC to be a valid interactive pair the following
must hold.

When projection vector Vx, = t1 * Vxx--

,J, where t1 eZR, is used to map simple

computations c,- and c,- Onto the same cell location, then projection vector
sz = t2 * Vyyij, where t2 eZR, must be used to map c,, and c, onto the same cell

location.

79

Proof: Suppose the contrary that projection vector Vx, is linearly dependent to feature
value Vxxy- and projection vector sz is not linearly dependent to feature value Vyyq.

Then the following hold for simple computations c,- and c,:

x,- = x,; 4.1-25
Y1 = yj;

0x,- = 0x];

03’; = O)?-

Let G, be the geometric representation which includes c,- and cj. Let 02 (a geometric
representation) include c, and c,. From Eq. 4.1-24, Vxxu = Vyyij and hence sz is
linearly independent to Vxxu. The projection of G, and 02 using projection vectors
Vxl and VxQ, respectively, will map c,- and c,- onto the same cell and ck and c, onto
two different cells. Figure 4.9 illustrates this condition. In Figure 4.9, input-1
operands of cell q are received from two different cells, q, and q2. This will require a

MU X at the input-l operand line of cell q, violating the simple communication

PmPCﬂY- D

To illustrate the condition of Lemma 4.8, consider simple computations c3, c5, c4 and

c7 of Eq. 4.1-1, which are repeated here for convenience,
c3: x210) = x22(*-) / “22( ); 4.1-26
663 3110) = x12(*-) / “11( );
641 x22(*-) = 16230“) " "23( ) "' 1310);
671 Jr12(""') = Jr13("'--) - "12( ) "' Jl€210)-

Lemma 4.8 implies that if Vx, = (l, 0)’, which is linearly dependent to feature value

Vast“, is used to map simple computations c3 and c6 onto one cell, then projection

 

80

xi élxt) - y, (1);) x. (in); y. 0!.)

 

 
  

.. Y: (In) ~ Yr (1") °

Kt :(h‘t)

31.01‘1)

Figure 4.9. The violated condition of Lemma 4.8.

 

81

vector sz = (1, 0)‘ must be used to map c4 and c7 onto one cell. Note that sz is

linearly dependent to feature value Vyy36.

Lemma 4.9 - Let z,-(Iz,-) and 2,02,) be previously computed outputs. Let ck and c, be

two simple computations such that,
xk(1x,,)(0x,,) = z,-(12,-)(02,-); 4.1-27
IIUXIXOXt) = z,(Iz,-)(02,-).

For the simple computation pair (c,, c) EC to be a valid interactive pair the following
must hold.
When projection vector Vxl = t1 * Vxx,j, where t, eZR, is used to map simple
computations c,- and c, onto the same cell location, then projection vector

sz = t2 * sz where t2 eZR, must be used to map ck and c, onto the same cell

if)
location.
Proof: The proof is analogous to that of Lemma 4.8; letter 2 is substituted for letter

y.El

Lemma 4.10 - Let w,(lw,) and wj(1wj) be previously computed outputs. Let c,, and c,

be two simple computations such that,
xk([x,,)(0x,,) = w,(1w,-)(0w,-); 4.1-28
xl(1x,)(0x,) = WjUWjXOWﬂ-

For the simple computation pair (c,-, C!) eC to be a valid interactive pair the following
must hold.
When projection vector Vx, = t1 * Vxx,j where t2 eZR, is used to map simple
computations c,- and cj- onto the same cell location, then projection vector

sz = t2 * V ,j, where :2 eZR, must be used to map ck and c, onto the same

 

82

cell location.

Proof: The proof is similar to that of Lemma 4.8. Here letter w is substituted for

letter y. 1]

Theorem 4.11 - Simple computation pair (c,-, c) 6C is a valid interactive pair if:

VX = 11 * Vxxij; 4.1-29

V)’ = ‘2 * Vyyt'j;

V2 = t3 * V225;

Vw = t4 * VWWij;
where t1, t2, t3, and t4 eZR.

Proof: The proof follows from Lemmas 4.8, 4.9, and 4.10. These lemmas indicate

that if projection vector Vx is linearly dependent to Vxx then projection vectors Vy,

ijr

V2, and Vw must be linearly dependent to Vyy,-,-, V221]: and wa,j respectively. El

Theorem 4.12 - Simple computation pair (c,-, c,) 6C is a valid interactive pair if the
following hold:

1. If x,- = y,, x, = yj, 0x,- = 0y,-, and 0x,- = 0y}, then projection vector Vx must

be linearly dependent to feature values Vxxy- and Vyyy.

2. If x,- = 2,-, x,- = 2,, 0x,- = 02,-, and 0x,- = 021-, then projection vector Vx must

be linearly dependent to feature values Vxx” and V22”.

3. If x,- = 19,-, xi = W], 0x,- = 0w,, and 0x,- = 0w], then projection vector Vx

must be linearly dependent to feature values Vxxij and wa,,-.

83

(In the ﬁrst condition of Theorem 4.12, it is not necessary to check for whether x,- = y,-
and x,- = yj. In a simple computation, the variable name of the output operand is
assumed to always be the same as that of the input-1 operand. The input-1 operands
are always partial results. This condition is stated in this form in order to be

consistent with the second and third conditions).

Proof: Consider condition 1 which is a special case of Lemma 4.8. Let ck and c, be
two simple computations such that,
xtUxtXOxt) = yi(lyi)(0yi); 4.1-30
xt(lxt)(0xt) = yj(ij)(0yj)°
Since the variable names and the set of operators of the output operands and the
input-1 operands of c,- and c,- are the same, simple computations c,-, 01-, Cb and c,
belong to the same geometric representation. Therefore projection vector Vx must be
linearly dependent to both feature values Vxxij and Vxx”. However, from Eq. 4.1-30,

ka = Iy,- and 1x, = ij. Hence, Vx must be also linearly dependent to Vyyij. Proofs of

condition 2 and 3 are similar to that of condition 1. E]

To illustrate condition 1 of Theorem 4.12, consider the following simple computations

of Eq. 4.1-l which are repeated here for convenience:
c4: x22(*-) = x23(*-) — u23( ) * x310); 4.1-31
673 Jr120'") = Jr13("“') " “12( ) * X210);
65: x23(*-) = x24(*-);
683 J"13("") = Jr140"") “ “13( l * ch31(0-

Simple computations c4 and c7 satisfy condition 1. Note that x22(*—), x12(*-),
x23(*-), and x13(*-) are of the same Operand type. This implies that simple

computations c4, c7, c5, and c8 belong to the same geometric representation.

84

Therefore, a projection vector, Vx, that maps (c4, c7) onto the same cell must also map
(c5, c8) onto the same cell. This implies that Vx must be linearly dependent to both

Vxx“ and Vxx58. But, Vxx53 is the same as Vyy47.

Occasionally, in a systolic array, a primary output is computed in two or more sub-
systolic arrays. That is, a ﬁnal partial result of one sub-systolic array is sent as a
partial result to another sub-systolic array for further calculations. For example,

consider simple computations c6 and c7 of Eq. 4.1-1, repeated here for convenience.
C62 X110) = x12(*—) / u11( ), 4.1-32
671 1‘12“") = x13(*—) " “12( ) * x210)-

Partial result x12 is computed in a sub-systolic array where each cell performs

multiplication and subtraction operations. Operand x12 is then sent to another sub-

systolic array, where each cell performs division leading to the computation of x“. It

is obvious that the cell model of Figure 3.1b should not be used to implement these

sub-systolic arrays. The following feature interrelationship theorem is used to

eliminate those sub-systolic solutions which require the cell model of Figure 3.1b and

also send results to other sub-systolic arrays.

Theorem 4.13 - Let C, and C2 be two subsets of simple computations such that each
subset is computed in a separate sub-systolic array and the variable names of the
output operands are the same in both sets. (This means that a resultant variable is
calculated by the COOperation of both sub-systolic arrays). Let G, and 02 be the
geometric representation of C1 and C2, respectively. Projection vector Vx, of GI and
projection vector sz of 02 are valid projection vectors if

Vx, and sz are both linearly independent to the dependency vector (i.e. values

of feature ny as deﬁned in Chapter 3).

85

Proof: Suppose the contrary that Vx, is linearly dependent to the dependency vector
and sz is not linearly dependent to the dependency vector. Let S, and S, represent
two sub-systolic arrays constructed from the projections of G, and 02, respectively.
Since Vx, is linearly dependent to the dependency vector, all the simple computations
mapped onto the same cell location in S, will require cell models of Figure 3.1b. That
is, the successive partial results will stay in the same cell where each is used to
calculate the next partial result. From the statement of the theorem, the simple
computations associated with a resultant variable are members of both subsets C, and
C2. Therefore, the ﬁnal partial results of each cell in S, must be sent as inputs to sub-
systolic array 32. However, since S, requires cells with stationary partial results
(Figure 3.1b), a special controller is needed to handle such communications between

sub-systolic arrays S, and $2. This contradicts the simple communication property. 121

To illustrate the condition of Theorem 4.13, consider the simple computations of

subsets C, and C2 of Eq. 4.1-2 which are repeated here for convenience.

c2: x32(*-) = x33(*-); 4.1-33
643 x22(*—) = x23(*—) - “23( ) * x3,(/);

653 x23(*-) = x24(*-);

C7: x,2(*—) = x,:,(*-) " “12( ) * 1210);

68: x13(*') = J"140") " “13( ) "‘ x31(/)3

691 Jr114("‘-) = 16150“)-

C1: X310) = X32(*-) I U33( ); 4.1'34
63: 15210) = J"22(""') / “22( );

653111(/) = x,2(*-) / u,,( ).

The sirnple computations of C, (Eq. 4.1-33) require cells performing multiplication

and subtraction operations and those of C2 (Eq. 4.1-34) require cells performing

86

division. If Vx, = (O, l)‘ is used to project the geometric representation of C,, then
simple computation c2 is mapped onto a cell location, c4 and c5 are mapped onto
another cell location, and c7, c3, and c9 are mapped onto a third cell location. This
implies that, for example, x23 of c5 and x22 of c4 are computed in the same cell, say q.
Operand x23, which is calculated in cell q, is used as the partial result in cell q in the
calculation of x22. Hence, cell q has the input/output intra-connection of Figure 3.1b.
Let simple computation c2 of C2 be computed in cell p. Since the simple
computations of Eq. 4.1-33 are backward recursive functions, c4 is the last
computation of cell q. Therefore, the output operand of C,,, which is x22 (which is
also the last output operand of cell q), must be sent to cell p as an input. This requires

additional control communication, in violation of simple communication property.

Theorems 4.1 - 4.7 and Theorems 4.11 - 4.13 deﬁne the feature interrelationships.
These interrelationships form a set of necessary conditions that must be satisﬁed
before a pair of simple computations can be computed in the same cell without
violating the properties of systolic arrays. The sufﬁcient condition is stated below in
Theorem 4.14.

Deﬁnition 4.5 - Simple computation pair (C,,, c,) is called a precedence pair of
simple computation pair (c,-, cj) if the output operands of ck and c, form either the

input-l, input-2, or input-3 operands of c,- and cj.

As an example, consider simple computations c,, c3, c4, and c7 of Eq. 4.1-1 which are

repeated here for convenience.
c,: x3,(/) = x32(*-) / u33( ); 4.1-35
€33 1‘21“) = x22(*-) / “22( );
C41 1220'“) = x23(*-) - “23( ) * 1310);

673 I12(""') = 1130—) ‘ “12( ) * x210)-

87

The output operands of simple computations c, and c, are used as the input-3
operands in the calculation of c., and c7, respectively. In this case, simple
computation pair (c,, (:3) is called a precedence pair of simple computation pair

(c4, c7).

Theorem 4.14 - Simple computation pair (c,-, C) e C is a valid— interactive pair if all of

its precedence pairs are valid interactive pairs.

Proof: Let (C,,, c,) be a precedence pair of (C,, cj). From Lemmas 4.8, 4.9, or 4.10,
when (c,-, c,) is computed in the same cell then, (ck, c,) must also be computed in the
same cell. That is, (Cb c,) must also be a valid interactive pair. Similarly, all

precedence pairs of (ck, c,) must be valid interactive pairs. El

Theorem 4.14 has a global effect. It ensures that data ﬂow in the array does not
violate the simple communication property. To determine whether a simple
computation pair is a valid interactive pair, one must determine whether every
precedence pair is a valid interactive pair. This is done by forming a test tree for
each simple computation pair. Pair (c,, c,) is considered as the root of the test tree
which has at most three sons, one for each of the existing precedence pairs.
Similarly, each son in the test tree may contain at most three sons of its own and so
on. Figure 4.10 shows such a test tree. In Figure 4.10 the simple computation pairs at
nodes C1 and C2 are the precedence pairs of the simple computation pair at node Bl.
Similarly, the pair at node BI is a precedence pair of the simple computation pair at
node Al. A1 is the root-node and BI, 82, and B3 are the son-nodes of A1.

It is easy to see that the leaves of the test tree contain those simple computation pairs
which have no precedence pairs. Note that a precedence pair is formed from two
simple computations and not from two input operands. FOr example, consider simple
computations c5 and c3 of Eq. 4.1-33. Although simple computation c8 requires the
output operands of simple computations c9 and c,, no precedence pairs can be formed
from the respective input operands of CS and c8. This is because the only input

operand of c5 is an initial partial result (a primary input).

88

A1

 

 

Figure 4.10. A test tree.

A simple computation pair can now be marked VIP (valid interactive pair) or NVIP
(not a valid interactive pair) based on the results obtained from applying the feature
interrelationships. A pair is marked NVIP as soon as one of the feature
interrelationships is not satisﬁed. A pair is marked VIP if all the feature
interrelationships are satisﬁed. However, a VIP may be changed to NVIP later if
Theorem 4.14 is not satisﬁed for that simple computation pair.

For each simple computation pair, the test starts at the root of the test tree. If the
simple computation pair at the root-node is unmarked, then the feature
interrelationships are applied to the pair at the root-node and the test result is marked.
If the pair at the root-node is marked VIP, then the process continues at the son-
nodes, each son-node, in turn, becoming a new root-node. When a pair at a node is
marked NVIP, then all the simple computation pairs at the preceding father-nodes are

89

also marked NVIP. For example, using the test tree of Figure 4.10, suppose the test
has been completed for the simple computation pairs at nodes A1 and BI and both are
marked VIP. Next, the test continues by taking node C l as the new root-node. The
feature interelationships are then applied to the pair at node C 1. Suppose one of the
feature interrelationships fails. Then the simple computation pair at node C1 is marked
NVIP. At this time, the mark of the simple computation pair at node Bl, which is the
father-node of Cl, is changed from VIP to NVIP and in turn the mark of the simple
computation pair at node Al (the father-node of BI) is also changed to NVIP. Once a
simple computation is marked NVIP, its mark remains unchanged throughout the

testing process.

The test tree for simple computation pair (c4, c7) of Eq. 4.1-33 is illustrated in Figure
4.1 1.

 

 

® (02. 04)

Figure 4.11. The test tree of (c4, c7) of Eq. 4.1-33.

90

Next to each node is the corresponding simple computation pair. Simple computation
pair (c2, c4) is a precedence pair of pair (c,, c3); pairs (c,, c3) and (c5, Cs) are the
precedence pairs of (c,,, c7). In order for (c4, c7) to be a valid interactive pair, simple

computation pairs (c2, C,,), (c,, c3), and (c5, c3) must all be valid interactive pairs.

Constructing a test tree for each pair of simple computations results in a test forest.
The testing process for valid interactions is applied only once for each of the simple
computation pairs and the test result is marked. If a simple computation pair is a
precedence pair of more than one other simple computation pairs, the successive

application of the testing process to the common precedence pair is eliminated.

The previous theorems indicate the necessary and sufﬁcient conditions that must be
satisﬁed before a pair of simple computations can be, computed in the same cell
without violating the properties of systolic arrays. In the next section, the values of
features Vxx, Vyy, V22, and wa of the valid interactive pairs are used to generate the
valid projection vectors. These vectors are then translated into geometric
transformation matrices. Each geometric representation is transformed into a new
representation. The projection of the new representation onto a lower dimensional
space produces a sub-systolic array. The ﬁnal systolic array is constructed by

combining the sub-systolic arrays.

4.2 VALID PROJECTION VECTORS

In this section, the values of interactive features Vxx, Vyy, sz, and wa of the valid
interactive pairs are used to determine the valid projection vectors for each subset of
simple computations. A set of 4-tuples (Vx, Vy, V2, Vw) of valid projection vectors are
generated from each subset of simple computations. The projection vectors of each

4-tuple corresponds to a sub-systolic solution.

Projection vector Vx is used to project the output operands and hence the
corresponding simple computations onto a set of cell locations. This process will also

identify the total number of cells and their individual locations on the projection

91

space. Projection vectors Vy, V2, and Vw deﬁne the relative operand distances for
each of the cell input lines. In non-homogeneous systolic arrays, these vectors also
deﬁne the necessary communication linkages between one or more adjacent sub-

systolic arrays.

Theorem 4.15 - Let C be a subset of simple computations. Projection vectors Vx, Vy,
V2, and Vw are valid projection vectors for the geometric representations of C if each

simple computation pair, (c,, cl) 6 C, is a valid interactive pair when
Vxxy- = t, * Vx, - 4.2-l
Vyyt'j = t2 "‘ Vy,
V22,-,- = t3 * V2,
wa,,- = t4 * Vw,

where t1, t2, t3, t4 eZR. (This is an extension to Theorem 4.11).

Proof: Assuming other feature interrelationships hold, from Theorem 4.11 a simple
computation, (c,, C), is a valid interactive pair if projection vectors Vx, Vy, V2, and
Vw are linearly dependent to interactive feature values Vxx,-,-, Vyy,-j, V2z,-,-, and Vw,-,-
respectively. This is true for each valid interactive pair. Therefore, in order to
construct a sub-systolic array from C using projection vectors Vx, Vy, V2, and Vw, Eq.
4.2-1 must hold for each pair in C which is mapped to the same cell in the sub-

systolic array. [:1

Theorem 4.15 provides a mechanism for generating the valid projection vectors from
each subset of simple computations. One way is to construct a table of all possible

projection vectors from the values of features Vxx, Vyy, V22, and wa. Feature value

92

Vxx,- calculated from simple computation pair (c,, cj), is compared with the values
already stored in the table. If Vxx,j is linearly dependent to any one of the
corresponding existing values, then a match has been made. (Note, Vxxy- is linearly

dependent to a projection vector, Vx). Otherwise, feature values Vxx- and

,j, Vyy,-j, V22

i,"
Vw,-,- are added as a record to the existing list in the table. Each record is a

candidate which indicates certain values for projection vectors Vx, Vy, V2, and Vw.

In case of a match, feature values Vyy,-j, sz and Vw,-,- are compared to the

ij’
corresponding stored values. If any one of these feature values is not linearly
dependent to that corresponding stored value, then that record is no longer a candidate
record. The record is also not a candidate record if simple computation pair (c,, cj)
does not satisfy all the feature interrelationships. That is, if (c,, cj) violates any one of
the feature interrelationships, then the corresponding record (which is identiﬁed by
feature value Vxx,,-) is marked as a non-candidate record. The process continues until
all the simple computation pairs are tested. The smallest vectors which are linearly
dependent to the values of the candidate records form the set of valid projection
vectors. Speciﬁcally, the following procedure is used to generate the valid projection
vectors for each subset of simple computations. The valid projection vectors extracted
from each subset of simple computations are referred to as the valid local projection

vectors.

Procedure A: Generating Valid Local Projection Vectors

Let C be a subset of simple computations such that Theorem 4.1 and 4.2 are
A satisﬁed for each simple computation pair, (c,, C), in C. (The simple
computations of C are computed in a sub-systolic array). A table, T, of records,

where each record consists of values linearly dependent to feature values Vim-j,

Vyy,-j, V22,-j, and VW

93

,j, is constructed. Table T will contain only one copy of

each different record and the records are identiﬁed by the values of feature Vxx.

Step 1.

Step 2.

Smp3.

Step 4.

Step 5.

For simple computation pair (c,, c) eC, compare Vxx,- with each
record in T. If Vxx,,- is not linearly dependent to the corresponding
value in all the records in T go to Step 2. If Vxx,- is linearly
dependent to the corresponding value in a record, r, in T and r is not

marked as “bad”, go to Step 3. Otherwise, go to Step 5.

Add feature values Vxx Vyy,-j, V22 and Vw,,- as a record, r, to the

ijr i1"

existing list in T and go to Step 4.

Compare feature values Vyy,-j, V22 and Vw,-,- to the corresponding

1]:

values in record r (Theorem 4.15). If either Vyy,-j, sz or wa,~ is

it" 1
not linearly dependent to the corresponding value in r, then mark

record r as “bad” and go to Step 5. Otherwise, go to Step 4.

Apply conditions of Theorems 4.3 - 4.7 and 4.12 - 4.14 to simple
computation pair (c,, c,). If any one of these conditions is not satisﬁed,

then mark record r as “bad” and go to Step 5.

If all the simple computation pairs in C are exhausted, print all the
records in T which are not marked “bad” and stop. Otherwise,

choose a different pair, (c,-, 0}), in C and go to Step 1.

Each “good” record in T indicates valid projection vectors Vx, Vy, V2, and Vw. If

there exist no “good” records in T then there exist no valid projection vectors. This

means that there either exists no sub-systolic solutions to implement the simple

computations of subset C or each" simple computation of C must be computed in a

94

separate cell. In the latter, each point in the geometric representation indicates a cell

location and no geometric projections are performed.

The simple computations of Eq. 4.1-1 are grouped into two subsets C, and C2 as

indicated in Eq. 4.1-2 and repeated here for convenience.

C1 = {C2, C4, C5, C7, C8, C9}, 4.2-2
C2 = {6'1- 93- C6}-
where
C22 X32(*") = 1330‘"); 4.2‘3

643 x22(*-) = J6239‘") ‘- u23( ) "‘ 1310);
6‘53 1230“) = x24(*-);

07: I120'") = Jr130“") ‘ “12( ) * x210);
Cs: Jr130'") = x,4(*-) " "13( ) * x3,(/);
691 1140-) = Jrrs(""-);

and
013x31(/) =132(*’) / “33( );
6‘3: Jr210) = J5220'") / “22( )i

063 x,,(/) =x,2(*-) / “11( )-

Procedure A is applied to the simple computations of subsets C, and C2 of Eq. 4.2-2
and the results are shown in Table 4.2. Let G, and 0, represent the geometric
representations constructed from the output operands’ indices of the simple
computations in C, and C,, respectively. Let S, and S, represent the sub-systolic
arrays constructed from the projections of G, and 02 onto a line, respectively. From
Table 4.2, projection vector Vx= (l, O)‘ of C, is used to construct S, from 0,.

Projection vectors Vy=(1,0)‘, Vz=(l, l)‘, and Vw=(l,0)‘ of C, indicate the

95

Table 4.2. Valid local projection vectors for C, and C2 of Eq. 4.2-2.

 

Set of Simple Valid Local Projection Vectors
Computations Vx Vy Vz Vw

 

Cl (190), (1,0)! (191)! (1,0)‘

 

C2 (190)! (190): (1’1)t '

 

 

 

 

relative input-l, input-2, and input-3 operand distances of each cell in 5,, respectively.
The simple computations of C2 contain only the output, input-l, and input-2 operands,
so Vw of C2 is not required. Projection vector Vx = (1, O)‘ of C2 is used to construct
S, from 02 while Vy = (l, 0)‘ and V2 = (1, l)‘ of C2 deﬁne the relative distances of

input-l and input-2 Operands of each cell in $2.

The projection vectors of Table 4.2 satisfy the necessary boundary communication
linkages between sub-systolic arrays S, and 52. Projection vector Vw = (l, O)‘ of C,
deﬁnes the relative distances of input-3 operands x3,(/), x,,(l), and x,,(l). This
implies that x3,(/), x,,(l), and x,,(l), which are calculated in sub-systolic array S2, are
supplied to S, with relative distances equal to Vw = (l, O)‘. This means that
projection vector Vx of C2, as shown in Table 4.2, must be equal to Vw of C,.
Similarly, communication linkages from S, to 52 must be also satisﬁed. The operands
x32(*-), x22(*-), and x,,(*-) are calculated in S, and supplied to $2 with relative
distances equal to Vy of C,. This implies that Vx of C, must be equal to Vy of C2 as
shown in Table 4.2.

96

The above discussion~ illustrates that certain relationships must hold between valid
local projection vectors in order to satisfy the necessary boundary linkages between
adjacent sub—systolic arrays. Furthermore, it also indicates that certain valid local
projection vectors may not satisfy the necessary boundary linkages of sub-systolic
arrays. Therefore, a list of valid global projection vectors is determined from the list
of all valid local projection vectors. A procedure to determine the valid global
projection vectors from the given valid local projection vectors of each subset of

simple computations is described below.

Let C,, C2, ..., C,, be the n subsets of simple computations of an algorithm. (The
simple computations of each C,- satisfy Theorems 4.1 and 4.2). Let T,, T,, ..., T,, be
the n tables consisting of records of valid local projection vectors of C,, C,, ..., C,,,
respectively. Each record, r eT,, consists of at most four vectors. That is, r = (Vx,

Vy, V2, Vw).

For example, let T, and T2 represent two tables of records consisting of the valid local
projection vectors of C, and C2 of Eq. 4.2-2, respectively. Tables T, and T2 each

contain only one record as shown in Table 4.2. That is,

T1 = {[00)‘. (Let. (1.02 (1.0)‘] }. ‘42-4

T,={[.,..,.,.., (1,111}.

Let a ET, and b 6T2. That is,
a = [(1.02 (1.0)‘. (1.1)‘. (1.0)‘]; 4.2-5

b = [(1.0)‘. (1.0)‘.(1.1)‘].

97

Notation 4.1 - Let notation Vxl, e1 indicate projection vector Vx of record r in
table T. Similarly, Vyl, GT, Vzl, ET, and le, 6, indicate projection vectors Vy, V2,

and Vw in record r of T, respectively.

For example, from Eq. 4.2-5, Vxla at, = (1, O)‘ and Vzlb €72 = (1, 1)‘.

Procedure B: Generating Valid Global PLojection Vectors

Step 1. If the output operands of a subset, C,, are used either as the input-l,
input-2, or input-3 Operands in another subset, C,, then Vxla GT, must be
linearly dependent to either Vylb GT1, Vzlb 67,, or leb 67-), where a and
b are two records. Repeat Step 1 for all a 67',- and all such C,. Remove

those records from T,- where VxlaeT, is not linearly dependent to either
Vyl,J er,- Vzlb GT}, or leb 57, depending on whether the output Operands
of C,- are used either as input-1, input-2, or input-3 Operands in C,,

respectively.

Step 2. If all C,-’s are exhausted, then print the valid global projection vectors
remaining as records in Tables T, through T,, and stop. Otherwise,

choose a different C,- and go to Step 1.

Incidently, the records of Table 4.2 are both valid local and global projection vectors
of subsets C, and C2 of Eq. 4.2—2. Valid global projection vector Vx of each subset
of simple computations indicates a geometric projection direction that supports a sub—
systolic solution. Valid global projection vectors Vy, V2, and Vw of each subset of
simple computations indicate the relative input-1, input—2, and input-3 operand

distances in the sub-systolic array, respectively. In the next chapter, each Vx is

98

translated into an integer, non-singular transformation matrix. This transformation
matrix is used to transform an original geometric representation, which is constructed
from the indices of the output operands, into a new geometric representation. The
projection of the new representation in a known direction will provide the individual
cell locations and the simple computations of each cell. The individual cell locations
and the simple computations of each cell form a subset of the design information
required to construct each of the sub-systolic arrays. Other design information such
as data ﬂow timing, cell count ratio, and design dimension, are also determined from
the valid global projection vectors. The local cell interconnections are performed
from the nearest cell input/output information. The valid global projection vectors
guarantee both local cell connectivity and simple and regular communications, the

architectural properties of systolic arrays.

4.3 CHAPTER SUMMARY

In this chapter additional features, referred to as the interactive features, are extracted
from each pair of simple computations. These features and the algorithm features
(extracted in Chapter 3) form the set of information extracted from the simple
computations. The architectural properties of systolic arrays (i.e., simple processing
elements, simple and regular data and control communications, and local cell
connectivity) are expressed in terms of the feature interrelationships. A pair of simple
computations is called a valid interactive pair if it can be computed in the same cell
without violating the properties of systolic arrays. A pair of simple computations is a
valid interactive pair if all the feature interrelationships are satisﬁed and if all its

precedence pairs are also valid interactive pairs.

A test tree is constructed from each simple computation pair and all its successive

precedence pairs. Each node in the test tree contains a simple computation pair. The

99

simple computation at the root-node is a valid interactive pair if the simple
computation pairs at all nodes of the tree are valid interactive pairs. The feature
interelationships are applied to each test tree constructed from each pair of simple
computations. The process (Procedure A) produces as output a set of valid local
projection vectors for each subset of simple computations. These projection vectors
will be used to construct sub-systolic solutions to implement the simple computations
of each subset. The ﬁnal systolic array is constructed by joining the sub-systolic

arrays.

The valid local projection vectors indicate valid geometric projections used to
construct sub-systolic solutions. They, however, may not satisfy the necessary
boundary linkages between sub-systolic arrays. Procedure B, which operates on the
valid local projection vectors, produces as output a set of valid global projection
vectors. These projection vectors, if they exist, correspond to those sub-systolic

solutions that when joined will produce a systolic array.

In the next chapter, the valid global projection vectors are translated into integer,
non-singular geometric transformation matrices. These matrices are used to transform
an original geometric representation into a new representation such that the projection
of the new representation provides the cell locations and simple computations of each
cell. The valid global projection vectors are also used to determine data movement

information and cell count ratio.

CHAPTER 5

ARCHITECTURAL SPECIFICATION

This chapter describes the method for the determination of systolic array architectural
speciﬁcations from the valid projection vectors. The architectural speciﬁcations

needed to construct systolic arrays are:

1. individual cell locations;

2. simple computations of each cell;

3. data movement;

4. data timing;

5. design dimension;

‘ 6. data input/output patterns;

7. cell interconnection patterns;

8. cell functions.
Not all of these speciﬁcations are needed to complete the design. The information of
individual cell locations, data tinting, and cell interconnection patterns are by far the
most important subset. Certain other speciﬁcations, such as the cell interconnection

patterns, can be extracted from other information; in this case from the simple

computations of each cell.

100

101

The architectural speciﬁcations directly extracted from the model include the
individual cell locations, simple computations of each cell, data movement, cell
functions, and design dimension. The cell interconnection patterns, data tinting, and
data input/output patterns are extracted from the nearest neighbor cell information and
data movement information. The individual cell locations are determined via
geometric projections. A geometric representation constructed from the indices of the
output operands is projected onto a lower dimensional space (e.g., plane). Each point
on the projection plane represents a cell location. The design dimension is
deternrined from the dimension of the geometric representation. The data movement
information is determined from the valid global projection vectors. It identiﬁes the

stationary and non-stationary operands in the array. The stationary operands are either

primary inputs or outputs.

An index, referred to as cell count ratio, is also calculated from the valid global
projection vectors. This index relates to the number of cells required. The cell count
ratio and data movement information are determined earlier in the design process

hence providing a priori information about the different systolic solutions.

5.1 INDIVIDUAL CELL LOCATIONS

The individual cell locations are determined by projecting a geometric representation,
constructed from the indices of the output operands, onto a lower dimensional space.
The valid global projection vectors, described in Chapter 4, indicate the projection
directions which produce systolic solutions. Since the output operands are unique
and each uniquely deﬁnes a simple computation, only the valid global projection
vector Vx is used to deﬁne the individual cell locations. Each different Vx indicates a

different geometric projection.

102

In order to assign a unique projection direction to all the geometric projections, Vx is
translated into a geometric transformation matrix, H. Let G be the geometric
representation constructed from the indices of the output operands of a subset of
simple computations. Matrix H transforms an index Ix 50 into another index Ix’ e G’
where H * Ix = Ix’ and G’ is the new geometric representation. This is denoted as
H: G -) G’. Matrix H is a mapping from Z’" to 2'", where m is the dimension of G
and Z is the set of all integer numbers. This means that the enteries of H are integer
numbers that produce integer coordinates for each point in 0’. Furthermore, H must

be a non-singular matrix to provide a one-to-one mapping from G onto 0’.

A geometric transformation is equivalent to axes rotation. The axes of the initial
geometric representation are rotated by certain angles which are calculated from the
valid global projection vector Vx. For a linear (l-dimensional) sub-systolic array,
geometric representation G is a 2-dimensional space. For example, in Figure 5.1, two
simple computations, c, and c,, are shown by points at locations Ix, and Ix,. Let
simple computation pair (c,,, c,) be a valid interactive pair. Let Vx be the valid global
projection vector which maps c, and c, onto the same cell location. From the feature
interrelationships discussed in Chapter 4, feature value Vxx“ = (be Ix,) is linearly
dependent to Vx. Select c,‘ and c, such that, Vxx“ = Vx and let Vx = (v,, vz)‘. The
rotation of i—j plane by 4) radians counter clockwise (as shown in Figure 5.1) results
in a new representation, 0’, indicated by i’—j’ plane. Note that the projection of G’
onto the j’ axis maps c, and c, onto the same location. This axes rotation can be
represented by a transformation matrix. The following discussion illustrates the
determination of transformation matrices from a valid global projection vector Vx for
2-dimensional and 3-dimensional geometric representations. The projections of these
representations produce l-dirnensional (1D) and 2-dimensional (2D) systolic arrays,

respectively.

103

 

 

 

 

Figure 5.1. A 2-dimensional geometric representation with two simple computations.

5.1.1 1D AND 2D SYSTOLIC ARRAYS

This section discusses the generation of the geometric transformation matrices for 2-
dirrrensional and 3-dimensional geometric representations. A transformation matrix in
this case is an integer and non-singular matrix which transforms each point with
integer coordinates in the original geometric representation into a unique point with
integer coordinates in the new representation. These matrices are generated from the

valid global projection vectors, Vx’s.

In Figure 5.1, the angle 4) is calculated as follows:

¢=- -1[2] 5.1-1
V1

where Vx = (v,, vz)‘. The rotation of the i—j plane by 4) radians counterclockwise can

104

be indicated by the following matrix:

_ cos¢—sin¢ -
R"[sintlr cos¢° 5'12

The multiplication of matrix R by Ix, produces Ix’,‘ which indicates the new location

of CI:- Let Ix,- = (x,, x2) where x, and x2 are integer numbers and let Ix’,- = (x’,, x’z).

That is,

[cos <1) - sin 0 [x1] = [{1} 5,3
srn 0 cos 4) x2 x 2
which can be rewritten as
x’, = cos 4) (x,) — sin 4) (x2); 5.1-4
x’z = sin 0 (x,) + cos 4) (x2).

However, note that, x’, and x’z may not be integers since the domain of both the cos
and sin functions is [O I]. To resolve this problem, an integer transformation matrix,

H, is derived from matrix R. To do this, two cases must be exarrrined.

Case 1: v, at 0 (this corresponds to the values of 4) which are not integer multiplies of

N2)

For values of 4) which are not integer multiples of tt/2, divide the equations of

5.1-4 by cos 0. That is,

 

 

x .
1 .. ,- “"452; 5.1-5
cos ‘1) cos 4)
1’2 = sin (b

 

 

This can be rewritten as

 

cos 4) = x, — tan (1) (x2); 5.1-6

105

 

cos 4) = tan 111 (x,) + x2.

Eq. 5.1-l indicates that tan ¢=(- v2 / v,). For v, at O we have,

 

 

Multiplying both sides of Eq. 5.1-7 by v, yields

 

 

v (X’ )

C108 ; = V1 (X1) '1’ V2 (X2);
v (x’ )

clos i = —v2 (x,) + v, (x2).

The right hand sides of Eq. 5.1—8 are both integers. Let

 

 

I
. ”10‘ 1)
X1 = ,
cos 4)
I
v,(x 2)
12 '-"- .
cos (I)

5.1-7

5.1-8

5.1-9

52, and 22, which are both integer numbers, can be considered as the new

coordinates of the location of ck where the original axes (i-j plane) are rotated

by d) radians in the counterclockwise direction. Replace the left hand sides of

Eq. 5.1-8 by those of Eq. 5.1-9. The results can be represented in matrix as

follows:

1it V1 V2 J‘1
£2 — -V2 v1 x2
V1 V2
“V2 V1

where

5.1-10

5.1-11

106

is the transformation matrix, H.
Case 2: v, = 0

In this case, 0 = -tan’1(v2 / 0) which implies that

-tan-1oo if V, > 0
¢ =

tan'loo if v2 S 0 5'1-“
and therefore
- 325 if v2 > 0
(l) =1 rt 5.1-12
L —2' if V2 S O

 

Hence, v, = 0 corresponds to values of 4) which are integer multiplies of 1t/2. In

this case H is calculated from matrix R (Eq. 5.1-2).

 

 

 

 

lcos(- éi) - srn(- 129)
H: It ifV2>O; 5.1-13
rt
srn(- —) cos(- —)
2 d
or
1.0.1 .. .m—
_ 2 2 .
H- sin-IE. COSE lfVZSO.
b 2 2 d
Thus,
0 1 .
H = [__1 0] 1f v2 > 0; 5.1-14

107

Therefore, in general, for a 2-dimensional geometric representation, the
transformation matrix, H, is calculated from projection vector Vx = (v1, v2)‘ as
follows:

V1 V2 .
H: W rfvlatO; 5.1-15

H=[_(i (1)] ifv2>Oandvl=O;

H=[? -01] iszsoandVl=O.

h

The new geometric representation is (3 indicated by the 29—}? plane. Transformation
matrix H is calculated such that the projection of (3 is always performed in the

direction of the 1° axis onto the j" axis.

For example, consider the geometric representation of Figure 5.2 with points at
locations p=(1,1) and q=(2,2). Using projection vector Vx= (1,1)‘, the

corresponding transformation matrix is calculated from Eq. 5.1-15 as

_11
H_L11} arm

Multiplying H by the coordinates of points p and q in the i—j plane will produce the

coordinates of p and q in the new geometric representation.

3H2. rm:
314.1. :ng-

 

This is illustrated by points p“ = (2, O) and c? = (4, O) in the ia-j“ plane in Figure 5.3.

Note that the projection of the new geometric representation (Figure 5.3) onto the f

108

N
i
I
|
l
l
I
l
l
l
4

r——-‘

 

4r

 

a-lb—J 'o
N-———————

Figure 5.2. An original geometric representation.

axis, as expected, maps 13 and q“ onto the same cell location.

The results of Eq. 5.1-15 can be extended to 3-dimensional geometric representations.
The valid global projection vectors for 3-dimensional geometric representations are
vectors of size three. Let Vx = (v1, v2, v3)‘ be the projection vector of a 3-dimensional
geometric representation, G. Let G be indicated by the i—j—k space where each
simple computation is represented by a point in G. The rotation of the i-j plane by

¢ = — tan-1(v2 / v1) radians counterclockwise is given by matrix R1

cos¢ -sin¢ 0
R1: sin 4) cos (p 0. 5.1-18
0 O 1

Similar to the method discussed for 2-dimensional geometric representations, an

109

 

JV

 

'4»-

Figure 5.3. A transformation of the geometric representation of Figure 5.2.

integer non-singular transformation matrix, H1, is constructed from R, as follows:

V1 V2 0
H1: -v2 v1 0 ifvlatO; 5.1-19
o o 1
O 1 0
H1: '1 0 O ifV1=0 andv2>0;
0 0 1

01'

0 -1 0
H1: 1 O 0 ifv1=0 andVZSO.
0 l

The transformation, H1: G —) G’, transforms each point in G to a point in G’ with
integer coordinates. Let G’ be indicated by the i’—f—k’ space. Recall that projection

vector Vx = (v1, v2, v3)‘, which indicates a projection direction, is a line connecting

110

points (0, O, O) and (v1, v2, v3). This implies that projection vector Vx of G

corresponds to a projection vector, Vx’, of G' where Vx’ = H, * Vx. That is,

V1 V1
H1 V2 = 9 . 5.1-20
V3 V3

The rotation of the i’—k’ plane by 9 = — tan-1(v’3 /v'1) radians counterclockwise is
given by matrix R2
cos 9 0 - sin 6
R2 = 0 l O . 5.1-21
in 9 0 cos 6

The corresponding transformation matrix H2 is

V1 0 V3
H2 = o 1 o if v', :2 0; 5.1-22
— V3 0 V’l

0 O 1
H2: 0 1 O ifV’1=O andv’3>0;
-l O 0
01'
0 O -1
H2: 0 1 O ifv’l=0 andv'3S0.
l O 0

The transformation, H2: 0’ -) G”, transforms each point in G’ to a point in G" with
integer coordinates. Let G" be indicated by the i”-j”—k" space. Projection vector

Vx’ = (v’l, O, v’3)‘ of 0’ corresponds to a projection vector, Vx”, of G” where

VII 'VIII
H2 0 = o . 5.1-23
V3 0

Projection vector Vx” is a line from (O, O, 0) to (v”1’0, O) in the i”—j”-k” space

Vx” = H2 * Vx’. That is,

111

which deﬁnes the i" axis as the projection direction and the j”—k” plane as the
projection plane. Transformation matrices H1 and H2, which are both integer and
non—singular matrices, are combined to produce a single integer and non-singular

transformation matrix, H by

H = H2 * H1. 5.1-24

5.1.2 m-DIMENSIONAL SYSTOLIC ARRAYS

Equations 5.1-15 and 5.1-24 illustrate the translations of the valid global projection
vector Vx to integer non-singular transformation matrices for 2-dimensional and 3—
dimensional geometric representations, respectively. The projections of the 2-
dimensional and 3-dimensional geometric representations produce l-dimensional and
2-dimensional sub-systolic arrays, respectively. However, the method of translating
projection vector Vx to a transformation matrix can be extented to higher (i.e., >3)

dimensional geometric representations.

Procedure C: m-Dimensional Transformation Matrix Construction

1. Let Vx(°)=(v1(0),v2(o), . . . , v,,,<°> )‘ be the initial valid global projection
vector. Let H“), ..., H(”" 1) be m x m diagonal matrices where the diagonal
elements are one and non-diagonal elements are zero. Let k = 1.

2. Modify H“) = (1:5)“) as follows:

i. let a = v1("’1) and b = mgr”;

ii. determine the new values of a and b from Table 5.1 (in Table 5.1, gcd

stands for the greatest common divider);

iii. let

112

hum = a;
hum“) = 1’;
hk+l,l(k) = ’b;

hk+l,k+l(k) = 0~

3. Let VxU‘) = H“) * Vx(k’l).
4. Let k = k+l; if k _<. m — 1 then go to 2, otherwise go to S.

5. Let H = ”(k-1) sl: ”(k-2) * . . . a: HO).

Table 5.1. Lookup Table for Transformation Matrix Construction.

 

 

 

 

 

 

a=0 a¢0
b=0 0:0 0:1
b=-1 b=O
b>0 a=0 . a=algcd(a,b)
b=1 b=b/gcd(a,b)
b<O a=0 a=a/gcd(a,b)
b=-1 b=blgcd(a, b)

 

 

 

 

 

 

Procedure C is applicable to any m-dimensional geometric representation. However,
for our purposes, m is either 2, 3, or 4 resulting respectively in l-dimensional, 2-

dimensional, or 3-dimensional sub-systolic arrays.

For example, consider the simple computations of the solution of the upper triangular
linear system Ux = b (Eq. 3.2-8) for N = 3 (Eq. 4.1—1). These simple computations,
as indicated in Chapter 4, are grouped into two subsets C1 and C2 (Eq. 4.1-2). The
simple computations of C1 and C2 are again listed here in Eq. 5.1-25 and Eq. 5.1-26,

respectively, for convenience.

1 l3
c2: x32(*—) = x33(*-); 5.1-25
C43 J€220“) = x23(*-) "‘ u23( ) * x310);
c5: X23(*—) = J‘24(""');
C71 x12(*—) =7 Jr130'") ’ “12( ) * 1210);
€83 J5130"") = 1140“) - u13( ) * J“310);

C93 x14(*-) = x15(*-).

c1: x31(/) = x32(*-) / u33( ); 5.1-26
03‘ X210) =122(*-) / “22( );

C6: X110) = x12(*-) / u11( ).

Figure 4.2, illustrates the geometric representations of the simple computations of C1
and C2. It is shown in Table 4.2 that Vx=(1, 0)’ for both C, and C2. Using
Vx = (1, O)’, the geometric representations of C1 and C2 are transformed into new
geometric representations. Applying Procedure C for Vx‘o) = (1, O)‘ and m = 2, the
corresponding transformation matrix is systematically determined. The result is the
identity matrix

_ l O
H - [O 1]. 5.1-27

Since H is an identity matrix, in this case the geometric representations need not be
transformed into new representations. The projection of the MS cells in Figure 4.2
onto the j axis maps simple computations c2, c4, and c7 onto a cell location, c5 and c8
onto a second cell location, and c9 onto a third cell location. Similarly, the projection
of the D cells maps c1, c3, and c6 onto one cell location. This is illustrated in Figure
5.4. Listed above each cell in Figure 5.4 is the set of simple computations which are

computed in that cell in sequence.

Upon the projection of a geometric representation onto a lower dimensional space

both the individual cell locations and the simple computations of each cell are known.

114

1:13. 012. :21
x23. «23. x31 and» 013. 1:31
:24 x15

x13 x14

‘-~

-~ £83®Ez
C9
0

x12. 011
2:22. 022
1:32. 033

x11
:21
:61

\V
H

1
_I
1

Figure 5.4. The cell locations and simple computations of sub-systolic arrays of

subsets C1 (Eq. 5.1-25) and C2 (Eq. 5.1-26).

This information forms a subset of the systolic array architectural speciﬁcations.

5.1.3 DESIGN DIMENSION

In Section 5.1, it was illustrated that the dimension of the projection space is one less
than that of the geometric representation. A projection space containing the

individual cell locations indicates the design dimension. That is,

115

array dimension = m — 1, 5.1-28

where m is the dimension of the geometric representation. Systolic solutions for
many common compute bound algorithms, such as the matrix multiplication, matrix
inversion, solution of a linear systems of equations, LU decomposition, ﬁnite impulse

responds, and convolution, are ldimensional or 2—dimensional arrays.

5.1.4 CELL COUNT RATIO

Cell count ratio (CCR) is an index proportional to the total .number of cells used to
construct a systolic array. The exact number of cells used is the same as the number
of individual cell locations on the projection space. The CCRs Can be used to select
those systolic solutions which require fewer numbers of cells. For example, given two
systolic solutions of an algorithm with different CCRs, the smaller CCR identiﬁes the
solution which requires fewest cells. The cell count ratio of a systolic array is
determined from the cell count ratios of its sub-systolic arrays. Cell count ratio is
determined from valid global projection vector Vx. Let Vx be a valid global
projection vector used to construct sub-systolic array S for the simple computations of
subset C. Let P be the set of simple computation pairs in C such that the values of

feature Vxx are linearly dependent to Vx. That is,

P = {65, C) | (c,, C!) EC and Vxxij = t * Vx, t EZR}. 5.1-29

n=|P|

where |P| is the size of set P. Note that simple computation pair (6;, cj)eC is
mapped to the same cell location if feature value Vxxij is linearly dependent to

projection vector Vx (refer to Chapter 4). This implies that the number of such pairs

116

in C for a given Vx is directly prOportional to the number of simple computation pairs
mapped to same cell location. Furthermore, the number of pairs which map to the
same cell location is inversely proportional to the number of cells. The cell count

ratio of the sub-systolic array S is computed as follows:

CCR = iii—'- 5.130

where l C I is the number of simple computations in C.

For example, consider subset C1 (the simple computations of Eq. 5.1-25), repeated

here for convenience.

023 x32(*-) = Jr330‘”); 5.1-31
C43 1220“") = 1‘23“”) ' “23( ) * x31(/);

C5: x23(*-) = x24(*-);

C73 x12(*-) = 1130-) - “12( ) * J‘2l(/);

€83 J"1‘.i(""') = x14(*—) " “13( ) * J€310);

C93 1140‘") = Jr150'")-

Let Vxl = (l, O)‘ be the projection vector used to construct sub-systolic array S] of
C1. Let P, be the set of simple computation pairs in C1 such that the values of feature

Vxx are linearly dependent to Vxl. The result is as follows:
P1 = {(C2, C4), (629 C7), (C4: C7), (C5: C8) }9 51'32

=4.

Subset C1 contains six simple computations hence I C, | = 6. The cell count ratio of

S, is computed as follows:

lCll

CCR] = "I

 

5.1-33

117
=2
4

= 1.5

Let sz = (1, l)’ be another projection vector used to project the geometric
representation of CI. (Note that sz = (1, l)’ is not a valid projection vector for the
geometric representation of C1. However, it is used here to illustrate its effects on the

CCR). Set 1’, in this case contains only one simple computation as follows:

P1 = {(C5, C7) }, 5.1-34

nl=|P1|
=1.

The corresponding cell count ratio is:

|C1 I
CCR2 = ; 5.1-35

"1

 

15.
1

=6.

Note that, CCR, < CCRZ. Therefore,.projection vector Vx, = (1, 0)’, as expected from

Figure 4.2, requires fewer cells than projection vector sz = (1, l)‘.

If set P in Eq. 5.1-29 is an empty set then there exist no simple computation pairs
that can be mappedonto the same cell location. This means each simple computation
is computed in its own cell, thus requiring | C l cells to construct S. Therefore, no

cell count ratio is computed in this case.

The cell count ratio for an entire systolic array is computed as follows:

8

Z leI
CCR=£-=-1——

S
2 "k
k=l

5.1-36

118

where s is the number of sub-systolic arrays. Consider subset C2, the simple

computations of Eq. 5.1-26, repeated here for convenience.
c1: x310) = x32(*—) / u33( ); 5.1-36
c3: x21(/) = x22(*—) / u22( );
c6: x11(/) = x12(*—) / u11( ).

Subset C2 contains three simple computations, hence | C2 | = 3. For Vx = (l, O)’,
| P2 | = 2 and CCR = 1.5 for $2, the sub-systolic solution of C2. The cell count ratio
for the systolic array, constructed from S] and 52, is calculated as follows:

__ |C1l+|C2|

"1+"2

CCR

 

5.1-38

6+3

 

The cell count ratio provides a priori information about the design. In such cases
where there exist several systolic solutions, the CCR can be used to select the solution

with the fewest number of cells.

5.2 DATA MOVEMENT

In systolic arrays, most data moves through the array from one cell to the next and is
used once in each visiting cell. Some data, however, is preloaded into the internal
registers of the cells and remains stationary. In this case, the'datum is used in every
calculation performed by a cell. The data movement information indicates which data
items are stationary and which ones move through the array. This information is

determined for each sub-systolic array.

119

Recall from Chapter 4 that the projection vectors Vx, Vy, V2, and Vw indicate the
relative distances between the operands of each cell. For example, projection vector
Vx is linearly dependent to feature value Vxxij of simple computations c,- and C}. Or,
projection vector Vy (of input-1 operands) is linearly dependent to feature value Vyy,-j.
As indicated in Table 4.2, Vx = (l, O)‘ of C1 indicates the relative distance between
the consecutive output operands Of each MS cell (Figure 5.4). Similarly, the relative
distances of the consecutive input-1, input-2, and input-3 operands of the MS cells are
(1, 0), (1, 1), and (l, 0), respectively. This can be observed from the simple

computations of each cell in Figure 5.4.

A zero projection vector (e.g. (O, O)‘) indicates that the relative distances between the
operands are zero. This implies that the operands do not move through the array with
respective to one another; in other words, they remain stationary. From Theorem 4.13,
when projection vector Vx is linearly dependent to the dependency vector (Deﬁnition
3.3), the sub-systolic solution requires the cell model Of Figure 3.1b. In Figure 3.1b,
the partial results (i.e. input-l operands) remain stationary in the cell. From the above
discussion, the data movement information, a subset of the architectural speciﬁcations,

is determined as follows:
input—l operands: stationary if Vy is a vector of the form 5.2-1
(O,1)‘, (0,0,1)', (0,0,0,1)’, etc.;
input-2 operands: stationary if V2 = 5;
input-3 operands: stationary if Vw = 5.

In Eq. 5.2-1, (0, 1)’, (0, 0, 1)‘, etc. indicate the possible values for the algorithm
dependency vector. The input-1 operands are the partial results. This means that the
input-l Operands of a cell are either the output operands of that cell or the output
Operands of an adjacent cell. In the former case the output operand line is connected

to the input-l operand line. Therefore, stationary input-1 Operands imply stationary

120

output Operands. In other words, the data movement information of input-1 Operands
implies the same data movement for the output Operands. The data movement
information of the output Operands can be also determined from projection vector Vx.
That is, output operands are stationary if Vx is a vector of the form (0, 1)‘, (O, O, 1)‘,

CIC.

The data movement information for the MS cells in Figure 5.4 is determined as

follows:
Vx = (1, O)‘ at (O, 1)’ implies non-stationary output operands; 5.2-2
Vy = (1, 0)’ at (O, l)‘ implies non—stationary input-l operands;
Vz = (1, l)‘ at (0, O)’ implies non-stationary input—2 Operands;
Vw = (1, 0)‘ at (O, O)‘ implies non—stationary input-3 Operands.
Similarly, the data movement information for the D cells is determined as follows:
Vx = (1, 0)‘ at (O, 1)‘ implies non-stationary output Operands; 5.2-3
Vy = (1, O)‘ ¢ (0, 1)‘ implies non-stationary input—l operands;
. V2 = (l, 1)‘ at (O, 0)’ implies non-stationary input—2 Operands.

The data movement information is an architectural speciﬁcation which provides a
priori information about the design. This information and the simple computations of

each cell are used to construct the cell interconnections.

5.3 CELL INTERCONNECTIONS

In Chapter 4 the properties Of systolic arrays were expressed in terms of the set of
feature interrelationships. One important property Of a systolic array is local cell

interconnection. The valid global projection vectors guarantee local cell connectivity.

121

In this approach, the cell interconnections are performed by comparing the nearest
neighbor cell information. Upon a valid geometric projection, a subset of simple
computations are mapped (assigned) to a set of cell locations. The simple
computations of each cell are computed in that cell in sequence. The cell
interconnections are determined from operands of adjacent cells’ simple computations

and from the data movement information.

For example, consider the projection line in Figure 5.4. There exist two sub-systolic
solutions; one constructed from the MS cells and one constructed from the D cells.
The data movement information for the MS cells and the D cells is given in Eq. 5.2-2
and Eq. 5.2-3, respectively. From Eq. 5.2-2 and Eq. 5.2-3, all the Operands of both
the MS and D cells are determined to be non-stationary. To determine the cell
interconnections Of the MS cells, compare the Operands of the simple computations of
the cells at locations 2 and 3 (referred to as cell 2 and cell 3). Note that output
Operands Of cell 3 are used as input-1 Operands in cell 2. This implies that there
exists a connecting line from the output line of cell 3 to the input-1 line of cell 2.
Next, compare the input-2 operands (u’s) of cell 2 with those of cell 3. The input-2
Operands of each cell are unique, so they enter each cell directly. Between the input-
3 Operands of cell 2 and cell 3, x3, is the common operand hence requiring a
connecting line between cell 2 and cell 3 through which x31 is communicated.
However, to determined the direction of this connecting line, one must consider the
order of the computations. Note that x31 is used in the calculation of output operands
x22 and 113. The solution of the upper triangular linear system is determined by
backward substitution. This means that 1:22 is computed prior to 113, implying that the

direction of the connecting line of input-3 Operands is from cell 2 to cell 3.

By deﬁnition, each sub-systolic array is a homogeneous array. Therefore, in order to

construct the cell interconnections of a sub-systolic array, it is not necessary to

122

compare the Operands at all adjacent cells. Once the cell interconnections for one cell
are determined, a macro cell is constructed. Repeating the macro cell at each cell

location results in completion Of the cell interconnections.

The more structured the algorithm, the easier it is to determine the cell
interconnections. Often in less structured algorithms, such as matrix multiplication,
there is no order in which the primary outputs are computed. In such cases, one must
compare the Operands Of more than one adjacent cell until the cell interconnection
patterns are determined. Therefore, in general, not only may systolic solutions be
different in the number of cells, data movement, throughput, and total delay, but they

may also be different in the order in which the primary outputs are calculated.

As shown in Figure 5.4, the cell at location 4 performs nO useful Operations. As
discussed in Chapter 3, such operations are used only as assignments to identify those
initial partial results which are primary inputs. Referring to Eq. 3.2-11, it is
determined that operands x33, x24, and x15 correspond to primary inputs b3, b2, and b1,
respectively. After the completion of the cell interconnections, x33, x24, and x15 are
substituted with the corresponding primary inputs. All those cells which perform no
useful Operations are deleted from the design and all the primary inputs and outputs
are shifted out to the boundary cells along their respective communication lines

(Figure 5.5).

The communication lines of a macro cell are depicted by vectors indicating the line
directions. Each input line is indicated by a vector which provides the relative location
Of a neighboring cell from which the input data is received. For example, from
Figures 5.4 and 5.5, the directions of the input-l and input-3 Operands of the MS
macro cell are indicated by i, = (1) and i3 = (-1), respectively. The input-2 operands
are supplied directly tO each cell hence requiring no special direction. This is denoted

as 1'; = (7, where I is an arbitrary direction not equal to the zero vector, 1'}, or i3.

123

 

x21. x31 ___) MS) AK MS ____)
x32. x22, x12 <___ ? (...— b3, b2, b1

 

u23 U13
.u12
x31, x21, x11 x32, x22. x12
U253
u22
u11

Figure 5.5. The primary inputs and outputs of sub-systolic solutions of C1 and C2.

Similarly, in those cases where the primary inputs are preloaded, the data direction is
indicated by an arbitrary direction again not equal to the zero vector or the other data

directions.

Direcrion i1 = (1) indicates that input 1:13 of the cell at location I = (2) is received, as

expeCted, from the neighboring cell at location
11 = 1+ i1 5.3-1
= (2) + (1)

= (3).

124

Note that, I: ll - i1. This means cell 11 sends its input-1 operands to the cell at
location I, — 1'], which is l. Direction i3 = (-’-1) indicates that input-3 operand x31 of

cell at location I = (3) is received, as expected, from the cell at location
13 = 1+ i3 5.3-2
= (3) + (-1)
= (2).

Consider the systolic solution Of Figure 5.6. For this systolic array, the directions of

the input Operands are indicated as follows:

i, = (o, —2); 5.3-3 ,
i2 = (_1: _1);
i3 = (1, ‘1);

where i1, i2, and i3 indicate the input directions Of input-l Operands (c’s), input-2

Operands (a’s), and input-3 Operands (b’s), respectively.

The purpose of modeling the input directions with vectors is to calculate other
architectural information such as the data ﬂow timing. These vectors are also used to

construct a systolic array (via a software program) for simulation.

5.4 DATA FLOW TIMING

The data ﬂow timing (DH) is calculated in terms of global clock cycles. It indicates
the rate at which actual data enters and exits the array in terms of the clock cycles.
For example, DFT = 1 indicates that the data enters the systolic array at every clock
cycle and DFT = 2 that data enters the array at every other clock cycle. This
information is used for scheduling the timing of the primary inputs and outputs.

Data ﬂow timing, an architectural speciﬁcation, is calculated from the data input

125

directions of the macro cells.

Theorem 5.1 - Let 1'], i2, and i3 be the data input directions of a macro cell. The
DFI‘ is calculated as follows:

.1 if i1+i2¢5,i,+i3¢5, i2+i3¢5, and i1+i2+i3¢5
DFT=<2 if i1+i2=6, i1+i3=6, or i2+i3=6 5.4-1

L3 lfll+12+l3=0

 

where 5 indicates the zero vector.

Proof: Let I be the location of a center cell on the projection space. Let cell 1
compute a simple computation at time I. At time t + 1, the data is passed from cell 1

to the cells at the following locations:

1" i1; 5.4-2
1-12;
1- 53.

These cells, where each computes a simple computation at time t + 1, lag cell 1 by
one clock cycle. At time t + 2, each cell at each of the following locations computes

a simple computation:

l-2*i1; 5.4-3
1— il-iz;
l— il—i3;
I-2*i2;
I— iz—i3;

l-2rq

127

Again, these cells lag the cells at locations I - i1, 1 - i2, and I - i3 by one clock cycle.

Now suppose, for example, i1 + £2 = 5. (Note that in Eq. 5.4-3, either i1 + i2, i1 + [3,
or £2 4» i3 may be equal to 6). From Eq. 5.4-3, it is determined that, l - (i1 + i2) = I.
This implies that the cell at location I lags the cell at location 1— i1 (and 1— i2) by
one clock cycle. But from Eq. 5.4-2, the cell at location I- i, lags cell 1 by one
clock cycle as well. Thus, the cells at locations I and 1- i1 compute simple
computations at alternating clock cycles. This implies that the primary inputs must be

supplied at every other clock cycle. That is, DFT=2.

At time t + 3, each cell at each following locations computes a simple computation:

l-3*i1; 5.4-4
1-2*i1-i2;

l—2*i1-i3;

l-3*i2;

l-2*i2-il;

l-2*i2—i3;

l-3*i3;

I-2*i3-i1;

l-2*§-§;

l—il-iz—i3.

Suppose, i1 + £2 + £3 = 5. (Note that, in Eq. 5.4-4, only i1 + £2 + i3 may be equal to

5). That is, l - (i1 + £2 + i3) = l which implies the following:

1. Cell I lags the cells at locations I — i1 - i2, 1 — i1 — i3, and l - i2 - i3 by one

clock cycle.

128

2. The cells at locations I-3 i1 — i2, I- ll -— i3, and 1— i2 - i3 lag the cells at

locations I - i1, 1 — i2, and I - i3 by one clock cycle.
3. The cells at locations I - i1, I - i2, and I — i3 lag cell I by one clock cycle.

These conditions imply that cell 1, the cells in Eq. 5.4-2, and the cells in Eq. 5.4-3
cannot compute simple computations at the same time. At time t, cell 1 computes a
simple computation and the cells in Eq. 5.4-2 and Eq. 5.4—3 compute no simple
computations. At time t + l, the cells in Eq. 5.4-2 compute simple computations and
cell 1 and those of Eq. 5.4-3 compute no simple computations. At time t + 2, the cells
. in Eq. 5.4-3 compute simple computations and cell I and those in Eq. 5.4-2 compute
no simple computations. This implies that the primary inputs must be supplied to the

systolic array at every third clock cycle (i.e., DFI‘=3).

In all other cases, the ﬂow of data through the array does not contain loops. The only

data dependency is due to the pipelining. Therefore, DFI‘=1. E1

The data ﬂow timing is computed separately for each sub-systolic array. The data
ﬂow timing of the entire systolic array, constructed from all its sub-systolic arrays, is

equal to the largest DPT.

For example, consider the MS and D macro cells in Figure 5.5. The MS macro cell
consists of three input lines. Let i1, i2, and i3 indicate the directions Of the input lines

of b’s, u’s, and x’s, respectively. These input directions are as follows:
i, = (1); 5.4-5
i2 = (.1;

i3 = (—1).

129

Primary input u’s are directly supplied to the array; hence i2 = (7 indicates an arbitrary
input direction for the u’s. (Note that J is a vector not equal to zero vector, 1'], or £3).

The following exists:
i1+i3 = (1) + (—1) 5.4-6
= (0).

Eq. 5.4-l is used to determined that, for the MS cells, DFI‘ =2.

The D macro cell consists of two input lines. Let i1 and i2 indicate the input

directions Of b’s and u’s, respectively. These input directions are as follows:
ii = (1); 5.4-7
i2 = 8?

Again, primary input u’s are directly supplied to the only D cell, so the direction is

indicated by 1'; = Z (Note that (7, again in this case, is a vector not equal to the zero

vector or ii). The following exists:
i1+i2 = (1) + E 5.4-8
a: (O).

From Eq. 5.4-l DFT=1 for the D cell.

Figure 5.7 shows the systolic solution constructed from the combination of the MS
and D sub-systolic arrays. For this systolic array, DFT=2, the largest DFI‘ of the MS
and D cells. As shown in Figure 5.7, the exact timing for supplying each primary
input is determined from the global DFT and data dependency of adjacent cells. For
the systolic array of Figure 5.7, the actual primary inputs are supplied at every other
clock cycle. The MS cells perform multiplication and subtraction operations, hence
the intermediate inputs are 0’s. On the other hand, the only D cell of the design

performs division so the intermediate inputs are 1’s.

130

The data input directions of the systolic solution of Figure 5.6 are given in Eq. 5.4-3 and

repeated here for convenience.

i2 = (‘1) ‘1);
i3 = (1, —1).

The input directions of Eq. 5.4-9 do not add to 5. That is,
i1+i2 + £3 = (o, 2) + (—1, 1) + (1, -1) 5.4-10
it (0, O).

From Eq. 5.4-1 and Eq. 5.4-10 it is determined that DFT=1. This implies that the actual

primary inputs to the array, as shown in Figure 5.6, are supplied at every clock cycle.

5.5 INPUT/OUTPUT DATA PATTERNS

In systolic arrays, the input/output communications take place only at the boundary cells. At
each clock cycle a set of parallel primary inputs are made available to the array input cells
and at the same time a set of parallel primary outputs is received from the output cells. The
regular communication property requires regular input/output patterns. For example, in Figure
5.6, the elements of matrix A enter the systolic array in row major order and those of matrix
3 enter the array in column major order. This systolic array provides as output one row and
one column of the resultant matrix, C, at each clock cycle. The input/output data patterns of
a systolic array are determined from the data ﬂow timing and the data dependency of

adjacent cells. The input/output data patterns are part of the design and remain unchanged.

131

 

 

 

 

1-4 x3
5 0
6 x2
7 O
8 x1
’I‘ a t
D ’ us 7( us
t-1 1 O 0 b3
2 1 O 0 0
3 U33 0 0 b2
4 1 023 0 O
5 u22 0 M3 b1
6 1 O12 O 0
7 M1 0 O O
_9 c-e/O b ___> __9 b
<~—— c e— e—— c
c - c - a 0 b
O a

Figure 5.7. The systolic solution for solving upper triangular system of linear

equations.

132

5.6 CELL FUNCTIONS

The functions of each cell are determined early in the design process. In general, before
generating the simple computations (Chapter 3), the functions of each different cell type is
deﬁned. Then, the algorithm is decomposed into its equivalent set of simple computations
such that each is computable in a cell. Only those cells with at most three inputs and three
outputs are considered here. Each such cell can perform at most two Operations during a
clock cycle. However, the method can be extended to more complex cell structures

performing many Operations during a clock cycle.

5.7 CHAPTER SUMMARY

This chapter presented the extraction of systolic array architectural characteristics such as the
individual cell locations, data movement, simple computations of each cell, data ﬂow timing,
design dimension, input/output data patterns, and cell count ratio. The individual cell
locations and the simple computations of each cell are determined from the projections of the
geometric representations. Data movement and cell count ratio are determined from the valid
global projection vectors. From the nearest cell information and data movement information a
macro cell is constructed. A sub-systolic array is constructed by repeating the macro cell at
each cell location. A systolic array is constructed by joining all the component sub-systolic

arrays.

The data ﬂow timing of each sub-systolic array is determined from the communication lines
of the macro cell which indicate the rate at which actual data enters and exits the sub-systolic
array. The data ﬂow timing of a systolic array is equal to the largest data ﬂow timing value
of its sub-systolic arrays. This rate and the data dependency information from the adjacent

cells indicate the timing of each datum and the input/output patterns.

133

Using the data movement information and cell count ratio, systolic solutions with different

numbers of cells and I/O limitations can be selected early in the design process.

CHAPTER 6

DESIGN TOOLS AND RESULTS

A design facility based on this method is developed for designing and simulating
systolic arrays from algorithms. Section 6.1 describes the design facility. The
systolic solutions for matrix multiplication, LU decomposition, solution of lower
triangular systems of equations, ﬁnite impulse response ﬁlter, and 2D convolution are

given in Section 6.2.

6.1 DESIGN FACILITY

The block diagram of the design facility is illustrated in Figure 6.1. The routines are
developed only for the purpose of research and at present they are loosely connected.
Furthermore, only those procedures which were essential for this research are
implemented. In Figure 6.1, the procedures of the Near Cell Information Analyzer
and Structure Editor are not implemented. In this section the purpose of each routine

is discussed.

Algorithm Decomposition — This step transfers the original algorithm described by
mathematical expressions into equivalent simple computations. This is referred to as

the algorithm initial representation. The generalization Of this step, however, requires

134

135

the writing of a universal routine that transfers a given mathematical expression into
its equivalent simple computations. Mathematical expressions in the categories

considered in this work are easily translated into simple computations.

Feature Extraction — This routine Operates on the algorithm initial representation; it
computes for each pair of simple computations those feature values which involve
index differences. Hence, the multiple evaluation of these feature values during the

generation of the valid projection vectors is eliminated.

Projection Vector Generator — This routine is the implementation of Procedures A
and B of Chapter 4. At its termination, the routine lists all the global valid projection
vectors for each subset of simple computations. If, for a subset of simple
computations, no valid global projection vector exists, then either there exists no

sub-systolic solution or one cell per each simple computation is needed.

Transformation Matrix Generator — For a given projection vector, Vx, this routine
computes the geometric transformation matrix. These matrices are stored for use later
in the construction of larger systolic arrays. The systolic design method is applied to
small problems. The geometric transformation matrices are then used to transform the

geometric representations of larger problem sizes.

Geometric Transformation and Projection — This routine performs the geometric
transformation and projection. Speciﬁcally, it determines the new geometric
representation G’. The projection of G’ onto the lower dimensional space (e.g. j’k’

plane) provides the individual cell locations and the simple computations of each cell.

Near CeII Information — From the simple computations of each cell and the data
movement information, this routine determines the macro cell for each sub-systolic

solution. Repeating the macro cell at each cell location produces the sub-systolic

136

 

Algorithm

 

 

——->

 

 

Structure
Editor

1
I

 

Figure 6.1. The system diagram.

 

 

Oecompoeition

‘ Feature ‘

Extraction

 

 

 

 

 
    

, Projection Vector
Generator

 

 

Traneformation Matrix
Generator

 

 

 

 

‘ Geometric Traneforrnatlon
aid
Projection

       

Structure

‘ Near Cell
information

 

 

   

Generator

 

Simulator

 

 

 

 

Macro
Cell

 

}

Input
Data
attern

C

137

array.

Structure Generator — This routine is used to construct the systolic array for
simulation. Each cell is considered as a node where the cell interconnections are
performed through linked lists. Each node is a record consisting of input registers,
output registers, cell functions, node type (a cell or a delay register), cell type (an
inner cell or boundary cell), and linkages. Given the individual cell locations and the
macro cell (for homogeneous arrays) the sub-systolic array is constructed. For non-
homogeneous arrays, the data directions may not be uniform and some structural

editing may be needed.

Structure Editing — This routine is used to modify the generated array in case the
systematic steps do not complete the design. Structural editing may be needed in the

following two cases.

1. The systematic steps generate only those cell locations which are required
to do the necessary calculations. However, in some designs (e.g., Figure
6.21), for the purpose of design regularity, additional cells are added to the
design. These cells perform no useful calculations and serve only as delay

registers.

2. For non-homogeneous systolic arrays, the design process produces the cell
locations and cell interconnections for each of the sub-systolic arrays. Each
sub-systolic array is a homogeneous array and therefore has uniform cell
interconnection. The ﬁnal systolic array is then constructed by joining all
the sub-systolic arrays. Often, when sub-systolic arrays are joined together
to form the ﬁnal systolic array, the cell interconnections between the
boundary cells of the adjacent sub-systolic arrays are different from those in

either of the sub-systolic arrays. In this case, structural editing is needed to

138

modify the boundary cell interconnections.

Simulator — This routine performs the ﬁnal simulation for a speciﬁc number of
clock cycles. At each clock cycle the input data in the desired input patterns is
supplied to the simulator. Similarly, at each clock cycle the routine lists the output

data from the respective boundary cells.

The system is used to construct and simulate systolic solutions for several known
problems. This system has reproduced known systolic solutions for these algorithms

and in some cases has produced several new solutions.

6.2 RESULTS

This new method for constructing systolic arrays is applied to several problems and
the results are reported. Matrix multiplication, LU decomposition, and the solution of
lower triangular systems of equations are selected as space-indexed algorithms. Finite
impulse response (FIR) ﬁlter and 2D convolution are chosen as time-indexed
algorithms. The transformation of each algorithm into its equivalent set of simple
computations is given. For each algorithm, the procedures are applied to a small
problem size. The selected problem size is the smallest size that represents the
algorithm. That is, the simple computations generated from two different problem
sizes should differ only in their number and not in the number of subsets of simple
computations. For each algorithm the global valid projection vectors, ﬁansformation
matrices, data ﬂow timing, and data movement are also reported. The cell count ratio

for each different systolic solution is also reported.

139

6.2.1 MATRIX MULTIPLICATION

Consider matrix multiplication AB = C, where A(a,-j) is N x M, B(b,-j) is M x Q, and

C(cij) is N x Q. This algorithm is described by

M
CI] = E a", * by for i=1, ..., N and j =1, ..., Q. 6.2-1
[=1

Each Cij is the sum of certain speciﬁed products. The order in which the products are
computed is not important. This provides considerable freedom as to how these
calculations are performed. For example, Eq. 6.2-l does not specify the order in
which the products are summed. The following simple computations perform the

summations in the lexicographical order of indices:

cwc E O for k = O, ..., M—l; 6.2-2
Cij,k(*+) =7 Cur-10+) ‘i' ai,k,0( ) * ka,o( ) for k = 1. ..., M ;
Cir = UN

for i = 1, ..., N and j = 1, ..., Q. The index spaces of the a’s and b’s are artiﬁcially
expanded (by adding a zero to each index) to match the dimension of the index space
of the c’s. As indicated in Eq. 6.2-2, all the indices are now members of Z3 (refer to

Chapter 3). For M = 3, c“ is calculated as follows:
6'110 E 0; 6.2-3
Clll(*+) = Cllo(*+) + “110( ) * b110( );
Cll'2(*+) = Clll(*+) + al20( ) * b210( );
01130”) = 0112(*+) + “130( ) * b310( );
€11 5 6113.

Each simple computation of Eq. 6.2-3 is computed in a cell with multiplication and

addition operations. For matrix multiplication, there exists only one type of simple

140

computation. Hence, the systolic solutions are homogeneous arrays. The procedures

were applied to matrix multiplication algorithm for N = 3, M = 3, and Q = 3. Table

6.1 shows the valid global projection vectors generated from 3 x 3 square matrix

multiplication.

Table 6.1. The valid global projection vectors for matrix multiplication.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vx Vy Vz Vw
l. (0,0, 1)‘ (0,0, l)‘ (0, 1, 0)‘ (1, 0, 0)‘
2. (o, l, 0)‘ (O, 1, 0)‘ (o, o, 0)‘ (0, l, 0)‘
3. (1,0, 0)‘ (l, O, 0)‘ (l, 0, 0)‘ (0,0, 0)‘
4. (o, 1, l)‘ (O, l, l)‘ (0, 1, 0)‘ (1, 1, 0)'
5. (1, l, 0)‘ (1, 1, 0)‘ (1, 0, 0)t (o, l, 0)‘
6. (1,0, l)‘ (1,0, 1)‘ (1, 1, O)‘ (1,0, 0)t
7. (0,1,-l)‘ (0,l,-1)‘ (0,1,0)t (1,-1, 0)‘ ‘
8. (1, -1, O)‘ (1, -1, O)‘ (1, o, 0)‘ (o, l, 0)‘
9. (1, O, -1)t (1,0, -l)‘ (l, -1, O)‘ (1,0, 0)'
10. (l, 1, l)‘ (l, l, 1)‘ (1, 1, O)‘ (1, 1, 0)‘
11. (1, 1, —1)‘ (1, l, —l)‘ (1,-l, 0)‘ (1,-1,0)‘
12. (1, -1, l)‘ (1, -1, 1)‘ (l, 1, O)‘ (1,-1, 0)‘
13. (1, -1, -1)‘ (1, -l, -1)‘ (1, -1, O)‘ (l, 1, O)‘

 

 

Each valid global projection vector, Vx, corresponds to a systolic solution. Each Vx

is systematically translated into a geometric transformation matrix H. Procedure C,

for m = 3, is applied to each projection vector and the transformation matrices are

listed in Eq. 6.2-4.

H

II
ONO

0 l

o
H: _1 o o for Vx=(o,l,0)‘;
1

0 l
0 O for Vx = (O, O, 1)’:
l 0

6.2-4

141

for Vx = (l, O, 0)‘;

O for Vx = (O, l, l)‘;

O for Vx = (1, 1, 0)‘;

O for Vx = (l, 0, 1)‘;

O for Vx = (O, l, -1)';

O for Vx = (l, -1. 0)‘;

o for Vx=(1,0,-1)‘;

O for Vx = (1, l, 1)‘;

0 for Vx = (1, 1, -1)';

O for Vx =(1, -1, 1)‘;

o for Vx = (1, -1, -1)‘.

142

The cell count ratio for each Vx is calculated for 3 x 3 square matrix multiplication
and the results are shown in Table 6.2.

Table 6.2. CCR for each Vx of matrix multiplication.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vx CCR

1. (0,0, 1)' '3‘: =15
2. (o, l, O)‘ % = 1.5
3. (1, 0, or % = 1.5
4. (o, 1, 1)‘ % = 2.25
5. (1, 1, 0)‘ % = 2.25
6. (1,0, 1)' % = 2.25
7. (o, l, -1)‘ % = 2.25
8. (1, —l, 0)' $25 = 2.25
9. (1, o, -1)‘ % = 2.25
10. (l, l, l)‘ 381 = 3.375
11. (1, l, -l)‘ 32-37— : 3.375
12. (1, -l, l)‘ 93?- = 3.375
13. (1, -1, -1)‘ 277- = 3.375

 

 

 

 

The total number of simple computations generated for 3 x 3 square matrix
multiplication is 27. CCRs are always greater than one. The closer CCR is to one,
the fewer number of cells is needed. This implies that those projection vectors which
correspond to CCR = 1.5 require the fewest number of cells. The projection planes

and the systolic solutions corresponding to the thirteen projection vectors are shown in

Figures 6.2 through 6.27. The directions of the data communication lines, which are

143

determined from the near cell information, are shown in Table 6.3.

Table 6.3. Data input directions for matrix multiplication systolic solutions.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Input Line Directions

Vx ia ib ic

1. (0,0, 1)‘ (o, -1) (-1, 0) 2i
2. (o, 1, 0)‘ J (1,0) (0, -1)
3. (1, o, 0)‘ (-1, 0) I (o, -1)
4. (o, 1, 1)‘ (O, 1) (1,0) (0, -l)
5. (1, 1, 0)‘ (o, l) (o, -l) (—1,0)
6. (1,0, 1)‘ (-1, 0) (o, l) (O, -l)
7. (o, 1, —l)‘ (o, 1) (1,0) (0, -1)
8. (1, -1, O)‘ (—1,0) (1,0) (0, -1)
9. (1,0, —l)‘ (_1, 0) (1,0) (0, -1)
10. ('1, l, l)‘ (—1, 1) (l, 1) (o, —2)
ll. (1, 1, —1)‘ (-1,'-1) (1, -1) (o, -2)
12. (1, -1, l)‘ (—1, -1) (1, -1) (o, —2)
13. (1, —l, —1)‘ (—1, 1) (1, l) (o, -2)

 

The actual number of cells, .data ﬂow timing, and data movement information are
shown in Table 6.4. In this table, data movement indicates not only the stationary
and non-stationary data, but also the relative directions. In Table 6.4, the word
“stay” implies stationary data. For example, in systolic solution 4, a’s move south
(S), b’s move west (W), and c’s move north (N). These directions are relative
directions. In solution 4, the data directions indicate that a’s and c’s move in opposite
directions in the array. Some systolic solutions, such as 6 and 9, are similar in the
required number of cells, DFI‘, and data movement. The difference between these

solutions is in the order of resultant’s calculation; that is, the elements of matrix C

appear as output in different order in each of these systolic arrays.

144

Table 6.4. Comparison Of Number of Cells, DFI‘, and Data Movement
for matrix multiplication systolic solutions.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vx Number Of Cells DFI‘ Data Movement

a’s b’ s c’s

1 (0,0,1)‘ Q*N l N E stay
2 (0,1 ,0)‘ M *N 1 stay W N
3 (1,0,0)‘ M*Q l E stay N
4 (0,1,1)‘ N(M+Q-l) 2 S W N
5 (1,1 ,0)‘ M(N+Q—1) 2 S N E
6 (1,0,1)‘ Q(N+M—l) 2 E S N
7 (0,1,—1)‘ N(M+Q-l) 2 S W N
8 (1,— l ,0)‘ M (N+Q-1) 2 E W N
9 (1,0,-1)‘ Q(N+M-l) 2 E S N
10 (1,1,1)‘ (N+M—1)(M+Q-l) 3 SE SW N
11 (1,1,—1)‘ (N+M—1)(M+Q-l) 1 NE NW N
12 (1,—1,1)‘ (N+M—1)(M+Q-1) 1 NE NW N
13 (1,- l ,-l)‘ (N+M-1 )(M+Q— 1) 3 SE SW N

 

 

 

 

 

 

The best systolic solution for matrix multiplication depends on the speciﬁc
application. In order to compare these systolic solutions under similar conditions, 5x5
square matrix multiplication was simulated using each of the solutions. Table 6.5
shows the results of the simulations. In Table 6.5, AA is the area Of a macro cell, AT
is the duration of the clock cycle, and 1: is the loading/unloading time. The results
indicate that for minimum chip area the best solutions are 1 - 3. For minimum
execution time, the best solutions are 11 and 12. Solutions 11 and 12 are also the
best solutions in terms of throughput. (DFT indicates the throughput of each pipeline
in the array). Optimal solutions based on AT or AT2 optimization can also be
selected. The total required execution time for solutions 1 - 3 depends on the

implementation of the loading/unloading process. The AT and AT 2 optimal solutions

145

among solutions 4 - 13 are 5 and 8.

Table 6.5. Simulation results for square (5x5) matrix multiplication.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vx Total Area Total Time

1 (0,0,1)‘ 25 AA 14 Ar + 1:

2 (0,1,0)‘ 25 AA 14 A7- + t

3 (1,0,0)‘ 25 AA 14 A1- + 1:
4 (0,1,1)‘ 45 AA 22 AT
5 (1,1,0)‘ 45 AA 18 AT
6 (1,0,1)‘ 45 AA 22 A7-
7 (0,l,-1)‘ 45 AA 22 AT
8 (l,—l,0)‘ 45 AA 18 AT
9 (1,0,-1)‘ 45 AA 22 AT
10 (1,1,1)‘ 81 AA 22 AT
11 (1,1,-1)‘ 81 AA 14 AT
12 (1,—1,1)‘ 81 AA 14 AT
13 (l,—1,—l)’ 81 AA 22 AT

 

 

A solution based on the total area, total time, or throughput is not necessarily the
optimal solution for a speciﬁc application. For example, consider solution 4 where
a’s and c’s move in opposite directions (Figure 6.9). In this case, all, an, an, etc.
and cm, 012: c13, etc. enter and exit the same pipeline in the array. The input and
output patterns in the other pipelines are also the same. b’s move perpendicular to the
directions of a’s and c’s. Solution 4 can therefore be selected as the best solution for
calculating powers of a matrix. For example, to compute A“, where A is a square

matrix, the following steps using systolic solution 4 are taken:

A*A=A2; A2*A=A3; A3*A=A4. 6.2-5

A multiplexer at each output boundary cell is required. To compute A2, the

multiplexers select the array inputs from the entries of A. To compute A3 and A4, the

146

entries of A2 and A3 are provided directly from the array output cells, respectively.
NO additional delays or buffers are needed. A reconﬁgurable systolic array that
supports all these cell interconnections will provide a rich class of systolic arrays for

matrix multiplication.

Systolic solution 10 corresponds to the hexagonal systolic array given by Kung [40]
(the array shown in [40] is for banded matrix multiplication). Solution 5 corresponds

to the rectangular solution given by Hwang [6].

A systolic array for any size matrix multiplication described by Eq. 6.2.2 can be
designed using the information obtained from 3 x 3 square matrix multiplication. The

following steps are used to design a larger systolic array for matrix multiplication:

1. Select the values of N, M, and Q in Eq. 6.2-2 to deﬁne the size of the

matrices.

2. Collect the indices of the output Operands (i.e. ij,k in Eq. 6.2.-2) which

indicate the cell locations in the geometric representation, G.

3. Select the desired transformation matrix, H, listed in Eq. 6.2-4. Associated
with each transformation matrix, H, there exists a projection vector,’ Vx.
Associated with each Vx, there exist CCR, DFI', data movement, and total

number of cells as indicated in Table 6.2 and Table 6.4.

4. Transform the original geometric representation (G) into a new geometric
representation (G’) by multiplying H by the coordinates of each cell location

in G.

5. Project G’ onto the predeﬁned plane (e.g., j’k’ plane). Each point on the

projection plane now indicates an actual cell location.

147

Associated with each H there exists a macro cell. Repeat the macro cell at

each cell location on the projection plane to construct the systolic array.

The DFI‘, data movement, and input/output data patterns of the larger array

are the same as those of the smaller array.

“r

 

c132. 013. 033
c131, 012. b23
c130. 011, 013

c1 33
c1 32
c1 31

c122. 013. 032
c121. 012. b22
0120. 011. 012

c1 23
c1 22
c1 21

o112. 013. 031
c111, 012. 021
c110. 011. 011

148

c232. 023. 033
0231. 022. 023
0230, 021. 013

c233
c232
c231

c212. 023. 032
c221. 022. 022
c220. 021. b12

c223
e222
c221

c212. 023. 1531
c211. 022, 021
0210, 021, b11

6332. 033. 1:33
c331, 032, 1323
c330. 031. M3

c333
c332
c331

c322. 033. 1332
c321. 032. b22
0320. 031. b12

c323
6322
0321

c312. 033. 031
0311. 032. b21
c310, 031. bii

 

c113 c213 c313

c112 c212 0312

c111 0211 c311
l 1 L \
I l l— 7
1 2 3

Figure 6.2. A projection plane for matrix multiplication; V): = (0, 0, 1)‘.

149

 

 

 

 

 

 

033 023 013 0 0 c13 > If >
032 022 012 0 c12 \r j y
031 021 011 c11 % am/ >

0 011 O

012 021

013 022

023

0

c-c+000

Figure 6.3. A systolic solution for matrix multiplication; Vx = (0, 0, l)‘.

032

(:31

X
,X’

 

0332. 033, 033
0322. 033, 032
c312. 033. 031

0333
c323
0313

0331. 032. 023
0321. 032.‘ 022
0311. 032. 021

0332
c322
0312

0330. 031, 013
0320, 031. 012
0310. 031. 011

150

0232. 023, 033
0222. 023, 032
0212. 023. 031

0233
223
021 3

0231, 022. 023
0221. 022. 022
0211. 022. 021

0232
0222
0212

0230. 021. 013
0220. 021, 012
0210, 021, 011

0132. 013, 033
0122, 013. 032
0112. 013. 031

$3.0

1

0
as
(all

0131, 012. 023
0121. 012. 022
0111. 012, 021

0132
0122
c112

0130. 011. 013
0120. 011, 012
0110. 011. 011

 

c331 0231 0131
c321 0221 c121
0311 0211 0111
l L l \
I T I I
-3 -2 -1

Figure 6.4. A projection plane for matrix multiplication; Vx = (0, l, 0)’.

EE§°°

i:

032

031

O
021

022
023

It

 

151

IN

 

 

 

< :23
< E
< 021

 

IN

 

NIX

IN

011 t-4
012 2
013 3

4

5

t-1 2 3 4 5

013 O O 031 032 033
012 0 021 022 023
011 011 012 013

c-c+000

Figure 6.5. A systolic solution for matrix multiplication; Vx = (0, 1, 0)‘.

 

0312. 033. 031
0212. 023, 031
0112. 013. 031

0313
0213
0113

0311, 032. 021
0221. 022. 021
0111, 012. 021

0312
0212
0112

0310. 031, 011
0210. 021. 011
0110, 011, 011

152

0322. 033. 032
0222. 023. 032
0122. 013. 032

0323
0223
0123 '

0321. 012, 022
0221. 022. 022
0121. 012. 022

0.322
0222
01 22

0320, 031. 012
0220. 021, 012
0120. 011, 012

0332, 033, 033

0232. 023. 033 ‘

0132. 013. 033

0333
0233
01 33

0331. 032. 023
0231. 022. 023
0131. 012. 023

0332
0232
01 32

0330. 031. 013
0230. 021, 013
0130. 011, 013

 

031 1 0321 0331

0211 0221 0231

0111 0121 0131
J L I A
l V l 7
1 2 3

Figure 6.6. A projection plane for matrix multiplication; Vx = (1, 0, 0)‘.

153

t-i 011 O O
2 012 021 O
3 013 022 031
‘ ‘r 023 032
5 “ >5 v» 033

 

 

033 023 013 o o 631 > A x 033

 

 

i I
032 022 012 o 021 xﬁ 022 ,L 1:23
031 021 011 on > 012 > 013

 

 

0-c+0-0

Figure 6.7. A systolic solution for matrix multiplication; Vx = (1, 0, O)‘.

 

154

 

kl
/\
0312, 033. 031 0212, 023, 031 0112. 013, 031
2 ....
® ® ®
0313 0213 0113
0322, 033, 032 0222, 023, 032 0122. 013, 032
0311, 032. 021 0211, 022. 021 0111, 012. 021
1-0
63 ® ®
0323 0223 0123
c312 0212 c112
0332. 033. 033 0232. 023. 033 0132, 013. 033
0321. 032. 022 0221, 022. 022 0121 012, 022
0310. 031. 011 0210. 021, 011 011 , 011, 011
° r 63 ® ®
0333 0233 0133
c322 0222 0122
c311 0211 0111
0331 032, 023 0231. 022, 023 0131, 012. 023
032 . 031, 012 0220, 021, 012 0120, 011. 012
_1__
® ® ®
0332 0232 0132
c321 0221 0121
0330. 031, 013 0230. 021, 013 0130, 011, 013
..2 __
® ® ®
0331 0231 0131
L L L \
T I F I
-3 -2 -1

Figure 6.8. A projection plane for matrix multiplication; Vx = (0, 1, 1)‘.

IMO-1.00%

155

00-04-000

 

 

 

 

 

 

0330 0 011 H
00 023021 0 7
032031 00 013012 8
00 022022 00 O 0 0
“are: 2:...
1
0033 0 011 12
‘4 J '4: o.

011 0 022 0 033

O 012 O 023

013

.
O
0

Figure 6.9. A systolic solution for matrix multiplication; Vx = (O, l, 1)‘.

156

mum

mm“

 

Ge.

mmm
mam.

a"
mo

-6

 

39.1
‘11

Db.

am £62“ .
a...

6% r

cm 031. 012
0210. 021. 011

0310. 031. 011
0311
L

I

I

I
-1

 

Figure 6.10. A projection plane for matrix multiplication; Vx = (l, l, O)‘.

r0

1NU+UON

157

033
032 0

031 0 023
O 022 0

021 O 013
O 012 0
011 0 0

011 0 0
0 021 0
012 0 031
0 022 0
013 0 032

033 0
033

cue-+000

8 7 til-0

003100

032 0 0210

022 0 011
O 012 0
013 0 0

Rotated by 90 Degree-

Figure 6.11. A systolic solution for matrix multiplication; Vx = (l, l, 0)‘.

 

158

 

k.
A
0112. 013, 031 0122, 013, 032 0132. 013. 033
o o
0113 0123 0133
0212. 023. 031 0222, 023. 032 0232, 023, 033
0111, 012, 021 0121, 012, 022 0131, 012, 023
13..
® ®
0213 0223 0233
0112 0122 0132
0312. 033, 031 0322. 033. 032 0332. 033, 033
0211, 022. 021 0221 022. 022 0231, 022. 023
8110. 011, 011 8120'. 011, 012 0130, 011. 013
0 .4-
® ® ®
0313 0323 0333
c212 0222 0232
0111 0121 0131
0331 032. 023 0321, 032. 022 0331, 032. 023
021 , 021, 011 0220. 021, 012 0230, 021. 013
-1-- o
0312 0322 c332
0211 0221 0231
0310. 031, 011 0320, 031, 012 0330. 031. 013
-2 ..
® ®
0311 c321 0331
L I l \
'_ T T 7
1 2 3

Figure 6.12. A projection plane for matrix multiplication; Vx = (l, 0, l)‘.

159

0 - 0 + a 0 0 t-0 011 033 7
0 7 0 012 032 0 e
0 021 031 0 0 013 023 5
0 0 0 022 022 0 0 4
10 031 021 0 0 023 013 3
0 11 0 0 032 012 0 O 2
12 0 011 0 0 033 O t-1
00
° ° 0"00
023 0 012 0 .

03300220011

03200210

03100

Figure 6.13. A systolic solution for matrix multiplication; Vx = (l, O, 1)‘.

 

160

 

 

k.
/\
0332. 033. 033 0232, 023, 033 0132. 013. 033
e
0333 0233 c133
0331, 032. 023 0231.022. 023 0131. 012. 023
0322. 033, 032 0222. 023. 032 0122. 013. 032
5..
1 ® ® ®
0332 0232 0132
0323 0223 0123
0330, 031. 013 0230, 021. 013 0130:: 011. 013
0321. 03 022 0221.022 0121,01022
0312, a .031 0212. 02 .031 0112. 01, 031
4 .,.
® ('3
0331 0231 c131
0322 0222 0122
0313 021301113
0320, 031, 012 0220. 021. 012 0120, 011, 0122
0311.032. 021 0211, 022. 021 0111, 012.17
3.1.. ®
0321 0221 c121
0312 0212 0112
0310. 031. 011 0210. 021. 011 0110. 011. 011
2 ...
® ® ®
0311 0211 0111
L L l \
r I r 7
-3 -2 -1

Figure 6.14. A projection plane for matrix multiplication; Vx = (0, 1, -1)‘.

lNqu-UQV

iﬂ

N

 

 

8°89

L.

O

L§o&o&ol

 

 

161

O a
a a u.
‘ONO‘H

012
011

O

013 I

 

 

00-04-000

t-3 4 5 0 7

0 023 0 032

013 0 022 0 031

0 012 O 021

00011

Figure 6.15. A systolic solution for matrix multiplication; Vx = (0, l, -1)‘.

162

0332
0330. 031. 013
0331

022
g 023 0331. 032. 023

0230. 021. 013
0321
0231

83:22 ”3’
033
d.”
022 0321.
0322
0232
0320. 031. 012

.0”.

mmvmmo

0131. 012. 023 0231
0311
m
0131

033.

$23.

0223

0133
0311. 032. 021

0310131. 011
8020. 021. 812

0130. 011. 013

011
012

$5322}
6) .5 o o

0111

8:133:
3032 0132.

.9“

2.

22.
0211
0121

0120. 011

0112. 013. 031
E“)
0113

0111. 012. 021
©
0112

0110. 011. 011

i

 

Figure 6.16. A projection plane for matrix multiplication; Vx = (l, -1, 0)‘.

”00.00%

A

163

0 0
013
012 0
061 032 033 0 c - c + 0 a 0
021 0 033
0 032 0
031 0 0 ' 0 0

109 8 7t-6
00011

O
i
Q o 012 0 e21
l , .
... 013 0 022 0 031
1
O
l
0

00230032

011 0

0 021 0
012 0 031
0 022 0 Rotated by 90 Degreee
013 O 032
033 0

Figure 6.17. A systolic solution for matrix multiplication; Vx = (l, -l, O)‘.

 

164

 

”r
0312, 033. 031 0322. 033. 032 0332. 033, 033
6 r ('9
0313 c323 0333
0311, 832; 821 0321. 032. 822 0331, 83 023
0212. 023, 031 0222. 023, 032 0232, 02 . 033
5 ..1. ®
0312 c322 0332
c213 0223 0233
0310, 031, 011 0320, 031. 012 0330. 031. 013
0211, 022. 021 0221. 022. 022 0231, 022. 023
0112. 013, 031 0122. 013. 032 0132. 013, 033
4.
1* ® ®
0311 c321 0331
0212 0222 0232
0113 0123 0133
0210, 021. 011 0220. 021. 012 0230. 021, 013
0111, 012. 021 0121, 012, 022 0131, 012. 023
o
0211 0221 0231
0112 0122 0132
0110. 011, 011 0120, 011, 012 0130. 011. 013
o o
0111 0121 0131
L I 4 A
I I T I
1 2 3

Figure 6.18. A projection plane for matrix multiplication; Vx = (1, 0, —l)‘.

0-0+000 1.5
b 7
0
0
10
0 11
12
00
033 0 0
023 0 032 0

013 0 022 0 031

012 0 021 0

01100

165

031 033
0 032 032 0
021 031 0 0 033 023
0 0 022 022 0 0
011 021 0 0 023 013
0 0 012 012 0 0
0 011 0 0 013 0

n

...NGQCION

Figure 6.19. A systolic solution for matrix multiplication; Vx = (1, 0, -1)‘.

 

166

 

k.
A 0112. 013. 031
‘ r ED
0212. 823. 831 “‘3 8122, 813. 032
o o
0222. 023. 032
0312, 033. 031 0213 0111. 012. 021 0123 0132. 013. 033
o o 0
an 033. 832 8232. 023. 033
“‘3 8211. 022. 021 $5" 0121. 012. 022 “33
® a: a a C2
. 0 0
e311. 832. 821 312‘; 0110. 011. 811 0122 0131. 812. 823
°" (‘9 ‘ ('5 C9
0321. 032. 022 0231. 022. 023
‘3‘: 0210, 021. 811 g 0120. 011. 812 “‘32
“I .11.. ® 0‘“ ®
0331. 032. 023
8310. 031. 811 M0211 0220. 021. 812 :3? 0130. 011. 013
4+ (‘9 ® (”9
‘3" 0320. 031. 012 323} case. 021. 813 “3‘
o o
“32‘ 0330. 031. 013 a“
-4 .1). ®
0331
L l l 1L 1 X
F F f r r I
-2 -1 O 1 2

Figure 6.20. A projection plane for matrix multiplication; Vx = (l, 1, 1)‘.

167

9.
i

0
7 021 0 012
0 031 0 0 0 013
9 0 022 0
10 032 0 023
11 0
12 033
0130000 0000031

0230001200 0002100032

0330002200011 0110002200033

03200021 001200023

”’1

00013

 

Figure 6.21. A systolic solution for matrix multiplication; Vx = (l, 1, 1)‘.

 

168

 

W
I
12.1 ®
W “33 W
1151.. 61-) ®
8331-1132-82-
8312—1133—1131 “‘3 .033 8132-1113-1130
o e o
6321—1132-1122 «231-822-823
8313 .332 a m 8133
.011- 6 “-013 ®
8311-1132-1121 8112-1113-1131 0131-01“”
8 «013 ® 8123
«320-831-812 «230-821-813 O
‘3" «211-402-821 3 0121-01sz “a
7d- 6 “‘3 ®
. mar-812
8321 «2.31
arm-811 8212 8111-1112-4121 8122 01H1-013
o - o o
‘9" arm-811 331‘ 0120-011-012 ‘1"
3-1- 6
‘2" 0110-011-011 “‘31
‘db 6
0111
r J 1 1L L _\ f
I IV I 1 r 7
-2 -1 o 1 2

Figure 6.22. A projection plane for matrix multiplication; Vx = (l, l, —1)‘.

 

 

170

 

yr 8132. C11; 033
e...
4122. 013 032 “’3 W
7....
,2, cm 023. 032 ®
«12.113831 ° 0131. 812.823 “”3 menu:
0 ..
1 S2 «212023.831 not... 69
0121. 012. 022 g «231. m. 82.3 ‘3”
w 69 3,108,, .4, e
0111. 012. 821 8.1.3 «130631. 813 g 8331. .32. 023
4.1. 6
«011. «22. 821 «021. 832. 822
“"3 8120 011. 812 3%: cm 021. 813 “m
3.... ‘ 0131 ®
0110. 811. 811 «121 8220. 021. 812 m, can. 831. 813
2.1. 6 6
°‘" 0210-1121-1011 3} 8.120. 0'51. 012 ‘3"
1..-
6’) ®
‘3" 8310. 031. 811 “3‘
odi-
8311
‘1 L L l L x f
I I I— I 1 I
2 3 4 3 0

Figure 6.24. A projection plane for matrix multiplication; Vx = (1, -l, 1)‘.

171

t-B 011 021 031 032 033'
7 012 022 023
8 013

 

\
o
'\ o
o 033
'\ 032
031 823
822
021 813

012

Figure 6.25. A systolic solution for matrix multiplication; Vx = (l, -1, l)‘.

 

172

 

1:
T 8312. on 031
e.
®
0212. «23 831 ‘3“ 8322. 833. 832
7 -.
0213 ‘3‘" '3’: ”2‘ Q
8112. 013. 831 0222. 023. 832 0332. 833. 833
0.3,. 8011 022. 821 6 8321 on. 822 6
«113 1 8312 1 8333
an. «131 832 8223 «232. as. 033
“-0 ® 8310. 031. 811 ®
8212 «021. 022. 822 .322
.111. .12. 821 .123 0132.6 833 82.33 8331.6 823
4 .1.-
ea10. 021. 011 0320. 031. 812
“"2 0121. 012. 822 :2; e231. «22. 823 “a
3.... 8133
S2 0220. «21. 812 Q
0110. 011. 011 0122 «131.615. 823 0232 03306031. 813
2....
“"1 8120. 011. 812 5‘}; an. «21. 813 “’1
‘dj- 6
“‘2' 0130. 011. 813 “3'
e.
j. «131
.L i i J r 3 I
2 3 0

Figure 6.26. A projection plane for matrix multiplication; Vx = (1, -1, —l)‘.

173

”'9 021 ‘3‘ 032
g 011 8 022 0 033
10 012 8 033
11
12 013
0330000 0000031
0230003200 0002100032
0130002200031 0110002200033

812008010 081200823
«1100?? Venue

e-c+e-0
0 0

Figure 6.27. A systolic solution for matrix multiplication; Vx = (l, -—l, —l)‘.

174

6.2.2 LU DECOMPOSITION

An N x N LU decomposition algorithm, A = LU, where A = (aij) is an N x N
nonsingular matrix, L = (1,-1) is an N x N lower triangular matrix with a unit diagonal,

and U = (uij) is an N x N upper triangular matrix, can be described as follows:

’1] = (0,]; "' 2 [it * ukj) I ujj for i=2,..., N and j=1,..., i—l; 6.2-6
It<j

uij = ail- - 2 [UL * ukj for i=1,..., N and j=l,..., N.

It<i

For N = 4, the matrices are as follows:

 

 

 

 

 

 

“11 “12 “13 “14 l O 0 01 11 “12 “13 “14
“21 “22 “23 “24 121 1 O» O O “22 “23 “24
“31 “32 “33 “34 = 131 ’32 1 0 0 0 “33 “34' 62-7
“41 “42 “43 “44 {41 I42 I43 1, O O 0 “44

The elements of matrix U are calculated via the following simple computations:
“110 5 all; 6.2-8
“lll(*') = “11009;

“11 -=- “111;

“120 E “123

“121(*-) 5 “1200“)?

“12 '=' “121;

“130 ‘=' “13;

 

175

“131(*‘) = “1300“);

“140 E “14;
“1410") = “14o(*-);

“14 E “1413

“220 E “223
“221(*-) = “220(*-);
“222(*—) = “221(*-) - 12120) "‘ “121(*-);

“22 =-" “2213

“230 5 “23;
“2310“) = “230(*-);
“232(*') = “231(*") ' 12120) * “1310“);

“23 5 “232;

“2.40 E “24;
“241(*-) = “24o(*—);
"242(*-) = “2410“) ' 12120) "‘ “141(*-)i

“24 E “242;

"330 E “33;

176

“331(*") = “330(*-);
"332(*‘) = “331(*‘) ' 13120) * “131(*");
“333(*‘) = “332(*") ' 13230) * “232(*-);

“33 E “3333

“340 5 “34;
“341(*") = “340(*-);
“342(*") = “34l(*') " 13120) * “14l(*-);
“343(*") = “342(""') " 13230) * “242("‘-);

“34 E “343;

u440 -=- a“;
u441("‘-) = u440(*-);

“4420") = “441(*") - 14120) * “141(*—);

“4430”) = “442(*') " 14230) * “242(*”);

u444(*-) = u443(*—) "' 14340) * “3430“);

“44 E “444.

The elements of matrix L are calculated via the following simple computations:

1210 E “21; 6.2-9
’211(*-) = (2100-);
12120) = 12110“) / “111(*-);

’21 E 12123

 

177

1310 E “313
1311(*-) = 1310(*-);
13120) = ’311(*") / “111(*—);

’31 E 13123

1320 E “32;
(3210—) = 1320(*-);
’322(*—) = 13210“) ‘ 13120) * “121(*“);
13230) = 1322(*-) / “2220“):

’32 5 1323;

1410 E “413
1411(*-) = 14100“);

14120) = 1411““) '/ “111(*‘);

’41 E 14123

1420 5 “42;
14210“) =1420(*')3
1422(*-) = 14210“) ‘ 14120) * “1210“):
14230) = I422("‘-) / “222(*-);

’42 E 1423;

’430 E “433
14310“) = 14300“);
14320“) = 14310“) - 14120) "‘ “131(*-);
14330“) = 1432“") ’ 14230) * “2320“):
14340) = 14330“) / “33309;

143 E 1434-

 

 

178

In general, Eq. 6.2-6 is decomposed into its equivalent simple computations as

follows:
10,0 5 at] for j < i; 6.2'10
”Mo 5 (1,-j for j 2 i;

I,J,,(*-)=1,J’,_,(*-) for i=2,...,N, j= 1,...,i-l, k: 1;
1,1,,(*-)=,.J,,_,(*-)— ,,,_,,,(/) * u,_,J,_,(*-) for i=3, N..,, j=2, i-l, k=2, j;
I,J,,,(/)= ,J.._,(*—)/uj‘,-J( ) for i=2, ..N.,, j= 1,...,i—1, k=j+1;
u,‘,,,,(*-)=u,~j,l.i(*—) fori= 1, ...,N, j=i,...,N, k: 1;

u.;.;l(*-)=u.-,,-,l.1(*-)-111.1).(0 * uk.lj,,._1(*—) for 1': 1, ...,N, j: i, ...,N, k: 2, i;

The LU decomposition algorithm consists of three subsets Of simple computations as

follows:

1. simple computations with variable name u and set of operations (*—);
2. simple computations with variable name I and set of operations (*—);
3. simple computations with variable name I and Operation (/).

Let C1, C2, and C3 be the three subsets of simple computations of the LU
decomposition, respectively. The systolic array design procedures were applied to the
simple computations of Eq. 6.2-8 and 6.2-9 and the global valid projection vectors are
shown in Table 6.6. As shown in Table 6.6, there exists only one sub-systolic

solution for each subset of simple computations. The geometric transformation matrix

associated with Vx = (1, l, l)‘ is

 

 

Table 6.6. The global valid projection vectors of LU decomposition.

179

 

 

 

 

 

 

 

 

 

Subset Vx Vy Vz Vw
Cl (1,1,1)‘ (1,1,1)‘ (1, l, 1)‘ (1, 1,1)‘
C2 (1,1,1)‘ (1,1, l)" (1, 1,1)‘ (1,1,1)’
C3 (1,1,1)t (1,1,1)1 (1, l, 1)‘ (1,1,1)‘
2 2 1
H: -l 1 0 6.2-11
-1 -1 2

 

The projection planes and the corresponding systolic array of the 4 x 4 LU
decomposition are shown in Figure 6.28 and Figure 6.29, respectively. The directions

of the data communication lines of each sub-systolic array, which are determined from

the near cell information, are shown in Table 6.7.

Table 6.7. Data input directions for LU decomposition.

 

 

 

 

 

Variable Operators Input Line Directions
names DFI‘ input-l input-2 input-3
u (*-) il =(-1. 1) i2 = (1, 1) is = (0, -2)
I (*-) ii = (-1. 1) i2 = (1. 1) i3 = (0. —2)
I (/) 1 i1: 3 i2 = (1, 1) -

 

 

 

 

The DFI‘ Of the completed systolic array for LU decomposition is 3, the largest DFT
among those of the sub-systolic arrays. This systolic solution is also given by Hwang
[6]. However, in [6], DFI‘ is incorrectly shown to be 2. The simulation results also
verify that DFI‘=3. Systolic arrays of larger size LU decompositions are designed in

the same manner as was discussed for the matrix multiplication algorithm.

 

 

 

1‘
ﬁ-7

V32 y. 1412. 0121
" 1431, mJ131:1412. “ 6341,1312, «141 :13? >0

180

 

1433 02133
' Varlwle n 0: l
01:22
/ 1211111111 ”‘1‘"

 

 

/ /£434 0343 \ Voridrle name: 0
11222 {21212. 112121 °P"°*°"‘ “'
13“. “111 '21\
taxi—1032' 8342. 1323. 11242

3/1321. 13122 :me In :3” 0231. 1212. 0131 \

u120

0111 Q
"$3.”; 131 “131 241. 121 \4
225 u 0 0135' u \

0232

 

1311 0332 01
o 9. ©>
1420 '3” "T113113 "m 123‘ 6240 V
\ 6:) /
\' 1430 $331 «340 " ’1’
\ \Q / /
u 1

 

Vdde 11800: I) \l g /

 

 

.L 4 1 T r + r

Figure 6.28. The projection planes of LU decomposition.

 

181

 

    

 

    

 

 

 

U
/
I u u I
U U C O
0 II. II
.. _ 3
. ..1 I u u
u n
I: 0 “II u
000%
3 4 I 000000m
001002
u u ‘.
- 0000u002
2 u. .4. a .
0100 00
u u u 200%003
a ..‘e‘o .. . .
1
1
m11m11u1 u «I m w “
. 11.1 11 11
.I. . .
a 00ﬂ00m00“
2 a o
OOMOOBOOM
1 2
00003004
4| 2 o 0
0003004
I 000000“ M
o
0000mT||_ w
1234567890 ﬂuR
1
... a

Figure 6.29. The systolic solution for LU decomposition.

182

6.2.3 SOLUTION OF LOWER TRIANGULAR SYSTEM OF EQUATIONS

Let L be an N x N lower triangular matrix with a unit diagonal. Let x and b be two
N x 1 vectors where the unknown vector, x, is the solution of linear system L1: = b.

The elements of vector x are calculated as follows:

{‘1' = bi " E lij * xj 6.2-12

j<i

for 1' = 1, 2, ..., N. For N = 4, the problem is indicated as follows:

 

 

 

 

 

 

1 0 0 0 x1 b1

’21 1 O 3 Jr2 52 62 13
’31 ’32 1 0 x3 - 53' ' -

l 1 143 x b

b41 42 1d L 4.1 L 4‘

The elements of vector x are calculated via the following simple computations:

x10 3 b1; 6..214
XML) = x10(*-);

x15 x11;

1‘20““) E 1’20““);
x21(*—) = x20(*—);
x22(*-) = x21(*-) - 121( ) * 1110'");

I2 5 1‘22;

x30 5 53;
x31(*‘) = 1300‘“);
JIC320““) = Jr:11("") ‘ ’31( ) * Jr110'");

1330—) = x32(*-) - ’32( ) * x22(*-);

183

ﬂaky

x40 5 54;
X410'") = 1400‘“);
Jr420'") = J‘C110(""") - l41( ) * J111(k);
X430") = X410“) - I42( ) * x22(*-);
1440'") = x43(*-) - ’43( ) * x33(*-);

X4 E X44.

In general, Eq. 6.2-12 is decomposed into its equivalent simple computations as

follows:
11,0 5 bi 3 6.2-15
xiJ L) =x1'J—l (L) for 1': 1;

11,; *-) = xij—l (*-) - iJ—l( ) * xj—lj—l (L) for .i = 2, ..., N;

x,- 11,1 ;
for 1' = 1, 2, ..., N. This problem consists of only one type of simple computation so
the systolic solution is a homogeneous array. The design procedures were applied to
the 4 x 4 problem size and no valid global projection vectors were produced. This
implies that either no systolic solutions exist for solving a lower triangular linear
system of equations or one cell per each simple computation is needed. If a systolic
solution exists then each cell location in the geometric representation, constructed
from the indices of the output Operands, indicates an actual cell location. That is, the
geometric representation is not projected onto a lower dimensional space. It turns out
that, in this case, a systolic solution does exist. Figure 6.30 illustrates the geometric
representation where each cell location is the actual cell location. The corresponding

systolic solution is shown in Figure 6.31. The systolic solution of Figure 6.31 is a

homogeneous solution in the sense that all the cells perform the same set of

184

operations and same type of simple computations. However, the directions of the data
communication lines of those cells which produce the ﬁnal partial results are different
than those of the other cells. As indicated in Figure 6.31, there exist two macro cells.
In this systolic solution, the elements of matrix L, which are primary inputs, are
preloaded into the internal register of each cell. The directions of the data
communication lines of those cells without the delay registers (refer to Figure 6.31)

are as follows:

1,, = (o, -1); 6.2-16
1": 67;.
i, = (-1, 0).

The line directions of those cells with the delay registers are as follows:

i, = (o, —1); 6.2-17
1.1 = d_;
i, = (-1, —1).

The DFT for each section of the array is 1. However, because of the delay registers,
results are produced once every other clock cycle although the actual data is supplied

at every cycle.

 

185

1143. I43. x33

11“

x32. l32. x22 x42. 142, 1122

x33 1143

1:21. 121. x11 1:31. 131. :61 x41. I41, x11

x32 x42

1131 x41

1.
w

1122 »
x10 1120 1130 1140
1121
l
I
2

Figure 6.30. The geometric representation of Lx = b.

186

t-G x4

t-4 x3 /ll

b/I.b - l e x <>/I
t=2 x2
b
x1
<>/S( 1..
T

 

 

 

 

 

 

 

1-1 b1 0
2 — - - b3 0
3 - - ' - b4

Figure 6.31. The systolic solution of L1: = b.

187

6.2.4 FINITE IMPULSE RESPONSE (FIR) FILTER

An FIR ﬁlter with K weights can be described by the following expression:

K
yt' = 2 Wk * Ink-.1 6.2-18
k = l
for i = 1, 2, - - - . The computational window index space (CWIS) for this problem

is K, which is the size of the window (refer to Chapter 3). This means that in order to
design systolic solutions for the FIR ﬁlter, the following problem description is used:
K

Y1: 2 Wk * xi+k-1 6.2-19
k: 1

for 1' = 1, 2, ..., K. Eq. 6.2—19 is decomposed to its simple computations as follows:
Yip E 0; 6.2-20
yi,k (*+) = Yi,k-1 (*+) + “’20 ( ) "‘ Ink-1.0 ( );
Yr 5 Yuc-

The index spaces of the w’s and x’s are artiﬁcially expanded (by adding a 0 to each
index) to match the dimension of the index space of y’s. Each index in Eq. 6.2-20 is

now a member of 22. For K = 3, the equivalent simple computations are as follows:

YIO E 0; , 6.2-21
Y11(*+) = y10(*+) + W10( ) * x10( );
Y12(*+) = Y11(*+) 1' W20( ) * x20( );
Y13(*+) = Y12(*+) 4' W3o( ) * x3o( );

Yr 5 Yrs;

)20 5 0;
Y21(*+) = Y2o(*+) + W10( ) * x20( )3

Y22(*+) = Y21(*+) + W20( ) * x30( );

188

Y23(*+) = h2(*+) ‘1' W30( ) "' x4o( );

)2 5 Y23;

yso E 0;
Y31(*+) = Y3o(*+) + W10( ) * x30( );
Y32(*+) = Y31(*+) + W20( ) * x40( );
Y33(*+) = Y32(*+) ‘1’ W30( ) * Jr50( ).

hahﬁ

The simple computations of Eq. 6.2-20 are all of the same type and form only one set
of simple computations. Each simple computation is computed in a cell performing
multiplication and addition operations. The systolic design procedures were applied to
the simple computations of Eq. 6.2-21 and the valid global projection vectors are
shown in Table 6.8.

Table 6.8. Global valid projection vectors for the FIR problem.

 

Vx Vy Vz Vw
1- (0, 1)‘ (0. l)‘ (1. 0)‘ (1. 0)’
2. (1. 0)‘ (1, 0)‘ (0. 0)’ (1, 0)‘

 

 

 

 

 

The corresponding geometric transformation matrices are:

[.01 (1) for VI = (0, 1):; 6.2-22

[(1) (1)] for Vx =(1, O)‘.

The projection planes and the systolic solutions are shown in Figures 6.32 - 6.35. The

189

directions of the data communication lines are shown in Table 6.9.

Table 6.9. Data input directions for the FIR systolic solutions.

 

 

 

Input Line Directions

Vx DFI‘ iy iw ix

1. (0,1)‘ 1 I (1) (1)
2. (1,0)‘ 2 (_1) Z" (1)

 

 

 

 

 

Although DFI=1 for Vx = (O, 1)‘, due to the delay registers one output is produced
every other clock cycle even though the actual data is supplied at every cycle. In
Figure 6.33, the elements of the window are supplied repeatedly. To reduce the I/O
bandwidth, the elements of the window can be preloaded. This result is shown in
Figure 6.36; however, a long communication line is needed in this case. These
solutions are also given by Kung [5]. As before, the geometric transformation
matrices can be used to design larger arrays for larger window sizes. However, in
this case, the design is a linear array and larger designs can be produced simply by

adding additional cells.

 

y32. v13. x6 yzz. '13. x4 y1 2. 1113. 113
y31. w2. 114 3m. 11112. x3 )1". 1112. 112
y30. '1. x3 )20. 1'1. 112 y10. 111. x1
1133 m 1113
1132 YZZ >42
731 M Y"
1 1 1 1 f
—3 -2 —1

Figure 6.32. The projection line of the FIR ﬁlter for Vx = (O, l)‘.

190

 

 

 

 

 

 

. 1-1 2 3 4 5
1111 112 113 V1 112
"'1 k “’1 0+ x1 x2 x3 x4 115
V V
y3 1-7 1/2 t-5 111 t-3
,3 10 yet 8 )4 5
13 . 11 . 9 y
WG——
X(___... 6____

 

Figure 6.33. The systolic solution of the FIR ﬁlter for Vx = (O, 1)‘.

 

y30. W1. x3 y31. 1112. 114 y32, 1113. :15
1’20. “.22 131.112.3131 122113.114
y10. 1111. 111 y11. 1112. 112 y12. 113. x3
y31 y32 y33
Y21 m 1’23
y11 y12 y13
1 1 1 e f
1 2 3

Figure 6.34. The projection line of the FIR ﬁlter for Vx = (1, O)‘.

191

 

 

 

 

X1 0 X2 0
1111 ¥ 2 \ é—

:E°§os

x(— 6— x
Y—-) —__) yI-y-i-wox

Figure 6.35. The systolic solution of the FIR ﬁlter for V1: = (1, O)‘.

 

 

 

 

 

 

v3
.3.
y3 t-7 n t-5 yl t-3
)0 10 y6 8 )4 0
13 . 11 . 0
Y

'6‘
1.

y-y-t-wox

Figure 6.36. An equivalent systolic solution of Figure 6.33.

~bna
an:

192

6.2.5 2D CONVOLUTION

A K x L 2D convolution can be described by the following expression:

K L
yij = Z 2 Wu * 1141-1, j+l—l 6.2-23
k= 1 I: I
for 1': 1, 2, - -~ and j= l, 2, -- - . The computational window index space

(CWIS) for 2D convolution is equal to the window size. The systolic design
procedures are applied to the problem described by
K L

yij = 2 2 Wu * Ink-1, f+l-l 6-2-24
k: l I: I

for i = 1, 2, ..., K and j = 1, 2, ..., L, which has bounded index spaces. Eq. 6.2-24 is

decomposed into equivalent simple computations are follows:
mp = 0: 6.2-25
My}: (*+) = yi‘J,t-1 (*+) + Wup ( ) * Ink-1314.0 ( );
yiJ = yt‘J,K"L;
for k =1, 2, ..., K, I: l, 2, ..., L, t =1, 2, ...., K*L, 1': l, 2, ..., K, andj = 1, 2, ..., L.
Again the index spaces of the w’s and x’s are artiﬁcially extended (by adding a O to
each index) to match the dimension of the iridex space of y’s. Each index in Eq. 6.2-

25 is now a member of Z3. For k = 2 and L = 2, which deﬁnes a 2 x 2 window, the

simple computations are as follows:

yl 10 .=— 0; 6.2-26
Yul (*+)=)’110(*+)+ W110 ( )* JlC110( );
yllz (*+) = Yul (*+) + W120 ( ) * 1120 ( ):
YI13 (*+) = ha (”'1") 4' W210 ( ) "‘ x210 ( );
Y114 (*+) = YI13 (”‘1“) 1' W220 ( ) * J"220 ( );

Yu 5 Y114;

193

ero 5 0;
err (*+) = ero (*+) 1’ W110 ( ) "‘ J"120 ( );
Y122 (*+) = err (*+) + W120 ( ) "' Jr130 ( );
Y123 (”'1”) = erz (”'1“) 4' W210 ( ) * x220 ( );
yi24 (*+) = .er3 (*+) + W220 ( ) "‘ Jr2:10 ( );

m 5 Y124§

Y2ro E 0;
Y211 (*+) = ero (*+) + W110 ( ) * x210 ( );
Y212 (*+) = Y211 (*+) + W120 ( ) "‘ Jr220 ( );
ers (”'3“) = Y212 (”'1“) 1’ W210 ( ) * Jr310 ( )3
3214 (*+) = Y213 (*+) + W220 ( ) "‘ x320 ( );

er 1"- Y214;

Yzzo 5 0;
)221 (*+) = Y220 (*+) '1' W110 ( ) "‘ x220 ( );
yZ22 (*+) = Y221 (”'4") + W120 ( ) "‘ x230 ( );
Y223 (*+) = Y222 (*+) + W210 ( ) * J'5320 ( );
Y224 (”‘1”) = Y223 (*+) + W220 ( ) * x330 ( );

Y22 5 )224-

The systolic design procedures were applied to the simple computations of Eq.6.2-26
and the valid global projection vectors are shown in Table 6.10. As shown in Table
6.10, the projection vector of the w’s is Vz = (O, 0, 0), indicating stationary operands.
This implies that the elements of the window are preloaded into the internal cell
registers prior to the start of the operations. The corresponding geometric

transformation matrices are:

0 l 0
—1 O O for Vx = (0, 1, O)‘; 6.2-27
0 O l

194

Table 6.10. Valid global projection vectors for 2D convolution.

 

Vx Vy Vz Vw
(O, l, O)‘ (O, 1, O)‘ (0, O, O)‘ (0, l, O)‘
(1, O, O)’ (1, O, O)‘ (0, O, O)‘ (l, O, O)‘
(1, 1, 0)‘ (l, 1, O)‘ (O, O, O)’ (1, 1, O)‘
(1, -1, 0)’ (l, -1, O)‘ (O, O, O)’ (1, -1, O)‘

 

 

 

 

PPN!‘

 

 

 

1 o o
o 1 o for Vx=(1,o,0)‘;
o 1

1 1 0
-1 1 o for Vx=(1,1.0)‘;
o 1

l l O
l —l 0 for V1: = (1, -1, 0)‘.
0 0 1

For Vx = (0, l, 0)’, the projection plane and its corresponding systolic solution are
shown in Figures 6.37 and 6.38. As shown in Figure 6.38, the solution contains one
non-uniform cell interconnection. Although the simple computations are all of the
same type and correspond to a homogeneous systolic solution, the solution of Figure
6.38 should be designed as the combination of two sub-systolic arrays. The left four
cells and the right four cells form the two sub-systolic arrays. This systolic array
requires a high I/O bandwidth. Another systolic solution, derived from the solution of
Figure 6.38, is shown in Figure 6.39. This systolic array is constructed by combining
the two cell rows in Figure 6.38 into one cell row. Additionally, the systolic solution

of Figure 6.39 is operated in a non-standard manner as follows:

195

1. xij’s with odd 1' enter the array at the top x-data stream;
2. xij’s with even i enter the array at the bottom x-data stream;

3. each cell uses an x-data stream for 4 consecutive cycles before switching to the

other stream and continues alternating in this manner.

The other valid global projection vectors, similar to Vx = (O, 1, 0)‘, result in systolic
solutions that should be decomposed into sub-systolic arrays. By overlapping the cells in
each of these solutions, the systolic solution- shown in Figure 6.39 is also constructed.
Figure 6.40 illustrates the systolic solution with nine weights (3 x 3 window). This array

is operated as follows:

1. xij’s with odd 1' enter the array at the top x-data stream;

2. xij’s with even i enter the array at the bottom x-data stream;

3. each cell uses an x-data stream for 6 consecutive cycles before switching to the

other stream and continues alternating in this manner.

In Eq. 6.2-25, the products are computed in the order of left to right and top to bottom as
the window moves from top to bottom and left to right over the image array. For example,
w“ * 1:11 is computed before “’12 * x12 and the computation of m cannot be started until
both products w“ * x11 and W12 "‘ x12 are known. However, the computations of the y’s
can be accelerated by computing the products in the order of right to left and bottom to top
as the window moves from top to bottom and left to right over the image array. This
means that, W12 * x12 is computed before w“ * x11 and consequently the computation of
Y12 can start immediately after computation of “’12 * x12. This results in the following

decomposition of Eq. 6.2-24:

yup = 0; 6.2-28

196

 

 

k0
x33
. A $23 3. 1122. x32 51%: 322%.. 325
n24 y124
m4 y114
3222. 21.1132 21. x22
y212, 2211x31 {111% 32122121
3223 y123
n13 y113
1.1v1 1123 21.1 13
$11: 111351122 {1111131221112
y222 y122
y212 y112
0. 11. x22 20.11.1112
3$230. :11. x21 y13103311121111
yz21 y121
>411 y111
1 1 > J’
-2 -1

Figure 6.37. The projection plane of 2D convolution (Eq. 6.2-26) for V1: = (O, l, O)‘.

197

Y1,” (*+) = you—1 (*+) + W210 ( ) * xi+k-lj+l—-1,0;
YiJ = YIJK‘L;
for k = K, K-l, ..., l, l= L, L-l, ..., l, t =1, 2, ...., K*L, i = l, 2, ..., K, andj=1, 2, ..., L.

For a 2 x 2 window, the simple computations are as follows:

YIIO E 0; 6.2-26

)’111(*+)= Yrro (”’4“) + szo( )* x220( );
Y112 (*+) = Yrrr (*+) + W210 ( ) * x210 ( );
Yrra (*+) = Yrrz (*+) 1' W120 ( ) * Jr120 ( );
Y114 (*+) = Y113 (*+) 4' W110( )"' Jr110( );

Yrr 5 Ym ( )3

ero 5 0( );
M21 (*+) = ero (*1’) '1' W220 ( ) * 1230 ( )3
3122 (”'19 = Y121 (”'1“) 1' W210 ( ) * J"220 ( );
Y123 (*+) = Y122 (*+) + W120 ( ) * x130 ( );
Y124 (*+) = Y123 (*+) 1‘ W110 ( ) * J“120 ( );

Y12 -=- Y124 ( );

>210 5 0 ( );
)211 (*+) = )210 (*+) + W220 ( ) * x320 ( );
erz (”'2“) = >211 (”'1“) 1’ W210 ( ) "' Jl€310 ( )?
Y213 (”‘1“) = Y212 0+) 1' W120 ( ) "' x220 ( );
Y214 (*+) = Y213 (*+) + W110 ( ) "‘ x210 ( );

Y21 5 Y214 ( );

nm50(k
Y221 (*+) = Y220 (*+) + W220 ( ) * 1330 ( );

Y222 (”'1”) = >221 (”‘1”) 1' W210 ( ) * x320 ( )i

198

 

 

 

r———> ... ,——->@ >691 7327—2

\ 1 W
COO-”1:623 *
,., ‘1 r1

 

 

 

 

ylt 0 0
. 0 1112 x21
. y12 0 0
° 0 1113 x22
)21 y13 0 0
0 . y14 x23
)22 . 0
0 1124 ~ x31
Y23 ' 0
' x32
0
V y - y + 1v 0 x ‘33
x 1 w x :

Figure 6.38. The systolic solution for 2D convolution (Eq. 6.2-26) for Vx = (0, 1, O)‘.

199

Y223 (*+) = )222 (*+) + W120 ( ) * x230 ( );
Y224 (*+) = Y223 (*+) + W110 ( ) * 1220 ( );

Y22 “-"' Y224-

The systolic design procedures were applied to the simple computations of Eq. 6.2-28 and
the valid global projection vectors are the same as those shown in Table 6.10. For
Vx = (O, 1, O)‘, the projection plane and its corresponding systolic solution are shown in
Figures 6.41 and 6.42. Again, the systolic solution is made of two sub-systolic arrays.
The throughput has been improved compared to that of the systolic solution of Figure 6.38.
However, additional delay registers are needed. This solution can be similarly improved by
combining the two cell rows into one cell row. The result is shown in Figure 6.43. The

operations of the systolic solution of Figure 6.43 are as follows:

1. xij’s with odd 1' enter the array at the top x-data stream;

2. xij’s with even i enter the array at the bottom x-data stream;

3. each cell uses an x—data stream for 2 consecutive cycles before switching to the

other stream and continues alternating in this manner.

Similarly, the systolic solutions corresponding to other global valid projection vectors are
made of sub-systolic solutions. Again, overlapping of the cells in these solutions results
in the systolic solution of Figure 6.43. Figure 6.44 illustrates the systolic array for 3 x 3

2D convolution. This array is operated as follows:
1. xij’s with odd 1' enter the array at the top x-data stream;
2. xij’s with even i enter the array at the bottom x-data stream;

3. each cell uses an x-data stream for 3 consecutive cycles before switching to the

other stream and continues alternating in this manner.

200

 

, x11 . 0 .1112 0

 

 

 

 

 

 

 

 

 

y11 .
1'11 W12 111121 11122
1:21 f 1113
0 0
x22 :14
o 0
x23

X

 

X

1‘s 1\
t
\l/
I
f.
i

 

 

 

Figure 6.39. The systolic solution of 2D convolution (Eq. 6.2-26) constructed by

overlapping the cells in the solution of Figure 6.38.

201

A systolic solution similar to the one shown in Figure 6.44 is given by Kung et. al. [41].
In their solution the products are computed in the order of bottom to top and right to left.

There are no other signiﬁcant differences.

~0~§OEOEO§OEOEO

202

 

 

1'11 '12 1'13 v33

at x

x x

y y-y+v0x
1'

Figure 6.40. The systolic solution of Figure 6.39 with nine weights.

 

.ooEOEoEo

x41

 

x31

 

203

)423. 11111. x22
ﬁ13. 11:11. x21

y224
yz14

y222. 11112. x23
)212. 11112. x22

y223
y213

3(221, 111121. 1132
3411. 11121. W31

y123, 11111. x12
y113. W11. x11

y124
y114

y122, W12. x13
y112. W12. x12

y123
y113

y121. 11121. x22
y111. 11121. x21

 

Y222 y122
Y212 y112

20. 1122. x33 20. 11122. x23

{2210. 11122. x32 114110, 1122. x22
YZ21. 11121
ﬂ11 y111

1 41 A, f

-2 -1

Figure 6.41. The projection plane of 2D convolution (Eq. 6.2-28) for V): = (O, 1, 0)‘.

204

 

 

 

 

ego—rowan.“ “r
Fréﬂo c_.e:!::

 

 

 

 

o a x22 y11 o
a 1131 1:23 1112 0
n1 x32 :24 y13 x11
,2: x33 - - x12
n3 x13

y___)m’_._) y-y-1-110x x___)m___)x
U

1——-> ——->1 ' 7—-—>‘\J-——-> v-H'”

 

 

 

 

Figure 6.42. The systolic solution for 2D convolution (Eq. 6.2-28) for Vx = (O, l, O)‘.

DO.

1151
x43

y32

0
x42
y31

Figure 6.43.

 

 

1&2

x51

x41
x42
x43

y12
y13

x44y22

x45

y31

X” 1132 1131
x41 0 x23
:22 m 142

3111,

 

v33

x32

1:31
1124

 

1'32

205

o
X22
y11

1122

 

1:13 1:11

1:12

 

 

Hf

 

11112 ' 1m

2,.
_)x
——)y-y+wex

 

The systolic solution of 2D convolution (Eq. 6.2-28) constructed by
overlapping the cells in the solution of Figure 6.42.

x15

 

 

 

lfl

 

11131

.HX
...—9X
...)y-y-t-wox

 

Figure 6.44. The systolic solution of Figure 6.43 with nine weights.

206

6.3 CHAPTER SUMMARY

This chapter presented the results of applying the systolic array design method on
several known problems. The known systolic solutions were reproduced.
Furthermore, additional systolic solutions were produced for matrix multiplication and
2D convolution. The method was also applied to the solution of lower triangular
linear system of e®ations and no valid projection vectors were found. This means
either that no systolic solution exists or one cell per simple computation is needed. In
this case, a systolic solution does exist where each point in the geometric
representation indicates an actual cell location (no geometric transformation and
projection are performed). The systolic solutions for LU decomposition and FIR ﬁlter

were also reported.

CHAPTER 7

DISCUSSION AND CONCLUSION

Systolic arrays structures are known for their regular structures and simple processing
elements (cells). These arrays operate synchronously, performing dependent
computations through the pipelines and independent computations in parallel. The
advantage of using extensive pipelining and multiprocessing makes them suitable for
the implementation of many computationally intensive problems. Although they have
many advantages in their architectural properties, their design involves complex data
scheduling; the right data has to reach the right cells at the right time in order to
preserve the algorithm’s correctness. This involves determination of the spatial and

tinting distributions of data.

Several attempts have been made to systematize the design of systolic arrays either
directly from an algorithm or from an intuitive non-systolic design. Presented here is
an introduction of a new systematic systolic array design method aimed at solving
some of the existing design problems. The new method belongs to that class of
design methods which map an algorithm into an architecture. The design method, in
general, follows three major steps: algorithm representation, algorithm model, and

architecture speciﬁcation.

207

208

The algorithm representation step involves the decomposition of the algorithm
computations into a set of basic, or what has been called in this work a set of simple,
computations. Each resultant variable is the result of one or more dependent basic
computations. Each cell in the systolic array is responsible for a subset of these basic
computations. In general, it is desirable to decompose the algorithm into the simplest
possible forms of basic computations thereby allowing maximum parallelism. Some
existing mapping methods start with the description of the algorithm in a high level
language with nested do-loops. They then decompose the algorithm by extending the
loop indices to every variable in the loop body. With this process a dependency
vector is generated for each variable. Often, in loop bodies with complex index
structures, this process is ad hoc. In the new method, a spaced-indexed or time-
indexed algorithm with uniform index spaces is described by mathematical
expressions. Each resultant variable is a function of certain input variables. These
functions are decomposed into a set of simple computations where each simple
computation has the desired form for computation in a simple cell. In this step a
designer has complete control in deﬁning the cell structures and hence the complexity

of the simple computations.

Here, only the simple cell structures with zero, one, or two operations were
considered. This implies that the algorithm is decomposed into sirrrple computations
with one resultant operand and one, two, or three input operands. An index expansion
is performed for the purpose of scheduling the simple computations needed to produce
a resultant variable. This scheduling is done by the lexicographical ordering of the
indices of the simple computations. Index expansion is performed on the primary
inputs for the purpose of equalizing the dimension of all the index spaces. No special
ordering is assigned to the primary inputs which results in complete freedom to

calculate the independent resultant variables in any order.

209

The algorithm model step involves the extraction of certain information from the basic
computations, such as the data dependency vectors which deﬁne the data ﬂow. This
information is then altered to meet the data ﬂow requirements of systolic arrays. Prior
to this work there had been no reported systematic approach for altering this
information. Furthermore, the data ﬂow information does not include any information
about the cell structure, making these methods suitable for designing only
homogeneous systolic arrays. In the new design method, a different approach has
been employed which incorporates the architectural properties of systolic arrays in the
design process. The knowledge of legal communication topologies (such as local cell
connectivity), design modularity (such as simple and regular data and control
communication), and operation complexities (such as simple arithmetic operations) are

used to design systems with such properties.

Certain information, referred to as features, is systematically extracted from the
simple computations. This information includes the operands of each simple
computation, the set of operations required for the computation of each operand, the
lexicographical index differences of each pair of operands of a simple computation,
and the lexicographical index differences of the respective operands of each pair of
simple computations. The simple computations are then grouped into subsets of
simple computations based on the required set of operations and the same type
operands. A systolic array, referred to as a sub-systolic array, is designed separately
for each subset of the simple computations. The ﬁnal array is constructed by joining
the sub-systolic arrays. This process, therefore, provides a method by which non-
homogeneous systolic arrays are designed by joining several homogeneous systolic
arrays (i.e. sub-systolic arrays). The grouping of the simple computations based on the
required set of operations incorporates the property of operation complexity in the
design process. This grouping of the simple computations prevents two simple

computations with different sets of operations from being assigned to the same cell.

210

The grouping of the simple computations based on the operand types is a requirement

to satisfy the property of modularity.

The prOperties of simple and regular data and control communication and local cell
interconnections are represented in terms of a set of rules based on the feature
interrelationships. Each feature interrelationship is based on the feature values of a
pair of simple computations. Each interrelationship checks for violation of one or
more architectural properties. A violation can be either local or global. A local
violation is when cell intra-connections, as described by the cell models, are violated.
A global violation occurs when a pair of simple computations computed in a cell
violates an architectural property in some other cell in the design. A pair of simple
computations can be computed in the same cell if and only if all the feature
interrelationships are satisﬁed for that pair and all its dependent pairs. The feature
interrelationships are translated into a procedure that accepts as input a set of simple
computations and produces as output a set of information in terms of the relative
distances of the operands. This information, referred to as valid projection vectors, is
produced for each subset of the simple computations. These vectors are used to design
each of the sub-systolic arrays. For each different sub-systolic solution up to four
valid projection vectors (one for each of the operands of simple cells) are produced.
Each valid projection vector deﬁnes the relative distance between the consecutive
pairs of operands in all the cells in a sub-systolic array. In non-homogeneous systolic
solutions, the valid projection vectors also deﬁne the necessary boundary
communication linkages required by the adjacent sub-systolic arrays. In addition, a
valid projection vector associated with the output operands provides information used
to determine the spatial distribution of the output operands; this in turn deﬁnes the

spatial distribution of the input operands.

211

The architecture speciﬁcation step involves the determination of design information
needed to construct systolic arrays. This information includes individual cell
locations, simple computations of each cell, data movement , data timing, design
dimension, cell interconnection patterns, data input/output patterns, and cell functions.
Not all of this information is needed to complete the design of a systolic array. The
information of individual cell locations, data timing, and cell interconnection patterns
are by far the most important subset of the architecture speciﬁcation. Certain
architecture information, such as the cell interconnection patterns, can be extracted
from other architectural speciﬁcations, such as the simple computations of each cell.
The individual cell locations are determined by geometric representation and
projection of the simple computations. For each subset of the simple computations,
the index of an output operand, indicated by a point in space, represents the location
of a cell where the corresponding simple. computation is computed. A geometric
representation is constructed for each subset of the simple computations. The
projection of a geometric representation onto a projection space maps several cell
locations onto one location. Each location on the projection space now represents an
actual cell location in the design. Furthermore, all the simple computations mapped to

the same cell location are computed in that cell in sequence.

In general, however, not all such geometric projections will correctly map simple
computations to a set of cell locations. Geometric projections are determined to be
valid if data communication between the cells is local, simple, and regular.
Geometric representations and the valid geometric projections are determined by
selecting the valid geometric transformations. An index transformation matrix is
selected to transform an original representation into a new representation. The
projection of the new geometric representation produces the individual cell locations
for a systolic solution. Prior to this work there was no reported systematic method for

constructing such matrices. (In one case, however, these matrices are constructed

212

from a set of predetermined cell interconnection patterns and others have reported
certain properties of such matrices.) The new systolic array design method provides a
method for systematically constructing the transformation matrices from the valid
projection vectors. The matrices constructed here have different properties than those
selected by others. These matrices are integer and non-singular, corresponding to
integer and one-to—one transformations between the original and the new geometric
representations. Furthermore, the tinting of the computations is determined not from
these matrices, but from the order of the computations and the data ﬂow timing which

indicates when the actual primary inputs are supplied to the array input cells.

The cell interconnections are performed by comparing the operands of the adjacent
cells. The valid projection vectors guarantee local communication between the cells.
Bach sub-systolic array, by deﬁnition, is a homogeneous array; hence the same cell
interconnection patterns are used for all the cells in the array. These interconnection
patterns are represented by vectors indicating the direction of the input lines. Macro
cells are repeated at each cell location to produce the cell interconnections. The line

directions are also used to determine the data ﬂow timing.

The various steps involved in the design method are illustrated by a Y-chart in Figure
7.1. The axes are Algorithm Representation, Algorithm Model, and Architecture
Speciﬁcation. This Y-chart has been used before to compare several existing systolic
array design methods [22]. As indicated in the Y-chart, some of the features are
extracted during the algorithm representation. With this step, the repeated extraction
of these features during the determination of the valid projection vectors is eliminated.
The projection vectors are extracted from the model and if no valid projection vector
exists for a set of simple computations then either no systolic solution exists to
implement the simple computations or one cell must be reserved for each of the

simple computations. This method, therefore, can also be used to verify whether a

213

systolic solution exists for a given set of set of simple computations. In addition, it
could also be used to verify the correctness of a given systolic solution. This method
also produces all the projection vectors for a set of simple computations. This is
important as there may exist many different systolic solutions with different
performance parameters. In fact, an index proportional to the total number of required
cells, referred to as cell count ratio, is calculated from the projection vectors. This
index can be used early on in a design process to select systolic solutions which
require fewest cells. Data movement information is also determined earlier in the
design process to identify stationary and non-stationary operands in the array. Systolic

solutions with non-stationary operands usually require higher I/O bandwidth.

This method was applied to several known problems and the known systolic solutions
were reproduced. For some problems, such as matrix multiplication and 2D
convolution, it produced new, previously undiscovered solutions. Moreover, some are

found to have better performance parameters than previously known solutions.

In this method, the design rules are described in terms of the features extracted from
the simple computations. As the problem size increases so does the number of the
simple computations. Systolic arrays are highly structured and apply to highly
.structured algorithms. Therefore, the design information, such as the geometric
transformation matrices, determined from a small problem size are used to design

larger arrays.

The exploration of this new approach for systematically designing systolic arrays
started with the goal of discovering the relationships between two different
representations of a class of highly structured algorithms; mathematical expressions
with bounded and uniform index spaces and their systolic solutions. It is not wholly
inappropriate to view systolic arrays as a form of algorithm representation because 1)

they are highly structured; and 2) they are unique for each algorithm. (A systolic array

214

is unique in the sense that it is tailored to a Speciﬁc application). The hypothesis that
there must exist certain inherent relationships between these two representations has

been validated as a result of this research.

215

.352: couoabmeoo .35 0:89? £388.“? Bo: 05 mo eowﬁeaoaou 55.? AH Bewﬂ

doﬂuogoomm 0.5503384

.3 8m ea 23.33.80 .
5.23:3 82 o
.5383 .3 .

. .632. .3
35.2 3oz

gang‘seb
.8...

05:... to: Soc .

..an 568:5.» 38v... .

e303

1'

.8323 .3 one—...,. .

nee—«025 Sea 9 2332.5

3309:5932. =00 o 9283.:
838.5580 been

8:833.
956:
55:82

 

.. Jen-29¢ .
as... a?

828.2...
51590
2.8.636 .82...
92838...
ass-8.8.21.8...
238... . 5.00be :39. .90
sags-89¢ 2300... 30308.50:

050880 25083:.

38: 85.834

noﬂﬂaeuoumom Eaton—4

LIST OF REFERENCES

[1]

[2]

[3]

[4]

[5]

[6}

[7]

LIST OF REFERENCES

H. T. Kung and C. E. Leiserson “Systolic Arrays (for VLSI),” Sparse Matrix
Proceedings 1978, Society for Industrial and Applied Mathematics, pp 256-82.

P. R. Cappello and K. Steiglitz, “Digital Signal Processing Applications of
Systolic Algorithms,” in VLSI Systems and Computations, H.T. Kung, et al.
eds., Computer Science Press, Oct. 1981, pp 243-54.

R. W. Priester, H. J. Whitehouse, K. Bromley and J. B. Clary, “Signal
Processing with Systolic Array(a,b),” IEEE 1981, pp 207-215.

H. '1‘. Kung, “Let‘s Design Algorithms for VLSI Systems,” Caltech Conference
on VLSI, Jan. 1979, pp 75—90.

H. T. Kung, “Why Systolic Architectures?,” Computer Magazine 15(1) 1982,
pp 37-46.

K. Hwang and Y-K Cheng, “VLSI Computing Structures for Solving Large-

Scale Linear System of Equations”, Conf. on Parallel Processing, 1980, pp.
217-227.

L. Johnson and D. Cohen, “Computational Arrays for the Discrete Fourier
Transform,” COMPCON Spring 1981, San Francisco CA, Feb. 1981, pp 236-
44.

216

217

[8] M. J. Foster and H. T. Kung, “The Design of Special Purpose VLSI Chip,”
Computer, Jan. 1980, pp 26-40.

[9] H. T. Kung, L. M. Ruane and D. W. L. Yen, “A Two-Level Pipelined Systolic
Array for Convolutions,” in VLSI Systems and Computations, H. T. Kung et. al.
eds., Computer Science Press, Oct. 1981, pp 255-264.

[10] J. L. Bentley and H. T. Kung, “An Introduction to Systolic Algorithms and
Architectures,” Nav. Res. Rev. (USA), vol. 35, no. 2, 1983, pp 3-16.

[11] L. Johnsson and D. Cohen, “ A Mathematical Approach to Modeling the Flow
of Data and Control in Computational Networks,” in VLSI Systems and
Computations, H. T. Kung et. al. eds., Computer Science Press, Oct. 1981, pp
213-225.

[12] G. J. Lie and B. W. Wah, “The Design of Optimal Systolic Algorithms,”
Proc. ‘of Computer Software and Applications Conf., 1983, pp 310-319.

[13] C. E. Leiserson, F. M. Rose and J. B. Saxe, “Optimizing Synchronous Circuitry
by Retiming,” Proc. Third Caltech Conf. on VLSI, ed. R. Bryant Comp. Sci.
Press 1983, pp 87-116.

[14] P. R. Cappello and K. Steiglitz, “Unifying VLSI Array Designs with Geometric
Transformations,” I983 Conf. on Parallel Processing, 1983, pp 448-457.

[15] S. Y. Kung, “ On Super Computing with Systolic/Wavefront Array Processors,”
Proc. IEEE, Vol. 72, No. 7, 1984, pp 868-884.

[16] I. V. Ramakrishnan, D.S. Fussel and A. Silberschats, “On Mapping
Homogeneous Graphs on a Linear Array-Processor Model(a),” Proc. of 1983
Int‘l Conf. on Parallel Processing, 1983, pp 440-447.

218

[17] R. H. Kuhn, “Transforming Algorithms for Single-Stage and VLSI
Architecture,”

Workshop on Interconnection Networks for Parallel and Distributed processing,

1980, pp 11-19.

[18] W. L. Miranker and A. Winkler, “Space-time Representations of Computational
Structures,” Computing, Vol. 32, 1984, pp 93-114.

[19] J. A. B. Fortes and F. P. Presicce, “Optimal Linear Schedules for the Parallel
Execution of Algorithms,” I984 Conf. on Parallel Processing, Aug. 1984, pp
322-328.

[20] P. Quinton, “Automatic Synthesis of Systolic Arrays from Uniform Recurrent

Equations,” Proc. 11th Symp. Comp.Architecture, 1984, pp 208-214.

[21] D. Gannon, “Pipelining Array Computations for MIMD Prallelism: A Function
Speciﬁcation,” I 982 Conf. on Parallel Processing, 1982, pp 284-286.

[22] J. A. B. Fortes, K. S. Fu and B. W. Wah, “Systematic Approaches to the Design
of Algorithmically Speciﬁed Systolic Arrays,” I985 Int’l Cortf. on Acoustics,
Speech, and Signal processing, Tampa, Florida, Mar. 1985, pp 300-303. .

[23] L. Johnsson, U. Weiser, D. Cohen, and A. L. Davis, “Towards a Formal
Treatment of VLSI Arrays,” Caltech Conf. on VLSI, Jan. 1981, pp 375-398.

[24] D. S. Broomhead, J. G. Harp, J. G. Mcwhirter, K. J. Palmer and J. G. B.
Roberts, “A Practical Comparison of the Systolic and Wavefront Array
Processing Approach,” I985 lCASSP, March 1985, pp 297-99.

[25] D. Cohen, “Mathematical Approach to Computational Networks,” USC
Technical Report, ISI/RR-78-73, Nov. 1978.

219

[26] M. Lam and J. Mostow, “A Transformational Model of VLSI Systolic Design,”
IFIP 6th Symp. Computer Hardware Descriptive Languages and Applications,
1983’ pp 65‘77.

[27] H. T. Kung and W. T. Lin, “An Algebra for VLSI Algorithm Design,”
Conf. on Elliptic Problem Solvers, April 1983, pp 1-26.

[28] J. A. B. Fortes and D. I. Moldovan, “Data Broadcasting in Linearly Scheduled

Array Processors,” I 1th Symp. on Computer Architecture, 1984, pp 224-231.

[29] D. I. Moldovan, C. 1. Wu and J. A. B. Fortes, “Mapping on Arbitrarily Large
QR Algorithm into a Fixed Size VLSI Array,” Proc. of 1984 Int’l Conf. on
Parallel Processing, Aug. 1984, pp 365-373.

[30] D. I. Moldovan, “On the Analysis of VLSI Algorithms,” IEEE trans. on
Computer, Vol. C-31, No. 11, Nov. 1982, pp 1121-1126.

[31] D. I. Moldovan, “On the Design of Algorithm for VLSI Systolic Arrays,”
Proc. of the IEEE, Vol. 71, No. 1, Jan. 1983, pp 113-120.

[32] C. E. Leiserson and J. B. Saxe, “Optimizing Synchronous Systems,”
22nd Symp. on Foundation of Computer Science, 1981, pp 23-36.

[33] G. J. Li and B. W. Wah, “Optimal Design of Systolic Arrays for Image

8’

Processing, Proc. of Workshop on Computer Architecture for Pattern Analysis

and Image Database Management, Oct. 1983, pp 134-141.

[34] C. E. Leiserson, “Systolic and Semi-systolic Design,” IEEE lnt‘l conf. on
Computer Design: VLSI in Computers, 1983, pp 627-632.

220

[35] D. D. Gajski and R. H. Kuhn, “Guest Editor’s Introduction: New VLSI Tools,”
IEEE Computer, No 12, Dec. 1983, pp 11-14.

[36] D. I. Moldovan, “ADVIS: A Software Package for the Design of Systolic
Arrays,” IEEE Trans. on Computer-Aided Design, Vol. CAD-6, No. 1, Jan.
1987) pp 33'40.

[37] A. Khurshid and P. D. Fisher, “Design of a Reconﬁgurable Systolic Array Using
LSSD Techniques,” Int’l. Conf. on Computer Design: VLSI In Computers, Oct.
1984, pp 171-175.

[38] L. Snyder, “Introduction to the Conﬁgurable Highly Parallel Computers,” IEEE
Computer, Jan. 1982, pp 47-56.

[39] O. H. Ibarra, S. M. Kim, and M. A. Palis, “Designing Systolic Algorithms using
Sequential Machines,” IEEE Trans. on Computers, Vol. C-35, No. 6, June 1986,
pp 531-542.

[40] C. Mead and L. Conway, “Introduction to VLSI Systems”, Addison-
Wesley, Section 8.3, 1980.

[41] H. T. Kung and R. L. Picard, “Hardware Pipelines for Multi-Dimensional
Convolution and Resampling”, IEEE 1981 Computer Architecture for Pattern

Analysis and Image Database Management, pp. 273-278.

[42] S. Y. Kung, S. C. Lo, S. N. Jean, and J. N. Hwang, “Wave Front Array

Processors - Concept to Implementation,” Computer, July 1987, pp. 18-33.

[43] B. L. Drake, F. T. Luk, J. M. Speiser, and J. J. Symanski, “SLAPP: A Systolic
Linear Algebra Parallel Processor,” Computer, July 1987, pp. 45-49.

221

[44] J. V. McCanny and J. G. McWhirter, “Some Systolic Array Development in the
United Kingdom,” Computer, July 1987, pp. 51-63.